* [RFC PATCH 1/2] sched/wait: Add add_wait_queue_priority() @ 2020-10-26 17:53 David Woodhouse 2020-10-26 17:53 ` [RFC PATCH 2/2] kvm/eventfd: Use priority waitqueue to catch events before userspace David Woodhouse 2020-10-27 14:39 ` [PATCH v2 0/2] Allow KVM IRQFD to consistently intercept events David Woodhouse 0 siblings, 2 replies; 29+ messages in thread From: David Woodhouse @ 2020-10-26 17:53 UTC (permalink / raw) To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Paolo Bonzini, linux-kernel, kvm From: David Woodhouse <dwmw@amazon.co.uk> This allows an exclusive wait_queue_entry to be added at the head of the queue, instead of the tail as normal. Thus, it gets to consume events first. The problem I'm trying to solve here is interrupt remapping invalidation vs. MSI interrupts from VFIO. I'd really like KVM IRQFD to be able to consume events before (and indeed instead of) userspace. When the remapped MSI target in the KVM routing table is invalidated, the VMM needs to *deassociate* the IRQFD and fall back to handling the next IRQ in userspace, so it can be retranslated and a fault reported if appropriate. It's possible to do that by constantly registering and deregistering the fd in the userspace poll loop, but it gets ugly especially because the fallback handler isn't really local to the core MSI handling. It's much nicer if the userspace handler can just remain registered all the time, and it just doesn't get any events when KVM steals them first. Which is precisely what happens with posted interrupts, and this makes it consistent. (Unless I'm missing something that prevents posted interrupts from working when there's another listener on the eventfd?) Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> --- include/linux/wait.h | 12 +++++++++++- kernel/sched/wait.c | 11 +++++++++++ 2 files changed, 22 insertions(+), 1 deletion(-) diff --git a/include/linux/wait.h b/include/linux/wait.h index 27fb99cfeb02..fe10e8570a52 100644 --- a/include/linux/wait.h +++ b/include/linux/wait.h @@ -22,6 +22,7 @@ int default_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, int #define WQ_FLAG_BOOKMARK 0x04 #define WQ_FLAG_CUSTOM 0x08 #define WQ_FLAG_DONE 0x10 +#define WQ_FLAG_PRIORITY 0x20 /* * A single wait-queue entry structure: @@ -164,11 +165,20 @@ static inline bool wq_has_sleeper(struct wait_queue_head *wq_head) extern void add_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry); extern void add_wait_queue_exclusive(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry); +extern void add_wait_queue_priority(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry); extern void remove_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry); static inline void __add_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry) { - list_add(&wq_entry->entry, &wq_head->head); + struct list_head *head = &wq_head->head; + struct wait_queue_entry *wq; + + list_for_each_entry(wq, &wq_head->head, entry) { + if (!(wq->flags & WQ_FLAG_PRIORITY)) + break; + head = &wq->entry; + } + list_add(&wq_entry->entry, head); } /* diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c index 01f5d3020589..d2a84c8e88bf 100644 --- a/kernel/sched/wait.c +++ b/kernel/sched/wait.c @@ -37,6 +37,17 @@ void add_wait_queue_exclusive(struct wait_queue_head *wq_head, struct wait_queue } EXPORT_SYMBOL(add_wait_queue_exclusive); +void add_wait_queue_priority(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry) +{ + unsigned long flags; + + wq_entry->flags |= WQ_FLAG_EXCLUSIVE | WQ_FLAG_PRIORITY; + spin_lock_irqsave(&wq_head->lock, flags); + __add_wait_queue(wq_head, wq_entry); + spin_unlock_irqrestore(&wq_head->lock, flags); +} +EXPORT_SYMBOL_GPL(add_wait_queue_priority); + void remove_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry) { unsigned long flags; -- 2.26.2 ^ permalink raw reply related [flat|nested] 29+ messages in thread
* [RFC PATCH 2/2] kvm/eventfd: Use priority waitqueue to catch events before userspace 2020-10-26 17:53 [RFC PATCH 1/2] sched/wait: Add add_wait_queue_priority() David Woodhouse @ 2020-10-26 17:53 ` David Woodhouse 2020-10-27 8:01 ` Paolo Bonzini 2020-10-27 14:39 ` [PATCH v2 0/2] Allow KVM IRQFD to consistently intercept events David Woodhouse 1 sibling, 1 reply; 29+ messages in thread From: David Woodhouse @ 2020-10-26 17:53 UTC (permalink / raw) To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Paolo Bonzini, linux-kernel, kvm From: David Woodhouse <dwmw@amazon.co.uk> As far as I can tell, when we use posted interrupts we silently cut off the events from userspace, if it's listening on the same eventfd that feeds the irqfd. I like that behaviour. Let's do it all the time, even without posted interrupts. It makes it much easier to handle IRQ remapping invalidation without having to constantly add/remove the fd from the userspace poll set. We can just leave userspace polling on it, and the bypass will... well... bypass it. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> --- virt/kvm/eventfd.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c index d6408bb497dc..39443e2f72bf 100644 --- a/virt/kvm/eventfd.c +++ b/virt/kvm/eventfd.c @@ -191,6 +191,7 @@ irqfd_wakeup(wait_queue_entry_t *wait, unsigned mode, int sync, void *key) struct kvm *kvm = irqfd->kvm; unsigned seq; int idx; + int ret = 0; if (flags & EPOLLIN) { idx = srcu_read_lock(&kvm->irq_srcu); @@ -204,6 +205,7 @@ irqfd_wakeup(wait_queue_entry_t *wait, unsigned mode, int sync, void *key) false) == -EWOULDBLOCK) schedule_work(&irqfd->inject); srcu_read_unlock(&kvm->irq_srcu, idx); + ret = 1; } if (flags & EPOLLHUP) { @@ -227,7 +229,7 @@ irqfd_wakeup(wait_queue_entry_t *wait, unsigned mode, int sync, void *key) spin_unlock_irqrestore(&kvm->irqfds.lock, iflags); } - return 0; + return ret; } static void @@ -236,7 +238,7 @@ irqfd_ptable_queue_proc(struct file *file, wait_queue_head_t *wqh, { struct kvm_kernel_irqfd *irqfd = container_of(pt, struct kvm_kernel_irqfd, pt); - add_wait_queue(wqh, &irqfd->wait); + add_wait_queue_priority(wqh, &irqfd->wait); } /* Must be called under irqfds.lock */ -- 2.26.2 ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 2/2] kvm/eventfd: Use priority waitqueue to catch events before userspace 2020-10-26 17:53 ` [RFC PATCH 2/2] kvm/eventfd: Use priority waitqueue to catch events before userspace David Woodhouse @ 2020-10-27 8:01 ` Paolo Bonzini 2020-10-27 10:15 ` David Woodhouse 2020-10-27 13:55 ` [PATCH 0/3] Allow in-kernel consumers to drain events from eventfd David Woodhouse 0 siblings, 2 replies; 29+ messages in thread From: Paolo Bonzini @ 2020-10-27 8:01 UTC (permalink / raw) To: David Woodhouse, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, linux-kernel, kvm On 26/10/20 18:53, David Woodhouse wrote: > From: David Woodhouse <dwmw@amazon.co.uk> > > As far as I can tell, when we use posted interrupts we silently cut off > the events from userspace, if it's listening on the same eventfd that > feeds the irqfd. > > I like that behaviour. Let's do it all the time, even without posted > interrupts. It makes it much easier to handle IRQ remapping invalidation > without having to constantly add/remove the fd from the userspace poll > set. We can just leave userspace polling on it, and the bypass will... > well... bypass it. This looks good, though of course it depends on the somewhat hackish patch 1. However don't you need to read the eventfd as well, since userspace will never be able to do so? Paolo > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> > --- > virt/kvm/eventfd.c | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c > index d6408bb497dc..39443e2f72bf 100644 > --- a/virt/kvm/eventfd.c > +++ b/virt/kvm/eventfd.c > @@ -191,6 +191,7 @@ irqfd_wakeup(wait_queue_entry_t *wait, unsigned mode, int sync, void *key) > struct kvm *kvm = irqfd->kvm; > unsigned seq; > int idx; > + int ret = 0; > > if (flags & EPOLLIN) { > idx = srcu_read_lock(&kvm->irq_srcu); > @@ -204,6 +205,7 @@ irqfd_wakeup(wait_queue_entry_t *wait, unsigned mode, int sync, void *key) > false) == -EWOULDBLOCK) > schedule_work(&irqfd->inject); > srcu_read_unlock(&kvm->irq_srcu, idx); > + ret = 1; > } > > if (flags & EPOLLHUP) { > @@ -227,7 +229,7 @@ irqfd_wakeup(wait_queue_entry_t *wait, unsigned mode, int sync, void *key) > spin_unlock_irqrestore(&kvm->irqfds.lock, iflags); > } > > - return 0; > + return ret; > } > > static void > @@ -236,7 +238,7 @@ irqfd_ptable_queue_proc(struct file *file, wait_queue_head_t *wqh, > { > struct kvm_kernel_irqfd *irqfd = > container_of(pt, struct kvm_kernel_irqfd, pt); > - add_wait_queue(wqh, &irqfd->wait); > + add_wait_queue_priority(wqh, &irqfd->wait); > } > > /* Must be called under irqfds.lock */ > ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 2/2] kvm/eventfd: Use priority waitqueue to catch events before userspace 2020-10-27 8:01 ` Paolo Bonzini @ 2020-10-27 10:15 ` David Woodhouse 2020-10-27 13:55 ` [PATCH 0/3] Allow in-kernel consumers to drain events from eventfd David Woodhouse 1 sibling, 0 replies; 29+ messages in thread From: David Woodhouse @ 2020-10-27 10:15 UTC (permalink / raw) To: Paolo Bonzini, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, linux-kernel, kvm [-- Attachment #1: Type: text/plain, Size: 2050 bytes --] On Tue, 2020-10-27 at 09:01 +0100, Paolo Bonzini wrote: > On 26/10/20 18:53, David Woodhouse wrote: > > From: David Woodhouse <dwmw@amazon.co.uk> > > > > As far as I can tell, when we use posted interrupts we silently cut off > > the events from userspace, if it's listening on the same eventfd that > > feeds the irqfd. > > > > I like that behaviour. Let's do it all the time, even without posted > > interrupts. It makes it much easier to handle IRQ remapping invalidation > > without having to constantly add/remove the fd from the userspace poll > > set. We can just leave userspace polling on it, and the bypass will... > > well... bypass it. > > This looks good, though of course it depends on the somewhat hackish > patch 1. I thought it was quite neat :) > However don't you need to read the eventfd as well, since > userspace will never be able to do so? Yes. Although that's a separate cleanup as it was already true before my patch. Right now, userspace needs to explicitly stop polling on the VFIO eventfd while it's assigned as KVM IRQFD (to avoid injecting duplicate interrupts when the kernel isn't using PI and allows events to leak). So it isn't going to consume the events in that case either. Nothing's really changed. The VFIO virqfd is just the same. The count just builds up when the kernel handles the events, and is eventually cleared by eventfd_ctx_remove_wait_queue(). In both cases, that actually works fine because in practice the events are raised by eventfd_signal() in the kernel, and that works even if the count reaches ULLONG_MAX. It's just that sending further events from *userspace* would block in that case. Both of them theoretically want fixing — regardless of the priority patch. Since the wq lock is held while the wakeup function (virqfd_wakeup or irqfd_wakeup for VFIO/KVM respectively) run, all they really need to do is call eventfd_ctx_do_read() to consume the events. I'll look at whether I can find a nicer option than just exporting that. [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5174 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH 0/3] Allow in-kernel consumers to drain events from eventfd 2020-10-27 8:01 ` Paolo Bonzini 2020-10-27 10:15 ` David Woodhouse @ 2020-10-27 13:55 ` David Woodhouse 2020-10-27 13:55 ` [PATCH 1/3] eventfd: Export eventfd_ctx_do_read() David Woodhouse ` (2 more replies) 1 sibling, 3 replies; 29+ messages in thread From: David Woodhouse @ 2020-10-27 13:55 UTC (permalink / raw) To: bonzini Cc: Alex Williamson, Cornelia Huck, Alexander Viro, Jens Axboe, kvm, linux-kernel, linux-fsdevel Paolo pointed out that the KVM eventfd doesn't drain the events from the irqfd as it handles them, and just lets them accumulate. This is also true for the VFIO virqfd used for handling acks for level-triggered IRQs. Export eventfd_ctx_do_read() and make the wakeup functions call it as they handle their respective events. David Woodhouse (3): eventfd: Export eventfd_ctx_do_read() vfio/virqfd: Drain events from eventfd in virqfd_wakeup() kvm/eventfd: Drain events from eventfd in irqfd_wakeup() drivers/vfio/virqfd.c | 3 +++ fs/eventfd.c | 5 ++++- include/linux/eventfd.h | 6 ++++++ virt/kvm/eventfd.c | 3 +++ 4 files changed, 16 insertions(+), 1 deletion(-) ^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH 1/3] eventfd: Export eventfd_ctx_do_read() 2020-10-27 13:55 ` [PATCH 0/3] Allow in-kernel consumers to drain events from eventfd David Woodhouse @ 2020-10-27 13:55 ` David Woodhouse 2020-10-27 13:55 ` [PATCH 2/3] vfio/virqfd: Drain events from eventfd in virqfd_wakeup() David Woodhouse 2020-10-27 13:55 ` [PATCH 3/3] kvm/eventfd: Drain events from eventfd in irqfd_wakeup() David Woodhouse 2 siblings, 0 replies; 29+ messages in thread From: David Woodhouse @ 2020-10-27 13:55 UTC (permalink / raw) To: bonzini Cc: Alex Williamson, Cornelia Huck, Alexander Viro, Jens Axboe, kvm, linux-kernel, linux-fsdevel From: David Woodhouse <dwmw@amazon.co.uk> Where events are consumed in the kernel, for example by KVM's irqfd_wakeup() and VFIO's virqfd_wakeup(), they currently lack a mechanism to drain the eventfd's counter. Since the wait queue is already locked while the wakeup functions are invoked, all they really need to do is call eventfd_ctx_do_read(). Add a check for the lock, and export it for them. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> --- fs/eventfd.c | 5 ++++- include/linux/eventfd.h | 6 ++++++ 2 files changed, 10 insertions(+), 1 deletion(-) diff --git a/fs/eventfd.c b/fs/eventfd.c index df466ef81ddd..e265b6dd4f34 100644 --- a/fs/eventfd.c +++ b/fs/eventfd.c @@ -182,11 +182,14 @@ static __poll_t eventfd_poll(struct file *file, poll_table *wait) return events; } -static void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt) +void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt) { + lockdep_assert_held(&ctx->wqh.lock); + *cnt = (ctx->flags & EFD_SEMAPHORE) ? 1 : ctx->count; ctx->count -= *cnt; } +EXPORT_SYMBOL_GPL(eventfd_ctx_do_read); /** * eventfd_ctx_remove_wait_queue - Read the current counter and removes wait queue. diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h index dc4fd8a6644d..fa0a524baed0 100644 --- a/include/linux/eventfd.h +++ b/include/linux/eventfd.h @@ -41,6 +41,7 @@ struct eventfd_ctx *eventfd_ctx_fileget(struct file *file); __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n); int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx, wait_queue_entry_t *wait, __u64 *cnt); +void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt); DECLARE_PER_CPU(int, eventfd_wake_count); @@ -82,6 +83,11 @@ static inline bool eventfd_signal_count(void) return false; } +static inline void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt) +{ + +} + #endif #endif /* _LINUX_EVENTFD_H */ -- 2.26.2 ^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH 2/3] vfio/virqfd: Drain events from eventfd in virqfd_wakeup() 2020-10-27 13:55 ` [PATCH 0/3] Allow in-kernel consumers to drain events from eventfd David Woodhouse 2020-10-27 13:55 ` [PATCH 1/3] eventfd: Export eventfd_ctx_do_read() David Woodhouse @ 2020-10-27 13:55 ` David Woodhouse 2020-11-06 23:29 ` Alex Williamson 2020-10-27 13:55 ` [PATCH 3/3] kvm/eventfd: Drain events from eventfd in irqfd_wakeup() David Woodhouse 2 siblings, 1 reply; 29+ messages in thread From: David Woodhouse @ 2020-10-27 13:55 UTC (permalink / raw) To: bonzini Cc: Alex Williamson, Cornelia Huck, Alexander Viro, Jens Axboe, kvm, linux-kernel, linux-fsdevel From: David Woodhouse <dwmw@amazon.co.uk> Don't allow the events to accumulate in the eventfd counter, drain them as they are handled. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> --- drivers/vfio/virqfd.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/vfio/virqfd.c b/drivers/vfio/virqfd.c index 997cb5d0a657..414e98d82b02 100644 --- a/drivers/vfio/virqfd.c +++ b/drivers/vfio/virqfd.c @@ -46,6 +46,9 @@ static int virqfd_wakeup(wait_queue_entry_t *wait, unsigned mode, int sync, void __poll_t flags = key_to_poll(key); if (flags & EPOLLIN) { + u64 cnt; + eventfd_ctx_do_read(virqfd->eventfd, &cnt); + /* An event has been signaled, call function */ if ((!virqfd->handler || virqfd->handler(virqfd->opaque, virqfd->data)) && -- 2.26.2 ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [PATCH 2/3] vfio/virqfd: Drain events from eventfd in virqfd_wakeup() 2020-10-27 13:55 ` [PATCH 2/3] vfio/virqfd: Drain events from eventfd in virqfd_wakeup() David Woodhouse @ 2020-11-06 23:29 ` Alex Williamson 2020-11-08 9:17 ` Paolo Bonzini 0 siblings, 1 reply; 29+ messages in thread From: Alex Williamson @ 2020-11-06 23:29 UTC (permalink / raw) To: David Woodhouse, Bonzini, Paolo Cc: Cornelia Huck, Alexander Viro, Jens Axboe, kvm, linux-kernel, linux-fsdevel On Tue, 27 Oct 2020 13:55:22 +0000 David Woodhouse <dwmw2@infradead.org> wrote: > From: David Woodhouse <dwmw@amazon.co.uk> > > Don't allow the events to accumulate in the eventfd counter, drain them > as they are handled. > > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> > --- Acked-by: Alex Williamson <alex.williamson@redhat.com> Paolo, I assume you'll add this to your queue. Thanks, Alex > drivers/vfio/virqfd.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/drivers/vfio/virqfd.c b/drivers/vfio/virqfd.c > index 997cb5d0a657..414e98d82b02 100644 > --- a/drivers/vfio/virqfd.c > +++ b/drivers/vfio/virqfd.c > @@ -46,6 +46,9 @@ static int virqfd_wakeup(wait_queue_entry_t *wait, unsigned mode, int sync, void > __poll_t flags = key_to_poll(key); > > if (flags & EPOLLIN) { > + u64 cnt; > + eventfd_ctx_do_read(virqfd->eventfd, &cnt); > + > /* An event has been signaled, call function */ > if ((!virqfd->handler || > virqfd->handler(virqfd->opaque, virqfd->data)) && ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 2/3] vfio/virqfd: Drain events from eventfd in virqfd_wakeup() 2020-11-06 23:29 ` Alex Williamson @ 2020-11-08 9:17 ` Paolo Bonzini 0 siblings, 0 replies; 29+ messages in thread From: Paolo Bonzini @ 2020-11-08 9:17 UTC (permalink / raw) To: Alex Williamson, David Woodhouse Cc: Cornelia Huck, Alexander Viro, Jens Axboe, kvm, linux-kernel, linux-fsdevel On 07/11/20 00:29, Alex Williamson wrote: >> From: David Woodhouse<dwmw@amazon.co.uk> >> >> Don't allow the events to accumulate in the eventfd counter, drain them >> as they are handled. >> >> Signed-off-by: David Woodhouse<dwmw@amazon.co.uk> >> --- > Acked-by: Alex Williamson<alex.williamson@redhat.com> > > Paolo, I assume you'll add this to your queue. Thanks, > > Alex > Yes, thanks. Paolo ^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH 3/3] kvm/eventfd: Drain events from eventfd in irqfd_wakeup() 2020-10-27 13:55 ` [PATCH 0/3] Allow in-kernel consumers to drain events from eventfd David Woodhouse 2020-10-27 13:55 ` [PATCH 1/3] eventfd: Export eventfd_ctx_do_read() David Woodhouse 2020-10-27 13:55 ` [PATCH 2/3] vfio/virqfd: Drain events from eventfd in virqfd_wakeup() David Woodhouse @ 2020-10-27 13:55 ` David Woodhouse 2020-10-27 18:41 ` kernel test robot ` (2 more replies) 2 siblings, 3 replies; 29+ messages in thread From: David Woodhouse @ 2020-10-27 13:55 UTC (permalink / raw) To: bonzini Cc: Alex Williamson, Cornelia Huck, Alexander Viro, Jens Axboe, kvm, linux-kernel, linux-fsdevel From: David Woodhouse <dwmw@amazon.co.uk> Don't allow the events to accumulate in the eventfd counter, drain them as they are handled. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> --- virt/kvm/eventfd.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c index d6408bb497dc..98b5cfa1d69f 100644 --- a/virt/kvm/eventfd.c +++ b/virt/kvm/eventfd.c @@ -193,6 +193,9 @@ irqfd_wakeup(wait_queue_entry_t *wait, unsigned mode, int sync, void *key) int idx; if (flags & EPOLLIN) { + u64 cnt; + eventfd_ctx_do_read(&irqfd->eventfd, &cnt); + idx = srcu_read_lock(&kvm->irq_srcu); do { seq = read_seqcount_begin(&irqfd->irq_entry_sc); -- 2.26.2 ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [PATCH 3/3] kvm/eventfd: Drain events from eventfd in irqfd_wakeup() 2020-10-27 13:55 ` [PATCH 3/3] kvm/eventfd: Drain events from eventfd in irqfd_wakeup() David Woodhouse @ 2020-10-27 18:41 ` kernel test robot 2020-10-27 21:42 ` kernel test robot 2020-10-27 23:13 ` kernel test robot 2 siblings, 0 replies; 29+ messages in thread From: kernel test robot @ 2020-10-27 18:41 UTC (permalink / raw) To: David Woodhouse, bonzini Cc: kbuild-all, Alex Williamson, Cornelia Huck, Alexander Viro, Jens Axboe, kvm, linux-kernel, linux-fsdevel [-- Attachment #1: Type: text/plain, Size: 4397 bytes --] Hi David, I love your patch! Yet something to improve: [auto build test ERROR on vfio/next] [also build test ERROR on vhost/linux-next linus/master kvm/linux-next v5.10-rc1 next-20201027] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch] url: https://github.com/0day-ci/linux/commits/David-Woodhouse/Allow-in-kernel-consumers-to-drain-events-from-eventfd/20201027-215658 base: https://github.com/awilliam/linux-vfio.git next config: x86_64-randconfig-s021-20201027 (attached as .config) compiler: gcc-9 (Debian 9.3.0-15) 9.3.0 reproduce: # apt-get install sparse # sparse version: v0.6.3-56-gc09e8239-dirty # https://github.com/0day-ci/linux/commit/dc45dd9af28fede8f8dd29b705b90f78cf87538c git remote add linux-review https://github.com/0day-ci/linux git fetch --no-tags linux-review David-Woodhouse/Allow-in-kernel-consumers-to-drain-events-from-eventfd/20201027-215658 git checkout dc45dd9af28fede8f8dd29b705b90f78cf87538c # save the attached .config to linux build tree make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' ARCH=x86_64 If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot <lkp@intel.com> All errors (new ones prefixed by >>): arch/x86/kvm/../../../virt/kvm/eventfd.c: In function 'irqfd_wakeup': >> arch/x86/kvm/../../../virt/kvm/eventfd.c:197:23: error: passing argument 1 of 'eventfd_ctx_do_read' from incompatible pointer type [-Werror=incompatible-pointer-types] 197 | eventfd_ctx_do_read(&irqfd->eventfd, &cnt); | ^~~~~~~~~~~~~~~ | | | struct eventfd_ctx ** In file included from arch/x86/kvm/../../../virt/kvm/eventfd.c:21: include/linux/eventfd.h:44:46: note: expected 'struct eventfd_ctx *' but argument is of type 'struct eventfd_ctx **' 44 | void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt); | ~~~~~~~~~~~~~~~~~~~~^~~ cc1: some warnings being treated as errors vim +/eventfd_ctx_do_read +197 arch/x86/kvm/../../../virt/kvm/eventfd.c 180 181 /* 182 * Called with wqh->lock held and interrupts disabled 183 */ 184 static int 185 irqfd_wakeup(wait_queue_entry_t *wait, unsigned mode, int sync, void *key) 186 { 187 struct kvm_kernel_irqfd *irqfd = 188 container_of(wait, struct kvm_kernel_irqfd, wait); 189 __poll_t flags = key_to_poll(key); 190 struct kvm_kernel_irq_routing_entry irq; 191 struct kvm *kvm = irqfd->kvm; 192 unsigned seq; 193 int idx; 194 195 if (flags & EPOLLIN) { 196 u64 cnt; > 197 eventfd_ctx_do_read(&irqfd->eventfd, &cnt); 198 199 idx = srcu_read_lock(&kvm->irq_srcu); 200 do { 201 seq = read_seqcount_begin(&irqfd->irq_entry_sc); 202 irq = irqfd->irq_entry; 203 } while (read_seqcount_retry(&irqfd->irq_entry_sc, seq)); 204 /* An event has been signaled, inject an interrupt */ 205 if (kvm_arch_set_irq_inatomic(&irq, kvm, 206 KVM_USERSPACE_IRQ_SOURCE_ID, 1, 207 false) == -EWOULDBLOCK) 208 schedule_work(&irqfd->inject); 209 srcu_read_unlock(&kvm->irq_srcu, idx); 210 } 211 212 if (flags & EPOLLHUP) { 213 /* The eventfd is closing, detach from KVM */ 214 unsigned long iflags; 215 216 spin_lock_irqsave(&kvm->irqfds.lock, iflags); 217 218 /* 219 * We must check if someone deactivated the irqfd before 220 * we could acquire the irqfds.lock since the item is 221 * deactivated from the KVM side before it is unhooked from 222 * the wait-queue. If it is already deactivated, we can 223 * simply return knowing the other side will cleanup for us. 224 * We cannot race against the irqfd going away since the 225 * other side is required to acquire wqh->lock, which we hold 226 */ 227 if (irqfd_is_active(irqfd)) 228 irqfd_deactivate(irqfd); 229 230 spin_unlock_irqrestore(&kvm->irqfds.lock, iflags); 231 } 232 233 return 0; 234 } 235 --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org [-- Attachment #2: .config.gz --] [-- Type: application/gzip, Size: 42031 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 3/3] kvm/eventfd: Drain events from eventfd in irqfd_wakeup() 2020-10-27 13:55 ` [PATCH 3/3] kvm/eventfd: Drain events from eventfd in irqfd_wakeup() David Woodhouse 2020-10-27 18:41 ` kernel test robot @ 2020-10-27 21:42 ` kernel test robot 2020-10-27 23:13 ` kernel test robot 2 siblings, 0 replies; 29+ messages in thread From: kernel test robot @ 2020-10-27 21:42 UTC (permalink / raw) To: David Woodhouse, bonzini Cc: kbuild-all, clang-built-linux, Alex Williamson, Cornelia Huck, Alexander Viro, Jens Axboe, kvm, linux-kernel, linux-fsdevel [-- Attachment #1: Type: text/plain, Size: 11205 bytes --] Hi David, I love your patch! Yet something to improve: [auto build test ERROR on vfio/next] [also build test ERROR on vhost/linux-next linus/master kvm/linux-next v5.10-rc1 next-20201027] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch] url: https://github.com/0day-ci/linux/commits/David-Woodhouse/Allow-in-kernel-consumers-to-drain-events-from-eventfd/20201027-215658 base: https://github.com/awilliam/linux-vfio.git next config: s390-randconfig-r023-20201027 (attached as .config) compiler: clang version 12.0.0 (https://github.com/llvm/llvm-project f2c25c70791de95d2466e09b5b58fc37f6ccd7a4) reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # install s390 cross compiling tool for clang build # apt-get install binutils-s390x-linux-gnu # https://github.com/0day-ci/linux/commit/dc45dd9af28fede8f8dd29b705b90f78cf87538c git remote add linux-review https://github.com/0day-ci/linux git fetch --no-tags linux-review David-Woodhouse/Allow-in-kernel-consumers-to-drain-events-from-eventfd/20201027-215658 git checkout dc45dd9af28fede8f8dd29b705b90f78cf87538c # save the attached .config to linux build tree COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=s390 If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot <lkp@intel.com> All errors (new ones prefixed by >>): In file included from arch/s390/include/asm/kvm_para.h:25: In file included from arch/s390/include/asm/diag.h:12: In file included from include/linux/if_ether.h:19: In file included from include/linux/skbuff.h:31: In file included from include/linux/dma-mapping.h:11: In file included from include/linux/scatterlist.h:9: In file included from arch/s390/include/asm/io.h:72: include/asm-generic/io.h:490:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr)); ~~~~~~~~~~ ^ include/uapi/linux/byteorder/big_endian.h:34:59: note: expanded from macro '__le32_to_cpu' #define __le32_to_cpu(x) __swab32((__force __u32)(__le32)(x)) ^ include/uapi/linux/swab.h:119:21: note: expanded from macro '__swab32' ___constant_swab32(x) : \ ^ include/uapi/linux/swab.h:21:12: note: expanded from macro '___constant_swab32' (((__u32)(x) & (__u32)0x00ff0000UL) >> 8) | \ ^ In file included from arch/s390/kvm/../../../virt/kvm/eventfd.c:12: In file included from include/linux/kvm_host.h:32: In file included from include/linux/kvm_para.h:5: In file included from include/uapi/linux/kvm_para.h:36: In file included from arch/s390/include/asm/kvm_para.h:25: In file included from arch/s390/include/asm/diag.h:12: In file included from include/linux/if_ether.h:19: In file included from include/linux/skbuff.h:31: In file included from include/linux/dma-mapping.h:11: In file included from include/linux/scatterlist.h:9: In file included from arch/s390/include/asm/io.h:72: include/asm-generic/io.h:490:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr)); ~~~~~~~~~~ ^ include/uapi/linux/byteorder/big_endian.h:34:59: note: expanded from macro '__le32_to_cpu' #define __le32_to_cpu(x) __swab32((__force __u32)(__le32)(x)) ^ include/uapi/linux/swab.h:119:21: note: expanded from macro '__swab32' ___constant_swab32(x) : \ ^ include/uapi/linux/swab.h:22:12: note: expanded from macro '___constant_swab32' (((__u32)(x) & (__u32)0xff000000UL) >> 24))) ^ In file included from arch/s390/kvm/../../../virt/kvm/eventfd.c:12: In file included from include/linux/kvm_host.h:32: In file included from include/linux/kvm_para.h:5: In file included from include/uapi/linux/kvm_para.h:36: In file included from arch/s390/include/asm/kvm_para.h:25: In file included from arch/s390/include/asm/diag.h:12: In file included from include/linux/if_ether.h:19: In file included from include/linux/skbuff.h:31: In file included from include/linux/dma-mapping.h:11: In file included from include/linux/scatterlist.h:9: In file included from arch/s390/include/asm/io.h:72: include/asm-generic/io.h:490:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr)); ~~~~~~~~~~ ^ include/uapi/linux/byteorder/big_endian.h:34:59: note: expanded from macro '__le32_to_cpu' #define __le32_to_cpu(x) __swab32((__force __u32)(__le32)(x)) ^ include/uapi/linux/swab.h:120:12: note: expanded from macro '__swab32' __fswab32(x)) ^ In file included from arch/s390/kvm/../../../virt/kvm/eventfd.c:12: In file included from include/linux/kvm_host.h:32: In file included from include/linux/kvm_para.h:5: In file included from include/uapi/linux/kvm_para.h:36: In file included from arch/s390/include/asm/kvm_para.h:25: In file included from arch/s390/include/asm/diag.h:12: In file included from include/linux/if_ether.h:19: In file included from include/linux/skbuff.h:31: In file included from include/linux/dma-mapping.h:11: In file included from include/linux/scatterlist.h:9: In file included from arch/s390/include/asm/io.h:72: include/asm-generic/io.h:501:33: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] __raw_writeb(value, PCI_IOBASE + addr); ~~~~~~~~~~ ^ include/asm-generic/io.h:511:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] __raw_writew((u16 __force)cpu_to_le16(value), PCI_IOBASE + addr); ~~~~~~~~~~ ^ include/asm-generic/io.h:521:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] __raw_writel((u32 __force)cpu_to_le32(value), PCI_IOBASE + addr); ~~~~~~~~~~ ^ include/asm-generic/io.h:609:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] readsb(PCI_IOBASE + addr, buffer, count); ~~~~~~~~~~ ^ include/asm-generic/io.h:617:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] readsw(PCI_IOBASE + addr, buffer, count); ~~~~~~~~~~ ^ include/asm-generic/io.h:625:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] readsl(PCI_IOBASE + addr, buffer, count); ~~~~~~~~~~ ^ include/asm-generic/io.h:634:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] writesb(PCI_IOBASE + addr, buffer, count); ~~~~~~~~~~ ^ include/asm-generic/io.h:643:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] writesw(PCI_IOBASE + addr, buffer, count); ~~~~~~~~~~ ^ include/asm-generic/io.h:652:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] writesl(PCI_IOBASE + addr, buffer, count); ~~~~~~~~~~ ^ >> arch/s390/kvm/../../../virt/kvm/eventfd.c:197:23: error: incompatible pointer types passing 'struct eventfd_ctx **' to parameter of type 'struct eventfd_ctx *'; remove & [-Werror,-Wincompatible-pointer-types] eventfd_ctx_do_read(&irqfd->eventfd, &cnt); ^~~~~~~~~~~~~~~ include/linux/eventfd.h:44:46: note: passing argument to parameter 'ctx' here void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt); ^ 20 warnings and 1 error generated. vim +197 arch/s390/kvm/../../../virt/kvm/eventfd.c 180 181 /* 182 * Called with wqh->lock held and interrupts disabled 183 */ 184 static int 185 irqfd_wakeup(wait_queue_entry_t *wait, unsigned mode, int sync, void *key) 186 { 187 struct kvm_kernel_irqfd *irqfd = 188 container_of(wait, struct kvm_kernel_irqfd, wait); 189 __poll_t flags = key_to_poll(key); 190 struct kvm_kernel_irq_routing_entry irq; 191 struct kvm *kvm = irqfd->kvm; 192 unsigned seq; 193 int idx; 194 195 if (flags & EPOLLIN) { 196 u64 cnt; > 197 eventfd_ctx_do_read(&irqfd->eventfd, &cnt); 198 199 idx = srcu_read_lock(&kvm->irq_srcu); 200 do { 201 seq = read_seqcount_begin(&irqfd->irq_entry_sc); 202 irq = irqfd->irq_entry; 203 } while (read_seqcount_retry(&irqfd->irq_entry_sc, seq)); 204 /* An event has been signaled, inject an interrupt */ 205 if (kvm_arch_set_irq_inatomic(&irq, kvm, 206 KVM_USERSPACE_IRQ_SOURCE_ID, 1, 207 false) == -EWOULDBLOCK) 208 schedule_work(&irqfd->inject); 209 srcu_read_unlock(&kvm->irq_srcu, idx); 210 } 211 212 if (flags & EPOLLHUP) { 213 /* The eventfd is closing, detach from KVM */ 214 unsigned long iflags; 215 216 spin_lock_irqsave(&kvm->irqfds.lock, iflags); 217 218 /* 219 * We must check if someone deactivated the irqfd before 220 * we could acquire the irqfds.lock since the item is 221 * deactivated from the KVM side before it is unhooked from 222 * the wait-queue. If it is already deactivated, we can 223 * simply return knowing the other side will cleanup for us. 224 * We cannot race against the irqfd going away since the 225 * other side is required to acquire wqh->lock, which we hold 226 */ 227 if (irqfd_is_active(irqfd)) 228 irqfd_deactivate(irqfd); 229 230 spin_unlock_irqrestore(&kvm->irqfds.lock, iflags); 231 } 232 233 return 0; 234 } 235 --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org [-- Attachment #2: .config.gz --] [-- Type: application/gzip, Size: 26571 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 3/3] kvm/eventfd: Drain events from eventfd in irqfd_wakeup() 2020-10-27 13:55 ` [PATCH 3/3] kvm/eventfd: Drain events from eventfd in irqfd_wakeup() David Woodhouse 2020-10-27 18:41 ` kernel test robot 2020-10-27 21:42 ` kernel test robot @ 2020-10-27 23:13 ` kernel test robot 2 siblings, 0 replies; 29+ messages in thread From: kernel test robot @ 2020-10-27 23:13 UTC (permalink / raw) To: David Woodhouse, bonzini Cc: kbuild-all, clang-built-linux, Alex Williamson, Cornelia Huck, Alexander Viro, Jens Axboe, kvm, linux-kernel, linux-fsdevel [-- Attachment #1: Type: text/plain, Size: 4377 bytes --] Hi David, I love your patch! Yet something to improve: [auto build test ERROR on vfio/next] [also build test ERROR on vhost/linux-next linus/master kvm/linux-next v5.10-rc1 next-20201027] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch] url: https://github.com/0day-ci/linux/commits/David-Woodhouse/Allow-in-kernel-consumers-to-drain-events-from-eventfd/20201027-215658 base: https://github.com/awilliam/linux-vfio.git next config: x86_64-randconfig-a004-20201026 (attached as .config) compiler: clang version 12.0.0 (https://github.com/llvm/llvm-project f2c25c70791de95d2466e09b5b58fc37f6ccd7a4) reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # install x86_64 cross compiling tool for clang build # apt-get install binutils-x86-64-linux-gnu # https://github.com/0day-ci/linux/commit/dc45dd9af28fede8f8dd29b705b90f78cf87538c git remote add linux-review https://github.com/0day-ci/linux git fetch --no-tags linux-review David-Woodhouse/Allow-in-kernel-consumers-to-drain-events-from-eventfd/20201027-215658 git checkout dc45dd9af28fede8f8dd29b705b90f78cf87538c # save the attached .config to linux build tree COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=x86_64 If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot <lkp@intel.com> All errors (new ones prefixed by >>): >> arch/x86/kvm/../../../virt/kvm/eventfd.c:197:23: error: incompatible pointer types passing 'struct eventfd_ctx **' to parameter of type 'struct eventfd_ctx *'; remove & [-Werror,-Wincompatible-pointer-types] eventfd_ctx_do_read(&irqfd->eventfd, &cnt); ^~~~~~~~~~~~~~~ include/linux/eventfd.h:44:46: note: passing argument to parameter 'ctx' here void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt); ^ 1 error generated. vim +197 arch/x86/kvm/../../../virt/kvm/eventfd.c 180 181 /* 182 * Called with wqh->lock held and interrupts disabled 183 */ 184 static int 185 irqfd_wakeup(wait_queue_entry_t *wait, unsigned mode, int sync, void *key) 186 { 187 struct kvm_kernel_irqfd *irqfd = 188 container_of(wait, struct kvm_kernel_irqfd, wait); 189 __poll_t flags = key_to_poll(key); 190 struct kvm_kernel_irq_routing_entry irq; 191 struct kvm *kvm = irqfd->kvm; 192 unsigned seq; 193 int idx; 194 195 if (flags & EPOLLIN) { 196 u64 cnt; > 197 eventfd_ctx_do_read(&irqfd->eventfd, &cnt); 198 199 idx = srcu_read_lock(&kvm->irq_srcu); 200 do { 201 seq = read_seqcount_begin(&irqfd->irq_entry_sc); 202 irq = irqfd->irq_entry; 203 } while (read_seqcount_retry(&irqfd->irq_entry_sc, seq)); 204 /* An event has been signaled, inject an interrupt */ 205 if (kvm_arch_set_irq_inatomic(&irq, kvm, 206 KVM_USERSPACE_IRQ_SOURCE_ID, 1, 207 false) == -EWOULDBLOCK) 208 schedule_work(&irqfd->inject); 209 srcu_read_unlock(&kvm->irq_srcu, idx); 210 } 211 212 if (flags & EPOLLHUP) { 213 /* The eventfd is closing, detach from KVM */ 214 unsigned long iflags; 215 216 spin_lock_irqsave(&kvm->irqfds.lock, iflags); 217 218 /* 219 * We must check if someone deactivated the irqfd before 220 * we could acquire the irqfds.lock since the item is 221 * deactivated from the KVM side before it is unhooked from 222 * the wait-queue. If it is already deactivated, we can 223 * simply return knowing the other side will cleanup for us. 224 * We cannot race against the irqfd going away since the 225 * other side is required to acquire wqh->lock, which we hold 226 */ 227 if (irqfd_is_active(irqfd)) 228 irqfd_deactivate(irqfd); 229 230 spin_unlock_irqrestore(&kvm->irqfds.lock, iflags); 231 } 232 233 return 0; 234 } 235 --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org [-- Attachment #2: .config.gz --] [-- Type: application/gzip, Size: 40494 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v2 0/2] Allow KVM IRQFD to consistently intercept events 2020-10-26 17:53 [RFC PATCH 1/2] sched/wait: Add add_wait_queue_priority() David Woodhouse 2020-10-26 17:53 ` [RFC PATCH 2/2] kvm/eventfd: Use priority waitqueue to catch events before userspace David Woodhouse @ 2020-10-27 14:39 ` David Woodhouse 2020-10-27 14:39 ` [PATCH v2 1/2] sched/wait: Add add_wait_queue_priority() David Woodhouse 2020-10-27 14:39 ` [PATCH v2 2/2] kvm/eventfd: Use priority waitqueue to catch events before userspace David Woodhouse 1 sibling, 2 replies; 29+ messages in thread From: David Woodhouse @ 2020-10-27 14:39 UTC (permalink / raw) To: linux-kernel Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Paolo Bonzini, kvm When posted interrupts are in use, KVM fully bypasses the eventfd and delivers events directly to the appropriate vCPU. Without posted interrupts, it still uses the eventfd but it doesn't actually stop userspace from receiving the events too. This leaves userspace having to carefully avoid seeing the same events and injecting duplicate interrupts to the guest. Fix it by adding a 'priority' mode for exclusive waiters which puts them at the head of the list, where they can consume events before the non-exclusive waiters are woken. v2: • Drop [RFC]. This seems to be working nicely, and userspace is a lot cleaner without having to mess around with adding/removing the eventfd to its poll set. And nobody yelled at me. Yet. • Reword commit comments, update comment above __wake_up_common() • Rebase to be applied after the (only vaguely related) fix to make irqfd actually consume the eventfd counter too. David Woodhouse (2): sched/wait: Add add_wait_queue_priority() kvm/eventfd: Use priority waitqueue to catch events before userspace include/linux/wait.h | 12 +++++++++++- kernel/sched/wait.c | 17 ++++++++++++++++- virt/kvm/eventfd.c | 6 ++++-- ^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v2 1/2] sched/wait: Add add_wait_queue_priority() 2020-10-27 14:39 ` [PATCH v2 0/2] Allow KVM IRQFD to consistently intercept events David Woodhouse @ 2020-10-27 14:39 ` David Woodhouse 2020-10-27 19:09 ` Peter Zijlstra 2020-10-28 14:35 ` Peter Zijlstra 2020-10-27 14:39 ` [PATCH v2 2/2] kvm/eventfd: Use priority waitqueue to catch events before userspace David Woodhouse 1 sibling, 2 replies; 29+ messages in thread From: David Woodhouse @ 2020-10-27 14:39 UTC (permalink / raw) To: linux-kernel Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Paolo Bonzini, kvm From: David Woodhouse <dwmw@amazon.co.uk> This allows an exclusive wait_queue_entry to be added at the head of the queue, instead of the tail as normal. Thus, it gets to consume events first without allowing non-exclusive waiters to be woken at all. The (first) intended use is for KVM IRQFD, which currently has inconsistent behaviour depending on whether posted interrupts are available or not. If they are, KVM will bypass the eventfd completely and deliver interrupts directly to the appropriate vCPU. If not, events are delivered through the eventfd and userspace will receive them when polling on the eventfd. By using add_wait_queue_priority(), KVM will be able to consistently consume events within the kernel without accidentally exposing them to userspace when they're supposed to be bypassed. This, in turn, means that userspace doesn't have to jump through hoops to avoid listening on the erroneously noisy eventfd and injecting duplicate interrupts. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> --- include/linux/wait.h | 12 +++++++++++- kernel/sched/wait.c | 17 ++++++++++++++++- 2 files changed, 27 insertions(+), 2 deletions(-) diff --git a/include/linux/wait.h b/include/linux/wait.h index 27fb99cfeb02..fe10e8570a52 100644 --- a/include/linux/wait.h +++ b/include/linux/wait.h @@ -22,6 +22,7 @@ int default_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, int #define WQ_FLAG_BOOKMARK 0x04 #define WQ_FLAG_CUSTOM 0x08 #define WQ_FLAG_DONE 0x10 +#define WQ_FLAG_PRIORITY 0x20 /* * A single wait-queue entry structure: @@ -164,11 +165,20 @@ static inline bool wq_has_sleeper(struct wait_queue_head *wq_head) extern void add_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry); extern void add_wait_queue_exclusive(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry); +extern void add_wait_queue_priority(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry); extern void remove_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry); static inline void __add_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry) { - list_add(&wq_entry->entry, &wq_head->head); + struct list_head *head = &wq_head->head; + struct wait_queue_entry *wq; + + list_for_each_entry(wq, &wq_head->head, entry) { + if (!(wq->flags & WQ_FLAG_PRIORITY)) + break; + head = &wq->entry; + } + list_add(&wq_entry->entry, head); } /* diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c index 01f5d3020589..183cc6ae68a6 100644 --- a/kernel/sched/wait.c +++ b/kernel/sched/wait.c @@ -37,6 +37,17 @@ void add_wait_queue_exclusive(struct wait_queue_head *wq_head, struct wait_queue } EXPORT_SYMBOL(add_wait_queue_exclusive); +void add_wait_queue_priority(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry) +{ + unsigned long flags; + + wq_entry->flags |= WQ_FLAG_EXCLUSIVE | WQ_FLAG_PRIORITY; + spin_lock_irqsave(&wq_head->lock, flags); + __add_wait_queue(wq_head, wq_entry); + spin_unlock_irqrestore(&wq_head->lock, flags); +} +EXPORT_SYMBOL_GPL(add_wait_queue_priority); + void remove_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry) { unsigned long flags; @@ -57,7 +68,11 @@ EXPORT_SYMBOL(remove_wait_queue); /* * The core wakeup function. Non-exclusive wakeups (nr_exclusive == 0) just * wake everything up. If it's an exclusive wakeup (nr_exclusive == small +ve - * number) then we wake all the non-exclusive tasks and one exclusive task. + * number) then we wake that number of exclusive tasks, and potentially all + * the non-exclusive tasks. Normally, exclusive tasks will be at the end of + * the list and any non-exclusive tasks will be woken first. A priority task + * may be at the head of the list, and can consume the event without any other + * tasks being woken. * * There are circumstances in which we can try to wake a task which has already * started to run but is not in state TASK_RUNNING. try_to_wake_up() returns -- 2.26.2 ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [PATCH v2 1/2] sched/wait: Add add_wait_queue_priority() 2020-10-27 14:39 ` [PATCH v2 1/2] sched/wait: Add add_wait_queue_priority() David Woodhouse @ 2020-10-27 19:09 ` Peter Zijlstra 2020-10-27 19:27 ` David Woodhouse 2020-10-28 14:35 ` Peter Zijlstra 1 sibling, 1 reply; 29+ messages in thread From: Peter Zijlstra @ 2020-10-27 19:09 UTC (permalink / raw) To: David Woodhouse Cc: linux-kernel, Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Paolo Bonzini, kvm, Oleg Nesterov On Tue, Oct 27, 2020 at 02:39:43PM +0000, David Woodhouse wrote: > From: David Woodhouse <dwmw@amazon.co.uk> > > This allows an exclusive wait_queue_entry to be added at the head of the > queue, instead of the tail as normal. Thus, it gets to consume events > first without allowing non-exclusive waiters to be woken at all. > > The (first) intended use is for KVM IRQFD, which currently has Do you have more? You could easily special case this inside the KVM code. I don't _think_ the other users of __add_wait_queue() will mind the extra branch, but what do I know. > inconsistent behaviour depending on whether posted interrupts are > available or not. If they are, KVM will bypass the eventfd completely > and deliver interrupts directly to the appropriate vCPU. If not, events > are delivered through the eventfd and userspace will receive them when > polling on the eventfd. > > By using add_wait_queue_priority(), KVM will be able to consistently > consume events within the kernel without accidentally exposing them > to userspace when they're supposed to be bypassed. This, in turn, means > that userspace doesn't have to jump through hoops to avoid listening > on the erroneously noisy eventfd and injecting duplicate interrupts. > > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> > --- > include/linux/wait.h | 12 +++++++++++- > kernel/sched/wait.c | 17 ++++++++++++++++- > 2 files changed, 27 insertions(+), 2 deletions(-) > > diff --git a/include/linux/wait.h b/include/linux/wait.h > index 27fb99cfeb02..fe10e8570a52 100644 > --- a/include/linux/wait.h > +++ b/include/linux/wait.h > @@ -22,6 +22,7 @@ int default_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, int > #define WQ_FLAG_BOOKMARK 0x04 > #define WQ_FLAG_CUSTOM 0x08 > #define WQ_FLAG_DONE 0x10 > +#define WQ_FLAG_PRIORITY 0x20 > > /* > * A single wait-queue entry structure: > @@ -164,11 +165,20 @@ static inline bool wq_has_sleeper(struct wait_queue_head *wq_head) > > extern void add_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry); > extern void add_wait_queue_exclusive(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry); > +extern void add_wait_queue_priority(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry); > extern void remove_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry); > > static inline void __add_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry) > { > - list_add(&wq_entry->entry, &wq_head->head); > + struct list_head *head = &wq_head->head; > + struct wait_queue_entry *wq; > + > + list_for_each_entry(wq, &wq_head->head, entry) { > + if (!(wq->flags & WQ_FLAG_PRIORITY)) > + break; > + head = &wq->entry; > + } > + list_add(&wq_entry->entry, head); > } So you're adding the PRIORITY things to the head of the list and need the PRIORITY flag to keep them in FIFO order there, right? While looking at this I found that weird __add_wait_queue_exclusive() which is used by fs/eventpoll.c and does something similar, except it doesn't keep the FIFO order. The Changelog doesn't state how important this property is to you. > /* > diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c > index 01f5d3020589..183cc6ae68a6 100644 > --- a/kernel/sched/wait.c > +++ b/kernel/sched/wait.c > @@ -37,6 +37,17 @@ void add_wait_queue_exclusive(struct wait_queue_head *wq_head, struct wait_queue > } > EXPORT_SYMBOL(add_wait_queue_exclusive); > > +void add_wait_queue_priority(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry) > +{ > + unsigned long flags; > + > + wq_entry->flags |= WQ_FLAG_EXCLUSIVE | WQ_FLAG_PRIORITY; > + spin_lock_irqsave(&wq_head->lock, flags); > + __add_wait_queue(wq_head, wq_entry); > + spin_unlock_irqrestore(&wq_head->lock, flags); > +} > +EXPORT_SYMBOL_GPL(add_wait_queue_priority); > + > void remove_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry) > { > unsigned long flags; > @@ -57,7 +68,11 @@ EXPORT_SYMBOL(remove_wait_queue); > /* > * The core wakeup function. Non-exclusive wakeups (nr_exclusive == 0) just > * wake everything up. If it's an exclusive wakeup (nr_exclusive == small +ve > - * number) then we wake all the non-exclusive tasks and one exclusive task. > + * number) then we wake that number of exclusive tasks, and potentially all > + * the non-exclusive tasks. Normally, exclusive tasks will be at the end of > + * the list and any non-exclusive tasks will be woken first. A priority task > + * may be at the head of the list, and can consume the event without any other > + * tasks being woken. > * > * There are circumstances in which we can try to wake a task which has already > * started to run but is not in state TASK_RUNNING. try_to_wake_up() returns > -- > 2.26.2 > ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v2 1/2] sched/wait: Add add_wait_queue_priority() 2020-10-27 19:09 ` Peter Zijlstra @ 2020-10-27 19:27 ` David Woodhouse 2020-10-27 20:30 ` Peter Zijlstra 0 siblings, 1 reply; 29+ messages in thread From: David Woodhouse @ 2020-10-27 19:27 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Paolo Bonzini, kvm, Oleg Nesterov [-- Attachment #1: Type: text/plain, Size: 2950 bytes --] On Tue, 2020-10-27 at 20:09 +0100, Peter Zijlstra wrote: > On Tue, Oct 27, 2020 at 02:39:43PM +0000, David Woodhouse wrote: > > From: David Woodhouse <dwmw@amazon.co.uk> > > > > This allows an exclusive wait_queue_entry to be added at the head of the > > queue, instead of the tail as normal. Thus, it gets to consume events > > first without allowing non-exclusive waiters to be woken at all. > > > > The (first) intended use is for KVM IRQFD, which currently has > > Do you have more? You could easily special case this inside the KVM > code. I don't have more right now. What is the easy special case that you see? > I don't _think_ the other users of __add_wait_queue() will mind the > extra branch, but what do I know. I suppose we could add an unlikely() in there. It seemed like premature optimisation. > > static inline void __add_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry) > > { > > - list_add(&wq_entry->entry, &wq_head->head); > > + struct list_head *head = &wq_head->head; > > + struct wait_queue_entry *wq; > > + > > + list_for_each_entry(wq, &wq_head->head, entry) { > > + if (!(wq->flags & WQ_FLAG_PRIORITY)) > > + break; > > + head = &wq->entry; > > + } > > + list_add(&wq_entry->entry, head); > > } > > So you're adding the PRIORITY things to the head of the list and need > the PRIORITY flag to keep them in FIFO order there, right? No, I don't care about the order of priority entries; there will typically be only one of them; that's the point. (I'd have used the word 'exclusive' if that wasn't already in use for something that... well... isn't.) I only case that the priority entries come *before* the bog-standard non-exclusive entries (like ep_poll_callback). The priority items end up getting added in FIFO order purely by chance, because it was simpler to use the same insertion flow for both priority and normal non-exclusive entries instead of making a new case. So they all get inserted behind any existing priority entries. > While looking at this I found that weird __add_wait_queue_exclusive() > which is used by fs/eventpoll.c and does something similar, except it > doesn't keep the FIFO order. It does, doesn't it? Except those so-called "exclusive" entries end up in FIFO order amongst themselves at the *tail* of the queue, to be woken up only after all the other entries before them *haven't* been excluded. > The Changelog doesn't state how important this property is to you. Because it isn't :) The ordering is: { PRIORITY }* { NON-EXCLUSIVE }* { EXCLUSIVE(sic) }* I care that PRIORITY comes before the others, because I want to actually exclude the others. Especially the "non-exclusive" ones, which the 'exclusive' ones don't actually exclude. I absolutely don't care about ordering *within* the set of PRIORITY entries, since as I said I expect there to be only one. [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5174 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v2 1/2] sched/wait: Add add_wait_queue_priority() 2020-10-27 19:27 ` David Woodhouse @ 2020-10-27 20:30 ` Peter Zijlstra 2020-10-27 20:49 ` David Woodhouse 2020-10-27 21:32 ` David Woodhouse 0 siblings, 2 replies; 29+ messages in thread From: Peter Zijlstra @ 2020-10-27 20:30 UTC (permalink / raw) To: David Woodhouse Cc: linux-kernel, Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Paolo Bonzini, kvm, Oleg Nesterov On Tue, Oct 27, 2020 at 07:27:59PM +0000, David Woodhouse wrote: > > While looking at this I found that weird __add_wait_queue_exclusive() > > which is used by fs/eventpoll.c and does something similar, except it > > doesn't keep the FIFO order. > > It does, doesn't it? Except those so-called "exclusive" entries end up > in FIFO order amongst themselves at the *tail* of the queue, to be > woken up only after all the other entries before them *haven't* been > excluded. __add_wait_queue_exclusive() uses __add_wait_queue() which does list_add(). It does _not_ add at the tail like normal exclusive users, and there is exactly _1_ user in tree that does this. I'm not exactly sure how this happened, but: add_wait_queue_exclusive() and __add_wait_queue_exclusive() are not related :-( > > The Changelog doesn't state how important this property is to you. > > Because it isn't :) > > The ordering is: > > { PRIORITY }* { NON-EXCLUSIVE }* { EXCLUSIVE(sic) }* > > I care that PRIORITY comes before the others, because I want to > actually exclude the others. Especially the "non-exclusive" ones, which > the 'exclusive' ones don't actually exclude. > > I absolutely don't care about ordering *within* the set of PRIORITY > entries, since as I said I expect there to be only one. Then you could arguably do something like: spin_lock_irqsave(&wq_head->lock, flags); __add_wait_queue_exclusive(wq_head, wq_entry); spin_unlock_irqrestore(&wq_head->lock, flags); and leave it at that. But now I'm itching to fix that horrible naming... tomorrow perhaps. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v2 1/2] sched/wait: Add add_wait_queue_priority() 2020-10-27 20:30 ` Peter Zijlstra @ 2020-10-27 20:49 ` David Woodhouse 2020-10-27 21:32 ` David Woodhouse 1 sibling, 0 replies; 29+ messages in thread From: David Woodhouse @ 2020-10-27 20:49 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Paolo Bonzini, kvm, Oleg Nesterov [-- Attachment #1: Type: text/plain, Size: 2123 bytes --] On Tue, 2020-10-27 at 21:30 +0100, Peter Zijlstra wrote: > On Tue, Oct 27, 2020 at 07:27:59PM +0000, David Woodhouse wrote: > > > > While looking at this I found that weird __add_wait_queue_exclusive() > > > which is used by fs/eventpoll.c and does something similar, except it > > > doesn't keep the FIFO order. > > > > It does, doesn't it? Except those so-called "exclusive" entries end up > > in FIFO order amongst themselves at the *tail* of the queue, to be > > woken up only after all the other entries before them *haven't* been > > excluded. > > __add_wait_queue_exclusive() uses __add_wait_queue() which does > list_add(). It does _not_ add at the tail like normal exclusive users, > and there is exactly _1_ user in tree that does this. > > I'm not exactly sure how this happened, but: > > add_wait_queue_exclusive() > > and > > __add_wait_queue_exclusive() > > are not related :-( Oh, that is *particularly* special. It sounds like the __add_wait_queue_exclusive() version is a half-baked attempt at doing what I'm doing here, except.... > > > The Changelog doesn't state how important this property is to you. > > > > Because it isn't :) > > > > The ordering is: > > > > { PRIORITY }* { NON-EXCLUSIVE }* { EXCLUSIVE(sic) }* > > > > I care that PRIORITY comes before the others, because I want to > > actually exclude the others. Especially the "non-exclusive" ones, which > > the 'exclusive' ones don't actually exclude. > > > > I absolutely don't care about ordering *within* the set of PRIORITY > > entries, since as I said I expect there to be only one. > > Then you could arguably do something like: > > spin_lock_irqsave(&wq_head->lock, flags); > __add_wait_queue_exclusive(wq_head, wq_entry); > spin_unlock_irqrestore(&wq_head->lock, flags); > > and leave it at that. .. the problem with that is that other waiters *can* end up on the queue before it, if they are added later. I don't know if the existing user (ep_poll) cares, but I do. > But now I'm itching to fix that horrible naming... tomorrow perhaps. :) [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5174 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v2 1/2] sched/wait: Add add_wait_queue_priority() 2020-10-27 20:30 ` Peter Zijlstra 2020-10-27 20:49 ` David Woodhouse @ 2020-10-27 21:32 ` David Woodhouse 2020-10-28 14:20 ` Peter Zijlstra 1 sibling, 1 reply; 29+ messages in thread From: David Woodhouse @ 2020-10-27 21:32 UTC (permalink / raw) To: Peter Zijlstra, Davide Libenzi, Davi E. M. Arnaut, davi Cc: linux-kernel, Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Paolo Bonzini, kvm, Oleg Nesterov [-- Attachment #1: Type: text/plain, Size: 1526 bytes --] On Tue, 2020-10-27 at 21:30 +0100, Peter Zijlstra wrote: > On Tue, Oct 27, 2020 at 07:27:59PM +0000, David Woodhouse wrote: > > > > While looking at this I found that weird __add_wait_queue_exclusive() > > > which is used by fs/eventpoll.c and does something similar, except it > > > doesn't keep the FIFO order. > > > > It does, doesn't it? Except those so-called "exclusive" entries end up > > in FIFO order amongst themselves at the *tail* of the queue, to be > > woken up only after all the other entries before them *haven't* been > > excluded. > > __add_wait_queue_exclusive() uses __add_wait_queue() which does > list_add(). It does _not_ add at the tail like normal exclusive users, > and there is exactly _1_ user in tree that does this. > > I'm not exactly sure how this happened, but: > > add_wait_queue_exclusive() > > and > > __add_wait_queue_exclusive() > > are not related :-( I think that goes all the way back to here: https://lkml.org/lkml/2007/5/4/530 It was rounded up in commit d47de16c72and subsequently "cleaned up" into an inline in wait.h, but I don't think there was ever a reason for it to be added to the head of the list instead of the tail. So I think we can reasonably make __add_wait_queue_exclusive() do precisely the same thing as add_wait_queue_exclusive() does (modulo locking). And then potentially rename them both to something that isn't quite such a lie. And give me the one I want that *does* actually exclude other waiters :) [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5174 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v2 1/2] sched/wait: Add add_wait_queue_priority() 2020-10-27 21:32 ` David Woodhouse @ 2020-10-28 14:20 ` Peter Zijlstra 2020-10-28 14:44 ` Paolo Bonzini 0 siblings, 1 reply; 29+ messages in thread From: Peter Zijlstra @ 2020-10-28 14:20 UTC (permalink / raw) To: David Woodhouse Cc: Davide Libenzi, Davi E. M. Arnaut, davi, linux-kernel, Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Paolo Bonzini, kvm, Oleg Nesterov On Tue, Oct 27, 2020 at 09:32:11PM +0000, David Woodhouse wrote: > On Tue, 2020-10-27 at 21:30 +0100, Peter Zijlstra wrote: > > On Tue, Oct 27, 2020 at 07:27:59PM +0000, David Woodhouse wrote: > > > > > > While looking at this I found that weird __add_wait_queue_exclusive() > > > > which is used by fs/eventpoll.c and does something similar, except it > > > > doesn't keep the FIFO order. > > > > > > It does, doesn't it? Except those so-called "exclusive" entries end up > > > in FIFO order amongst themselves at the *tail* of the queue, to be > > > woken up only after all the other entries before them *haven't* been > > > excluded. > > > > __add_wait_queue_exclusive() uses __add_wait_queue() which does > > list_add(). It does _not_ add at the tail like normal exclusive users, > > and there is exactly _1_ user in tree that does this. > > > > I'm not exactly sure how this happened, but: > > > > add_wait_queue_exclusive() > > > > and > > > > __add_wait_queue_exclusive() > > > > are not related :-( > > I think that goes all the way back to here: > > https://lkml.org/lkml/2007/5/4/530 > > It was rounded up in commit d47de16c72and subsequently "cleaned up" > into an inline in wait.h, but I don't think there was ever a reason for > it to be added to the head of the list instead of the tail. Maybe, I'm not sure I can tell in a hurry. I've opted to undo the above 'cleanups' > So I think we can reasonably make __add_wait_queue_exclusive() do > precisely the same thing as add_wait_queue_exclusive() does (modulo > locking). Aye, see below. > And then potentially rename them both to something that isn't quite > such a lie. And give me the one I want that *does* actually exclude > other waiters :) I don't think we want to do that; people are very much used to the current semantics. I also very much want to do: s/__add_wait_queue_entry_tail/__add_wait_queue_tail/ on top of all this. Anyway, I'll agree to your patch. How do we route this? Shall I take the waitqueue thing and stick it in a topic branch for Paolo so he can then merge that and the kvm bits on top into the KVM tree? --- diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 4df61129566d..a2a7e1e339f6 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -1895,10 +1895,12 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, */ eavail = ep_events_available(ep); if (!eavail) { - if (signal_pending(current)) + if (signal_pending(current)) { res = -EINTR; - else - __add_wait_queue_exclusive(&ep->wq, &wait); + } else { + wq_entry->flags |= WQ_FLAG_EXCLUSIVE; + __add_wait_queue(wq_head, wq_entry); + } } write_unlock_irq(&ep->lock); diff --git a/fs/orangefs/orangefs-bufmap.c b/fs/orangefs/orangefs-bufmap.c index 538e839590ef..8cac3589f365 100644 --- a/fs/orangefs/orangefs-bufmap.c +++ b/fs/orangefs/orangefs-bufmap.c @@ -86,7 +86,7 @@ static int wait_for_free(struct slot_map *m) do { long n = left, t; if (likely(list_empty(&wait.entry))) - __add_wait_queue_entry_tail_exclusive(&m->q, &wait); + __add_wait_queue_exclusive(&m->q, &wait); set_current_state(TASK_INTERRUPTIBLE); if (m->c > 0) diff --git a/include/linux/wait.h b/include/linux/wait.h index 27fb99cfeb02..4b8c4ece13f7 100644 --- a/include/linux/wait.h +++ b/include/linux/wait.h @@ -171,23 +171,13 @@ static inline void __add_wait_queue(struct wait_queue_head *wq_head, struct wait list_add(&wq_entry->entry, &wq_head->head); } -/* - * Used for wake-one threads: - */ -static inline void -__add_wait_queue_exclusive(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry) -{ - wq_entry->flags |= WQ_FLAG_EXCLUSIVE; - __add_wait_queue(wq_head, wq_entry); -} - static inline void __add_wait_queue_entry_tail(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry) { list_add_tail(&wq_entry->entry, &wq_head->head); } static inline void -__add_wait_queue_entry_tail_exclusive(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry) +__add_wait_queue_exclusive(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry) { wq_entry->flags |= WQ_FLAG_EXCLUSIVE; __add_wait_queue_entry_tail(wq_head, wq_entry); ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [PATCH v2 1/2] sched/wait: Add add_wait_queue_priority() 2020-10-28 14:20 ` Peter Zijlstra @ 2020-10-28 14:44 ` Paolo Bonzini 0 siblings, 0 replies; 29+ messages in thread From: Paolo Bonzini @ 2020-10-28 14:44 UTC (permalink / raw) To: Peter Zijlstra, David Woodhouse Cc: Davide Libenzi, Davi E. M. Arnaut, davi, linux-kernel, Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, kvm, Oleg Nesterov On 28/10/20 15:20, Peter Zijlstra wrote: > Shall I take the waitqueue thing and stick it in a topic branch for > Paolo so he can then merge that and the kvm bits on top into the KVM > tree? Topic branches are always the best solution. :) Paolo ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v2 1/2] sched/wait: Add add_wait_queue_priority() 2020-10-27 14:39 ` [PATCH v2 1/2] sched/wait: Add add_wait_queue_priority() David Woodhouse 2020-10-27 19:09 ` Peter Zijlstra @ 2020-10-28 14:35 ` Peter Zijlstra 2020-11-04 9:35 ` David Woodhouse 1 sibling, 1 reply; 29+ messages in thread From: Peter Zijlstra @ 2020-10-28 14:35 UTC (permalink / raw) To: David Woodhouse Cc: linux-kernel, Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Paolo Bonzini, kvm On Tue, Oct 27, 2020 at 02:39:43PM +0000, David Woodhouse wrote: > From: David Woodhouse <dwmw@amazon.co.uk> > > This allows an exclusive wait_queue_entry to be added at the head of the > queue, instead of the tail as normal. Thus, it gets to consume events > first without allowing non-exclusive waiters to be woken at all. > > The (first) intended use is for KVM IRQFD, which currently has > inconsistent behaviour depending on whether posted interrupts are > available or not. If they are, KVM will bypass the eventfd completely > and deliver interrupts directly to the appropriate vCPU. If not, events > are delivered through the eventfd and userspace will receive them when > polling on the eventfd. > > By using add_wait_queue_priority(), KVM will be able to consistently > consume events within the kernel without accidentally exposing them > to userspace when they're supposed to be bypassed. This, in turn, means > that userspace doesn't have to jump through hoops to avoid listening > on the erroneously noisy eventfd and injecting duplicate interrupts. > > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> > --- > include/linux/wait.h | 12 +++++++++++- > kernel/sched/wait.c | 17 ++++++++++++++++- > 2 files changed, 27 insertions(+), 2 deletions(-) > > diff --git a/include/linux/wait.h b/include/linux/wait.h > index 27fb99cfeb02..fe10e8570a52 100644 > --- a/include/linux/wait.h > +++ b/include/linux/wait.h > @@ -22,6 +22,7 @@ int default_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, int > #define WQ_FLAG_BOOKMARK 0x04 > #define WQ_FLAG_CUSTOM 0x08 > #define WQ_FLAG_DONE 0x10 > +#define WQ_FLAG_PRIORITY 0x20 > > /* > * A single wait-queue entry structure: > @@ -164,11 +165,20 @@ static inline bool wq_has_sleeper(struct wait_queue_head *wq_head) > > extern void add_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry); > extern void add_wait_queue_exclusive(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry); > +extern void add_wait_queue_priority(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry); > extern void remove_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry); > > static inline void __add_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry) > { > - list_add(&wq_entry->entry, &wq_head->head); > + struct list_head *head = &wq_head->head; > + struct wait_queue_entry *wq; > + > + list_for_each_entry(wq, &wq_head->head, entry) { > + if (!(wq->flags & WQ_FLAG_PRIORITY)) > + break; > + head = &wq->entry; > + } > + list_add(&wq_entry->entry, head); > } > > /* > diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c > index 01f5d3020589..183cc6ae68a6 100644 > --- a/kernel/sched/wait.c > +++ b/kernel/sched/wait.c > @@ -37,6 +37,17 @@ void add_wait_queue_exclusive(struct wait_queue_head *wq_head, struct wait_queue > } > EXPORT_SYMBOL(add_wait_queue_exclusive); > > +void add_wait_queue_priority(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry) > +{ > + unsigned long flags; > + > + wq_entry->flags |= WQ_FLAG_EXCLUSIVE | WQ_FLAG_PRIORITY; > + spin_lock_irqsave(&wq_head->lock, flags); > + __add_wait_queue(wq_head, wq_entry); > + spin_unlock_irqrestore(&wq_head->lock, flags); > +} > +EXPORT_SYMBOL_GPL(add_wait_queue_priority); > + > void remove_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry) > { > unsigned long flags; > @@ -57,7 +68,11 @@ EXPORT_SYMBOL(remove_wait_queue); > /* > * The core wakeup function. Non-exclusive wakeups (nr_exclusive == 0) just > * wake everything up. If it's an exclusive wakeup (nr_exclusive == small +ve > - * number) then we wake all the non-exclusive tasks and one exclusive task. > + * number) then we wake that number of exclusive tasks, and potentially all > + * the non-exclusive tasks. Normally, exclusive tasks will be at the end of > + * the list and any non-exclusive tasks will be woken first. A priority task > + * may be at the head of the list, and can consume the event without any other > + * tasks being woken. > * > * There are circumstances in which we can try to wake a task which has already > * started to run but is not in state TASK_RUNNING. try_to_wake_up() returns > -- > 2.26.2 > ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v2 1/2] sched/wait: Add add_wait_queue_priority() 2020-10-28 14:35 ` Peter Zijlstra @ 2020-11-04 9:35 ` David Woodhouse 2020-11-04 11:25 ` Paolo Bonzini 2020-11-06 10:17 ` Paolo Bonzini 0 siblings, 2 replies; 29+ messages in thread From: David Woodhouse @ 2020-11-04 9:35 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Paolo Bonzini, kvm [-- Attachment #1: Type: text/plain, Size: 1417 bytes --] On Wed, 2020-10-28 at 15:35 +0100, Peter Zijlstra wrote: > On Tue, Oct 27, 2020 at 02:39:43PM +0000, David Woodhouse wrote: > > From: David Woodhouse <dwmw@amazon.co.uk> > > > > This allows an exclusive wait_queue_entry to be added at the head of the > > queue, instead of the tail as normal. Thus, it gets to consume events > > first without allowing non-exclusive waiters to be woken at all. > > > > The (first) intended use is for KVM IRQFD, which currently has > > inconsistent behaviour depending on whether posted interrupts are > > available or not. If they are, KVM will bypass the eventfd completely > > and deliver interrupts directly to the appropriate vCPU. If not, events > > are delivered through the eventfd and userspace will receive them when > > polling on the eventfd. > > > > By using add_wait_queue_priority(), KVM will be able to consistently > > consume events within the kernel without accidentally exposing them > > to userspace when they're supposed to be bypassed. This, in turn, means > > that userspace doesn't have to jump through hoops to avoid listening > > on the erroneously noisy eventfd and injecting duplicate interrupts. > > > > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> > > Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Thanks. Paolo, the conclusion was that you were going to take this set through the KVM tree, wasn't it? [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5174 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v2 1/2] sched/wait: Add add_wait_queue_priority() 2020-11-04 9:35 ` David Woodhouse @ 2020-11-04 11:25 ` Paolo Bonzini 2020-11-06 10:17 ` Paolo Bonzini 1 sibling, 0 replies; 29+ messages in thread From: Paolo Bonzini @ 2020-11-04 11:25 UTC (permalink / raw) To: David Woodhouse, Peter Zijlstra Cc: linux-kernel, Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, kvm On 04/11/20 10:35, David Woodhouse wrote: > On Wed, 2020-10-28 at 15:35 +0100, Peter Zijlstra wrote: >> On Tue, Oct 27, 2020 at 02:39:43PM +0000, David Woodhouse wrote: >>> From: David Woodhouse <dwmw@amazon.co.uk> >>> >>> This allows an exclusive wait_queue_entry to be added at the head of the >>> queue, instead of the tail as normal. Thus, it gets to consume events >>> first without allowing non-exclusive waiters to be woken at all. >>> >>> The (first) intended use is for KVM IRQFD, which currently has >>> inconsistent behaviour depending on whether posted interrupts are >>> available or not. If they are, KVM will bypass the eventfd completely >>> and deliver interrupts directly to the appropriate vCPU. If not, events >>> are delivered through the eventfd and userspace will receive them when >>> polling on the eventfd. >>> >>> By using add_wait_queue_priority(), KVM will be able to consistently >>> consume events within the kernel without accidentally exposing them >>> to userspace when they're supposed to be bypassed. This, in turn, means >>> that userspace doesn't have to jump through hoops to avoid listening >>> on the erroneously noisy eventfd and injecting duplicate interrupts. >>> >>> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> >> >> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> > > Thanks. Paolo, the conclusion was that you were going to take this set > through the KVM tree, wasn't it? > Yes. Paolo ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v2 1/2] sched/wait: Add add_wait_queue_priority() 2020-11-04 9:35 ` David Woodhouse 2020-11-04 11:25 ` Paolo Bonzini @ 2020-11-06 10:17 ` Paolo Bonzini 2020-11-06 16:32 ` Alex Williamson 1 sibling, 1 reply; 29+ messages in thread From: Paolo Bonzini @ 2020-11-06 10:17 UTC (permalink / raw) To: David Woodhouse, Peter Zijlstra Cc: linux-kernel, Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, kvm, Alex Williamson On 04/11/20 10:35, David Woodhouse wrote: > On Wed, 2020-10-28 at 15:35 +0100, Peter Zijlstra wrote: >> On Tue, Oct 27, 2020 at 02:39:43PM +0000, David Woodhouse wrote: >>> From: David Woodhouse <dwmw@amazon.co.uk> >>> >>> This allows an exclusive wait_queue_entry to be added at the head of the >>> queue, instead of the tail as normal. Thus, it gets to consume events >>> first without allowing non-exclusive waiters to be woken at all. >>> >>> The (first) intended use is for KVM IRQFD, which currently has >>> inconsistent behaviour depending on whether posted interrupts are >>> available or not. If they are, KVM will bypass the eventfd completely >>> and deliver interrupts directly to the appropriate vCPU. If not, events >>> are delivered through the eventfd and userspace will receive them when >>> polling on the eventfd. >>> >>> By using add_wait_queue_priority(), KVM will be able to consistently >>> consume events within the kernel without accidentally exposing them >>> to userspace when they're supposed to be bypassed. This, in turn, means >>> that userspace doesn't have to jump through hoops to avoid listening >>> on the erroneously noisy eventfd and injecting duplicate interrupts. >>> >>> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> >> >> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> > > Thanks. Paolo, the conclusion was that you were going to take this set > through the KVM tree, wasn't it? > Queued, except for patch 2/3 in the eventfd series which Alex hasn't reviewed/acked yet. Paolo ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v2 1/2] sched/wait: Add add_wait_queue_priority() 2020-11-06 10:17 ` Paolo Bonzini @ 2020-11-06 16:32 ` Alex Williamson 2020-11-06 17:18 ` David Woodhouse 0 siblings, 1 reply; 29+ messages in thread From: Alex Williamson @ 2020-11-06 16:32 UTC (permalink / raw) To: Paolo Bonzini Cc: David Woodhouse, Peter Zijlstra, linux-kernel, Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, kvm On Fri, 6 Nov 2020 11:17:21 +0100 Paolo Bonzini <pbonzini@redhat.com> wrote: > On 04/11/20 10:35, David Woodhouse wrote: > > On Wed, 2020-10-28 at 15:35 +0100, Peter Zijlstra wrote: > >> On Tue, Oct 27, 2020 at 02:39:43PM +0000, David Woodhouse wrote: > >>> From: David Woodhouse <dwmw@amazon.co.uk> > >>> > >>> This allows an exclusive wait_queue_entry to be added at the head of the > >>> queue, instead of the tail as normal. Thus, it gets to consume events > >>> first without allowing non-exclusive waiters to be woken at all. > >>> > >>> The (first) intended use is for KVM IRQFD, which currently has > >>> inconsistent behaviour depending on whether posted interrupts are > >>> available or not. If they are, KVM will bypass the eventfd completely > >>> and deliver interrupts directly to the appropriate vCPU. If not, events > >>> are delivered through the eventfd and userspace will receive them when > >>> polling on the eventfd. > >>> > >>> By using add_wait_queue_priority(), KVM will be able to consistently > >>> consume events within the kernel without accidentally exposing them > >>> to userspace when they're supposed to be bypassed. This, in turn, means > >>> that userspace doesn't have to jump through hoops to avoid listening > >>> on the erroneously noisy eventfd and injecting duplicate interrupts. > >>> > >>> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> > >> > >> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> > > > > Thanks. Paolo, the conclusion was that you were going to take this set > > through the KVM tree, wasn't it? > > > > Queued, except for patch 2/3 in the eventfd series which Alex hasn't > reviewed/acked yet. There was no vfio patch here, nor mention why it got dropped in v2 afaict. Thanks, Alex ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v2 1/2] sched/wait: Add add_wait_queue_priority() 2020-11-06 16:32 ` Alex Williamson @ 2020-11-06 17:18 ` David Woodhouse 0 siblings, 0 replies; 29+ messages in thread From: David Woodhouse @ 2020-11-06 17:18 UTC (permalink / raw) To: Alex Williamson, Paolo Bonzini Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, kvm On 6 November 2020 16:32:00 GMT, Alex Williamson <alex.williamson@redhat.com> wrote: >On Fri, 6 Nov 2020 11:17:21 +0100 >Paolo Bonzini <pbonzini@redhat.com> wrote: > >> On 04/11/20 10:35, David Woodhouse wrote: >> > On Wed, 2020-10-28 at 15:35 +0100, Peter Zijlstra wrote: >> >> On Tue, Oct 27, 2020 at 02:39:43PM +0000, David Woodhouse wrote: >> >>> From: David Woodhouse <dwmw@amazon.co.uk> >> >>> >> >>> This allows an exclusive wait_queue_entry to be added at the head >of the >> >>> queue, instead of the tail as normal. Thus, it gets to consume >events >> >>> first without allowing non-exclusive waiters to be woken at all. >> >>> >> >>> The (first) intended use is for KVM IRQFD, which currently has >> >>> inconsistent behaviour depending on whether posted interrupts are >> >>> available or not. If they are, KVM will bypass the eventfd >completely >> >>> and deliver interrupts directly to the appropriate vCPU. If not, >events >> >>> are delivered through the eventfd and userspace will receive them >when >> >>> polling on the eventfd. >> >>> >> >>> By using add_wait_queue_priority(), KVM will be able to >consistently >> >>> consume events within the kernel without accidentally exposing >them >> >>> to userspace when they're supposed to be bypassed. This, in turn, >means >> >>> that userspace doesn't have to jump through hoops to avoid >listening >> >>> on the erroneously noisy eventfd and injecting duplicate >interrupts. >> >>> >> >>> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> >> >> >> >> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> >> > >> > Thanks. Paolo, the conclusion was that you were going to take this >set >> > through the KVM tree, wasn't it? >> > >> >> Queued, except for patch 2/3 in the eventfd series which Alex hasn't >> reviewed/acked yet. > >There was no vfio patch here, nor mention why it got dropped in v2 >afaict. Thanks, That was a different (but related) series. The VFIO one is https://patchwork.kernel.org/project/kvm/patch/20201027135523.646811-3-dwmw2@infradead.org/ Thanks. -- Sent from my Android device with K-9 Mail. Please excuse my brevity. ^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v2 2/2] kvm/eventfd: Use priority waitqueue to catch events before userspace 2020-10-27 14:39 ` [PATCH v2 0/2] Allow KVM IRQFD to consistently intercept events David Woodhouse 2020-10-27 14:39 ` [PATCH v2 1/2] sched/wait: Add add_wait_queue_priority() David Woodhouse @ 2020-10-27 14:39 ` David Woodhouse 1 sibling, 0 replies; 29+ messages in thread From: David Woodhouse @ 2020-10-27 14:39 UTC (permalink / raw) To: linux-kernel Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Paolo Bonzini, kvm From: David Woodhouse <dwmw@amazon.co.uk> When posted interrupts are available, the IRTE is modified to deliver interrupts direclty to the vCPU and nothing ever reaches userspace, if it's listening on the same eventfd that feeds the irqfd. I like that behaviour. Let's do it all the time, even without posted interrupts. It makes it much easier to handle IRQ remapping invalidation without having to constantly add/remove the fd from the userspace poll set. We can just leave userspace polling on it, and the bypass will... well... bypass it. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> --- virt/kvm/eventfd.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c index 87fe94355350..09cbdf2ded70 100644 --- a/virt/kvm/eventfd.c +++ b/virt/kvm/eventfd.c @@ -191,6 +191,7 @@ irqfd_wakeup(wait_queue_entry_t *wait, unsigned mode, int sync, void *key) struct kvm *kvm = irqfd->kvm; unsigned seq; int idx; + int ret = 0; if (flags & EPOLLIN) { u64 cnt; @@ -207,6 +208,7 @@ irqfd_wakeup(wait_queue_entry_t *wait, unsigned mode, int sync, void *key) false) == -EWOULDBLOCK) schedule_work(&irqfd->inject); srcu_read_unlock(&kvm->irq_srcu, idx); + ret = 1; } if (flags & EPOLLHUP) { @@ -230,7 +232,7 @@ irqfd_wakeup(wait_queue_entry_t *wait, unsigned mode, int sync, void *key) spin_unlock_irqrestore(&kvm->irqfds.lock, iflags); } - return 0; + return ret; } static void @@ -239,7 +241,7 @@ irqfd_ptable_queue_proc(struct file *file, wait_queue_head_t *wqh, { struct kvm_kernel_irqfd *irqfd = container_of(pt, struct kvm_kernel_irqfd, pt); - add_wait_queue(wqh, &irqfd->wait); + add_wait_queue_priority(wqh, &irqfd->wait); } /* Must be called under irqfds.lock */ -- 2.26.2 ^ permalink raw reply related [flat|nested] 29+ messages in thread
end of thread, other threads:[~2020-11-08 9:17 UTC | newest] Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-10-26 17:53 [RFC PATCH 1/2] sched/wait: Add add_wait_queue_priority() David Woodhouse 2020-10-26 17:53 ` [RFC PATCH 2/2] kvm/eventfd: Use priority waitqueue to catch events before userspace David Woodhouse 2020-10-27 8:01 ` Paolo Bonzini 2020-10-27 10:15 ` David Woodhouse 2020-10-27 13:55 ` [PATCH 0/3] Allow in-kernel consumers to drain events from eventfd David Woodhouse 2020-10-27 13:55 ` [PATCH 1/3] eventfd: Export eventfd_ctx_do_read() David Woodhouse 2020-10-27 13:55 ` [PATCH 2/3] vfio/virqfd: Drain events from eventfd in virqfd_wakeup() David Woodhouse 2020-11-06 23:29 ` Alex Williamson 2020-11-08 9:17 ` Paolo Bonzini 2020-10-27 13:55 ` [PATCH 3/3] kvm/eventfd: Drain events from eventfd in irqfd_wakeup() David Woodhouse 2020-10-27 18:41 ` kernel test robot 2020-10-27 21:42 ` kernel test robot 2020-10-27 23:13 ` kernel test robot 2020-10-27 14:39 ` [PATCH v2 0/2] Allow KVM IRQFD to consistently intercept events David Woodhouse 2020-10-27 14:39 ` [PATCH v2 1/2] sched/wait: Add add_wait_queue_priority() David Woodhouse 2020-10-27 19:09 ` Peter Zijlstra 2020-10-27 19:27 ` David Woodhouse 2020-10-27 20:30 ` Peter Zijlstra 2020-10-27 20:49 ` David Woodhouse 2020-10-27 21:32 ` David Woodhouse 2020-10-28 14:20 ` Peter Zijlstra 2020-10-28 14:44 ` Paolo Bonzini 2020-10-28 14:35 ` Peter Zijlstra 2020-11-04 9:35 ` David Woodhouse 2020-11-04 11:25 ` Paolo Bonzini 2020-11-06 10:17 ` Paolo Bonzini 2020-11-06 16:32 ` Alex Williamson 2020-11-06 17:18 ` David Woodhouse 2020-10-27 14:39 ` [PATCH v2 2/2] kvm/eventfd: Use priority waitqueue to catch events before userspace David Woodhouse
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).