All of lore.kernel.org
 help / color / mirror / Atom feed
* Fundamental race condition in wait_event_interruptible_exclusive() ?
@ 2019-12-08 21:12 Linus Torvalds
  2019-12-09  9:18 ` [RFC PATCH] sched/wait: Make interruptible exclusive waitqueue wakeups reliable Ingo Molnar
  2019-12-09 17:38 ` Fundamental race condition in wait_event_interruptible_exclusive() ? Oleg Nesterov
  0 siblings, 2 replies; 13+ messages in thread
From: Linus Torvalds @ 2019-12-08 21:12 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Miklos Szeredi, Linux Kernel Mailing List, Felipe Balbi

So I'm looking at the thundering herd issue with pipes, and I have a
truly monumentally silly benchmark for it:

        #include <unistd.h>

        int main(int argc, char **argv)
        {
                int fd[2], counters[2];

                pipe(fd);
                counters[0] = 0;
                counters[1] = -1;
                write(fd[1], counters, sizeof(counters));

                /* 64 processes */
                fork(); fork(); fork(); fork(); fork(); fork();

                do {
                        int i;
                        read(fd[0], &i, sizeof(i));
                        if (i < 0)
                                continue;
                        counters[0] = i+1;
                        write(fd[1], counters, (1+(i & 1)) *sizeof(int));
                } while (counters[0] < 1000000);
                return 0;
        }

which basically passes an integer token around (with "-1" being
"ignore" and being passed around occasionally as a test for partial
reads).

It passes that counter token around a million times, and every other
time it also writes that "-1" case, so in a perfect world, you'd get
1.5 million scheduling events, because that's how many write() ->
read() pairings there are.

Even in a perfect world there will be a few extra scheduling events
because the whole "finish the pipeline" waits for _everybody_ to see
that one million number, so the counter token gets passed around a bit
more than exactly a million times, but that's in the noise.

In reality, I get many more than that, because the write will wake up
all the readers, even if only one (or two) get any values. You can see
it easily with "perf stat" or whatever.

Whatever, that's not the important part. The benchmark is obviously
entirely artificial and not that interesting, it's just to show a
point.

I do have a tentative patch to fix it - use separate reader and writer
wake queues, and exclusive waits.

But when I was looking at using exclusive waits, I noticed an issue.
It looks like wait_event_interruptible_exclusive() is fundamentally
untrustworthy.

What I'd _like_ to do is basically something like this in pipe_read():

        .. read the data, pipe is now empty ..
        __pipe_unlock();
        .. wake writers if the pipe used to be full ..

        // Wait for more data
        err = wait_event_interruptible_exclusive(pipe->rd_wait,
pipe_readable());
        if (err)
                return err;

        // Remember that we need to wake up the next reader
        // if we don't consume everything
        wake_remaining_readers = true;

        __pipe_lock();
        .. loop back and read the data ..

but it looks like the above pattern is actually buggy (the above
doesn't have the parts at the end where we actually wake the next
reader if we didn't empty the pipe, I'm just describing the waiting
part).

The reason it is buggy is that wait_event_interruptible_exclusive()
does this (inside the __wait_event() macro that it expands to):

                long __int = prepare_to_wait_event(&wq_head,
&__wq_entry, state);\

         \
                if (condition)
         \
                        break;
         \

         \
                if (___wait_is_interruptible(state) && __int) {
         \
                        __ret = __int;
         \
                        goto __out;
         \

and the thing is, if does that "__ret = __int" case and returns
-ERESTARTSYS, it's possible that the wakeup event has already been
consumed, because we've added ourselves as an exclusive writer to the
queue. So it _says_ it was interrupted, not woken up, and the wait got
cancelled, but because we were an exclusive waiter, we might be the
_only_ thing that got woken up, and the wakeup basically got forgotten
- all the other exclusive waiters will remain waiting.

Yes, yes, I can make the read_pipe() logic be that I ignore the return
value entirely, and always take the pipe lock, and always loop back,
and then the reading code will check for signal_pending() there, and
do the wake_remaining_readers if there is still data to be read, and
just do this instead:

        wait_event_interruptible_exclusive(pipe->rd_wait, pipe_readable());

        // Remember that we need to wake up the next reader
        // if we don't consume everything
        wake_remaining_readers = true;

        __pipe_lock();
        .. loop back and read the data ..

but that's kind of sad. And the basic point is that the return value
from wait_event_interruptible_exclusive() seems to not really be
reliable. You can't really use it - even if it says you got
interrupted, you still have to go back and check the condition and do
the work, and only do interruptability handling after that.

I dunno. Maybe this is fundamental to the interface? We do not have a
lot of users that mix "interruptible" and "exclusive". In fact, I see
only two cases that care about the return value, but at least the fuse
use does seem to have exactly this problem with potentially lost
wakeups because the ERESTARTSYS case ends up eating a wakeup without
doing anything about it.

Looks like the other user - the USB gadget HID thing - also has this
buggy pattern of believing the return value, and losing a wakeup
event.

Adding Miklos and Felipe to the cc just because of the fuse and USB
gadget potential issues, but this is mainly a scheduler maintainer
question.

It's possible that I've misread the wait-event code. PeterZ?

                 Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC PATCH] sched/wait: Make interruptible exclusive waitqueue wakeups reliable
  2019-12-08 21:12 Fundamental race condition in wait_event_interruptible_exclusive() ? Linus Torvalds
@ 2019-12-09  9:18 ` Ingo Molnar
  2019-12-09 10:27   ` Ingo Molnar
  2019-12-09 12:08   ` Oleg Nesterov
  2019-12-09 17:38 ` Fundamental race condition in wait_event_interruptible_exclusive() ? Oleg Nesterov
  1 sibling, 2 replies; 13+ messages in thread
From: Ingo Molnar @ 2019-12-09  9:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Juri Lelli, Vincent Guittot, Miklos Szeredi,
	Linux Kernel Mailing List, Felipe Balbi, Oleg Nesterov


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> The reason it is buggy is that wait_event_interruptible_exclusive()
> does this (inside the __wait_event() macro that it expands to):
> 
>                 long __int = prepare_to_wait_event(&wq_head, &__wq_entry, state);
> 
>                 if (condition)
>                         break;
>                 if (___wait_is_interruptible(state) && __int) {
>                         __ret = __int;
>                         goto __out;
> 
> and the thing is, if does that "__ret = __int" case and returns
> -ERESTARTSYS, it's possible that the wakeup event has already been
> consumed, because we've added ourselves as an exclusive writer to the
> queue. So it _says_ it was interrupted, not woken up, and the wait got
> cancelled, but because we were an exclusive waiter, we might be the
> _only_ thing that got woken up, and the wakeup basically got forgotten
> - all the other exclusive waiters will remain waiting.

So the place that detects interruption is prepare_to_wait_event():

long prepare_to_wait_event(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry, int state)
{
        unsigned long flags;
        long ret = 0;

        spin_lock_irqsave(&wq_head->lock, flags);
        if (signal_pending_state(state, current)) {
                /*
                 * Exclusive waiter must not fail if it was selected by wakeup,
                 * it should "consume" the condition we were waiting for.
                 *
                 * The caller will recheck the condition and return success if
                 * we were already woken up, we can not miss the event because
                 * wakeup locks/unlocks the same wq_head->lock.
                 *
                 * But we need to ensure that set-condition + wakeup after that
                 * can't see us, it should wake up another exclusive waiter if
                 * we fail.
                 */
                list_del_init(&wq_entry->entry);
                ret = -ERESTARTSYS;
        } else {
                if (list_empty(&wq_entry->entry)) {
                        if (wq_entry->flags & WQ_FLAG_EXCLUSIVE)
                                __add_wait_queue_entry_tail(wq_head, wq_entry);
                        else
                                __add_wait_queue(wq_head, wq_entry);
                }
                set_current_state(state);
        }
        spin_unlock_irqrestore(&wq_head->lock, flags);

        return ret;
}

I think we can indeed lose an exclusive event here, despite the comment 
that argues that we shouldn't: if we were already removed from the list 
then list_del_init() does nothing and loses the exclusive event AFAICS.

This logic was introduced in this commit 3 years ago:

  b1ea06a90f52: ("sched/wait: Avoid abort_exclusive_wait() in ___wait_event()")
  eaf9ef52241b: ("sched/wait: Avoid abort_exclusive_wait() in __wait_on_bit_lock()")

Before that commit we simply did this:

 long prepare_to_wait_event(wait_queue_head_t *q, wait_queue_t *wait, int state)
 {
        unsigned long flags;
-
-       if (signal_pending_state(state, current))
-               return -ERESTARTSYS;

Which was safe in the sense that it didn't touch the waitqueue in case of 
interruption. Then it used abort_exclusive_wait(), which got removed in 
eaf9ef52241b, to pass on any leftover wakeups:

-void abort_exclusive_wait(wait_queue_head_t *q, wait_queue_t *wait, void *key)
-{
-       unsigned long flags;
-
-       __set_current_state(TASK_RUNNING);
-       spin_lock_irqsave(&q->lock, flags);
-       if (!list_empty(&wait->task_list))
-               list_del_init(&wait->task_list);
-       else if (waitqueue_active(q))
-               __wake_up_locked_key(q, TASK_NORMAL, key);
-       spin_unlock_irqrestore(&q->lock, flags);
-}
-EXPORT_SYMBOL(abort_exclusive_wait);

Note how the wakeup is passed along to another exclusive waiter (if any) 
via the __wake_up_locked_key() call.

Oleg?

> I dunno. Maybe this is fundamental to the interface? We do not have a 
> lot of users that mix "interruptible" and "exclusive". In fact, I see 
> only two cases that care about the return value, but at least the fuse 
> use does seem to have exactly this problem with potentially lost 
> wakeups because the ERESTARTSYS case ends up eating a wakeup without 
> doing anything about it.

I don't think it's fundamental to the interface.

In fact I'd argue that it's fundamental to the interface to *not* lose 
exclusive events.

> Looks like the other user - the USB gadget HID thing - also has this
> buggy pattern of believing the return value, and losing a wakeup
> event.
> 
> Adding Miklos and Felipe to the cc just because of the fuse and USB
> gadget potential issues, but this is mainly a scheduler maintainer
> question.
> 
> It's possible that I've misread the wait-event code. PeterZ?

I think your analysis is correct, and I think the statistical evidence is 
also overwhelming that the interface is buggy: if we include your new 
usecase then 3 out of 3 attempts got it wrong. :-)

So I'd argue we should fix the interface and allow the 'simple' use of 
reliable interruptible-exclusive waitqueues - i.e. reintroduce the 
abort_exclusive_wait() logic?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] sched/wait: Make interruptible exclusive waitqueue wakeups reliable
  2019-12-09  9:18 ` [RFC PATCH] sched/wait: Make interruptible exclusive waitqueue wakeups reliable Ingo Molnar
@ 2019-12-09 10:27   ` Ingo Molnar
  2019-12-09 13:00     ` Oleg Nesterov
  2019-12-09 18:06     ` [RFC PATCH] sched/wait: Make interruptible exclusive waitqueue wakeups reliable Linus Torvalds
  2019-12-09 12:08   ` Oleg Nesterov
  1 sibling, 2 replies; 13+ messages in thread
From: Ingo Molnar @ 2019-12-09 10:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Juri Lelli, Vincent Guittot, Miklos Szeredi,
	Linux Kernel Mailing List, Felipe Balbi, Oleg Nesterov


* Ingo Molnar <mingo@kernel.org> wrote:

> I think your analysis is correct, and I think the statistical evidence 
> is also overwhelming that the interface is buggy: if we include your 
> new usecase then 3 out of 3 attempts got it wrong. :-)
> 
> So I'd argue we should fix the interface and allow the 'simple' use of 
> reliable interruptible-exclusive waitqueues - i.e. reintroduce the 
> abort_exclusive_wait() logic?

Forgot to attach the patch...

Totally untested, but shows the principle: I believe interrupted 
exclusive waits should not auto-cleanup in prepare_to_wait_event().

This slightly complicates the error path of open coded 
prepare_to_wait_event() users, but there's only one in BTRFS, so ...

Also note that there's now a split of the cleanup interface, 
finish_wait() vs. finish_wait_exclusive().

In principle this could be the same API, but then we'd have to pass in a 
new flag which is a runtime overhead - but splitting the API we can do 
this at build time.

Any consumed exclusive event is handled in finish_wait_exclusive() now:

+               } else {
+                       /* We got removed from the waitqueue already, wake up the next exclusive waiter (if any): */
+                       if (interrupted && waitqueue_active(wq_head))
+                               __wake_up_locked_key(wq_head, TASK_NORMAL, NULL);

I tried to maintain the lockless fastpath in the usual 
finish_wait_exclusive() fastpath.

I pushed the cleanup into finish_wait() to reduce the inlining footprint 
of wait_event*() - the most readable variant of this event loop would be 
to open code the signal check in the wait.h macro itself.

I also also added a WARN_ON() to check an assumption that I think is 
always true, and which could be used to remove the 
___wait_is_interruptible(state) check.

BTW., I actually like it that we now always go through finish_wait(), 
even in the interrupted case, makes the main loop easier to read IMHO.

Completely, utterly untested though, and might be terminally broken.

Thanks,

	Ingo

Subject: [PATCH] sched/wait: Make interruptible exclusive waitqueue wakeups reliable

---
 fs/btrfs/space-info.c    |  1 +
 include/linux/wait.h     | 16 +++++++----
 include/linux/wait_bit.h | 13 +++++----
 kernel/sched/wait.c      | 72 +++++++++++++++++++++++++++++++-----------------
 4 files changed, 67 insertions(+), 35 deletions(-)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index f09aa6ee9113..39b8d044b6c5 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -899,6 +899,7 @@ static void wait_reserve_ticket(struct btrfs_fs_info *fs_info,
 			 * despite getting an error, resulting in a space leak
 			 * (bytes_may_use counter of our space_info).
 			 */
+			finish_wait(&ticket->wait, &wait);
 			list_del_init(&ticket->list);
 			ticket->error = -EINTR;
 			break;
diff --git a/include/linux/wait.h b/include/linux/wait.h
index 3283c8d02137..a86a6e0148b1 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -261,26 +261,31 @@ extern void init_wait_entry(struct wait_queue_entry *wq_entry, int flags);
 
 #define ___wait_event(wq_head, condition, state, exclusive, ret, cmd)		\
 ({										\
-	__label__ __out;							\
 	struct wait_queue_entry __wq_entry;					\
 	long __ret = ret;	/* explicit shadow */				\
+	long __int = 0;								\
 										\
 	init_wait_entry(&__wq_entry, exclusive ? WQ_FLAG_EXCLUSIVE : 0);	\
 	for (;;) {								\
-		long __int = prepare_to_wait_event(&wq_head, &__wq_entry, state);\
+		__int = prepare_to_wait_event(&wq_head, &__wq_entry, state);	\
 										\
 		if (condition)							\
 			break;							\
 										\
+		WARN_ON_ONCE(__int && !___wait_is_interruptible(state));	\
+										\
 		if (___wait_is_interruptible(state) && __int) {			\
 			__ret = __int;						\
-			goto __out;						\
+			break;							\
 		}								\
 										\
 		cmd;								\
 	}									\
-	finish_wait(&wq_head, &__wq_entry);					\
-__out:	__ret;									\
+	if (exclusive)								\
+		finish_wait_exclusive(&wq_head, &__wq_entry, __int);		\
+	else									\
+		finish_wait(&wq_head, &__wq_entry);				\
+	__ret;									\
 })
 
 #define __wait_event(wq_head, condition)					\
@@ -1127,6 +1132,7 @@ void prepare_to_wait(struct wait_queue_head *wq_head, struct wait_queue_entry *w
 void prepare_to_wait_exclusive(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry, int state);
 long prepare_to_wait_event(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry, int state);
 void finish_wait(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry);
+void finish_wait_exclusive(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry, long interrupted);
 long wait_woken(struct wait_queue_entry *wq_entry, unsigned mode, long timeout);
 int woken_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, int sync, void *key);
 int autoremove_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, int sync, void *key);
diff --git a/include/linux/wait_bit.h b/include/linux/wait_bit.h
index 7dec36aecbd9..a431fc3349a2 100644
--- a/include/linux/wait_bit.h
+++ b/include/linux/wait_bit.h
@@ -241,15 +241,15 @@ extern wait_queue_head_t *__var_waitqueue(void *p);
 
 #define ___wait_var_event(var, condition, state, exclusive, ret, cmd)	\
 ({									\
-	__label__ __out;						\
 	struct wait_queue_head *__wq_head = __var_waitqueue(var);	\
 	struct wait_bit_queue_entry __wbq_entry;			\
 	long __ret = ret; /* explicit shadow */				\
+	long __int = 0;							\
 									\
 	init_wait_var_entry(&__wbq_entry, var,				\
 			    exclusive ? WQ_FLAG_EXCLUSIVE : 0);		\
 	for (;;) {							\
-		long __int = prepare_to_wait_event(__wq_head,		\
+		__int = prepare_to_wait_event(__wq_head,		\
 						   &__wbq_entry.wq_entry, \
 						   state);		\
 		if (condition)						\
@@ -257,13 +257,16 @@ extern wait_queue_head_t *__var_waitqueue(void *p);
 									\
 		if (___wait_is_interruptible(state) && __int) {		\
 			__ret = __int;					\
-			goto __out;					\
+			break;						\
 		}							\
 									\
 		cmd;							\
 	}								\
-	finish_wait(__wq_head, &__wbq_entry.wq_entry);			\
-__out:	__ret;								\
+	if (exclusive)							\
+		finish_wait_exclusive(__wq_head, &__wbq_entry.wq_entry, __int); \
+	else								\
+		finish_wait(__wq_head, &__wbq_entry.wq_entry);		\
+	__ret;								\
 })
 
 #define __wait_var_event(var, condition)				\
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index ba059fbfc53a..ed4318fe4e32 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -275,36 +275,20 @@ EXPORT_SYMBOL(init_wait_entry);
 long prepare_to_wait_event(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry, int state)
 {
 	unsigned long flags;
-	long ret = 0;
+
+	if (signal_pending_state(state, current))
+		return -ERESTARTSYS;
 
 	spin_lock_irqsave(&wq_head->lock, flags);
-	if (signal_pending_state(state, current)) {
-		/*
-		 * Exclusive waiter must not fail if it was selected by wakeup,
-		 * it should "consume" the condition we were waiting for.
-		 *
-		 * The caller will recheck the condition and return success if
-		 * we were already woken up, we can not miss the event because
-		 * wakeup locks/unlocks the same wq_head->lock.
-		 *
-		 * But we need to ensure that set-condition + wakeup after that
-		 * can't see us, it should wake up another exclusive waiter if
-		 * we fail.
-		 */
-		list_del_init(&wq_entry->entry);
-		ret = -ERESTARTSYS;
-	} else {
-		if (list_empty(&wq_entry->entry)) {
-			if (wq_entry->flags & WQ_FLAG_EXCLUSIVE)
-				__add_wait_queue_entry_tail(wq_head, wq_entry);
-			else
-				__add_wait_queue(wq_head, wq_entry);
-		}
-		set_current_state(state);
+	if (list_empty(&wq_entry->entry)) {
+		if (wq_entry->flags & WQ_FLAG_EXCLUSIVE)
+			__add_wait_queue_entry_tail(wq_head, wq_entry);
+		else
+			__add_wait_queue(wq_head, wq_entry);
 	}
 	spin_unlock_irqrestore(&wq_head->lock, flags);
 
-	return ret;
+	return 0;
 }
 EXPORT_SYMBOL(prepare_to_wait_event);
 
@@ -384,6 +368,44 @@ void finish_wait(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_en
 }
 EXPORT_SYMBOL(finish_wait);
 
+/**
+ * finish_wait_exclusive - clean up after waiting in a queue as an exclusive waiter
+ * @wq_head: waitqueue waited on
+ * @wq_entry: wait descriptor
+ *
+ * Sets current thread back to running state and removes
+ * the wait descriptor from the given waitqueue if still
+ * queued.
+ *
+ * It also makes sure that if an exclusive wait was interrupted, no events are lost.
+ */
+void finish_wait_exclusive(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry, long interrupted)
+{
+	unsigned long flags;
+
+	__set_current_state(TASK_RUNNING);
+
+	if (interrupted) {
+		spin_lock_irqsave(&wq_head->lock, flags);
+		if (!list_empty(&wq_entry->entry)) {
+			list_del_init(&wq_entry->entry);
+		} else {
+			/* We got removed from the waitqueue already, wake up the next exclusive waiter (if any): */
+			if (interrupted && waitqueue_active(wq_head))
+				__wake_up_locked_key(wq_head, TASK_NORMAL, NULL);
+		}
+		spin_unlock_irqrestore(&wq_head->lock, flags);
+	} else {
+		/* See finish_wait() about why this lockless check is safe: */
+		if (!list_empty_careful(&wq_entry->entry)) {
+			spin_lock_irqsave(&wq_head->lock, flags);
+			list_del_init(&wq_entry->entry);
+			spin_unlock_irqrestore(&wq_head->lock, flags);
+		}
+	}
+}
+EXPORT_SYMBOL(finish_wait_exclusive);
+
 int autoremove_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, int sync, void *key)
 {
 	int ret = default_wake_function(wq_entry, mode, sync, key);

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] sched/wait: Make interruptible exclusive waitqueue wakeups reliable
  2019-12-09  9:18 ` [RFC PATCH] sched/wait: Make interruptible exclusive waitqueue wakeups reliable Ingo Molnar
  2019-12-09 10:27   ` Ingo Molnar
@ 2019-12-09 12:08   ` Oleg Nesterov
  2019-12-10  7:29     ` Ingo Molnar
  1 sibling, 1 reply; 13+ messages in thread
From: Oleg Nesterov @ 2019-12-09 12:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Miklos Szeredi, Linux Kernel Mailing List, Felipe Balbi

On 12/09, Ingo Molnar wrote:
>
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> > The reason it is buggy is that wait_event_interruptible_exclusive()
> > does this (inside the __wait_event() macro that it expands to):
> >
> >                 long __int = prepare_to_wait_event(&wq_head, &__wq_entry, state);
> >
> >                 if (condition)
> >                         break;
> >                 if (___wait_is_interruptible(state) && __int) {
> >                         __ret = __int;
> >                         goto __out;
> >
> > and the thing is, if does that "__ret = __int" case and returns
> > -ERESTARTSYS,

But note that it checks "condition" after prepare_to_wait_event(), if it is
true then ___wait_is_interruptible() won't be even called.

> it's possible that the wakeup event has already been
> > consumed, because we've added ourselves as an exclusive writer to the
> > queue. So it _says_ it was interrupted, not woken up, and the wait got
> > cancelled, but because we were an exclusive waiter, we might be the
> > _only_ thing that got woken up, and the wakeup basically got forgotten
> > - all the other exclusive waiters will remain waiting.
>
> So the place that detects interruption is prepare_to_wait_event():

Yes,

> long prepare_to_wait_event(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry, int state)
> {
>         unsigned long flags;
>         long ret = 0;
>
>         spin_lock_irqsave(&wq_head->lock, flags);
>         if (signal_pending_state(state, current)) {
>                 /*
>                  * Exclusive waiter must not fail if it was selected by wakeup,
>                  * it should "consume" the condition we were waiting for.
>                  *
>                  * The caller will recheck the condition and return success if
>                  * we were already woken up, we can not miss the event because
>                  * wakeup locks/unlocks the same wq_head->lock.
>                  *
>                  * But we need to ensure that set-condition + wakeup after that
>                  * can't see us, it should wake up another exclusive waiter if
>                  * we fail.
>                  */
>                 list_del_init(&wq_entry->entry);
>                 ret = -ERESTARTSYS;

...

> I think we can indeed lose an exclusive event here, despite the comment
> that argues that we shouldn't: if we were already removed from the list

If we were already removed from the list and condition is true, we can't
miss it, ret = -ERESTARTSYS won't be used. This is what this part of the
comment above

	 * The caller will recheck the condition and return success if
	 * we were already woken up, we can not miss the event because
	 * wakeup locks/unlocks the same wq_head->lock.

tries to explain.

> then list_del_init() does nothing and loses the exclusive event AFAICS.

list_del_init() ensures that wake_up() can't pick this task after
prepare_to_wait_event() returns.

IOW. Suppose that ___wait_event() races with

	condition = true;
	wake_up();

if wake_up() happens before prepare_to_wait_event(), __wait_event() will
see condition == true, -ERESTARTSYS returned by prepare_to_wait_event() has
no effect.

If wake_up() comes after prepare_to_wait_event(), the task was already removed
from the list, another exclusive waiter (if any) will be woken up. In this case
__wait_event() can return success or -ERESTARTSYS, both are correct.

No?

Oleg.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] sched/wait: Make interruptible exclusive waitqueue wakeups reliable
  2019-12-09 10:27   ` Ingo Molnar
@ 2019-12-09 13:00     ` Oleg Nesterov
  2019-12-10  7:21       ` Ingo Molnar
  2019-12-09 18:06     ` [RFC PATCH] sched/wait: Make interruptible exclusive waitqueue wakeups reliable Linus Torvalds
  1 sibling, 1 reply; 13+ messages in thread
From: Oleg Nesterov @ 2019-12-09 13:00 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Miklos Szeredi, Linux Kernel Mailing List, Felipe Balbi

On 12/09, Ingo Molnar wrote:
>
> Any consumed exclusive event is handled in finish_wait_exclusive() now:
>
> +               } else {
> +                       /* We got removed from the waitqueue already, wake up the next exclusive waiter (if any): */
> +                       if (interrupted && waitqueue_active(wq_head))
> +                               __wake_up_locked_key(wq_head, TASK_NORMAL, NULL);

See my previous email, I don't think we need this...

But if we do this, then __wake_up_locked_key(key => NULL) doesn't look right.
It should use the same "key" which was passed to __wake_up(key) which removed
us from list.

Currently this doesn't really matter, the only user of prepare_to_wait_event()
which relies on the "keyed" wakeup is ___wait_var_event() and it doesn't have
"exclusive" waiters, but still.

Hmm. and it seems that init_wait_var_entry() is buggy? Again, currently this
doesn't matter, but don't we need the trivial fix below?

Oleg.

--- x/kernel/sched/wait_bit.c
+++ x/kernel/sched/wait_bit.c
@@ -179,6 +179,7 @@ void init_wait_var_entry(struct wait_bit
 			.bit_nr = -1,
 		},
 		.wq_entry = {
+			.flags	 = flags,
 			.private = current,
 			.func	 = var_wake_function,
 			.entry	 = LIST_HEAD_INIT(wbq_entry->wq_entry.entry),


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Fundamental race condition in wait_event_interruptible_exclusive() ?
  2019-12-08 21:12 Fundamental race condition in wait_event_interruptible_exclusive() ? Linus Torvalds
  2019-12-09  9:18 ` [RFC PATCH] sched/wait: Make interruptible exclusive waitqueue wakeups reliable Ingo Molnar
@ 2019-12-09 17:38 ` Oleg Nesterov
  2019-12-09 18:03   ` Linus Torvalds
  1 sibling, 1 reply; 13+ messages in thread
From: Oleg Nesterov @ 2019-12-09 17:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Miklos Szeredi, Linux Kernel Mailing List, Felipe Balbi

I have alredy replied to Ingo, but if I was not clear...

On 12/08, Linus Torvalds wrote:
>
> The reason it is buggy is that wait_event_interruptible_exclusive()
> does this (inside the __wait_event() macro that it expands to):
>
>                 long __int = prepare_to_wait_event(&wq_head,
> &__wq_entry, state);\
>
>          \
>                 if (condition)
>          \
>                         break;
>          \
>
>          \
>                 if (___wait_is_interruptible(state) && __int) {
>          \
>                         __ret = __int;
>          \
>                         goto __out;
>          \
>
> and the thing is, if does that "__ret = __int" case and returns
> -ERESTARTSYS, it's possible that the wakeup event has already been
> consumed,

Afaics, no.

> because we've added ourselves as an exclusive writer to the
> queue. So it _says_ it was interrupted, not woken up, and the wait got
> cancelled, but because we were an exclusive waiter, we might be the
> _only_ thing that got woken up,

And that is why ___wait_event() always checks the condition after
prepare_to_wait_event(), whatever it returns.

And. If it actually does "__ret = __int" and returns -ERESTARTSYS, then
this task was already removed from the list, so we should not worry about
the case when wake_up() comes after prepare_to_wait_event().

> And the basic point is that the return value
> from wait_event_interruptible_exclusive() seems to not really be
> reliable. You can't really use it -

see above ...

> even if it says you got
> interrupted, you still have to go back and check the condition and do
> the work, and only do interruptability handling after that.

This is exactly what ___wait_event() does. Even if prepare_to_wait_event()
says you got interrupted, it still checks the condition and returns success
if it is true.

Oleg.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Fundamental race condition in wait_event_interruptible_exclusive() ?
  2019-12-09 17:38 ` Fundamental race condition in wait_event_interruptible_exclusive() ? Oleg Nesterov
@ 2019-12-09 18:03   ` Linus Torvalds
  0 siblings, 0 replies; 13+ messages in thread
From: Linus Torvalds @ 2019-12-09 18:03 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Miklos Szeredi, Linux Kernel Mailing List, Felipe Balbi

On Mon, Dec 9, 2019 at 9:38 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> > because we've added ourselves as an exclusive writer to the
> > queue. So it _says_ it was interrupted, not woken up, and the wait got
> > cancelled, but because we were an exclusive waiter, we might be the
> > _only_ thing that got woken up,
>
> And that is why ___wait_event() always checks the condition after
> prepare_to_wait_event(), whatever it returns.

Ack. I misread the code, and you're right - if we've been woken up,
the condition is supposed to be true, and we never actually return
-ERESTARTSYS.

So the situation of "got woken but returned error, and lost wakeup"
won't actually ever happen.

Good.

So yeah, I can do what I wanted to do, and it should all work.

I never even tested it, because I was getting fairly anal about
possible races in the pipe code by then.

               Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] sched/wait: Make interruptible exclusive waitqueue wakeups reliable
  2019-12-09 10:27   ` Ingo Molnar
  2019-12-09 13:00     ` Oleg Nesterov
@ 2019-12-09 18:06     ` Linus Torvalds
  1 sibling, 0 replies; 13+ messages in thread
From: Linus Torvalds @ 2019-12-09 18:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Juri Lelli, Vincent Guittot, Miklos Szeredi,
	Linux Kernel Mailing List, Felipe Balbi, Oleg Nesterov

On Mon, Dec 9, 2019 at 2:28 AM Ingo Molnar <mingo@kernel.org> wrote:
>
> Totally untested, but shows the principle: I believe interrupted
> exclusive waits should not auto-cleanup in prepare_to_wait_event().

Oleg convinced me I was wrong, and that there is no bug, so I don't
think we need anything after all.

Yes, there's a "race" between wakeup and returning the error, but
since any correct code already relies on the condition having become
true before the wakeup (and it needs to be a stable condition -
otherwise the exclusive waiters wouldn't work in the first place), the
code that returns the error will never get reached because of the

>                 if (condition)
>                         break;

part of the ___wait_event() macro.

            Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] sched/wait: Make interruptible exclusive waitqueue wakeups reliable
  2019-12-09 13:00     ` Oleg Nesterov
@ 2019-12-10  7:21       ` Ingo Molnar
  2019-12-10 19:19         ` [PATCH] sched/wait: fix ___wait_var_event(exclusive) Oleg Nesterov
  0 siblings, 1 reply; 13+ messages in thread
From: Ingo Molnar @ 2019-12-10  7:21 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Miklos Szeredi, Linux Kernel Mailing List, Felipe Balbi


* Oleg Nesterov <oleg@redhat.com> wrote:

> On 12/09, Ingo Molnar wrote:
> >
> > Any consumed exclusive event is handled in finish_wait_exclusive() now:
> >
> > +               } else {
> > +                       /* We got removed from the waitqueue already, wake up the next exclusive waiter (if any): */
> > +                       if (interrupted && waitqueue_active(wq_head))
> > +                               __wake_up_locked_key(wq_head, TASK_NORMAL, NULL);
> 
> See my previous email, I don't think we need this...
> 
> But if we do this, then __wake_up_locked_key(key => NULL) doesn't look right.
> It should use the same "key" which was passed to __wake_up(key) which removed
> us from list.
> 
> Currently this doesn't really matter, the only user of prepare_to_wait_event()
> which relies on the "keyed" wakeup is ___wait_var_event() and it doesn't have
> "exclusive" waiters, but still.
> 
> Hmm. and it seems that init_wait_var_entry() is buggy? Again, currently this
> doesn't matter, but don't we need the trivial fix below?
> 
> Oleg.
> 
> --- x/kernel/sched/wait_bit.c
> +++ x/kernel/sched/wait_bit.c
> @@ -179,6 +179,7 @@ void init_wait_var_entry(struct wait_bit
>  			.bit_nr = -1,
>  		},
>  		.wq_entry = {
> +			.flags	 = flags,
>  			.private = current,
>  			.func	 = var_wake_function,
>  			.entry	 = LIST_HEAD_INIT(wbq_entry->wq_entry.entry),

Yeah, agreed.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] sched/wait: Make interruptible exclusive waitqueue wakeups reliable
  2019-12-09 12:08   ` Oleg Nesterov
@ 2019-12-10  7:29     ` Ingo Molnar
  2019-12-10 17:30       ` Oleg Nesterov
  0 siblings, 1 reply; 13+ messages in thread
From: Ingo Molnar @ 2019-12-10  7:29 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Miklos Szeredi, Linux Kernel Mailing List, Felipe Balbi


* Oleg Nesterov <oleg@redhat.com> wrote:

> > long prepare_to_wait_event(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry, int state)
> > {
> >         unsigned long flags;
> >         long ret = 0;
> >
> >         spin_lock_irqsave(&wq_head->lock, flags);
> >         if (signal_pending_state(state, current)) {
> >                 /*
> >                  * Exclusive waiter must not fail if it was selected by wakeup,
> >                  * it should "consume" the condition we were waiting for.
> >                  *
> >                  * The caller will recheck the condition and return success if
> >                  * we were already woken up, we can not miss the event because
> >                  * wakeup locks/unlocks the same wq_head->lock.
> >                  *
> >                  * But we need to ensure that set-condition + wakeup after that
> >                  * can't see us, it should wake up another exclusive waiter if
> >                  * we fail.
> >                  */
> >                 list_del_init(&wq_entry->entry);
> >                 ret = -ERESTARTSYS;
> 
> ...
> 
> > I think we can indeed lose an exclusive event here, despite the comment
> > that argues that we shouldn't: if we were already removed from the list
> 
> If we were already removed from the list and condition is true, we can't
> miss it, ret = -ERESTARTSYS won't be used. This is what this part of the
> comment above
> 
> 	 * The caller will recheck the condition and return success if
> 	 * we were already woken up, we can not miss the event because
> 	 * wakeup locks/unlocks the same wq_head->lock.
> 
> tries to explain.

Yeah, indeed - it assumes that the condition is stable from wakeup to 
wakee running - which as Linus said it must be, because otherwise 
exclusive waiters couldn't reliably exit the wait loop.

So there's no bug. How about the clarifying comment below?

Thanks,

	Ingo

 kernel/sched/wait.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index ba059fbfc53a..6783bac00b5c 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -290,6 +290,11 @@ long prepare_to_wait_event(struct wait_queue_head *wq_head, struct wait_queue_en
 		 * But we need to ensure that set-condition + wakeup after that
 		 * can't see us, it should wake up another exclusive waiter if
 		 * we fail.
+		 *
+		 * In other words, if an exclusive waiter got here, then the
+		 * waitqueue condition is and stays true and we are guaranteed
+		 * to exit the waitqueue loop and will ignore the -ERESTARTSYS
+		 * and return success.
 		 */
 		list_del_init(&wq_entry->entry);
 		ret = -ERESTARTSYS;

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH] sched/wait: Make interruptible exclusive waitqueue wakeups reliable
  2019-12-10  7:29     ` Ingo Molnar
@ 2019-12-10 17:30       ` Oleg Nesterov
  0 siblings, 0 replies; 13+ messages in thread
From: Oleg Nesterov @ 2019-12-10 17:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Miklos Szeredi, Linux Kernel Mailing List, Felipe Balbi

On 12/10, Ingo Molnar wrote:
>
> --- a/kernel/sched/wait.c
> +++ b/kernel/sched/wait.c
> @@ -290,6 +290,11 @@ long prepare_to_wait_event(struct wait_queue_head *wq_head, struct wait_queue_en
>  		 * But we need to ensure that set-condition + wakeup after that
>  		 * can't see us, it should wake up another exclusive waiter if
>  		 * we fail.
> +		 *
> +		 * In other words, if an exclusive waiter got here, then the
> +		 * waitqueue condition is and stays true and we are guaranteed
> +		 * to exit the waitqueue loop and will ignore the -ERESTARTSYS
> +		 * and return success.
>  		 */
>  		list_del_init(&wq_entry->entry);
>  		ret = -ERESTARTSYS;

Agreed, this makes it more clear... but at the same time technically this is
not 100% correct, or perhaps I misread this comment.

We are not guaranteed to return success even if condition == T and we were
woken up as an exclusive waiter, another waiter can consume the condition.
But this is fine. Say,

	long LOCK;
	wait_queue_head WQ;

	int lock()
	{
		return wait_event_interruptible_exclusive(&WQ, xchg(&LOCK, 1) == 0);
	}

	void unlock()
	{
		xchg(&LOCK, 0);
		wake_up(&WQ, TASK_NORMAL);
	}

A woken exclusive waiter can return -ERESTARTSYS if it races with another
lock(), or it races with another sleeping waiter woken up by the signal,
this is fine.

So may be

		 * In other words, if an exclusive waiter got here and the
		 * waitqueue condition is and stays true, then we are guaranteed
		 * to exit the waitqueue loop and will ignore the -ERESTARTSYS
		 * and return success.

is more accurate?

Oleg.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] sched/wait: fix ___wait_var_event(exclusive)
  2019-12-10  7:21       ` Ingo Molnar
@ 2019-12-10 19:19         ` Oleg Nesterov
  2019-12-17 12:39           ` [tip: sched/core] " tip-bot2 for Oleg Nesterov
  0 siblings, 1 reply; 13+ messages in thread
From: Oleg Nesterov @ 2019-12-10 19:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Miklos Szeredi, Linux Kernel Mailing List, Felipe Balbi

init_wait_var_entry() forgets to initialize wq_entry->flags.

Currently not a problem, we don't have wait_var_event_exclusive().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/sched/wait_bit.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/wait_bit.c b/kernel/sched/wait_bit.c
index 45eba18..02ce292 100644
--- a/kernel/sched/wait_bit.c
+++ b/kernel/sched/wait_bit.c
@@ -179,6 +179,7 @@ void init_wait_var_entry(struct wait_bit_queue_entry *wbq_entry, void *var, int
 			.bit_nr = -1,
 		},
 		.wq_entry = {
+			.flags	 = flags,
 			.private = current,
 			.func	 = var_wake_function,
 			.entry	 = LIST_HEAD_INIT(wbq_entry->wq_entry.entry),
-- 
2.5.0



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [tip: sched/core] sched/wait: fix ___wait_var_event(exclusive)
  2019-12-10 19:19         ` [PATCH] sched/wait: fix ___wait_var_event(exclusive) Oleg Nesterov
@ 2019-12-17 12:39           ` tip-bot2 for Oleg Nesterov
  0 siblings, 0 replies; 13+ messages in thread
From: tip-bot2 for Oleg Nesterov @ 2019-12-17 12:39 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Oleg Nesterov, Peter Zijlstra (Intel),
	Vincent Guittot, Ingo Molnar, Felipe Balbi, Linus Torvalds,
	Miklos Szeredi, Juri Lelli, x86, LKML

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     cde65194502778665c1b52afc5722cf7dbfaa399
Gitweb:        https://git.kernel.org/tip/cde65194502778665c1b52afc5722cf7dbfaa399
Author:        Oleg Nesterov <oleg@redhat.com>
AuthorDate:    Tue, 10 Dec 2019 20:19:03 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 17 Dec 2019 13:32:50 +01:00

sched/wait: fix ___wait_var_event(exclusive)

init_wait_var_entry() forgets to initialize wq_entry->flags.

Currently not a problem, we don't have wait_var_event_exclusive().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Felipe Balbi <balbi@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Juri Lelli <juri.lelli@redhat.com>
Link: https://lkml.kernel.org/r/20191210191902.GB14449@redhat.com
---
 kernel/sched/wait_bit.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/wait_bit.c b/kernel/sched/wait_bit.c
index 45eba18..02ce292 100644
--- a/kernel/sched/wait_bit.c
+++ b/kernel/sched/wait_bit.c
@@ -179,6 +179,7 @@ void init_wait_var_entry(struct wait_bit_queue_entry *wbq_entry, void *var, int 
 			.bit_nr = -1,
 		},
 		.wq_entry = {
+			.flags	 = flags,
 			.private = current,
 			.func	 = var_wake_function,
 			.entry	 = LIST_HEAD_INIT(wbq_entry->wq_entry.entry),

^ permalink raw reply related	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2019-12-17 12:40 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-08 21:12 Fundamental race condition in wait_event_interruptible_exclusive() ? Linus Torvalds
2019-12-09  9:18 ` [RFC PATCH] sched/wait: Make interruptible exclusive waitqueue wakeups reliable Ingo Molnar
2019-12-09 10:27   ` Ingo Molnar
2019-12-09 13:00     ` Oleg Nesterov
2019-12-10  7:21       ` Ingo Molnar
2019-12-10 19:19         ` [PATCH] sched/wait: fix ___wait_var_event(exclusive) Oleg Nesterov
2019-12-17 12:39           ` [tip: sched/core] " tip-bot2 for Oleg Nesterov
2019-12-09 18:06     ` [RFC PATCH] sched/wait: Make interruptible exclusive waitqueue wakeups reliable Linus Torvalds
2019-12-09 12:08   ` Oleg Nesterov
2019-12-10  7:29     ` Ingo Molnar
2019-12-10 17:30       ` Oleg Nesterov
2019-12-09 17:38 ` Fundamental race condition in wait_event_interruptible_exclusive() ? Oleg Nesterov
2019-12-09 18:03   ` Linus Torvalds

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.