All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] sched/wait: abort_exclusive_wait() should pass TASK_NORMAL to wake_up()
@ 2016-08-26 12:44 Oleg Nesterov
  2016-08-26 12:45 ` [PATCH 1/2] " Oleg Nesterov
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Oleg Nesterov @ 2016-08-26 12:44 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Al Viro, Bart Van Assche, Johannes Weiner, Linus Torvalds,
	Neil Brown, linux-kernel

To clarify, this can't fix the lock_page() hang reported by Bart,
__wait_on_bit_lock() is fine even if abort_exclusive_wait() is buggy.

Still should be fixed I think. And perhaps we should simply kill
abort_exclusive_wait() ? See 2/2.


And this reminds me... I still fail to understand why/how the commit
68985633bccb60 "sched/wait: Fix signal handling in bit wait helpers"
can help. Apart from s/return 1/return -EINTR/. Peter, do you have any
theory?

Oleg.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/2] sched/wait: abort_exclusive_wait() should pass TASK_NORMAL to wake_up()
  2016-08-26 12:44 [PATCH 0/2] sched/wait: abort_exclusive_wait() should pass TASK_NORMAL to wake_up() Oleg Nesterov
@ 2016-08-26 12:45 ` Oleg Nesterov
  2016-09-01 11:39   ` Peter Zijlstra
  2016-08-26 12:45 ` [PATCH 2/2] sched/wait: avoid abort_exclusive_wait() in __wait_on_bit_lock() Oleg Nesterov
  2016-09-01 11:03 ` [PATCH 0/2] sched/wait: abort_exclusive_wait() should pass TASK_NORMAL to wake_up() Peter Zijlstra
  2 siblings, 1 reply; 15+ messages in thread
From: Oleg Nesterov @ 2016-08-26 12:45 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Al Viro, Bart Van Assche, Johannes Weiner, Linus Torvalds,
	Neil Brown, linux-kernel

Otherwise this logic only works if mode is "compatible" with another
exclusive waiter.

If some wq has both TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE waiters,
abort_exclusive_wait() won't wait an uninterruptible waiter.

The main user is __wait_on_bit_lock() and currently it is fine but only
because TASK_KILLABLE includes TASK_UNINTERRUPTIBLE and we do not have
lock_page_interruptible() yet.

Just use TASK_NORMAL and remove the "mode" arg from abort_exclusive_wait().
Yes, this means that (say) wake_up_interruptible() can wake up the non-
interruptible waiter(s), but I think this is fine. And in fact I think
that abort_exclusive_wait() must die, see the next change.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 include/linux/wait.h | 6 +++---
 kernel/sched/wait.c  | 8 +++-----
 2 files changed, 6 insertions(+), 8 deletions(-)

diff --git a/include/linux/wait.h b/include/linux/wait.h
index 27d7a0a..329f796 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -281,8 +281,8 @@ wait_queue_head_t *bit_waitqueue(void *, int);
 		if (___wait_is_interruptible(state) && __int) {		\
 			__ret = __int;					\
 			if (exclusive) {				\
-				abort_exclusive_wait(&wq, &__wait,	\
-						     state, NULL);	\
+				abort_exclusive_wait(&wq, &__wait, 	\
+						     NULL);		\
 				goto __out;				\
 			}						\
 			break;						\
@@ -976,7 +976,7 @@ void prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state);
 void prepare_to_wait_exclusive(wait_queue_head_t *q, wait_queue_t *wait, int state);
 long prepare_to_wait_event(wait_queue_head_t *q, wait_queue_t *wait, int state);
 void finish_wait(wait_queue_head_t *q, wait_queue_t *wait);
-void abort_exclusive_wait(wait_queue_head_t *q, wait_queue_t *wait, unsigned int mode, void *key);
+void abort_exclusive_wait(wait_queue_head_t *q, wait_queue_t *wait, void *key);
 long wait_woken(wait_queue_t *wait, unsigned mode, long timeout);
 int woken_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key);
 int autoremove_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key);
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index f15d6b6..2bbba01 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -259,7 +259,6 @@ EXPORT_SYMBOL(finish_wait);
  * abort_exclusive_wait - abort exclusive waiting in a queue
  * @q: waitqueue waited on
  * @wait: wait descriptor
- * @mode: runstate of the waiter to be woken
  * @key: key to identify a wait bit queue or %NULL
  *
  * Sets current thread back to running state and removes
@@ -273,8 +272,7 @@ EXPORT_SYMBOL(finish_wait);
  * aborts and is woken up concurrently and no one wakes up
  * the next waiter.
  */
-void abort_exclusive_wait(wait_queue_head_t *q, wait_queue_t *wait,
-			unsigned int mode, void *key)
+void abort_exclusive_wait(wait_queue_head_t *q, wait_queue_t *wait, void *key)
 {
 	unsigned long flags;
 
@@ -283,7 +281,7 @@ void abort_exclusive_wait(wait_queue_head_t *q, wait_queue_t *wait,
 	if (!list_empty(&wait->task_list))
 		list_del_init(&wait->task_list);
 	else if (waitqueue_active(q))
-		__wake_up_locked_key(q, mode, key);
+		__wake_up_locked_key(q, TASK_NORMAL, key);
 	spin_unlock_irqrestore(&q->lock, flags);
 }
 EXPORT_SYMBOL(abort_exclusive_wait);
@@ -434,7 +432,7 @@ __wait_on_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
 		ret = action(&q->key, mode);
 		if (!ret)
 			continue;
-		abort_exclusive_wait(wq, &q->wait, mode, &q->key);
+		abort_exclusive_wait(wq, &q->wait, &q->key);
 		return ret;
 	} while (test_and_set_bit(q->key.bit_nr, q->key.flags));
 	finish_wait(wq, &q->wait);
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/2] sched/wait: avoid abort_exclusive_wait() in __wait_on_bit_lock()
  2016-08-26 12:44 [PATCH 0/2] sched/wait: abort_exclusive_wait() should pass TASK_NORMAL to wake_up() Oleg Nesterov
  2016-08-26 12:45 ` [PATCH 1/2] " Oleg Nesterov
@ 2016-08-26 12:45 ` Oleg Nesterov
  2016-08-26 12:47   ` Oleg Nesterov
  2016-09-01 19:01   ` Peter Zijlstra
  2016-09-01 11:03 ` [PATCH 0/2] sched/wait: abort_exclusive_wait() should pass TASK_NORMAL to wake_up() Peter Zijlstra
  2 siblings, 2 replies; 15+ messages in thread
From: Oleg Nesterov @ 2016-08-26 12:45 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Al Viro, Bart Van Assche, Johannes Weiner, Linus Torvalds,
	Neil Brown, linux-kernel

I think it would be nice to kill abort_exclusive_wait(). This patch
changes __wait_on_bit_lock(), but we can do a similar change to remove
it from ___wait_event().

We do not need anything tricky to avoid the race, we can just call
finish_wait() if action() fails. test_and_set_bit() implies mb() so
the lockless list_empty_careful() case is fine, we can not miss the
condition if we race with unlock_page().

I am not sure we even want to conditionalize both finish_wait()'s,
we could simply call it unconditionally and once before test_and_set(),
the spurious wakeup is unlikely case.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/sched/wait.c | 32 ++++++++++++++++++++------------
 1 file changed, 20 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 2bbba01..c10e904 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -423,20 +423,28 @@ int __sched
 __wait_on_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
 			wait_bit_action_f *action, unsigned mode)
 {
-	do {
-		int ret;
+	int ret = 0;
 
+	for (;;) {
 		prepare_to_wait_exclusive(wq, &q->wait, mode);
-		if (!test_bit(q->key.bit_nr, q->key.flags))
-			continue;
-		ret = action(&q->key, mode);
-		if (!ret)
-			continue;
-		abort_exclusive_wait(wq, &q->wait, &q->key);
-		return ret;
-	} while (test_and_set_bit(q->key.bit_nr, q->key.flags));
-	finish_wait(wq, &q->wait);
-	return 0;
+		if (test_bit(q->key.bit_nr, q->key.flags)) {
+			ret = action(&q->key, mode);
+			/*
+			 * Ensure that clear_bit() + wake_up() right after
+			 * test_and_set_bit() below can't see us; it should
+			 * wake up another exclusive waiter if we fail.
+			 */
+			if (ret)
+				finish_wait(wq, &q->wait);
+		}
+		if (!test_and_set_bit(q->key.bit_nr, q->key.flags)) {
+			if (!ret)
+				finish_wait(wq, &q->wait);
+			return 0;
+		} else if (ret) {
+			return ret;
+		}
+	}
 }
 EXPORT_SYMBOL(__wait_on_bit_lock);
 
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] sched/wait: avoid abort_exclusive_wait() in __wait_on_bit_lock()
  2016-08-26 12:45 ` [PATCH 2/2] sched/wait: avoid abort_exclusive_wait() in __wait_on_bit_lock() Oleg Nesterov
@ 2016-08-26 12:47   ` Oleg Nesterov
  2016-09-01 19:01   ` Peter Zijlstra
  1 sibling, 0 replies; 15+ messages in thread
From: Oleg Nesterov @ 2016-08-26 12:47 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Al Viro, Bart Van Assche, Johannes Weiner, Linus Torvalds,
	Neil Brown, linux-kernel

On 08/26, Oleg Nesterov wrote:
>
> We do not need anything tricky to avoid the race, we can just call
> finish_wait() if action() fails. test_and_set_bit() implies mb() so
> the lockless list_empty_careful() case is fine, we can not miss the
> condition if we race with unlock_page().

To simplify the review see the code with the patch applied:


int __sched
__wait_on_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
			wait_bit_action_f *action, unsigned mode)
{
	int ret = 0;

	for (;;) {
		prepare_to_wait_exclusive(wq, &q->wait, mode);
		if (test_bit(q->key.bit_nr, q->key.flags)) {
			ret = action(&q->key, mode);
			/*
			 * Ensure that clear_bit() + wake_up() right after
			 * test_and_set_bit() below can't see us; it should
			 * wake up another exclusive waiter if we fail.
			 */
			if (ret)
				finish_wait(wq, &q->wait);
		}
		if (!test_and_set_bit(q->key.bit_nr, q->key.flags)) {
			if (!ret)
				finish_wait(wq, &q->wait);
			return 0;
		} else if (ret) {
			return ret;
		}
	}
}

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/2] sched/wait: abort_exclusive_wait() should pass TASK_NORMAL to wake_up()
  2016-08-26 12:44 [PATCH 0/2] sched/wait: abort_exclusive_wait() should pass TASK_NORMAL to wake_up() Oleg Nesterov
  2016-08-26 12:45 ` [PATCH 1/2] " Oleg Nesterov
  2016-08-26 12:45 ` [PATCH 2/2] sched/wait: avoid abort_exclusive_wait() in __wait_on_bit_lock() Oleg Nesterov
@ 2016-09-01 11:03 ` Peter Zijlstra
  2 siblings, 0 replies; 15+ messages in thread
From: Peter Zijlstra @ 2016-09-01 11:03 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ingo Molnar, Al Viro, Bart Van Assche, Johannes Weiner,
	Linus Torvalds, Neil Brown, linux-kernel

On Fri, Aug 26, 2016 at 02:44:53PM +0200, Oleg Nesterov wrote:
> And this reminds me... I still fail to understand why/how the commit
> 68985633bccb60 "sched/wait: Fix signal handling in bit wait helpers"
> can help. Apart from s/return 1/return -EINTR/. Peter, do you have any
> theory?

No, I was (and still am) completely puzzled by that one.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] sched/wait: abort_exclusive_wait() should pass TASK_NORMAL to wake_up()
  2016-08-26 12:45 ` [PATCH 1/2] " Oleg Nesterov
@ 2016-09-01 11:39   ` Peter Zijlstra
  2016-09-01 17:26     ` Oleg Nesterov
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2016-09-01 11:39 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ingo Molnar, Al Viro, Bart Van Assche, Johannes Weiner,
	Linus Torvalds, Neil Brown, linux-kernel

On Fri, Aug 26, 2016 at 02:45:28PM +0200, Oleg Nesterov wrote:
> Otherwise this logic only works if mode is "compatible" with another
> exclusive waiter.
> 
> If some wq has both TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE waiters,
> abort_exclusive_wait() won't wait an uninterruptible waiter.
> 
> The main user is __wait_on_bit_lock() and currently it is fine but only
> because TASK_KILLABLE includes TASK_UNINTERRUPTIBLE and we do not have
> lock_page_interruptible() yet.

So mixing INTERRUPTIBLE and UNINTERRUPTIBLE and then not using
TASK_NORMAL for wakeups is a mis-feature/abuse of waitqueues IMO.

That said, people do 'creative' things, so maybe we should add some
debug infra to detect this mis-match.

Something like the below perhaps? It will miss people using the (old)
add_wait_queue() (which are plenty :/) but there's nothing quick we can
do about those.

Completely untested..

---
 include/linux/wait.h | 13 ++++++++++++-
 kernel/sched/wait.c  | 27 +++++++++++++++++++++++++++
 2 files changed, 39 insertions(+), 1 deletion(-)

diff --git a/include/linux/wait.h b/include/linux/wait.h
index c3ff74d764fa..e99ea720c5f9 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -39,6 +39,9 @@ struct wait_bit_queue {
 struct __wait_queue_head {
 	spinlock_t		lock;
 	struct list_head	task_list;
+#ifdef CONFIG_DEBUG_WAITQUEUE
+	unsigned int		state;
+#endif
 };
 typedef struct __wait_queue_head wait_queue_head_t;
 
@@ -48,6 +51,13 @@ struct task_struct;
  * Macros for declaration and initialisaton of the datatypes
  */
 
+#ifdef CONFIG_DEBUG_WAITQUEUE
+#define __DEBUG_WAIT_QUEUE_HEAD_INIT(name)				\
+	.state = -1,
+#else
+#define __DEBUG_WAIT_QUEUE_HEAD_INIT(name)
+#endif
+
 #define __WAITQUEUE_INITIALIZER(name, tsk) {				\
 	.private	= tsk,						\
 	.func		= default_wake_function,			\
@@ -58,7 +68,8 @@ struct task_struct;
 
 #define __WAIT_QUEUE_HEAD_INITIALIZER(name) {				\
 	.lock		= __SPIN_LOCK_UNLOCKED(name.lock),		\
-	.task_list	= { &(name).task_list, &(name).task_list } }
+	.task_list	= { &(name).task_list, &(name).task_list },	\
+	__DEBUG_WAIT_QUEUE_HEAD_INIT(name) }
 
 #define DECLARE_WAIT_QUEUE_HEAD(name) \
 	wait_queue_head_t name = __WAIT_QUEUE_HEAD_INITIALIZER(name)
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index f15d6b6a538a..cb71c56c5e76 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -16,6 +16,9 @@ void __init_waitqueue_head(wait_queue_head_t *q, const char *name, struct lock_c
 	spin_lock_init(&q->lock);
 	lockdep_set_class_and_name(&q->lock, key, name);
 	INIT_LIST_HEAD(&q->task_list);
+#ifdef CONFIG_DEBUG_WAITQUEUE
+	q->state = -1;
+#endif
 }
 
 EXPORT_SYMBOL(__init_waitqueue_head);
@@ -67,6 +70,16 @@ static void __wake_up_common(wait_queue_head_t *q, unsigned int mode,
 {
 	wait_queue_t *curr, *next;
 
+#ifdef CONFIG_DEBUG_WAITQUEUE
+	if (q->state != -1) {
+		/*
+		 * WARN if we have INTERRUPTIBLE and UNINTERRUPTIBLE
+		 * waiters and do not use TASK_NORMAL to wake.
+		 */
+		WARN_ON_ONCE(q->state != (mode & TASK_NORMAL));
+	}
+#endif
+
 	list_for_each_entry_safe(curr, next, &q->task_list, task_list) {
 		unsigned flags = curr->flags;
 
@@ -156,6 +169,17 @@ void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr_exclusive)
 }
 EXPORT_SYMBOL_GPL(__wake_up_sync);	/* For internal use only */
 
+static inline void prepare_debug(struct wait_queue_head *q, int state)
+{
+#ifdef CONFIG_DEBUG_WAITQUEUE
+	if (q->state == -1) {
+		q->state = state & TASK_NORMAL;
+	} else {
+		q->state |= state & TASK_NORMAL;
+	}
+#endif
+}
+
 /*
  * Note: we use "set_current_state()" _after_ the wait-queue add,
  * because we need a memory barrier there on SMP, so that any
@@ -178,6 +202,7 @@ prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state)
 	if (list_empty(&wait->task_list))
 		__add_wait_queue(q, wait);
 	set_current_state(state);
+	prepare_debug(q, state);
 	spin_unlock_irqrestore(&q->lock, flags);
 }
 EXPORT_SYMBOL(prepare_to_wait);
@@ -192,6 +217,7 @@ prepare_to_wait_exclusive(wait_queue_head_t *q, wait_queue_t *wait, int state)
 	if (list_empty(&wait->task_list))
 		__add_wait_queue_tail(q, wait);
 	set_current_state(state);
+	prepare_debug(q, state);
 	spin_unlock_irqrestore(&q->lock, flags);
 }
 EXPORT_SYMBOL(prepare_to_wait_exclusive);
@@ -214,6 +240,7 @@ long prepare_to_wait_event(wait_queue_head_t *q, wait_queue_t *wait, int state)
 			__add_wait_queue(q, wait);
 	}
 	set_current_state(state);
+	prepare_debug(q, state);
 	spin_unlock_irqrestore(&q->lock, flags);
 
 	return 0;

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] sched/wait: abort_exclusive_wait() should pass TASK_NORMAL to wake_up()
  2016-09-01 11:39   ` Peter Zijlstra
@ 2016-09-01 17:26     ` Oleg Nesterov
  2016-09-01 18:09       ` Peter Zijlstra
  0 siblings, 1 reply; 15+ messages in thread
From: Oleg Nesterov @ 2016-09-01 17:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Al Viro, Bart Van Assche, Johannes Weiner,
	Linus Torvalds, Neil Brown, linux-kernel

On 09/01, Peter Zijlstra wrote:
>
> So mixing INTERRUPTIBLE and UNINTERRUPTIBLE and then not using
> TASK_NORMAL for wakeups is a mis-feature/abuse of waitqueues IMO.

Heh, agreed. When I was doing this fix I suddenly realize that I do
not understand why do we have, say, wake_up_interruptible().

I mean, I can't imagine the "real" use-case when you actually want
to wake up only the INTERRUPTIBLE tasks and leave the UNINTERRUPTIBLE
sleeping. Exclusive or not.

It seems that wake_up_interruptible() is mostly used simply because
the caller knows that UNINTERRUPTIBLE waiters are not possible, this
is often the case.

> @@ -67,6 +70,16 @@ static void __wake_up_common(wait_queue_head_t *q, unsigned int mode,
>  {
>  	wait_queue_t *curr, *next;
>
> +#ifdef CONFIG_DEBUG_WAITQUEUE
> +	if (q->state != -1) {
> +		/*
> +		 * WARN if we have INTERRUPTIBLE and UNINTERRUPTIBLE
> +		 * waiters and do not use TASK_NORMAL to wake.
> +		 */
> +		WARN_ON_ONCE(q->state != (mode & TASK_NORMAL));
> +	}
> +#endif

Yes, perhaps...

Actually, I think that TASK_NORMAL should be used even if wq mixes
UNINTERRUPTIBLE and KILLABLE waiters. The fact that TASK_KILLABLE
includes TASK_UNINTERRUPTIBLE is just "implementation detail" even
if I do not think this will be ever changed.

Oleg.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] sched/wait: abort_exclusive_wait() should pass TASK_NORMAL to wake_up()
  2016-09-01 17:26     ` Oleg Nesterov
@ 2016-09-01 18:09       ` Peter Zijlstra
  0 siblings, 0 replies; 15+ messages in thread
From: Peter Zijlstra @ 2016-09-01 18:09 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ingo Molnar, Al Viro, Bart Van Assche, Johannes Weiner,
	Linus Torvalds, Neil Brown, linux-kernel

On Thu, Sep 01, 2016 at 07:26:58PM +0200, Oleg Nesterov wrote:
> On 09/01, Peter Zijlstra wrote:
> >
> > So mixing INTERRUPTIBLE and UNINTERRUPTIBLE and then not using
> > TASK_NORMAL for wakeups is a mis-feature/abuse of waitqueues IMO.
> 
> Heh, agreed. When I was doing this fix I suddenly realize that I do
> not understand why do we have, say, wake_up_interruptible().
> 
> I mean, I can't imagine the "real" use-case when you actually want
> to wake up only the INTERRUPTIBLE tasks and leave the UNINTERRUPTIBLE
> sleeping. Exclusive or not.
> 
> It seems that wake_up_interruptible() is mostly used simply because
> the caller knows that UNINTERRUPTIBLE waiters are not possible, this
> is often the case.

I suspect the same.

> Actually, I think that TASK_NORMAL should be used even if wq mixes
> UNINTERRUPTIBLE and KILLABLE waiters. The fact that TASK_KILLABLE
> includes TASK_UNINTERRUPTIBLE is just "implementation detail" even
> if I do not think this will be ever changed.

I think its a fairly fundamental thing, its part of the semantics of
TASK_KILLABLE. Namely its UNINTERRUPTIBLE, except you can interrupt it
with fatal. A TASK_NORMAL wake should very much wake a TASK_KILLABLE
sleep, and for that to happen they need to share a bit, namely
UNINTERRUPTIBLE.

But I agree with you, all waitqueue wakeups _should_ simply be
TASK_NORMAL. Like said, I don't see it ever makes sense to play games
with it.

Now, let me try and get back to making sense of your abort abortion ;-)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] sched/wait: avoid abort_exclusive_wait() in __wait_on_bit_lock()
  2016-08-26 12:45 ` [PATCH 2/2] sched/wait: avoid abort_exclusive_wait() in __wait_on_bit_lock() Oleg Nesterov
  2016-08-26 12:47   ` Oleg Nesterov
@ 2016-09-01 19:01   ` Peter Zijlstra
  2016-09-01 19:08     ` Peter Zijlstra
                       ` (2 more replies)
  1 sibling, 3 replies; 15+ messages in thread
From: Peter Zijlstra @ 2016-09-01 19:01 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ingo Molnar, Al Viro, Bart Van Assche, Johannes Weiner,
	Linus Torvalds, Neil Brown, linux-kernel

On Fri, Aug 26, 2016 at 02:45:52PM +0200, Oleg Nesterov wrote:

> We do not need anything tricky to avoid the race,

The race being:

CPU0			CPU1			CPU2
			
			__wait_on_bit_lock()
			  bit_wait_io()
			    io_schedule()

clear_bit_unlock()
__wake_up_common(.nr_exclusive=1)
  list_for_each_entry()
    if (curr->func() && --nr_exclusive)
      break

						signal()

			    if (signal_pending_state()) == TRUE
			      return -EINTR

And no progress because CPU1 exits without acquiring the lock and CPU0
thinks its done because it woke someone.

> we can just call finish_wait() if action() fails.

That would be bit_wait*() returning -EINTR because sigpending.

Sure, you can always call that, first thing through the loop does
prepare again, so no harm. That however does not connect to your
condition,.. /me puzzled

> test_and_set_bit() implies mb() so
> the lockless list_empty_careful() case is fine, we can not miss the
> condition if we race with unlock_page().

You're talking about this ordering?:

	finish_wait()			clear_bit_unlock();
	  list_empty_careful()

	/* MB implied */		smp_mb__after_atomic();
	test_and_set_bit()		wake_up_page()
					  ...
					    autoremove_wake_function()
					      list_del_init();


That could do with spelling out I feel.. :-)

>  __wait_on_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
>  			wait_bit_action_f *action, unsigned mode)
>  {
> +	int ret = 0;
>  
> +	for (;;) {
>  		prepare_to_wait_exclusive(wq, &q->wait, mode);
> +		if (test_bit(q->key.bit_nr, q->key.flags)) {
> +			ret = action(&q->key, mode);
> +			/*
> +			 * Ensure that clear_bit() + wake_up() right after
> +			 * test_and_set_bit() below can't see us; it should
> +			 * wake up another exclusive waiter if we fail.
> +			 */
> +			if (ret)
> +				finish_wait(wq, &q->wait);
> +		}
> +		if (!test_and_set_bit(q->key.bit_nr, q->key.flags)) {

So this is the actual difference, instead of failing the lock and
aborting on signal, we acquire the lock if possible. If its not
possible, someone else has it, which guarantees that someone else will
do an unlock which implies another wakeup and life goes on.

> +			if (!ret)
> +				finish_wait(wq, &q->wait);
> +			return 0;
> +		} else if (ret) {
> +			return ret;
> +		}
> +	}
>  }

> I am not sure we even want to conditionalize both finish_wait()'s,
> we could simply call it unconditionally and once before test_and_set(),
> the spurious wakeup is unlikely case.


	ret = 0;

	for (;;) {
		prepare_to_wait_exclusive(wq, &q->wait, mode);

		if (test_bit(&q->key.bit_nr, &q->key.flag))
			ret = action(&q->key, mode);

		if (!test_and_set_bit(&q->key.bit_nr, &q->key.flag)) {
			/* we got the lock anyway, ignore the signal */
			ret = 0;
			break;
		}

		if (ret)
			break;
	}
	finish_wait(wq, &q->wait);

	return ret;


Would not that work too?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] sched/wait: avoid abort_exclusive_wait() in __wait_on_bit_lock()
  2016-09-01 19:01   ` Peter Zijlstra
@ 2016-09-01 19:08     ` Peter Zijlstra
  2016-09-02 12:06       ` Oleg Nesterov
  2016-09-01 22:17     ` Peter Zijlstra
  2016-09-02 12:06     ` Oleg Nesterov
  2 siblings, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2016-09-01 19:08 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ingo Molnar, Al Viro, Bart Van Assche, Johannes Weiner,
	Linus Torvalds, Neil Brown, linux-kernel

On Thu, Sep 01, 2016 at 09:01:41PM +0200, Peter Zijlstra wrote:
> > test_and_set_bit() implies mb() so
> > the lockless list_empty_careful() case is fine, we can not miss the
> > condition if we race with unlock_page().
> 
> You're talking about this ordering?:
> 
> 	finish_wait()			clear_bit_unlock();
> 	  list_empty_careful()
> 
> 	/* MB implied */		smp_mb__after_atomic();
> 	test_and_set_bit()		wake_up_page()
> 					  ...
> 					    autoremove_wake_function()
> 					      list_del_init();
> 
> 
> That could do with spelling out I feel.. :-)

This ^^^

> > I am not sure we even want to conditionalize both finish_wait()'s,
> > we could simply call it unconditionally and once before test_and_set(),
> > the spurious wakeup is unlikely case.
> 
> 
> 	ret = 0;
> 
> 	for (;;) {
> 		prepare_to_wait_exclusive(wq, &q->wait, mode);
> 
> 		if (test_bit(&q->key.bit_nr, &q->key.flag))
> 			ret = action(&q->key, mode);
> 
> 		if (!test_and_set_bit(&q->key.bit_nr, &q->key.flag)) {
> 			/* we got the lock anyway, ignore the signal */
> 			ret = 0;
> 			break;
> 		}
> 
> 		if (ret)
> 			break;
> 	}
> 	finish_wait(wq, &q->wait);
> 
> 	return ret;
> 
> 
> Would not that work too?

Nope, because we need to do that finish_wait() before
test_and_set_bit()..

Also the problem with doing finish_wait() unconditionally would be
destroying the FIFO order. With a bit of bad luck you'd get starvation
cases :/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] sched/wait: avoid abort_exclusive_wait() in __wait_on_bit_lock()
  2016-09-01 19:01   ` Peter Zijlstra
  2016-09-01 19:08     ` Peter Zijlstra
@ 2016-09-01 22:17     ` Peter Zijlstra
  2016-09-02 12:06       ` Oleg Nesterov
  2016-09-02 12:06     ` Oleg Nesterov
  2 siblings, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2016-09-01 22:17 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ingo Molnar, Al Viro, Bart Van Assche, Johannes Weiner,
	Linus Torvalds, Neil Brown, linux-kernel

On Thu, Sep 01, 2016 at 09:01:41PM +0200, Peter Zijlstra wrote:
> On Fri, Aug 26, 2016 at 02:45:52PM +0200, Oleg Nesterov wrote:
> 
> > We do not need anything tricky to avoid the race,
> 
> The race being:
> 
> CPU0			CPU1			CPU2
> 			
> 			__wait_on_bit_lock()
> 			  bit_wait_io()
> 			    io_schedule()
> 
> clear_bit_unlock()
> __wake_up_common(.nr_exclusive=1)
>   list_for_each_entry()
>     if (curr->func() && --nr_exclusive)
>       break
> 
> 						signal()
> 
> 			    if (signal_pending_state()) == TRUE
> 			      return -EINTR
> 
> And no progress because CPU1 exits without acquiring the lock and CPU0
> thinks its done because it woke someone.

FWIW, the way the mutex code avoids this issue is by doing the
signal_pending test while holding the q->lock, that way its exclusive
with wakeup.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] sched/wait: avoid abort_exclusive_wait() in __wait_on_bit_lock()
  2016-09-01 19:01   ` Peter Zijlstra
  2016-09-01 19:08     ` Peter Zijlstra
  2016-09-01 22:17     ` Peter Zijlstra
@ 2016-09-02 12:06     ` Oleg Nesterov
  2 siblings, 0 replies; 15+ messages in thread
From: Oleg Nesterov @ 2016-09-02 12:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Al Viro, Bart Van Assche, Johannes Weiner,
	Linus Torvalds, Neil Brown, linux-kernel

On 09/01, Peter Zijlstra wrote:
>
> On Fri, Aug 26, 2016 at 02:45:52PM +0200, Oleg Nesterov wrote:
>
> > We do not need anything tricky to avoid the race,
>
> The race being:
>
> CPU0			CPU1			CPU2
>
> 			__wait_on_bit_lock()
> 			  bit_wait_io()
> 			    io_schedule()
>
> clear_bit_unlock()
> __wake_up_common(.nr_exclusive=1)
>   list_for_each_entry()
>     if (curr->func() && --nr_exclusive)
>       break
>
> 						signal()
>
> 			    if (signal_pending_state()) == TRUE
> 			      return -EINTR
>
> And no progress because CPU1 exits without acquiring the lock and CPU0
> thinks its done because it woke someone.

Yes,

> > we can just call finish_wait() if action() fails.
>
> That would be bit_wait*() returning -EINTR because sigpending.

Hmm. Not sure I understand... Let me reply just in case, even if
I am sure you get it right.

Yes, in the likely case we are going to fail with -EINTR, but only
if test-and-set after thar fails.

> Sure, you can always call that, first thing through the loop does
> prepare again, so no harm. That however does not connect to your
> condition,.. /me puzzled

If ->action() fails we will abort the loop in any case, prepare
won't be called. So in this case finish_wait() does the right thing.

> > test_and_set_bit() implies mb() so
> > the lockless list_empty_careful() case is fine, we can not miss the
> > condition if we race with unlock_page().
>
> You're talking about this ordering?:
>
> 	finish_wait()			clear_bit_unlock();
> 	  list_empty_careful()
>
> 	/* MB implied */		smp_mb__after_atomic();
> 	test_and_set_bit()		wake_up_page()
> 					  ...
> 					    autoremove_wake_function()
> 					      list_del_init();
>
>
> That could do with spelling out I feel.. :-)

Yes, yes.

> >  __wait_on_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
> >  			wait_bit_action_f *action, unsigned mode)
> >  {
> > +	int ret = 0;
> >
> > +	for (;;) {
> >  		prepare_to_wait_exclusive(wq, &q->wait, mode);
> > +		if (test_bit(q->key.bit_nr, q->key.flags)) {
> > +			ret = action(&q->key, mode);
> > +			/*
> > +			 * Ensure that clear_bit() + wake_up() right after
> > +			 * test_and_set_bit() below can't see us; it should
> > +			 * wake up another exclusive waiter if we fail.
> > +			 */
> > +			if (ret)
> > +				finish_wait(wq, &q->wait);
> > +		}
> > +		if (!test_and_set_bit(q->key.bit_nr, q->key.flags)) {
>
> So this is the actual difference, instead of failing the lock and
> aborting on signal, we acquire the lock if possible. If its not
> possible, someone else has it, which guarantees that someone else will
> do an unlock which implies another wakeup and life goes on.

Yes. This way we eliminate the need for the additional wake_up.

Oleg.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] sched/wait: avoid abort_exclusive_wait() in __wait_on_bit_lock()
  2016-09-01 19:08     ` Peter Zijlstra
@ 2016-09-02 12:06       ` Oleg Nesterov
  0 siblings, 0 replies; 15+ messages in thread
From: Oleg Nesterov @ 2016-09-02 12:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Al Viro, Bart Van Assche, Johannes Weiner,
	Linus Torvalds, Neil Brown, linux-kernel

On 09/01, Peter Zijlstra wrote:
>
> > 	ret = 0;
> >
> > 	for (;;) {
> > 		prepare_to_wait_exclusive(wq, &q->wait, mode);
> >
> > 		if (test_bit(&q->key.bit_nr, &q->key.flag))
> > 			ret = action(&q->key, mode);
> >
> > 		if (!test_and_set_bit(&q->key.bit_nr, &q->key.flag)) {
> > 			/* we got the lock anyway, ignore the signal */
> > 			ret = 0;
> > 			break;
> > 		}
> >
> > 		if (ret)
> > 			break;
> > 	}
> > 	finish_wait(wq, &q->wait);
> >
> > 	return ret;
> >
> >
> > Would not that work too?
>
> Nope, because we need to do that finish_wait() before
> test_and_set_bit()..

Yes, I meant

	int __sched
	__wait_on_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
				wait_bit_action_f *action, unsigned mode)
	{
		int ret = 0;

		for (;;) {
			prepare_to_wait_exclusive(wq, &q->wait, mode);
			if (test_bit(q->key.bit_nr, q->key.flags))
				ret = action(&q->key, mode);

			finish_wait(wq, &q->wait);

			if (!test_and_set_bit(q->key.bit_nr, q->key.flags))
				return 0;
			else if (ret)
				return ret;

		}
	}

> Also the problem with doing finish_wait() unconditionally would be
> destroying the FIFO order. With a bit of bad luck you'd get starvation
> cases :/

OK, I didn't think about that, thanks.

Oleg.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] sched/wait: avoid abort_exclusive_wait() in __wait_on_bit_lock()
  2016-09-01 22:17     ` Peter Zijlstra
@ 2016-09-02 12:06       ` Oleg Nesterov
  2016-09-02 13:20         ` Peter Zijlstra
  0 siblings, 1 reply; 15+ messages in thread
From: Oleg Nesterov @ 2016-09-02 12:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Al Viro, Bart Van Assche, Johannes Weiner,
	Linus Torvalds, Neil Brown, linux-kernel

On 09/02, Peter Zijlstra wrote:
>
> FWIW, the way the mutex code avoids this issue is by doing the
> signal_pending test while holding the q->lock, that way its exclusive
> with wakeup.

And __wait_event_interruptible_locked() too.

BTW it is buggy anyway, it needs the

	-	__add_wait_queue_tail(&(wq), &__wait);
	+	if (exclusive)
	+		__add_wait_queue_tail(&(wq), &__wait);
	+	else
	+		__add_wait_queue((&(wq), &__wait);

and in fact it should use __add_wait_queue_exclusive() so that we
can remove another "if (exclusive)" but this is off-topic.

Yes, I considered this option, but to me the addtional finish_wait()
looks simpler.

And, if you agree with this change I will try to change __wait_event()
as well and kill abort_exclusive_wait().

And in this case we certainly do not want to check the "condition" with
q->lock held, because this would mean that "condition" won't be able to
take this lock.

Oleg.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] sched/wait: avoid abort_exclusive_wait() in __wait_on_bit_lock()
  2016-09-02 12:06       ` Oleg Nesterov
@ 2016-09-02 13:20         ` Peter Zijlstra
  0 siblings, 0 replies; 15+ messages in thread
From: Peter Zijlstra @ 2016-09-02 13:20 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ingo Molnar, Al Viro, Bart Van Assche, Johannes Weiner,
	Linus Torvalds, Neil Brown, linux-kernel

On Fri, Sep 02, 2016 at 02:06:43PM +0200, Oleg Nesterov wrote:

> Yes, I considered this option, but to me the addtional finish_wait()
> looks simpler.

its all relative, this stuff always makes my head hurt one way or the
other ;-)

> And, if you agree with this change I will try to change __wait_event()
> as well and kill abort_exclusive_wait().

Yeah, I think this'll work. Please send a new series with 'enhanced'
changelog so that when we have to look at this again in a few
weeks/months time we won't be cursing at ourselves for how the heck it
was supposed to work.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2016-09-02 13:21 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-26 12:44 [PATCH 0/2] sched/wait: abort_exclusive_wait() should pass TASK_NORMAL to wake_up() Oleg Nesterov
2016-08-26 12:45 ` [PATCH 1/2] " Oleg Nesterov
2016-09-01 11:39   ` Peter Zijlstra
2016-09-01 17:26     ` Oleg Nesterov
2016-09-01 18:09       ` Peter Zijlstra
2016-08-26 12:45 ` [PATCH 2/2] sched/wait: avoid abort_exclusive_wait() in __wait_on_bit_lock() Oleg Nesterov
2016-08-26 12:47   ` Oleg Nesterov
2016-09-01 19:01   ` Peter Zijlstra
2016-09-01 19:08     ` Peter Zijlstra
2016-09-02 12:06       ` Oleg Nesterov
2016-09-01 22:17     ` Peter Zijlstra
2016-09-02 12:06       ` Oleg Nesterov
2016-09-02 13:20         ` Peter Zijlstra
2016-09-02 12:06     ` Oleg Nesterov
2016-09-01 11:03 ` [PATCH 0/2] sched/wait: abort_exclusive_wait() should pass TASK_NORMAL to wake_up() Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.