wait_on_page_bit_common(TASK_KILLABLE, EXCLUSIVE) can miss wakeup?

All of lore.kernel.org
 help / color / mirror / Atom feed

* wait_on_page_bit_common(TASK_KILLABLE, EXCLUSIVE) can miss wakeup?
@ 2020-06-24 16:11 Oleg Nesterov
  2020-06-24 16:20 ` Oleg Nesterov
  2020-06-24 16:22 ` Linus Torvalds
  0 siblings, 2 replies; 23+ messages in thread
From: Oleg Nesterov @ 2020-06-24 16:11 UTC (permalink / raw)
  To: Linus Torvalds, Nick Piggin, Peter Zijlstra, Mel Gorman,
	Jan Kara, Davidlohr Bueso, Andi Kleen
  Cc: Lukas Czerner, linux-kernel

Suppose that 2 threads T1 and T2 call __lock_page_killable() and sleep in
wait_on_page_bit_common() -> io_schedule().

T1 is killed, it does test_and_set_bit_lock() but the page is still locked.

unlock_page() calls __wake_up_common(nr_exclusive = 1), this wakes T1 up.
T2 is not woken.

T1 checks signal_pending_state() and returns EINTR.

T2 will sleep until another thread does lock/unlock ?

----------------------------------------------------------------------------
I noticed this by accident, I am hunting for another / unrelated bug. I did
git-blame and iiuc the commit a8b169afbf06a ("Avoid page waitqueue race leaving
possible page locker waiting") tried to fix the problem but see above, I don't
understand how can it help.

Don't we need something like below or I am totally confused?

Oleg.

--- x/mm/filemap.c
+++ x/mm/filemap.c
@@ -1131,14 +1131,23 @@ static inline int wait_on_page_bit_commo
 	wait_page.bit_nr = bit_nr;
 
 	for (;;) {
+		int intr = 0;
+
 		spin_lock_irq(&q->lock);
 
-		if (likely(list_empty(&wait->entry))) {
-			__add_wait_queue_entry_tail(q, wait);
-			SetPageWaiters(page);
-		}
+		// see the comment prepare_to_wait_event()
+		if (signal_pending_state(state, current)) {
+			list_del_init(&wait->entry);
+			intr = 1;
+		} else {
+			if (likely(list_empty(&wait->entry))) {
+				// HMM. head/tail depending on EXCLUSIVE ???
+				__add_wait_queue_entry_tail(q, wait);
+				SetPageWaiters(page);
+			}
 
-		set_current_state(state);
+			set_current_state(state);
+		}
 
 		spin_unlock_irq(&q->lock);
 
@@ -1146,7 +1155,7 @@ static inline int wait_on_page_bit_commo
 		if (behavior == DROP)
 			put_page(page);
 
-		if (likely(bit_is_set))
+		if (!intr && likely(bit_is_set))
 			io_schedule();
 
 		if (behavior == EXCLUSIVE) {
@@ -1157,7 +1166,7 @@ static inline int wait_on_page_bit_commo
 				break;
 		}
 
-		if (signal_pending_state(state, current)) {
+		if (intr) {
 			ret = -EINTR;
 			break;
 		}


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: wait_on_page_bit_common(TASK_KILLABLE, EXCLUSIVE) can miss wakeup?
  2020-06-24 16:11 wait_on_page_bit_common(TASK_KILLABLE, EXCLUSIVE) can miss wakeup? Oleg Nesterov
@ 2020-06-24 16:20 ` Oleg Nesterov
  2020-06-24 16:36   ` Linus Torvalds
  2020-06-24 16:22 ` Linus Torvalds
  1 sibling, 1 reply; 23+ messages in thread
From: Oleg Nesterov @ 2020-06-24 16:20 UTC (permalink / raw)
  To: Linus Torvalds, Nick Piggin, Peter Zijlstra, Mel Gorman,
	Jan Kara, Davidlohr Bueso, Andi Kleen
  Cc: Lukas Czerner, linux-kernel

On 06/24, Oleg Nesterov wrote:
> Suppose that 2 threads T1 and T2 call __lock_page_killable() and sleep in
> wait_on_page_bit_common() -> io_schedule().
>
> T1 is killed, it does test_and_set_bit_lock() but the page is still locked.
>
> unlock_page() calls __wake_up_common(nr_exclusive = 1), this wakes T1 up.
> T2 is not woken.

Ah, please ignore me, sorry for noise.

If T1 is killed it is TASK_RUNNING, try_to_wake_up() should return 0.

Oleg.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: wait_on_page_bit_common(TASK_KILLABLE, EXCLUSIVE) can miss wakeup?
  2020-06-24 16:11 wait_on_page_bit_common(TASK_KILLABLE, EXCLUSIVE) can miss wakeup? Oleg Nesterov
  2020-06-24 16:20 ` Oleg Nesterov
@ 2020-06-24 16:22 ` Linus Torvalds
  2020-06-24 16:43   ` Oleg Nesterov
  1 sibling, 1 reply; 23+ messages in thread
From: Linus Torvalds @ 2020-06-24 16:22 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Nick Piggin, Peter Zijlstra, Mel Gorman, Jan Kara,
	Davidlohr Bueso, Andi Kleen, Lukas Czerner,
	Linux Kernel Mailing List

On Wed, Jun 24, 2020 at 9:11 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> T1 checks signal_pending_state() and returns EINTR.
>
> T2 will sleep until another thread does lock/unlock ?

Yeah, this is a nasty pattern with any exclusive wait, we've had this
bug before where an exclusive wait exits without taking the event or
waking up the next waiter.

That said, I'm not entirely happy with your patch.

The real problem, I feel, is that

                if (likely(bit_is_set))
                        io_schedule();

anti-pattern. Without that, we wouldn't have the bug.

Normally, we'd be TASK_RUNNING in this sequence, but because we might
skip io_schedule(), we can still be in a "sleeping" state here and be
"woken up" between that bit setting and the signal check.

                 Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: wait_on_page_bit_common(TASK_KILLABLE, EXCLUSIVE) can miss wakeup?
  2020-06-24 16:20 ` Oleg Nesterov
@ 2020-06-24 16:36   ` Linus Torvalds
  2020-06-26 15:43     ` Peter Zijlstra
  0 siblings, 1 reply; 23+ messages in thread
From: Linus Torvalds @ 2020-06-24 16:36 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Nick Piggin, Peter Zijlstra, Mel Gorman, Jan Kara,
	Davidlohr Bueso, Andi Kleen, Lukas Czerner,
	Linux Kernel Mailing List

On Wed, Jun 24, 2020 at 9:20 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> If T1 is killed it is TASK_RUNNING, try_to_wake_up() should return 0.

Hmm. I already acknowledge your bug, but yeah, this is subtle.

But I think the bug still exists.

So the requirement is:

 - bit_is_set returns false so we don't call io_schedule: we're still
TASK_KILLABLE

 - somebody else gets the lock, so the test_and_set_bit_lock() fails

 - that somebody else releases the lock almost immediately, and wakes
us up because we're still on the wait-queue (and still TASK_KILLABLE,
not TASK_RUNNING)

 - another party sends us a SIGKILL

 - we see the signal_pending_state() and exit

 - we've now been woken up, but didn't wake anybody else up, and the
lock is released but there may be waiters who came in at the same time
and never saw the wakeup.

I think this is basically impossible to hit in practice, but it does
look like a real bug.

Maybe I'm missing something. This code is subtle.

But as mentioned, I _think_ the real problem is that "don't do
io_schedule()" optimization, because that's the thing that means that
we can miss the wakeup.

If we only had a single "test_and_set_bit_lock()" thing, we'd be ok.
Either we got the lock or we didn't, and if we didn't we'd schedule
and re-try, and we'd never have the race between testing signals and
bits while we're "sleeping" and can be woken up and try_to_wake_up()
thought we took it.

                Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: wait_on_page_bit_common(TASK_KILLABLE, EXCLUSIVE) can miss wakeup?
  2020-06-24 16:22 ` Linus Torvalds
@ 2020-06-24 16:43   ` Oleg Nesterov
  0 siblings, 0 replies; 23+ messages in thread
From: Oleg Nesterov @ 2020-06-24 16:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Peter Zijlstra, Mel Gorman, Jan Kara,
	Davidlohr Bueso, Andi Kleen, Lukas Czerner,
	Linux Kernel Mailing List

On 06/24, Linus Torvalds wrote:
>
> That said, I'm not entirely happy with your patch.

Neither me,

> The real problem, I feel, is that
>
>                 if (likely(bit_is_set))
>                         io_schedule();
>
> anti-pattern. Without that, we wouldn't have the bug.
>
> Normally, we'd be TASK_RUNNING in this sequence, but because we might
> skip io_schedule(), we can still be in a "sleeping" state here and be
> "woken up" between that bit setting and the signal check.

Ah.

And now it _seems_ to me that even if io_schedule() is called
try_to_wake_up() can "falsely" succed if signal_pending_state() is true,
even if __schedule() won't block in this case.

But I am sure I missed something else. I spent to much time reading the
random code paths today, I'll return tomorrow.

Oleg.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: wait_on_page_bit_common(TASK_KILLABLE, EXCLUSIVE) can miss wakeup?
  2020-06-24 16:36   ` Linus Torvalds
@ 2020-06-26 15:43     ` Peter Zijlstra
  2020-06-28  5:39       ` Linus Torvalds
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2020-06-26 15:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Nick Piggin, Mel Gorman, Jan Kara,
	Davidlohr Bueso, Andi Kleen, Lukas Czerner,
	Linux Kernel Mailing List

On Wed, Jun 24, 2020 at 09:36:54AM -0700, Linus Torvalds wrote:
> But I think the bug still exists.
> 
> So the requirement is:
> 
>  - bit_is_set returns false so we don't call io_schedule: we're still
> TASK_KILLABLE
> 
>  - somebody else gets the lock, so the test_and_set_bit_lock() fails
> 
>  - that somebody else releases the lock almost immediately, and wakes
> us up because we're still on the wait-queue (and still TASK_KILLABLE,
> not TASK_RUNNING)
> 
>  - another party sends us a SIGKILL
> 
>  - we see the signal_pending_state() and exit
> 
>  - we've now been woken up, but didn't wake anybody else up, and the
> lock is released but there may be waiters who came in at the same time
> and never saw the wakeup.
> 
> I think this is basically impossible to hit in practice, but it does
> look like a real bug.
> 
> Maybe I'm missing something. This code is subtle.

I ended up with something like the below.. but it is too warm to think
properly.

I don't particularly like WQ_FLAG_PAGEWAITERS, but I liked open-coding
all that even less.

---
 include/linux/wait.h | 16 +++++++++++--
 kernel/sched/wait.c  | 12 ++++++++++
 mm/filemap.c         | 68 ++++++++++++++++++----------------------------------
 3 files changed, 49 insertions(+), 47 deletions(-)

diff --git a/include/linux/wait.h b/include/linux/wait.h
index 898c890fc153..760d0a6e10e8 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -11,7 +11,7 @@
 #include <asm/current.h>
 #include <uapi/linux/wait.h>
 
-typedef struct wait_queue_entry wait_queue_entry_t;
+struct wait_queue_entry;
 
 typedef int (*wait_queue_func_t)(struct wait_queue_entry *wq_entry, unsigned mode, int flags, void *key);
 int default_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, int flags, void *key);
@@ -20,7 +20,8 @@ int default_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, int
 #define WQ_FLAG_EXCLUSIVE	0x01
 #define WQ_FLAG_WOKEN		0x02
 #define WQ_FLAG_BOOKMARK	0x04
-#define WQ_FLAG_CUSTOM		0x08
+#define WQ_FLAG_PAGEWAITERS	0x08
+#define WQ_FLAG_CUSTOM		0x10
 
 /*
  * A single wait-queue entry structure:
@@ -31,6 +32,17 @@ struct wait_queue_entry {
 	wait_queue_func_t	func;
 	struct list_head	entry;
 };
+typedef struct wait_queue_entry wait_queue_entry_t;
+
+
+/*
+ * See mm/filemap.c:wait_on_page_bit_common()
+ */
+struct wait_page_queue {
+	wait_queue_entry_t	wait;
+	struct page		*page;
+	int			bit_nr;
+};
 
 struct wait_queue_head {
 	spinlock_t		lock;
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index ba059fbfc53a..9e814cdf704f 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -299,6 +299,12 @@ long prepare_to_wait_event(struct wait_queue_head *wq_head, struct wait_queue_en
 				__add_wait_queue_entry_tail(wq_head, wq_entry);
 			else
 				__add_wait_queue(wq_head, wq_entry);
+
+			if (wq_entry->flags & WQ_FLAG_PAGEWAITERS) {
+				struct wait_page_queue *wpq =
+					container_of(wq_entry, struct wait_page_queue, wait);
+				SetPageWaiters(wpq->page);
+			}
 		}
 		set_current_state(state);
 	}
@@ -379,6 +385,12 @@ void finish_wait(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_en
 	if (!list_empty_careful(&wq_entry->entry)) {
 		spin_lock_irqsave(&wq_head->lock, flags);
 		list_del_init(&wq_entry->entry);
+
+		if (wq_entry->flags & WQ_FLAG_PAGEWAITERS) {
+			struct wait_page_queue *wpq =
+				container_of(wq_entry, struct wait_page_queue, wait);
+			ClearPageWaiters(wpq->page);
+		}
 		spin_unlock_irqrestore(&wq_head->lock, flags);
 	}
 }
diff --git a/mm/filemap.c b/mm/filemap.c
index f0ae9a6308cb..6d03d51f2d92 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -994,12 +994,6 @@ struct wait_page_key {
 	int page_match;
 };
 
-struct wait_page_queue {
-	struct page *page;
-	int bit_nr;
-	wait_queue_entry_t wait;
-};
-
 static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync, void *arg)
 {
 	struct wait_page_key *key = arg;
@@ -1089,9 +1083,9 @@ static void wake_up_page(struct page *page, int bit)
 }
 
 /*
- * A choice of three behaviors for wait_on_page_bit_common():
+ * A choice of three behaviours for wait_on_page_bit_common():
  */
-enum behavior {
+enum behaviour {
 	EXCLUSIVE,	/* Hold ref to page and take the bit when woken, like
 			 * __lock_page() waiting on then setting PG_locked.
 			 */
@@ -1103,12 +1097,12 @@ enum behavior {
 			 */
 };
 
-static inline int wait_on_page_bit_common(wait_queue_head_t *q,
-	struct page *page, int bit_nr, int state, enum behavior behavior)
+static __always_inline int
+wait_on_page_bit_common(wait_queue_head_t *q, struct page *page, int bit_nr,
+			int state, enum behaviour behaviour)
 {
 	struct wait_page_queue wait_page;
 	wait_queue_entry_t *wait = &wait_page.wait;
-	bool bit_is_set;
 	bool thrashing = false;
 	bool delayacct = false;
 	unsigned long pflags;
@@ -1125,71 +1119,55 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
 	}
 
 	init_wait(wait);
-	wait->flags = behavior == EXCLUSIVE ? WQ_FLAG_EXCLUSIVE : 0;
+	wait->flags = WQ_FLAG_PAGEWAITERS |
+		      WQ_FLAG_EXCLUSIVE * (behaviour == EXCLUSIVE);
 	wait->func = wake_page_function;
 	wait_page.page = page;
 	wait_page.bit_nr = bit_nr;
 
+	/* ___wait_event() */
 	for (;;) {
-		spin_lock_irq(&q->lock);
-
-		if (likely(list_empty(&wait->entry))) {
-			__add_wait_queue_entry_tail(q, wait);
-			SetPageWaiters(page);
-		}
-
-		set_current_state(state);
-
-		spin_unlock_irq(&q->lock);
-
-		bit_is_set = test_bit(bit_nr, &page->flags);
-		if (behavior == DROP)
-			put_page(page);
+		int intr = prepare_to_wait_event(q, wait, state);
 
-		if (likely(bit_is_set))
-			io_schedule();
-
-		if (behavior == EXCLUSIVE) {
+		/* @cond */
+		if (behaviour == EXCLUSIVE) {
 			if (!test_and_set_bit_lock(bit_nr, &page->flags))
 				break;
-		} else if (behavior == SHARED) {
+		} else if (behaviour == SHARED) {
 			if (!test_bit(bit_nr, &page->flags))
 				break;
+		} else if (behaviour == DROP) {
+			put_page(page);
 		}
 
-		if (signal_pending_state(state, current)) {
-			ret = -EINTR;
-			break;
+		if (___wait_is_interruptible(state) && intr) {
+			ret = intr;
+			goto out;
 		}
 
-		if (behavior == DROP) {
+		/* @cmd */
+		io_schedule();
+
+		if (behaviour == DROP) {
 			/*
 			 * We can no longer safely access page->flags:
 			 * even if CONFIG_MEMORY_HOTREMOVE is not enabled,
 			 * there is a risk of waiting forever on a page reused
 			 * for something that keeps it locked indefinitely.
-			 * But best check for -EINTR above before breaking.
 			 */
+			wait->flags &= ~WQ_FLAG_PAGEWAITERS;
 			break;
 		}
 	}
-
 	finish_wait(q, wait);
 
+out:
 	if (thrashing) {
 		if (delayacct)
 			delayacct_thrashing_end();
 		psi_memstall_leave(&pflags);
 	}
 
-	/*
-	 * A signal could leave PageWaiters set. Clearing it here if
-	 * !waitqueue_active would be possible (by open-coding finish_wait),
-	 * but still fail to catch it in the case of wait hash collision. We
-	 * already can fail to clear wait hash collision cases, so don't
-	 * bother with signals either.
-	 */
-
 	return ret;
 }
 

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: wait_on_page_bit_common(TASK_KILLABLE, EXCLUSIVE) can miss wakeup?
  2020-06-26 15:43     ` Peter Zijlstra
@ 2020-06-28  5:39       ` Linus Torvalds
  2020-06-28 13:18         ` Peter Zijlstra
                           ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Linus Torvalds @ 2020-06-28  5:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Nick Piggin, Mel Gorman, Jan Kara,
	Davidlohr Bueso, Andi Kleen, Lukas Czerner,
	Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 2441 bytes --]

On Fri, Jun 26, 2020 at 8:43 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> I ended up with something like the below.. but it is too warm to think
> properly.
>
> I don't particularly like WQ_FLAG_PAGEWAITERS, but I liked open-coding
> all that even less.

Ugh.

I think I have a much simpler approach, actually.

So the *problem* is purely around that

                if (behavior == EXCLUSIVE) {
                        if (!test_and_set_bit_lock(bit_nr, &page->flags))
                                break;
                } else ..

and in particular it is purely *after* that test_and_set_bit_lock()
case. We have two cases:

 (a) *If* we get the lock there, we're good, and all done, and we have
the lock. We don't care about any other wakeups, because they'll be
stale anyway (the thing that released the lock that we just took) and
because we got the lock, no other exclusive waiters should be woken up
anyway (and we will in turn wake up any waiters when we release it)

 (b) we did *not* get the lock, because somebody else got it and is
about to immediately unlock again. And that _future_ wakeup that they
send might get lost because it might end up targeting us (but we might
just exit and not care).

Agreed?

So the only case that really matters is that we didn't get the lock,
but we must *not* be woken up afterwards.

So how about the attached trivial two-liner? We solve the problem by
simply marking ourselves TASK_RUNNING, which means that we won't be
counted as an exclusive wakeup.

Ok, so the "one" line to do that is that is actually two lines:

        __set_current_state(TASK_RUNNING);
        smp_mb__before_atomic();

and there's four lines of comments to go with it, but it really is
very simple: if we do that before we do the test_and_set_bit_lock(),
no wakeups will be lost, because we won't be sleeping for that wakeup.

I'm not entirely happy about that "smp_mb__before_atomic()". I think
it's right in practice that test_and_set_bit_lock() (when it actually
does a write) has at LEAST atomic seqmantics, so I think it's good.
But it's not pretty.

But I don't want to use a heavy

        set_current_state(TASK_RUNNING);
        if (!test_and_set_bit_lock(bit_nr, &page->flags)) ..

sequence, because at least on x86, that test_and_set_bit_lock()
already has a memory barrier in it, so the extra memory barrier from
set_current_state() is all kinds of pointless.

Hmm?

                    Linus

[-- Attachment #2: patch --]
[-- Type: application/octet-stream, Size: 580 bytes --]

 mm/filemap.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index f0ae9a6308cb..ee73d33465b6 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1150,6 +1150,12 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
 			io_schedule();

 		if (behavior == EXCLUSIVE) {
+			/*
+			 * Make sure we don't get any exclusive wakeups
+			 * after this point!
+			 */
+			__set_current_state(TASK_RUNNING);
+			smp_mb__before_atomic();
 			if (!test_and_set_bit_lock(bit_nr, &page->flags))
 				break;
 		} else if (behavior == SHARED) {

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: wait_on_page_bit_common(TASK_KILLABLE, EXCLUSIVE) can miss wakeup?
  2020-06-28  5:39       ` Linus Torvalds
@ 2020-06-28 13:18         ` Peter Zijlstra
  2020-06-29  3:28         ` Nicholas Piggin
  2020-06-29 15:13         ` Oleg Nesterov
  2 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2020-06-28 13:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Nick Piggin, Mel Gorman, Jan Kara,
	Davidlohr Bueso, Andi Kleen, Lukas Czerner,
	Linux Kernel Mailing List

On Sat, Jun 27, 2020 at 10:39:20PM -0700, Linus Torvalds wrote:

> Ugh.

Heh, I was afraid of that..

> So how about the attached trivial two-liner? We solve the problem by
> simply marking ourselves TASK_RUNNING, which means that we won't be
> counted as an exclusive wakeup.
> 
> Ok, so the "one" line to do that is that is actually two lines:
> 
>         __set_current_state(TASK_RUNNING);
>         smp_mb__before_atomic();
> 
> and there's four lines of comments to go with it, but it really is
> very simple: if we do that before we do the test_and_set_bit_lock(),
> no wakeups will be lost, because we won't be sleeping for that wakeup.
> 
> I'm not entirely happy about that "smp_mb__before_atomic()". I think
> it's right in practice that test_and_set_bit_lock() (when it actually
> does a write) has at LEAST atomic seqmantics, so I think it's good.
> But it's not pretty.

Hurm... yes. I think I agree this solves it. However... the wait loop is
'weird'. It isn't shaped like our other loops.

On the one hand, who cares, that's just my OCD, on the other hand, it
does mean you have to think harder every time you look at this thing.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: wait_on_page_bit_common(TASK_KILLABLE, EXCLUSIVE) can miss wakeup?
  2020-06-28  5:39       ` Linus Torvalds
  2020-06-28 13:18         ` Peter Zijlstra
@ 2020-06-29  3:28         ` Nicholas Piggin
  2020-06-29 13:16           ` Nicholas Piggin
  2020-06-29 14:02           ` Oleg Nesterov
  2020-06-29 15:13         ` Oleg Nesterov
  2 siblings, 2 replies; 23+ messages in thread
From: Nicholas Piggin @ 2020-06-29  3:28 UTC (permalink / raw)
  To: Peter Zijlstra, Linus Torvalds
  Cc: Andi Kleen, Davidlohr Bueso, Jan Kara, Lukas Czerner,
	Linux Kernel Mailing List, Mel Gorman, Oleg Nesterov

Excerpts from Linus Torvalds's message of June 28, 2020 3:39 pm:
> On Fri, Jun 26, 2020 at 8:43 AM Peter Zijlstra <peterz@infradead.org> wrote:
>>
>> I ended up with something like the below.. but it is too warm to think
>> properly.
>>
>> I don't particularly like WQ_FLAG_PAGEWAITERS, but I liked open-coding
>> all that even less.
> 
> Ugh.
> 
> I think I have a much simpler approach, actually.
> 
> So the *problem* is purely around that
> 
>                 if (behavior == EXCLUSIVE) {
>                         if (!test_and_set_bit_lock(bit_nr, &page->flags))
>                                 break;
>                 } else ..
> 
> and in particular it is purely *after* that test_and_set_bit_lock()
> case. We have two cases:
> 
>  (a) *If* we get the lock there, we're good, and all done, and we have
> the lock. We don't care about any other wakeups, because they'll be
> stale anyway (the thing that released the lock that we just took) and
> because we got the lock, no other exclusive waiters should be woken up
> anyway (and we will in turn wake up any waiters when we release it)
> 
>  (b) we did *not* get the lock, because somebody else got it and is
> about to immediately unlock again. And that _future_ wakeup that they
> send might get lost because it might end up targeting us (but we might
> just exit and not care).
> 
> Agreed?
> 
> So the only case that really matters is that we didn't get the lock,
> but we must *not* be woken up afterwards.
> 
> So how about the attached trivial two-liner? We solve the problem by
> simply marking ourselves TASK_RUNNING, which means that we won't be
> counted as an exclusive wakeup.
> 
> Ok, so the "one" line to do that is that is actually two lines:
> 
>         __set_current_state(TASK_RUNNING);
>         smp_mb__before_atomic();
> 
> and there's four lines of comments to go with it, but it really is
> very simple: if we do that before we do the test_and_set_bit_lock(),
> no wakeups will be lost, because we won't be sleeping for that wakeup.
> 
> I'm not entirely happy about that "smp_mb__before_atomic()". I think
> it's right in practice that test_and_set_bit_lock() (when it actually
> does a write) has at LEAST atomic seqmantics, so I think it's good.
> But it's not pretty.
> 
> But I don't want to use a heavy
> 
>         set_current_state(TASK_RUNNING);
>         if (!test_and_set_bit_lock(bit_nr, &page->flags)) ..
> 
> sequence, because at least on x86, that test_and_set_bit_lock()
> already has a memory barrier in it, so the extra memory barrier from
> set_current_state() is all kinds of pointless.
> 
> Hmm?

Wow good catch. Does bit_is_set even have to be true? If it's woken due 
to a signal, it may still be on the waitqueue right? I think your patch
works, but it looks like a pretty standard variant of "don't lose
wakeups" bug.

prepare_to_wait_event() has a pretty good pattern (and comment), I would
favour using that (test the signal when inserting on the waitqueue).

@@ -1133,6 +1133,15 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
        for (;;) {
                spin_lock_irq(&q->lock);
 
+               if (signal_pending_state(state, current)) {
+                       /* Must not lose an exclusive wake up, see
+                        * prepare_to_wait_event comment */
+                       list_del_init(&wait->entry);
+                       spin_unlock_irq(&q->lock);
+                       ret = -EINTR;
+                       break;
+               }
+
                if (likely(list_empty(&wait->entry))) {
                        __add_wait_queue_entry_tail(q, wait);
                        SetPageWaiters(page);
@@ -1157,11 +1166,6 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
                                break;
                }
 
-               if (signal_pending_state(state, current)) {
-                       ret = -EINTR;
-                       break;
-               }
-
                if (behavior == DROP) {
                        /*
                         * We can no longer safely access page->flags:

- mutex_lock_common does the signal check under its wait queue lock.

- rwsem looks like it does it backward and checks if it was woken if it
got a signal and tries to handle it that way (hopefully okay, I prefer
the standard pattern).

- futex unqueues and tests for wakeup before testing signal.

Etc. And it's not even exclusive to signals of course, those are just 
the typical asynchronous thing that would wake us without removing from
the wait queue. Bit of a shame there is no standard pattern to follow
though.

I wonder how you could improve that. finish_wait could WARN_ON an
exclusive waiter being found on the queue?

@@ -377,6 +377,7 @@ void finish_wait(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_en
         *    the list).
         */
        if (!list_empty_careful(&wq_entry->entry)) {
+               WARN_ON(wq_entry->flags & WQ_FLAG_EXCLUSIVE);
                spin_lock_irqsave(&wq_head->lock, flags);
                list_del_init(&wq_entry->entry);
                spin_unlock_irqrestore(&wq_head->lock, flags);

That doesn't catch a limited count of wake ups, maybe if you passed in 
success value to finish_wait, it could warn in case a failure has 
WQ_FLAG_WOKEN. That doesn't help things that invent their own waitqueues
mind you. I wonder if we could do some kind of generic annotations for
anyone implementing wait queues to call, which could have debug checks
implemented?

Thanks,
Nick


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: wait_on_page_bit_common(TASK_KILLABLE, EXCLUSIVE) can miss wakeup?
  2020-06-29  3:28         ` Nicholas Piggin
@ 2020-06-29 13:16           ` Nicholas Piggin
  2020-06-29 16:36             ` Linus Torvalds
  2020-06-29 14:02           ` Oleg Nesterov
  1 sibling, 1 reply; 23+ messages in thread
From: Nicholas Piggin @ 2020-06-29 13:16 UTC (permalink / raw)
  To: Peter Zijlstra, Linus Torvalds
  Cc: Andi Kleen, Davidlohr Bueso, Jan Kara, Lukas Czerner,
	Linux Kernel Mailing List, Mel Gorman, Oleg Nesterov

Excerpts from Nicholas Piggin's message of June 29, 2020 1:28 pm:
> Excerpts from Linus Torvalds's message of June 28, 2020 3:39 pm:
>> On Fri, Jun 26, 2020 at 8:43 AM Peter Zijlstra <peterz@infradead.org> wrote:
>>>
>>> I ended up with something like the below.. but it is too warm to think
>>> properly.
>>>
>>> I don't particularly like WQ_FLAG_PAGEWAITERS, but I liked open-coding
>>> all that even less.
>> 
>> Ugh.
>> 
>> I think I have a much simpler approach, actually.
>> 
>> So the *problem* is purely around that
>> 
>>                 if (behavior == EXCLUSIVE) {
>>                         if (!test_and_set_bit_lock(bit_nr, &page->flags))
>>                                 break;
>>                 } else ..
>> 
>> and in particular it is purely *after* that test_and_set_bit_lock()
>> case. We have two cases:
>> 
>>  (a) *If* we get the lock there, we're good, and all done, and we have
>> the lock. We don't care about any other wakeups, because they'll be
>> stale anyway (the thing that released the lock that we just took) and
>> because we got the lock, no other exclusive waiters should be woken up
>> anyway (and we will in turn wake up any waiters when we release it)
>> 
>>  (b) we did *not* get the lock, because somebody else got it and is
>> about to immediately unlock again. And that _future_ wakeup that they
>> send might get lost because it might end up targeting us (but we might
>> just exit and not care).
>> 
>> Agreed?
>> 
>> So the only case that really matters is that we didn't get the lock,
>> but we must *not* be woken up afterwards.
>> 
>> So how about the attached trivial two-liner? We solve the problem by
>> simply marking ourselves TASK_RUNNING, which means that we won't be
>> counted as an exclusive wakeup.
>> 
>> Ok, so the "one" line to do that is that is actually two lines:
>> 
>>         __set_current_state(TASK_RUNNING);
>>         smp_mb__before_atomic();
>> 
>> and there's four lines of comments to go with it, but it really is
>> very simple: if we do that before we do the test_and_set_bit_lock(),
>> no wakeups will be lost, because we won't be sleeping for that wakeup.
>> 
>> I'm not entirely happy about that "smp_mb__before_atomic()". I think
>> it's right in practice that test_and_set_bit_lock() (when it actually
>> does a write) has at LEAST atomic seqmantics, so I think it's good.
>> But it's not pretty.
>> 
>> But I don't want to use a heavy
>> 
>>         set_current_state(TASK_RUNNING);
>>         if (!test_and_set_bit_lock(bit_nr, &page->flags)) ..
>> 
>> sequence, because at least on x86, that test_and_set_bit_lock()
>> already has a memory barrier in it, so the extra memory barrier from
>> set_current_state() is all kinds of pointless.
>> 
>> Hmm?
> 
> Wow good catch. Does bit_is_set even have to be true? If it's woken due 
> to a signal, it may still be on the waitqueue right?

No, ignore this part (which you explained well it was just a thinko,
and your patch of course would not have worked if this was the case):
the exclusive wake up doesn't get lost if schedule() was called because
state goes back to running regardless of what woke it.

I still prefer if it can be changed to the below fix though.

> works, but it looks like a pretty standard variant of "don't lose
> wakeups" bug.
> 
> prepare_to_wait_event() has a pretty good pattern (and comment), I would
> favour using that (test the signal when inserting on the waitqueue).
> 
> @@ -1133,6 +1133,15 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
>         for (;;) {
>                 spin_lock_irq(&q->lock);
>  
> +               if (signal_pending_state(state, current)) {
> +                       /* Must not lose an exclusive wake up, see
> +                        * prepare_to_wait_event comment */
> +                       list_del_init(&wait->entry);
> +                       spin_unlock_irq(&q->lock);> +                       ret = -EINTR;
> +                       break;
> +               }
> +
>                 if (likely(list_empty(&wait->entry))) {
>                         __add_wait_queue_entry_tail(q, wait);
>                         SetPageWaiters(page);
> @@ -1157,11 +1166,6 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
>                                 break;
>                 }
>  
> -               if (signal_pending_state(state, current)) {
> -                       ret = -EINTR;
> -                       break;
> -               }
> -
>                 if (behavior == DROP) {
>                         /*
>                          * We can no longer safely access page->flags:
> 
> - mutex_lock_common does the signal check under its wait queue lock.
> 
> - rwsem looks like it does it backward and checks if it was woken if it
> got a signal and tries to handle it that way (hopefully okay, I prefer
> the standard pattern).
> 
> - futex unqueues and tests for wakeup before testing signal.
> 
> Etc. And it's not even exclusive to signals of course, those are just 
> the typical asynchronous thing that would wake us without removing from
> the wait queue. Bit of a shame there is no standard pattern to follow
> though.
> 
> I wonder how you could improve that. finish_wait could WARN_ON an
> exclusive waiter being found on the queue?
> 
> @@ -377,6 +377,7 @@ void finish_wait(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_en
>          *    the list).
>          */
>         if (!list_empty_careful(&wq_entry->entry)) {
> +               WARN_ON(wq_entry->flags & WQ_FLAG_EXCLUSIVE);
>                 spin_lock_irqsave(&wq_head->lock, flags);
>                 list_del_init(&wq_entry->entry);
>                 spin_unlock_irqrestore(&wq_head->lock, flags);
> 
> That doesn't catch a limited count of wake ups, maybe if you passed in 
> success value to finish_wait, it could warn in case a failure has 
> WQ_FLAG_WOKEN. That doesn't help things that invent their own waitqueues
> mind you. I wonder if we could do some kind of generic annotations for
> anyone implementing wait queues to call, which could have debug checks
> implemented?
> 
> Thanks,
> Nick
> 
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: wait_on_page_bit_common(TASK_KILLABLE, EXCLUSIVE) can miss wakeup?
  2020-06-29  3:28         ` Nicholas Piggin
  2020-06-29 13:16           ` Nicholas Piggin
@ 2020-06-29 14:02           ` Oleg Nesterov
  2020-06-30  2:08             ` Nicholas Piggin
  1 sibling, 1 reply; 23+ messages in thread
From: Oleg Nesterov @ 2020-06-29 14:02 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Peter Zijlstra, Linus Torvalds, Andi Kleen, Davidlohr Bueso,
	Jan Kara, Lukas Czerner, Linux Kernel Mailing List, Mel Gorman

On 06/29, Nicholas Piggin wrote:
>
> prepare_to_wait_event() has a pretty good pattern (and comment), I would
> favour using that (test the signal when inserting on the waitqueue).
>
> @@ -1133,6 +1133,15 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
>         for (;;) {
>                 spin_lock_irq(&q->lock);
>
> +               if (signal_pending_state(state, current)) {
> +                       /* Must not lose an exclusive wake up, see
> +                        * prepare_to_wait_event comment */
> +                       list_del_init(&wait->entry);
> +                       spin_unlock_irq(&q->lock);
> +                       ret = -EINTR;

Basically this is what my patch in the 1st email does. But note that we can't
just set "ret = -EINTR" here, we will need to clear "ret" if test_and_set_bit()
below succeeds. That is why I used another "int intr" variable.

Oleg.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: wait_on_page_bit_common(TASK_KILLABLE, EXCLUSIVE) can miss wakeup?
  2020-06-28  5:39       ` Linus Torvalds
  2020-06-28 13:18         ` Peter Zijlstra
  2020-06-29  3:28         ` Nicholas Piggin
@ 2020-06-29 15:13         ` Oleg Nesterov
  2 siblings, 0 replies; 23+ messages in thread
From: Oleg Nesterov @ 2020-06-29 15:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Nick Piggin, Mel Gorman, Jan Kara,
	Davidlohr Bueso, Andi Kleen, Lukas Czerner,
	Linux Kernel Mailing List

On 06/27, Linus Torvalds wrote:
>
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1150,6 +1150,12 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
>  			io_schedule();
>  
>  		if (behavior == EXCLUSIVE) {
> +			/*
> +			 * Make sure we don't get any exclusive wakeups
> +			 * after this point!
> +			 */
> +			__set_current_state(TASK_RUNNING);
> +			smp_mb__before_atomic();
>  			if (!test_and_set_bit_lock(bit_nr, &page->flags))
>  				break;
>  		} else if (behavior == SHARED) {

FWIW,

Reviewed-by: Oleg Nesterov <oleg@redhat.com>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: wait_on_page_bit_common(TASK_KILLABLE, EXCLUSIVE) can miss wakeup?
  2020-06-29 13:16           ` Nicholas Piggin
@ 2020-06-29 16:36             ` Linus Torvalds
  2020-06-30  2:12               ` Nicholas Piggin
  0 siblings, 1 reply; 23+ messages in thread
From: Linus Torvalds @ 2020-06-29 16:36 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Peter Zijlstra, Andi Kleen, Davidlohr Bueso, Jan Kara,
	Lukas Czerner, Linux Kernel Mailing List, Mel Gorman,
	Oleg Nesterov

On Mon, Jun 29, 2020 at 6:16 AM Nicholas Piggin <npiggin@gmail.com> wrote:
>
> No, ignore this part (which you explained well it was just a thinko,
> and your patch of course would not have worked if this was the case):
> the exclusive wake up doesn't get lost if schedule() was called because
> state goes back to running regardless of what woke it.

Right.

The normal pattern for a wait-loop is

 - add yourself to the wait-queue and set yourself "sleeping" with a
memory barrier.

 - test for the condition, exit if ok

 - go to sleep

and that pattern doesn't have the race.

The other common pattern is to check for the "do I need to sleep at
all" at the *top* of the function, long before you bother with any
wait-queues at all. This code does that odd "let's check in the middle
if we need to sleep" instead, which then caused the bug.

So we had an odd setup because of three different wait conditions that
had different rules for what they could/should do before sleeping, and
then not sleeping reliably at all.

We could also fix it by just splitting out the three cases into their
own wait routines that match the normal pattern. The bug really
happened because that wait-loop is doing things such an odd way due to
all the different cases..

I actually think it would be a lot more readable if it was three
different cases instead of trying to be one "common" routine.

The *common* parts is the special PG_locked logic at the top, and the
thrashing/delayacct code at the bottom.

And *that* could be a true common helper, but the wait loop (which
isn't even a loop for the DROP case) are fundamentally different and
probably should be separate functions.

So I think my "one-liner" fixes the problem, but I'd certainly be open
to somebody cleaning this up properly.

Anybody?

             Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: wait_on_page_bit_common(TASK_KILLABLE, EXCLUSIVE) can miss wakeup?
  2020-06-29 14:02           ` Oleg Nesterov
@ 2020-06-30  2:08             ` Nicholas Piggin
  2020-06-30  6:17               ` Oleg Nesterov
  0 siblings, 1 reply; 23+ messages in thread
From: Nicholas Piggin @ 2020-06-30  2:08 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andi Kleen, Davidlohr Bueso, Jan Kara, Lukas Czerner,
	Linux Kernel Mailing List, Mel Gorman, Peter Zijlstra,
	Linus Torvalds

Excerpts from Oleg Nesterov's message of June 30, 2020 12:02 am:
> On 06/29, Nicholas Piggin wrote:
>>
>> prepare_to_wait_event() has a pretty good pattern (and comment), I would
>> favour using that (test the signal when inserting on the waitqueue).
>>
>> @@ -1133,6 +1133,15 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
>>         for (;;) {
>>                 spin_lock_irq(&q->lock);
>>
>> +               if (signal_pending_state(state, current)) {
>> +                       /* Must not lose an exclusive wake up, see
>> +                        * prepare_to_wait_event comment */
>> +                       list_del_init(&wait->entry);
>> +                       spin_unlock_irq(&q->lock);
>> +                       ret = -EINTR;
> 
> Basically this is what my patch in the 1st email does. But note that we can't
> just set "ret = -EINTR" here, we will need to clear "ret" if test_and_set_bit()
> below succeeds. That is why I used another "int intr" variable.

You snipped off one more important line of context. No such games are 
required AFAIKS.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: wait_on_page_bit_common(TASK_KILLABLE, EXCLUSIVE) can miss wakeup?
  2020-06-29 16:36             ` Linus Torvalds
@ 2020-06-30  2:12               ` Nicholas Piggin
  0 siblings, 0 replies; 23+ messages in thread
From: Nicholas Piggin @ 2020-06-30  2:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Davidlohr Bueso, Jan Kara, Lukas Czerner,
	Linux Kernel Mailing List, Mel Gorman, Oleg Nesterov,
	Peter Zijlstra

Excerpts from Linus Torvalds's message of June 30, 2020 2:36 am:
> On Mon, Jun 29, 2020 at 6:16 AM Nicholas Piggin <npiggin@gmail.com> wrote:
>>
>> No, ignore this part (which you explained well it was just a thinko,
>> and your patch of course would not have worked if this was the case):
>> the exclusive wake up doesn't get lost if schedule() was called because
>> state goes back to running regardless of what woke it.
> 
> Right.
> 
> The normal pattern for a wait-loop is
> 
>  - add yourself to the wait-queue and set yourself "sleeping" with a
> memory barrier.
> 
>  - test for the condition, exit if ok
> 
>  - go to sleep
> 
> and that pattern doesn't have the race.
> 
> The other common pattern is to check for the "do I need to sleep at
> all" at the *top* of the function, long before you bother with any
> wait-queues at all. This code does that odd "let's check in the middle
> if we need to sleep" instead, which then caused the bug.
> 
> So we had an odd setup because of three different wait conditions that
> had different rules for what they could/should do before sleeping, and
> then not sleeping reliably at all.
> 
> We could also fix it by just splitting out the three cases into their
> own wait routines that match the normal pattern. The bug really
> happened because that wait-loop is doing things such an odd way due to
> all the different cases..
> 
> I actually think it would be a lot more readable if it was three
> different cases instead of trying to be one "common" routine.
> 
> The *common* parts is the special PG_locked logic at the top, and the
> thrashing/delayacct code at the bottom.
> 
> And *that* could be a true common helper, but the wait loop (which
> isn't even a loop for the DROP case) are fundamentally different and
> probably should be separate functions.
> 
> So I think my "one-liner" fixes the problem, but I'd certainly be open
> to somebody cleaning this up properly.

I think it does. I would rather my patch which doesn't add a barrier to 
the "fast" path though (by the time we get here I think the fast path is
that we are sleeping on the page, doing the IO).

Alternatively (although this still adds more branching than necessary at
least it doesn't have a barrier for the sleeping case), you could set
the running state as an else case if you don't schedule.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: wait_on_page_bit_common(TASK_KILLABLE, EXCLUSIVE) can miss wakeup?
  2020-06-30  2:08             ` Nicholas Piggin
@ 2020-06-30  6:17               ` Oleg Nesterov
  2020-06-30  9:08                 ` Nicholas Piggin
  0 siblings, 1 reply; 23+ messages in thread
From: Oleg Nesterov @ 2020-06-30  6:17 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Andi Kleen, Davidlohr Bueso, Jan Kara, Lukas Czerner,
	Linux Kernel Mailing List, Mel Gorman, Peter Zijlstra,
	Linus Torvalds

On 06/30, Nicholas Piggin wrote:
> Excerpts from Oleg Nesterov's message of June 30, 2020 12:02 am:
> > On 06/29, Nicholas Piggin wrote:
> >>
> >> prepare_to_wait_event() has a pretty good pattern (and comment), I would
> >> favour using that (test the signal when inserting on the waitqueue).
> >>
> >> @@ -1133,6 +1133,15 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
> >>         for (;;) {
> >>                 spin_lock_irq(&q->lock);
> >>
> >> +               if (signal_pending_state(state, current)) {
> >> +                       /* Must not lose an exclusive wake up, see
> >> +                        * prepare_to_wait_event comment */
> >> +                       list_del_init(&wait->entry);
> >> +                       spin_unlock_irq(&q->lock);
> >> +                       ret = -EINTR;
> >
> > Basically this is what my patch in the 1st email does. But note that we can't
> > just set "ret = -EINTR" here, we will need to clear "ret" if test_and_set_bit()
> > below succeeds. That is why I used another "int intr" variable.
>
> You snipped off one more important line of context. No such games are
> required AFAIKS.

		for (;;) {
			spin_lock_irq(&q->lock);
	 
	+               if (signal_pending_state(state, current)) {
	+                       /* Must not lose an exclusive wake up, see
	+                        * prepare_to_wait_event comment */
	+                       list_del_init(&wait->entry);
	+                       spin_unlock_irq(&q->lock);
	+                       ret = -EINTR;
	+                       break;
	+               }


so wait_on_page_bit_common() just returns -EINTR if signal_pending_state() == T.
And this is wrong if "current" was already woken up by unlock_page().

That is why ___wait_event() checks the condition even if prepare_to_wait_event()
returns -EINTR. The comment in prepare_to_wait_event() tries to explain this.

Oleg.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: wait_on_page_bit_common(TASK_KILLABLE, EXCLUSIVE) can miss wakeup?
  2020-06-30  6:17               ` Oleg Nesterov
@ 2020-06-30  9:08                 ` Nicholas Piggin
  2020-06-30 10:53                   ` Oleg Nesterov
  0 siblings, 1 reply; 23+ messages in thread
From: Nicholas Piggin @ 2020-06-30  9:08 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andi Kleen, Davidlohr Bueso, Jan Kara, Lukas Czerner,
	Linux Kernel Mailing List, Mel Gorman, Peter Zijlstra,
	Linus Torvalds

Excerpts from Oleg Nesterov's message of June 30, 2020 4:17 pm:
> On 06/30, Nicholas Piggin wrote:
>> Excerpts from Oleg Nesterov's message of June 30, 2020 12:02 am:
>> > On 06/29, Nicholas Piggin wrote:
>> >>
>> >> prepare_to_wait_event() has a pretty good pattern (and comment), I would
>> >> favour using that (test the signal when inserting on the waitqueue).
>> >>
>> >> @@ -1133,6 +1133,15 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
>> >>         for (;;) {
>> >>                 spin_lock_irq(&q->lock);
>> >>
>> >> +               if (signal_pending_state(state, current)) {
>> >> +                       /* Must not lose an exclusive wake up, see
>> >> +                        * prepare_to_wait_event comment */
>> >> +                       list_del_init(&wait->entry);
>> >> +                       spin_unlock_irq(&q->lock);
>> >> +                       ret = -EINTR;
>> >
>> > Basically this is what my patch in the 1st email does. But note that we can't
>> > just set "ret = -EINTR" here, we will need to clear "ret" if test_and_set_bit()
>> > below succeeds. That is why I used another "int intr" variable.
>>
>> You snipped off one more important line of context. No such games are
>> required AFAIKS.
> 
> 		for (;;) {
> 			spin_lock_irq(&q->lock);
> 	 
> 	+               if (signal_pending_state(state, current)) {
> 	+                       /* Must not lose an exclusive wake up, see
> 	+                        * prepare_to_wait_event comment */
> 	+                       list_del_init(&wait->entry);
> 	+                       spin_unlock_irq(&q->lock);
> 	+                       ret = -EINTR;
> 	+                       break;
> 	+               }
> 
> 
> so wait_on_page_bit_common() just returns -EINTR if signal_pending_state() == T.
> And this is wrong if "current" was already woken up by unlock_page().
> 
> That is why ___wait_event() checks the condition even if prepare_to_wait_event()
> returns -EINTR. The comment in prepare_to_wait_event() tries to explain this.

Hmm, yeah because we can loop around here with task in task sleeping 
state. Which comes back to Linus' fix. Thanks.

It looks like I broke this with 62906027091f1, then Linus mostly fixed
it in a8b169afbf06a. My patch is what actually introduced this ugly
bit test, but do we even need it at all? If we do then it's 
under-commented, I can't see it wouldn't be racy though. Can we just
get rid of it entirely?

Thanks,
Nick

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: wait_on_page_bit_common(TASK_KILLABLE, EXCLUSIVE) can miss wakeup?
  2020-06-30  9:08                 ` Nicholas Piggin
@ 2020-06-30 10:53                   ` Oleg Nesterov
  2020-06-30 11:36                     ` Oleg Nesterov
  0 siblings, 1 reply; 23+ messages in thread
From: Oleg Nesterov @ 2020-06-30 10:53 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Andi Kleen, Davidlohr Bueso, Jan Kara, Lukas Czerner,
	Linux Kernel Mailing List, Mel Gorman, Peter Zijlstra,
	Linus Torvalds

On 06/30, Nicholas Piggin wrote:
>
> My patch is what actually introduced this ugly
> bit test, but do we even need it at all? If we do then it's
> under-commented, I can't see it wouldn't be racy though. Can we just
> get rid of it entirely?

But then we will need to move io_schedule() down, after test_and_set_bit().
And we will have the same problem with task->state != RUNNING. Plus more
complications with "behavior == DROP".

Oleg.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: wait_on_page_bit_common(TASK_KILLABLE, EXCLUSIVE) can miss wakeup?
  2020-06-30 10:53                   ` Oleg Nesterov
@ 2020-06-30 11:36                     ` Oleg Nesterov
  2020-06-30 11:50                       ` Oleg Nesterov
  0 siblings, 1 reply; 23+ messages in thread
From: Oleg Nesterov @ 2020-06-30 11:36 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Andi Kleen, Davidlohr Bueso, Jan Kara, Lukas Czerner,
	Linux Kernel Mailing List, Mel Gorman, Peter Zijlstra,
	Linus Torvalds

On 06/30, Oleg Nesterov wrote:
>
> On 06/30, Nicholas Piggin wrote:
> >
> > My patch is what actually introduced this ugly
> > bit test, but do we even need it at all? If we do then it's
> > under-commented, I can't see it wouldn't be racy though. Can we just
> > get rid of it entirely?
>
> But then we will need to move io_schedule() down, after test_and_set_bit().
> And we will have the same problem with task->state != RUNNING. Plus more
> complications with "behavior == DROP".

may be someting like this

	for (;;) {
		int intr = 0;

		spin_lock_irq(&q->lock);
		if (signal_pending_state(state, current)) {
			/* see the comment in prepare_to_wait_event() */
			list_del_init(&wait->entry);
			intr = 1;
		} else {
			if (likely(list_empty(&wait->entry))) {
				__add_wait_queue_entry_tail(q, wait);
				SetPageWaiters(page);
			}
			set_current_state(state);
		}
		spin_unlock_irq(&q->lock);

		if (behavior == EXCLUSIVE) {
			if (!test_and_set_bit_lock(bit_nr, &page->flags))
				break;
		} else {
			int is_set = test_bit(bit_nr, &page->flags);
			if (behavior == DROP)
				put_page(page);
			if (!is_set)
				break;
		}

		if (intr) {
			ret = -EINTR;
			break;
		}

		io_schedule();

		if (behavior == DROP) {
			/*
			 * We can no longer safely access page->flags:
			 * even if CONFIG_MEMORY_HOTREMOVE is not enabled,
			 * there is a risk of waiting forever on a page reused
			 * for something that keeps it locked indefinitely.
			 * But best check for -EINTR before breaking.
			 */
			if (signal_pending_state(state, current))
				ret = -EINTR;
			break;
		}
	}

? I dunno...

Oleg.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: wait_on_page_bit_common(TASK_KILLABLE, EXCLUSIVE) can miss wakeup?
  2020-06-30 11:36                     ` Oleg Nesterov
@ 2020-06-30 11:50                       ` Oleg Nesterov
  2020-06-30 18:02                         ` Linus Torvalds
  0 siblings, 1 reply; 23+ messages in thread
From: Oleg Nesterov @ 2020-06-30 11:50 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Andi Kleen, Davidlohr Bueso, Jan Kara, Lukas Czerner,
	Linux Kernel Mailing List, Mel Gorman, Peter Zijlstra,
	Linus Torvalds

On 06/30, Oleg Nesterov wrote:
>
> may be someting like this

or this ...

	for (;;) {
		int intr = 0;

		spin_lock_irq(&q->lock);
		if (signal_pending_state(state, current)) {
			/* see the comment in prepare_to_wait_event() */
			list_del_init(&wait->entry);
			intr = 1;
		} else {
			if (likely(list_empty(&wait->entry))) {
				__add_wait_queue_entry_tail(q, wait);
				SetPageWaiters(page);
			}
			set_current_state(state);
		}
		spin_unlock_irq(&q->lock);

		bit_is_set = test_bit(bit_nr, &page->flags);
		if (behavior == DROP)
			put_page(page);

		if (!bit_is_set) {
			if (behavior != EXCLUSIVE)
				break;
			if (!test_and_set_bit_lock(bit_nr, &page->flags))
				break;
		}

		if (intr) {
			ret = -EINTR;
			break;
		}

		io_schedule();

		if (behavior == DROP) {
			/*
			 * We can no longer safely access page->flags:
			 * even if CONFIG_MEMORY_HOTREMOVE is not enabled,
			 * there is a risk of waiting forever on a page reused
			 * for something that keeps it locked indefinitely.
			 * But best check for -EINTR before breaking.
			 */
			if (signal_pending_state(state, current))
				ret = -EINTR;
			break;
		}
	}

Oleg.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: wait_on_page_bit_common(TASK_KILLABLE, EXCLUSIVE) can miss wakeup?
  2020-06-30 11:50                       ` Oleg Nesterov
@ 2020-06-30 18:02                         ` Linus Torvalds
  2020-06-30 18:29                           ` Oleg Nesterov
  0 siblings, 1 reply; 23+ messages in thread
From: Linus Torvalds @ 2020-06-30 18:02 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Nicholas Piggin, Andi Kleen, Davidlohr Bueso, Jan Kara,
	Lukas Czerner, Linux Kernel Mailing List, Mel Gorman,
	Peter Zijlstra

On Tue, Jun 30, 2020 at 4:51 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 06/30, Oleg Nesterov wrote:
> >
> > may be someting like this
>
> or this ...

Guys, I'd suggest we either

 (a) do the minimal ugly one-liner workaround

or

 (b) do something *much* more radical.

What would that "radical" thing be? Just move the "behavior" bit into
the "struct wait_page_key" itself, and handle it at wakeup time!

Right now we have

 - wake_page_function() basically tests that it's the right bit and
page, and that the bit has actually cleared

 - then it calls "autoremove_wake_function()" to wake things up and
remove you from the wait list if that wqoke you up.

I would suggest that if we want to clean this up, we do

 - add "behavior" to  "struct wait_page_key"

 - now, for the "exclusive" case, instead of doing

        if (test_bit(key->bit_nr, &key->page->flags))
                return -1;

   the wake_page_function() would actually do

        if (test_and_set_bit(key->bit_nr, &key->page->flags))
                return -1;
        /*
         * Careful, careful: we need to make sure that the *last*
         * thing that touches 'wq_entry' is setting WQ_FLAG_WOKEN
         */
        list_del_init(&wq_entry->entry);
        default_wake_function(wq_entry, mode, sync, key);
        smp_store_release(&wq_entry->flags, WQ_FLAG_WOKEN);

        /* No point in waking up anybody else */
        return -1;

while for the SHARED and DROP cases it would do

        if (test_bit(key->bit_nr, &key->page->flags))
                return -1;
        list_del_init(&wq_entry->entry);
        default_wake_function(wq_entry, mode, sync, key);
        smp_store_release(&wq_entry->flags, WQ_FLAG_WOKEN);

        return 0;

and now the actual _waiting_ side can basically just do

        /* All the wait-queue setup is done outside the loop */
        wait_page.behavior = behavior;

        spin_lock_irq(&q->lock);
        .. test the big again here, do the rigth thing for mode, exit if done ..
        __add_wait_queue_entry_tail(q, wait);
        SetPageWaiters(page);
        set_current_state(state);
        spin_unlock_irq(&q->lock);

        for (;;) {
                set_current_state(state);

                /*
                 * Have we already been woken and are all done?
                 * We don't even need to remove ourselves from the
                 * wait-queues, that's been done for us.
                 */
                if (wait->flags & WQ_FLAG_WOKEN)
                        return 0;

                if (signal_pending_state(state, current))
                         break;

                io_schedule();
        }
        /* We got interrupted by a signal? */
        finish_wait(q, wait);

        /* But maybe we raced with being woken and are all done? */
        if (wait->flags & WQ_FLAG_WOKEN)
                        return 0;

        return -EINTR;

NOTE! The above doesn't do the thrashing/delayacct handling, and that
was intentional for clarity, because I actually think that should be
done in an outer function that just calls this actual "do the
lock/wait"

Look, doesn't that seem *much* simpler and clearer? Having that flag
in the wait-queue (that stays around even after it's been removed from
the list) shows that the state we're testing for - or the state we
wanted - is already ours.

Note that we have a "woken_wake_function()" that does that
WQ_FLAG_WOKEN thing, but I don't think it's careful about that final
WQ_FLAG_WOKEN bit setting, and the above function cares about that
(notice how it checks WQ_FLAG_WOKEN without ever taking the spinlock
that protects the other members of it, so the stack space can get
deallocated immediately once it sees that flag.

The above code has been written in my MUA, it may be entirely broken,
but it _feels_ right to me.

And yes, it's a lot bigger change than my one-liner. But honestly,
this is the kind of thing that the whole "struct wait_page_key" thing
is all about and exists for.

What do people think? I think it clarifies the logic and makes things
potentially much clearer. In fact, now the actual _waiting_ code never
looks at the struct page at all, so it can drop the reference to the
page early, outside the loop, adn that "drop" thing doesn't even have
to be visible on the waker side either (the wakeup logic is the same
as SHARED)

Hmm?

              Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: wait_on_page_bit_common(TASK_KILLABLE, EXCLUSIVE) can miss wakeup?
  2020-06-30 18:02                         ` Linus Torvalds
@ 2020-06-30 18:29                           ` Oleg Nesterov
  2020-06-30 18:57                             ` Linus Torvalds
  0 siblings, 1 reply; 23+ messages in thread
From: Oleg Nesterov @ 2020-06-30 18:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nicholas Piggin, Andi Kleen, Davidlohr Bueso, Jan Kara,
	Lukas Czerner, Linux Kernel Mailing List, Mel Gorman,
	Peter Zijlstra

On 06/30, Linus Torvalds wrote:
>
> On Tue, Jun 30, 2020 at 4:51 AM Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > On 06/30, Oleg Nesterov wrote:
> > >
> > > may be someting like this
> >
> > or this ...
>
> Guys, I'd suggest we either
>
>  (a) do the minimal ugly one-liner workaround
>
> or
>
>  (b) do something *much* more radical.

I agree, and let me repeat I think your one-liner is fine.

I just tried to discuss the possible alternatives to that on-liner for the
start.

>    the wake_page_function() would actually do

I swear, I too thought about changing wake_page_function() but got lost ;)

The very fact it returns -1 if test_bit() is true suggests it can do more.

Oleg.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: wait_on_page_bit_common(TASK_KILLABLE, EXCLUSIVE) can miss wakeup?
  2020-06-30 18:29                           ` Oleg Nesterov
@ 2020-06-30 18:57                             ` Linus Torvalds
  0 siblings, 0 replies; 23+ messages in thread
From: Linus Torvalds @ 2020-06-30 18:57 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Nicholas Piggin, Andi Kleen, Davidlohr Bueso, Jan Kara,
	Lukas Czerner, Linux Kernel Mailing List, Mel Gorman,
	Peter Zijlstra

On Tue, Jun 30, 2020 at 11:36 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> I swear, I too thought about changing wake_page_function() but got lost ;)
>
> The very fact it returns -1 if test_bit() is true suggests it can do more.

wake functions can indeed do more.

You can return -1 to say "stop traversing the list, nobody else on
this waitlist is relevant" (and that's separate from the "I did the
exclusive one".

Returning "0" is a success, but implies it shouldn't count as an
exclusive wakeup (generally because "try_to_wake_up()" didn't return
1, so the entry on the list wasn't actually sleeping and might already
have passed the critical region).

And returning "1" means that you _actually_ woke something up, so that
the exclusive counting triggers.

So that 0-vs-1 is very much exactly because of the race with "I'm an
exclusive waiter, and if an event comes in while I'm _actively_
waiting for it, then I promise I will take it". And the difference is
exactly about whether try_to_wake_up() actually changed the state of
the process or not.

But in addition to these "should I continue and/or count exclusive
waiters", wake functions can _do_ things too. The common one is
"autoremove", which is basically "if you actually woke something up,
then remove it from the list" (so that one generally goes along with a
successful try_to_wake_up() and returning 1.

Doing that can help scalability enormously, because if you stay on the
list and only remove yourself in the final "finish_waikt()", then if
there are lots of waiters and lots of wakeup events, the wakups will
traverse your entry over and over and over again. So autoremove is
generally a good thing.

An autoremove() thing can also be used at finish_wait() time to decide
whether you were actually woken up by that list or not (like the
WQ_FLAG_WOKEN, just implicitly). Remember: you can be on _many_ wait
lists, and you can be woken up by non-waitlist events like signals etc
too, so you can use WQ_FLAG_WOKEN (if you use woken_wake_function) or
the "am I still on the list" (if you use autoremove) to decide that
_that_ particular wakeup source triggered.

But some code wants to _stay_ on the list, because maybe they want to
be notified about multiple things and don't want to re-insert
themselves onto the list in between. That's the default behavior if
you don't do anything at all and just use the normal wake queue init
things. It's not generally the best thing to do, but it _can_ be
useful, and I think it's the default one purely for historical reasons
- ie that's how they all used to work, and the "autoremove" behavior
came in because of the scalability concerns.

But the waiting side is the one that decides which wakeup function it
wants for the wakeup event. So you can have the "please wake me up and
remove me from the wait list when you do". _and_ then add your own
logic as well, if you want to.

Yes, our wait queues are complicated. They are very powerful, though,
and they have been designed to easily avoid races many different ways
(and to also very naturally work with waiting for multiple very
different events, where select and poll are the obvious cases, but
even a single "read()" function _could_ decide to use multiple
wait-queues if there's another wait-queue for exceptional
circumstances, for example).

                  Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2020-06-30 18:57 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-24 16:11 wait_on_page_bit_common(TASK_KILLABLE, EXCLUSIVE) can miss wakeup? Oleg Nesterov
2020-06-24 16:20 ` Oleg Nesterov
2020-06-24 16:36   ` Linus Torvalds
2020-06-26 15:43     ` Peter Zijlstra
2020-06-28  5:39       ` Linus Torvalds
2020-06-28 13:18         ` Peter Zijlstra
2020-06-29  3:28         ` Nicholas Piggin
2020-06-29 13:16           ` Nicholas Piggin
2020-06-29 16:36             ` Linus Torvalds
2020-06-30  2:12               ` Nicholas Piggin
2020-06-29 14:02           ` Oleg Nesterov
2020-06-30  2:08             ` Nicholas Piggin
2020-06-30  6:17               ` Oleg Nesterov
2020-06-30  9:08                 ` Nicholas Piggin
2020-06-30 10:53                   ` Oleg Nesterov
2020-06-30 11:36                     ` Oleg Nesterov
2020-06-30 11:50                       ` Oleg Nesterov
2020-06-30 18:02                         ` Linus Torvalds
2020-06-30 18:29                           ` Oleg Nesterov
2020-06-30 18:57                             ` Linus Torvalds
2020-06-29 15:13         ` Oleg Nesterov
2020-06-24 16:22 ` Linus Torvalds
2020-06-24 16:43   ` Oleg Nesterov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.