All of lore.kernel.org
 help / color / mirror / Atom feed
From: Linus Torvalds <torvalds@linux-foundation.org>
To: Tim Chen <tim.c.chen@linux.intel.com>, Nick Piggin <npiggin@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@elte.hu>, Andi Kleen <ak@linux.intel.com>,
	Kan Liang <kan.liang@intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Johannes Weiner <hannes@cmpxchg.org>, Jan Kara <jack@suse.cz>,
	Christopher Lameter <cl@linux.com>,
	"Eric W . Biederman" <ebiederm@xmission.com>,
	Davidlohr Bueso <dave@stgolabs.net>,
	linux-mm <linux-mm@kvack.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 2/2 v2] sched/wait: Introduce lock breaker in wake_up_page_bit
Date: Sun, 27 Aug 2017 16:12:19 -0700	[thread overview]
Message-ID: <CA+55aFwuyqm6xMmS0PdjDZbgrXTiXkH+cGua=npXLaEnzOUGjw@mail.gmail.com> (raw)
In-Reply-To: <CA+55aFx0NjiHM5Aw0N7xDwRcnHOiaceV2iYuGOU1uM3FUyf+Lg@mail.gmail.com>

On Sun, Aug 27, 2017 at 2:40 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> The race goes like this:
>
>   thread1       thread2         thread3
>   ----          ----            ----
>
>   .. CPU1 ...
>   __lock_page_killable
>     wait_on_page_bit_common()
>       get wq lock
>       __add_wait_queue_entry_tail_exclusive()
>       set_current_state(TASK_KILLABLE);
>       release wq lock
>         io_schedule
>
>                 ... CPU2 ...
>                 __lock_page[_killable]()
>                   wait_on_page_bit_common()
>                     get wq lock
>                     __add_wait_queue_entry_tail_exclusive()
>                     set_current_state(TASK_KILLABLE);
>                     release wq lock
>                     io_schedule
>
>                                 ... CPU3 ...
>                                 unlock_page()
>                                 wake_up_page_bit(page, PG_Locked)
>                                   wakes up CPU1 _only_
>
>   ... lethal signal for thread1 happens ...
>      if (unlikely(signal_pending_state(state, current))) {
>           ret = -EINTR;
>           break;
>      }

With the race meaning that thread2 never gets woken up due to the
exclusive wakeup being caught by thread1 (which doesn't actually take
the lock).

I think that this bug was introduced by commit 62906027091f ("mm: add
PageWaiters indicating tasks are waiting for a page bit"), which
changed the page lock from using the wait_on_bit_lock() code to its
own _slightly_ different version.

Because it looks like _almost_ the same thing existed in the old
wait_on_bit_lock() code - and that is still used by a couple of
filesystems.

*Most* of the users seem to use TASK_UNINTERRUPTIBLE, which is fine.
But cifs and the sunrpc XPRT_LOCKED code both use the TASK_KILLABLE
form that would seem to have the exact same issue: wait_on_bit_lock()
uses exclusive wait-queues, but then may return with an error without
actually taking the lock.

Now, it turns out that I think the wait_on_bit_lock() code is actually
safe, for a subtle reason.

Why? Unlike the page lock code, the wait_on_bit_lock() code always
tries to get the lock bit before returning an error. So
wait_on_bit_lock() will prefer a successful lock taking over EINTR,
which means that if the bit really was unlocked, it would have been
taken.

And if something else locked the bit again under us and we didn't get
it, that "something else" presumably will then wake things up when it
unlocks.

So the wait_on_bit_lock() code could _also_ lose the wakeup event, but
it would only lose it in situations where somebody else would then
re-send it.

Do people agree with that analysis?

So I think the old wait_on_bit_lock() code ends up being safe -
despite having this same pattern of "exclusive wait but might error
out without taking the lock".

Whether that "wait_on_bit_lock() is safe" was just a fluke or was
because people thought about it is unclear. It's old code. People
probably *did* think about it. I really can't remember.

But it does point to a fix for the page lock case: just move the
signal_pending_state() test to after the bit checking.

So the page locking code is racy because you could have this:

 - another cpu does the unlock_page() and wakes up the process (and
uses the exclusive event)

 - we then get a lethal signal before we get toi the
"signal_pending_state()" state.

 - we end up prioritizing the lethal signal, because obviously we
don't care about locking the page any more.

 - so now the lock bit may be still clear and there's nobody who is
going to wake up the remaining waiter

Moving the signal_pending_state() down actually fixes the race,
because we know that in order for the exclusive thing to have
mattered, it *has* to actually wake us up. So the unlock_page() must
happen before the lethal signal (where before is well-defined because
of that "try_to_wake_up()" taking a lock and looking at the task
state). The exclusive accounting is only done if the process is
actually woken up, not if it was already running (see
"try_to_wake_up()").

And if the unlock_page() happened before the lethal signal, then we
know that test_and_set_bit_lock() will either work (in which case
we're ok), or another locker successfully came in later - in which
case we're _also_ ok, because that other locker will then do the
unlock again, and wake up subsequent waiters that might have been
blocked by our first exclusive waiter.

So I propose that the fix might be as simple as this:

    diff --git a/mm/filemap.c b/mm/filemap.c
    index baba290c276b..0b41c8cbeabc 100644
    --- a/mm/filemap.c
    +++ b/mm/filemap.c
    @@ -986,10 +986,6 @@ static inline int
wait_on_page_bit_common(wait_queue_head_t *q,

                if (likely(test_bit(bit_nr, &page->flags))) {
                        io_schedule();
    -                   if (unlikely(signal_pending_state(state, current))) {
    -                           ret = -EINTR;
    -                           break;
    -                   }
                }

                if (lock) {
    @@ -999,6 +995,11 @@ static inline int
wait_on_page_bit_common(wait_queue_head_t *q,
                        if (!test_bit(bit_nr, &page->flags))
                                break;
                }
    +
    +           if (unlikely(signal_pending_state(state, current))) {
    +                   ret = -EINTR;
    +                   break;
    +           }
        }

        finish_wait(q, wait);

but maybe I'm missing something.

Nick, comments?

                 Linus

WARNING: multiple messages have this Message-ID (diff)
From: Linus Torvalds <torvalds@linux-foundation.org>
To: Tim Chen <tim.c.chen@linux.intel.com>, Nick Piggin <npiggin@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@elte.hu>, Andi Kleen <ak@linux.intel.com>,
	Kan Liang <kan.liang@intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Johannes Weiner <hannes@cmpxchg.org>, Jan Kara <jack@suse.cz>,
	Christopher Lameter <cl@linux.com>,
	"Eric W . Biederman" <ebiederm@xmission.com>,
	Davidlohr Bueso <dave@stgolabs.net>,
	linux-mm <linux-mm@kvack.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 2/2 v2] sched/wait: Introduce lock breaker in wake_up_page_bit
Date: Sun, 27 Aug 2017 16:12:19 -0700	[thread overview]
Message-ID: <CA+55aFwuyqm6xMmS0PdjDZbgrXTiXkH+cGua=npXLaEnzOUGjw@mail.gmail.com> (raw)
In-Reply-To: <CA+55aFx0NjiHM5Aw0N7xDwRcnHOiaceV2iYuGOU1uM3FUyf+Lg@mail.gmail.com>

On Sun, Aug 27, 2017 at 2:40 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> The race goes like this:
>
>   thread1       thread2         thread3
>   ----          ----            ----
>
>   .. CPU1 ...
>   __lock_page_killable
>     wait_on_page_bit_common()
>       get wq lock
>       __add_wait_queue_entry_tail_exclusive()
>       set_current_state(TASK_KILLABLE);
>       release wq lock
>         io_schedule
>
>                 ... CPU2 ...
>                 __lock_page[_killable]()
>                   wait_on_page_bit_common()
>                     get wq lock
>                     __add_wait_queue_entry_tail_exclusive()
>                     set_current_state(TASK_KILLABLE);
>                     release wq lock
>                     io_schedule
>
>                                 ... CPU3 ...
>                                 unlock_page()
>                                 wake_up_page_bit(page, PG_Locked)
>                                   wakes up CPU1 _only_
>
>   ... lethal signal for thread1 happens ...
>      if (unlikely(signal_pending_state(state, current))) {
>           ret = -EINTR;
>           break;
>      }

With the race meaning that thread2 never gets woken up due to the
exclusive wakeup being caught by thread1 (which doesn't actually take
the lock).

I think that this bug was introduced by commit 62906027091f ("mm: add
PageWaiters indicating tasks are waiting for a page bit"), which
changed the page lock from using the wait_on_bit_lock() code to its
own _slightly_ different version.

Because it looks like _almost_ the same thing existed in the old
wait_on_bit_lock() code - and that is still used by a couple of
filesystems.

*Most* of the users seem to use TASK_UNINTERRUPTIBLE, which is fine.
But cifs and the sunrpc XPRT_LOCKED code both use the TASK_KILLABLE
form that would seem to have the exact same issue: wait_on_bit_lock()
uses exclusive wait-queues, but then may return with an error without
actually taking the lock.

Now, it turns out that I think the wait_on_bit_lock() code is actually
safe, for a subtle reason.

Why? Unlike the page lock code, the wait_on_bit_lock() code always
tries to get the lock bit before returning an error. So
wait_on_bit_lock() will prefer a successful lock taking over EINTR,
which means that if the bit really was unlocked, it would have been
taken.

And if something else locked the bit again under us and we didn't get
it, that "something else" presumably will then wake things up when it
unlocks.

So the wait_on_bit_lock() code could _also_ lose the wakeup event, but
it would only lose it in situations where somebody else would then
re-send it.

Do people agree with that analysis?

So I think the old wait_on_bit_lock() code ends up being safe -
despite having this same pattern of "exclusive wait but might error
out without taking the lock".

Whether that "wait_on_bit_lock() is safe" was just a fluke or was
because people thought about it is unclear. It's old code. People
probably *did* think about it. I really can't remember.

But it does point to a fix for the page lock case: just move the
signal_pending_state() test to after the bit checking.

So the page locking code is racy because you could have this:

 - another cpu does the unlock_page() and wakes up the process (and
uses the exclusive event)

 - we then get a lethal signal before we get toi the
"signal_pending_state()" state.

 - we end up prioritizing the lethal signal, because obviously we
don't care about locking the page any more.

 - so now the lock bit may be still clear and there's nobody who is
going to wake up the remaining waiter

Moving the signal_pending_state() down actually fixes the race,
because we know that in order for the exclusive thing to have
mattered, it *has* to actually wake us up. So the unlock_page() must
happen before the lethal signal (where before is well-defined because
of that "try_to_wake_up()" taking a lock and looking at the task
state). The exclusive accounting is only done if the process is
actually woken up, not if it was already running (see
"try_to_wake_up()").

And if the unlock_page() happened before the lethal signal, then we
know that test_and_set_bit_lock() will either work (in which case
we're ok), or another locker successfully came in later - in which
case we're _also_ ok, because that other locker will then do the
unlock again, and wake up subsequent waiters that might have been
blocked by our first exclusive waiter.

So I propose that the fix might be as simple as this:

    diff --git a/mm/filemap.c b/mm/filemap.c
    index baba290c276b..0b41c8cbeabc 100644
    --- a/mm/filemap.c
    +++ b/mm/filemap.c
    @@ -986,10 +986,6 @@ static inline int
wait_on_page_bit_common(wait_queue_head_t *q,

                if (likely(test_bit(bit_nr, &page->flags))) {
                        io_schedule();
    -                   if (unlikely(signal_pending_state(state, current))) {
    -                           ret = -EINTR;
    -                           break;
    -                   }
                }

                if (lock) {
    @@ -999,6 +995,11 @@ static inline int
wait_on_page_bit_common(wait_queue_head_t *q,
                        if (!test_bit(bit_nr, &page->flags))
                                break;
                }
    +
    +           if (unlikely(signal_pending_state(state, current))) {
    +                   ret = -EINTR;
    +                   break;
    +           }
        }

        finish_wait(q, wait);

but maybe I'm missing something.

Nick, comments?

                 Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2017-08-27 23:12 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-08-25 16:13 [PATCH 1/2 v2] sched/wait: Break up long wake list walk Tim Chen
2017-08-25 16:13 ` Tim Chen
2017-08-25 16:13 ` [PATCH 2/2 v2] sched/wait: Introduce lock breaker in wake_up_page_bit Tim Chen
2017-08-25 16:13   ` Tim Chen
2017-08-25 19:58   ` Linus Torvalds
2017-08-25 19:58     ` Linus Torvalds
2017-08-25 22:19     ` Tim Chen
2017-08-25 22:19       ` Tim Chen
2017-08-25 22:51       ` Linus Torvalds
2017-08-25 23:03         ` Linus Torvalds
2017-08-26  0:31           ` Linus Torvalds
2017-08-26  0:31             ` Linus Torvalds
2017-08-26  2:54             ` Linus Torvalds
2017-08-26  2:54               ` Linus Torvalds
2017-08-26 18:15               ` Linus Torvalds
2017-08-27 21:40                 ` Linus Torvalds
2017-08-27 21:40                   ` Linus Torvalds
2017-08-27 21:42                   ` Linus Torvalds
2017-08-27 21:42                     ` Linus Torvalds
2017-08-27 23:12                   ` Linus Torvalds [this message]
2017-08-27 23:12                     ` Linus Torvalds
2017-08-28  1:16                     ` Nicholas Piggin
2017-08-28  1:16                       ` Nicholas Piggin
2017-08-28  1:29                       ` Nicholas Piggin
2017-08-28  1:29                         ` Nicholas Piggin
2017-08-28  5:17                         ` Linus Torvalds
2017-08-28  5:17                           ` Linus Torvalds
2017-08-28  7:18                           ` Nicholas Piggin
2017-08-28  7:18                             ` Nicholas Piggin
2017-08-28 14:51                 ` Liang, Kan
2017-08-28 16:48                   ` Linus Torvalds
2017-08-28 20:01                     ` Tim Chen
2017-08-28 20:01                       ` Tim Chen
2017-08-29 12:57                     ` Liang, Kan
2017-08-29 16:01                       ` Linus Torvalds
2017-08-29 16:01                         ` Linus Torvalds
2017-08-29 16:13                         ` Tim Chen
2017-08-29 16:13                           ` Tim Chen
2017-08-29 16:24                           ` Linus Torvalds
2017-08-29 16:24                             ` Linus Torvalds
2017-08-29 16:57                             ` Tim Chen
2017-08-29 16:57                               ` Tim Chen
2017-09-14  2:12                             ` Tim Chen
2017-09-14  2:12                               ` Tim Chen
2017-09-14  2:27                               ` Linus Torvalds
2017-09-14  2:27                                 ` Linus Torvalds
2017-09-14 16:50                                 ` Tim Chen
2017-09-14 16:50                                   ` Tim Chen
2017-09-14 17:00                                   ` Linus Torvalds
2017-09-14 17:00                                     ` Linus Torvalds
2017-09-14 16:39                               ` Christopher Lameter
2017-08-29 16:17                     ` Tim Chen
2017-08-29 16:17                       ` Tim Chen
2017-08-29 16:22                       ` Linus Torvalds
2017-08-29 16:22                         ` Linus Torvalds
2017-08-25 17:46 ` [PATCH 1/2 v2] sched/wait: Break up long wake list walk Christopher Lameter
2017-08-25 17:46   ` Christopher Lameter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CA+55aFwuyqm6xMmS0PdjDZbgrXTiXkH+cGua=npXLaEnzOUGjw@mail.gmail.com' \
    --to=torvalds@linux-foundation.org \
    --cc=ak@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=cl@linux.com \
    --cc=dave@stgolabs.net \
    --cc=ebiederm@xmission.com \
    --cc=hannes@cmpxchg.org \
    --cc=jack@suse.cz \
    --cc=kan.liang@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=mingo@elte.hu \
    --cc=npiggin@gmail.com \
    --cc=peterz@infradead.org \
    --cc=tim.c.chen@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.