All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ingo Molnar <mingo@kernel.org>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Miklos Szeredi <miklos@szeredi.hu>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Felipe Balbi <balbi@kernel.org>, Oleg Nesterov <oleg@redhat.com>
Subject: [RFC PATCH] sched/wait: Make interruptible exclusive waitqueue wakeups reliable
Date: Mon, 9 Dec 2019 10:18:13 +0100	[thread overview]
Message-ID: <20191209091813.GA41320@gmail.com> (raw)
In-Reply-To: <CAHk-=whiKy63tpFVUUS1sH07ce692rKcoo0ztnHw5UaPaMg8Ng@mail.gmail.com>


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> The reason it is buggy is that wait_event_interruptible_exclusive()
> does this (inside the __wait_event() macro that it expands to):
> 
>                 long __int = prepare_to_wait_event(&wq_head, &__wq_entry, state);
> 
>                 if (condition)
>                         break;
>                 if (___wait_is_interruptible(state) && __int) {
>                         __ret = __int;
>                         goto __out;
> 
> and the thing is, if does that "__ret = __int" case and returns
> -ERESTARTSYS, it's possible that the wakeup event has already been
> consumed, because we've added ourselves as an exclusive writer to the
> queue. So it _says_ it was interrupted, not woken up, and the wait got
> cancelled, but because we were an exclusive waiter, we might be the
> _only_ thing that got woken up, and the wakeup basically got forgotten
> - all the other exclusive waiters will remain waiting.

So the place that detects interruption is prepare_to_wait_event():

long prepare_to_wait_event(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry, int state)
{
        unsigned long flags;
        long ret = 0;

        spin_lock_irqsave(&wq_head->lock, flags);
        if (signal_pending_state(state, current)) {
                /*
                 * Exclusive waiter must not fail if it was selected by wakeup,
                 * it should "consume" the condition we were waiting for.
                 *
                 * The caller will recheck the condition and return success if
                 * we were already woken up, we can not miss the event because
                 * wakeup locks/unlocks the same wq_head->lock.
                 *
                 * But we need to ensure that set-condition + wakeup after that
                 * can't see us, it should wake up another exclusive waiter if
                 * we fail.
                 */
                list_del_init(&wq_entry->entry);
                ret = -ERESTARTSYS;
        } else {
                if (list_empty(&wq_entry->entry)) {
                        if (wq_entry->flags & WQ_FLAG_EXCLUSIVE)
                                __add_wait_queue_entry_tail(wq_head, wq_entry);
                        else
                                __add_wait_queue(wq_head, wq_entry);
                }
                set_current_state(state);
        }
        spin_unlock_irqrestore(&wq_head->lock, flags);

        return ret;
}

I think we can indeed lose an exclusive event here, despite the comment 
that argues that we shouldn't: if we were already removed from the list 
then list_del_init() does nothing and loses the exclusive event AFAICS.

This logic was introduced in this commit 3 years ago:

  b1ea06a90f52: ("sched/wait: Avoid abort_exclusive_wait() in ___wait_event()")
  eaf9ef52241b: ("sched/wait: Avoid abort_exclusive_wait() in __wait_on_bit_lock()")

Before that commit we simply did this:

 long prepare_to_wait_event(wait_queue_head_t *q, wait_queue_t *wait, int state)
 {
        unsigned long flags;
-
-       if (signal_pending_state(state, current))
-               return -ERESTARTSYS;

Which was safe in the sense that it didn't touch the waitqueue in case of 
interruption. Then it used abort_exclusive_wait(), which got removed in 
eaf9ef52241b, to pass on any leftover wakeups:

-void abort_exclusive_wait(wait_queue_head_t *q, wait_queue_t *wait, void *key)
-{
-       unsigned long flags;
-
-       __set_current_state(TASK_RUNNING);
-       spin_lock_irqsave(&q->lock, flags);
-       if (!list_empty(&wait->task_list))
-               list_del_init(&wait->task_list);
-       else if (waitqueue_active(q))
-               __wake_up_locked_key(q, TASK_NORMAL, key);
-       spin_unlock_irqrestore(&q->lock, flags);
-}
-EXPORT_SYMBOL(abort_exclusive_wait);

Note how the wakeup is passed along to another exclusive waiter (if any) 
via the __wake_up_locked_key() call.

Oleg?

> I dunno. Maybe this is fundamental to the interface? We do not have a 
> lot of users that mix "interruptible" and "exclusive". In fact, I see 
> only two cases that care about the return value, but at least the fuse 
> use does seem to have exactly this problem with potentially lost 
> wakeups because the ERESTARTSYS case ends up eating a wakeup without 
> doing anything about it.

I don't think it's fundamental to the interface.

In fact I'd argue that it's fundamental to the interface to *not* lose 
exclusive events.

> Looks like the other user - the USB gadget HID thing - also has this
> buggy pattern of believing the return value, and losing a wakeup
> event.
> 
> Adding Miklos and Felipe to the cc just because of the fuse and USB
> gadget potential issues, but this is mainly a scheduler maintainer
> question.
> 
> It's possible that I've misread the wait-event code. PeterZ?

I think your analysis is correct, and I think the statistical evidence is 
also overwhelming that the interface is buggy: if we include your new 
usecase then 3 out of 3 attempts got it wrong. :-)

So I'd argue we should fix the interface and allow the 'simple' use of 
reliable interruptible-exclusive waitqueues - i.e. reintroduce the 
abort_exclusive_wait() logic?

Thanks,

	Ingo

  reply	other threads:[~2019-12-09  9:18 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-12-08 21:12 Fundamental race condition in wait_event_interruptible_exclusive() ? Linus Torvalds
2019-12-09  9:18 ` Ingo Molnar [this message]
2019-12-09 10:27   ` [RFC PATCH] sched/wait: Make interruptible exclusive waitqueue wakeups reliable Ingo Molnar
2019-12-09 13:00     ` Oleg Nesterov
2019-12-10  7:21       ` Ingo Molnar
2019-12-10 19:19         ` [PATCH] sched/wait: fix ___wait_var_event(exclusive) Oleg Nesterov
2019-12-17 12:39           ` [tip: sched/core] " tip-bot2 for Oleg Nesterov
2019-12-09 18:06     ` [RFC PATCH] sched/wait: Make interruptible exclusive waitqueue wakeups reliable Linus Torvalds
2019-12-09 12:08   ` Oleg Nesterov
2019-12-10  7:29     ` Ingo Molnar
2019-12-10 17:30       ` Oleg Nesterov
2019-12-09 17:38 ` Fundamental race condition in wait_event_interruptible_exclusive() ? Oleg Nesterov
2019-12-09 18:03   ` Linus Torvalds

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191209091813.GA41320@gmail.com \
    --to=mingo@kernel.org \
    --cc=balbi@kernel.org \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=miklos@szeredi.hu \
    --cc=oleg@redhat.com \
    --cc=peterz@infradead.org \
    --cc=torvalds@linux-foundation.org \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.