All of lore.kernel.org
 help / color / mirror / Atom feed
From: Linus Torvalds <torvalds@linux-foundation.org>
To: Ingo Molnar <mingo@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Felipe Balbi <balbi@kernel.org>
Subject: Fundamental race condition in wait_event_interruptible_exclusive() ?
Date: Sun, 8 Dec 2019 13:12:59 -0800	[thread overview]
Message-ID: <CAHk-=whiKy63tpFVUUS1sH07ce692rKcoo0ztnHw5UaPaMg8Ng@mail.gmail.com> (raw)

So I'm looking at the thundering herd issue with pipes, and I have a
truly monumentally silly benchmark for it:

        #include <unistd.h>

        int main(int argc, char **argv)
        {
                int fd[2], counters[2];

                pipe(fd);
                counters[0] = 0;
                counters[1] = -1;
                write(fd[1], counters, sizeof(counters));

                /* 64 processes */
                fork(); fork(); fork(); fork(); fork(); fork();

                do {
                        int i;
                        read(fd[0], &i, sizeof(i));
                        if (i < 0)
                                continue;
                        counters[0] = i+1;
                        write(fd[1], counters, (1+(i & 1)) *sizeof(int));
                } while (counters[0] < 1000000);
                return 0;
        }

which basically passes an integer token around (with "-1" being
"ignore" and being passed around occasionally as a test for partial
reads).

It passes that counter token around a million times, and every other
time it also writes that "-1" case, so in a perfect world, you'd get
1.5 million scheduling events, because that's how many write() ->
read() pairings there are.

Even in a perfect world there will be a few extra scheduling events
because the whole "finish the pipeline" waits for _everybody_ to see
that one million number, so the counter token gets passed around a bit
more than exactly a million times, but that's in the noise.

In reality, I get many more than that, because the write will wake up
all the readers, even if only one (or two) get any values. You can see
it easily with "perf stat" or whatever.

Whatever, that's not the important part. The benchmark is obviously
entirely artificial and not that interesting, it's just to show a
point.

I do have a tentative patch to fix it - use separate reader and writer
wake queues, and exclusive waits.

But when I was looking at using exclusive waits, I noticed an issue.
It looks like wait_event_interruptible_exclusive() is fundamentally
untrustworthy.

What I'd _like_ to do is basically something like this in pipe_read():

        .. read the data, pipe is now empty ..
        __pipe_unlock();
        .. wake writers if the pipe used to be full ..

        // Wait for more data
        err = wait_event_interruptible_exclusive(pipe->rd_wait,
pipe_readable());
        if (err)
                return err;

        // Remember that we need to wake up the next reader
        // if we don't consume everything
        wake_remaining_readers = true;

        __pipe_lock();
        .. loop back and read the data ..

but it looks like the above pattern is actually buggy (the above
doesn't have the parts at the end where we actually wake the next
reader if we didn't empty the pipe, I'm just describing the waiting
part).

The reason it is buggy is that wait_event_interruptible_exclusive()
does this (inside the __wait_event() macro that it expands to):

                long __int = prepare_to_wait_event(&wq_head,
&__wq_entry, state);\

         \
                if (condition)
         \
                        break;
         \

         \
                if (___wait_is_interruptible(state) && __int) {
         \
                        __ret = __int;
         \
                        goto __out;
         \

and the thing is, if does that "__ret = __int" case and returns
-ERESTARTSYS, it's possible that the wakeup event has already been
consumed, because we've added ourselves as an exclusive writer to the
queue. So it _says_ it was interrupted, not woken up, and the wait got
cancelled, but because we were an exclusive waiter, we might be the
_only_ thing that got woken up, and the wakeup basically got forgotten
- all the other exclusive waiters will remain waiting.

Yes, yes, I can make the read_pipe() logic be that I ignore the return
value entirely, and always take the pipe lock, and always loop back,
and then the reading code will check for signal_pending() there, and
do the wake_remaining_readers if there is still data to be read, and
just do this instead:

        wait_event_interruptible_exclusive(pipe->rd_wait, pipe_readable());

        // Remember that we need to wake up the next reader
        // if we don't consume everything
        wake_remaining_readers = true;

        __pipe_lock();
        .. loop back and read the data ..

but that's kind of sad. And the basic point is that the return value
from wait_event_interruptible_exclusive() seems to not really be
reliable. You can't really use it - even if it says you got
interrupted, you still have to go back and check the condition and do
the work, and only do interruptability handling after that.

I dunno. Maybe this is fundamental to the interface? We do not have a
lot of users that mix "interruptible" and "exclusive". In fact, I see
only two cases that care about the return value, but at least the fuse
use does seem to have exactly this problem with potentially lost
wakeups because the ERESTARTSYS case ends up eating a wakeup without
doing anything about it.

Looks like the other user - the USB gadget HID thing - also has this
buggy pattern of believing the return value, and losing a wakeup
event.

Adding Miklos and Felipe to the cc just because of the fuse and USB
gadget potential issues, but this is mainly a scheduler maintainer
question.

It's possible that I've misread the wait-event code. PeterZ?

                 Linus

             reply	other threads:[~2019-12-08 21:13 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-12-08 21:12 Linus Torvalds [this message]
2019-12-09  9:18 ` [RFC PATCH] sched/wait: Make interruptible exclusive waitqueue wakeups reliable Ingo Molnar
2019-12-09 10:27   ` Ingo Molnar
2019-12-09 13:00     ` Oleg Nesterov
2019-12-10  7:21       ` Ingo Molnar
2019-12-10 19:19         ` [PATCH] sched/wait: fix ___wait_var_event(exclusive) Oleg Nesterov
2019-12-17 12:39           ` [tip: sched/core] " tip-bot2 for Oleg Nesterov
2019-12-09 18:06     ` [RFC PATCH] sched/wait: Make interruptible exclusive waitqueue wakeups reliable Linus Torvalds
2019-12-09 12:08   ` Oleg Nesterov
2019-12-10  7:29     ` Ingo Molnar
2019-12-10 17:30       ` Oleg Nesterov
2019-12-09 17:38 ` Fundamental race condition in wait_event_interruptible_exclusive() ? Oleg Nesterov
2019-12-09 18:03   ` Linus Torvalds

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAHk-=whiKy63tpFVUUS1sH07ce692rKcoo0ztnHw5UaPaMg8Ng@mail.gmail.com' \
    --to=torvalds@linux-foundation.org \
    --cc=balbi@kernel.org \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=miklos@szeredi.hu \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.