All of lore.kernel.org
 help / color / mirror / Atom feed
* Indefinitely sleeping task in poll_schedule_timeout()
@ 2022-01-04 12:44 Jan Kara
  2022-01-06 18:41 ` Linus Torvalds
  0 siblings, 1 reply; 4+ messages in thread
From: Jan Kara @ 2022-01-04 12:44 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, Linus Torvalds

[-- Attachment #1: Type: text/plain, Size: 1866 bytes --]

Hello!

I have been scratching my head over a crashdump where a task is sleeping
in do_select() -> poll_schedule_timeout() indefinitely although there are
things to read from a fd the select was run on. The oddity triggering this
is that the fd is no longer open in the task's files_struct. Another
unusual thing is that the file select(2) is running on was actually a
fsnotify_group but that particular thing does not seem to be substantial.
So what I think happened is the following race:

TASK1 (thread1)             TASK2                       TASK1 (thread2)
do_select()                      
  setup poll_wqueues table
                            generate fanotify event
                              wake_up(group->notification_waitq)
                                pollwake()
                                  table->triggered = 1
                                                        closes fd thread1 is
                                                          waiting for
  poll_schedule_timeout()
    - sees table->triggered
    table->triggered = 0
    return -EINTR
  loop back in do_select() but fdget() in the setup of poll_wqueues fails
now so we never find fanotify group's fd is ready for reading and sleep in
poll_schedule_timeout() indefinitely.

Arguably the application is doing something stupid (waiting for fd to
become readable while closing it) and it gets what it deserves but the fact
that do_select() holds the last file reference makes the outcome somewhat
unexpected - normally, ->release() would cleanup everything and writers
would know the file is dead (e.g. fanotify group would get torn down) but
that does not happen until do_select() calls poll_freewait() which never
happens...

So maybe something like attached patch (boot tested only so far)? What you
do think?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

[-- Attachment #2: 0001-select-Fix-indefinitely-sleeping-task-in-poll_schedu.patch --]
[-- Type: text/x-patch, Size: 2287 bytes --]

From ae2a849f54b18568037fd27d0e83b3068cd3b292 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Tue, 4 Jan 2022 13:17:21 +0100
Subject: [PATCH] select: Fix indefinitely sleeping task in
 poll_schedule_timeout()

A task can end up indefinitely sleeping in do_select() ->
poll_schedule_timeout() when the following race happens:

TASK1 (thread1)             TASK2                       TASK1 (thread2)
do_select()
  setup poll_wqueues table
  with 'fd'
                            write data to 'fd'
                              pollwake()
                                table->triggered = 1
                                                        closes 'fd' thread1 is
                                                          waiting for
  poll_schedule_timeout()
    - sees table->triggered
    table->triggered = 0
    return -EINTR
  loop back in do_select() but fdget() in the setup of poll_wqueues
fails now so we never find 'fd' is ready for reading and sleep in
poll_schedule_timeout() indefinitely.

Make sure we return -EBADF from do_select() when we spot file we cannot
get anymore. This is the same behavior as when not open fd is passed to
select(2) from the start.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/select.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/fs/select.c b/fs/select.c
index 945896d0ac9e..f839adf283ae 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -505,6 +505,7 @@ static int do_select(int n, fd_set_bits *fds, struct timespec64 *end_time)
 	for (;;) {
 		unsigned long *rinp, *routp, *rexp, *inp, *outp, *exp;
 		bool can_busy_loop = false;
+		bool bad_fd = false;
 
 		inp = fds->in; outp = fds->out; exp = fds->ex;
 		rinp = fds->res_in; routp = fds->res_out; rexp = fds->res_ex;
@@ -561,6 +562,8 @@ static int do_select(int n, fd_set_bits *fds, struct timespec64 *end_time)
 					} else if (busy_flag & mask)
 						can_busy_loop = true;
 
+				} else {
+					bad_fd = true;
 				}
 			}
 			if (res_in)
@@ -578,6 +581,10 @@ static int do_select(int n, fd_set_bits *fds, struct timespec64 *end_time)
 			retval = table.error;
 			break;
 		}
+		if (bad_fd) {
+			retval = -EBADF;
+			break;
+		}
 
 		/* only if found POLL_BUSY_LOOP sockets && not out of time */
 		if (can_busy_loop && !need_resched()) {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: Indefinitely sleeping task in poll_schedule_timeout()
  2022-01-04 12:44 Indefinitely sleeping task in poll_schedule_timeout() Jan Kara
@ 2022-01-06 18:41 ` Linus Torvalds
  2022-01-06 19:51   ` Linus Torvalds
  2022-01-07 18:20   ` Jan Kara
  0 siblings, 2 replies; 4+ messages in thread
From: Linus Torvalds @ 2022-01-06 18:41 UTC (permalink / raw)
  To: Jan Kara; +Cc: Al Viro, linux-fsdevel

On Thu, Jan 6, 2022 at 7:45 AM Jan Kara <jack@suse.cz> wrote:
>
> So maybe something like attached patch (boot tested only so far)? What you
> do think?

I don't think the patch is wrong, but it looks like we've actually
never reacted to bad file descriptors, so I get the strong feeling
that the patch might break existing code that has worked by accident
before.

And the breakage might end up very subtle - select() randomly now
returning EBADFD if you have a race with somebody closing the fd that
previously was benign and silent. Which is going to be hell to debug
because it's some sporadic user-space race with close and select at
the same time, and it might easily actually break user space because
somebody just takes "error from select" to be fatal and asserts it.

What do_pollfd() does if there's no file associated with the fd is to
use EPOLLNVAL (an actual negative fd is ignored):

        mask = EPOLLNVAL;
        f = fdget(fd);
        if (!f.file)
                goto out;

and I wonder if we should try something equivalent for select() as a
slightly less drastic change.

Of course, select() doesn't actually have EPOLLNVAL, but it does have
EPOLLERR and EPOLLPRI, which would set all the bits for that fd.

The patch would be something like this (NOTE NOTE NOTE: this is not
only whitespace-damaged - it was done with "git diff -w" to not show
the whitespace change. Look at how it changes the nesting of that "if
(f.file)" thing, and it needs to change the indentation of all those
subsequent if-statements that check the mask. I did it this way just
to show the logic change, a real patch would be much bigger due to the
whitespace change).

    diff --git a/fs/select.c b/fs/select.c
    index 945896d0ac9e..d77f8a614ad4 100644
    --- a/fs/select.c
    +++ b/fs/select.c
    @@ -528,12 +528,14 @@ static int do_select(int n, fd_set_bits *fds,
                    if (!(bit & all_bits))
                            continue;
                    f = fdget(i);
    +               mask = EPOLLERR | EPOLLPRI;
                    if (f.file) {
                            wait_key_set(wait, in, out, bit,
                                         busy_flag);
                            mask = vfs_poll(f.file, wait);

                            fdput(f);
    +               }
                    if ((mask & POLLIN_SET) && (in & bit)) {
                            res_in |= bit;
                            retval++;
    @@ -560,8 +562,6 @@ static int do_select(int n, fd_set_bits *fds,
                     */
                    } else if (busy_flag & mask)
                            can_busy_loop = true;
    -
    -               }
            }
            if (res_in)
                    *rinp = res_in;


That said: while I don't think your patch is wrong, if we do want to
do that EBADF thing, I think your patch is still very very ugly. You
are adding this new 'bad_fd' variable for no good reason.

We already have an error variable: 'table.error'.

So a *minimal* patch that does what you suggest is to just do something like

    diff --git a/fs/select.c b/fs/select.c
    index 945896d0ac9e..7a06d28ec83d 100644
    --- a/fs/select.c
    +++ b/fs/select.c
    @@ -528,7 +528,9 @@ static int do_select(int n, fd_set_bits *fds,
struct timespec64 *end_time)
                    if (!(bit & all_bits))
                            continue;
                    f = fdget(i);
    -               if (f.file) {
    +               if (unlikely(!f.file)) {
    +                       table.error = -EBADF;
    +               } else {
                            wait_key_set(wait, in, out, bit,
                                         busy_flag);
                            mask = vfs_poll(f.file, wait);

(which is again whitespace-damaged, but this time that's just to make
it legible inline - there's no indentation change).

Hmm? I'd be ok with either of these (the second of these patches
_should_ be equivalent to yours), but I think that first one might be
the one that might cause less potential user-space breakage.

... and neither of the above whitespace-damaged patches are *tested*
in any way. I might have completely screwed something up.

Comments? Anybody?

               Linus

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Indefinitely sleeping task in poll_schedule_timeout()
  2022-01-06 18:41 ` Linus Torvalds
@ 2022-01-06 19:51   ` Linus Torvalds
  2022-01-07 18:20   ` Jan Kara
  1 sibling, 0 replies; 4+ messages in thread
From: Linus Torvalds @ 2022-01-06 19:51 UTC (permalink / raw)
  To: Jan Kara; +Cc: Al Viro, linux-fsdevel

On Thu, Jan 6, 2022 at 10:41 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> What do_pollfd() does if there's no file associated with the fd is to
> use EPOLLNVAL (an actual negative fd is ignored):
>
>         mask = EPOLLNVAL;
>         f = fdget(fd);
>         if (!f.file)
>                 goto out;
>
> and I wonder if we should try something equivalent for select() as a
> slightly less drastic change.
>
> Of course, select() doesn't actually have EPOLLNVAL, but it does have
> EPOLLERR and EPOLLPRI, which would set all the bits for that fd.

Actually, it would probably be better to actually use EPOLLNVAL
exactly to be consistent with epoll() on a code level, and just also
add that bit to the POLL{IN,OUT,ERR}_SET collections, so that an fd
that has been closed will set all the in/out/ex bits.

I suspect that not only is the most consistent we can be with
'poll()', putting a bit in the output bitmap is less likely to break
existing apps, and is at least a useful situation: anybody who then
walks the output bitmaps to see what's going on will get EBADF when
actually trying to read/write from the file descriptor.

So it's actually _useful_ semantics, although obviously nobody can
rely on it (since we've never set those bits before, and presumably no
other unix does either).

The advantage of the EBADF return value is obviously that it's
ostensibly the portable thing to do, but considering that we've never
actually done that, I really do get the feeling that it's very likely
to break applications that do "select()" in one thread, and do
concurrent file management in another.

Regardless, I feel either change might cause subtle enough user
semantic breakage that I'd be inclined to leave this for 5.17, and
back-port it to stable after we've seen more testing, rather than
apply it this late in the development series.

                Linus

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Indefinitely sleeping task in poll_schedule_timeout()
  2022-01-06 18:41 ` Linus Torvalds
  2022-01-06 19:51   ` Linus Torvalds
@ 2022-01-07 18:20   ` Jan Kara
  1 sibling, 0 replies; 4+ messages in thread
From: Jan Kara @ 2022-01-07 18:20 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jan Kara, Al Viro, linux-fsdevel

On Thu 06-01-22 10:41:27, Linus Torvalds wrote:
> On Thu, Jan 6, 2022 at 7:45 AM Jan Kara <jack@suse.cz> wrote:
> >
> > So maybe something like attached patch (boot tested only so far)? What you
> > do think?
> 
> I don't think the patch is wrong, but it looks like we've actually
> never reacted to bad file descriptors, so I get the strong feeling
> that the patch might break existing code that has worked by accident
> before.

Well, max_select_fd() returns -EBADF in case some file descriptor is wrong.
But yes, once this initial test passes we didn't care about bad fd. So if
the timing used to work out so that fd was never closed before the
max_select_fd() check I agree my patch could break such app.

> And the breakage might end up very subtle - select() randomly now
> returning EBADFD if you have a race with somebody closing the fd that
> previously was benign and silent. Which is going to be hell to debug
> because it's some sporadic user-space race with close and select at
> the same time, and it might easily actually break user space because
> somebody just takes "error from select" to be fatal and asserts it.
> 
> What do_pollfd() does if there's no file associated with the fd is to
> use EPOLLNVAL (an actual negative fd is ignored):
> 
>         mask = EPOLLNVAL;
>         f = fdget(fd);
>         if (!f.file)
>                 goto out;
> 
> and I wonder if we should try something equivalent for select() as a
> slightly less drastic change.
> 
> Of course, select() doesn't actually have EPOLLNVAL, but it does have
> EPOLLERR and EPOLLPRI, which would set all the bits for that fd.
> 
> The patch would be something like this (NOTE NOTE NOTE: this is not
> only whitespace-damaged - it was done with "git diff -w" to not show
> the whitespace change. Look at how it changes the nesting of that "if
> (f.file)" thing, and it needs to change the indentation of all those
> subsequent if-statements that check the mask. I did it this way just
> to show the logic change, a real patch would be much bigger due to the
> whitespace change).
> 
>     diff --git a/fs/select.c b/fs/select.c
>     index 945896d0ac9e..d77f8a614ad4 100644
>     --- a/fs/select.c
>     +++ b/fs/select.c
>     @@ -528,12 +528,14 @@ static int do_select(int n, fd_set_bits *fds,
>                     if (!(bit & all_bits))
>                             continue;
>                     f = fdget(i);
>     +               mask = EPOLLERR | EPOLLPRI;
>                     if (f.file) {
>                             wait_key_set(wait, in, out, bit,
>                                          busy_flag);
>                             mask = vfs_poll(f.file, wait);
> 
>                             fdput(f);
>     +               }
>                     if ((mask & POLLIN_SET) && (in & bit)) {
>                             res_in |= bit;
>                             retval++;
>     @@ -560,8 +562,6 @@ static int do_select(int n, fd_set_bits *fds,
>                      */
>                     } else if (busy_flag & mask)
>                             can_busy_loop = true;
>     -
>     -               }
>             }
>             if (res_in)
>                     *rinp = res_in;
> 

OK, nice, I think this would work and probably has lower chance of breaking
someone.

> That said: while I don't think your patch is wrong, if we do want to
> do that EBADF thing, I think your patch is still very very ugly. You
> are adding this new 'bad_fd' variable for no good reason.
> 
> We already have an error variable: 'table.error'.
> 
> So a *minimal* patch that does what you suggest is to just do something like
> 
>     diff --git a/fs/select.c b/fs/select.c
>     index 945896d0ac9e..7a06d28ec83d 100644
>     --- a/fs/select.c
>     +++ b/fs/select.c
>     @@ -528,7 +528,9 @@ static int do_select(int n, fd_set_bits *fds,
> struct timespec64 *end_time)
>                     if (!(bit & all_bits))
>                             continue;
>                     f = fdget(i);
>     -               if (f.file) {
>     +               if (unlikely(!f.file)) {
>     +                       table.error = -EBADF;
>     +               } else {
>                             wait_key_set(wait, in, out, bit,
>                                          busy_flag);
>                             mask = vfs_poll(f.file, wait);
> 
> (which is again whitespace-damaged, but this time that's just to make
> it legible inline - there's no indentation change).

Yup, nicer.

> Hmm? I'd be ok with either of these (the second of these patches
> _should_ be equivalent to yours), but I think that first one might be
> the one that might cause less potential user-space breakage.

Agreed. I'll code it and give it some testing. 

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2022-01-07 18:20 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-04 12:44 Indefinitely sleeping task in poll_schedule_timeout() Jan Kara
2022-01-06 18:41 ` Linus Torvalds
2022-01-06 19:51   ` Linus Torvalds
2022-01-07 18:20   ` Jan Kara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.