linux-usb.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [REGRESSION?] Simultaneous writes to a reader-less, non-full pipe can hang
       [not found] <1628086770.5rn8p04n6j.none.ref@localhost>
@ 2021-08-04 15:37 ` Alex Xu (Hello71)
  2021-08-04 16:31   ` Linus Torvalds
  0 siblings, 1 reply; 4+ messages in thread
From: Alex Xu (Hello71) @ 2021-08-04 15:37 UTC (permalink / raw)
  To: linux-kernel, dhowells, acrichton
  Cc: torvalds, Rasmus Villemoes, Greg Kroah-Hartman, Peter Zijlstra,
	nicolas.dichtel, raven, Christian Brauner, keyrings, linux-usb,
	linux-block, linux-security-module, linux-fsdevel, linux-api

Hi,

An issue "Jobserver hangs due to full pipe" was recently reported 
against Cargo, the Rust package manager. This was diagnosed as an issue 
with pipe writes hanging in certain circumstances.

Specifically, if two or more threads simultaneously write to a pipe, it 
is possible for all the writers to hang despite there being significant 
space available in the pipe.

I have translated the Rust example to C with some small adjustments:

#define _GNU_SOURCE
#include <fcntl.h>
#include <pthread.h>
#include <stdio.h>
#include <unistd.h>

static int pipefd[2];

void *thread_start(void *arg) {
    char buf[1];
    for (int i = 0; i < 1000000; i++) {
        read(pipefd[0], buf, sizeof(buf));
        write(pipefd[1], buf, sizeof(buf));
    }
    puts("done");
    return NULL;
}

int main() {
    pipe(pipefd);
    printf("init buffer: %d\n", fcntl(pipefd[1], F_GETPIPE_SZ));
    printf("new buffer:  %d\n", fcntl(pipefd[1], F_SETPIPE_SZ, 0));
    write(pipefd[1], "aa", 2);
    pthread_t thread1, thread2;
    pthread_create(&thread1, NULL, thread_start, NULL);
    pthread_create(&thread2, NULL, thread_start, NULL);
    pthread_join(thread1, NULL);
    pthread_join(thread2, NULL);
}

The expected behavior of this program is to print:

init buffer: 65536
new buffer:  4096
done
done

and then exit.

On Linux 5.14-rc4, compiling this program and running it will print the 
following about half the time:

init buffer: 65536
new buffer:  4096
done

and then hang. This is unexpected behavior, since the pipe is at most 
two bytes full at any given time.

/proc/x/stack shows that the remaining thread is hanging at pipe.c:560. 
It looks like not only there needs to be space in the pipe, but also 
slots. At pipe.c:1306, a one-page pipe has only one slot. this led me to 
test nthreads=2, which also hangs. Checking blame of the pipe_write 
comment, it was added in a194dfe, which says, among other things:

> We just abandon the preallocated slot if we get a copy error.  Future
> writes may continue it and a future read will eventually recycle it.

This matches the observed behavior: in this case, there are no readers 
on the pipe, so the abandoned slot is lost.

In my opinion (as expressed on the issue), the pipe is being misused 
here. As explained in the pipe(7) manual page:

> Applications should not rely on a particular capacity: an application 
> should be designed so that a reading process consumes data as soon as 
> it is available, so that a writing process does not remain blocked.

Despite the misuse, I am reporting this for the following reasons:

1. I am reasonably confident that this is a regression in the kernel, 
   which has a standard of making reasonable efforts to maintain 
   backwards compatibility even with broken programs.

2. Even if this is not a regression, it seems like this situation could 
   be handled somewhat more gracefully. In this case, we are not writing 
   4095 bytes and then expecting a one-byte write to succeed; the pipe 
   is actually almost entirely empty.

3. Pipe sizes dynamically shrink in Linux, so despite the fact that this 
   case is unlikely to occur with two or more slots available, even a 
   program which does not explicitly allocate a one-page pipe buffer may 
   wind up with one if the user has 1024 or more pipes already open. 
   This significantly exacerbates the next point:

4. GNU make's jobserver uses pipes in a similar manner. By my reading of 
   the paper, it is theoretically possible for an N simultaneous writes 
   to occur without any readers, where N is the maximum concurrent jobs 
   permitted.

   Consider the following example with make -j2: two compile jobs are to 
   be performed: one at the top level, and one in a sub-directory. The 
   top-level make invokes one make and one cc, costing two tokens. The 
   sub-make invokes one cc with its free token. The pipe is now empty. 
   Now, suppose the two compilers return at exactly the same time. Both 
   copies of make will attempt to simultaneously write a token to the 
   pipe. This does not yet trigger deadlock: at least one write will 
   always succeed on an empty pipe. Suppose the sub-make's write goes 
   through. It then exits. The top-level make, however, is still blocked 
   on its original write, since it was not successfully merged with the 
   other write. The build is now deadlocked.

   I think this does not happen only by a coincidental design decision: 
   when the sub-make exits, the top-level make receives a SIGCHLD. GNU 
   make registers a SA_RESTART handler for SIGCHLD, so the write will be 
   interrupted and restarted. This is only a coincidence, however: the 
   program does not actually expect writing to the control pipe to ever 
   block; it could just as well de-register the signal handler while 
   performing the write and still be fully correct.

Regards,
Alex.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [REGRESSION?] Simultaneous writes to a reader-less, non-full pipe can hang
  2021-08-04 15:37 ` [REGRESSION?] Simultaneous writes to a reader-less, non-full pipe can hang Alex Xu (Hello71)
@ 2021-08-04 16:31   ` Linus Torvalds
  2021-08-04 19:48     ` Alex Xu (Hello71)
  0 siblings, 1 reply; 4+ messages in thread
From: Linus Torvalds @ 2021-08-04 16:31 UTC (permalink / raw)
  To: Alex Xu (Hello71)
  Cc: Linux Kernel Mailing List, David Howells, acrichton,
	Rasmus Villemoes, Greg Kroah-Hartman, Peter Zijlstra,
	Nicolas Dichtel, Ian Kent, Christian Brauner, keyrings,
	linux-usb, linux-block, LSM List, linux-fsdevel, Linux API

Your program is buggy.

On Wed, Aug 4, 2021 at 8:37 AM Alex Xu (Hello71) <alex_y_xu@yahoo.ca> wrote:
>
>     pipe(pipefd);
>     printf("init buffer: %d\n", fcntl(pipefd[1], F_GETPIPE_SZ));
>     printf("new buffer:  %d\n", fcntl(pipefd[1], F_SETPIPE_SZ, 0));

Yeah, what did you expect this to do? You said you want a minimal
buffer, you get a really small buffer.

Then you try to write multiple messages to the pipe that you just said
should have a minimum size.

Don't do that then.

> /proc/x/stack shows that the remaining thread is hanging at pipe.c:560.
> It looks like not only there needs to be space in the pipe, but also
> slots.

Correct. The fullness of a pipe is not about whether it has the
possibility of merging more bytes into an existing not-full slot, but
about whether it has empty slots left.

Part of that is simply the POSIX pipe guarantees - a write of size
PIPE_BUF or less is guaranteed to be atomic, so it mustn't be split
among buffers.

So a pipe must not be "writable" unless it has space for at least that
much (think select/poll, which don't know the size of the write).

The fact that we might be able to reuse a partially filled buffer for
smaller writes is simply not relevant to that issue.

And yes, we could have different measures of "could write" for
different writes, but we just don't have or want that complexity.

Please don't mess with F_SETPIPE_SZ unless you have a really good
reason to do so, and actually understand what you are doing.

Doing a F_SETPIPE_SZ, 0 basically means "I want the mimimum pipe size
possible". And that one accepts exactly one write at a time.

Of course, the exact semantics are much more complicated than that
"exactly one write". The pipe write code will optimistically merge
writes into a previous buffer, which means that depending on the
pattern of your writes, the exact number of bytes you can write will
be very different.

But that "merge writes into a previous buffer" only appends to the
buffer - not _reuse_ it - so when each buffer is one page in size,
what happens is that you can merge up to 4096 bytes worth of writes,
but then after that the pipe write will want a new buffer - even if
the old buffer is now empty because of old reads.

That's why your test program won't block immediately: both writers
will actually start out happily doing writes into that one buffer that
is allocated, but at some point that buffer ends, and it wants to
allocate a new buffer.

But you told it not to allocate more buffers, and the old buffer is
never completely empty because your readers never read _everythign_,
so it will hang, waiting for you to empty the one minimal buffer it
allocated. And that will never happen.

There's a very real reason why we do *not* by default say "pipes can
only ever use only one buffer".

I don't think this is a regression, but if you have an actual
application - not a test program - that does crazy things like this
and used to work (I'm not sure it has ever worked, though), we can
look into making it work again.

That said, I suspect the way to make it work is to just say "the
minimum pipe size is two slots" rather than change the "we want at
least one empty slot". Exactly because of that whole "look, we must
not consider a pipe that doesn't have a slot writable".

Because clearly people don't understand how subtle F_SETPIPE_SZ is.
It's not really a "byte count", even though that is how it's
expressed.

                   Linus

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [REGRESSION?] Simultaneous writes to a reader-less, non-full pipe can hang
  2021-08-04 16:31   ` Linus Torvalds
@ 2021-08-04 19:48     ` Alex Xu (Hello71)
  2021-08-04 20:04       ` Linus Torvalds
  0 siblings, 1 reply; 4+ messages in thread
From: Alex Xu (Hello71) @ 2021-08-04 19:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: acrichton, Christian Brauner, David Howells, Greg Kroah-Hartman,
	keyrings, Linux API, linux-block, linux-fsdevel,
	Linux Kernel Mailing List, Rasmus Villemoes, LSM List, linux-usb,
	Nicolas Dichtel, Peter Zijlstra, Ian Kent

Excerpts from Linus Torvalds's message of August 4, 2021 12:31 pm:
> Your program is buggy.
> 
> On Wed, Aug 4, 2021 at 8:37 AM Alex Xu (Hello71) <alex_y_xu@yahoo.ca> wrote:
>>
>>     pipe(pipefd);
>>     printf("init buffer: %d\n", fcntl(pipefd[1], F_GETPIPE_SZ));
>>     printf("new buffer:  %d\n", fcntl(pipefd[1], F_SETPIPE_SZ, 0));
> 
> Yeah, what did you expect this to do? You said you want a minimal
> buffer, you get a really small buffer.
> 
> Then you try to write multiple messages to the pipe that you just said
> should have a minimum size.
> 
> Don't do that then.
> 
>> /proc/x/stack shows that the remaining thread is hanging at pipe.c:560.
>> It looks like not only there needs to be space in the pipe, but also
>> slots.
> 
> Correct. The fullness of a pipe is not about whether it has the
> possibility of merging more bytes into an existing not-full slot, but
> about whether it has empty slots left.
> 
> Part of that is simply the POSIX pipe guarantees - a write of size
> PIPE_BUF or less is guaranteed to be atomic, so it mustn't be split
> among buffers.
> 
> So a pipe must not be "writable" unless it has space for at least that
> much (think select/poll, which don't know the size of the write).
> 
> The fact that we might be able to reuse a partially filled buffer for
> smaller writes is simply not relevant to that issue.
> 
> And yes, we could have different measures of "could write" for
> different writes, but we just don't have or want that complexity.
> 
> Please don't mess with F_SETPIPE_SZ unless you have a really good
> reason to do so, and actually understand what you are doing.
> 
> Doing a F_SETPIPE_SZ, 0 basically means "I want the mimimum pipe size
> possible". And that one accepts exactly one write at a time.
> 
> Of course, the exact semantics are much more complicated than that
> "exactly one write". The pipe write code will optimistically merge
> writes into a previous buffer, which means that depending on the
> pattern of your writes, the exact number of bytes you can write will
> be very different.
> 
> But that "merge writes into a previous buffer" only appends to the
> buffer - not _reuse_ it - so when each buffer is one page in size,
> what happens is that you can merge up to 4096 bytes worth of writes,
> but then after that the pipe write will want a new buffer - even if
> the old buffer is now empty because of old reads.
> 
> That's why your test program won't block immediately: both writers
> will actually start out happily doing writes into that one buffer that
> is allocated, but at some point that buffer ends, and it wants to
> allocate a new buffer.
> 
> But you told it not to allocate more buffers, and the old buffer is
> never completely empty because your readers never read _everythign_,
> so it will hang, waiting for you to empty the one minimal buffer it
> allocated. And that will never happen.
> 
> There's a very real reason why we do *not* by default say "pipes can
> only ever use only one buffer".
> 
> I don't think this is a regression, but if you have an actual
> application - not a test program - that does crazy things like this
> and used to work (I'm not sure it has ever worked, though), we can
> look into making it work again.
> 
> That said, I suspect the way to make it work is to just say "the
> minimum pipe size is two slots" rather than change the "we want at
> least one empty slot". Exactly because of that whole "look, we must
> not consider a pipe that doesn't have a slot writable".
> 
> Because clearly people don't understand how subtle F_SETPIPE_SZ is.
> It's not really a "byte count", even though that is how it's
> expressed.
> 
>                    Linus
> 

I agree that if this only affects programs which intentionally adjust 
the pipe buffer size, then it is not a huge issue. The problem, 
admittedly buried very close to the bottom of my email, is that the 
kernel will silently provide one-page pipe buffers if the user has 
exceeded 16384 (by default) pipe buffer pages allocated. Try this 
slightly more complicated program:

#define _GNU_SOURCE
#include <fcntl.h>
#include <pthread.h>
#include <signal.h>
#include <stdio.h>
#include <unistd.h>

static int pipefd[2];

void *thread_start(void *arg) {
    char buf[1];
    int i;
    for (i = 0; i < 1000000; i++) {
        read(pipefd[0], buf, sizeof(buf));
        if (write(pipefd[1], buf, sizeof(buf)) == -1)
            break;
    }
    printf("done %d\n", i);
    return NULL;
}

int main() {
    signal(SIGPIPE, SIG_IGN);
    // pretend there are a very large number of make processes running
    for (int i = 0; i < 1025; i++)
    {
        pipe(pipefd);
        write(pipefd[1], "aa", 2);
    }
    printf("%d %d\n", pipefd[0], pipefd[1]);
    printf("init buffer: %d\n", fcntl(pipefd[1], F_GETPIPE_SZ));
    //printf("new buffer:  %d\n", fcntl(pipefd[1], F_SETPIPE_SZ, 0));
    pthread_t thread1, thread2;
    pthread_create(&thread1, NULL, thread_start, NULL);
    pthread_create(&thread2, NULL, thread_start, NULL);
    sleep(3);
    close(pipefd[0]);
    close(pipefd[1]);
    pthread_join(thread1, NULL);
    pthread_join(thread2, NULL);
}

You may need to raise your RLIMIT_NOFILE before running the program.

With default settings (fs.pipe-user-pages-soft = 16384) the init buffer 
will be 4096, no mangling required. I think this could be probably be 
solved if the kernel instead reduced over-quota pipes to two pages 
instead of one page. If someone wants to set a one-page pipe buffer, 
then they can suffer the consequences, but the kernel shouldn't silently 
hand people that footgun.

Regards,
Alex.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [REGRESSION?] Simultaneous writes to a reader-less, non-full pipe can hang
  2021-08-04 19:48     ` Alex Xu (Hello71)
@ 2021-08-04 20:04       ` Linus Torvalds
  0 siblings, 0 replies; 4+ messages in thread
From: Linus Torvalds @ 2021-08-04 20:04 UTC (permalink / raw)
  To: Alex Xu (Hello71)
  Cc: acrichton, Christian Brauner, David Howells, Greg Kroah-Hartman,
	keyrings, Linux API, linux-block, linux-fsdevel,
	Linux Kernel Mailing List, Rasmus Villemoes, LSM List, linux-usb,
	Nicolas Dichtel, Peter Zijlstra, Ian Kent

On Wed, Aug 4, 2021 at 12:48 PM Alex Xu (Hello71) <alex_y_xu@yahoo.ca> wrote:
>
> I agree that if this only affects programs which intentionally adjust
> the pipe buffer size, then it is not a huge issue. The problem,
> admittedly buried very close to the bottom of my email, is that the
> kernel will silently provide one-page pipe buffers if the user has
> exceeded 16384 (by default) pipe buffer pages allocated.

That's a good point.

That "fall back to a single buffer" case is meant to make things
hobble along if the user has exhausted the pipe buffers, but you're
right that we might want to make that minimum be two buffers.

I didn't test this, but the obvious fix seems to be just increasing
the '1' to '2'.

  @@ -781,8 +784,8 @@ struct pipe_inode_info *alloc_pipe_info(void)
          user_bufs = account_pipe_buffers(user, 0, pipe_bufs);

          if (too_many_pipe_buffers_soft(user_bufs) &&
pipe_is_unprivileged_user()) {
  -               user_bufs = account_pipe_buffers(user, pipe_bufs, 1);
  -               pipe_bufs = 1;
  +               user_bufs = account_pipe_buffers(user, pipe_bufs, 2);
  +               pipe_bufs = 2;
          }

          if (too_many_pipe_buffers_hard(user_bufs) &&
pipe_is_unprivileged_user())

although a real patch would need a comment about how a single buffer
is problematic, and probably make the '2' be a #define to not just
repeat the same magic constant silently.

IOW, something like

  /*
   * The general pipe use case needs at least two buffers: one
   * for data yet to be read, and one for new data
   */
  #define DEF_MIN_PIPE_BUFFERS 2

to go with the existing PIPE_DEF_BUFFERS (although the
DEF_MIN_PIPE_BUFFERS use would only be in fs/pipe.c, so I guess it
doesn't actually need to be exposed to anybody else in the pipe.h
header file).

I'd take that patch with some proper testing and a commit message.

Hint hint ;)

                Linus

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-08-04 20:04 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1628086770.5rn8p04n6j.none.ref@localhost>
2021-08-04 15:37 ` [REGRESSION?] Simultaneous writes to a reader-less, non-full pipe can hang Alex Xu (Hello71)
2021-08-04 16:31   ` Linus Torvalds
2021-08-04 19:48     ` Alex Xu (Hello71)
2021-08-04 20:04       ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).