archive mirror
 help / color / mirror / Atom feed
From: "Alex Xu (Hello71)" <>
To: Linus Torvalds <>
Cc:, Christian Brauner <>,
	David Howells <>,
	Greg Kroah-Hartman <>,, Linux API <>,
	linux-block <>,
	linux-fsdevel <>,
	Linux Kernel Mailing List <>,
	Rasmus Villemoes <>,
	LSM List <>,,
	Nicolas Dichtel <>,
	Peter Zijlstra <>,
	Ian Kent <>
Subject: Re: [REGRESSION?] Simultaneous writes to a reader-less, non-full pipe can hang
Date: Wed, 04 Aug 2021 15:48:36 -0400	[thread overview]
Message-ID: <1628105897.vb3ko0vb06.none@localhost> (raw)
In-Reply-To: <>

Excerpts from Linus Torvalds's message of August 4, 2021 12:31 pm:
> Your program is buggy.
> On Wed, Aug 4, 2021 at 8:37 AM Alex Xu (Hello71) <> wrote:
>>     pipe(pipefd);
>>     printf("init buffer: %d\n", fcntl(pipefd[1], F_GETPIPE_SZ));
>>     printf("new buffer:  %d\n", fcntl(pipefd[1], F_SETPIPE_SZ, 0));
> Yeah, what did you expect this to do? You said you want a minimal
> buffer, you get a really small buffer.
> Then you try to write multiple messages to the pipe that you just said
> should have a minimum size.
> Don't do that then.
>> /proc/x/stack shows that the remaining thread is hanging at pipe.c:560.
>> It looks like not only there needs to be space in the pipe, but also
>> slots.
> Correct. The fullness of a pipe is not about whether it has the
> possibility of merging more bytes into an existing not-full slot, but
> about whether it has empty slots left.
> Part of that is simply the POSIX pipe guarantees - a write of size
> PIPE_BUF or less is guaranteed to be atomic, so it mustn't be split
> among buffers.
> So a pipe must not be "writable" unless it has space for at least that
> much (think select/poll, which don't know the size of the write).
> The fact that we might be able to reuse a partially filled buffer for
> smaller writes is simply not relevant to that issue.
> And yes, we could have different measures of "could write" for
> different writes, but we just don't have or want that complexity.
> Please don't mess with F_SETPIPE_SZ unless you have a really good
> reason to do so, and actually understand what you are doing.
> Doing a F_SETPIPE_SZ, 0 basically means "I want the mimimum pipe size
> possible". And that one accepts exactly one write at a time.
> Of course, the exact semantics are much more complicated than that
> "exactly one write". The pipe write code will optimistically merge
> writes into a previous buffer, which means that depending on the
> pattern of your writes, the exact number of bytes you can write will
> be very different.
> But that "merge writes into a previous buffer" only appends to the
> buffer - not _reuse_ it - so when each buffer is one page in size,
> what happens is that you can merge up to 4096 bytes worth of writes,
> but then after that the pipe write will want a new buffer - even if
> the old buffer is now empty because of old reads.
> That's why your test program won't block immediately: both writers
> will actually start out happily doing writes into that one buffer that
> is allocated, but at some point that buffer ends, and it wants to
> allocate a new buffer.
> But you told it not to allocate more buffers, and the old buffer is
> never completely empty because your readers never read _everythign_,
> so it will hang, waiting for you to empty the one minimal buffer it
> allocated. And that will never happen.
> There's a very real reason why we do *not* by default say "pipes can
> only ever use only one buffer".
> I don't think this is a regression, but if you have an actual
> application - not a test program - that does crazy things like this
> and used to work (I'm not sure it has ever worked, though), we can
> look into making it work again.
> That said, I suspect the way to make it work is to just say "the
> minimum pipe size is two slots" rather than change the "we want at
> least one empty slot". Exactly because of that whole "look, we must
> not consider a pipe that doesn't have a slot writable".
> Because clearly people don't understand how subtle F_SETPIPE_SZ is.
> It's not really a "byte count", even though that is how it's
> expressed.
>                    Linus

I agree that if this only affects programs which intentionally adjust 
the pipe buffer size, then it is not a huge issue. The problem, 
admittedly buried very close to the bottom of my email, is that the 
kernel will silently provide one-page pipe buffers if the user has 
exceeded 16384 (by default) pipe buffer pages allocated. Try this 
slightly more complicated program:

#define _GNU_SOURCE
#include <fcntl.h>
#include <pthread.h>
#include <signal.h>
#include <stdio.h>
#include <unistd.h>

static int pipefd[2];

void *thread_start(void *arg) {
    char buf[1];
    int i;
    for (i = 0; i < 1000000; i++) {
        read(pipefd[0], buf, sizeof(buf));
        if (write(pipefd[1], buf, sizeof(buf)) == -1)
    printf("done %d\n", i);
    return NULL;

int main() {
    signal(SIGPIPE, SIG_IGN);
    // pretend there are a very large number of make processes running
    for (int i = 0; i < 1025; i++)
        write(pipefd[1], "aa", 2);
    printf("%d %d\n", pipefd[0], pipefd[1]);
    printf("init buffer: %d\n", fcntl(pipefd[1], F_GETPIPE_SZ));
    //printf("new buffer:  %d\n", fcntl(pipefd[1], F_SETPIPE_SZ, 0));
    pthread_t thread1, thread2;
    pthread_create(&thread1, NULL, thread_start, NULL);
    pthread_create(&thread2, NULL, thread_start, NULL);
    pthread_join(thread1, NULL);
    pthread_join(thread2, NULL);

You may need to raise your RLIMIT_NOFILE before running the program.

With default settings (fs.pipe-user-pages-soft = 16384) the init buffer 
will be 4096, no mangling required. I think this could be probably be 
solved if the kernel instead reduced over-quota pipes to two pages 
instead of one page. If someone wants to set a one-page pipe buffer, 
then they can suffer the consequences, but the kernel shouldn't silently 
hand people that footgun.


  reply	other threads:[~2021-08-04 19:49 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <1628086770.5rn8p04n6j.none.ref@localhost>
2021-08-04 15:37 ` [REGRESSION?] Simultaneous writes to a reader-less, non-full pipe can hang Alex Xu (Hello71)
2021-08-04 16:31   ` Linus Torvalds
2021-08-04 19:48     ` Alex Xu (Hello71) [this message]
2021-08-04 20:04       ` Linus Torvalds

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1628105897.vb3ko0vb06.none@localhost \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).