Re: [REGRESSION?] Simultaneous writes to a reader-less, non-full pipe can hang

From: "Alex Xu (Hello71)" <alex_y_xu@yahoo.ca>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: acrichton@mozilla.com, Christian Brauner <christian@brauner.io>,
	David Howells <dhowells@redhat.com>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	keyrings@vger.kernel.org, Linux API <linux-api@vger.kernel.org>,
	linux-block <linux-block@vger.kernel.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Rasmus Villemoes <linux@rasmusvillemoes.dk>,
	LSM List <linux-security-module@vger.kernel.org>,
	linux-usb@vger.kernel.org,
	Nicolas Dichtel <nicolas.dichtel@6wind.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Ian Kent <raven@themaw.net>
Subject: Re: [REGRESSION?] Simultaneous writes to a reader-less, non-full pipe can hang
Date: Wed, 04 Aug 2021 15:48:36 -0400	[thread overview]
Message-ID: <1628105897.vb3ko0vb06.none@localhost> (raw)
In-Reply-To: <CAHk-=wiLr55zHUWNzmp3DeoO0DUaYp7vAzQB5KUCni5FpwC7Uw@mail.gmail.com>

Excerpts from Linus Torvalds's message of August 4, 2021 12:31 pm:
> Your program is buggy.
> 
> On Wed, Aug 4, 2021 at 8:37 AM Alex Xu (Hello71) <alex_y_xu@yahoo.ca> wrote:
>>
>>     pipe(pipefd);
>>     printf("init buffer: %d\n", fcntl(pipefd[1], F_GETPIPE_SZ));
>>     printf("new buffer:  %d\n", fcntl(pipefd[1], F_SETPIPE_SZ, 0));
> 
> Yeah, what did you expect this to do? You said you want a minimal
> buffer, you get a really small buffer.
> 
> Then you try to write multiple messages to the pipe that you just said
> should have a minimum size.
> 
> Don't do that then.
> 
>> /proc/x/stack shows that the remaining thread is hanging at pipe.c:560.
>> It looks like not only there needs to be space in the pipe, but also
>> slots.
> 
> Correct. The fullness of a pipe is not about whether it has the
> possibility of merging more bytes into an existing not-full slot, but
> about whether it has empty slots left.
> 
> Part of that is simply the POSIX pipe guarantees - a write of size
> PIPE_BUF or less is guaranteed to be atomic, so it mustn't be split
> among buffers.
> 
> So a pipe must not be "writable" unless it has space for at least that
> much (think select/poll, which don't know the size of the write).
> 
> The fact that we might be able to reuse a partially filled buffer for
> smaller writes is simply not relevant to that issue.
> 
> And yes, we could have different measures of "could write" for
> different writes, but we just don't have or want that complexity.
> 
> Please don't mess with F_SETPIPE_SZ unless you have a really good
> reason to do so, and actually understand what you are doing.
> 
> Doing a F_SETPIPE_SZ, 0 basically means "I want the mimimum pipe size
> possible". And that one accepts exactly one write at a time.
> 
> Of course, the exact semantics are much more complicated than that
> "exactly one write". The pipe write code will optimistically merge
> writes into a previous buffer, which means that depending on the
> pattern of your writes, the exact number of bytes you can write will
> be very different.
> 
> But that "merge writes into a previous buffer" only appends to the
> buffer - not _reuse_ it - so when each buffer is one page in size,
> what happens is that you can merge up to 4096 bytes worth of writes,
> but then after that the pipe write will want a new buffer - even if
> the old buffer is now empty because of old reads.
> 
> That's why your test program won't block immediately: both writers
> will actually start out happily doing writes into that one buffer that
> is allocated, but at some point that buffer ends, and it wants to
> allocate a new buffer.
> 
> But you told it not to allocate more buffers, and the old buffer is
> never completely empty because your readers never read _everythign_,
> so it will hang, waiting for you to empty the one minimal buffer it
> allocated. And that will never happen.
> 
> There's a very real reason why we do *not* by default say "pipes can
> only ever use only one buffer".
> 
> I don't think this is a regression, but if you have an actual
> application - not a test program - that does crazy things like this
> and used to work (I'm not sure it has ever worked, though), we can
> look into making it work again.
> 
> That said, I suspect the way to make it work is to just say "the
> minimum pipe size is two slots" rather than change the "we want at
> least one empty slot". Exactly because of that whole "look, we must
> not consider a pipe that doesn't have a slot writable".
> 
> Because clearly people don't understand how subtle F_SETPIPE_SZ is.
> It's not really a "byte count", even though that is how it's
> expressed.
> 
>                    Linus
> 

I agree that if this only affects programs which intentionally adjust 
the pipe buffer size, then it is not a huge issue. The problem, 
admittedly buried very close to the bottom of my email, is that the 
kernel will silently provide one-page pipe buffers if the user has 
exceeded 16384 (by default) pipe buffer pages allocated. Try this 
slightly more complicated program:

#define _GNU_SOURCE
#include <fcntl.h>
#include <pthread.h>
#include <signal.h>
#include <stdio.h>
#include <unistd.h>

static int pipefd[2];

void *thread_start(void *arg) {
    char buf[1];
    int i;
    for (i = 0; i < 1000000; i++) {
        read(pipefd[0], buf, sizeof(buf));
        if (write(pipefd[1], buf, sizeof(buf)) == -1)
            break;
    }
    printf("done %d\n", i);
    return NULL;
}

int main() {
    signal(SIGPIPE, SIG_IGN);
    // pretend there are a very large number of make processes running
    for (int i = 0; i < 1025; i++)
    {
        pipe(pipefd);
        write(pipefd[1], "aa", 2);
    }
    printf("%d %d\n", pipefd[0], pipefd[1]);
    printf("init buffer: %d\n", fcntl(pipefd[1], F_GETPIPE_SZ));
    //printf("new buffer:  %d\n", fcntl(pipefd[1], F_SETPIPE_SZ, 0));
    pthread_t thread1, thread2;
    pthread_create(&thread1, NULL, thread_start, NULL);
    pthread_create(&thread2, NULL, thread_start, NULL);
    sleep(3);
    close(pipefd[0]);
    close(pipefd[1]);
    pthread_join(thread1, NULL);
    pthread_join(thread2, NULL);
}

You may need to raise your RLIMIT_NOFILE before running the program.

With default settings (fs.pipe-user-pages-soft = 16384) the init buffer 
will be 4096, no mangling required. I think this could be probably be 
solved if the kernel instead reduced over-quota pipes to two pages 
instead of one page. If someone wants to set a one-page pipe buffer, 
then they can suffer the consequences, but the kernel shouldn't silently 
hand people that footgun.

Regards,
Alex.