From: "Alex Xu (Hello71)" <alex_y_xu@yahoo.ca>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: acrichton@mozilla.com, Christian Brauner <christian@brauner.io>,
David Howells <dhowells@redhat.com>,
Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
keyrings@vger.kernel.org, Linux API <linux-api@vger.kernel.org>,
linux-block <linux-block@vger.kernel.org>,
linux-fsdevel <linux-fsdevel@vger.kernel.org>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
Rasmus Villemoes <linux@rasmusvillemoes.dk>,
LSM List <linux-security-module@vger.kernel.org>,
linux-usb@vger.kernel.org,
Nicolas Dichtel <nicolas.dichtel@6wind.com>,
Peter Zijlstra <peterz@infradead.org>,
Ian Kent <raven@themaw.net>
Subject: Re: [REGRESSION?] Simultaneous writes to a reader-less, non-full pipe can hang
Date: Wed, 04 Aug 2021 15:48:36 -0400 [thread overview]
Message-ID: <1628105897.vb3ko0vb06.none@localhost> (raw)
In-Reply-To: <CAHk-=wiLr55zHUWNzmp3DeoO0DUaYp7vAzQB5KUCni5FpwC7Uw@mail.gmail.com>
Excerpts from Linus Torvalds's message of August 4, 2021 12:31 pm:
> Your program is buggy.
>
> On Wed, Aug 4, 2021 at 8:37 AM Alex Xu (Hello71) <alex_y_xu@yahoo.ca> wrote:
>>
>> pipe(pipefd);
>> printf("init buffer: %d\n", fcntl(pipefd[1], F_GETPIPE_SZ));
>> printf("new buffer: %d\n", fcntl(pipefd[1], F_SETPIPE_SZ, 0));
>
> Yeah, what did you expect this to do? You said you want a minimal
> buffer, you get a really small buffer.
>
> Then you try to write multiple messages to the pipe that you just said
> should have a minimum size.
>
> Don't do that then.
>
>> /proc/x/stack shows that the remaining thread is hanging at pipe.c:560.
>> It looks like not only there needs to be space in the pipe, but also
>> slots.
>
> Correct. The fullness of a pipe is not about whether it has the
> possibility of merging more bytes into an existing not-full slot, but
> about whether it has empty slots left.
>
> Part of that is simply the POSIX pipe guarantees - a write of size
> PIPE_BUF or less is guaranteed to be atomic, so it mustn't be split
> among buffers.
>
> So a pipe must not be "writable" unless it has space for at least that
> much (think select/poll, which don't know the size of the write).
>
> The fact that we might be able to reuse a partially filled buffer for
> smaller writes is simply not relevant to that issue.
>
> And yes, we could have different measures of "could write" for
> different writes, but we just don't have or want that complexity.
>
> Please don't mess with F_SETPIPE_SZ unless you have a really good
> reason to do so, and actually understand what you are doing.
>
> Doing a F_SETPIPE_SZ, 0 basically means "I want the mimimum pipe size
> possible". And that one accepts exactly one write at a time.
>
> Of course, the exact semantics are much more complicated than that
> "exactly one write". The pipe write code will optimistically merge
> writes into a previous buffer, which means that depending on the
> pattern of your writes, the exact number of bytes you can write will
> be very different.
>
> But that "merge writes into a previous buffer" only appends to the
> buffer - not _reuse_ it - so when each buffer is one page in size,
> what happens is that you can merge up to 4096 bytes worth of writes,
> but then after that the pipe write will want a new buffer - even if
> the old buffer is now empty because of old reads.
>
> That's why your test program won't block immediately: both writers
> will actually start out happily doing writes into that one buffer that
> is allocated, but at some point that buffer ends, and it wants to
> allocate a new buffer.
>
> But you told it not to allocate more buffers, and the old buffer is
> never completely empty because your readers never read _everythign_,
> so it will hang, waiting for you to empty the one minimal buffer it
> allocated. And that will never happen.
>
> There's a very real reason why we do *not* by default say "pipes can
> only ever use only one buffer".
>
> I don't think this is a regression, but if you have an actual
> application - not a test program - that does crazy things like this
> and used to work (I'm not sure it has ever worked, though), we can
> look into making it work again.
>
> That said, I suspect the way to make it work is to just say "the
> minimum pipe size is two slots" rather than change the "we want at
> least one empty slot". Exactly because of that whole "look, we must
> not consider a pipe that doesn't have a slot writable".
>
> Because clearly people don't understand how subtle F_SETPIPE_SZ is.
> It's not really a "byte count", even though that is how it's
> expressed.
>
> Linus
>
I agree that if this only affects programs which intentionally adjust
the pipe buffer size, then it is not a huge issue. The problem,
admittedly buried very close to the bottom of my email, is that the
kernel will silently provide one-page pipe buffers if the user has
exceeded 16384 (by default) pipe buffer pages allocated. Try this
slightly more complicated program:
#define _GNU_SOURCE
#include <fcntl.h>
#include <pthread.h>
#include <signal.h>
#include <stdio.h>
#include <unistd.h>
static int pipefd[2];
void *thread_start(void *arg) {
char buf[1];
int i;
for (i = 0; i < 1000000; i++) {
read(pipefd[0], buf, sizeof(buf));
if (write(pipefd[1], buf, sizeof(buf)) == -1)
break;
}
printf("done %d\n", i);
return NULL;
}
int main() {
signal(SIGPIPE, SIG_IGN);
// pretend there are a very large number of make processes running
for (int i = 0; i < 1025; i++)
{
pipe(pipefd);
write(pipefd[1], "aa", 2);
}
printf("%d %d\n", pipefd[0], pipefd[1]);
printf("init buffer: %d\n", fcntl(pipefd[1], F_GETPIPE_SZ));
//printf("new buffer: %d\n", fcntl(pipefd[1], F_SETPIPE_SZ, 0));
pthread_t thread1, thread2;
pthread_create(&thread1, NULL, thread_start, NULL);
pthread_create(&thread2, NULL, thread_start, NULL);
sleep(3);
close(pipefd[0]);
close(pipefd[1]);
pthread_join(thread1, NULL);
pthread_join(thread2, NULL);
}
You may need to raise your RLIMIT_NOFILE before running the program.
With default settings (fs.pipe-user-pages-soft = 16384) the init buffer
will be 4096, no mangling required. I think this could be probably be
solved if the kernel instead reduced over-quota pipes to two pages
instead of one page. If someone wants to set a one-page pipe buffer,
then they can suffer the consequences, but the kernel shouldn't silently
hand people that footgun.
Regards,
Alex.
next prev parent reply other threads:[~2021-08-04 19:49 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <1628086770.5rn8p04n6j.none.ref@localhost>
2021-08-04 15:37 ` [REGRESSION?] Simultaneous writes to a reader-less, non-full pipe can hang Alex Xu (Hello71)
2021-08-04 16:31 ` Linus Torvalds
2021-08-04 19:48 ` Alex Xu (Hello71) [this message]
2021-08-04 20:04 ` Linus Torvalds
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1628105897.vb3ko0vb06.none@localhost \
--to=alex_y_xu@yahoo.ca \
--cc=acrichton@mozilla.com \
--cc=christian@brauner.io \
--cc=dhowells@redhat.com \
--cc=gregkh@linuxfoundation.org \
--cc=keyrings@vger.kernel.org \
--cc=linux-api@vger.kernel.org \
--cc=linux-block@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-security-module@vger.kernel.org \
--cc=linux-usb@vger.kernel.org \
--cc=linux@rasmusvillemoes.dk \
--cc=nicolas.dichtel@6wind.com \
--cc=peterz@infradead.org \
--cc=raven@themaw.net \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).