[REGRESSION?] Simultaneous writes to a reader-less, non-full pipe can hang

From: "Alex Xu (Hello71)" <alex_y_xu@yahoo.ca>
To: linux-kernel@vger.kernel.org, dhowells@redhat.com, acrichton@mozilla.com
Cc: torvalds@linux-foundation.org,
	Rasmus Villemoes <linux@rasmusvillemoes.dk>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Peter Zijlstra <peterz@infradead.org>,
	nicolas.dichtel@6wind.com, raven@themaw.net,
	Christian Brauner <christian@brauner.io>,
	keyrings@vger.kernel.org, linux-usb@vger.kernel.org,
	linux-block@vger.kernel.org,
	linux-security-module@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org
Subject: [REGRESSION?] Simultaneous writes to a reader-less, non-full pipe can hang
Date: Wed, 04 Aug 2021 11:37:36 -0400	[thread overview]
Message-ID: <1628086770.5rn8p04n6j.none@localhost> (raw)
In-Reply-To: 1628086770.5rn8p04n6j.none.ref@localhost

Hi,

An issue "Jobserver hangs due to full pipe" was recently reported 
against Cargo, the Rust package manager. This was diagnosed as an issue 
with pipe writes hanging in certain circumstances.

Specifically, if two or more threads simultaneously write to a pipe, it 
is possible for all the writers to hang despite there being significant 
space available in the pipe.

I have translated the Rust example to C with some small adjustments:

#define _GNU_SOURCE
#include <fcntl.h>
#include <pthread.h>
#include <stdio.h>
#include <unistd.h>

static int pipefd[2];

void *thread_start(void *arg) {
    char buf[1];
    for (int i = 0; i < 1000000; i++) {
        read(pipefd[0], buf, sizeof(buf));
        write(pipefd[1], buf, sizeof(buf));
    }
    puts("done");
    return NULL;
}

int main() {
    pipe(pipefd);
    printf("init buffer: %d\n", fcntl(pipefd[1], F_GETPIPE_SZ));
    printf("new buffer:  %d\n", fcntl(pipefd[1], F_SETPIPE_SZ, 0));
    write(pipefd[1], "aa", 2);
    pthread_t thread1, thread2;
    pthread_create(&thread1, NULL, thread_start, NULL);
    pthread_create(&thread2, NULL, thread_start, NULL);
    pthread_join(thread1, NULL);
    pthread_join(thread2, NULL);
}

The expected behavior of this program is to print:

init buffer: 65536
new buffer:  4096
done
done

and then exit.

On Linux 5.14-rc4, compiling this program and running it will print the 
following about half the time:

init buffer: 65536
new buffer:  4096
done

and then hang. This is unexpected behavior, since the pipe is at most 
two bytes full at any given time.

/proc/x/stack shows that the remaining thread is hanging at pipe.c:560. 
It looks like not only there needs to be space in the pipe, but also 
slots. At pipe.c:1306, a one-page pipe has only one slot. this led me to 
test nthreads=2, which also hangs. Checking blame of the pipe_write 
comment, it was added in a194dfe, which says, among other things:

> We just abandon the preallocated slot if we get a copy error.  Future
> writes may continue it and a future read will eventually recycle it.

This matches the observed behavior: in this case, there are no readers 
on the pipe, so the abandoned slot is lost.

In my opinion (as expressed on the issue), the pipe is being misused 
here. As explained in the pipe(7) manual page:

> Applications should not rely on a particular capacity: an application 
> should be designed so that a reading process consumes data as soon as 
> it is available, so that a writing process does not remain blocked.

Despite the misuse, I am reporting this for the following reasons:

1. I am reasonably confident that this is a regression in the kernel, 
   which has a standard of making reasonable efforts to maintain 
   backwards compatibility even with broken programs.

2. Even if this is not a regression, it seems like this situation could 
   be handled somewhat more gracefully. In this case, we are not writing 
   4095 bytes and then expecting a one-byte write to succeed; the pipe 
   is actually almost entirely empty.

3. Pipe sizes dynamically shrink in Linux, so despite the fact that this 
   case is unlikely to occur with two or more slots available, even a 
   program which does not explicitly allocate a one-page pipe buffer may 
   wind up with one if the user has 1024 or more pipes already open. 
   This significantly exacerbates the next point:

4. GNU make's jobserver uses pipes in a similar manner. By my reading of 
   the paper, it is theoretically possible for an N simultaneous writes 
   to occur without any readers, where N is the maximum concurrent jobs 
   permitted.

   Consider the following example with make -j2: two compile jobs are to 
   be performed: one at the top level, and one in a sub-directory. The 
   top-level make invokes one make and one cc, costing two tokens. The 
   sub-make invokes one cc with its free token. The pipe is now empty. 
   Now, suppose the two compilers return at exactly the same time. Both 
   copies of make will attempt to simultaneously write a token to the 
   pipe. This does not yet trigger deadlock: at least one write will 
   always succeed on an empty pipe. Suppose the sub-make's write goes 
   through. It then exits. The top-level make, however, is still blocked 
   on its original write, since it was not successfully merged with the 
   other write. The build is now deadlocked.

   I think this does not happen only by a coincidental design decision: 
   when the sub-make exits, the top-level make receives a SIGCHLD. GNU 
   make registers a SA_RESTART handler for SIGCHLD, so the write will be 
   interrupted and restarted. This is only a coincidence, however: the 
   program does not actually expect writing to the control pipe to ever 
   block; it could just as well de-register the signal handler while 
   performing the write and still be fully correct.

Regards,
Alex.