linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH AUTOSEL 5.5 539/542] fuse: don't overflow LLONG_MAX with end offset
       [not found] <20200214154854.6746-1-sashal@kernel.org>
@ 2020-02-14 15:48 ` Sasha Levin
  2020-02-14 15:48 ` [PATCH AUTOSEL 5.5 542/542] pipe: use exclusive waits when reading or writing Sasha Levin
  1 sibling, 0 replies; 14+ messages in thread
From: Sasha Levin @ 2020-02-14 15:48 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Miklos Szeredi, Xiao Yang, Sasha Levin, linux-fsdevel

From: Miklos Szeredi <mszeredi@redhat.com>

[ Upstream commit 2f1398291bf35fe027914ae7a9610d8e601fbfde ]

Handle the special case of fuse_readpages() wanting to read the last page
of a hugest file possible and overflowing the end offset in the process.

This is basically to unbreak xfstests:generic/525 and prevent filesystems
from doing bad things with an overflowing offset.

Reported-by: Xiao Yang <ice_yangxiao@163.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/fuse/file.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 695369f46f92d..3dd37a998ea93 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -803,6 +803,10 @@ static int fuse_do_readpage(struct file *file, struct page *page)
 
 	attr_ver = fuse_get_attr_version(fc);
 
+	/* Don't overflow end offset */
+	if (pos + (desc.length - 1) == LLONG_MAX)
+		desc.length--;
+
 	fuse_read_args_fill(&ia, file, pos, desc.length, FUSE_READ);
 	res = fuse_simple_request(fc, &ia.ap.args);
 	if (res < 0)
@@ -888,6 +892,14 @@ static void fuse_send_readpages(struct fuse_io_args *ia, struct file *file)
 	ap->args.out_pages = true;
 	ap->args.page_zeroing = true;
 	ap->args.page_replace = true;
+
+	/* Don't overflow end offset */
+	if (pos + (count - 1) == LLONG_MAX) {
+		count--;
+		ap->descs[ap->num_pages - 1].length--;
+	}
+	WARN_ON((loff_t) (pos + count) < 0);
+
 	fuse_read_args_fill(ia, file, pos, count, FUSE_READ);
 	ia->read.attr_ver = fuse_get_attr_version(fc);
 	if (fc->async_read) {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH AUTOSEL 5.5 542/542] pipe: use exclusive waits when reading or writing
       [not found] <20200214154854.6746-1-sashal@kernel.org>
  2020-02-14 15:48 ` [PATCH AUTOSEL 5.5 539/542] fuse: don't overflow LLONG_MAX with end offset Sasha Levin
@ 2020-02-14 15:48 ` Sasha Levin
  2020-02-18  9:51   ` Andrei Vagin
  2020-02-18 18:53   ` Linus Torvalds
  1 sibling, 2 replies; 14+ messages in thread
From: Sasha Levin @ 2020-02-14 15:48 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Linus Torvalds, Josh Triplett, Sasha Levin, linux-fsdevel

From: Linus Torvalds <torvalds@linux-foundation.org>

[ Upstream commit 0ddad21d3e99c743a3aa473121dc5561679e26bb ]

This makes the pipe code use separate wait-queues and exclusive waiting
for readers and writers, avoiding a nasty thundering herd problem when
there are lots of readers waiting for data on a pipe (or, less commonly,
lots of writers waiting for a pipe to have space).

While this isn't a common occurrence in the traditional "use a pipe as a
data transport" case, where you typically only have a single reader and
a single writer process, there is one common special case: using a pipe
as a source of "locking tokens" rather than for data communication.

In particular, the GNU make jobserver code ends up using a pipe as a way
to limit parallelism, where each job consumes a token by reading a byte
from the jobserver pipe, and releases the token by writing a byte back
to the pipe.

This pattern is fairly traditional on Unix, and works very well, but
will waste a lot of time waking up a lot of processes when only a single
reader needs to be woken up when a writer releases a new token.

A simplified test-case of just this pipe interaction is to create 64
processes, and then pass a single token around between them (this
test-case also intentionally passes another token that gets ignored to
test the "wake up next" logic too, in case anybody wonders about it):

    #include <unistd.h>

    int main(int argc, char **argv)
    {
        int fd[2], counters[2];

        pipe(fd);
        counters[0] = 0;
        counters[1] = -1;
        write(fd[1], counters, sizeof(counters));

        /* 64 processes */
        fork(); fork(); fork(); fork(); fork(); fork();

        do {
                int i;
                read(fd[0], &i, sizeof(i));
                if (i < 0)
                        continue;
                counters[0] = i+1;
                write(fd[1], counters, (1+(i & 1)) *sizeof(int));
        } while (counters[0] < 1000000);
        return 0;
    }

and in a perfect world, passing that token around should only cause one
context switch per transfer, when the writer of a token causes a
directed wakeup of just a single reader.

But with the "writer wakes all readers" model we traditionally had, on
my test box the above case causes more than an order of magnitude more
scheduling: instead of the expected ~1M context switches, "perf stat"
shows

        231,852.37 msec task-clock                #   15.857 CPUs utilized
        11,250,961      context-switches          #    0.049 M/sec
           616,304      cpu-migrations            #    0.003 M/sec
             1,648      page-faults               #    0.007 K/sec
 1,097,903,998,514      cycles                    #    4.735 GHz
   120,781,778,352      instructions              #    0.11  insn per cycle
    27,997,056,043      branches                  #  120.754 M/sec
       283,581,233      branch-misses             #    1.01% of all branches

      14.621273891 seconds time elapsed

       0.018243000 seconds user
       3.611468000 seconds sys

before this commit.

After this commit, I get

          5,229.55 msec task-clock                #    3.072 CPUs utilized
         1,212,233      context-switches          #    0.232 M/sec
           103,951      cpu-migrations            #    0.020 M/sec
             1,328      page-faults               #    0.254 K/sec
    21,307,456,166      cycles                    #    4.074 GHz
    12,947,819,999      instructions              #    0.61  insn per cycle
     2,881,985,678      branches                  #  551.096 M/sec
        64,267,015      branch-misses             #    2.23% of all branches

       1.702148350 seconds time elapsed

       0.004868000 seconds user
       0.110786000 seconds sys

instead. Much better.

[ Note! This kernel improvement seems to be very good at triggering a
  race condition in the make jobserver (in GNU make 4.2.1) for me. It's
  a long known bug that was fixed back in June 2017 by GNU make commit
  b552b0525198 ("[SV 51159] Use a non-blocking read with pselect to
  avoid hangs.").

  But there wasn't a new release of GNU make until 4.3 on Jan 19 2020,
  so a number of distributions may still have the buggy version. Some
  have backported the fix to their 4.2.1 release, though, and even
  without the fix it's quite timing-dependent whether the bug actually
  is hit. ]

Josh Triplett says:
 "I've been hammering on your pipe fix patch (switching to exclusive
  wait queues) for a month or so, on several different systems, and I've
  run into no issues with it. The patch *substantially* improves
  parallel build times on large (~100 CPU) systems, both with parallel
  make and with other things that use make's pipe-based jobserver.

  All current distributions (including stable and long-term stable
  distributions) have versions of GNU make that no longer have the
  jobserver bug"

Tested-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/coredump.c             |  4 +--
 fs/pipe.c                 | 67 +++++++++++++++++++++++++--------------
 fs/splice.c               |  8 ++---
 include/linux/pipe_fs_i.h |  2 +-
 4 files changed, 51 insertions(+), 30 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index b1ea7dfbd1494..f8296a82d01df 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -517,7 +517,7 @@ static void wait_for_dump_helpers(struct file *file)
 	pipe_lock(pipe);
 	pipe->readers++;
 	pipe->writers--;
-	wake_up_interruptible_sync(&pipe->wait);
+	wake_up_interruptible_sync(&pipe->rd_wait);
 	kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
 	pipe_unlock(pipe);
 
@@ -525,7 +525,7 @@ static void wait_for_dump_helpers(struct file *file)
 	 * We actually want wait_event_freezable() but then we need
 	 * to clear TIF_SIGPENDING and improve dump_interrupted().
 	 */
-	wait_event_interruptible(pipe->wait, pipe->readers == 1);
+	wait_event_interruptible(pipe->rd_wait, pipe->readers == 1);
 
 	pipe_lock(pipe);
 	pipe->readers--;
diff --git a/fs/pipe.c b/fs/pipe.c
index 57502c3c0fba1..5a34d6c22d4ce 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -108,16 +108,19 @@ void pipe_double_lock(struct pipe_inode_info *pipe1,
 /* Drop the inode semaphore and wait for a pipe event, atomically */
 void pipe_wait(struct pipe_inode_info *pipe)
 {
-	DEFINE_WAIT(wait);
+	DEFINE_WAIT(rdwait);
+	DEFINE_WAIT(wrwait);
 
 	/*
 	 * Pipes are system-local resources, so sleeping on them
 	 * is considered a noninteractive wait:
 	 */
-	prepare_to_wait(&pipe->wait, &wait, TASK_INTERRUPTIBLE);
+	prepare_to_wait(&pipe->rd_wait, &rdwait, TASK_INTERRUPTIBLE);
+	prepare_to_wait(&pipe->wr_wait, &wrwait, TASK_INTERRUPTIBLE);
 	pipe_unlock(pipe);
 	schedule();
-	finish_wait(&pipe->wait, &wait);
+	finish_wait(&pipe->rd_wait, &rdwait);
+	finish_wait(&pipe->wr_wait, &wrwait);
 	pipe_lock(pipe);
 }
 
@@ -286,7 +289,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
 	size_t total_len = iov_iter_count(to);
 	struct file *filp = iocb->ki_filp;
 	struct pipe_inode_info *pipe = filp->private_data;
-	bool was_full;
+	bool was_full, wake_next_reader = false;
 	ssize_t ret;
 
 	/* Null read succeeds. */
@@ -344,10 +347,10 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
 
 			if (!buf->len) {
 				pipe_buf_release(pipe, buf);
-				spin_lock_irq(&pipe->wait.lock);
+				spin_lock_irq(&pipe->rd_wait.lock);
 				tail++;
 				pipe->tail = tail;
-				spin_unlock_irq(&pipe->wait.lock);
+				spin_unlock_irq(&pipe->rd_wait.lock);
 			}
 			total_len -= chars;
 			if (!total_len)
@@ -384,7 +387,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
 		 * no data.
 		 */
 		if (unlikely(was_full)) {
-			wake_up_interruptible_sync_poll(&pipe->wait, EPOLLOUT | EPOLLWRNORM);
+			wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
 			kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
 		}
 
@@ -394,18 +397,23 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
 		 * since we've done any required wakeups and there's no need
 		 * to mark anything accessed. And we've dropped the lock.
 		 */
-		if (wait_event_interruptible(pipe->wait, pipe_readable(pipe)) < 0)
+		if (wait_event_interruptible_exclusive(pipe->rd_wait, pipe_readable(pipe)) < 0)
 			return -ERESTARTSYS;
 
 		__pipe_lock(pipe);
 		was_full = pipe_full(pipe->head, pipe->tail, pipe->max_usage);
+		wake_next_reader = true;
 	}
+	if (pipe_empty(pipe->head, pipe->tail))
+		wake_next_reader = false;
 	__pipe_unlock(pipe);
 
 	if (was_full) {
-		wake_up_interruptible_sync_poll(&pipe->wait, EPOLLOUT | EPOLLWRNORM);
+		wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
 		kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
 	}
+	if (wake_next_reader)
+		wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
 	if (ret > 0)
 		file_accessed(filp);
 	return ret;
@@ -437,6 +445,7 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
 	size_t total_len = iov_iter_count(from);
 	ssize_t chars;
 	bool was_empty = false;
+	bool wake_next_writer = false;
 
 	/* Null write succeeds. */
 	if (unlikely(total_len == 0))
@@ -515,16 +524,16 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
 			 * it, either the reader will consume it or it'll still
 			 * be there for the next write.
 			 */
-			spin_lock_irq(&pipe->wait.lock);
+			spin_lock_irq(&pipe->rd_wait.lock);
 
 			head = pipe->head;
 			if (pipe_full(head, pipe->tail, pipe->max_usage)) {
-				spin_unlock_irq(&pipe->wait.lock);
+				spin_unlock_irq(&pipe->rd_wait.lock);
 				continue;
 			}
 
 			pipe->head = head + 1;
-			spin_unlock_irq(&pipe->wait.lock);
+			spin_unlock_irq(&pipe->rd_wait.lock);
 
 			/* Insert it into the buffer array */
 			buf = &pipe->bufs[head & mask];
@@ -576,14 +585,17 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
 		 */
 		__pipe_unlock(pipe);
 		if (was_empty) {
-			wake_up_interruptible_sync_poll(&pipe->wait, EPOLLIN | EPOLLRDNORM);
+			wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
 			kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
 		}
-		wait_event_interruptible(pipe->wait, pipe_writable(pipe));
+		wait_event_interruptible_exclusive(pipe->wr_wait, pipe_writable(pipe));
 		__pipe_lock(pipe);
 		was_empty = pipe_empty(pipe->head, pipe->tail);
+		wake_next_writer = true;
 	}
 out:
+	if (pipe_full(pipe->head, pipe->tail, pipe->max_usage))
+		wake_next_writer = false;
 	__pipe_unlock(pipe);
 
 	/*
@@ -596,9 +608,11 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
 	 * wake up pending jobs
 	 */
 	if (was_empty) {
-		wake_up_interruptible_sync_poll(&pipe->wait, EPOLLIN | EPOLLRDNORM);
+		wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
 		kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
 	}
+	if (wake_next_writer)
+		wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
 	if (ret > 0 && sb_start_write_trylock(file_inode(filp)->i_sb)) {
 		int err = file_update_time(filp);
 		if (err)
@@ -642,12 +656,15 @@ pipe_poll(struct file *filp, poll_table *wait)
 	unsigned int head, tail;
 
 	/*
-	 * Reading only -- no need for acquiring the semaphore.
+	 * Reading pipe state only -- no need for acquiring the semaphore.
 	 *
 	 * But because this is racy, the code has to add the
 	 * entry to the poll table _first_ ..
 	 */
-	poll_wait(filp, &pipe->wait, wait);
+	if (filp->f_mode & FMODE_READ)
+		poll_wait(filp, &pipe->rd_wait, wait);
+	if (filp->f_mode & FMODE_WRITE)
+		poll_wait(filp, &pipe->wr_wait, wait);
 
 	/*
 	 * .. and only then can you do the racy tests. That way,
@@ -706,7 +723,8 @@ pipe_release(struct inode *inode, struct file *file)
 		pipe->writers--;
 
 	if (pipe->readers || pipe->writers) {
-		wake_up_interruptible_sync_poll(&pipe->wait, EPOLLIN | EPOLLOUT | EPOLLRDNORM | EPOLLWRNORM | EPOLLERR | EPOLLHUP);
+		wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM | EPOLLERR | EPOLLHUP);
+		wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM | EPOLLERR | EPOLLHUP);
 		kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
 		kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
 	}
@@ -789,7 +807,8 @@ struct pipe_inode_info *alloc_pipe_info(void)
 			     GFP_KERNEL_ACCOUNT);
 
 	if (pipe->bufs) {
-		init_waitqueue_head(&pipe->wait);
+		init_waitqueue_head(&pipe->rd_wait);
+		init_waitqueue_head(&pipe->wr_wait);
 		pipe->r_counter = pipe->w_counter = 1;
 		pipe->max_usage = pipe_bufs;
 		pipe->ring_size = pipe_bufs;
@@ -1007,7 +1026,8 @@ static int wait_for_partner(struct pipe_inode_info *pipe, unsigned int *cnt)
 
 static void wake_up_partner(struct pipe_inode_info *pipe)
 {
-	wake_up_interruptible(&pipe->wait);
+	wake_up_interruptible(&pipe->rd_wait);
+	wake_up_interruptible(&pipe->wr_wait);
 }
 
 static int fifo_open(struct inode *inode, struct file *filp)
@@ -1118,13 +1138,13 @@ static int fifo_open(struct inode *inode, struct file *filp)
 
 err_rd:
 	if (!--pipe->readers)
-		wake_up_interruptible(&pipe->wait);
+		wake_up_interruptible(&pipe->wr_wait);
 	ret = -ERESTARTSYS;
 	goto err;
 
 err_wr:
 	if (!--pipe->writers)
-		wake_up_interruptible(&pipe->wait);
+		wake_up_interruptible(&pipe->rd_wait);
 	ret = -ERESTARTSYS;
 	goto err;
 
@@ -1251,7 +1271,8 @@ static long pipe_set_size(struct pipe_inode_info *pipe, unsigned long arg)
 	pipe->max_usage = nr_slots;
 	pipe->tail = tail;
 	pipe->head = head;
-	wake_up_interruptible_all(&pipe->wait);
+	wake_up_interruptible_all(&pipe->rd_wait);
+	wake_up_interruptible_all(&pipe->wr_wait);
 	return pipe->max_usage * PAGE_SIZE;
 
 out_revert_acct:
diff --git a/fs/splice.c b/fs/splice.c
index 3009652a41c85..d671936d0aad6 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -165,8 +165,8 @@ static const struct pipe_buf_operations user_page_pipe_buf_ops = {
 static void wakeup_pipe_readers(struct pipe_inode_info *pipe)
 {
 	smp_mb();
-	if (waitqueue_active(&pipe->wait))
-		wake_up_interruptible(&pipe->wait);
+	if (waitqueue_active(&pipe->rd_wait))
+		wake_up_interruptible(&pipe->rd_wait);
 	kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
 }
 
@@ -462,8 +462,8 @@ static int pipe_to_sendpage(struct pipe_inode_info *pipe,
 static void wakeup_pipe_writers(struct pipe_inode_info *pipe)
 {
 	smp_mb();
-	if (waitqueue_active(&pipe->wait))
-		wake_up_interruptible(&pipe->wait);
+	if (waitqueue_active(&pipe->wr_wait))
+		wake_up_interruptible(&pipe->wr_wait);
 	kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
 }
 
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index dbcfa68923842..d5765039652a5 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -47,7 +47,7 @@ struct pipe_buffer {
  **/
 struct pipe_inode_info {
 	struct mutex mutex;
-	wait_queue_head_t wait;
+	wait_queue_head_t rd_wait, wr_wait;
 	unsigned int head;
 	unsigned int tail;
 	unsigned int max_usage;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH AUTOSEL 5.5 542/542] pipe: use exclusive waits when reading or writing
  2020-02-14 15:48 ` [PATCH AUTOSEL 5.5 542/542] pipe: use exclusive waits when reading or writing Sasha Levin
@ 2020-02-18  9:51   ` Andrei Vagin
  2020-02-18 17:36     ` Linus Torvalds
  2020-02-18 18:53   ` Linus Torvalds
  1 sibling, 1 reply; 14+ messages in thread
From: Andrei Vagin @ 2020-02-18  9:51 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: LKML, Sasha Levin, stable, Josh Triplett, linux-fsdevel

Hi Linus,

This patch breaks one of CRIU tests. Here is a small reproducer:
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>

int main()
{
  int p[2];
  pid_t p1, p2;
  int status;

  if (pipe(p) == -1)
    return 1;

  p1 = fork();
  if (p1 == 0) {
    close(p[1]);
    read(p[0], &status, sizeof(status));
    return 0;
  }
  p2 = fork();
  if (p2 == 0) {
    close(p[1]);
    read(p[0], &status, sizeof(status));
    return 0;
  }
  sleep(1);
  close(p[1]);
  wait(&status);
  wait(&status);

  return 0;
}

Here are two readers which are waiting for data but only one of them
will be woken up after closing the last writer.

The quick fix looks like this:

diff --git a/fs/pipe.c b/fs/pipe.c
index 5a34d6c22d4c..deaf67239a18 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -412,7 +412,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
                wake_up_interruptible_sync_poll(&pipe->wr_wait,
EPOLLOUT | EPOLLWRNORM);
                kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
        }
-       if (wake_next_reader)
+       if (!pipe->writers || wake_next_reader)
                wake_up_interruptible_sync_poll(&pipe->rd_wait,
EPOLLIN | EPOLLRDNORM);
        if (ret > 0)
                file_accessed(filp);

I've checked that it fixes the issue, but It is too late today to read
this code carefully, so I could skip something.

Thanks,
Andrei

On Fri, Feb 14, 2020 at 8:03 AM Sasha Levin <sashal@kernel.org> wrote:
>
> From: Linus Torvalds <torvalds@linux-foundation.org>
>
> [ Upstream commit 0ddad21d3e99c743a3aa473121dc5561679e26bb ]
>
> This makes the pipe code use separate wait-queues and exclusive waiting
> for readers and writers, avoiding a nasty thundering herd problem when
> there are lots of readers waiting for data on a pipe (or, less commonly,
> lots of writers waiting for a pipe to have space).
>
> While this isn't a common occurrence in the traditional "use a pipe as a
> data transport" case, where you typically only have a single reader and
> a single writer process, there is one common special case: using a pipe
> as a source of "locking tokens" rather than for data communication.
>
> In particular, the GNU make jobserver code ends up using a pipe as a way
> to limit parallelism, where each job consumes a token by reading a byte
> from the jobserver pipe, and releases the token by writing a byte back
> to the pipe.
>
> This pattern is fairly traditional on Unix, and works very well, but
> will waste a lot of time waking up a lot of processes when only a single
> reader needs to be woken up when a writer releases a new token.
>
> A simplified test-case of just this pipe interaction is to create 64
> processes, and then pass a single token around between them (this
> test-case also intentionally passes another token that gets ignored to
> test the "wake up next" logic too, in case anybody wonders about it):
>
>     #include <unistd.h>
>
>     int main(int argc, char **argv)
>     {
>         int fd[2], counters[2];
>
>         pipe(fd);
>         counters[0] = 0;
>         counters[1] = -1;
>         write(fd[1], counters, sizeof(counters));
>
>         /* 64 processes */
>         fork(); fork(); fork(); fork(); fork(); fork();
>
>         do {
>                 int i;
>                 read(fd[0], &i, sizeof(i));
>                 if (i < 0)
>                         continue;
>                 counters[0] = i+1;
>                 write(fd[1], counters, (1+(i & 1)) *sizeof(int));
>         } while (counters[0] < 1000000);
>         return 0;
>     }
>
> and in a perfect world, passing that token around should only cause one
> context switch per transfer, when the writer of a token causes a
> directed wakeup of just a single reader.
>
> But with the "writer wakes all readers" model we traditionally had, on
> my test box the above case causes more than an order of magnitude more
> scheduling: instead of the expected ~1M context switches, "perf stat"
> shows
>
>         231,852.37 msec task-clock                #   15.857 CPUs utilized
>         11,250,961      context-switches          #    0.049 M/sec
>            616,304      cpu-migrations            #    0.003 M/sec
>              1,648      page-faults               #    0.007 K/sec
>  1,097,903,998,514      cycles                    #    4.735 GHz
>    120,781,778,352      instructions              #    0.11  insn per cycle
>     27,997,056,043      branches                  #  120.754 M/sec
>        283,581,233      branch-misses             #    1.01% of all branches
>
>       14.621273891 seconds time elapsed
>
>        0.018243000 seconds user
>        3.611468000 seconds sys
>
> before this commit.
>
> After this commit, I get
>
>           5,229.55 msec task-clock                #    3.072 CPUs utilized
>          1,212,233      context-switches          #    0.232 M/sec
>            103,951      cpu-migrations            #    0.020 M/sec
>              1,328      page-faults               #    0.254 K/sec
>     21,307,456,166      cycles                    #    4.074 GHz
>     12,947,819,999      instructions              #    0.61  insn per cycle
>      2,881,985,678      branches                  #  551.096 M/sec
>         64,267,015      branch-misses             #    2.23% of all branches
>
>        1.702148350 seconds time elapsed
>
>        0.004868000 seconds user
>        0.110786000 seconds sys
>
> instead. Much better.
>
> [ Note! This kernel improvement seems to be very good at triggering a
>   race condition in the make jobserver (in GNU make 4.2.1) for me. It's
>   a long known bug that was fixed back in June 2017 by GNU make commit
>   b552b0525198 ("[SV 51159] Use a non-blocking read with pselect to
>   avoid hangs.").
>
>   But there wasn't a new release of GNU make until 4.3 on Jan 19 2020,
>   so a number of distributions may still have the buggy version. Some
>   have backported the fix to their 4.2.1 release, though, and even
>   without the fix it's quite timing-dependent whether the bug actually
>   is hit. ]
>
> Josh Triplett says:
>  "I've been hammering on your pipe fix patch (switching to exclusive
>   wait queues) for a month or so, on several different systems, and I've
>   run into no issues with it. The patch *substantially* improves
>   parallel build times on large (~100 CPU) systems, both with parallel
>   make and with other things that use make's pipe-based jobserver.
>
>   All current distributions (including stable and long-term stable
>   distributions) have versions of GNU make that no longer have the
>   jobserver bug"
>
> Tested-by: Josh Triplett <josh@joshtriplett.org>
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Sasha Levin <sashal@kernel.org>
> ---
>  fs/coredump.c             |  4 +--
>  fs/pipe.c                 | 67 +++++++++++++++++++++++++--------------
>  fs/splice.c               |  8 ++---
>  include/linux/pipe_fs_i.h |  2 +-
>  4 files changed, 51 insertions(+), 30 deletions(-)
>
> diff --git a/fs/coredump.c b/fs/coredump.c
> index b1ea7dfbd1494..f8296a82d01df 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -517,7 +517,7 @@ static void wait_for_dump_helpers(struct file *file)
>         pipe_lock(pipe);
>         pipe->readers++;
>         pipe->writers--;
> -       wake_up_interruptible_sync(&pipe->wait);
> +       wake_up_interruptible_sync(&pipe->rd_wait);
>         kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
>         pipe_unlock(pipe);
>
> @@ -525,7 +525,7 @@ static void wait_for_dump_helpers(struct file *file)
>          * We actually want wait_event_freezable() but then we need
>          * to clear TIF_SIGPENDING and improve dump_interrupted().
>          */
> -       wait_event_interruptible(pipe->wait, pipe->readers == 1);
> +       wait_event_interruptible(pipe->rd_wait, pipe->readers == 1);
>
>         pipe_lock(pipe);
>         pipe->readers--;
> diff --git a/fs/pipe.c b/fs/pipe.c
> index 57502c3c0fba1..5a34d6c22d4ce 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -108,16 +108,19 @@ void pipe_double_lock(struct pipe_inode_info *pipe1,
>  /* Drop the inode semaphore and wait for a pipe event, atomically */
>  void pipe_wait(struct pipe_inode_info *pipe)
>  {
> -       DEFINE_WAIT(wait);
> +       DEFINE_WAIT(rdwait);
> +       DEFINE_WAIT(wrwait);
>
>         /*
>          * Pipes are system-local resources, so sleeping on them
>          * is considered a noninteractive wait:
>          */
> -       prepare_to_wait(&pipe->wait, &wait, TASK_INTERRUPTIBLE);
> +       prepare_to_wait(&pipe->rd_wait, &rdwait, TASK_INTERRUPTIBLE);
> +       prepare_to_wait(&pipe->wr_wait, &wrwait, TASK_INTERRUPTIBLE);
>         pipe_unlock(pipe);
>         schedule();
> -       finish_wait(&pipe->wait, &wait);
> +       finish_wait(&pipe->rd_wait, &rdwait);
> +       finish_wait(&pipe->wr_wait, &wrwait);
>         pipe_lock(pipe);
>  }
>
> @@ -286,7 +289,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
>         size_t total_len = iov_iter_count(to);
>         struct file *filp = iocb->ki_filp;
>         struct pipe_inode_info *pipe = filp->private_data;
> -       bool was_full;
> +       bool was_full, wake_next_reader = false;
>         ssize_t ret;
>
>         /* Null read succeeds. */
> @@ -344,10 +347,10 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
>
>                         if (!buf->len) {
>                                 pipe_buf_release(pipe, buf);
> -                               spin_lock_irq(&pipe->wait.lock);
> +                               spin_lock_irq(&pipe->rd_wait.lock);
>                                 tail++;
>                                 pipe->tail = tail;
> -                               spin_unlock_irq(&pipe->wait.lock);
> +                               spin_unlock_irq(&pipe->rd_wait.lock);
>                         }
>                         total_len -= chars;
>                         if (!total_len)
> @@ -384,7 +387,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
>                  * no data.
>                  */
>                 if (unlikely(was_full)) {
> -                       wake_up_interruptible_sync_poll(&pipe->wait, EPOLLOUT | EPOLLWRNORM);
> +                       wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
>                         kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
>                 }
>
> @@ -394,18 +397,23 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
>                  * since we've done any required wakeups and there's no need
>                  * to mark anything accessed. And we've dropped the lock.
>                  */
> -               if (wait_event_interruptible(pipe->wait, pipe_readable(pipe)) < 0)
> +               if (wait_event_interruptible_exclusive(pipe->rd_wait, pipe_readable(pipe)) < 0)
>                         return -ERESTARTSYS;
>
>                 __pipe_lock(pipe);
>                 was_full = pipe_full(pipe->head, pipe->tail, pipe->max_usage);
> +               wake_next_reader = true;
>         }
> +       if (pipe_empty(pipe->head, pipe->tail))
> +               wake_next_reader = false;
>         __pipe_unlock(pipe);
>
>         if (was_full) {
> -               wake_up_interruptible_sync_poll(&pipe->wait, EPOLLOUT | EPOLLWRNORM);
> +               wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
>                 kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
>         }
> +       if (wake_next_reader)
> +               wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
>         if (ret > 0)
>                 file_accessed(filp);
>         return ret;
> @@ -437,6 +445,7 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
>         size_t total_len = iov_iter_count(from);
>         ssize_t chars;
>         bool was_empty = false;
> +       bool wake_next_writer = false;
>
>         /* Null write succeeds. */
>         if (unlikely(total_len == 0))
> @@ -515,16 +524,16 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
>                          * it, either the reader will consume it or it'll still
>                          * be there for the next write.
>                          */
> -                       spin_lock_irq(&pipe->wait.lock);
> +                       spin_lock_irq(&pipe->rd_wait.lock);
>
>                         head = pipe->head;
>                         if (pipe_full(head, pipe->tail, pipe->max_usage)) {
> -                               spin_unlock_irq(&pipe->wait.lock);
> +                               spin_unlock_irq(&pipe->rd_wait.lock);
>                                 continue;
>                         }
>
>                         pipe->head = head + 1;
> -                       spin_unlock_irq(&pipe->wait.lock);
> +                       spin_unlock_irq(&pipe->rd_wait.lock);
>
>                         /* Insert it into the buffer array */
>                         buf = &pipe->bufs[head & mask];
> @@ -576,14 +585,17 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
>                  */
>                 __pipe_unlock(pipe);
>                 if (was_empty) {
> -                       wake_up_interruptible_sync_poll(&pipe->wait, EPOLLIN | EPOLLRDNORM);
> +                       wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
>                         kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
>                 }
> -               wait_event_interruptible(pipe->wait, pipe_writable(pipe));
> +               wait_event_interruptible_exclusive(pipe->wr_wait, pipe_writable(pipe));
>                 __pipe_lock(pipe);
>                 was_empty = pipe_empty(pipe->head, pipe->tail);
> +               wake_next_writer = true;
>         }
>  out:
> +       if (pipe_full(pipe->head, pipe->tail, pipe->max_usage))
> +               wake_next_writer = false;
>         __pipe_unlock(pipe);
>
>         /*
> @@ -596,9 +608,11 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
>          * wake up pending jobs
>          */
>         if (was_empty) {
> -               wake_up_interruptible_sync_poll(&pipe->wait, EPOLLIN | EPOLLRDNORM);
> +               wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
>                 kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
>         }
> +       if (wake_next_writer)
> +               wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
>         if (ret > 0 && sb_start_write_trylock(file_inode(filp)->i_sb)) {
>                 int err = file_update_time(filp);
>                 if (err)
> @@ -642,12 +656,15 @@ pipe_poll(struct file *filp, poll_table *wait)
>         unsigned int head, tail;
>
>         /*
> -        * Reading only -- no need for acquiring the semaphore.
> +        * Reading pipe state only -- no need for acquiring the semaphore.
>          *
>          * But because this is racy, the code has to add the
>          * entry to the poll table _first_ ..
>          */
> -       poll_wait(filp, &pipe->wait, wait);
> +       if (filp->f_mode & FMODE_READ)
> +               poll_wait(filp, &pipe->rd_wait, wait);
> +       if (filp->f_mode & FMODE_WRITE)
> +               poll_wait(filp, &pipe->wr_wait, wait);
>
>         /*
>          * .. and only then can you do the racy tests. That way,
> @@ -706,7 +723,8 @@ pipe_release(struct inode *inode, struct file *file)
>                 pipe->writers--;
>
>         if (pipe->readers || pipe->writers) {
> -               wake_up_interruptible_sync_poll(&pipe->wait, EPOLLIN | EPOLLOUT | EPOLLRDNORM | EPOLLWRNORM | EPOLLERR | EPOLLHUP);
> +               wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM | EPOLLERR | EPOLLHUP);
> +               wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM | EPOLLERR | EPOLLHUP);
>                 kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
>                 kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
>         }
> @@ -789,7 +807,8 @@ struct pipe_inode_info *alloc_pipe_info(void)
>                              GFP_KERNEL_ACCOUNT);
>
>         if (pipe->bufs) {
> -               init_waitqueue_head(&pipe->wait);
> +               init_waitqueue_head(&pipe->rd_wait);
> +               init_waitqueue_head(&pipe->wr_wait);
>                 pipe->r_counter = pipe->w_counter = 1;
>                 pipe->max_usage = pipe_bufs;
>                 pipe->ring_size = pipe_bufs;
> @@ -1007,7 +1026,8 @@ static int wait_for_partner(struct pipe_inode_info *pipe, unsigned int *cnt)
>
>  static void wake_up_partner(struct pipe_inode_info *pipe)
>  {
> -       wake_up_interruptible(&pipe->wait);
> +       wake_up_interruptible(&pipe->rd_wait);
> +       wake_up_interruptible(&pipe->wr_wait);
>  }
>
>  static int fifo_open(struct inode *inode, struct file *filp)
> @@ -1118,13 +1138,13 @@ static int fifo_open(struct inode *inode, struct file *filp)
>
>  err_rd:
>         if (!--pipe->readers)
> -               wake_up_interruptible(&pipe->wait);
> +               wake_up_interruptible(&pipe->wr_wait);
>         ret = -ERESTARTSYS;
>         goto err;
>
>  err_wr:
>         if (!--pipe->writers)
> -               wake_up_interruptible(&pipe->wait);
> +               wake_up_interruptible(&pipe->rd_wait);
>         ret = -ERESTARTSYS;
>         goto err;
>
> @@ -1251,7 +1271,8 @@ static long pipe_set_size(struct pipe_inode_info *pipe, unsigned long arg)
>         pipe->max_usage = nr_slots;
>         pipe->tail = tail;
>         pipe->head = head;
> -       wake_up_interruptible_all(&pipe->wait);
> +       wake_up_interruptible_all(&pipe->rd_wait);
> +       wake_up_interruptible_all(&pipe->wr_wait);
>         return pipe->max_usage * PAGE_SIZE;
>
>  out_revert_acct:
> diff --git a/fs/splice.c b/fs/splice.c
> index 3009652a41c85..d671936d0aad6 100644
> --- a/fs/splice.c
> +++ b/fs/splice.c
> @@ -165,8 +165,8 @@ static const struct pipe_buf_operations user_page_pipe_buf_ops = {
>  static void wakeup_pipe_readers(struct pipe_inode_info *pipe)
>  {
>         smp_mb();
> -       if (waitqueue_active(&pipe->wait))
> -               wake_up_interruptible(&pipe->wait);
> +       if (waitqueue_active(&pipe->rd_wait))
> +               wake_up_interruptible(&pipe->rd_wait);
>         kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
>  }
>
> @@ -462,8 +462,8 @@ static int pipe_to_sendpage(struct pipe_inode_info *pipe,
>  static void wakeup_pipe_writers(struct pipe_inode_info *pipe)
>  {
>         smp_mb();
> -       if (waitqueue_active(&pipe->wait))
> -               wake_up_interruptible(&pipe->wait);
> +       if (waitqueue_active(&pipe->wr_wait))
> +               wake_up_interruptible(&pipe->wr_wait);
>         kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
>  }
>
> diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
> index dbcfa68923842..d5765039652a5 100644
> --- a/include/linux/pipe_fs_i.h
> +++ b/include/linux/pipe_fs_i.h
> @@ -47,7 +47,7 @@ struct pipe_buffer {
>   **/
>  struct pipe_inode_info {
>         struct mutex mutex;
> -       wait_queue_head_t wait;
> +       wait_queue_head_t rd_wait, wr_wait;
>         unsigned int head;
>         unsigned int tail;
>         unsigned int max_usage;
> --
> 2.20.1
>

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH AUTOSEL 5.5 542/542] pipe: use exclusive waits when reading or writing
  2020-02-18  9:51   ` Andrei Vagin
@ 2020-02-18 17:36     ` Linus Torvalds
  2020-02-18 17:54       ` Linus Torvalds
  0 siblings, 1 reply; 14+ messages in thread
From: Linus Torvalds @ 2020-02-18 17:36 UTC (permalink / raw)
  To: Andrei Vagin; +Cc: LKML, Sasha Levin, stable, Josh Triplett, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 680 bytes --]

On Tue, Feb 18, 2020 at 1:51 AM Andrei Vagin <avagin@gmail.com> wrote:
>
> This patch breaks one of CRIU tests. Here is a small reproducer:

Good catch.

> The quick fix looks like this:

That one works, but is not really right.

The things that change the number of readers or writers should simply
use "wake_up_all()".

I thought we did that already, but no - there _was_ one place where we
did it, but that was for the pipe buffer size case, and in that case
it's actually pointless. That case acts just like a "new space or data
was added"

So I think the right fix is the attached patch. Since you had such a
lovely test-case, let me go test it too ;)

                Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 1227 bytes --]

 fs/pipe.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index 5a34d6c22d4c..76e7f66fe2fe 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -1026,8 +1026,8 @@ static int wait_for_partner(struct pipe_inode_info *pipe, unsigned int *cnt)
 
 static void wake_up_partner(struct pipe_inode_info *pipe)
 {
-	wake_up_interruptible(&pipe->rd_wait);
-	wake_up_interruptible(&pipe->wr_wait);
+	wake_up_interruptible_all(&pipe->rd_wait);
+	wake_up_interruptible_all(&pipe->wr_wait);
 }
 
 static int fifo_open(struct inode *inode, struct file *filp)
@@ -1144,7 +1144,7 @@ static int fifo_open(struct inode *inode, struct file *filp)
 
 err_wr:
 	if (!--pipe->writers)
-		wake_up_interruptible(&pipe->rd_wait);
+		wake_up_interruptible_all(&pipe->rd_wait);
 	ret = -ERESTARTSYS;
 	goto err;
 
@@ -1271,8 +1271,8 @@ static long pipe_set_size(struct pipe_inode_info *pipe, unsigned long arg)
 	pipe->max_usage = nr_slots;
 	pipe->tail = tail;
 	pipe->head = head;
-	wake_up_interruptible_all(&pipe->rd_wait);
-	wake_up_interruptible_all(&pipe->wr_wait);
+	wake_up_interruptible(&pipe->rd_wait);
+	wake_up_interruptible(&pipe->wr_wait);
 	return pipe->max_usage * PAGE_SIZE;
 
 out_revert_acct:

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH AUTOSEL 5.5 542/542] pipe: use exclusive waits when reading or writing
  2020-02-18 17:36     ` Linus Torvalds
@ 2020-02-18 17:54       ` Linus Torvalds
  2020-02-18 18:17         ` Linus Torvalds
  0 siblings, 1 reply; 14+ messages in thread
From: Linus Torvalds @ 2020-02-18 17:54 UTC (permalink / raw)
  To: Andrei Vagin; +Cc: LKML, Sasha Levin, stable, Josh Triplett, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 1100 bytes --]

On Tue, Feb 18, 2020 at 9:36 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> The things that change the number of readers or writers should simply
> use "wake_up_all()".
>
> So I think the right fix is the attached patch. Since you had such a
> lovely test-case, let me go test it too ;)

Good that I did. I missed the _real_ case of this - pipe_release().
Because that used a different wakeup function.

In fact, that case uses wake_up_interruptible_sync_poll(), which
doesn't have the "all" version.

But it doesn't actually need that fancy thing, since it's only meant
for "let's avoid waking up things that don't need these poll keys",
and the whole point is that now we're closing the pipe so we should
wake up everybody.

And in fact the test for "are there readers or writers" was
nonsensical. We shouldn't wake up readers just because they still
exist. We should wake up readers only if they exist, _and_ there are
no writers left (and vice versa).

Anyway, new patch attached. This hasn't been tested either, but I'll
let you know if it's broken too ;)

                  Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 1916 bytes --]

 fs/pipe.c | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index 5a34d6c22d4c..89d54c1911fe 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -722,9 +722,10 @@ pipe_release(struct inode *inode, struct file *file)
 	if (file->f_mode & FMODE_WRITE)
 		pipe->writers--;
 
-	if (pipe->readers || pipe->writers) {
-		wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM | EPOLLERR | EPOLLHUP);
-		wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM | EPOLLERR | EPOLLHUP);
+	/* Was that the last reader or writer, but not the other side? */
+	if (!pipe->readers != !pipe->writers) {
+		wake_up_interruptible_all(&pipe->rd_wait);
+		wake_up_interruptible_all(&pipe->wr_wait);
 		kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
 		kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
 	}
@@ -1026,8 +1027,8 @@ static int wait_for_partner(struct pipe_inode_info *pipe, unsigned int *cnt)
 
 static void wake_up_partner(struct pipe_inode_info *pipe)
 {
-	wake_up_interruptible(&pipe->rd_wait);
-	wake_up_interruptible(&pipe->wr_wait);
+	wake_up_interruptible_all(&pipe->rd_wait);
+	wake_up_interruptible_all(&pipe->wr_wait);
 }
 
 static int fifo_open(struct inode *inode, struct file *filp)
@@ -1144,7 +1145,7 @@ static int fifo_open(struct inode *inode, struct file *filp)
 
 err_wr:
 	if (!--pipe->writers)
-		wake_up_interruptible(&pipe->rd_wait);
+		wake_up_interruptible_all(&pipe->rd_wait);
 	ret = -ERESTARTSYS;
 	goto err;
 
@@ -1271,8 +1272,8 @@ static long pipe_set_size(struct pipe_inode_info *pipe, unsigned long arg)
 	pipe->max_usage = nr_slots;
 	pipe->tail = tail;
 	pipe->head = head;
-	wake_up_interruptible_all(&pipe->rd_wait);
-	wake_up_interruptible_all(&pipe->wr_wait);
+	wake_up_interruptible(&pipe->rd_wait);
+	wake_up_interruptible(&pipe->wr_wait);
 	return pipe->max_usage * PAGE_SIZE;
 
 out_revert_acct:

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH AUTOSEL 5.5 542/542] pipe: use exclusive waits when reading or writing
  2020-02-18 17:54       ` Linus Torvalds
@ 2020-02-18 18:17         ` Linus Torvalds
  2020-02-18 18:20           ` Matthew Wilcox
  0 siblings, 1 reply; 14+ messages in thread
From: Linus Torvalds @ 2020-02-18 18:17 UTC (permalink / raw)
  To: Andrei Vagin; +Cc: LKML, Sasha Levin, stable, Josh Triplett, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 483 bytes --]

On Tue, Feb 18, 2020 at 9:54 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Anyway, new patch attached. This hasn't been tested either, but I'll
> let you know if it's broken too ;)

That one looks good. Some small cosmetic edits later, and with a
commit log it looks like the appended.

If you're testing this and it works for your full CRIU test case too,
I'll add your tested-by if I get it before I end up pushing things out
later today,

                    Linus

[-- Attachment #2: 0001-pipe-make-sure-to-wake-up-everybody-when-the-last-re.patch --]
[-- Type: text/x-patch, Size: 4776 bytes --]

From 7e2fe96d1b760eb3e7dec771db7b8983d5b6c25c Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Tue, 18 Feb 2020 10:12:58 -0800
Subject: [PATCH] pipe: make sure to wake up everybody when the last
 reader/writer closes

Andrei Vagin reported that commit 0ddad21d3e99 ("pipe: use exclusive
waits when reading or writing") broke one of the CRIU tests.  He even
has a trivial reproducer:

    #include <unistd.h>
    #include <sys/types.h>
    #include <sys/wait.h>

    int main()
    {
            int p[2];
            pid_t p1, p2;
            int status;

            if (pipe(p) == -1)
                    return 1;

            p1 = fork();
            if (p1 == 0) {
                    close(p[1]);
                    read(p[0], &status, sizeof(status));
                    return 0;
            }
            p2 = fork();
            if (p2 == 0) {
                    close(p[1]);
                    read(p[0], &status, sizeof(status));
                    return 0;
            }
            sleep(1);
            close(p[1]);
            wait(&status);
            wait(&status);

            return 0;
    }

and the problem - once he points it out - is obvious.  We use these nice
exclusive waits, but when the last writer goes away, it then needs to
wake up _every_ reader (and conversely, the last reader disappearing
needs to wake every writer, of course).

In fact, when going through this, we had several small oddities around
how to wake things.  We did in fact wake every reader when we changed
the size of the pipe buffers.  But that's entirely pointless, since that
just acts as a possible source of new space - no new data to read.

And when we change the size of the buffer, we don't need to wake all
writers even when we add space - that case acts just as if somebody made
space by reading, and any writer that finds itself not filling it up
entirely will wake the next one.

On the other hand, on the exit path, we tried to limit the wakeups with
the proper poll keys etc, which is entirely pointless, because at that
point we obviously need to wake up everybody.  So don't do that: just
wake up everybody - but only do that if the counts changed to zero.

So fix those non-IO wakeups to be more proper: space change doesn't add
any new data, but it might make room for writers, so it wakes up a
writer.  And the actual changes to reader/writer counts should wake up
everybody, since everybody is affected (ie readers will all see EOF if
the writers have gone away, and writers will all get EPIPE if all
readers have gone away).

Fixes: 0ddad21d3e99 ("pipe: use exclusive waits when reading or writing")
Reported-by: Andrei Vagin <avagin@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 fs/pipe.c | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index 5a34d6c22d4c..2144507447c5 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -722,9 +722,10 @@ pipe_release(struct inode *inode, struct file *file)
 	if (file->f_mode & FMODE_WRITE)
 		pipe->writers--;
 
-	if (pipe->readers || pipe->writers) {
-		wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM | EPOLLERR | EPOLLHUP);
-		wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM | EPOLLERR | EPOLLHUP);
+	/* Was that the last reader or writer, but not the other side? */
+	if (!pipe->readers != !pipe->writers) {
+		wake_up_interruptible_all(&pipe->rd_wait);
+		wake_up_interruptible_all(&pipe->wr_wait);
 		kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
 		kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
 	}
@@ -1026,8 +1027,8 @@ static int wait_for_partner(struct pipe_inode_info *pipe, unsigned int *cnt)
 
 static void wake_up_partner(struct pipe_inode_info *pipe)
 {
-	wake_up_interruptible(&pipe->rd_wait);
-	wake_up_interruptible(&pipe->wr_wait);
+	wake_up_interruptible_all(&pipe->rd_wait);
+	wake_up_interruptible_all(&pipe->wr_wait);
 }
 
 static int fifo_open(struct inode *inode, struct file *filp)
@@ -1144,7 +1145,7 @@ static int fifo_open(struct inode *inode, struct file *filp)
 
 err_wr:
 	if (!--pipe->writers)
-		wake_up_interruptible(&pipe->rd_wait);
+		wake_up_interruptible_all(&pipe->rd_wait);
 	ret = -ERESTARTSYS;
 	goto err;
 
@@ -1271,8 +1272,9 @@ static long pipe_set_size(struct pipe_inode_info *pipe, unsigned long arg)
 	pipe->max_usage = nr_slots;
 	pipe->tail = tail;
 	pipe->head = head;
-	wake_up_interruptible_all(&pipe->rd_wait);
-	wake_up_interruptible_all(&pipe->wr_wait);
+
+	/* This might have made more room for writers */
+	wake_up_interruptible(&pipe->wr_wait);
 	return pipe->max_usage * PAGE_SIZE;
 
 out_revert_acct:
-- 
2.24.0.158.g3fed155289


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH AUTOSEL 5.5 542/542] pipe: use exclusive waits when reading or writing
  2020-02-18 18:17         ` Linus Torvalds
@ 2020-02-18 18:20           ` Matthew Wilcox
  2020-02-18 18:28             ` Linus Torvalds
  0 siblings, 1 reply; 14+ messages in thread
From: Matthew Wilcox @ 2020-02-18 18:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrei Vagin, LKML, Sasha Levin, stable, Josh Triplett, linux-fsdevel

On Tue, Feb 18, 2020 at 10:17:30AM -0800, Linus Torvalds wrote:
> @@ -722,9 +722,10 @@ pipe_release(struct inode *inode, struct file *file)
>  	if (file->f_mode & FMODE_WRITE)
>  		pipe->writers--;
>  
> -	if (pipe->readers || pipe->writers) {
> -		wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM | EPOLLERR | EPOLLHUP);
> -		wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM | EPOLLERR | EPOLLHUP);
> +	/* Was that the last reader or writer, but not the other side? */
> +	if (!pipe->readers != !pipe->writers) {
> +		wake_up_interruptible_all(&pipe->rd_wait);
> +		wake_up_interruptible_all(&pipe->wr_wait);
>  		kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
>  		kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
>  	}
> @@ -1026,8 +1027,8 @@ static int wait_for_partner(struct pipe_inode_info *pipe, unsigned int *cnt)
>  
>  static void wake_up_partner(struct pipe_inode_info *pipe)
>  {
> -	wake_up_interruptible(&pipe->rd_wait);
> -	wake_up_interruptible(&pipe->wr_wait);
> +	wake_up_interruptible_all(&pipe->rd_wait);
> +	wake_up_interruptible_all(&pipe->wr_wait);
>  }

You don't want to move wake_up_partner() up and call it from pipe_release()?


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH AUTOSEL 5.5 542/542] pipe: use exclusive waits when reading or writing
  2020-02-18 18:20           ` Matthew Wilcox
@ 2020-02-18 18:28             ` Linus Torvalds
  2020-02-18 22:33               ` Andrei Vagin
  0 siblings, 1 reply; 14+ messages in thread
From: Linus Torvalds @ 2020-02-18 18:28 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrei Vagin, LKML, Sasha Levin, stable, Josh Triplett, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 575 bytes --]

On Tue, Feb 18, 2020 at 10:20 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> You don't want to move wake_up_partner() up and call it from pipe_release()?

I was actually thinking of going the other way - two of three users of
wake_up_partner() are redundantly waking up the wrong side, and the
third user is pointlessly written too.

So I was _thinking_ of a patch like the appended (which is on top of
the previous patch), but ended up not doing it. Until you brought it
up.

But I won't bother committing this, since it shouldn't really matter.

                 Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 1656 bytes --]

 fs/pipe.c | 18 ++++++------------
 1 file changed, 6 insertions(+), 12 deletions(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index 2144507447c5..79ba61430f9c 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -1025,12 +1025,6 @@ static int wait_for_partner(struct pipe_inode_info *pipe, unsigned int *cnt)
 	return cur == *cnt ? -ERESTARTSYS : 0;
 }
 
-static void wake_up_partner(struct pipe_inode_info *pipe)
-{
-	wake_up_interruptible_all(&pipe->rd_wait);
-	wake_up_interruptible_all(&pipe->wr_wait);
-}
-
 static int fifo_open(struct inode *inode, struct file *filp)
 {
 	struct pipe_inode_info *pipe;
@@ -1078,7 +1072,7 @@ static int fifo_open(struct inode *inode, struct file *filp)
 	 */
 		pipe->r_counter++;
 		if (pipe->readers++ == 0)
-			wake_up_partner(pipe);
+			wake_up_interruptible_all(&pipe->wr_wait);
 
 		if (!is_pipe && !pipe->writers) {
 			if ((filp->f_flags & O_NONBLOCK)) {
@@ -1104,7 +1098,7 @@ static int fifo_open(struct inode *inode, struct file *filp)
 
 		pipe->w_counter++;
 		if (!pipe->writers++)
-			wake_up_partner(pipe);
+			wake_up_interruptible_all(&pipe->rd_wait);
 
 		if (!is_pipe && !pipe->readers) {
 			if (wait_for_partner(pipe, &pipe->r_counter))
@@ -1120,12 +1114,12 @@ static int fifo_open(struct inode *inode, struct file *filp)
 	 *  the process can at least talk to itself.
 	 */
 
-		pipe->readers++;
-		pipe->writers++;
+		if (pipe->readers++ == 0)
+			wake_up_interruptible_all(&pipe->wr_wait);
+		if (pipe->writers++ == 0)
+			wake_up_interruptible_all(&pipe->rd_wait);
 		pipe->r_counter++;
 		pipe->w_counter++;
-		if (pipe->readers == 1 || pipe->writers == 1)
-			wake_up_partner(pipe);
 		break;
 
 	default:

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH AUTOSEL 5.5 542/542] pipe: use exclusive waits when reading or writing
  2020-02-14 15:48 ` [PATCH AUTOSEL 5.5 542/542] pipe: use exclusive waits when reading or writing Sasha Levin
  2020-02-18  9:51   ` Andrei Vagin
@ 2020-02-18 18:53   ` Linus Torvalds
  1 sibling, 0 replies; 14+ messages in thread
From: Linus Torvalds @ 2020-02-18 18:53 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Linux Kernel Mailing List, stable, Josh Triplett, linux-fsdevel

On Fri, Feb 14, 2020 at 8:00 AM Sasha Levin <sashal@kernel.org> wrote:
>
> From: Linus Torvalds <torvalds@linux-foundation.org>
>
> [ Upstream commit 0ddad21d3e99c743a3aa473121dc5561679e26bb ]
>
> This makes the pipe code use separate wait-queues and exclusive waiting
> for readers and writers, [..]

Oh, and since I didn't react initially, let me react now that Andrei
found a bug here: why was this patch auto-selected for 5.5 stable in
the first place?

It wasn't really a fix, and there's no Fixes: tag or stable tag in
there. Yeah, there's a reference to an old commit, but that one isn't
even a kernel commit.

Yeah, the performance improvements are quite nice if you hit the case
this matters for, but still..

              Linus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH AUTOSEL 5.5 542/542] pipe: use exclusive waits when reading or writing
  2020-02-18 18:28             ` Linus Torvalds
@ 2020-02-18 22:33               ` Andrei Vagin
  2020-02-18 23:03                 ` Linus Torvalds
  0 siblings, 1 reply; 14+ messages in thread
From: Andrei Vagin @ 2020-02-18 22:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, LKML, Sasha Levin, stable, Josh Triplett, linux-fsdevel

On Tue, Feb 18, 2020 at 10:28:23AM -0800, Linus Torvalds wrote:
> On Tue, Feb 18, 2020 at 10:20 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > You don't want to move wake_up_partner() up and call it from pipe_release()?
> 
> I was actually thinking of going the other way - two of three users of
> wake_up_partner() are redundantly waking up the wrong side, and the
> third user is pointlessly written too.
> 
> So I was _thinking_ of a patch like the appended (which is on top of
> the previous patch), but ended up not doing it. Until you brought it
> up.
> 
> But I won't bother committing this, since it shouldn't really matter.

I run CRIU tests on the kernel with both these patches. Everything work
as expected. Thank you for the fix.

> 
>                  Linus

>  fs/pipe.c | 18 ++++++------------
>  1 file changed, 6 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/pipe.c b/fs/pipe.c
> index 2144507447c5..79ba61430f9c 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -1025,12 +1025,6 @@ static int wait_for_partner(struct pipe_inode_info *pipe, unsigned int *cnt)
>  	return cur == *cnt ? -ERESTARTSYS : 0;
>  }
>  
> -static void wake_up_partner(struct pipe_inode_info *pipe)
> -{
> -	wake_up_interruptible_all(&pipe->rd_wait);
> -	wake_up_interruptible_all(&pipe->wr_wait);
> -}
> -
>  static int fifo_open(struct inode *inode, struct file *filp)
>  {
>  	struct pipe_inode_info *pipe;
> @@ -1078,7 +1072,7 @@ static int fifo_open(struct inode *inode, struct file *filp)
>  	 */
>  		pipe->r_counter++;
>  		if (pipe->readers++ == 0)
> -			wake_up_partner(pipe);
> +			wake_up_interruptible_all(&pipe->wr_wait);
>  
>  		if (!is_pipe && !pipe->writers) {
>  			if ((filp->f_flags & O_NONBLOCK)) {
> @@ -1104,7 +1098,7 @@ static int fifo_open(struct inode *inode, struct file *filp)
>  
>  		pipe->w_counter++;
>  		if (!pipe->writers++)
> -			wake_up_partner(pipe);
> +			wake_up_interruptible_all(&pipe->rd_wait);
>  
>  		if (!is_pipe && !pipe->readers) {
>  			if (wait_for_partner(pipe, &pipe->r_counter))
> @@ -1120,12 +1114,12 @@ static int fifo_open(struct inode *inode, struct file *filp)
>  	 *  the process can at least talk to itself.
>  	 */
>  
> -		pipe->readers++;
> -		pipe->writers++;
> +		if (pipe->readers++ == 0)
> +			wake_up_interruptible_all(&pipe->wr_wait);
> +		if (pipe->writers++ == 0)
> +			wake_up_interruptible_all(&pipe->rd_wait);
>  		pipe->r_counter++;
>  		pipe->w_counter++;
> -		if (pipe->readers == 1 || pipe->writers == 1)
> -			wake_up_partner(pipe);
>  		break;
>  
>  	default:


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH AUTOSEL 5.5 542/542] pipe: use exclusive waits when reading or writing
  2020-02-18 22:33               ` Andrei Vagin
@ 2020-02-18 23:03                 ` Linus Torvalds
  2020-03-05 18:19                   ` Andrei Vagin
  0 siblings, 1 reply; 14+ messages in thread
From: Linus Torvalds @ 2020-02-18 23:03 UTC (permalink / raw)
  To: Andrei Vagin
  Cc: Matthew Wilcox, LKML, Sasha Levin, stable, Josh Triplett, linux-fsdevel

On Tue, Feb 18, 2020 at 2:33 PM Andrei Vagin <avagin@gmail.com> wrote:
>
> I run CRIU tests on the kernel with both these patches. Everything work
> as expected.

Thanks. I've added your tested-by and pushed out the fix.

           Linus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH AUTOSEL 5.5 542/542] pipe: use exclusive waits when reading or writing
  2020-02-18 23:03                 ` Linus Torvalds
@ 2020-03-05 18:19                   ` Andrei Vagin
  2020-03-05 18:40                     ` Linus Torvalds
  0 siblings, 1 reply; 14+ messages in thread
From: Andrei Vagin @ 2020-03-05 18:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, LKML, Sasha Levin, stable, Josh Triplett, linux-fsdevel

Hello Linus,

After this change, one more criu test became flaky. This is due to one
of corner cases, so I am not sure that we need to fix something in the
kernel. I have fixed this issue in the test. I am not sure that this
will affect any real applications.

Here is the reproducer:

#include <unistd.h>
#include <stdio.h>

int main()
{
    char buf[1<<20];
    int pid, p[2], ret;

    if (pipe(p) < 0)
        return 1;
    pid = fork();
    if (pid == 0) {
        close(p[1]);

        ret = read(p[0], buf, sizeof(buf));
        if (ret < 0)
            return 1;
        printf("read -> %d\n", ret);
        return 0;
    }
    close(p[0]);
    ret = write(p[1], buf, sizeof(buf));
    if (ret < 0)
        return 1;
    printf("write -> %d\n", ret);
    return 0;
}

Before this change:
[avagin@laptop fifo]$ uname -a
Linux laptop 5.3.7-200.fc30.x86_64 #1 SMP Fri Oct 18 20:13:59 UTC 2019
x86_64 x86_64 x86_64 GNU/Linux
[avagin@laptop fifo]$ strace -e read,write,pipe -f ./pipe_bigbuf
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\260r\2\0\0\0\0\0"...,
832) = 832
read(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"...,
784) = 784
read(3, "\4\0\0\0\20\0\0\0\5\0\0\0GNU\0\2\0\0\300\4\0\0\0\3\0\0\0\0\0\0\0",
32) = 32
read(3, "\4\0\0\0\24\0\0\0\3\0\0\0GNU\0gZ\316<\240z\v\206=\360\37F\32{\t\204"...,
68) = 68
read(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"...,
784) = 784
read(3, "\4\0\0\0\20\0\0\0\5\0\0\0GNU\0\2\0\0\300\4\0\0\0\3\0\0\0\0\0\0\0",
32) = 32
read(3, "\4\0\0\0\24\0\0\0\3\0\0\0GNU\0gZ\316<\240z\v\206=\360\37F\32{\t\204"...,
68) = 68
pipe([3, 4])                            = 0
strace: Process 622350 attached
[pid 622349] write(4,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
1048576 <unfinished ...>
[pid 622350] read(3,  <unfinished ...>
[pid 622349] <... write resumed> )      = 1048576
[pid 622350] <... read resumed>
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
1048576) = 1048576
[pid 622349] write(1, "write -> 1048576\n", 17write -> 1048576
) = 17
[pid 622350] write(1, "read -> 1048576\n", 16read -> 1048576
) = 16
[pid 622349] +++ exited with 0 +++
+++ exited with 0 +++

After this change:
[root@fc24 ~]# strace -e read,write,pipe -f ./pipe_bigbuf
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\260r\2\0\0\0\0\0"...,
832) = 832
read(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"...,
784) = 784
read(3, "\4\0\0\0\20\0\0\0\5\0\0\0GNU\0\2\0\0\300\4\0\0\0\3\0\0\0\0\0\0\0",
32) = 32
read(3, "\4\0\0\0\24\0\0\0\3\0\0\0GNU\0L\355\265_\4c\17r@ix\305q\26W\242"...,
68) = 68
read(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"...,
784) = 784
read(3, "\4\0\0\0\20\0\0\0\5\0\0\0GNU\0\2\0\0\300\4\0\0\0\3\0\0\0\0\0\0\0",
32) = 32
read(3, "\4\0\0\0\24\0\0\0\3\0\0\0GNU\0L\355\265_\4c\17r@ix\305q\26W\242"...,
68) = 68
pipe([3, 4])                            = 0
strace: Process 4946 attached
[pid  4945] write(4,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
1048576 <unfinished ...>
[pid  4946] read(3,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
1048576) = 65536
[pid  4946] write(1, "read -> 65536\n", 14read -> 65536
) = 14
[pid  4945] <... write resumed>)        = 131072
[pid  4946] +++ exited with 0 +++
--- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=4945, si_uid=0} ---
+++ killed by SIGPIPE +++

On Tue, Feb 18, 2020 at 3:03 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Tue, Feb 18, 2020 at 2:33 PM Andrei Vagin <avagin@gmail.com> wrote:
> >
> > I run CRIU tests on the kernel with both these patches. Everything work
> > as expected.
>
> Thanks. I've added your tested-by and pushed out the fix.
>
>            Linus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH AUTOSEL 5.5 542/542] pipe: use exclusive waits when reading or writing
  2020-03-05 18:19                   ` Andrei Vagin
@ 2020-03-05 18:40                     ` Linus Torvalds
  2020-03-05 19:54                       ` Andrei Vagin
  0 siblings, 1 reply; 14+ messages in thread
From: Linus Torvalds @ 2020-03-05 18:40 UTC (permalink / raw)
  To: Andrei Vagin
  Cc: Matthew Wilcox, LKML, Sasha Levin, stable, Josh Triplett, linux-fsdevel

On Thu, Mar 5, 2020 at 12:20 PM Andrei Vagin <avagin@gmail.com> wrote:
>
> After this change, one more criu test became flaky. This is due to one
> of corner cases, so I am not sure that we need to fix something in the
> kernel. I have fixed this issue in the test. I am not sure that this
> will affect any real applications.

It's an interesting test-case, but it's really not doing anything you
should rely on.

The code basically expects a pipe write() to be "atomic" for something
bigger than PIPE_BUF. We've never really guaranteed that (and POSIX
doesn't), but we had this special case where readers would continue to
read as long as there was an active writer, which kind of approximated
that for some cases (and your test-case in particular).

A reader that wants to read everything should do multiple read() calls
until it gets an EOF (or gets the expected buffer size). A regular
read() on a pipe can simply always return a partial buffer (it will
always do so in the case of signals, but what you're seeing is that it
will also now do it if the kernel buffers emptied even when there was
a writer that was ready to fill them again).

And I suspect it wasn't actually the commit you point to that changes
the behavior, I think it's actually the "pipe: remove
'waiting_writers' merging logic" that changed behavior for your test

But because it's then timing-dependent on whether the reader gets all
the data or not, it might bisect to any commit after that point.
Particularly since some of the other commits change timing too..

We could re-introduce the "continue reading while there are active
writers" logic, but if this is the only test that triggers that, I'd
prefer to wait until some real user notices...

But if CRIU itself depends on this behavior (rather than just a test),
then I guess we need to.

So is it just a test-case, or does CRIU itself depend on that "reads
get full buffers"? As mentioned, that really _is_ fundamentally broken
if there is any chance of signals..

                Linus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH AUTOSEL 5.5 542/542] pipe: use exclusive waits when reading or writing
  2020-03-05 18:40                     ` Linus Torvalds
@ 2020-03-05 19:54                       ` Andrei Vagin
  0 siblings, 0 replies; 14+ messages in thread
From: Andrei Vagin @ 2020-03-05 19:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, LKML, Sasha Levin, stable, Josh Triplett, linux-fsdevel

On Thu, Mar 5, 2020 at 10:41 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, Mar 5, 2020 at 12:20 PM Andrei Vagin <avagin@gmail.com> wrote:
> >
> > After this change, one more criu test became flaky. This is due to one
> > of corner cases, so I am not sure that we need to fix something in the
> > kernel. I have fixed this issue in the test. I am not sure that this
> > will affect any real applications.
>
> It's an interesting test-case, but it's really not doing anything you
> should rely on.

I'm agree with this.

> But if CRIU itself depends on this behavior (rather than just a test),
> then I guess we need to.
>
> So is it just a test-case, or does CRIU itself depend on that "reads
> get full buffers"? As mentioned, that really _is_ fundamentally broken
> if there is any chance of signals..

No, it doesn't. I'm agree that we can wait  an report from a real app.

Thanks!
Andrei

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2020-03-05 19:54 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20200214154854.6746-1-sashal@kernel.org>
2020-02-14 15:48 ` [PATCH AUTOSEL 5.5 539/542] fuse: don't overflow LLONG_MAX with end offset Sasha Levin
2020-02-14 15:48 ` [PATCH AUTOSEL 5.5 542/542] pipe: use exclusive waits when reading or writing Sasha Levin
2020-02-18  9:51   ` Andrei Vagin
2020-02-18 17:36     ` Linus Torvalds
2020-02-18 17:54       ` Linus Torvalds
2020-02-18 18:17         ` Linus Torvalds
2020-02-18 18:20           ` Matthew Wilcox
2020-02-18 18:28             ` Linus Torvalds
2020-02-18 22:33               ` Andrei Vagin
2020-02-18 23:03                 ` Linus Torvalds
2020-03-05 18:19                   ` Andrei Vagin
2020-03-05 18:40                     ` Linus Torvalds
2020-03-05 19:54                       ` Andrei Vagin
2020-02-18 18:53   ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).