* [PATCH AUTOSEL 5.5 539/542] fuse: don't overflow LLONG_MAX with end offset [not found] <20200214154854.6746-1-sashal@kernel.org> @ 2020-02-14 15:48 ` Sasha Levin 2020-02-14 15:48 ` [PATCH AUTOSEL 5.5 542/542] pipe: use exclusive waits when reading or writing Sasha Levin 1 sibling, 0 replies; 14+ messages in thread From: Sasha Levin @ 2020-02-14 15:48 UTC (permalink / raw) To: linux-kernel, stable Cc: Miklos Szeredi, Xiao Yang, Sasha Levin, linux-fsdevel From: Miklos Szeredi <mszeredi@redhat.com> [ Upstream commit 2f1398291bf35fe027914ae7a9610d8e601fbfde ] Handle the special case of fuse_readpages() wanting to read the last page of a hugest file possible and overflowing the end offset in the process. This is basically to unbreak xfstests:generic/525 and prevent filesystems from doing bad things with an overflowing offset. Reported-by: Xiao Yang <ice_yangxiao@163.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org> --- fs/fuse/file.c | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 695369f46f92d..3dd37a998ea93 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -803,6 +803,10 @@ static int fuse_do_readpage(struct file *file, struct page *page) attr_ver = fuse_get_attr_version(fc); + /* Don't overflow end offset */ + if (pos + (desc.length - 1) == LLONG_MAX) + desc.length--; + fuse_read_args_fill(&ia, file, pos, desc.length, FUSE_READ); res = fuse_simple_request(fc, &ia.ap.args); if (res < 0) @@ -888,6 +892,14 @@ static void fuse_send_readpages(struct fuse_io_args *ia, struct file *file) ap->args.out_pages = true; ap->args.page_zeroing = true; ap->args.page_replace = true; + + /* Don't overflow end offset */ + if (pos + (count - 1) == LLONG_MAX) { + count--; + ap->descs[ap->num_pages - 1].length--; + } + WARN_ON((loff_t) (pos + count) < 0); + fuse_read_args_fill(ia, file, pos, count, FUSE_READ); ia->read.attr_ver = fuse_get_attr_version(fc); if (fc->async_read) { -- 2.20.1 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH AUTOSEL 5.5 542/542] pipe: use exclusive waits when reading or writing [not found] <20200214154854.6746-1-sashal@kernel.org> 2020-02-14 15:48 ` [PATCH AUTOSEL 5.5 539/542] fuse: don't overflow LLONG_MAX with end offset Sasha Levin @ 2020-02-14 15:48 ` Sasha Levin 2020-02-18 9:51 ` Andrei Vagin 2020-02-18 18:53 ` Linus Torvalds 1 sibling, 2 replies; 14+ messages in thread From: Sasha Levin @ 2020-02-14 15:48 UTC (permalink / raw) To: linux-kernel, stable Cc: Linus Torvalds, Josh Triplett, Sasha Levin, linux-fsdevel From: Linus Torvalds <torvalds@linux-foundation.org> [ Upstream commit 0ddad21d3e99c743a3aa473121dc5561679e26bb ] This makes the pipe code use separate wait-queues and exclusive waiting for readers and writers, avoiding a nasty thundering herd problem when there are lots of readers waiting for data on a pipe (or, less commonly, lots of writers waiting for a pipe to have space). While this isn't a common occurrence in the traditional "use a pipe as a data transport" case, where you typically only have a single reader and a single writer process, there is one common special case: using a pipe as a source of "locking tokens" rather than for data communication. In particular, the GNU make jobserver code ends up using a pipe as a way to limit parallelism, where each job consumes a token by reading a byte from the jobserver pipe, and releases the token by writing a byte back to the pipe. This pattern is fairly traditional on Unix, and works very well, but will waste a lot of time waking up a lot of processes when only a single reader needs to be woken up when a writer releases a new token. A simplified test-case of just this pipe interaction is to create 64 processes, and then pass a single token around between them (this test-case also intentionally passes another token that gets ignored to test the "wake up next" logic too, in case anybody wonders about it): #include <unistd.h> int main(int argc, char **argv) { int fd[2], counters[2]; pipe(fd); counters[0] = 0; counters[1] = -1; write(fd[1], counters, sizeof(counters)); /* 64 processes */ fork(); fork(); fork(); fork(); fork(); fork(); do { int i; read(fd[0], &i, sizeof(i)); if (i < 0) continue; counters[0] = i+1; write(fd[1], counters, (1+(i & 1)) *sizeof(int)); } while (counters[0] < 1000000); return 0; } and in a perfect world, passing that token around should only cause one context switch per transfer, when the writer of a token causes a directed wakeup of just a single reader. But with the "writer wakes all readers" model we traditionally had, on my test box the above case causes more than an order of magnitude more scheduling: instead of the expected ~1M context switches, "perf stat" shows 231,852.37 msec task-clock # 15.857 CPUs utilized 11,250,961 context-switches # 0.049 M/sec 616,304 cpu-migrations # 0.003 M/sec 1,648 page-faults # 0.007 K/sec 1,097,903,998,514 cycles # 4.735 GHz 120,781,778,352 instructions # 0.11 insn per cycle 27,997,056,043 branches # 120.754 M/sec 283,581,233 branch-misses # 1.01% of all branches 14.621273891 seconds time elapsed 0.018243000 seconds user 3.611468000 seconds sys before this commit. After this commit, I get 5,229.55 msec task-clock # 3.072 CPUs utilized 1,212,233 context-switches # 0.232 M/sec 103,951 cpu-migrations # 0.020 M/sec 1,328 page-faults # 0.254 K/sec 21,307,456,166 cycles # 4.074 GHz 12,947,819,999 instructions # 0.61 insn per cycle 2,881,985,678 branches # 551.096 M/sec 64,267,015 branch-misses # 2.23% of all branches 1.702148350 seconds time elapsed 0.004868000 seconds user 0.110786000 seconds sys instead. Much better. [ Note! This kernel improvement seems to be very good at triggering a race condition in the make jobserver (in GNU make 4.2.1) for me. It's a long known bug that was fixed back in June 2017 by GNU make commit b552b0525198 ("[SV 51159] Use a non-blocking read with pselect to avoid hangs."). But there wasn't a new release of GNU make until 4.3 on Jan 19 2020, so a number of distributions may still have the buggy version. Some have backported the fix to their 4.2.1 release, though, and even without the fix it's quite timing-dependent whether the bug actually is hit. ] Josh Triplett says: "I've been hammering on your pipe fix patch (switching to exclusive wait queues) for a month or so, on several different systems, and I've run into no issues with it. The patch *substantially* improves parallel build times on large (~100 CPU) systems, both with parallel make and with other things that use make's pipe-based jobserver. All current distributions (including stable and long-term stable distributions) have versions of GNU make that no longer have the jobserver bug" Tested-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org> --- fs/coredump.c | 4 +-- fs/pipe.c | 67 +++++++++++++++++++++++++-------------- fs/splice.c | 8 ++--- include/linux/pipe_fs_i.h | 2 +- 4 files changed, 51 insertions(+), 30 deletions(-) diff --git a/fs/coredump.c b/fs/coredump.c index b1ea7dfbd1494..f8296a82d01df 100644 --- a/fs/coredump.c +++ b/fs/coredump.c @@ -517,7 +517,7 @@ static void wait_for_dump_helpers(struct file *file) pipe_lock(pipe); pipe->readers++; pipe->writers--; - wake_up_interruptible_sync(&pipe->wait); + wake_up_interruptible_sync(&pipe->rd_wait); kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); pipe_unlock(pipe); @@ -525,7 +525,7 @@ static void wait_for_dump_helpers(struct file *file) * We actually want wait_event_freezable() but then we need * to clear TIF_SIGPENDING and improve dump_interrupted(). */ - wait_event_interruptible(pipe->wait, pipe->readers == 1); + wait_event_interruptible(pipe->rd_wait, pipe->readers == 1); pipe_lock(pipe); pipe->readers--; diff --git a/fs/pipe.c b/fs/pipe.c index 57502c3c0fba1..5a34d6c22d4ce 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -108,16 +108,19 @@ void pipe_double_lock(struct pipe_inode_info *pipe1, /* Drop the inode semaphore and wait for a pipe event, atomically */ void pipe_wait(struct pipe_inode_info *pipe) { - DEFINE_WAIT(wait); + DEFINE_WAIT(rdwait); + DEFINE_WAIT(wrwait); /* * Pipes are system-local resources, so sleeping on them * is considered a noninteractive wait: */ - prepare_to_wait(&pipe->wait, &wait, TASK_INTERRUPTIBLE); + prepare_to_wait(&pipe->rd_wait, &rdwait, TASK_INTERRUPTIBLE); + prepare_to_wait(&pipe->wr_wait, &wrwait, TASK_INTERRUPTIBLE); pipe_unlock(pipe); schedule(); - finish_wait(&pipe->wait, &wait); + finish_wait(&pipe->rd_wait, &rdwait); + finish_wait(&pipe->wr_wait, &wrwait); pipe_lock(pipe); } @@ -286,7 +289,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to) size_t total_len = iov_iter_count(to); struct file *filp = iocb->ki_filp; struct pipe_inode_info *pipe = filp->private_data; - bool was_full; + bool was_full, wake_next_reader = false; ssize_t ret; /* Null read succeeds. */ @@ -344,10 +347,10 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to) if (!buf->len) { pipe_buf_release(pipe, buf); - spin_lock_irq(&pipe->wait.lock); + spin_lock_irq(&pipe->rd_wait.lock); tail++; pipe->tail = tail; - spin_unlock_irq(&pipe->wait.lock); + spin_unlock_irq(&pipe->rd_wait.lock); } total_len -= chars; if (!total_len) @@ -384,7 +387,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to) * no data. */ if (unlikely(was_full)) { - wake_up_interruptible_sync_poll(&pipe->wait, EPOLLOUT | EPOLLWRNORM); + wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM); kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT); } @@ -394,18 +397,23 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to) * since we've done any required wakeups and there's no need * to mark anything accessed. And we've dropped the lock. */ - if (wait_event_interruptible(pipe->wait, pipe_readable(pipe)) < 0) + if (wait_event_interruptible_exclusive(pipe->rd_wait, pipe_readable(pipe)) < 0) return -ERESTARTSYS; __pipe_lock(pipe); was_full = pipe_full(pipe->head, pipe->tail, pipe->max_usage); + wake_next_reader = true; } + if (pipe_empty(pipe->head, pipe->tail)) + wake_next_reader = false; __pipe_unlock(pipe); if (was_full) { - wake_up_interruptible_sync_poll(&pipe->wait, EPOLLOUT | EPOLLWRNORM); + wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM); kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT); } + if (wake_next_reader) + wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM); if (ret > 0) file_accessed(filp); return ret; @@ -437,6 +445,7 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from) size_t total_len = iov_iter_count(from); ssize_t chars; bool was_empty = false; + bool wake_next_writer = false; /* Null write succeeds. */ if (unlikely(total_len == 0)) @@ -515,16 +524,16 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from) * it, either the reader will consume it or it'll still * be there for the next write. */ - spin_lock_irq(&pipe->wait.lock); + spin_lock_irq(&pipe->rd_wait.lock); head = pipe->head; if (pipe_full(head, pipe->tail, pipe->max_usage)) { - spin_unlock_irq(&pipe->wait.lock); + spin_unlock_irq(&pipe->rd_wait.lock); continue; } pipe->head = head + 1; - spin_unlock_irq(&pipe->wait.lock); + spin_unlock_irq(&pipe->rd_wait.lock); /* Insert it into the buffer array */ buf = &pipe->bufs[head & mask]; @@ -576,14 +585,17 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from) */ __pipe_unlock(pipe); if (was_empty) { - wake_up_interruptible_sync_poll(&pipe->wait, EPOLLIN | EPOLLRDNORM); + wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM); kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); } - wait_event_interruptible(pipe->wait, pipe_writable(pipe)); + wait_event_interruptible_exclusive(pipe->wr_wait, pipe_writable(pipe)); __pipe_lock(pipe); was_empty = pipe_empty(pipe->head, pipe->tail); + wake_next_writer = true; } out: + if (pipe_full(pipe->head, pipe->tail, pipe->max_usage)) + wake_next_writer = false; __pipe_unlock(pipe); /* @@ -596,9 +608,11 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from) * wake up pending jobs */ if (was_empty) { - wake_up_interruptible_sync_poll(&pipe->wait, EPOLLIN | EPOLLRDNORM); + wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM); kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); } + if (wake_next_writer) + wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM); if (ret > 0 && sb_start_write_trylock(file_inode(filp)->i_sb)) { int err = file_update_time(filp); if (err) @@ -642,12 +656,15 @@ pipe_poll(struct file *filp, poll_table *wait) unsigned int head, tail; /* - * Reading only -- no need for acquiring the semaphore. + * Reading pipe state only -- no need for acquiring the semaphore. * * But because this is racy, the code has to add the * entry to the poll table _first_ .. */ - poll_wait(filp, &pipe->wait, wait); + if (filp->f_mode & FMODE_READ) + poll_wait(filp, &pipe->rd_wait, wait); + if (filp->f_mode & FMODE_WRITE) + poll_wait(filp, &pipe->wr_wait, wait); /* * .. and only then can you do the racy tests. That way, @@ -706,7 +723,8 @@ pipe_release(struct inode *inode, struct file *file) pipe->writers--; if (pipe->readers || pipe->writers) { - wake_up_interruptible_sync_poll(&pipe->wait, EPOLLIN | EPOLLOUT | EPOLLRDNORM | EPOLLWRNORM | EPOLLERR | EPOLLHUP); + wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM | EPOLLERR | EPOLLHUP); + wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM | EPOLLERR | EPOLLHUP); kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT); } @@ -789,7 +807,8 @@ struct pipe_inode_info *alloc_pipe_info(void) GFP_KERNEL_ACCOUNT); if (pipe->bufs) { - init_waitqueue_head(&pipe->wait); + init_waitqueue_head(&pipe->rd_wait); + init_waitqueue_head(&pipe->wr_wait); pipe->r_counter = pipe->w_counter = 1; pipe->max_usage = pipe_bufs; pipe->ring_size = pipe_bufs; @@ -1007,7 +1026,8 @@ static int wait_for_partner(struct pipe_inode_info *pipe, unsigned int *cnt) static void wake_up_partner(struct pipe_inode_info *pipe) { - wake_up_interruptible(&pipe->wait); + wake_up_interruptible(&pipe->rd_wait); + wake_up_interruptible(&pipe->wr_wait); } static int fifo_open(struct inode *inode, struct file *filp) @@ -1118,13 +1138,13 @@ static int fifo_open(struct inode *inode, struct file *filp) err_rd: if (!--pipe->readers) - wake_up_interruptible(&pipe->wait); + wake_up_interruptible(&pipe->wr_wait); ret = -ERESTARTSYS; goto err; err_wr: if (!--pipe->writers) - wake_up_interruptible(&pipe->wait); + wake_up_interruptible(&pipe->rd_wait); ret = -ERESTARTSYS; goto err; @@ -1251,7 +1271,8 @@ static long pipe_set_size(struct pipe_inode_info *pipe, unsigned long arg) pipe->max_usage = nr_slots; pipe->tail = tail; pipe->head = head; - wake_up_interruptible_all(&pipe->wait); + wake_up_interruptible_all(&pipe->rd_wait); + wake_up_interruptible_all(&pipe->wr_wait); return pipe->max_usage * PAGE_SIZE; out_revert_acct: diff --git a/fs/splice.c b/fs/splice.c index 3009652a41c85..d671936d0aad6 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -165,8 +165,8 @@ static const struct pipe_buf_operations user_page_pipe_buf_ops = { static void wakeup_pipe_readers(struct pipe_inode_info *pipe) { smp_mb(); - if (waitqueue_active(&pipe->wait)) - wake_up_interruptible(&pipe->wait); + if (waitqueue_active(&pipe->rd_wait)) + wake_up_interruptible(&pipe->rd_wait); kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); } @@ -462,8 +462,8 @@ static int pipe_to_sendpage(struct pipe_inode_info *pipe, static void wakeup_pipe_writers(struct pipe_inode_info *pipe) { smp_mb(); - if (waitqueue_active(&pipe->wait)) - wake_up_interruptible(&pipe->wait); + if (waitqueue_active(&pipe->wr_wait)) + wake_up_interruptible(&pipe->wr_wait); kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT); } diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h index dbcfa68923842..d5765039652a5 100644 --- a/include/linux/pipe_fs_i.h +++ b/include/linux/pipe_fs_i.h @@ -47,7 +47,7 @@ struct pipe_buffer { **/ struct pipe_inode_info { struct mutex mutex; - wait_queue_head_t wait; + wait_queue_head_t rd_wait, wr_wait; unsigned int head; unsigned int tail; unsigned int max_usage; -- 2.20.1 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH AUTOSEL 5.5 542/542] pipe: use exclusive waits when reading or writing 2020-02-14 15:48 ` [PATCH AUTOSEL 5.5 542/542] pipe: use exclusive waits when reading or writing Sasha Levin @ 2020-02-18 9:51 ` Andrei Vagin 2020-02-18 17:36 ` Linus Torvalds 2020-02-18 18:53 ` Linus Torvalds 1 sibling, 1 reply; 14+ messages in thread From: Andrei Vagin @ 2020-02-18 9:51 UTC (permalink / raw) To: Linus Torvalds; +Cc: LKML, Sasha Levin, stable, Josh Triplett, linux-fsdevel Hi Linus, This patch breaks one of CRIU tests. Here is a small reproducer: #include <unistd.h> #include <sys/types.h> #include <sys/wait.h> int main() { int p[2]; pid_t p1, p2; int status; if (pipe(p) == -1) return 1; p1 = fork(); if (p1 == 0) { close(p[1]); read(p[0], &status, sizeof(status)); return 0; } p2 = fork(); if (p2 == 0) { close(p[1]); read(p[0], &status, sizeof(status)); return 0; } sleep(1); close(p[1]); wait(&status); wait(&status); return 0; } Here are two readers which are waiting for data but only one of them will be woken up after closing the last writer. The quick fix looks like this: diff --git a/fs/pipe.c b/fs/pipe.c index 5a34d6c22d4c..deaf67239a18 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -412,7 +412,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to) wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM); kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT); } - if (wake_next_reader) + if (!pipe->writers || wake_next_reader) wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM); if (ret > 0) file_accessed(filp); I've checked that it fixes the issue, but It is too late today to read this code carefully, so I could skip something. Thanks, Andrei On Fri, Feb 14, 2020 at 8:03 AM Sasha Levin <sashal@kernel.org> wrote: > > From: Linus Torvalds <torvalds@linux-foundation.org> > > [ Upstream commit 0ddad21d3e99c743a3aa473121dc5561679e26bb ] > > This makes the pipe code use separate wait-queues and exclusive waiting > for readers and writers, avoiding a nasty thundering herd problem when > there are lots of readers waiting for data on a pipe (or, less commonly, > lots of writers waiting for a pipe to have space). > > While this isn't a common occurrence in the traditional "use a pipe as a > data transport" case, where you typically only have a single reader and > a single writer process, there is one common special case: using a pipe > as a source of "locking tokens" rather than for data communication. > > In particular, the GNU make jobserver code ends up using a pipe as a way > to limit parallelism, where each job consumes a token by reading a byte > from the jobserver pipe, and releases the token by writing a byte back > to the pipe. > > This pattern is fairly traditional on Unix, and works very well, but > will waste a lot of time waking up a lot of processes when only a single > reader needs to be woken up when a writer releases a new token. > > A simplified test-case of just this pipe interaction is to create 64 > processes, and then pass a single token around between them (this > test-case also intentionally passes another token that gets ignored to > test the "wake up next" logic too, in case anybody wonders about it): > > #include <unistd.h> > > int main(int argc, char **argv) > { > int fd[2], counters[2]; > > pipe(fd); > counters[0] = 0; > counters[1] = -1; > write(fd[1], counters, sizeof(counters)); > > /* 64 processes */ > fork(); fork(); fork(); fork(); fork(); fork(); > > do { > int i; > read(fd[0], &i, sizeof(i)); > if (i < 0) > continue; > counters[0] = i+1; > write(fd[1], counters, (1+(i & 1)) *sizeof(int)); > } while (counters[0] < 1000000); > return 0; > } > > and in a perfect world, passing that token around should only cause one > context switch per transfer, when the writer of a token causes a > directed wakeup of just a single reader. > > But with the "writer wakes all readers" model we traditionally had, on > my test box the above case causes more than an order of magnitude more > scheduling: instead of the expected ~1M context switches, "perf stat" > shows > > 231,852.37 msec task-clock # 15.857 CPUs utilized > 11,250,961 context-switches # 0.049 M/sec > 616,304 cpu-migrations # 0.003 M/sec > 1,648 page-faults # 0.007 K/sec > 1,097,903,998,514 cycles # 4.735 GHz > 120,781,778,352 instructions # 0.11 insn per cycle > 27,997,056,043 branches # 120.754 M/sec > 283,581,233 branch-misses # 1.01% of all branches > > 14.621273891 seconds time elapsed > > 0.018243000 seconds user > 3.611468000 seconds sys > > before this commit. > > After this commit, I get > > 5,229.55 msec task-clock # 3.072 CPUs utilized > 1,212,233 context-switches # 0.232 M/sec > 103,951 cpu-migrations # 0.020 M/sec > 1,328 page-faults # 0.254 K/sec > 21,307,456,166 cycles # 4.074 GHz > 12,947,819,999 instructions # 0.61 insn per cycle > 2,881,985,678 branches # 551.096 M/sec > 64,267,015 branch-misses # 2.23% of all branches > > 1.702148350 seconds time elapsed > > 0.004868000 seconds user > 0.110786000 seconds sys > > instead. Much better. > > [ Note! This kernel improvement seems to be very good at triggering a > race condition in the make jobserver (in GNU make 4.2.1) for me. It's > a long known bug that was fixed back in June 2017 by GNU make commit > b552b0525198 ("[SV 51159] Use a non-blocking read with pselect to > avoid hangs."). > > But there wasn't a new release of GNU make until 4.3 on Jan 19 2020, > so a number of distributions may still have the buggy version. Some > have backported the fix to their 4.2.1 release, though, and even > without the fix it's quite timing-dependent whether the bug actually > is hit. ] > > Josh Triplett says: > "I've been hammering on your pipe fix patch (switching to exclusive > wait queues) for a month or so, on several different systems, and I've > run into no issues with it. The patch *substantially* improves > parallel build times on large (~100 CPU) systems, both with parallel > make and with other things that use make's pipe-based jobserver. > > All current distributions (including stable and long-term stable > distributions) have versions of GNU make that no longer have the > jobserver bug" > > Tested-by: Josh Triplett <josh@joshtriplett.org> > Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> > Signed-off-by: Sasha Levin <sashal@kernel.org> > --- > fs/coredump.c | 4 +-- > fs/pipe.c | 67 +++++++++++++++++++++++++-------------- > fs/splice.c | 8 ++--- > include/linux/pipe_fs_i.h | 2 +- > 4 files changed, 51 insertions(+), 30 deletions(-) > > diff --git a/fs/coredump.c b/fs/coredump.c > index b1ea7dfbd1494..f8296a82d01df 100644 > --- a/fs/coredump.c > +++ b/fs/coredump.c > @@ -517,7 +517,7 @@ static void wait_for_dump_helpers(struct file *file) > pipe_lock(pipe); > pipe->readers++; > pipe->writers--; > - wake_up_interruptible_sync(&pipe->wait); > + wake_up_interruptible_sync(&pipe->rd_wait); > kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); > pipe_unlock(pipe); > > @@ -525,7 +525,7 @@ static void wait_for_dump_helpers(struct file *file) > * We actually want wait_event_freezable() but then we need > * to clear TIF_SIGPENDING and improve dump_interrupted(). > */ > - wait_event_interruptible(pipe->wait, pipe->readers == 1); > + wait_event_interruptible(pipe->rd_wait, pipe->readers == 1); > > pipe_lock(pipe); > pipe->readers--; > diff --git a/fs/pipe.c b/fs/pipe.c > index 57502c3c0fba1..5a34d6c22d4ce 100644 > --- a/fs/pipe.c > +++ b/fs/pipe.c > @@ -108,16 +108,19 @@ void pipe_double_lock(struct pipe_inode_info *pipe1, > /* Drop the inode semaphore and wait for a pipe event, atomically */ > void pipe_wait(struct pipe_inode_info *pipe) > { > - DEFINE_WAIT(wait); > + DEFINE_WAIT(rdwait); > + DEFINE_WAIT(wrwait); > > /* > * Pipes are system-local resources, so sleeping on them > * is considered a noninteractive wait: > */ > - prepare_to_wait(&pipe->wait, &wait, TASK_INTERRUPTIBLE); > + prepare_to_wait(&pipe->rd_wait, &rdwait, TASK_INTERRUPTIBLE); > + prepare_to_wait(&pipe->wr_wait, &wrwait, TASK_INTERRUPTIBLE); > pipe_unlock(pipe); > schedule(); > - finish_wait(&pipe->wait, &wait); > + finish_wait(&pipe->rd_wait, &rdwait); > + finish_wait(&pipe->wr_wait, &wrwait); > pipe_lock(pipe); > } > > @@ -286,7 +289,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to) > size_t total_len = iov_iter_count(to); > struct file *filp = iocb->ki_filp; > struct pipe_inode_info *pipe = filp->private_data; > - bool was_full; > + bool was_full, wake_next_reader = false; > ssize_t ret; > > /* Null read succeeds. */ > @@ -344,10 +347,10 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to) > > if (!buf->len) { > pipe_buf_release(pipe, buf); > - spin_lock_irq(&pipe->wait.lock); > + spin_lock_irq(&pipe->rd_wait.lock); > tail++; > pipe->tail = tail; > - spin_unlock_irq(&pipe->wait.lock); > + spin_unlock_irq(&pipe->rd_wait.lock); > } > total_len -= chars; > if (!total_len) > @@ -384,7 +387,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to) > * no data. > */ > if (unlikely(was_full)) { > - wake_up_interruptible_sync_poll(&pipe->wait, EPOLLOUT | EPOLLWRNORM); > + wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM); > kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT); > } > > @@ -394,18 +397,23 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to) > * since we've done any required wakeups and there's no need > * to mark anything accessed. And we've dropped the lock. > */ > - if (wait_event_interruptible(pipe->wait, pipe_readable(pipe)) < 0) > + if (wait_event_interruptible_exclusive(pipe->rd_wait, pipe_readable(pipe)) < 0) > return -ERESTARTSYS; > > __pipe_lock(pipe); > was_full = pipe_full(pipe->head, pipe->tail, pipe->max_usage); > + wake_next_reader = true; > } > + if (pipe_empty(pipe->head, pipe->tail)) > + wake_next_reader = false; > __pipe_unlock(pipe); > > if (was_full) { > - wake_up_interruptible_sync_poll(&pipe->wait, EPOLLOUT | EPOLLWRNORM); > + wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM); > kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT); > } > + if (wake_next_reader) > + wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM); > if (ret > 0) > file_accessed(filp); > return ret; > @@ -437,6 +445,7 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from) > size_t total_len = iov_iter_count(from); > ssize_t chars; > bool was_empty = false; > + bool wake_next_writer = false; > > /* Null write succeeds. */ > if (unlikely(total_len == 0)) > @@ -515,16 +524,16 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from) > * it, either the reader will consume it or it'll still > * be there for the next write. > */ > - spin_lock_irq(&pipe->wait.lock); > + spin_lock_irq(&pipe->rd_wait.lock); > > head = pipe->head; > if (pipe_full(head, pipe->tail, pipe->max_usage)) { > - spin_unlock_irq(&pipe->wait.lock); > + spin_unlock_irq(&pipe->rd_wait.lock); > continue; > } > > pipe->head = head + 1; > - spin_unlock_irq(&pipe->wait.lock); > + spin_unlock_irq(&pipe->rd_wait.lock); > > /* Insert it into the buffer array */ > buf = &pipe->bufs[head & mask]; > @@ -576,14 +585,17 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from) > */ > __pipe_unlock(pipe); > if (was_empty) { > - wake_up_interruptible_sync_poll(&pipe->wait, EPOLLIN | EPOLLRDNORM); > + wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM); > kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); > } > - wait_event_interruptible(pipe->wait, pipe_writable(pipe)); > + wait_event_interruptible_exclusive(pipe->wr_wait, pipe_writable(pipe)); > __pipe_lock(pipe); > was_empty = pipe_empty(pipe->head, pipe->tail); > + wake_next_writer = true; > } > out: > + if (pipe_full(pipe->head, pipe->tail, pipe->max_usage)) > + wake_next_writer = false; > __pipe_unlock(pipe); > > /* > @@ -596,9 +608,11 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from) > * wake up pending jobs > */ > if (was_empty) { > - wake_up_interruptible_sync_poll(&pipe->wait, EPOLLIN | EPOLLRDNORM); > + wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM); > kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); > } > + if (wake_next_writer) > + wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM); > if (ret > 0 && sb_start_write_trylock(file_inode(filp)->i_sb)) { > int err = file_update_time(filp); > if (err) > @@ -642,12 +656,15 @@ pipe_poll(struct file *filp, poll_table *wait) > unsigned int head, tail; > > /* > - * Reading only -- no need for acquiring the semaphore. > + * Reading pipe state only -- no need for acquiring the semaphore. > * > * But because this is racy, the code has to add the > * entry to the poll table _first_ .. > */ > - poll_wait(filp, &pipe->wait, wait); > + if (filp->f_mode & FMODE_READ) > + poll_wait(filp, &pipe->rd_wait, wait); > + if (filp->f_mode & FMODE_WRITE) > + poll_wait(filp, &pipe->wr_wait, wait); > > /* > * .. and only then can you do the racy tests. That way, > @@ -706,7 +723,8 @@ pipe_release(struct inode *inode, struct file *file) > pipe->writers--; > > if (pipe->readers || pipe->writers) { > - wake_up_interruptible_sync_poll(&pipe->wait, EPOLLIN | EPOLLOUT | EPOLLRDNORM | EPOLLWRNORM | EPOLLERR | EPOLLHUP); > + wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM | EPOLLERR | EPOLLHUP); > + wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM | EPOLLERR | EPOLLHUP); > kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); > kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT); > } > @@ -789,7 +807,8 @@ struct pipe_inode_info *alloc_pipe_info(void) > GFP_KERNEL_ACCOUNT); > > if (pipe->bufs) { > - init_waitqueue_head(&pipe->wait); > + init_waitqueue_head(&pipe->rd_wait); > + init_waitqueue_head(&pipe->wr_wait); > pipe->r_counter = pipe->w_counter = 1; > pipe->max_usage = pipe_bufs; > pipe->ring_size = pipe_bufs; > @@ -1007,7 +1026,8 @@ static int wait_for_partner(struct pipe_inode_info *pipe, unsigned int *cnt) > > static void wake_up_partner(struct pipe_inode_info *pipe) > { > - wake_up_interruptible(&pipe->wait); > + wake_up_interruptible(&pipe->rd_wait); > + wake_up_interruptible(&pipe->wr_wait); > } > > static int fifo_open(struct inode *inode, struct file *filp) > @@ -1118,13 +1138,13 @@ static int fifo_open(struct inode *inode, struct file *filp) > > err_rd: > if (!--pipe->readers) > - wake_up_interruptible(&pipe->wait); > + wake_up_interruptible(&pipe->wr_wait); > ret = -ERESTARTSYS; > goto err; > > err_wr: > if (!--pipe->writers) > - wake_up_interruptible(&pipe->wait); > + wake_up_interruptible(&pipe->rd_wait); > ret = -ERESTARTSYS; > goto err; > > @@ -1251,7 +1271,8 @@ static long pipe_set_size(struct pipe_inode_info *pipe, unsigned long arg) > pipe->max_usage = nr_slots; > pipe->tail = tail; > pipe->head = head; > - wake_up_interruptible_all(&pipe->wait); > + wake_up_interruptible_all(&pipe->rd_wait); > + wake_up_interruptible_all(&pipe->wr_wait); > return pipe->max_usage * PAGE_SIZE; > > out_revert_acct: > diff --git a/fs/splice.c b/fs/splice.c > index 3009652a41c85..d671936d0aad6 100644 > --- a/fs/splice.c > +++ b/fs/splice.c > @@ -165,8 +165,8 @@ static const struct pipe_buf_operations user_page_pipe_buf_ops = { > static void wakeup_pipe_readers(struct pipe_inode_info *pipe) > { > smp_mb(); > - if (waitqueue_active(&pipe->wait)) > - wake_up_interruptible(&pipe->wait); > + if (waitqueue_active(&pipe->rd_wait)) > + wake_up_interruptible(&pipe->rd_wait); > kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); > } > > @@ -462,8 +462,8 @@ static int pipe_to_sendpage(struct pipe_inode_info *pipe, > static void wakeup_pipe_writers(struct pipe_inode_info *pipe) > { > smp_mb(); > - if (waitqueue_active(&pipe->wait)) > - wake_up_interruptible(&pipe->wait); > + if (waitqueue_active(&pipe->wr_wait)) > + wake_up_interruptible(&pipe->wr_wait); > kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT); > } > > diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h > index dbcfa68923842..d5765039652a5 100644 > --- a/include/linux/pipe_fs_i.h > +++ b/include/linux/pipe_fs_i.h > @@ -47,7 +47,7 @@ struct pipe_buffer { > **/ > struct pipe_inode_info { > struct mutex mutex; > - wait_queue_head_t wait; > + wait_queue_head_t rd_wait, wr_wait; > unsigned int head; > unsigned int tail; > unsigned int max_usage; > -- > 2.20.1 > ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH AUTOSEL 5.5 542/542] pipe: use exclusive waits when reading or writing 2020-02-18 9:51 ` Andrei Vagin @ 2020-02-18 17:36 ` Linus Torvalds 2020-02-18 17:54 ` Linus Torvalds 0 siblings, 1 reply; 14+ messages in thread From: Linus Torvalds @ 2020-02-18 17:36 UTC (permalink / raw) To: Andrei Vagin; +Cc: LKML, Sasha Levin, stable, Josh Triplett, linux-fsdevel [-- Attachment #1: Type: text/plain, Size: 680 bytes --] On Tue, Feb 18, 2020 at 1:51 AM Andrei Vagin <avagin@gmail.com> wrote: > > This patch breaks one of CRIU tests. Here is a small reproducer: Good catch. > The quick fix looks like this: That one works, but is not really right. The things that change the number of readers or writers should simply use "wake_up_all()". I thought we did that already, but no - there _was_ one place where we did it, but that was for the pipe buffer size case, and in that case it's actually pointless. That case acts just like a "new space or data was added" So I think the right fix is the attached patch. Since you had such a lovely test-case, let me go test it too ;) Linus [-- Attachment #2: patch.diff --] [-- Type: text/x-patch, Size: 1227 bytes --] fs/pipe.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/fs/pipe.c b/fs/pipe.c index 5a34d6c22d4c..76e7f66fe2fe 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -1026,8 +1026,8 @@ static int wait_for_partner(struct pipe_inode_info *pipe, unsigned int *cnt) static void wake_up_partner(struct pipe_inode_info *pipe) { - wake_up_interruptible(&pipe->rd_wait); - wake_up_interruptible(&pipe->wr_wait); + wake_up_interruptible_all(&pipe->rd_wait); + wake_up_interruptible_all(&pipe->wr_wait); } static int fifo_open(struct inode *inode, struct file *filp) @@ -1144,7 +1144,7 @@ static int fifo_open(struct inode *inode, struct file *filp) err_wr: if (!--pipe->writers) - wake_up_interruptible(&pipe->rd_wait); + wake_up_interruptible_all(&pipe->rd_wait); ret = -ERESTARTSYS; goto err; @@ -1271,8 +1271,8 @@ static long pipe_set_size(struct pipe_inode_info *pipe, unsigned long arg) pipe->max_usage = nr_slots; pipe->tail = tail; pipe->head = head; - wake_up_interruptible_all(&pipe->rd_wait); - wake_up_interruptible_all(&pipe->wr_wait); + wake_up_interruptible(&pipe->rd_wait); + wake_up_interruptible(&pipe->wr_wait); return pipe->max_usage * PAGE_SIZE; out_revert_acct: ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH AUTOSEL 5.5 542/542] pipe: use exclusive waits when reading or writing 2020-02-18 17:36 ` Linus Torvalds @ 2020-02-18 17:54 ` Linus Torvalds 2020-02-18 18:17 ` Linus Torvalds 0 siblings, 1 reply; 14+ messages in thread From: Linus Torvalds @ 2020-02-18 17:54 UTC (permalink / raw) To: Andrei Vagin; +Cc: LKML, Sasha Levin, stable, Josh Triplett, linux-fsdevel [-- Attachment #1: Type: text/plain, Size: 1100 bytes --] On Tue, Feb 18, 2020 at 9:36 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > The things that change the number of readers or writers should simply > use "wake_up_all()". > > So I think the right fix is the attached patch. Since you had such a > lovely test-case, let me go test it too ;) Good that I did. I missed the _real_ case of this - pipe_release(). Because that used a different wakeup function. In fact, that case uses wake_up_interruptible_sync_poll(), which doesn't have the "all" version. But it doesn't actually need that fancy thing, since it's only meant for "let's avoid waking up things that don't need these poll keys", and the whole point is that now we're closing the pipe so we should wake up everybody. And in fact the test for "are there readers or writers" was nonsensical. We shouldn't wake up readers just because they still exist. We should wake up readers only if they exist, _and_ there are no writers left (and vice versa). Anyway, new patch attached. This hasn't been tested either, but I'll let you know if it's broken too ;) Linus [-- Attachment #2: patch.diff --] [-- Type: text/x-patch, Size: 1916 bytes --] fs/pipe.c | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/fs/pipe.c b/fs/pipe.c index 5a34d6c22d4c..89d54c1911fe 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -722,9 +722,10 @@ pipe_release(struct inode *inode, struct file *file) if (file->f_mode & FMODE_WRITE) pipe->writers--; - if (pipe->readers || pipe->writers) { - wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM | EPOLLERR | EPOLLHUP); - wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM | EPOLLERR | EPOLLHUP); + /* Was that the last reader or writer, but not the other side? */ + if (!pipe->readers != !pipe->writers) { + wake_up_interruptible_all(&pipe->rd_wait); + wake_up_interruptible_all(&pipe->wr_wait); kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT); } @@ -1026,8 +1027,8 @@ static int wait_for_partner(struct pipe_inode_info *pipe, unsigned int *cnt) static void wake_up_partner(struct pipe_inode_info *pipe) { - wake_up_interruptible(&pipe->rd_wait); - wake_up_interruptible(&pipe->wr_wait); + wake_up_interruptible_all(&pipe->rd_wait); + wake_up_interruptible_all(&pipe->wr_wait); } static int fifo_open(struct inode *inode, struct file *filp) @@ -1144,7 +1145,7 @@ static int fifo_open(struct inode *inode, struct file *filp) err_wr: if (!--pipe->writers) - wake_up_interruptible(&pipe->rd_wait); + wake_up_interruptible_all(&pipe->rd_wait); ret = -ERESTARTSYS; goto err; @@ -1271,8 +1272,8 @@ static long pipe_set_size(struct pipe_inode_info *pipe, unsigned long arg) pipe->max_usage = nr_slots; pipe->tail = tail; pipe->head = head; - wake_up_interruptible_all(&pipe->rd_wait); - wake_up_interruptible_all(&pipe->wr_wait); + wake_up_interruptible(&pipe->rd_wait); + wake_up_interruptible(&pipe->wr_wait); return pipe->max_usage * PAGE_SIZE; out_revert_acct: ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH AUTOSEL 5.5 542/542] pipe: use exclusive waits when reading or writing 2020-02-18 17:54 ` Linus Torvalds @ 2020-02-18 18:17 ` Linus Torvalds 2020-02-18 18:20 ` Matthew Wilcox 0 siblings, 1 reply; 14+ messages in thread From: Linus Torvalds @ 2020-02-18 18:17 UTC (permalink / raw) To: Andrei Vagin; +Cc: LKML, Sasha Levin, stable, Josh Triplett, linux-fsdevel [-- Attachment #1: Type: text/plain, Size: 483 bytes --] On Tue, Feb 18, 2020 at 9:54 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Anyway, new patch attached. This hasn't been tested either, but I'll > let you know if it's broken too ;) That one looks good. Some small cosmetic edits later, and with a commit log it looks like the appended. If you're testing this and it works for your full CRIU test case too, I'll add your tested-by if I get it before I end up pushing things out later today, Linus [-- Attachment #2: 0001-pipe-make-sure-to-wake-up-everybody-when-the-last-re.patch --] [-- Type: text/x-patch, Size: 4776 bytes --] From 7e2fe96d1b760eb3e7dec771db7b8983d5b6c25c Mon Sep 17 00:00:00 2001 From: Linus Torvalds <torvalds@linux-foundation.org> Date: Tue, 18 Feb 2020 10:12:58 -0800 Subject: [PATCH] pipe: make sure to wake up everybody when the last reader/writer closes Andrei Vagin reported that commit 0ddad21d3e99 ("pipe: use exclusive waits when reading or writing") broke one of the CRIU tests. He even has a trivial reproducer: #include <unistd.h> #include <sys/types.h> #include <sys/wait.h> int main() { int p[2]; pid_t p1, p2; int status; if (pipe(p) == -1) return 1; p1 = fork(); if (p1 == 0) { close(p[1]); read(p[0], &status, sizeof(status)); return 0; } p2 = fork(); if (p2 == 0) { close(p[1]); read(p[0], &status, sizeof(status)); return 0; } sleep(1); close(p[1]); wait(&status); wait(&status); return 0; } and the problem - once he points it out - is obvious. We use these nice exclusive waits, but when the last writer goes away, it then needs to wake up _every_ reader (and conversely, the last reader disappearing needs to wake every writer, of course). In fact, when going through this, we had several small oddities around how to wake things. We did in fact wake every reader when we changed the size of the pipe buffers. But that's entirely pointless, since that just acts as a possible source of new space - no new data to read. And when we change the size of the buffer, we don't need to wake all writers even when we add space - that case acts just as if somebody made space by reading, and any writer that finds itself not filling it up entirely will wake the next one. On the other hand, on the exit path, we tried to limit the wakeups with the proper poll keys etc, which is entirely pointless, because at that point we obviously need to wake up everybody. So don't do that: just wake up everybody - but only do that if the counts changed to zero. So fix those non-IO wakeups to be more proper: space change doesn't add any new data, but it might make room for writers, so it wakes up a writer. And the actual changes to reader/writer counts should wake up everybody, since everybody is affected (ie readers will all see EOF if the writers have gone away, and writers will all get EPIPE if all readers have gone away). Fixes: 0ddad21d3e99 ("pipe: use exclusive waits when reading or writing") Reported-by: Andrei Vagin <avagin@gmail.com> Cc: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> --- fs/pipe.c | 18 ++++++++++-------- 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/fs/pipe.c b/fs/pipe.c index 5a34d6c22d4c..2144507447c5 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -722,9 +722,10 @@ pipe_release(struct inode *inode, struct file *file) if (file->f_mode & FMODE_WRITE) pipe->writers--; - if (pipe->readers || pipe->writers) { - wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM | EPOLLERR | EPOLLHUP); - wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM | EPOLLERR | EPOLLHUP); + /* Was that the last reader or writer, but not the other side? */ + if (!pipe->readers != !pipe->writers) { + wake_up_interruptible_all(&pipe->rd_wait); + wake_up_interruptible_all(&pipe->wr_wait); kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT); } @@ -1026,8 +1027,8 @@ static int wait_for_partner(struct pipe_inode_info *pipe, unsigned int *cnt) static void wake_up_partner(struct pipe_inode_info *pipe) { - wake_up_interruptible(&pipe->rd_wait); - wake_up_interruptible(&pipe->wr_wait); + wake_up_interruptible_all(&pipe->rd_wait); + wake_up_interruptible_all(&pipe->wr_wait); } static int fifo_open(struct inode *inode, struct file *filp) @@ -1144,7 +1145,7 @@ static int fifo_open(struct inode *inode, struct file *filp) err_wr: if (!--pipe->writers) - wake_up_interruptible(&pipe->rd_wait); + wake_up_interruptible_all(&pipe->rd_wait); ret = -ERESTARTSYS; goto err; @@ -1271,8 +1272,9 @@ static long pipe_set_size(struct pipe_inode_info *pipe, unsigned long arg) pipe->max_usage = nr_slots; pipe->tail = tail; pipe->head = head; - wake_up_interruptible_all(&pipe->rd_wait); - wake_up_interruptible_all(&pipe->wr_wait); + + /* This might have made more room for writers */ + wake_up_interruptible(&pipe->wr_wait); return pipe->max_usage * PAGE_SIZE; out_revert_acct: -- 2.24.0.158.g3fed155289 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH AUTOSEL 5.5 542/542] pipe: use exclusive waits when reading or writing 2020-02-18 18:17 ` Linus Torvalds @ 2020-02-18 18:20 ` Matthew Wilcox 2020-02-18 18:28 ` Linus Torvalds 0 siblings, 1 reply; 14+ messages in thread From: Matthew Wilcox @ 2020-02-18 18:20 UTC (permalink / raw) To: Linus Torvalds Cc: Andrei Vagin, LKML, Sasha Levin, stable, Josh Triplett, linux-fsdevel On Tue, Feb 18, 2020 at 10:17:30AM -0800, Linus Torvalds wrote: > @@ -722,9 +722,10 @@ pipe_release(struct inode *inode, struct file *file) > if (file->f_mode & FMODE_WRITE) > pipe->writers--; > > - if (pipe->readers || pipe->writers) { > - wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM | EPOLLERR | EPOLLHUP); > - wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM | EPOLLERR | EPOLLHUP); > + /* Was that the last reader or writer, but not the other side? */ > + if (!pipe->readers != !pipe->writers) { > + wake_up_interruptible_all(&pipe->rd_wait); > + wake_up_interruptible_all(&pipe->wr_wait); > kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); > kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT); > } > @@ -1026,8 +1027,8 @@ static int wait_for_partner(struct pipe_inode_info *pipe, unsigned int *cnt) > > static void wake_up_partner(struct pipe_inode_info *pipe) > { > - wake_up_interruptible(&pipe->rd_wait); > - wake_up_interruptible(&pipe->wr_wait); > + wake_up_interruptible_all(&pipe->rd_wait); > + wake_up_interruptible_all(&pipe->wr_wait); > } You don't want to move wake_up_partner() up and call it from pipe_release()? ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH AUTOSEL 5.5 542/542] pipe: use exclusive waits when reading or writing 2020-02-18 18:20 ` Matthew Wilcox @ 2020-02-18 18:28 ` Linus Torvalds 2020-02-18 22:33 ` Andrei Vagin 0 siblings, 1 reply; 14+ messages in thread From: Linus Torvalds @ 2020-02-18 18:28 UTC (permalink / raw) To: Matthew Wilcox Cc: Andrei Vagin, LKML, Sasha Levin, stable, Josh Triplett, linux-fsdevel [-- Attachment #1: Type: text/plain, Size: 575 bytes --] On Tue, Feb 18, 2020 at 10:20 AM Matthew Wilcox <willy@infradead.org> wrote: > > You don't want to move wake_up_partner() up and call it from pipe_release()? I was actually thinking of going the other way - two of three users of wake_up_partner() are redundantly waking up the wrong side, and the third user is pointlessly written too. So I was _thinking_ of a patch like the appended (which is on top of the previous patch), but ended up not doing it. Until you brought it up. But I won't bother committing this, since it shouldn't really matter. Linus [-- Attachment #2: patch.diff --] [-- Type: text/x-patch, Size: 1656 bytes --] fs/pipe.c | 18 ++++++------------ 1 file changed, 6 insertions(+), 12 deletions(-) diff --git a/fs/pipe.c b/fs/pipe.c index 2144507447c5..79ba61430f9c 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -1025,12 +1025,6 @@ static int wait_for_partner(struct pipe_inode_info *pipe, unsigned int *cnt) return cur == *cnt ? -ERESTARTSYS : 0; } -static void wake_up_partner(struct pipe_inode_info *pipe) -{ - wake_up_interruptible_all(&pipe->rd_wait); - wake_up_interruptible_all(&pipe->wr_wait); -} - static int fifo_open(struct inode *inode, struct file *filp) { struct pipe_inode_info *pipe; @@ -1078,7 +1072,7 @@ static int fifo_open(struct inode *inode, struct file *filp) */ pipe->r_counter++; if (pipe->readers++ == 0) - wake_up_partner(pipe); + wake_up_interruptible_all(&pipe->wr_wait); if (!is_pipe && !pipe->writers) { if ((filp->f_flags & O_NONBLOCK)) { @@ -1104,7 +1098,7 @@ static int fifo_open(struct inode *inode, struct file *filp) pipe->w_counter++; if (!pipe->writers++) - wake_up_partner(pipe); + wake_up_interruptible_all(&pipe->rd_wait); if (!is_pipe && !pipe->readers) { if (wait_for_partner(pipe, &pipe->r_counter)) @@ -1120,12 +1114,12 @@ static int fifo_open(struct inode *inode, struct file *filp) * the process can at least talk to itself. */ - pipe->readers++; - pipe->writers++; + if (pipe->readers++ == 0) + wake_up_interruptible_all(&pipe->wr_wait); + if (pipe->writers++ == 0) + wake_up_interruptible_all(&pipe->rd_wait); pipe->r_counter++; pipe->w_counter++; - if (pipe->readers == 1 || pipe->writers == 1) - wake_up_partner(pipe); break; default: ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH AUTOSEL 5.5 542/542] pipe: use exclusive waits when reading or writing 2020-02-18 18:28 ` Linus Torvalds @ 2020-02-18 22:33 ` Andrei Vagin 2020-02-18 23:03 ` Linus Torvalds 0 siblings, 1 reply; 14+ messages in thread From: Andrei Vagin @ 2020-02-18 22:33 UTC (permalink / raw) To: Linus Torvalds Cc: Matthew Wilcox, LKML, Sasha Levin, stable, Josh Triplett, linux-fsdevel On Tue, Feb 18, 2020 at 10:28:23AM -0800, Linus Torvalds wrote: > On Tue, Feb 18, 2020 at 10:20 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > You don't want to move wake_up_partner() up and call it from pipe_release()? > > I was actually thinking of going the other way - two of three users of > wake_up_partner() are redundantly waking up the wrong side, and the > third user is pointlessly written too. > > So I was _thinking_ of a patch like the appended (which is on top of > the previous patch), but ended up not doing it. Until you brought it > up. > > But I won't bother committing this, since it shouldn't really matter. I run CRIU tests on the kernel with both these patches. Everything work as expected. Thank you for the fix. > > Linus > fs/pipe.c | 18 ++++++------------ > 1 file changed, 6 insertions(+), 12 deletions(-) > > diff --git a/fs/pipe.c b/fs/pipe.c > index 2144507447c5..79ba61430f9c 100644 > --- a/fs/pipe.c > +++ b/fs/pipe.c > @@ -1025,12 +1025,6 @@ static int wait_for_partner(struct pipe_inode_info *pipe, unsigned int *cnt) > return cur == *cnt ? -ERESTARTSYS : 0; > } > > -static void wake_up_partner(struct pipe_inode_info *pipe) > -{ > - wake_up_interruptible_all(&pipe->rd_wait); > - wake_up_interruptible_all(&pipe->wr_wait); > -} > - > static int fifo_open(struct inode *inode, struct file *filp) > { > struct pipe_inode_info *pipe; > @@ -1078,7 +1072,7 @@ static int fifo_open(struct inode *inode, struct file *filp) > */ > pipe->r_counter++; > if (pipe->readers++ == 0) > - wake_up_partner(pipe); > + wake_up_interruptible_all(&pipe->wr_wait); > > if (!is_pipe && !pipe->writers) { > if ((filp->f_flags & O_NONBLOCK)) { > @@ -1104,7 +1098,7 @@ static int fifo_open(struct inode *inode, struct file *filp) > > pipe->w_counter++; > if (!pipe->writers++) > - wake_up_partner(pipe); > + wake_up_interruptible_all(&pipe->rd_wait); > > if (!is_pipe && !pipe->readers) { > if (wait_for_partner(pipe, &pipe->r_counter)) > @@ -1120,12 +1114,12 @@ static int fifo_open(struct inode *inode, struct file *filp) > * the process can at least talk to itself. > */ > > - pipe->readers++; > - pipe->writers++; > + if (pipe->readers++ == 0) > + wake_up_interruptible_all(&pipe->wr_wait); > + if (pipe->writers++ == 0) > + wake_up_interruptible_all(&pipe->rd_wait); > pipe->r_counter++; > pipe->w_counter++; > - if (pipe->readers == 1 || pipe->writers == 1) > - wake_up_partner(pipe); > break; > > default: ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH AUTOSEL 5.5 542/542] pipe: use exclusive waits when reading or writing 2020-02-18 22:33 ` Andrei Vagin @ 2020-02-18 23:03 ` Linus Torvalds 2020-03-05 18:19 ` Andrei Vagin 0 siblings, 1 reply; 14+ messages in thread From: Linus Torvalds @ 2020-02-18 23:03 UTC (permalink / raw) To: Andrei Vagin Cc: Matthew Wilcox, LKML, Sasha Levin, stable, Josh Triplett, linux-fsdevel On Tue, Feb 18, 2020 at 2:33 PM Andrei Vagin <avagin@gmail.com> wrote: > > I run CRIU tests on the kernel with both these patches. Everything work > as expected. Thanks. I've added your tested-by and pushed out the fix. Linus ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH AUTOSEL 5.5 542/542] pipe: use exclusive waits when reading or writing 2020-02-18 23:03 ` Linus Torvalds @ 2020-03-05 18:19 ` Andrei Vagin 2020-03-05 18:40 ` Linus Torvalds 0 siblings, 1 reply; 14+ messages in thread From: Andrei Vagin @ 2020-03-05 18:19 UTC (permalink / raw) To: Linus Torvalds Cc: Matthew Wilcox, LKML, Sasha Levin, stable, Josh Triplett, linux-fsdevel Hello Linus, After this change, one more criu test became flaky. This is due to one of corner cases, so I am not sure that we need to fix something in the kernel. I have fixed this issue in the test. I am not sure that this will affect any real applications. Here is the reproducer: #include <unistd.h> #include <stdio.h> int main() { char buf[1<<20]; int pid, p[2], ret; if (pipe(p) < 0) return 1; pid = fork(); if (pid == 0) { close(p[1]); ret = read(p[0], buf, sizeof(buf)); if (ret < 0) return 1; printf("read -> %d\n", ret); return 0; } close(p[0]); ret = write(p[1], buf, sizeof(buf)); if (ret < 0) return 1; printf("write -> %d\n", ret); return 0; } Before this change: [avagin@laptop fifo]$ uname -a Linux laptop 5.3.7-200.fc30.x86_64 #1 SMP Fri Oct 18 20:13:59 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux [avagin@laptop fifo]$ strace -e read,write,pipe -f ./pipe_bigbuf read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\260r\2\0\0\0\0\0"..., 832) = 832 read(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784) = 784 read(3, "\4\0\0\0\20\0\0\0\5\0\0\0GNU\0\2\0\0\300\4\0\0\0\3\0\0\0\0\0\0\0", 32) = 32 read(3, "\4\0\0\0\24\0\0\0\3\0\0\0GNU\0gZ\316<\240z\v\206=\360\37F\32{\t\204"..., 68) = 68 read(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784) = 784 read(3, "\4\0\0\0\20\0\0\0\5\0\0\0GNU\0\2\0\0\300\4\0\0\0\3\0\0\0\0\0\0\0", 32) = 32 read(3, "\4\0\0\0\24\0\0\0\3\0\0\0GNU\0gZ\316<\240z\v\206=\360\37F\32{\t\204"..., 68) = 68 pipe([3, 4]) = 0 strace: Process 622350 attached [pid 622349] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576 <unfinished ...> [pid 622350] read(3, <unfinished ...> [pid 622349] <... write resumed> ) = 1048576 [pid 622350] <... read resumed> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576 [pid 622349] write(1, "write -> 1048576\n", 17write -> 1048576 ) = 17 [pid 622350] write(1, "read -> 1048576\n", 16read -> 1048576 ) = 16 [pid 622349] +++ exited with 0 +++ +++ exited with 0 +++ After this change: [root@fc24 ~]# strace -e read,write,pipe -f ./pipe_bigbuf read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\260r\2\0\0\0\0\0"..., 832) = 832 read(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784) = 784 read(3, "\4\0\0\0\20\0\0\0\5\0\0\0GNU\0\2\0\0\300\4\0\0\0\3\0\0\0\0\0\0\0", 32) = 32 read(3, "\4\0\0\0\24\0\0\0\3\0\0\0GNU\0L\355\265_\4c\17r@ix\305q\26W\242"..., 68) = 68 read(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784) = 784 read(3, "\4\0\0\0\20\0\0\0\5\0\0\0GNU\0\2\0\0\300\4\0\0\0\3\0\0\0\0\0\0\0", 32) = 32 read(3, "\4\0\0\0\24\0\0\0\3\0\0\0GNU\0L\355\265_\4c\17r@ix\305q\26W\242"..., 68) = 68 pipe([3, 4]) = 0 strace: Process 4946 attached [pid 4945] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576 <unfinished ...> [pid 4946] read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 65536 [pid 4946] write(1, "read -> 65536\n", 14read -> 65536 ) = 14 [pid 4945] <... write resumed>) = 131072 [pid 4946] +++ exited with 0 +++ --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=4945, si_uid=0} --- +++ killed by SIGPIPE +++ On Tue, Feb 18, 2020 at 3:03 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Tue, Feb 18, 2020 at 2:33 PM Andrei Vagin <avagin@gmail.com> wrote: > > > > I run CRIU tests on the kernel with both these patches. Everything work > > as expected. > > Thanks. I've added your tested-by and pushed out the fix. > > Linus ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH AUTOSEL 5.5 542/542] pipe: use exclusive waits when reading or writing 2020-03-05 18:19 ` Andrei Vagin @ 2020-03-05 18:40 ` Linus Torvalds 2020-03-05 19:54 ` Andrei Vagin 0 siblings, 1 reply; 14+ messages in thread From: Linus Torvalds @ 2020-03-05 18:40 UTC (permalink / raw) To: Andrei Vagin Cc: Matthew Wilcox, LKML, Sasha Levin, stable, Josh Triplett, linux-fsdevel On Thu, Mar 5, 2020 at 12:20 PM Andrei Vagin <avagin@gmail.com> wrote: > > After this change, one more criu test became flaky. This is due to one > of corner cases, so I am not sure that we need to fix something in the > kernel. I have fixed this issue in the test. I am not sure that this > will affect any real applications. It's an interesting test-case, but it's really not doing anything you should rely on. The code basically expects a pipe write() to be "atomic" for something bigger than PIPE_BUF. We've never really guaranteed that (and POSIX doesn't), but we had this special case where readers would continue to read as long as there was an active writer, which kind of approximated that for some cases (and your test-case in particular). A reader that wants to read everything should do multiple read() calls until it gets an EOF (or gets the expected buffer size). A regular read() on a pipe can simply always return a partial buffer (it will always do so in the case of signals, but what you're seeing is that it will also now do it if the kernel buffers emptied even when there was a writer that was ready to fill them again). And I suspect it wasn't actually the commit you point to that changes the behavior, I think it's actually the "pipe: remove 'waiting_writers' merging logic" that changed behavior for your test But because it's then timing-dependent on whether the reader gets all the data or not, it might bisect to any commit after that point. Particularly since some of the other commits change timing too.. We could re-introduce the "continue reading while there are active writers" logic, but if this is the only test that triggers that, I'd prefer to wait until some real user notices... But if CRIU itself depends on this behavior (rather than just a test), then I guess we need to. So is it just a test-case, or does CRIU itself depend on that "reads get full buffers"? As mentioned, that really _is_ fundamentally broken if there is any chance of signals.. Linus ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH AUTOSEL 5.5 542/542] pipe: use exclusive waits when reading or writing 2020-03-05 18:40 ` Linus Torvalds @ 2020-03-05 19:54 ` Andrei Vagin 0 siblings, 0 replies; 14+ messages in thread From: Andrei Vagin @ 2020-03-05 19:54 UTC (permalink / raw) To: Linus Torvalds Cc: Matthew Wilcox, LKML, Sasha Levin, stable, Josh Triplett, linux-fsdevel On Thu, Mar 5, 2020 at 10:41 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Thu, Mar 5, 2020 at 12:20 PM Andrei Vagin <avagin@gmail.com> wrote: > > > > After this change, one more criu test became flaky. This is due to one > > of corner cases, so I am not sure that we need to fix something in the > > kernel. I have fixed this issue in the test. I am not sure that this > > will affect any real applications. > > It's an interesting test-case, but it's really not doing anything you > should rely on. I'm agree with this. > But if CRIU itself depends on this behavior (rather than just a test), > then I guess we need to. > > So is it just a test-case, or does CRIU itself depend on that "reads > get full buffers"? As mentioned, that really _is_ fundamentally broken > if there is any chance of signals.. No, it doesn't. I'm agree that we can wait an report from a real app. Thanks! Andrei ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH AUTOSEL 5.5 542/542] pipe: use exclusive waits when reading or writing 2020-02-14 15:48 ` [PATCH AUTOSEL 5.5 542/542] pipe: use exclusive waits when reading or writing Sasha Levin 2020-02-18 9:51 ` Andrei Vagin @ 2020-02-18 18:53 ` Linus Torvalds 1 sibling, 0 replies; 14+ messages in thread From: Linus Torvalds @ 2020-02-18 18:53 UTC (permalink / raw) To: Sasha Levin Cc: Linux Kernel Mailing List, stable, Josh Triplett, linux-fsdevel On Fri, Feb 14, 2020 at 8:00 AM Sasha Levin <sashal@kernel.org> wrote: > > From: Linus Torvalds <torvalds@linux-foundation.org> > > [ Upstream commit 0ddad21d3e99c743a3aa473121dc5561679e26bb ] > > This makes the pipe code use separate wait-queues and exclusive waiting > for readers and writers, [..] Oh, and since I didn't react initially, let me react now that Andrei found a bug here: why was this patch auto-selected for 5.5 stable in the first place? It wasn't really a fix, and there's no Fixes: tag or stable tag in there. Yeah, there's a reference to an old commit, but that one isn't even a kernel commit. Yeah, the performance improvements are quite nice if you hit the case this matters for, but still.. Linus ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2020-03-05 19:54 UTC | newest] Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <20200214154854.6746-1-sashal@kernel.org> 2020-02-14 15:48 ` [PATCH AUTOSEL 5.5 539/542] fuse: don't overflow LLONG_MAX with end offset Sasha Levin 2020-02-14 15:48 ` [PATCH AUTOSEL 5.5 542/542] pipe: use exclusive waits when reading or writing Sasha Levin 2020-02-18 9:51 ` Andrei Vagin 2020-02-18 17:36 ` Linus Torvalds 2020-02-18 17:54 ` Linus Torvalds 2020-02-18 18:17 ` Linus Torvalds 2020-02-18 18:20 ` Matthew Wilcox 2020-02-18 18:28 ` Linus Torvalds 2020-02-18 22:33 ` Andrei Vagin 2020-02-18 23:03 ` Linus Torvalds 2020-03-05 18:19 ` Andrei Vagin 2020-03-05 18:40 ` Linus Torvalds 2020-03-05 19:54 ` Andrei Vagin 2020-02-18 18:53 ` Linus Torvalds
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).