* [PATCH 0/2] pipe: Fixes [ver #2] @ 2019-12-05 22:30 David Howells 2019-12-05 22:30 ` [PATCH 1/2] pipe: Remove assertion from pipe_poll() " David Howells ` (2 more replies) 0 siblings, 3 replies; 41+ messages in thread From: David Howells @ 2019-12-05 22:30 UTC (permalink / raw) To: torvalds, ebiggers; +Cc: dhowells, viro, linux-fsdevel, linux-kernel Hi Linus, Eric, Here are a couple of patches to fix bugs syzbot found in the pipe changes: (1) An assertion check will sometimes trip when polling a pipe because the ring size and indices used are approximate and may be being changed simultaneously. An equivalent approximate calculation was done previously, but without the assertion check, so I've just dropped the check. To make it accurate, the pipe mutex would need to be taken or the spin lock could be used - but usage of the spinlock would need to be rolled out into splice, iov_iter and other places for that. (2) The index mask and the max_usage values cannot be cached across pipe_wait() as F_SETPIPE_SZ could have been called during the wait. This can cause pipe_write() to break. David --- David Howells (2): pipe: Remove assertion from pipe_poll() pipe: Fix missing mask update after pipe_wait() fs/pipe.c | 21 ++++++++++----------- 1 file changed, 10 insertions(+), 11 deletions(-) ^ permalink raw reply [flat|nested] 41+ messages in thread
* [PATCH 1/2] pipe: Remove assertion from pipe_poll() [ver #2] 2019-12-05 22:30 [PATCH 0/2] pipe: Fixes [ver #2] David Howells @ 2019-12-05 22:30 ` David Howells 2019-12-05 22:30 ` [PATCH 2/2] pipe: Fix missing mask update after pipe_wait() " David Howells 2019-12-06 13:56 ` [PATCH 0/2] pipe: Fixes " David Sterba 2 siblings, 0 replies; 41+ messages in thread From: David Howells @ 2019-12-05 22:30 UTC (permalink / raw) To: torvalds, ebiggers; +Cc: dhowells, viro, linux-fsdevel, linux-kernel An assertion check was added to pipe_poll() to make sure that the ring occupancy isn't seen to overflow the ring size. However, since no locks are held when the three values are read, it is possible for F_SETPIPE_SZ to intervene and muck up the calculation, thereby causing the oops. Fix this by simply removing the assertion and accepting that the calculation might be approximate. Note that the previous code also had a similar issue, though there was no assertion check, since the occupancy counter and the ring size were not read with a lock held, so it's possible that the poll check might have malfunctioned then too. Also wake up all the waiters so that they can reissue their checks if there was a competing read or write. Fixes: 8cefc107ca54 ("pipe: Use head and tail pointers for the ring, not cursor and length") Reported-by: syzbot+d37abaade33a934f16f2@syzkaller.appspotmail.com Signed-off-by: David Howells <dhowells@redhat.com> cc: Eric Biggers <ebiggers@kernel.org> --- fs/pipe.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/fs/pipe.c b/fs/pipe.c index 648ce440ca85..da782ee251d2 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -579,8 +579,6 @@ pipe_poll(struct file *filp, poll_table *wait) poll_wait(filp, &pipe->wait, wait); - BUG_ON(pipe_occupancy(head, tail) > pipe->ring_size); - /* Reading only -- no need for acquiring the semaphore. */ mask = 0; if (filp->f_mode & FMODE_READ) { @@ -1176,6 +1174,7 @@ static long pipe_set_size(struct pipe_inode_info *pipe, unsigned long arg) pipe->max_usage = nr_slots; pipe->tail = tail; pipe->head = head; + wake_up_interruptible_all(&pipe->wait); return pipe->max_usage * PAGE_SIZE; out_revert_acct: ^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH 2/2] pipe: Fix missing mask update after pipe_wait() [ver #2] 2019-12-05 22:30 [PATCH 0/2] pipe: Fixes [ver #2] David Howells 2019-12-05 22:30 ` [PATCH 1/2] pipe: Remove assertion from pipe_poll() " David Howells @ 2019-12-05 22:30 ` David Howells 2019-12-05 23:58 ` Linus Torvalds 2019-12-06 13:56 ` [PATCH 0/2] pipe: Fixes " David Sterba 2 siblings, 1 reply; 41+ messages in thread From: David Howells @ 2019-12-05 22:30 UTC (permalink / raw) To: torvalds, ebiggers; +Cc: dhowells, viro, linux-fsdevel, linux-kernel Fix pipe_write() to not cache the ring index mask and max_usage as their values are invalidated by calling pipe_wait() because the latter function drops the pipe lock, thereby allowing F_SETPIPE_SZ change them. Without this, pipe_write() may subsequently miscalculate the array indices and pipe fullness, leading to an oops like the following: BUG: KASAN: slab-out-of-bounds in pipe_write+0xc25/0xe10 fs/pipe.c:481 Write of size 8 at addr ffff8880771167a8 by task syz-executor.3/7987 ... CPU: 1 PID: 7987 Comm: syz-executor.3 Not tainted 5.4.0-rc2-syzkaller #0 ... Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x113/0x167 lib/dump_stack.c:113 print_address_description.constprop.8.cold.10+0x9/0x31d mm/kasan/report.c:374 __kasan_report.cold.11+0x1b/0x3a mm/kasan/report.c:506 kasan_report+0x12/0x20 mm/kasan/common.c:634 __asan_report_store8_noabort+0x17/0x20 mm/kasan/generic_report.c:137 pipe_write+0xc25/0xe10 fs/pipe.c:481 call_write_iter include/linux/fs.h:1895 [inline] new_sync_write+0x3fd/0x7e0 fs/read_write.c:483 __vfs_write+0x94/0x110 fs/read_write.c:496 vfs_write+0x18a/0x520 fs/read_write.c:558 ksys_write+0x105/0x220 fs/read_write.c:611 __do_sys_write fs/read_write.c:623 [inline] __se_sys_write fs/read_write.c:620 [inline] __x64_sys_write+0x6e/0xb0 fs/read_write.c:620 do_syscall_64+0xca/0x5d0 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe This is not a problem for pipe_read() as the mask is recalculated on each pass of the loop, after pipe_wait() has been called. Fixes: 8cefc107ca54 ("pipe: Use head and tail pointers for the ring, not cursor and length") Reported-by: syzbot+838eb0878ffd51f27c41@syzkaller.appspotmail.com Signed-off-by: David Howells <dhowells@redhat.com> cc: Eric Biggers <ebiggers@kernel.org> --- fs/pipe.c | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/fs/pipe.c b/fs/pipe.c index da782ee251d2..8061b093140d 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -389,7 +389,7 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from) { struct file *filp = iocb->ki_filp; struct pipe_inode_info *pipe = filp->private_data; - unsigned int head, max_usage, mask; + unsigned int head; ssize_t ret = 0; int do_wakeup = 0; size_t total_len = iov_iter_count(from); @@ -408,13 +408,12 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from) } head = pipe->head; - max_usage = pipe->max_usage; - mask = pipe->ring_size - 1; /* We try to merge small writes */ chars = total_len & (PAGE_SIZE-1); /* size of the last buffer */ if (!pipe_empty(head, pipe->tail) && chars != 0) { - struct pipe_buffer *buf = &pipe->bufs[(head - 1) & mask]; + struct pipe_buffer *buf = + &pipe->bufs[(head - 1) & (pipe->ring_size - 1)]; int offset = buf->offset + buf->len; if (pipe_buf_can_merge(buf) && offset + chars <= PAGE_SIZE) { @@ -443,8 +442,9 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from) } head = pipe->head; - if (!pipe_full(head, pipe->tail, max_usage)) { - struct pipe_buffer *buf = &pipe->bufs[head & mask]; + if (!pipe_full(head, pipe->tail, pipe->max_usage)) { + struct pipe_buffer *buf = + &pipe->bufs[head & (pipe->ring_size - 1)]; struct page *page = pipe->tmp_page; int copied; @@ -465,7 +465,7 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from) spin_lock_irq(&pipe->wait.lock); head = pipe->head; - if (pipe_full(head, pipe->tail, max_usage)) { + if (pipe_full(head, pipe->tail, pipe->max_usage)) { spin_unlock_irq(&pipe->wait.lock); continue; } @@ -484,7 +484,7 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from) kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); /* Insert it into the buffer array */ - buf = &pipe->bufs[head & mask]; + buf = &pipe->bufs[head & (pipe->ring_size - 1)]; buf->page = page; buf->ops = &anon_pipe_buf_ops; buf->offset = 0; @@ -510,7 +510,7 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from) break; } - if (!pipe_full(head, pipe->tail, max_usage)) + if (!pipe_full(head, pipe->tail, pipe->max_usage)) continue; /* Wait for buffer space to become available. */ ^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [PATCH 2/2] pipe: Fix missing mask update after pipe_wait() [ver #2] 2019-12-05 22:30 ` [PATCH 2/2] pipe: Fix missing mask update after pipe_wait() " David Howells @ 2019-12-05 23:58 ` Linus Torvalds 0 siblings, 0 replies; 41+ messages in thread From: Linus Torvalds @ 2019-12-05 23:58 UTC (permalink / raw) To: David Howells Cc: Eric Biggers, Al Viro, linux-fsdevel, Linux Kernel Mailing List On Thu, Dec 5, 2019 at 2:30 PM David Howells <dhowells@redhat.com> wrote: > > - struct pipe_buffer *buf = &pipe->bufs[(head - 1) & mask]; > + struct pipe_buffer *buf = > + &pipe->bufs[(head - 1) & (pipe->ring_size - 1)]; I changed the two occurrences of this to use a local temporary "mask" variable, to avoid the long lines. It's no longer _caching_ the value, but it makes the code more legible. Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] pipe: Fixes [ver #2] 2019-12-05 22:30 [PATCH 0/2] pipe: Fixes [ver #2] David Howells 2019-12-05 22:30 ` [PATCH 1/2] pipe: Remove assertion from pipe_poll() " David Howells 2019-12-05 22:30 ` [PATCH 2/2] pipe: Fix missing mask update after pipe_wait() " David Howells @ 2019-12-06 13:56 ` David Sterba 2019-12-06 17:09 ` Linus Torvalds ` (2 more replies) 2 siblings, 3 replies; 41+ messages in thread From: David Sterba @ 2019-12-06 13:56 UTC (permalink / raw) To: David Howells; +Cc: torvalds, ebiggers, viro, linux-fsdevel, linux-kernel On Thu, Dec 05, 2019 at 10:30:22PM +0000, David Howells wrote: > David Howells (2): > pipe: Remove assertion from pipe_poll() > pipe: Fix missing mask update after pipe_wait() For reference, I've retested current master (b0d4beaa5a4b7d), that incldes the 2 pipe fixes, the test still hangs. The stack now points to pipe_wait but otherwise seems to be the same: [<0>] pipe_wait+0x72/0xc0 [<0>] pipe_write+0x217/0x4b0 [<0>] new_sync_write+0x11e/0x1c0 [<0>] vfs_write+0xc1/0x1d0 [<0>] kernel_write+0x2c/0x40 [<0>] send_cmd+0x78/0xf0 [btrfs] [<0>] send_extent_data+0x4af/0x52a [btrfs] [<0>] process_extent+0xe5d/0xeb4 [btrfs] [<0>] changed_cb+0xcf5/0xd2f [btrfs] [<0>] send_subvol+0x8af/0xc0d [btrfs] [<0>] btrfs_ioctl_send+0xe2b/0xef0 [btrfs] [<0>] _btrfs_ioctl_send+0x80/0x110 [btrfs] [<0>] btrfs_ioctl+0x14b8/0x3120 [btrfs] [<0>] do_vfs_ioctl+0xa1/0x750 [<0>] ksys_ioctl+0x70/0x80 [<0>] __x64_sys_ioctl+0x16/0x20 [<0>] do_syscall_64+0x56/0x240 [<0>] entry_SYSCALL_64_after_hwframe+0x49/0xbe (gdb) l *(pipe_wait+0x72) 0x622 is in pipe_wait (fs/pipe.c:120). 115 * is considered a noninteractive wait: 116 */ 117 prepare_to_wait(&pipe->wait, &wait, TASK_INTERRUPTIBLE); 118 pipe_unlock(pipe); 119 schedule(); 120 finish_wait(&pipe->wait, &wait); 121 pipe_lock(pipe); 122 } 123 124 static void anon_pipe_buf_release(struct pipe_inode_info *pipe, ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] pipe: Fixes [ver #2] 2019-12-06 13:56 ` [PATCH 0/2] pipe: Fixes " David Sterba @ 2019-12-06 17:09 ` Linus Torvalds 2019-12-06 17:42 ` Linus Torvalds 2019-12-06 20:28 ` Linus Torvalds 2019-12-06 21:26 ` David Howells 2 siblings, 1 reply; 41+ messages in thread From: Linus Torvalds @ 2019-12-06 17:09 UTC (permalink / raw) To: David Sterba, David Howells, Linus Torvalds, Eric Biggers, Al Viro, linux-fsdevel, Linux Kernel Mailing List On Fri, Dec 6, 2019 at 5:56 AM David Sterba <dsterba@suse.cz> wrote: > > For reference, I've retested current master (b0d4beaa5a4b7d), that > incldes the 2 pipe fixes, the test still hangs. Yeah, they are the same fixes that you already tested, just slightly reworded (and with a WARN_ON actually removed, but you never hit that one). Does the btrfs test that fails actually require a btrfs filesystem? Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] pipe: Fixes [ver #2] 2019-12-06 17:09 ` Linus Torvalds @ 2019-12-06 17:42 ` Linus Torvalds 2019-12-06 18:59 ` Linus Torvalds 0 siblings, 1 reply; 41+ messages in thread From: Linus Torvalds @ 2019-12-06 17:42 UTC (permalink / raw) To: David Sterba, David Howells, Linus Torvalds, Eric Biggers, Al Viro, linux-fsdevel, Linux Kernel Mailing List On Fri, Dec 6, 2019 at 9:09 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Does the btrfs test that fails actually require a btrfs filesystem? Nope. DavidH, you can do this: git clone git://git.kernel.org/pub/scm/linux/kernel/git/kdave/btrfs-progs.git cd btrfs-progs ./autogen.sh make and it looks like that make TEST=016\* test-misc should work - all you need is btrfs and loop built as modules. Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] pipe: Fixes [ver #2] 2019-12-06 17:42 ` Linus Torvalds @ 2019-12-06 18:59 ` Linus Torvalds 2019-12-07 21:31 ` Akemi Yagi 0 siblings, 1 reply; 41+ messages in thread From: Linus Torvalds @ 2019-12-06 18:59 UTC (permalink / raw) To: David Sterba, David Howells, Linus Torvalds, Eric Biggers, Al Viro, linux-fsdevel, Linux Kernel Mailing List Hmm. I think I just saw this same bug with a plain kernel compile. My "make -j32" suddenly came to a crawl, and seems to be entirely single-threaded. And that's almost certainly because the way 'make' handles load distribution is with a network of pipes that has a token passed to the sub-makes. So there's most definitely something wrong with the new pipe rework. Well, I can't _guarantee_ the pipes are the cause of this, but it does smell like it. Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] pipe: Fixes [ver #2] 2019-12-06 18:59 ` Linus Torvalds @ 2019-12-07 21:31 ` Akemi Yagi 2019-12-08 16:45 ` Akemi Yagi 0 siblings, 1 reply; 41+ messages in thread From: Akemi Yagi @ 2019-12-07 21:31 UTC (permalink / raw) To: linux-kernel; +Cc: linux-fsdevel On 12/6/19 10:59 AM, Linus Torvalds wrote: > Hmm. I think I just saw this same bug with a plain kernel compile. > > My "make -j32" suddenly came to a crawl, and seems to be entirely > single-threaded. > > And that's almost certainly because the way 'make' handles load > distribution is with a network of pipes that has a token passed to the > sub-makes. > > So there's most definitely something wrong with the new pipe rework. > Well, I can't _guarantee_ the pipes are the cause of this, but it does > smell like it. > > Linus > We encountered what seems to be the same problem when building a kernel under RHEL 8. As it turned out, this was due to a bug in make (make-4.2.1-9.el8) and a patch is available. Details are found in this Red Hat bugzilla report: https://bugzilla.redhat.com/show_bug.cgi?id=1774790 Akemi ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] pipe: Fixes [ver #2] 2019-12-07 21:31 ` Akemi Yagi @ 2019-12-08 16:45 ` Akemi Yagi 2019-12-08 18:04 ` Linus Torvalds 0 siblings, 1 reply; 41+ messages in thread From: Akemi Yagi @ 2019-12-08 16:45 UTC (permalink / raw) To: linux-kernel; +Cc: Linus Torvalds, David Howells, toracat On Sat, Dec 7, 2019 at 1:31 PM Akemi Yagi <toracat@elrepo.org> wrote: > > On 12/6/19 10:59 AM, Linus Torvalds wrote: > > Hmm. I think I just saw this same bug with a plain kernel compile. > > > > My "make -j32" suddenly came to a crawl, and seems to be entirely > > single-threaded. > > We encountered what seems to be the same problem when building a kernel > under RHEL 8. > > As it turned out, this was due to a bug in make (make-4.2.1-9.el8) and > a patch is available. Details are found in this Red Hat bugzilla report: > > https://bugzilla.redhat.com/show_bug.cgi?id=1774790 I forgot to mention that the aforementioned bug in make was originally reported for Fedora. They now have a patched version of make (make-4.2.1-15.fc32) in Rawhide. Akemi ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] pipe: Fixes [ver #2] 2019-12-08 16:45 ` Akemi Yagi @ 2019-12-08 18:04 ` Linus Torvalds 2019-12-09 3:07 ` Linus Torvalds 0 siblings, 1 reply; 41+ messages in thread From: Linus Torvalds @ 2019-12-08 18:04 UTC (permalink / raw) To: Akemi Yagi; +Cc: Linux Kernel Mailing List, David Howells On Sun, Dec 8, 2019 at 8:46 AM Akemi Yagi <toracat@elrepo.org> wrote: > > I forgot to mention that the aforementioned bug in make was originally > reported for Fedora. They now have a patched version of make > (make-4.2.1-15.fc32) in Rawhide. Yes, as mentioned I do expect it's some jobserver interaction - I've seen bad jobserver behavior before. But my 'make' hasn't been upgraded since May, as far as I can tell from my logs, so the huge performance regression was new. So the huge slowdown is real, and was triggered by the kernel changes. But yes, it's entirely possible - even likely - that it's due to a jobserver bug that ends up being sensitive to scheduling. We do treat even "user space is buggy and shouldn't have done that" as a regression. It's annoying. Timing-related issues are some of the worst to debug. Particularly when (at least I) primarily debug by looking at the code and trying to see what's wrong that way. Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] pipe: Fixes [ver #2] 2019-12-08 18:04 ` Linus Torvalds @ 2019-12-09 3:07 ` Linus Torvalds 0 siblings, 0 replies; 41+ messages in thread From: Linus Torvalds @ 2019-12-09 3:07 UTC (permalink / raw) To: Akemi Yagi, Peter Zijlstra, Ingo Molnar Cc: Linux Kernel Mailing List, David Howells [-- Attachment #1: Type: text/plain, Size: 1665 bytes --] On Sun, Dec 8, 2019 at 10:04 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Sun, Dec 8, 2019 at 8:46 AM Akemi Yagi <toracat@elrepo.org> wrote: > > > > I forgot to mention that the aforementioned bug in make was originally > > reported for Fedora. They now have a patched version of make > > (make-4.2.1-15.fc32) in Rawhide. > > Yes, as mentioned I do expect it's some jobserver interaction - I've > seen bad jobserver behavior before. But my 'make' hasn't been upgraded > since May, as far as I can tell from my logs, so the huge performance > regression was new. Interestingly, when I'm trying to do the non-thundering-herd pipe reads and writes, a side effect of that is that the read target is "fair" (ie when there are multiple concurrent waiting read() calls, it's always the first one that gets it). And that seems to really mess up the jobserver bug, and triggers it every single time. Fairness is often bad for locking throughput, but this is on another level entirely, and yeah, I see a lot of "defunct" shell and make processes, so it does seem to be the make bug. (Adding scheduler people to the participants list, because this patch is the "avoid using the return value of wait_event_interruptible_exclusive()" version because I think wait_event_interruptible_exclusive() is itself buggy). I built a new version of 'make' from source, and that one seems to work fine with the attached patch. But the standard Fedora 'make' binary completely hates this patch. Sad. The patch does seem to be the RightThing(tm) to do, but this make bug makes it inadvisable to apply it unless you want to play with things. Linus [-- Attachment #2: patch.diff --] [-- Type: application/x-patch, Size: 9612 bytes --] ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] pipe: Fixes [ver #2] 2019-12-06 13:56 ` [PATCH 0/2] pipe: Fixes " David Sterba 2019-12-06 17:09 ` Linus Torvalds @ 2019-12-06 20:28 ` Linus Torvalds 2019-12-06 21:04 ` Linus Torvalds 2019-12-06 21:26 ` David Howells 2 siblings, 1 reply; 41+ messages in thread From: Linus Torvalds @ 2019-12-06 20:28 UTC (permalink / raw) To: David Sterba, David Howells, Linus Torvalds, Eric Biggers, Al Viro, linux-fsdevel, Linux Kernel Mailing List On Fri, Dec 6, 2019 at 5:56 AM David Sterba <dsterba@suse.cz> wrote: > > For reference, I've retested current master (b0d4beaa5a4b7d), that > incldes the 2 pipe fixes, the test still hangs. I think I found it. TOTALLY UNTESTED patch appended. It's whitespace-damaged and may be completely wrong. And might not fix it. The first hunk is purely syntactic sugar - use the normal head/tail order. The second/third hunk is what I think fixes the problem: iter_file_splice_write() had the same buggy "let's cache head/tail/mask" pattern as pipe_write() had. You can't cache them over a 'pipe_wait()' that drops the pipe lock, and there's one in splice_from_pipe_next(). Linus --- snip snip -- diff --git a/fs/splice.c b/fs/splice.c index f2400ce7d528..fa1f3773c8cd 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -495,7 +495,7 @@ static int splice_from_pipe_feed(struct pipe_inode_info *pipe, struct splice_des unsigned int mask = pipe->ring_size - 1; int ret; - while (!pipe_empty(tail, head)) { + while (!pipe_empty(head, tail)) { struct pipe_buffer *buf = &pipe->bufs[tail & mask]; sd->len = buf->len; @@ -711,9 +711,7 @@ iter_file_splice_write(struct pipe_inode_info *pipe, struct file *out, splice_from_pipe_begin(&sd); while (sd.total_len) { struct iov_iter from; - unsigned int head = pipe->head; - unsigned int tail = pipe->tail; - unsigned int mask = pipe->ring_size - 1; + unsigned int head, tail, mask; size_t left; int n; @@ -732,6 +730,10 @@ iter_file_splice_write(struct pipe_inode_info *pipe, struct file *out, } } + head = pipe->head; + tail = pipe->tail; + mask = pipe->ring_size - 1; + /* build the vector */ left = sd.total_len; for (n = 0; !pipe_empty(head, tail) && left && n < nbufs; tail++, n++) { ^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] pipe: Fixes [ver #2] 2019-12-06 20:28 ` Linus Torvalds @ 2019-12-06 21:04 ` Linus Torvalds 2019-12-07 3:50 ` Linus Torvalds 2019-12-09 14:55 ` David Sterba 0 siblings, 2 replies; 41+ messages in thread From: Linus Torvalds @ 2019-12-06 21:04 UTC (permalink / raw) To: David Sterba, David Howells, Linus Torvalds, Eric Biggers, Al Viro, linux-fsdevel, Linux Kernel Mailing List On Fri, Dec 6, 2019 at 12:28 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > I think I found it. > > TOTALLY UNTESTED patch appended. It's whitespace-damaged and may be > completely wrong. And might not fix it. Well, it compiles, and it's obviously correct, so I've committed it. It doesn't fix my "kernel compiles go single-threaded" issue. Which is not surprising - make doesn't use splice(), it just reads and writes single characters (the main make server writes a "+" character for each available job, and parallel sub-makes will read one for each job they start and write one when done - or something very close to that). I think that is related to the other pipe changes, though - there were some wakeup changes in there too. The btrfs problem was bisected to the original commit, which is what I think my patch fixes. Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] pipe: Fixes [ver #2] 2019-12-06 21:04 ` Linus Torvalds @ 2019-12-07 3:50 ` Linus Torvalds 2019-12-07 4:01 ` Linus Torvalds ` (2 more replies) 2019-12-09 14:55 ` David Sterba 1 sibling, 3 replies; 41+ messages in thread From: Linus Torvalds @ 2019-12-07 3:50 UTC (permalink / raw) To: David Sterba, David Howells, Linus Torvalds, Eric Biggers, Al Viro, linux-fsdevel, Linux Kernel Mailing List On Fri, Dec 6, 2019 at 1:04 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > It doesn't fix my "kernel compiles go single-threaded" issue. I ended up just bisecting it. The "make goes slow" problem bisects down to b667b8673443 ("pipe: Advance tail pointer inside of wait spinlock in pipe_read()"). I assume it's the added "do_wakeup = 0" (not the spinlock) that ends up having some subtle issue. Very strange, but there it is. Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] pipe: Fixes [ver #2] 2019-12-07 3:50 ` Linus Torvalds @ 2019-12-07 4:01 ` Linus Torvalds 2019-12-07 22:47 ` Linus Torvalds 2019-12-11 20:55 ` David Howells 2 siblings, 0 replies; 41+ messages in thread From: Linus Torvalds @ 2019-12-07 4:01 UTC (permalink / raw) To: David Sterba, David Howells, Linus Torvalds, Eric Biggers, Al Viro, linux-fsdevel, Linux Kernel Mailing List On Fri, Dec 6, 2019 at 7:50 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > I assume it's the added "do_wakeup = 0" (not the spinlock) that ends > up having some subtle issue. Ahh, and then later that is removed, but when it is removed it also remote the wakeup before the pipe_wait(). So whatever issue that commit introduces ends up remaining. I wonder if the extra wakeups ended up hiding some other bug. We do extra wakeups on the write side too, with a "FIXME! Is this really true?" comment.. Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] pipe: Fixes [ver #2] 2019-12-07 3:50 ` Linus Torvalds 2019-12-07 4:01 ` Linus Torvalds @ 2019-12-07 22:47 ` Linus Torvalds 2019-12-09 9:53 ` Vincent Guittot 2019-12-12 10:18 ` Konstantin Khlebnikov 2019-12-11 20:55 ` David Howells 2 siblings, 2 replies; 41+ messages in thread From: Linus Torvalds @ 2019-12-07 22:47 UTC (permalink / raw) To: David Sterba, David Howells, Eric Biggers, Al Viro, linux-fsdevel, Linux Kernel Mailing List Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot On Fri, Dec 6, 2019 at 7:50 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > The "make goes slow" problem bisects down to b667b8673443 ("pipe: > Advance tail pointer inside of wait spinlock in pipe_read()"). I'm not entirely sure that ends up being 100% true. It did bisect to that, but the behavior wasn't entirely stable. There definitely is some nasty timing trigger. But I did finally figure out what seems to have been going on with at least the biggest part of the build performance regression. It's seems to be a nasty interaction with the scheduler and the GNU make jobserver, and in particular the pipe wakeups really _really_ do seem to want to be synchronous both for the readers and the writers. When a writer wakes up a reader, we want the reader to react quickly and vice versa. The most obvious case was for the GNU make jobserver, where sub-makes would do a single-byte write to the jobserver pipe, and we want to wake up the reader *immediatly*, because the reader is actually a lot more important than the writer. The reader is what gets the next job going, the writer just got done with the last one. And when a reader empties a full pipe, it's because the writer is generating data, and you want to just get the writer going again asap. Anyway, I've spent way too much time looking at this and wondering about odd performance patterns. It seems to be mostly back up to normal. I say "mostly", because I still see times of "not as many concurrent compiles going as I'd expect". It might be a kbuild problem, it might be an issue with GNU make (I've seen problems with the make jobserver wanting many more tokens than expected before and the kernel makefiles - it migth be about deep subdirectories etc), and it might be some remaining pipe issue. But my allmodconfig builds aren't _enormously_ slower than they used to be. But there's definitely some unhappy interaction with the jobserver. I have 16 threads (8 cores with HT), and I generally use "make -j32" to keep them busy because the jobserver isn't great. The pipe rework made even that 2x slop not work all that well. Something held on to tokens too long, and there was definitely some interaction with the pipe wakeup code. Using "-j64" hid the problem, but it was a problem. It might be the new scheduler balancing changes that are interacting with the pipe thing. I'm adding PeterZ, Ingo and Vincent to the cc, because I hadn't realized just how important the sync wakeup seems to be for pipe performance even at a big level. I've pushed out my pipe changes. I really didn't want to do that kind of stuff at the end of the merge window, but I spent a lot more time than I wanted looking at this code, because I was getting to the point where the alternative was to just revert it all. DavidH, give these a look: 85190d15f4ea pipe: don't use 'pipe_wait() for basic pipe IO a28c8b9db8a1 pipe: remove 'waiting_writers' merging logic f467a6a66419 pipe: fix and clarify pipe read wakeup logic 1b6b26ae7053 pipe: fix and clarify pipe write wakeup logic ad910e36da4c pipe: fix poll/select race introduced by the pipe rework the top two of which are purely "I'm fed up looking at this code, this needs to go" kind of changes. In particular, that last change is because I think the GNU jobserver problem is partly a thundering herd issue: when a job token becomes free (ie somebody does a one-byte write to an empty jobserver pipe), it wakes up *everybody* who is waiting for a token. One of them will get it, and the others will go to sleep again. And then it repeats all over. I didn't fix it, but it _could_ be fixed with exclusive waits for readers/writers, but that means more smarts than pipe_wait() can do. And because the jobserver isn't great at keeping everybody happy, I'm using a much bigger "make -jX" value than the number of CPU's I have, which makes the herd bigger. And I suspect none of this helps the scheduler pick the _right_ process to run, which just makes scheduling an even bigger problem. Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] pipe: Fixes [ver #2] 2019-12-07 22:47 ` Linus Torvalds @ 2019-12-09 9:53 ` Vincent Guittot 2019-12-09 17:48 ` Linus Torvalds 2019-12-12 10:18 ` Konstantin Khlebnikov 1 sibling, 1 reply; 41+ messages in thread From: Vincent Guittot @ 2019-12-09 9:53 UTC (permalink / raw) To: Linus Torvalds Cc: David Sterba, David Howells, Eric Biggers, Al Viro, linux-fsdevel, Linux Kernel Mailing List, Peter Zijlstra, Ingo Molnar On Sat, 7 Dec 2019 at 23:48, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Fri, Dec 6, 2019 at 7:50 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > The "make goes slow" problem bisects down to b667b8673443 ("pipe: > > Advance tail pointer inside of wait spinlock in pipe_read()"). > > I'm not entirely sure that ends up being 100% true. It did bisect to > that, but the behavior wasn't entirely stable. There definitely is > some nasty timing trigger. > > But I did finally figure out what seems to have been going on with at > least the biggest part of the build performance regression. It's seems > to be a nasty interaction with the scheduler and the GNU make > jobserver, and in particular the pipe wakeups really _really_ do seem > to want to be synchronous both for the readers and the writers. > > When a writer wakes up a reader, we want the reader to react quickly > and vice versa. The most obvious case was for the GNU make jobserver, > where sub-makes would do a single-byte write to the jobserver pipe, > and we want to wake up the reader *immediatly*, because the reader is > actually a lot more important than the writer. The reader is what gets > the next job going, the writer just got done with the last one. > > And when a reader empties a full pipe, it's because the writer is > generating data, and you want to just get the writer going again asap. > > Anyway, I've spent way too much time looking at this and wondering > about odd performance patterns. It seems to be mostly back up to > normal. > > I say "mostly", because I still see times of "not as many concurrent > compiles going as I'd expect". It might be a kbuild problem, it might > be an issue with GNU make (I've seen problems with the make jobserver > wanting many more tokens than expected before and the kernel makefiles > - it migth be about deep subdirectories etc), and it might be some > remaining pipe issue. But my allmodconfig builds aren't _enormously_ > slower than they used to be. > > But there's definitely some unhappy interaction with the jobserver. I > have 16 threads (8 cores with HT), and I generally use "make -j32" to > keep them busy because the jobserver isn't great. The pipe rework made > even that 2x slop not work all that well. Something held on to tokens > too long, and there was definitely some interaction with the pipe > wakeup code. Using "-j64" hid the problem, but it was a problem. > > It might be the new scheduler balancing changes that are interacting > with the pipe thing. I'm adding PeterZ, Ingo and Vincent to the cc, > because I hadn't realized just how important the sync wakeup seems to > be for pipe performance even at a big level. Which version of make should I use to reproduce the problem ? My setup is not the same and my make is a bit old but I haven't been able to reproduce the problem described above on my arm64 octa cores system and v5.5-rc1. All cores are busy with -j16. And even -j8 keeps the cores almost always busy > > I've pushed out my pipe changes. I really didn't want to do that kind > of stuff at the end of the merge window, but I spent a lot more time > than I wanted looking at this code, because I was getting to the point > where the alternative was to just revert it all. > > DavidH, give these a look: > > 85190d15f4ea pipe: don't use 'pipe_wait() for basic pipe IO > a28c8b9db8a1 pipe: remove 'waiting_writers' merging logic > f467a6a66419 pipe: fix and clarify pipe read wakeup logic > 1b6b26ae7053 pipe: fix and clarify pipe write wakeup logic > ad910e36da4c pipe: fix poll/select race introduced by the pipe rework > > the top two of which are purely "I'm fed up looking at this code, this > needs to go" kind of changes. > > In particular, that last change is because I think the GNU jobserver > problem is partly a thundering herd issue: when a job token becomes > free (ie somebody does a one-byte write to an empty jobserver pipe), > it wakes up *everybody* who is waiting for a token. One of them will > get it, and the others will go to sleep again. And then it repeats all > over. I didn't fix it, but it _could_ be fixed with exclusive waits > for readers/writers, but that means more smarts than pipe_wait() can > do. And because the jobserver isn't great at keeping everybody happy, > I'm using a much bigger "make -jX" value than the number of CPU's I > have, which makes the herd bigger. And I suspect none of this helps > the scheduler pick the _right_ process to run, which just makes > scheduling an even bigger problem. > > Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] pipe: Fixes [ver #2] 2019-12-09 9:53 ` Vincent Guittot @ 2019-12-09 17:48 ` Linus Torvalds 2019-12-09 17:57 ` Akemi Yagi ` (3 more replies) 0 siblings, 4 replies; 41+ messages in thread From: Linus Torvalds @ 2019-12-09 17:48 UTC (permalink / raw) To: Vincent Guittot, DJ Delorie Cc: David Sterba, David Howells, Eric Biggers, Al Viro, linux-fsdevel, Linux Kernel Mailing List, Peter Zijlstra, Ingo Molnar [-- Attachment #1: Type: text/plain, Size: 3222 bytes --] [ Added DJ to the participants, since he seems to be the Fedora make maintainer - DJ, any chance that this absolutely horrid 'make' buf can be fixed in older versions too, not just rawhide? The bugfix is two and a half years old by now, and the bug looks real and very serious ] On Mon, Dec 9, 2019 at 1:54 AM Vincent Guittot <vincent.guittot@linaro.org> wrote: > > Which version of make should I use to reproduce the problem ? So the problematic one is "make-4.2.1-13.fc30.x86_64" in Fedora 30. I'm assuming it's fairly plain 4.2.1, but I didn't try to look into the source rpm or anything like that. The working one for me was just the top of -git from https://git.savannah.gnu.org/git/make.git which is 4.2.92 right now. The fix is presumably commit b552b05 ("[SV 51159] Use a non-blocking read with pselect to avoid hangs") as per Akemi. That is indeed after 4.2.1, and it looks real. Before that commit the buggy jobserver code basically does (1) use pselect() to wait for readable and see child deaths atomically (2) use blocking read to get the token and while (1) is atomic, if the child death happens between the two, it goes into the blocking read and has SIGCHLD blocked, so it will try to read the token from the token pipe, but it will never react to the child death - and the child death is what is going to _release_ a token. So what seems to happen is that when the right timing triggers, you end up with a lot of sub-makes waiting for a token, but they are also all supposed to _release_ a token. So you don't have enough tokens to go around. In the worst case, _everybody_ who has a token is also not releasing it, and then you end up triggering the timeout code (after one second), which will make things really go into a crawl. And by a crawl I mean that worst-case you really end up with just one job per second per sub-make. It will take _hours_ to compile the kernel at that speed, when it normally finishes in 15 minutes on my machine even when I do a from-scratch allmodconfig build. It does seem to be a major bug in the jobserver code. In particular with the trial fair and exclusive wakeup patch that I sent out in the other thread, it seems to be _reliably_ much worse and triggers 100% of the time for me. It's possible that my trial patch is buggy, but everything else looks fine, and with a fixed make the trial patch works for me. I'll include the trial patch here too, I think I cc'd you on the other thread too, but hey.. Anyway, it looks like the sync wakeup thing is more of a "get timing right by luck" thing than anything else. Possibly it actually causes the reverse order of reader wakeups more often (ie the most _recent_ reader is most likely to get woken up synchronously) and that may be what really ends up masking the jobserver problem, since apparently doing wakeups in the fair and proper order makes things much worse.. What a horrible pain that pipe rework ended up being. But I think we're in better shape now than we used to be, it just had very unfortunate timing issues and several real bugs. But sadly, there's no way I can push that fair pipe wakeup thing as long as this horribly buggy version of make is widespread. Linus [-- Attachment #2: patch.diff --] [-- Type: application/x-patch, Size: 9612 bytes --] ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] pipe: Fixes [ver #2] 2019-12-09 17:48 ` Linus Torvalds @ 2019-12-09 17:57 ` Akemi Yagi 2019-12-09 18:18 ` Linus Torvalds 2019-12-10 2:58 ` DJ Delorie ` (2 subsequent siblings) 3 siblings, 1 reply; 41+ messages in thread From: Akemi Yagi @ 2019-12-09 17:57 UTC (permalink / raw) To: Linus Torvalds Cc: Vincent Guittot, DJ Delorie, David Sterba, David Howells, Eric Biggers, Al Viro, linux-fsdevel, Linux Kernel Mailing List, Peter Zijlstra, Ingo Molnar On Mon, Dec 9, 2019 at 9:49 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > [ Added DJ to the participants, since he seems to be the Fedora make > maintainer - DJ, any chance that this absolutely horrid 'make' buf can > be fixed in older versions too, not just rawhide? The bugfix is two > and a half years old by now, and the bug looks real and very serious ] (snip) > But sadly, there's no way I can push that fair pipe wakeup thing as > long as this horribly buggy version of make is widespread. > > Linus In addition to the Fedora make-4.2.1-4.fc27 (1) mentioned by Linus, RHEL 8 make-4.2.1-9.el8 (2) is affected. The patch applied to Fedora make (3) has been confirmed to fix the issue in RHEL's make. Those are the only real-world examples I know of. I have no idea how widespread this thing is... Akemi (1) https://bugzilla.redhat.com/show_bug.cgi?id=1556839 (2) https://bugzilla.redhat.com/show_bug.cgi?id=1774790 (3) https://git.savannah.gnu.org/cgit/make.git/commit/?id=b552b05251980f693c729e251f93f5225b400714 ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] pipe: Fixes [ver #2] 2019-12-09 17:57 ` Akemi Yagi @ 2019-12-09 18:18 ` Linus Torvalds 2019-12-09 18:24 ` Linus Torvalds 2019-12-18 20:59 ` Josh Triplett 0 siblings, 2 replies; 41+ messages in thread From: Linus Torvalds @ 2019-12-09 18:18 UTC (permalink / raw) To: Akemi Yagi Cc: Vincent Guittot, DJ Delorie, David Sterba, David Howells, Eric Biggers, Al Viro, linux-fsdevel, Linux Kernel Mailing List, Peter Zijlstra, Ingo Molnar On Mon, Dec 9, 2019 at 9:57 AM Akemi Yagi <toracat@elrepo.org> wrote: > > In addition to the Fedora make-4.2.1-4.fc27 (1) mentioned by Linus, > RHEL 8 make-4.2.1-9.el8 (2) is affected. The patch applied to Fedora > make (3) has been confirmed to fix the issue in RHEL's make. > > Those are the only real-world examples I know of. I have no idea how > widespread this thing is... Looks like opensuse and ubuntu are also on 4.2.1 according to https://software.opensuse.org/package/make https://packages.ubuntu.com/cosmic/make so apparently the bug is almost universal with the big three sharing this buggy version. Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] pipe: Fixes [ver #2] 2019-12-09 18:18 ` Linus Torvalds @ 2019-12-09 18:24 ` Linus Torvalds 2019-12-18 20:59 ` Josh Triplett 1 sibling, 0 replies; 41+ messages in thread From: Linus Torvalds @ 2019-12-09 18:24 UTC (permalink / raw) To: Akemi Yagi Cc: Vincent Guittot, DJ Delorie, David Sterba, David Howells, Eric Biggers, Al Viro, linux-fsdevel, Linux Kernel Mailing List, Peter Zijlstra, Ingo Molnar On Mon, Dec 9, 2019 at 10:18 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Looks like opensuse and ubuntu are also on 4.2.1 according to > > https://software.opensuse.org/package/make > https://packages.ubuntu.com/cosmic/make > > so apparently the bug is almost universal with the big three sharing > this buggy version. And the reason seems to be is that it's considered the latest "release" version. In the git tree, I see 4.2.92, but looking at https://ftp.gnu.org/gnu/make/ it looks like 4.2.1 is the latest actual "release". Oh well. I can't find a workaround for the bug, other than perhaps using "make -LX" instead of "make -jX". Which is not the same thing at all, of course. Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] pipe: Fixes [ver #2] 2019-12-09 18:18 ` Linus Torvalds 2019-12-09 18:24 ` Linus Torvalds @ 2019-12-18 20:59 ` Josh Triplett 1 sibling, 0 replies; 41+ messages in thread From: Josh Triplett @ 2019-12-18 20:59 UTC (permalink / raw) To: Linus Torvalds Cc: Akemi Yagi, Vincent Guittot, DJ Delorie, David Sterba, David Howells, Eric Biggers, Al Viro, linux-fsdevel, Linux Kernel Mailing List, Peter Zijlstra, Ingo Molnar On Mon, Dec 09, 2019 at 10:18:57AM -0800, Linus Torvalds wrote: > On Mon, Dec 9, 2019 at 9:57 AM Akemi Yagi <toracat@elrepo.org> wrote: > > > > In addition to the Fedora make-4.2.1-4.fc27 (1) mentioned by Linus, > > RHEL 8 make-4.2.1-9.el8 (2) is affected. The patch applied to Fedora > > make (3) has been confirmed to fix the issue in RHEL's make. > > > > Those are the only real-world examples I know of. I have no idea how > > widespread this thing is... > > Looks like opensuse and ubuntu are also on 4.2.1 according to > > https://software.opensuse.org/package/make > https://packages.ubuntu.com/cosmic/make > > so apparently the bug is almost universal with the big three sharing > this buggy version. Debian and Ubuntu have make 4.2.1-1.2, which includes "[SV 51159] Use a non-blocking read with pselect to avoid hangs." and various other fixes. https://metadata.ftp-master.debian.org/changelogs/main/m/make-dfsg/make-dfsg_4.2.1-1.2_changelog So, both Debian and Ubuntu should be fine with the pipe improvements. (I'm testing that now.) Is the version of your non-thundering-herd pipe wakeup patch attached to https://lore.kernel.org/lkml/CAHk-=wicgTacrHUJmSBbW9MYAdMPdrXzULPNqQ3G7+HkLeNf1Q@mail.gmail.com/ still the best version to test performance with? ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] pipe: Fixes [ver #2] 2019-12-09 17:48 ` Linus Torvalds 2019-12-09 17:57 ` Akemi Yagi @ 2019-12-10 2:58 ` DJ Delorie 2019-12-10 14:38 ` Vincent Guittot 2019-12-11 18:09 ` DJ Delorie 3 siblings, 0 replies; 41+ messages in thread From: DJ Delorie @ 2019-12-10 2:58 UTC (permalink / raw) To: Linus Torvalds Cc: vincent.guittot, dsterba, dhowells, ebiggers, viro, linux-fsdevel, linux-kernel, peterz, mingo Linus Torvalds <torvalds@linux-foundation.org> writes: > [ Added DJ to the participants, since he seems to be the Fedora make > maintainer - DJ, any chance that this absolutely horrid 'make' buf can > be fixed in older versions too, not just rawhide? The bugfix is two > and a half years old by now, and the bug looks real and very serious ] I've got builds ready for F30 and F31 but my local testing shows no difference in kernel build times with/without. I was hoping someone would test the rawhide build[*] and say "yup, that fixes it" in case some other even older patch is also needed. F29 just went EOL so I can't fix that one. [*] https://koji.fedoraproject.org/koji/buildinfo?buildID=1420394 ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] pipe: Fixes [ver #2] 2019-12-09 17:48 ` Linus Torvalds 2019-12-09 17:57 ` Akemi Yagi 2019-12-10 2:58 ` DJ Delorie @ 2019-12-10 14:38 ` Vincent Guittot 2019-12-10 17:39 ` Linus Torvalds 2019-12-11 18:09 ` DJ Delorie 3 siblings, 1 reply; 41+ messages in thread From: Vincent Guittot @ 2019-12-10 14:38 UTC (permalink / raw) To: Linus Torvalds Cc: DJ Delorie, David Sterba, David Howells, Eric Biggers, Al Viro, linux-fsdevel, Linux Kernel Mailing List, Peter Zijlstra, Ingo Molnar On Mon, 9 Dec 2019 at 18:48, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > [ Added DJ to the participants, since he seems to be the Fedora make > maintainer - DJ, any chance that this absolutely horrid 'make' buf can > be fixed in older versions too, not just rawhide? The bugfix is two > and a half years old by now, and the bug looks real and very serious ] > > On Mon, Dec 9, 2019 at 1:54 AM Vincent Guittot > <vincent.guittot@linaro.org> wrote: > > > > Which version of make should I use to reproduce the problem ? > > So the problematic one is "make-4.2.1-13.fc30.x86_64" in Fedora 30. > I'm assuming it's fairly plain 4.2.1, but I didn't try to look into > the source rpm or anything like that. I'm using Debian buster and the make package is version: 4.2.1-1.2 for arm64. It doesn't have the commit you mentioned below but I don't see the problem on my platform and all 8 cpus are used with -j 16 or even -j 9 > > The working one for me was just the top of -git from > > https://git.savannah.gnu.org/git/make.git > > which is 4.2.92 right now. > > The fix is presumably commit b552b05 ("[SV 51159] Use a non-blocking > read with pselect to avoid hangs") as per Akemi. That is indeed after > 4.2.1, and it looks real. > > Before that commit the buggy jobserver code basically does > > (1) use pselect() to wait for readable and see child deaths atomically > (2) use blocking read to get the token > > and while (1) is atomic, if the child death happens between the two, > it goes into the blocking read and has SIGCHLD blocked, so it will try > to read the token from the token pipe, but it will never react to the > child death - and the child death is what is going to _release_ a > token. > > So what seems to happen is that when the right timing triggers, you That can explain why I can't see the problem on my platform > end up with a lot of sub-makes waiting for a token, but they are also > all supposed to _release_ a token. So you don't have enough tokens to > go around. In the worst case, _everybody_ who has a token is also not > releasing it, and then you end up triggering the timeout code (after > one second), which will make things really go into a crawl. > > And by a crawl I mean that worst-case you really end up with just one > job per second per sub-make. It will take _hours_ to compile the > kernel at that speed, when it normally finishes in 15 minutes on my > machine even when I do a from-scratch allmodconfig build. > > It does seem to be a major bug in the jobserver code. In particular > with the trial fair and exclusive wakeup patch that I sent out in the > other thread, it seems to be _reliably_ much worse and triggers 100% > of the time for me. > > It's possible that my trial patch is buggy, but everything else looks > fine, and with a fixed make the trial patch works for me. > > I'll include the trial patch here too, I think I cc'd you on the other > thread too, but hey.. > > Anyway, it looks like the sync wakeup thing is more of a "get timing > right by luck" thing than anything else. Possibly it actually causes > the reverse order of reader wakeups more often (ie the most _recent_ > reader is most likely to get woken up synchronously) and that may be > what really ends up masking the jobserver problem, since apparently > doing wakeups in the fair and proper order makes things much worse.. > > What a horrible pain that pipe rework ended up being. But I think > we're in better shape now than we used to be, it just had very > unfortunate timing issues and several real bugs. > > But sadly, there's no way I can push that fair pipe wakeup thing as > long as this horribly buggy version of make is widespread. > > Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] pipe: Fixes [ver #2] 2019-12-10 14:38 ` Vincent Guittot @ 2019-12-10 17:39 ` Linus Torvalds 0 siblings, 0 replies; 41+ messages in thread From: Linus Torvalds @ 2019-12-10 17:39 UTC (permalink / raw) To: Vincent Guittot Cc: DJ Delorie, David Sterba, David Howells, Eric Biggers, Al Viro, linux-fsdevel, Linux Kernel Mailing List, Peter Zijlstra, Ingo Molnar On Tue, Dec 10, 2019 at 6:38 AM Vincent Guittot <vincent.guittot@linaro.org> wrote: > > On Mon, 9 Dec 2019 at 18:48, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > Before that commit the buggy jobserver code basically does > > > > (1) use pselect() to wait for readable and see child deaths atomically > > (2) use blocking read to get the token > > > > and while (1) is atomic, if the child death happens between the two, > > it goes into the blocking read and has SIGCHLD blocked, so it will try > > to read the token from the token pipe, but it will never react to the > > child death - and the child death is what is going to _release_ a > > token. > > > > So what seems to happen is that when the right timing triggers, you > > That can explain why I can't see the problem on my platform Note that the above is kind of simplified. It actually needs a bit more to trigger.. To lose _one_ token, you need to have a sub-make basically hit this race: - the pselect() needs to say that the pipe is readable, so there is at least one token - another sub-make comes along and steals the very last token - but because pselect returned "readable" (no SIGCHLD yet), the read() starts and now blocks because all the jobserver tokens are gone again due to the other sub-make stealing the last one. - before a new token comes in, the child exits, and now because the sub-make is blocking for reads (and because the jobserver blocks SIGCHILD in general outside of the pselect), it doesn't react, so it won't release the token that the child holds. but notice how any _other_ sub-make then releasing a token will get things going again, so the _common_situation is that the jobserver bug only causes a slight dip in concurrency. Hitting it _once_ is trivial. Losing several tokens at once is also not that hard: you don't need to hit the race many times, it's enough to hit the exact same race just once - just with several sub-makes at the same time. And that "lose several tokens at once" isn't as unlikely as you'd think: they are all doing the same thing, and they all saw the free token with "pselect()", they all did a "read()". And since it's common for the tokens to be gone, the common case is that _one_ of the waiting sub-makes got the read, and the N other sub-makes did not, and went into the blocking read(). And they all had children that were about to finish, and finished before the next token got available. So losing quite a few tokens is easy. This has actually gone on for a long time, and I just never debugged it. My solution has been "I have 16 threads (8 core with HT), but I'll use -j32, and it is all good". I bet you can see it too - the buggy jobserver just means that the load isn't as high as you'd expect. Just run 'top' while the make is going. With the _fixed_ jobserver, if I do "make -j32", I will actually see a load that is over 32 (I do nothing but kernel compiles and occasional reboots during the merge window, so I have the kernel in my cache, so there's no IO, I have a fast SSD so writeback doesn't cause any delays either etc etc, and I have my browser and a few other things going). With the buggy one, even before the pipe rework, I would see a load that tended to fluctuate around 16. Because due to the bug you have a few locked-up tokens at times, so instead of getting a load of 32 when you use "make -j32", you get a load of maybe 20. Or maybe less. The pipe re-work made it much easier to trigger the "almost all the tokens are gone" for some reason. And the "fair readers()" patch I have seems to make it _really_ easy to trigger the case where absolutely all the tokens were gone and it goes into a single-thread mode. I'm not sure I really ever saw the 1-second timeout trigger, but it was slow. But it _is_ dependent on timing, so somebody else with a different load - or a different machine - might not see it to nearly the same degree. I bet you see the load value difference, though, even if you don't necessarily see enough idle CPU time to see much of a difference in compile times. After all, once you have all CPU's busy, it doesn't matter if you have a load of 16 or a load of 32 - the higher load won't make the compile go any faster. Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] pipe: Fixes [ver #2] 2019-12-09 17:48 ` Linus Torvalds ` (2 preceding siblings ...) 2019-12-10 14:38 ` Vincent Guittot @ 2019-12-11 18:09 ` DJ Delorie 2019-12-11 18:59 ` Linus Torvalds 3 siblings, 1 reply; 41+ messages in thread From: DJ Delorie @ 2019-12-11 18:09 UTC (permalink / raw) To: Linus Torvalds Cc: vincent.guittot, dsterba, dhowells, ebiggers, viro, linux-fsdevel, linux-kernel, peterz, mingo Linus Torvalds <torvalds@linux-foundation.org> writes: > [ Added DJ to the participants, since he seems to be the Fedora make > maintainer - DJ, any chance that this absolutely horrid 'make' buf can > be fixed in older versions too, not just rawhide? The bugfix is two > and a half years old by now, and the bug looks real and very serious ] Builds for F30 and F31 are in bodhi, waiting on testing and karma... https://bodhi.fedoraproject.org/updates/FEDORA-2019-bd81ed62bf https://bodhi.fedoraproject.org/updates/FEDORA-2019-a056aa61d4 F29 is already EOL. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] pipe: Fixes [ver #2] 2019-12-11 18:09 ` DJ Delorie @ 2019-12-11 18:59 ` Linus Torvalds 0 siblings, 0 replies; 41+ messages in thread From: Linus Torvalds @ 2019-12-11 18:59 UTC (permalink / raw) To: DJ Delorie Cc: Vincent Guittot, David Sterba, David Howells, Eric Biggers, Al Viro, linux-fsdevel, Linux Kernel Mailing List, Peter Zijlstra, Ingo Molnar On Wed, Dec 11, 2019 at 10:09 AM DJ Delorie <dj@redhat.com> wrote: > > Builds for F30 and F31 are in bodhi, waiting on testing and karma... Thanks DJ, Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] pipe: Fixes [ver #2] 2019-12-07 22:47 ` Linus Torvalds 2019-12-09 9:53 ` Vincent Guittot @ 2019-12-12 10:18 ` Konstantin Khlebnikov 2019-12-18 22:51 ` Linus Torvalds 1 sibling, 1 reply; 41+ messages in thread From: Konstantin Khlebnikov @ 2019-12-12 10:18 UTC (permalink / raw) To: Linus Torvalds, David Sterba, David Howells, Eric Biggers, Al Viro, linux-fsdevel, Linux Kernel Mailing List Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot On 08/12/2019 01.47, Linus Torvalds wrote: > On Fri, Dec 6, 2019 at 7:50 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> The "make goes slow" problem bisects down to b667b8673443 ("pipe: >> Advance tail pointer inside of wait spinlock in pipe_read()"). > > I'm not entirely sure that ends up being 100% true. It did bisect to > that, but the behavior wasn't entirely stable. There definitely is > some nasty timing trigger. > > But I did finally figure out what seems to have been going on with at > least the biggest part of the build performance regression. It's seems > to be a nasty interaction with the scheduler and the GNU make > jobserver, and in particular the pipe wakeups really _really_ do seem > to want to be synchronous both for the readers and the writers. > > When a writer wakes up a reader, we want the reader to react quickly > and vice versa. The most obvious case was for the GNU make jobserver, > where sub-makes would do a single-byte write to the jobserver pipe, > and we want to wake up the reader *immediatly*, because the reader is > actually a lot more important than the writer. The reader is what gets > the next job going, the writer just got done with the last one. > > And when a reader empties a full pipe, it's because the writer is > generating data, and you want to just get the writer going again asap. > > Anyway, I've spent way too much time looking at this and wondering > about odd performance patterns. It seems to be mostly back up to > normal. > > I say "mostly", because I still see times of "not as many concurrent > compiles going as I'd expect". It might be a kbuild problem, it might > be an issue with GNU make (I've seen problems with the make jobserver > wanting many more tokens than expected before and the kernel makefiles > - it migth be about deep subdirectories etc), and it might be some > remaining pipe issue. But my allmodconfig builds aren't _enormously_ > slower than they used to be. > > But there's definitely some unhappy interaction with the jobserver. I > have 16 threads (8 cores with HT), and I generally use "make -j32" to > keep them busy because the jobserver isn't great. The pipe rework made > even that 2x slop not work all that well. Something held on to tokens > too long, and there was definitely some interaction with the pipe > wakeup code. Using "-j64" hid the problem, but it was a problem. > > It might be the new scheduler balancing changes that are interacting > with the pipe thing. I'm adding PeterZ, Ingo and Vincent to the cc, > because I hadn't realized just how important the sync wakeup seems to > be for pipe performance even at a big level. > > I've pushed out my pipe changes. I really didn't want to do that kind > of stuff at the end of the merge window, but I spent a lot more time > than I wanted looking at this code, because I was getting to the point > where the alternative was to just revert it all. > > DavidH, give these a look: > b> 85190d15f4ea pipe: don't use 'pipe_wait() for basic pipe IO > a28c8b9db8a1 pipe: remove 'waiting_writers' merging logic > f467a6a66419 pipe: fix and clarify pipe read wakeup logic > 1b6b26ae7053 pipe: fix and clarify pipe write wakeup logic > ad910e36da4c pipe: fix poll/select race introduced by the pipe rework commit f467a6a66419 pipe: fix and clarify pipe read wakeup logic killed "wake writer when buffer becomes half empty" part added by commit cefa80ced57a ("pipe: Increase the writer-wakeup threshold to reduce context-switch count"). I suppose that was unintentional. Jobserver juggles with few bytes and should never reach half/full buffer thresholds. Also reader should wake writer with sync wakeup only if buffer is empty. Otherwise sync wakeup adds couple unneeded context switches. > > the top two of which are purely "I'm fed up looking at this code, this > needs to go" kind of changes. > > In particular, that last change is because I think the GNU jobserver > problem is partly a thundering herd issue: when a job token becomes > free (ie somebody does a one-byte write to an empty jobserver pipe), > it wakes up *everybody* who is waiting for a token. One of them will > get it, and the others will go to sleep again. And then it repeats all > over. I didn't fix it, but it _could_ be fixed with exclusive waits > for readers/writers, but that means more smarts than pipe_wait() can > do. And because the jobserver isn't great at keeping everybody happy, > I'm using a much bigger "make -jX" value than the number of CPU's I > have, which makes the herd bigger. And I suspect none of this helps > the scheduler pick the _right_ process to run, which just makes > scheduling an even bigger problem. > > Linus > ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] pipe: Fixes [ver #2] 2019-12-12 10:18 ` Konstantin Khlebnikov @ 2019-12-18 22:51 ` Linus Torvalds 2019-12-19 0:03 ` Josh Triplett 0 siblings, 1 reply; 41+ messages in thread From: Linus Torvalds @ 2019-12-18 22:51 UTC (permalink / raw) To: Konstantin Khlebnikov, Josh Triplett, Akemi Yagi, DJ Delorie Cc: David Sterba, David Howells, Eric Biggers, Al Viro, linux-fsdevel, Linux Kernel Mailing List, Peter Zijlstra, Ingo Molnar, Vincent Guittot On Thu, Dec 12, 2019 at 2:18 AM Konstantin Khlebnikov <khlebnikov@yandex-team.ru> wrote: > > commit f467a6a66419 pipe: fix and clarify pipe read wakeup logic > killed "wake writer when buffer becomes half empty" part added by > commit cefa80ced57a ("pipe: Increase the writer-wakeup threshold to reduce context-switch count"). > > I suppose that was unintentional. Jobserver juggles with few bytes and > should never reach half/full buffer thresholds. It wasn'tunintentional - the rest of the cleanups should mean that we never wake things up unnecessarily - we now only wake up if it _used_ to be 100% full, and we only do it once per read. And it did it without the watermark, which I was worried about might break something that "knew" the size of the pipe. But performance testing would be good. Both for the "no unnecessary wakeups" case, but also for the thundering herd issue. To answer Josh's email in this same thread: On Wed, Dec 18, 2019 at 12:59 PM Josh Triplett <josh@joshtriplett.org> wrote: > > Debian and Ubuntu have make 4.2.1-1.2, which includes "[SV 51159] Use a > non-blocking read with pselect to avoid hangs." and various other fixes. > https://metadata.ftp-master.debian.org/changelogs/main/m/make-dfsg/make-dfsg_4.2.1-1.2_changelog > So, both Debian and Ubuntu should be fine with the pipe improvements. > (I'm testing that now.) > > Is the version of your non-thundering-herd pipe wakeup patch attached to > https://lore.kernel.org/lkml/CAHk-=wicgTacrHUJmSBbW9MYAdMPdrXzULPNqQ3G7+HkLeNf1Q@mail.gmail.com/ > still the best version to test performance with? That's my latest version, but you'll have to tweak it a tiny bit because of d1c6a2aa02af ("pipe: simplify signal handling in pipe_read() and add comments") which I did after that patch. The easiest way to resolve it is likely to revert that d1c6a2aa02af, then apply the non-thundering-herd patch and then apply d1c6a2aa02af again by hand - it's fairly straightforward (and you can return -ERESTARTSYS directly if wait_event_interruptible_exclusive() fails, because of all the same reasons why it coul dhappen without the thundering-herd patch. I can look at re-creating that patch if you find it to be too annoying. Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] pipe: Fixes [ver #2] 2019-12-18 22:51 ` Linus Torvalds @ 2019-12-19 0:03 ` Josh Triplett 2019-12-19 0:14 ` Josh Triplett 0 siblings, 1 reply; 41+ messages in thread From: Josh Triplett @ 2019-12-19 0:03 UTC (permalink / raw) To: Linus Torvalds Cc: Konstantin Khlebnikov, Akemi Yagi, DJ Delorie, David Sterba, David Howells, Eric Biggers, Al Viro, linux-fsdevel, Linux Kernel Mailing List, Peter Zijlstra, Ingo Molnar, Vincent Guittot [-- Attachment #1: Type: text/plain, Size: 1820 bytes --] On Wed, Dec 18, 2019 at 02:51:27PM -0800, Linus Torvalds wrote: > On Wed, Dec 18, 2019 at 12:59 PM Josh Triplett <josh@joshtriplett.org> wrote: > > Debian and Ubuntu have make 4.2.1-1.2, which includes "[SV 51159] Use a > > non-blocking read with pselect to avoid hangs." and various other fixes. > > https://metadata.ftp-master.debian.org/changelogs/main/m/make-dfsg/make-dfsg_4.2.1-1.2_changelog > > So, both Debian and Ubuntu should be fine with the pipe improvements. > > (I'm testing that now.) I've now tested this, and 4.2.1-1.2 doesn't seem to have the same problem that affects upstream 4.2.1. With Debian and Ubuntu already fixed, and Fedora now in the process of getting fixed, that should cover many users who would have the buggy make. > > Is the version of your non-thundering-herd pipe wakeup patch attached to > > https://lore.kernel.org/lkml/CAHk-=wicgTacrHUJmSBbW9MYAdMPdrXzULPNqQ3G7+HkLeNf1Q@mail.gmail.com/ > > still the best version to test performance with? > > That's my latest version, but you'll have to tweak it a tiny bit > because of d1c6a2aa02af ("pipe: simplify signal handling in > pipe_read() and add comments") which I did after that patch. That's what I encountered, and I ended up manually fixing it up, resulting in the attached patch. Does that look reasonable? > The easiest way to resolve it is likely to revert that d1c6a2aa02af, > then apply the non-thundering-herd patch and then apply d1c6a2aa02af > again by hand - it's fairly straightforward (and you can return > -ERESTARTSYS directly if wait_event_interruptible_exclusive() fails, > because of all the same reasons why it coul dhappen without the > thundering-herd patch. I can try adding that additional return -ERESTARTSYS. Does that seem likely to have a noticeable additional performance impact? - Josh Triplett [-- Attachment #2: linux-pipe-fix.patch --] [-- Type: text/x-diff, Size: 9612 bytes --] fs/coredump.c | 4 +-- fs/pipe.c | 67 +++++++++++++++++++++++++++++++---------------- fs/splice.c | 8 +++--- include/linux/pipe_fs_i.h | 2 +- 4 files changed, 51 insertions(+), 30 deletions(-) diff --git a/fs/coredump.c b/fs/coredump.c index b1ea7dfbd149..f8296a82d01d 100644 --- a/fs/coredump.c +++ b/fs/coredump.c @@ -517,7 +517,7 @@ static void wait_for_dump_helpers(struct file *file) pipe_lock(pipe); pipe->readers++; pipe->writers--; - wake_up_interruptible_sync(&pipe->wait); + wake_up_interruptible_sync(&pipe->rd_wait); kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); pipe_unlock(pipe); @@ -525,7 +525,7 @@ static void wait_for_dump_helpers(struct file *file) * We actually want wait_event_freezable() but then we need * to clear TIF_SIGPENDING and improve dump_interrupted(). */ - wait_event_interruptible(pipe->wait, pipe->readers == 1); + wait_event_interruptible(pipe->rd_wait, pipe->readers == 1); pipe_lock(pipe); pipe->readers--; diff --git a/fs/pipe.c b/fs/pipe.c index 87109e761fa5..1eb64cab63b5 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -108,16 +108,19 @@ void pipe_double_lock(struct pipe_inode_info *pipe1, /* Drop the inode semaphore and wait for a pipe event, atomically */ void pipe_wait(struct pipe_inode_info *pipe) { - DEFINE_WAIT(wait); + DEFINE_WAIT(rdwait); + DEFINE_WAIT(wrwait); /* * Pipes are system-local resources, so sleeping on them * is considered a noninteractive wait: */ - prepare_to_wait(&pipe->wait, &wait, TASK_INTERRUPTIBLE); + prepare_to_wait(&pipe->rd_wait, &rdwait, TASK_INTERRUPTIBLE); + prepare_to_wait(&pipe->wr_wait, &wrwait, TASK_INTERRUPTIBLE); pipe_unlock(pipe); schedule(); - finish_wait(&pipe->wait, &wait); + finish_wait(&pipe->rd_wait, &rdwait); + finish_wait(&pipe->wr_wait, &wrwait); pipe_lock(pipe); } @@ -286,7 +289,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to) size_t total_len = iov_iter_count(to); struct file *filp = iocb->ki_filp; struct pipe_inode_info *pipe = filp->private_data; - bool was_full; + bool was_full, wake_next_reader = false; ssize_t ret; /* Null read succeeds. */ @@ -344,10 +347,10 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to) if (!buf->len) { pipe_buf_release(pipe, buf); - spin_lock_irq(&pipe->wait.lock); + spin_lock_irq(&pipe->rd_wait.lock); tail++; pipe->tail = tail; - spin_unlock_irq(&pipe->wait.lock); + spin_unlock_irq(&pipe->rd_wait.lock); } total_len -= chars; if (!total_len) @@ -371,19 +374,24 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to) } __pipe_unlock(pipe); if (was_full) { - wake_up_interruptible_sync_poll(&pipe->wait, EPOLLOUT | EPOLLWRNORM); + wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM); kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT); } - wait_event_interruptible(pipe->wait, pipe_readable(pipe)); + wait_event_interruptible_exclusive(pipe->rd_wait, pipe_readable(pipe)); __pipe_lock(pipe); was_full = pipe_full(pipe->head, pipe->tail, pipe->max_usage); + wake_next_reader = true; } + if (pipe_empty(pipe->head, pipe->tail)) + wake_next_reader = false; __pipe_unlock(pipe); if (was_full) { - wake_up_interruptible_sync_poll(&pipe->wait, EPOLLOUT | EPOLLWRNORM); + wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM); kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT); } + if (wake_next_reader) + wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM); if (ret > 0) file_accessed(filp); return ret; @@ -415,6 +423,7 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from) size_t total_len = iov_iter_count(from); ssize_t chars; bool was_empty = false; + bool wake_next_writer = false; /* Null write succeeds. */ if (unlikely(total_len == 0)) @@ -493,16 +502,16 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from) * it, either the reader will consume it or it'll still * be there for the next write. */ - spin_lock_irq(&pipe->wait.lock); + spin_lock_irq(&pipe->rd_wait.lock); head = pipe->head; if (pipe_full(head, pipe->tail, pipe->max_usage)) { - spin_unlock_irq(&pipe->wait.lock); + spin_unlock_irq(&pipe->rd_wait.lock); continue; } pipe->head = head + 1; - spin_unlock_irq(&pipe->wait.lock); + spin_unlock_irq(&pipe->rd_wait.lock); /* Insert it into the buffer array */ buf = &pipe->bufs[head & mask]; @@ -554,14 +563,17 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from) */ __pipe_unlock(pipe); if (was_empty) { - wake_up_interruptible_sync_poll(&pipe->wait, EPOLLIN | EPOLLRDNORM); + wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM); kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); } - wait_event_interruptible(pipe->wait, pipe_writable(pipe)); + wait_event_interruptible_exclusive(pipe->wr_wait, pipe_writable(pipe)); __pipe_lock(pipe); was_empty = pipe_empty(head, pipe->tail); + wake_next_writer = true; } out: + if (pipe_full(pipe->head, pipe->tail, pipe->max_usage)) + wake_next_writer = false; __pipe_unlock(pipe); /* @@ -574,9 +586,11 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from) * wake up pending jobs */ if (was_empty) { - wake_up_interruptible_sync_poll(&pipe->wait, EPOLLIN | EPOLLRDNORM); + wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM); kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); } + if (wake_next_writer) + wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM); if (ret > 0 && sb_start_write_trylock(file_inode(filp)->i_sb)) { int err = file_update_time(filp); if (err) @@ -620,12 +634,15 @@ pipe_poll(struct file *filp, poll_table *wait) unsigned int head, tail; /* - * Reading only -- no need for acquiring the semaphore. + * Reading pipe state only -- no need for acquiring the semaphore. * * But because this is racy, the code has to add the * entry to the poll table _first_ .. */ - poll_wait(filp, &pipe->wait, wait); + if (filp->f_mode & FMODE_READ) + poll_wait(filp, &pipe->rd_wait, wait); + if (filp->f_mode & FMODE_WRITE) + poll_wait(filp, &pipe->wr_wait, wait); /* * .. and only then can you do the racy tests. That way, @@ -684,7 +701,8 @@ pipe_release(struct inode *inode, struct file *file) pipe->writers--; if (pipe->readers || pipe->writers) { - wake_up_interruptible_sync_poll(&pipe->wait, EPOLLIN | EPOLLOUT | EPOLLRDNORM | EPOLLWRNORM | EPOLLERR | EPOLLHUP); + wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM | EPOLLERR | EPOLLHUP); + wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM | EPOLLERR | EPOLLHUP); kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT); } @@ -767,7 +785,8 @@ struct pipe_inode_info *alloc_pipe_info(void) GFP_KERNEL_ACCOUNT); if (pipe->bufs) { - init_waitqueue_head(&pipe->wait); + init_waitqueue_head(&pipe->rd_wait); + init_waitqueue_head(&pipe->wr_wait); pipe->r_counter = pipe->w_counter = 1; pipe->max_usage = pipe_bufs; pipe->ring_size = pipe_bufs; @@ -985,7 +1004,8 @@ static int wait_for_partner(struct pipe_inode_info *pipe, unsigned int *cnt) static void wake_up_partner(struct pipe_inode_info *pipe) { - wake_up_interruptible(&pipe->wait); + wake_up_interruptible(&pipe->rd_wait); + wake_up_interruptible(&pipe->wr_wait); } static int fifo_open(struct inode *inode, struct file *filp) @@ -1096,13 +1116,13 @@ static int fifo_open(struct inode *inode, struct file *filp) err_rd: if (!--pipe->readers) - wake_up_interruptible(&pipe->wait); + wake_up_interruptible(&pipe->wr_wait); ret = -ERESTARTSYS; goto err; err_wr: if (!--pipe->writers) - wake_up_interruptible(&pipe->wait); + wake_up_interruptible(&pipe->rd_wait); ret = -ERESTARTSYS; goto err; @@ -1229,7 +1249,8 @@ static long pipe_set_size(struct pipe_inode_info *pipe, unsigned long arg) pipe->max_usage = nr_slots; pipe->tail = tail; pipe->head = head; - wake_up_interruptible_all(&pipe->wait); + wake_up_interruptible_all(&pipe->rd_wait); + wake_up_interruptible_all(&pipe->wr_wait); return pipe->max_usage * PAGE_SIZE; out_revert_acct: diff --git a/fs/splice.c b/fs/splice.c index 3009652a41c8..d671936d0aad 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -165,8 +165,8 @@ static const struct pipe_buf_operations user_page_pipe_buf_ops = { static void wakeup_pipe_readers(struct pipe_inode_info *pipe) { smp_mb(); - if (waitqueue_active(&pipe->wait)) - wake_up_interruptible(&pipe->wait); + if (waitqueue_active(&pipe->rd_wait)) + wake_up_interruptible(&pipe->rd_wait); kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); } @@ -462,8 +462,8 @@ static int pipe_to_sendpage(struct pipe_inode_info *pipe, static void wakeup_pipe_writers(struct pipe_inode_info *pipe) { smp_mb(); - if (waitqueue_active(&pipe->wait)) - wake_up_interruptible(&pipe->wait); + if (waitqueue_active(&pipe->wr_wait)) + wake_up_interruptible(&pipe->wr_wait); kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT); } diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h index dbcfa6892384..d5765039652a 100644 --- a/include/linux/pipe_fs_i.h +++ b/include/linux/pipe_fs_i.h @@ -47,7 +47,7 @@ struct pipe_buffer { **/ struct pipe_inode_info { struct mutex mutex; - wait_queue_head_t wait; + wait_queue_head_t rd_wait, wr_wait; unsigned int head; unsigned int tail; unsigned int max_usage; ^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] pipe: Fixes [ver #2] 2019-12-19 0:03 ` Josh Triplett @ 2019-12-19 0:14 ` Josh Triplett 2019-12-19 0:51 ` Linus Torvalds 2019-12-19 7:56 ` David Howells 0 siblings, 2 replies; 41+ messages in thread From: Josh Triplett @ 2019-12-19 0:14 UTC (permalink / raw) To: Linus Torvalds Cc: Konstantin Khlebnikov, Akemi Yagi, DJ Delorie, David Sterba, David Howells, Eric Biggers, Al Viro, linux-fsdevel, Linux Kernel Mailing List, Peter Zijlstra, Ingo Molnar, Vincent Guittot [-- Attachment #1: Type: text/plain, Size: 540 bytes --] On Wed, Dec 18, 2019 at 04:03:18PM -0800, Josh Triplett wrote: > On Wed, Dec 18, 2019 at 02:51:27PM -0800, Linus Torvalds wrote: > > That's my latest version, but you'll have to tweak it a tiny bit > > because of d1c6a2aa02af ("pipe: simplify signal handling in > > pipe_read() and add comments") which I did after that patch. > > That's what I encountered, and I ended up manually fixing it up, > resulting in the attached patch. Does that look reasonable? Er, wrong file. That's the original patch; the attached patch is the right one. [-- Attachment #2: linux-pipe-fix.patch --] [-- Type: text/x-diff, Size: 9621 bytes --] diff --git a/fs/coredump.c b/fs/coredump.c index b1ea7dfbd149..f8296a82d01d 100644 --- a/fs/coredump.c +++ b/fs/coredump.c @@ -517,7 +517,7 @@ static void wait_for_dump_helpers(struct file *file) pipe_lock(pipe); pipe->readers++; pipe->writers--; - wake_up_interruptible_sync(&pipe->wait); + wake_up_interruptible_sync(&pipe->rd_wait); kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); pipe_unlock(pipe); @@ -525,7 +525,7 @@ static void wait_for_dump_helpers(struct file *file) * We actually want wait_event_freezable() but then we need * to clear TIF_SIGPENDING and improve dump_interrupted(). */ - wait_event_interruptible(pipe->wait, pipe->readers == 1); + wait_event_interruptible(pipe->rd_wait, pipe->readers == 1); pipe_lock(pipe); pipe->readers--; diff --git a/fs/pipe.c b/fs/pipe.c index 04d004ee2e8c..5ca6446ca2b1 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -108,16 +108,19 @@ void pipe_double_lock(struct pipe_inode_info *pipe1, /* Drop the inode semaphore and wait for a pipe event, atomically */ void pipe_wait(struct pipe_inode_info *pipe) { - DEFINE_WAIT(wait); + DEFINE_WAIT(rdwait); + DEFINE_WAIT(wrwait); /* * Pipes are system-local resources, so sleeping on them * is considered a noninteractive wait: */ - prepare_to_wait(&pipe->wait, &wait, TASK_INTERRUPTIBLE); + prepare_to_wait(&pipe->rd_wait, &rdwait, TASK_INTERRUPTIBLE); + prepare_to_wait(&pipe->wr_wait, &wrwait, TASK_INTERRUPTIBLE); pipe_unlock(pipe); schedule(); - finish_wait(&pipe->wait, &wait); + finish_wait(&pipe->rd_wait, &rdwait); + finish_wait(&pipe->wr_wait, &wrwait); pipe_lock(pipe); } @@ -286,7 +289,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to) size_t total_len = iov_iter_count(to); struct file *filp = iocb->ki_filp; struct pipe_inode_info *pipe = filp->private_data; - bool was_full; + bool was_full, wake_next_reader = false; ssize_t ret; /* Null read succeeds. */ @@ -344,10 +347,10 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to) if (!buf->len) { pipe_buf_release(pipe, buf); - spin_lock_irq(&pipe->wait.lock); + spin_lock_irq(&pipe->rd_wait.lock); tail++; pipe->tail = tail; - spin_unlock_irq(&pipe->wait.lock); + spin_unlock_irq(&pipe->rd_wait.lock); } total_len -= chars; if (!total_len) @@ -384,7 +387,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to) * no data. */ if (unlikely(was_full)) { - wake_up_interruptible_sync_poll(&pipe->wait, EPOLLOUT | EPOLLWRNORM); + wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM); kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT); } @@ -394,18 +397,23 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to) * since we've done any required wakeups and there's no need * to mark anything accessed. And we've dropped the lock. */ - if (wait_event_interruptible(pipe->wait, pipe_readable(pipe)) < 0) + if (wait_event_interruptible_exclusive(pipe->rd_wait, pipe_readable(pipe)) < 0) return -ERESTARTSYS; __pipe_lock(pipe); was_full = pipe_full(pipe->head, pipe->tail, pipe->max_usage); + wake_next_reader = true; } + if (pipe_empty(pipe->head, pipe->tail)) + wake_next_reader = false; __pipe_unlock(pipe); if (was_full) { - wake_up_interruptible_sync_poll(&pipe->wait, EPOLLOUT | EPOLLWRNORM); + wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM); kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT); } + if (wake_next_reader) + wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM); if (ret > 0) file_accessed(filp); return ret; @@ -437,6 +445,7 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from) size_t total_len = iov_iter_count(from); ssize_t chars; bool was_empty = false; + bool wake_next_writer = false; /* Null write succeeds. */ if (unlikely(total_len == 0)) @@ -515,16 +524,16 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from) * it, either the reader will consume it or it'll still * be there for the next write. */ - spin_lock_irq(&pipe->wait.lock); + spin_lock_irq(&pipe->rd_wait.lock); head = pipe->head; if (pipe_full(head, pipe->tail, pipe->max_usage)) { - spin_unlock_irq(&pipe->wait.lock); + spin_unlock_irq(&pipe->rd_wait.lock); continue; } pipe->head = head + 1; - spin_unlock_irq(&pipe->wait.lock); + spin_unlock_irq(&pipe->rd_wait.lock); /* Insert it into the buffer array */ buf = &pipe->bufs[head & mask]; @@ -576,14 +585,17 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from) */ __pipe_unlock(pipe); if (was_empty) { - wake_up_interruptible_sync_poll(&pipe->wait, EPOLLIN | EPOLLRDNORM); + wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM); kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); } - wait_event_interruptible(pipe->wait, pipe_writable(pipe)); + wait_event_interruptible_exclusive(pipe->wr_wait, pipe_writable(pipe)); __pipe_lock(pipe); was_empty = pipe_empty(head, pipe->tail); + wake_next_writer = true; } out: + if (pipe_full(pipe->head, pipe->tail, pipe->max_usage)) + wake_next_writer = false; __pipe_unlock(pipe); /* @@ -596,9 +608,11 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from) * wake up pending jobs */ if (was_empty) { - wake_up_interruptible_sync_poll(&pipe->wait, EPOLLIN | EPOLLRDNORM); + wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM); kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); } + if (wake_next_writer) + wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM); if (ret > 0 && sb_start_write_trylock(file_inode(filp)->i_sb)) { int err = file_update_time(filp); if (err) @@ -642,12 +656,15 @@ pipe_poll(struct file *filp, poll_table *wait) unsigned int head, tail; /* - * Reading only -- no need for acquiring the semaphore. + * Reading pipe state only -- no need for acquiring the semaphore. * * But because this is racy, the code has to add the * entry to the poll table _first_ .. */ - poll_wait(filp, &pipe->wait, wait); + if (filp->f_mode & FMODE_READ) + poll_wait(filp, &pipe->rd_wait, wait); + if (filp->f_mode & FMODE_WRITE) + poll_wait(filp, &pipe->wr_wait, wait); /* * .. and only then can you do the racy tests. That way, @@ -706,7 +723,8 @@ pipe_release(struct inode *inode, struct file *file) pipe->writers--; if (pipe->readers || pipe->writers) { - wake_up_interruptible_sync_poll(&pipe->wait, EPOLLIN | EPOLLOUT | EPOLLRDNORM | EPOLLWRNORM | EPOLLERR | EPOLLHUP); + wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM | EPOLLERR | EPOLLHUP); + wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM | EPOLLERR | EPOLLHUP); kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT); } @@ -789,7 +807,8 @@ struct pipe_inode_info *alloc_pipe_info(void) GFP_KERNEL_ACCOUNT); if (pipe->bufs) { - init_waitqueue_head(&pipe->wait); + init_waitqueue_head(&pipe->rd_wait); + init_waitqueue_head(&pipe->wr_wait); pipe->r_counter = pipe->w_counter = 1; pipe->max_usage = pipe_bufs; pipe->ring_size = pipe_bufs; @@ -1007,7 +1026,8 @@ static int wait_for_partner(struct pipe_inode_info *pipe, unsigned int *cnt) static void wake_up_partner(struct pipe_inode_info *pipe) { - wake_up_interruptible(&pipe->wait); + wake_up_interruptible(&pipe->rd_wait); + wake_up_interruptible(&pipe->wr_wait); } static int fifo_open(struct inode *inode, struct file *filp) @@ -1118,13 +1138,13 @@ static int fifo_open(struct inode *inode, struct file *filp) err_rd: if (!--pipe->readers) - wake_up_interruptible(&pipe->wait); + wake_up_interruptible(&pipe->wr_wait); ret = -ERESTARTSYS; goto err; err_wr: if (!--pipe->writers) - wake_up_interruptible(&pipe->wait); + wake_up_interruptible(&pipe->rd_wait); ret = -ERESTARTSYS; goto err; @@ -1251,7 +1271,8 @@ static long pipe_set_size(struct pipe_inode_info *pipe, unsigned long arg) pipe->max_usage = nr_slots; pipe->tail = tail; pipe->head = head; - wake_up_interruptible_all(&pipe->wait); + wake_up_interruptible_all(&pipe->rd_wait); + wake_up_interruptible_all(&pipe->wr_wait); return pipe->max_usage * PAGE_SIZE; out_revert_acct: diff --git a/fs/splice.c b/fs/splice.c index 3009652a41c8..d671936d0aad 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -165,8 +165,8 @@ static const struct pipe_buf_operations user_page_pipe_buf_ops = { static void wakeup_pipe_readers(struct pipe_inode_info *pipe) { smp_mb(); - if (waitqueue_active(&pipe->wait)) - wake_up_interruptible(&pipe->wait); + if (waitqueue_active(&pipe->rd_wait)) + wake_up_interruptible(&pipe->rd_wait); kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); } @@ -462,8 +462,8 @@ static int pipe_to_sendpage(struct pipe_inode_info *pipe, static void wakeup_pipe_writers(struct pipe_inode_info *pipe) { smp_mb(); - if (waitqueue_active(&pipe->wait)) - wake_up_interruptible(&pipe->wait); + if (waitqueue_active(&pipe->wr_wait)) + wake_up_interruptible(&pipe->wr_wait); kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT); } diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h index dbcfa6892384..d5765039652a 100644 --- a/include/linux/pipe_fs_i.h +++ b/include/linux/pipe_fs_i.h @@ -47,7 +47,7 @@ struct pipe_buffer { **/ struct pipe_inode_info { struct mutex mutex; - wait_queue_head_t wait; + wait_queue_head_t rd_wait, wr_wait; unsigned int head; unsigned int tail; unsigned int max_usage; ^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] pipe: Fixes [ver #2] 2019-12-19 0:14 ` Josh Triplett @ 2019-12-19 0:51 ` Linus Torvalds 2019-12-19 0:54 ` Linus Torvalds 2019-12-19 7:56 ` David Howells 1 sibling, 1 reply; 41+ messages in thread From: Linus Torvalds @ 2019-12-19 0:51 UTC (permalink / raw) To: Josh Triplett Cc: Konstantin Khlebnikov, Akemi Yagi, DJ Delorie, David Sterba, David Howells, Eric Biggers, Al Viro, linux-fsdevel, Linux Kernel Mailing List, Peter Zijlstra, Ingo Molnar, Vincent Guittot On Wed, Dec 18, 2019 at 4:14 PM Josh Triplett <josh@joshtriplett.org> wrote: > > Er, wrong file. That's the original patch; the attached patch is the > right one. This looks correct to me. If I were to actually commit it, the "split into two waitqueues" would be a separate patch from the "use wait_event_interruptible_exclusive() and add "wake_next_reader/writer logic", but for testing purposes the unified patch was simpler, and your forward port looks good to me. I ran the original patch for a couple of days, and didn't see any other issues than the 'make' thing in F30. It was all good with my self-build make. But that "ran for a couple of days" wasn't all that stress-full. I did do the "verify that the thundering herd is gone" test - including that silly test-case here again: #include <unistd.h> int main(int argc, char **argv) { int fd[2], counters[2]; pipe(fd); counters[0] = 0; counters[1] = -1; write(fd[1], counters, sizeof(counters)); /* 64 processes */ fork(); fork(); fork(); fork(); fork(); fork(); do { int i; read(fd[0], &i, sizeof(i)); if (i < 0) continue; counters[0] = i+1; write(fd[1], counters, (1+(i & 1)) *sizeof(int)); } while (counters[0] < 1000000); return 0; } where you can tweak the numbers - add another fork() or two to create even more pipe waiters, and maybe change the final count exit value to match whatever hw performance you have. Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] pipe: Fixes [ver #2] 2019-12-19 0:51 ` Linus Torvalds @ 2019-12-19 0:54 ` Linus Torvalds 0 siblings, 0 replies; 41+ messages in thread From: Linus Torvalds @ 2019-12-19 0:54 UTC (permalink / raw) To: Josh Triplett Cc: Konstantin Khlebnikov, Akemi Yagi, DJ Delorie, David Sterba, David Howells, Eric Biggers, Al Viro, linux-fsdevel, Linux Kernel Mailing List, Peter Zijlstra, Ingo Molnar, Vincent Guittot On Wed, Dec 18, 2019 at 4:51 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > This looks correct to me. Note the "looks". I checked that it applied on current -git, and I read through the patch, and checked that it almost matched my original, but I didn't _test_ it in any way. Caveat emptor. Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] pipe: Fixes [ver #2] 2019-12-19 0:14 ` Josh Triplett 2019-12-19 0:51 ` Linus Torvalds @ 2019-12-19 7:56 ` David Howells 2019-12-19 16:35 ` Linus Torvalds 1 sibling, 1 reply; 41+ messages in thread From: David Howells @ 2019-12-19 7:56 UTC (permalink / raw) To: Linus Torvalds Cc: dhowells, Josh Triplett, Konstantin Khlebnikov, Akemi Yagi, DJ Delorie, David Sterba, Eric Biggers, Al Viro, linux-fsdevel, Linux Kernel Mailing List, Peter Zijlstra, Ingo Molnar, Vincent Guittot Linus Torvalds <torvalds@linux-foundation.org> wrote: > If I were to actually commit it, the "split into two waitqueues" would > be a separate patch from the "use wait_event_interruptible_exclusive() > and add "wake_next_reader/writer logic", but for testing purposes the > unified patch was simpler, and your forward port looks good to me. I looked at splitting the waitqueue in to two, but it makes poll tricky. David ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] pipe: Fixes [ver #2] 2019-12-19 7:56 ` David Howells @ 2019-12-19 16:35 ` Linus Torvalds 0 siblings, 0 replies; 41+ messages in thread From: Linus Torvalds @ 2019-12-19 16:35 UTC (permalink / raw) To: David Howells Cc: Josh Triplett, Konstantin Khlebnikov, Akemi Yagi, DJ Delorie, David Sterba, Eric Biggers, Al Viro, linux-fsdevel, Linux Kernel Mailing List, Peter Zijlstra, Ingo Molnar, Vincent Guittot On Wed, Dec 18, 2019 at 11:56 PM David Howells <dhowells@redhat.com> wrote: > > I looked at splitting the waitqueue in to two, but it makes poll tricky. No, it's actually trivial for poll. The thing is, poll can happily just add two wait-queues to the poll_table. In my conversion, I just did - poll_wait(filp, &pipe->wait, wait); + if (filp->f_mode & FMODE_READ) + poll_wait(filp, &pipe->rd_wait, wait); + if (filp->f_mode & FMODE_WRITE) + poll_wait(filp, &pipe->wr_wait, wait); which changes the unconditional "add one" to two conditional adds. They _could_ have been unconditional too, but why add unnecessary wakeups? So it only really does it twice on named pipes (if opened for reading and writing). It's _unusual_ to add two wait-queues for a single poll, but it's not wrong. The tty layer has always done it - exactly because it has separate wait queues for reading and writing. Some other drivers do too. Sometimes there's a separate wait queue for errors, sometimes there are multiple wait-queues because there are events from the "subsystem" and there are other events from the "device". I think sound does the latter, for example. And no, I don't particularly like the FMODE_READ/WRITE testing above - it would be better to pass in the polling mask and ask "are we waiting for polling for reading or writing?" instead of asking whether the file descriptor was opened for read or write, but hey, it is what it is. Sadly, "poll()" doesn't really get passed the bitmask of what is being waited on (it's there in the poll_tabvle "_key" field, but I don't want to have the pipe code look into those kinds of details. So the named pipe case could be improved, but it's not like anybody really cares. Nobody uses named pipes any more (and few people ever did). So I didn't worry about it. Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] pipe: Fixes [ver #2] 2019-12-07 3:50 ` Linus Torvalds 2019-12-07 4:01 ` Linus Torvalds 2019-12-07 22:47 ` Linus Torvalds @ 2019-12-11 20:55 ` David Howells 2019-12-12 1:28 ` Linus Torvalds 2019-12-12 7:34 ` David Howells 2 siblings, 2 replies; 41+ messages in thread From: David Howells @ 2019-12-11 20:55 UTC (permalink / raw) To: Linus Torvalds Cc: dhowells, David Sterba, Eric Biggers, Al Viro, linux-fsdevel, Linux Kernel Mailing List, Peter Zijlstra, Ingo Molnar, Vincent Guittot Linus Torvalds <torvalds@linux-foundation.org> wrote: > DavidH, give these a look: > > 85190d15f4ea pipe: don't use 'pipe_wait() for basic pipe IO > a28c8b9db8a1 pipe: remove 'waiting_writers' merging logic > f467a6a66419 pipe: fix and clarify pipe read wakeup logic > 1b6b26ae7053 pipe: fix and clarify pipe write wakeup logic > ad910e36da4c pipe: fix poll/select race introduced by the pipe rework > > the top two of which are purely "I'm fed up looking at this code, this > needs to go" kind of changes. They look reasonable. Is it worth reverting: commit f94df9890e98f2090c6a8d70c795134863b70201 Add wake_up_interruptible_sync_poll_locked() since you changed the code that was calling that new function and so it's no longer called? David ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] pipe: Fixes [ver #2] 2019-12-11 20:55 ` David Howells @ 2019-12-12 1:28 ` Linus Torvalds 2019-12-12 7:34 ` David Howells 1 sibling, 0 replies; 41+ messages in thread From: Linus Torvalds @ 2019-12-12 1:28 UTC (permalink / raw) To: David Howells Cc: David Sterba, Eric Biggers, Al Viro, linux-fsdevel, Linux Kernel Mailing List, Peter Zijlstra, Ingo Molnar, Vincent Guittot On Wed, Dec 11, 2019 at 12:55 PM David Howells <dhowells@redhat.com> wrote: > > Is it worth reverting: > > commit f94df9890e98f2090c6a8d70c795134863b70201 > Add wake_up_interruptible_sync_poll_locked() > > since you changed the code that was calling that new function and so it's no > longer called? You are sure you won't want that for the notification queue cases? I guess they'll never want to "sync" part.. Looking at the regular pipe read/write code, maybe we'll want to try it again - do the wakeup while we already have the spinlock, rather than later. But I have this suspicion that that might just then push things into mutex contention, so who knows.. Regardless, it's not going to happen for 5.5, so I guess we could revert it and if we ever end up trying it again we can resurrect it. Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] pipe: Fixes [ver #2] 2019-12-11 20:55 ` David Howells 2019-12-12 1:28 ` Linus Torvalds @ 2019-12-12 7:34 ` David Howells 1 sibling, 0 replies; 41+ messages in thread From: David Howells @ 2019-12-12 7:34 UTC (permalink / raw) To: Linus Torvalds Cc: dhowells, David Sterba, Eric Biggers, Al Viro, linux-fsdevel, Linux Kernel Mailing List, Peter Zijlstra, Ingo Molnar, Vincent Guittot Linus Torvalds <torvalds@linux-foundation.org> wrote: > You are sure you won't want that for the notification queue cases? I > guess they'll never want to "sync" part.. Actually, that's a good point, I do call it in post_one_notification() since I have to hold the lock anyway. But, as you say, I'm not sure whether I need the sync variant. David ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] pipe: Fixes [ver #2] 2019-12-06 21:04 ` Linus Torvalds 2019-12-07 3:50 ` Linus Torvalds @ 2019-12-09 14:55 ` David Sterba 1 sibling, 0 replies; 41+ messages in thread From: David Sterba @ 2019-12-09 14:55 UTC (permalink / raw) To: Linus Torvalds Cc: David Sterba, David Howells, Eric Biggers, Al Viro, linux-fsdevel, Linux Kernel Mailing List On Fri, Dec 06, 2019 at 01:04:01PM -0800, Linus Torvalds wrote: > On Fri, Dec 6, 2019 at 12:28 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > I think I found it. > > > > TOTALLY UNTESTED patch appended. It's whitespace-damaged and may be > > completely wrong. And might not fix it. > > Well, it compiles, and it's obviously correct, so I've committed it. Thanks for the fix, works for me too. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/2] pipe: Fixes [ver #2] 2019-12-06 13:56 ` [PATCH 0/2] pipe: Fixes " David Sterba 2019-12-06 17:09 ` Linus Torvalds 2019-12-06 20:28 ` Linus Torvalds @ 2019-12-06 21:26 ` David Howells 2 siblings, 0 replies; 41+ messages in thread From: David Howells @ 2019-12-06 21:26 UTC (permalink / raw) To: Linus Torvalds Cc: dhowells, David Sterba, Eric Biggers, Al Viro, linux-fsdevel, Linux Kernel Mailing List Linus Torvalds <torvalds@linux-foundation.org> wrote: > I think I found it. > > TOTALLY UNTESTED patch appended. It's whitespace-damaged and may be > completely wrong. And might not fix it. > > The first hunk is purely syntactic sugar - use the normal head/tail > order. The second/third hunk is what I think fixes the problem: > iter_file_splice_write() had the same buggy "let's cache > head/tail/mask" pattern as pipe_write() had. > > You can't cache them over a 'pipe_wait()' that drops the pipe lock, > and there's one in splice_from_pipe_next(). That seems to fix it. Tested-by: David Howells <dhowells@redhat.com> ^ permalink raw reply [flat|nested] 41+ messages in thread
end of thread, other threads:[~2019-12-19 16:36 UTC | newest] Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-12-05 22:30 [PATCH 0/2] pipe: Fixes [ver #2] David Howells 2019-12-05 22:30 ` [PATCH 1/2] pipe: Remove assertion from pipe_poll() " David Howells 2019-12-05 22:30 ` [PATCH 2/2] pipe: Fix missing mask update after pipe_wait() " David Howells 2019-12-05 23:58 ` Linus Torvalds 2019-12-06 13:56 ` [PATCH 0/2] pipe: Fixes " David Sterba 2019-12-06 17:09 ` Linus Torvalds 2019-12-06 17:42 ` Linus Torvalds 2019-12-06 18:59 ` Linus Torvalds 2019-12-07 21:31 ` Akemi Yagi 2019-12-08 16:45 ` Akemi Yagi 2019-12-08 18:04 ` Linus Torvalds 2019-12-09 3:07 ` Linus Torvalds 2019-12-06 20:28 ` Linus Torvalds 2019-12-06 21:04 ` Linus Torvalds 2019-12-07 3:50 ` Linus Torvalds 2019-12-07 4:01 ` Linus Torvalds 2019-12-07 22:47 ` Linus Torvalds 2019-12-09 9:53 ` Vincent Guittot 2019-12-09 17:48 ` Linus Torvalds 2019-12-09 17:57 ` Akemi Yagi 2019-12-09 18:18 ` Linus Torvalds 2019-12-09 18:24 ` Linus Torvalds 2019-12-18 20:59 ` Josh Triplett 2019-12-10 2:58 ` DJ Delorie 2019-12-10 14:38 ` Vincent Guittot 2019-12-10 17:39 ` Linus Torvalds 2019-12-11 18:09 ` DJ Delorie 2019-12-11 18:59 ` Linus Torvalds 2019-12-12 10:18 ` Konstantin Khlebnikov 2019-12-18 22:51 ` Linus Torvalds 2019-12-19 0:03 ` Josh Triplett 2019-12-19 0:14 ` Josh Triplett 2019-12-19 0:51 ` Linus Torvalds 2019-12-19 0:54 ` Linus Torvalds 2019-12-19 7:56 ` David Howells 2019-12-19 16:35 ` Linus Torvalds 2019-12-11 20:55 ` David Howells 2019-12-12 1:28 ` Linus Torvalds 2019-12-12 7:34 ` David Howells 2019-12-09 14:55 ` David Sterba 2019-12-06 21:26 ` David Howells
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).