* [PATCH] pipe: fix hang when racing with a wakeup @ 2020-10-01 15:58 Josef Bacik 2020-10-01 17:28 ` Linus Torvalds 0 siblings, 1 reply; 6+ messages in thread From: Josef Bacik @ 2020-10-01 15:58 UTC (permalink / raw) To: viro, akpm, linux-fsdevel, torvalds I hit a hang with fstest btrfs/187, which does a btrfs send into /dev/null. This works by creating a pipe, the write side is given to the kernel to write into, and the read side is handed to a thread that splices into a file, in this case /dev/null. The box that was hung had the write side stuck here [<0>] pipe_write+0x38b/0x600 [<0>] new_sync_write+0x108/0x180 [<0>] __kernel_write+0xd4/0x160 [<0>] kernel_write+0x74/0x110 [<0>] btrfs_ioctl_send+0xb51/0x1867 [<0>] _btrfs_ioctl_send+0xbf/0x100 [<0>] btrfs_ioctl+0x1d4c/0x3090 [<0>] __x64_sys_ioctl+0x83/0xb0 [<0>] do_syscall_64+0x33/0x40 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 and the read side stuck here [<0>] pipe_wait+0xa4/0x100 [<0>] splice_from_pipe_next.part.0+0x33/0xe0 [<0>] __splice_from_pipe+0x6a/0x200 [<0>] splice_from_pipe+0x50/0x70 [<0>] do_splice+0x35c/0x7e0 [<0>] __x64_sys_splice+0x92/0x110 [<0>] do_syscall_64+0x33/0x40 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Using drgn to debug, I found that the pipe rd_wait was empty, but the wr_wait was not >>> pipe.rd_wait.head (struct list_head){ .next = (struct list_head *)0xffff9d2e5a3ad8d0, .prev = (struct list_head *)0xffff9d2e5a3ad8d0, } >>> pipe.wr_wait.head (struct list_head){ .next = (struct list_head *)0xffffa8934111bd88, .prev = (struct list_head *)0xffffa89341143ba8, } >>> for e in list_for_each_entry('struct wait_queue_entry', pipe.wr_wait.head.address_of_(), 'entry'): ... task = Object(prog, 'struct task_struct', address=e.private.value_()) ... print("pid {} state {}".format(task.pid, task.state)) ... pid (pid_t)3080640 state (volatile long)1 pid (pid_t)3080638 state (volatile long)1 >>> for e in list_for_each_entry('struct wait_queue_entry', pipe.rd_wait.head.address_of_(), 'entry'): ... task = Object(prog, 'struct task_struct', address=e.private.value_()) ... print("pid {} state {}".format(task.pid, task.state)) ... The wr_wait has both the writer and the reader waiting, which is expected, the pipe is full >>> pipe.head (unsigned int)179612 >>> pipe.tail (unsigned int)179596 >>> pipe.max_usage (unsigned int)16 and the read side is only waiting on wr_wait, rd_wait doesn't have our entry any more. This can happen in the following way WRITER READER pipe_write() splice pipe_lock pipe_lock() was_empty = true do the write break out of the loop pipe_unlock consume what was written pipe_wait() prepare_to_wait(rd_wait) set_task_state(INTERRUPTIBLE) wake_up(rd_wait) set_task_state(READER, RUNNING) prepare_to_wait(wr_wait) set_task_state(INTERRUPTIBLE) pip_unlock() schedule() The problem is we're doing the prepare_to_wait, which sets our state each time, however we can be woken up either with reads or writes. In the case above we race with the WRITER waking us up, and re-set our state to INTERRUPTIBLE, and thus never break out of schedule. Instead we need to set our state once, and then add ourselves to our respective waitqueues. This way if any of our waitqueues wake us up, we'll have TASK_RUNNING set when we enter schedule() and will go right back to check for whatever it is we're waiting on. I tested this patch with the test that hung, but this only happened on one vm last night, and these tests run twice on 3 vm's every night and this is the first time I hit the problem, so it's a rare occurrence. Fixes: 0ddad21d3e99 ("pipe: use exclusive waits when reading or writing") Signed-off-by: Josef Bacik <josef@toxicpanda.com> --- fs/pipe.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/fs/pipe.c b/fs/pipe.c index 60dbee457143..8803a11cbc1b 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -116,8 +116,9 @@ void pipe_wait(struct pipe_inode_info *pipe) * Pipes are system-local resources, so sleeping on them * is considered a noninteractive wait: */ - prepare_to_wait(&pipe->rd_wait, &rdwait, TASK_INTERRUPTIBLE); - prepare_to_wait(&pipe->wr_wait, &wrwait, TASK_INTERRUPTIBLE); + set_current_state(TASK_INTERRUPTIBLE); + add_wait_queue(&pipe->rd_wait, &rdwait); + add_wait_queue(&pipe->wr_wait, &wrwait); pipe_unlock(pipe); schedule(); finish_wait(&pipe->rd_wait, &rdwait); -- 2.26.2 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH] pipe: fix hang when racing with a wakeup 2020-10-01 15:58 [PATCH] pipe: fix hang when racing with a wakeup Josef Bacik @ 2020-10-01 17:28 ` Linus Torvalds 2020-10-01 17:41 ` Josef Bacik 0 siblings, 1 reply; 6+ messages in thread From: Linus Torvalds @ 2020-10-01 17:28 UTC (permalink / raw) To: Josef Bacik; +Cc: Al Viro, Andrew Morton, linux-fsdevel On Thu, Oct 1, 2020 at 8:58 AM Josef Bacik <josef@toxicpanda.com> wrote: > > The problem is we're doing the prepare_to_wait, which sets our state > each time, however we can be woken up either with reads or writes. In > the case above we race with the WRITER waking us up, and re-set our > state to INTERRUPTIBLE, and thus never break out of schedule. Good catch of an interesting problem. That said, the real problem here is that "pipe_wait()" is some nasty garbage. I really wanted to convert all users to do a proper wait-queue usage, but left the old structure alone. Any normal "prepare_to_wait()" user should always check for the condition that it's waiting for _after_ the prepare-to-wait call, but the pipe code was traditionally all run under the pipe mutex, so it should have no races at all, because it's completely invalid to use "pipe_wait()" with anything that doesn't hold the mutex (both on the sleeping and waking side). So pipe_wait() is kind of legacy and meant for all the silly and complex UNIX domain connection things that nobody wants to touch. The IO code was supposed to have been converted away from that pipe mutex model, but it looks like I punted on splice, without thinking about this issue. So I think the *real* fix would be to convert the splice waiting code to work even without holding the pipe mutex. Because honestly, I think your patch fixes the problem, but not completely. In particular, the pipe wakeup can happen from somebody that doesn't hold the pipe mutex at all (see pipe_read(), and notice how it's doing the __pipe_unlock() before it does the "if it was full, wake things up), so this whole sequence is racy: check if pipe is full pipe_wait() if it is because the wakeup can happen in between those things, and "pipe_wait()" has no way to know what it's waiting for, and the wakeup event has already happened by the time it's called. Let me think about this, but I think the right solution is to just get rid of the splice use of pipe_wait. Do you have some half-way reliable way to trigger the issue for testing? Linus ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] pipe: fix hang when racing with a wakeup 2020-10-01 17:28 ` Linus Torvalds @ 2020-10-01 17:41 ` Josef Bacik 2020-10-01 18:38 ` Linus Torvalds 0 siblings, 1 reply; 6+ messages in thread From: Josef Bacik @ 2020-10-01 17:41 UTC (permalink / raw) To: Linus Torvalds; +Cc: Al Viro, Andrew Morton, linux-fsdevel On 10/1/20 1:28 PM, Linus Torvalds wrote: > On Thu, Oct 1, 2020 at 8:58 AM Josef Bacik <josef@toxicpanda.com> wrote: >> >> The problem is we're doing the prepare_to_wait, which sets our state >> each time, however we can be woken up either with reads or writes. In >> the case above we race with the WRITER waking us up, and re-set our >> state to INTERRUPTIBLE, and thus never break out of schedule. > > Good catch of an interesting problem. > > That said, the real problem here is that "pipe_wait()" is some nasty > garbage. I really wanted to convert all users to do a proper > wait-queue usage, but left the old structure alone. > > Any normal "prepare_to_wait()" user should always check for the > condition that it's waiting for _after_ the prepare-to-wait call, but > the pipe code was traditionally all run under the pipe mutex, so it > should have no races at all, because it's completely invalid to use > "pipe_wait()" with anything that doesn't hold the mutex (both on the > sleeping and waking side). > > So pipe_wait() is kind of legacy and meant for all the silly and > complex UNIX domain connection things that nobody wants to touch. > > The IO code was supposed to have been converted away from that pipe > mutex model, but it looks like I punted on splice, without thinking > about this issue. > > So I think the *real* fix would be to convert the splice waiting code > to work even without holding the pipe mutex. Because honestly, I think > your patch fixes the problem, but not completely. > > In particular, the pipe wakeup can happen from somebody that doesn't > hold the pipe mutex at all (see pipe_read(), and notice how it's doing > the __pipe_unlock() before it does the "if it was full, wake things > up), so this whole sequence is racy: > > check if pipe is full > pipe_wait() if it is > > because the wakeup can happen in between those things, and > "pipe_wait()" has no way to know what it's waiting for, and the wakeup > event has already happened by the time it's called. Yes, sorry that's what I meant to convey in my changelog, the wakeup has to occur outside the pipe_lock for this to occur. Generally I think this is the only real problem we have, because as you said we only modify the tail/head under the pipe_lock, so it's not going to change while we do our pipe_wait(). The problem happens where we get removed for a case that no longer applies to us. My fix "fixes" the problem in that we'll get a spurious wakeup, see that our condition isn't met, and go back to sleep, and then the next writer will wake us up for real. Obviously not ideal, but I figured the simpler fix was better for stable, and then we could work out something better. > > Let me think about this, but I think the right solution is to just get > rid of the splice use of pipe_wait. > Yeah we could just have the callers wait on the waitqueue they actually care about, that would be a simple solution that would also be cleaner. > Do you have some half-way reliable way to trigger the issue for testing? > I do not, but I also didn't try. We basically just have to break out of the pipe_write() loop before waking anything up. I'll rig up something that just writes one page at a time, and the other side splices into /dev/null, and just run that in a loop for a few minutes and see if I can trigger it more reliably than my btrfs send thing. Thanks, Joesf ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] pipe: fix hang when racing with a wakeup 2020-10-01 17:41 ` Josef Bacik @ 2020-10-01 18:38 ` Linus Torvalds 2020-10-01 19:52 ` Josef Bacik 2020-10-01 21:08 ` Josef Bacik 0 siblings, 2 replies; 6+ messages in thread From: Linus Torvalds @ 2020-10-01 18:38 UTC (permalink / raw) To: Josef Bacik; +Cc: Al Viro, Andrew Morton, linux-fsdevel [-- Attachment #1: Type: text/plain, Size: 724 bytes --] On Thu, Oct 1, 2020 at 10:41 AM Josef Bacik <josef@toxicpanda.com> wrote: > > Obviously not ideal, but I figured the simpler fix was better for stable, and > then we could work out something better. I think the attached is the proper fix, and it's not really any more complicated. The patch is bigger, but it's pretty obvious: get rid of the non-specific "pipe_wait()", and replace them with specific versions that wait for a particular thing. NOTE! Entirely untested. It seems to build fine for me, and it _looks_ obvious, but I haven't actually rebooted to see if it works at all. I don't think I have any real splice-heavy test cases. Mind trying this out on the load that showed problems? Linus [-- Attachment #2: patch --] [-- Type: application/octet-stream, Size: 4712 bytes --] fs/pipe.c | 62 +++++++++++++++++++++++++++++++---------------- fs/splice.c | 8 +++--- include/linux/pipe_fs_i.h | 5 ++-- 3 files changed, 48 insertions(+), 27 deletions(-) diff --git a/fs/pipe.c b/fs/pipe.c index 60dbee457143..7639e1117521 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -106,25 +106,6 @@ void pipe_double_lock(struct pipe_inode_info *pipe1, } } -/* Drop the inode semaphore and wait for a pipe event, atomically */ -void pipe_wait(struct pipe_inode_info *pipe) -{ - DEFINE_WAIT(rdwait); - DEFINE_WAIT(wrwait); - - /* - * Pipes are system-local resources, so sleeping on them - * is considered a noninteractive wait: - */ - prepare_to_wait(&pipe->rd_wait, &rdwait, TASK_INTERRUPTIBLE); - prepare_to_wait(&pipe->wr_wait, &wrwait, TASK_INTERRUPTIBLE); - pipe_unlock(pipe); - schedule(); - finish_wait(&pipe->rd_wait, &rdwait); - finish_wait(&pipe->wr_wait, &wrwait); - pipe_lock(pipe); -} - static void anon_pipe_buf_release(struct pipe_inode_info *pipe, struct pipe_buffer *buf) { @@ -1035,12 +1016,52 @@ SYSCALL_DEFINE1(pipe, int __user *, fildes) return do_pipe2(fildes, 0); } +/* + * This is the stupid "wait for pipe to be readable or writable" + * model. + * + * See pipe_read/write() for the proper kind of exclusive wait, + * but that requires that we wake up any other readers/writers + * if we then do not end up reading everything (ie the whole + * "wake_next_reader/writer" logic in pipe_read/write()). + */ +void pipe_wait_readable(struct pipe_inode_info *pipe) +{ + pipe_unlock(pipe); + wait_event_interruptible(pipe->rd_wait, pipe_readable(pipe)); + pipe_lock(pipe); +} + +void pipe_wait_writable(struct pipe_inode_info *pipe) +{ + pipe_unlock(pipe); + wait_event_interruptible(pipe->wr_wait, pipe_writable(pipe)); + pipe_lock(pipe); +} + +/* + * This depends on both the (here) wait and the wakeup (wake_up_partner) + * holding the pipe lock, so "*cnt" is stable and we know a wakeup cannot + * race with the count check and waitqueue prep. + * + * Normally in order to avoid races, you'd do the prepare_to_wait() first, + * then check the condition you're waiting for, and only then sleep. But + * because of the pipe lock, we can check the condition before being on + * the wait queue. + * + * We use the 'rd_wait' waitqueue for pipe partner waiting. + */ static int wait_for_partner(struct pipe_inode_info *pipe, unsigned int *cnt) { + DEFINE_WAIT(rdwait); int cur = *cnt; while (cur == *cnt) { - pipe_wait(pipe); + prepare_to_wait(&pipe->rd_wait, &rdwait, TASK_INTERRUPTIBLE); + pipe_unlock(pipe); + schedule(); + finish_wait(&pipe->rd_wait, &rdwait); + pipe_lock(pipe); if (signal_pending(current)) break; } @@ -1050,7 +1071,6 @@ static int wait_for_partner(struct pipe_inode_info *pipe, unsigned int *cnt) static void wake_up_partner(struct pipe_inode_info *pipe) { wake_up_interruptible_all(&pipe->rd_wait); - wake_up_interruptible_all(&pipe->wr_wait); } static int fifo_open(struct inode *inode, struct file *filp) diff --git a/fs/splice.c b/fs/splice.c index d7c8a7c4db07..c3d00dfc7344 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -563,7 +563,7 @@ static int splice_from_pipe_next(struct pipe_inode_info *pipe, struct splice_des sd->need_wakeup = false; } - pipe_wait(pipe); + pipe_wait_readable(pipe); } return 1; @@ -1077,7 +1077,7 @@ static int wait_for_space(struct pipe_inode_info *pipe, unsigned flags) return -EAGAIN; if (signal_pending(current)) return -ERESTARTSYS; - pipe_wait(pipe); + pipe_wait_writable(pipe); } } @@ -1454,7 +1454,7 @@ static int ipipe_prep(struct pipe_inode_info *pipe, unsigned int flags) ret = -EAGAIN; break; } - pipe_wait(pipe); + pipe_wait_readable(pipe); } pipe_unlock(pipe); @@ -1493,7 +1493,7 @@ static int opipe_prep(struct pipe_inode_info *pipe, unsigned int flags) ret = -ERESTARTSYS; break; } - pipe_wait(pipe); + pipe_wait_writable(pipe); } pipe_unlock(pipe); diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h index 50afd0d0084c..5d2705f1d01c 100644 --- a/include/linux/pipe_fs_i.h +++ b/include/linux/pipe_fs_i.h @@ -240,8 +240,9 @@ extern unsigned int pipe_max_size; extern unsigned long pipe_user_pages_hard; extern unsigned long pipe_user_pages_soft; -/* Drop the inode semaphore and wait for a pipe event, atomically */ -void pipe_wait(struct pipe_inode_info *pipe); +/* Wait for a pipe to be readable/writable while dropping the pipe lock */ +void pipe_wait_readable(struct pipe_inode_info *); +void pipe_wait_writable(struct pipe_inode_info *); struct pipe_inode_info *alloc_pipe_info(void); void free_pipe_info(struct pipe_inode_info *); ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH] pipe: fix hang when racing with a wakeup 2020-10-01 18:38 ` Linus Torvalds @ 2020-10-01 19:52 ` Josef Bacik 2020-10-01 21:08 ` Josef Bacik 1 sibling, 0 replies; 6+ messages in thread From: Josef Bacik @ 2020-10-01 19:52 UTC (permalink / raw) To: Linus Torvalds; +Cc: Al Viro, Andrew Morton, linux-fsdevel On 10/1/20 2:38 PM, Linus Torvalds wrote: > On Thu, Oct 1, 2020 at 10:41 AM Josef Bacik <josef@toxicpanda.com> wrote: >> >> Obviously not ideal, but I figured the simpler fix was better for stable, and >> then we could work out something better. > > I think the attached is the proper fix, and it's not really any more > complicated. > > The patch is bigger, but it's pretty obvious: get rid of the > non-specific "pipe_wait()", and replace them with specific versions > that wait for a particular thing. > > NOTE! Entirely untested. It seems to build fine for me, and it _looks_ > obvious, but I haven't actually rebooted to see if it works at all. I > don't think I have any real splice-heavy test cases. > > Mind trying this out on the load that showed problems? > I wrote a simple reproducer, reproduced the problem on an unpatched kernel in like 20 minutes. I'm loading this up, I'll call it fixed in say a couple of hours? Thanks, Josef ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] pipe: fix hang when racing with a wakeup 2020-10-01 18:38 ` Linus Torvalds 2020-10-01 19:52 ` Josef Bacik @ 2020-10-01 21:08 ` Josef Bacik 1 sibling, 0 replies; 6+ messages in thread From: Josef Bacik @ 2020-10-01 21:08 UTC (permalink / raw) To: Linus Torvalds; +Cc: Al Viro, Andrew Morton, linux-fsdevel On 10/1/20 2:38 PM, Linus Torvalds wrote: > On Thu, Oct 1, 2020 at 10:41 AM Josef Bacik <josef@toxicpanda.com> wrote: >> >> Obviously not ideal, but I figured the simpler fix was better for stable, and >> then we could work out something better. > > I think the attached is the proper fix, and it's not really any more > complicated. > > The patch is bigger, but it's pretty obvious: get rid of the > non-specific "pipe_wait()", and replace them with specific versions > that wait for a particular thing. > > NOTE! Entirely untested. It seems to build fine for me, and it _looks_ > obvious, but I haven't actually rebooted to see if it works at all. I > don't think I have any real splice-heavy test cases. > > Mind trying this out on the load that showed problems? > It's been running for an hour without hanging, you can add Reviewed-and-tested-by: Josef Bacik <josef@toxicpanda.com> Thanks! Josef ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2020-10-01 21:08 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-10-01 15:58 [PATCH] pipe: fix hang when racing with a wakeup Josef Bacik 2020-10-01 17:28 ` Linus Torvalds 2020-10-01 17:41 ` Josef Bacik 2020-10-01 18:38 ` Linus Torvalds 2020-10-01 19:52 ` Josef Bacik 2020-10-01 21:08 ` Josef Bacik
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).