linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] pipe: fix hang when racing with a wakeup
@ 2020-10-01 15:58 Josef Bacik
  2020-10-01 17:28 ` Linus Torvalds
  0 siblings, 1 reply; 6+ messages in thread
From: Josef Bacik @ 2020-10-01 15:58 UTC (permalink / raw)
  To: viro, akpm, linux-fsdevel, torvalds

I hit a hang with fstest btrfs/187, which does a btrfs send into
/dev/null.  This works by creating a pipe, the write side is given to
the kernel to write into, and the read side is handed to a thread that
splices into a file, in this case /dev/null.  The box that was hung had
the write side stuck here

[<0>] pipe_write+0x38b/0x600
[<0>] new_sync_write+0x108/0x180
[<0>] __kernel_write+0xd4/0x160
[<0>] kernel_write+0x74/0x110
[<0>] btrfs_ioctl_send+0xb51/0x1867
[<0>] _btrfs_ioctl_send+0xbf/0x100
[<0>] btrfs_ioctl+0x1d4c/0x3090
[<0>] __x64_sys_ioctl+0x83/0xb0
[<0>] do_syscall_64+0x33/0x40
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

and the read side stuck here

[<0>] pipe_wait+0xa4/0x100
[<0>] splice_from_pipe_next.part.0+0x33/0xe0
[<0>] __splice_from_pipe+0x6a/0x200
[<0>] splice_from_pipe+0x50/0x70
[<0>] do_splice+0x35c/0x7e0
[<0>] __x64_sys_splice+0x92/0x110
[<0>] do_syscall_64+0x33/0x40
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Using drgn to debug, I found that the pipe rd_wait was empty, but the
wr_wait was not

>>> pipe.rd_wait.head
(struct list_head){
        .next = (struct list_head *)0xffff9d2e5a3ad8d0,
        .prev = (struct list_head *)0xffff9d2e5a3ad8d0,
}
>>> pipe.wr_wait.head
(struct list_head){
        .next = (struct list_head *)0xffffa8934111bd88,
        .prev = (struct list_head *)0xffffa89341143ba8,
}

>>> for e in list_for_each_entry('struct wait_queue_entry',
				 pipe.wr_wait.head.address_of_(), 'entry'):
...   task = Object(prog, 'struct task_struct',
		    address=e.private.value_())
...   print("pid {} state {}".format(task.pid, task.state))
...
pid (pid_t)3080640 state (volatile long)1
pid (pid_t)3080638 state (volatile long)1
>>> for e in list_for_each_entry('struct wait_queue_entry',
				 pipe.rd_wait.head.address_of_(), 'entry'):
...   task = Object(prog, 'struct task_struct',
		    address=e.private.value_())
...   print("pid {} state {}".format(task.pid, task.state))
...

The wr_wait has both the writer and the reader waiting, which is
expected, the pipe is full

>>> pipe.head
(unsigned int)179612
>>> pipe.tail
(unsigned int)179596
>>> pipe.max_usage
(unsigned int)16

and the read side is only waiting on wr_wait, rd_wait doesn't have our
entry any more.

This can happen in the following way

WRITER					READER
pipe_write()				splice
  pipe_lock				pipe_lock()
    was_empty = true
    do the write
    break out of the loop
  pipe_unlock
					  consume what was written
					pipe_wait()
					prepare_to_wait(rd_wait)
					  set_task_state(INTERRUPTIBLE)
  wake_up(rd_wait)
    set_task_state(READER, RUNNING)
					prepare_to_wait(wr_wait)
					  set_task_state(INTERRUPTIBLE)
					pip_unlock()
					schedule()

The problem is we're doing the prepare_to_wait, which sets our state
each time, however we can be woken up either with reads or writes.  In
the case above we race with the WRITER waking us up, and re-set our
state to INTERRUPTIBLE, and thus never break out of schedule.

Instead we need to set our state once, and then add ourselves to our
respective waitqueues.  This way if any of our waitqueues wake us up,
we'll have TASK_RUNNING set when we enter schedule() and will go right
back to check for whatever it is we're waiting on.

I tested this patch with the test that hung, but this only happened on
one vm last night, and these tests run twice on 3 vm's every night and
this is the first time I hit the problem, so it's a rare occurrence.

Fixes: 0ddad21d3e99 ("pipe: use exclusive waits when reading or writing")
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/pipe.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index 60dbee457143..8803a11cbc1b 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -116,8 +116,9 @@ void pipe_wait(struct pipe_inode_info *pipe)
 	 * Pipes are system-local resources, so sleeping on them
 	 * is considered a noninteractive wait:
 	 */
-	prepare_to_wait(&pipe->rd_wait, &rdwait, TASK_INTERRUPTIBLE);
-	prepare_to_wait(&pipe->wr_wait, &wrwait, TASK_INTERRUPTIBLE);
+	set_current_state(TASK_INTERRUPTIBLE);
+	add_wait_queue(&pipe->rd_wait, &rdwait);
+	add_wait_queue(&pipe->wr_wait, &wrwait);
 	pipe_unlock(pipe);
 	schedule();
 	finish_wait(&pipe->rd_wait, &rdwait);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] pipe: fix hang when racing with a wakeup
  2020-10-01 15:58 [PATCH] pipe: fix hang when racing with a wakeup Josef Bacik
@ 2020-10-01 17:28 ` Linus Torvalds
  2020-10-01 17:41   ` Josef Bacik
  0 siblings, 1 reply; 6+ messages in thread
From: Linus Torvalds @ 2020-10-01 17:28 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Al Viro, Andrew Morton, linux-fsdevel

On Thu, Oct 1, 2020 at 8:58 AM Josef Bacik <josef@toxicpanda.com> wrote:
>
> The problem is we're doing the prepare_to_wait, which sets our state
> each time, however we can be woken up either with reads or writes.  In
> the case above we race with the WRITER waking us up, and re-set our
> state to INTERRUPTIBLE, and thus never break out of schedule.

Good catch of an interesting problem.

That said, the real problem here is that "pipe_wait()" is some nasty
garbage. I really wanted to convert all users to do a proper
wait-queue usage, but left the old structure alone.

Any normal "prepare_to_wait()" user should always check for the
condition that it's waiting for _after_ the prepare-to-wait call, but
the pipe code was traditionally all run under the pipe mutex, so it
should have no races at all, because it's completely invalid to use
"pipe_wait()" with anything that doesn't hold the mutex (both on the
sleeping and waking side).

So pipe_wait() is kind of legacy and meant for all the silly and
complex UNIX domain connection things that nobody wants to touch.

The IO code was supposed to have been converted away from that pipe
mutex model, but it looks like I punted on splice, without thinking
about this issue.

So I think the *real* fix would be to convert the splice waiting code
to work even without holding the pipe mutex. Because honestly, I think
your patch fixes the problem, but not completely.

In particular, the pipe wakeup can happen from somebody that doesn't
hold the pipe mutex at all (see pipe_read(), and notice how it's doing
the __pipe_unlock() before it does the "if it was full, wake things
up), so this whole sequence is racy:

   check if pipe is full
   pipe_wait() if it is

because the wakeup can happen in between those things, and
"pipe_wait()" has no way to know what it's waiting for, and the wakeup
event has already happened by the time it's called.

Let me think about this, but I think the right solution is to just get
rid of the splice use of pipe_wait.

Do you have some half-way reliable way to trigger the issue for testing?

                 Linus

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] pipe: fix hang when racing with a wakeup
  2020-10-01 17:28 ` Linus Torvalds
@ 2020-10-01 17:41   ` Josef Bacik
  2020-10-01 18:38     ` Linus Torvalds
  0 siblings, 1 reply; 6+ messages in thread
From: Josef Bacik @ 2020-10-01 17:41 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Al Viro, Andrew Morton, linux-fsdevel

On 10/1/20 1:28 PM, Linus Torvalds wrote:
> On Thu, Oct 1, 2020 at 8:58 AM Josef Bacik <josef@toxicpanda.com> wrote:
>>
>> The problem is we're doing the prepare_to_wait, which sets our state
>> each time, however we can be woken up either with reads or writes.  In
>> the case above we race with the WRITER waking us up, and re-set our
>> state to INTERRUPTIBLE, and thus never break out of schedule.
> 
> Good catch of an interesting problem.
> 
> That said, the real problem here is that "pipe_wait()" is some nasty
> garbage. I really wanted to convert all users to do a proper
> wait-queue usage, but left the old structure alone.
> 
> Any normal "prepare_to_wait()" user should always check for the
> condition that it's waiting for _after_ the prepare-to-wait call, but
> the pipe code was traditionally all run under the pipe mutex, so it
> should have no races at all, because it's completely invalid to use
> "pipe_wait()" with anything that doesn't hold the mutex (both on the
> sleeping and waking side).
> 
> So pipe_wait() is kind of legacy and meant for all the silly and
> complex UNIX domain connection things that nobody wants to touch.
> 
> The IO code was supposed to have been converted away from that pipe
> mutex model, but it looks like I punted on splice, without thinking
> about this issue.
> 
> So I think the *real* fix would be to convert the splice waiting code
> to work even without holding the pipe mutex. Because honestly, I think
> your patch fixes the problem, but not completely.
> 
> In particular, the pipe wakeup can happen from somebody that doesn't
> hold the pipe mutex at all (see pipe_read(), and notice how it's doing
> the __pipe_unlock() before it does the "if it was full, wake things
> up), so this whole sequence is racy:
> 
>     check if pipe is full
>     pipe_wait() if it is
> 
> because the wakeup can happen in between those things, and
> "pipe_wait()" has no way to know what it's waiting for, and the wakeup
> event has already happened by the time it's called.

Yes, sorry that's what I meant to convey in my changelog, the wakeup has to 
occur outside the pipe_lock for this to occur.

Generally I think this is the only real problem we have, because as you said we 
only modify the tail/head under the pipe_lock, so it's not going to change while 
we do our pipe_wait().  The problem happens where we get removed for a case that 
no longer applies to us.  My fix "fixes" the problem in that we'll get a 
spurious wakeup, see that our condition isn't met, and go back to sleep, and 
then the next writer will wake us up for real.

Obviously not ideal, but I figured the simpler fix was better for stable, and 
then we could work out something better.

> 
> Let me think about this, but I think the right solution is to just get
> rid of the splice use of pipe_wait.
> 

Yeah we could just have the callers wait on the waitqueue they actually care 
about, that would be a simple solution that would also be cleaner.

> Do you have some half-way reliable way to trigger the issue for testing?
>
I do not, but I also didn't try.  We basically just have to break out of the 
pipe_write() loop before waking anything up.  I'll rig up something that just 
writes one page at a time, and the other side splices into /dev/null, and just 
run that in a loop for a few minutes and see if I can trigger it more reliably 
than my btrfs send thing.  Thanks,

Joesf

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] pipe: fix hang when racing with a wakeup
  2020-10-01 17:41   ` Josef Bacik
@ 2020-10-01 18:38     ` Linus Torvalds
  2020-10-01 19:52       ` Josef Bacik
  2020-10-01 21:08       ` Josef Bacik
  0 siblings, 2 replies; 6+ messages in thread
From: Linus Torvalds @ 2020-10-01 18:38 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Al Viro, Andrew Morton, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 724 bytes --]

On Thu, Oct 1, 2020 at 10:41 AM Josef Bacik <josef@toxicpanda.com> wrote:
>
> Obviously not ideal, but I figured the simpler fix was better for stable, and
> then we could work out something better.

I think the attached is the proper fix, and it's not really any more
complicated.

The patch is bigger, but it's pretty obvious: get rid of the
non-specific "pipe_wait()", and replace them with specific versions
that wait for a particular thing.

NOTE! Entirely untested. It seems to build fine for me, and it _looks_
obvious, but I haven't actually rebooted to see if it works at all. I
don't think I have any real splice-heavy test cases.

Mind trying this out on the load that showed problems?

                    Linus

[-- Attachment #2: patch --]
[-- Type: application/octet-stream, Size: 4712 bytes --]

 fs/pipe.c                 | 62 +++++++++++++++++++++++++++++++----------------
 fs/splice.c               |  8 +++---
 include/linux/pipe_fs_i.h |  5 ++--
 3 files changed, 48 insertions(+), 27 deletions(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index 60dbee457143..7639e1117521 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -106,25 +106,6 @@ void pipe_double_lock(struct pipe_inode_info *pipe1,
 	}
 }
 
-/* Drop the inode semaphore and wait for a pipe event, atomically */
-void pipe_wait(struct pipe_inode_info *pipe)
-{
-	DEFINE_WAIT(rdwait);
-	DEFINE_WAIT(wrwait);
-
-	/*
-	 * Pipes are system-local resources, so sleeping on them
-	 * is considered a noninteractive wait:
-	 */
-	prepare_to_wait(&pipe->rd_wait, &rdwait, TASK_INTERRUPTIBLE);
-	prepare_to_wait(&pipe->wr_wait, &wrwait, TASK_INTERRUPTIBLE);
-	pipe_unlock(pipe);
-	schedule();
-	finish_wait(&pipe->rd_wait, &rdwait);
-	finish_wait(&pipe->wr_wait, &wrwait);
-	pipe_lock(pipe);
-}
-
 static void anon_pipe_buf_release(struct pipe_inode_info *pipe,
 				  struct pipe_buffer *buf)
 {
@@ -1035,12 +1016,52 @@ SYSCALL_DEFINE1(pipe, int __user *, fildes)
 	return do_pipe2(fildes, 0);
 }
 
+/*
+ * This is the stupid "wait for pipe to be readable or writable"
+ * model.
+ *
+ * See pipe_read/write() for the proper kind of exclusive wait,
+ * but that requires that we wake up any other readers/writers
+ * if we then do not end up reading everything (ie the whole
+ * "wake_next_reader/writer" logic in pipe_read/write()).
+ */
+void pipe_wait_readable(struct pipe_inode_info *pipe)
+{
+	pipe_unlock(pipe);
+	wait_event_interruptible(pipe->rd_wait, pipe_readable(pipe));
+	pipe_lock(pipe);
+}
+
+void pipe_wait_writable(struct pipe_inode_info *pipe)
+{
+	pipe_unlock(pipe);
+	wait_event_interruptible(pipe->wr_wait, pipe_writable(pipe));
+	pipe_lock(pipe);
+}
+
+/*
+ * This depends on both the (here) wait and the wakeup (wake_up_partner)
+ * holding the pipe lock, so "*cnt" is stable and we know a wakeup cannot
+ * race with the count check and waitqueue prep.
+ *
+ * Normally in order to avoid races, you'd do the prepare_to_wait() first,
+ * then check the condition you're waiting for, and only then sleep. But
+ * because of the pipe lock, we can check the condition before being on
+ * the wait queue.
+ *
+ * We use the 'rd_wait' waitqueue for pipe partner waiting.
+ */
 static int wait_for_partner(struct pipe_inode_info *pipe, unsigned int *cnt)
 {
+	DEFINE_WAIT(rdwait);
 	int cur = *cnt;
 
 	while (cur == *cnt) {
-		pipe_wait(pipe);
+		prepare_to_wait(&pipe->rd_wait, &rdwait, TASK_INTERRUPTIBLE);
+		pipe_unlock(pipe);
+		schedule();
+		finish_wait(&pipe->rd_wait, &rdwait);
+		pipe_lock(pipe);
 		if (signal_pending(current))
 			break;
 	}
@@ -1050,7 +1071,6 @@ static int wait_for_partner(struct pipe_inode_info *pipe, unsigned int *cnt)
 static void wake_up_partner(struct pipe_inode_info *pipe)
 {
 	wake_up_interruptible_all(&pipe->rd_wait);
-	wake_up_interruptible_all(&pipe->wr_wait);
 }
 
 static int fifo_open(struct inode *inode, struct file *filp)
diff --git a/fs/splice.c b/fs/splice.c
index d7c8a7c4db07..c3d00dfc7344 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -563,7 +563,7 @@ static int splice_from_pipe_next(struct pipe_inode_info *pipe, struct splice_des
 			sd->need_wakeup = false;
 		}
 
-		pipe_wait(pipe);
+		pipe_wait_readable(pipe);
 	}
 
 	return 1;
@@ -1077,7 +1077,7 @@ static int wait_for_space(struct pipe_inode_info *pipe, unsigned flags)
 			return -EAGAIN;
 		if (signal_pending(current))
 			return -ERESTARTSYS;
-		pipe_wait(pipe);
+		pipe_wait_writable(pipe);
 	}
 }
 
@@ -1454,7 +1454,7 @@ static int ipipe_prep(struct pipe_inode_info *pipe, unsigned int flags)
 			ret = -EAGAIN;
 			break;
 		}
-		pipe_wait(pipe);
+		pipe_wait_readable(pipe);
 	}
 
 	pipe_unlock(pipe);
@@ -1493,7 +1493,7 @@ static int opipe_prep(struct pipe_inode_info *pipe, unsigned int flags)
 			ret = -ERESTARTSYS;
 			break;
 		}
-		pipe_wait(pipe);
+		pipe_wait_writable(pipe);
 	}
 
 	pipe_unlock(pipe);
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index 50afd0d0084c..5d2705f1d01c 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -240,8 +240,9 @@ extern unsigned int pipe_max_size;
 extern unsigned long pipe_user_pages_hard;
 extern unsigned long pipe_user_pages_soft;
 
-/* Drop the inode semaphore and wait for a pipe event, atomically */
-void pipe_wait(struct pipe_inode_info *pipe);
+/* Wait for a pipe to be readable/writable while dropping the pipe lock */
+void pipe_wait_readable(struct pipe_inode_info *);
+void pipe_wait_writable(struct pipe_inode_info *);
 
 struct pipe_inode_info *alloc_pipe_info(void);
 void free_pipe_info(struct pipe_inode_info *);

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] pipe: fix hang when racing with a wakeup
  2020-10-01 18:38     ` Linus Torvalds
@ 2020-10-01 19:52       ` Josef Bacik
  2020-10-01 21:08       ` Josef Bacik
  1 sibling, 0 replies; 6+ messages in thread
From: Josef Bacik @ 2020-10-01 19:52 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Al Viro, Andrew Morton, linux-fsdevel

On 10/1/20 2:38 PM, Linus Torvalds wrote:
> On Thu, Oct 1, 2020 at 10:41 AM Josef Bacik <josef@toxicpanda.com> wrote:
>>
>> Obviously not ideal, but I figured the simpler fix was better for stable, and
>> then we could work out something better.
> 
> I think the attached is the proper fix, and it's not really any more
> complicated.
> 
> The patch is bigger, but it's pretty obvious: get rid of the
> non-specific "pipe_wait()", and replace them with specific versions
> that wait for a particular thing.
> 
> NOTE! Entirely untested. It seems to build fine for me, and it _looks_
> obvious, but I haven't actually rebooted to see if it works at all. I
> don't think I have any real splice-heavy test cases.
> 
> Mind trying this out on the load that showed problems?
> 

I wrote a simple reproducer, reproduced the problem on an unpatched kernel in 
like 20 minutes.  I'm loading this up, I'll call it fixed in say a couple of 
hours?  Thanks,

Josef

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] pipe: fix hang when racing with a wakeup
  2020-10-01 18:38     ` Linus Torvalds
  2020-10-01 19:52       ` Josef Bacik
@ 2020-10-01 21:08       ` Josef Bacik
  1 sibling, 0 replies; 6+ messages in thread
From: Josef Bacik @ 2020-10-01 21:08 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Al Viro, Andrew Morton, linux-fsdevel

On 10/1/20 2:38 PM, Linus Torvalds wrote:
> On Thu, Oct 1, 2020 at 10:41 AM Josef Bacik <josef@toxicpanda.com> wrote:
>>
>> Obviously not ideal, but I figured the simpler fix was better for stable, and
>> then we could work out something better.
> 
> I think the attached is the proper fix, and it's not really any more
> complicated.
> 
> The patch is bigger, but it's pretty obvious: get rid of the
> non-specific "pipe_wait()", and replace them with specific versions
> that wait for a particular thing.
> 
> NOTE! Entirely untested. It seems to build fine for me, and it _looks_
> obvious, but I haven't actually rebooted to see if it works at all. I
> don't think I have any real splice-heavy test cases.
> 
> Mind trying this out on the load that showed problems?
> 

It's been running for an hour without hanging, you can add

Reviewed-and-tested-by: Josef Bacik <josef@toxicpanda.com>

Thanks!

Josef

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-10-01 21:08 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-01 15:58 [PATCH] pipe: fix hang when racing with a wakeup Josef Bacik
2020-10-01 17:28 ` Linus Torvalds
2020-10-01 17:41   ` Josef Bacik
2020-10-01 18:38     ` Linus Torvalds
2020-10-01 19:52       ` Josef Bacik
2020-10-01 21:08       ` Josef Bacik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).