Linux-Fsdevel Archive on lore.kernel.org
 help / color / Atom feed
From: Josef Bacik <josef@toxicpanda.com>
To: viro@ZenIV.linux.org.uk, akpm@linux-foundation.org,
	linux-fsdevel@vger.kernel.org, torvalds@linux-foundation.org
Subject: [PATCH] pipe: fix hang when racing with a wakeup
Date: Thu,  1 Oct 2020 11:58:31 -0400
Message-ID: <bfa88b5ad6f069b2b679316b9e495a970130416c.1601567868.git.josef@toxicpanda.com> (raw)

I hit a hang with fstest btrfs/187, which does a btrfs send into
/dev/null.  This works by creating a pipe, the write side is given to
the kernel to write into, and the read side is handed to a thread that
splices into a file, in this case /dev/null.  The box that was hung had
the write side stuck here

[<0>] pipe_write+0x38b/0x600
[<0>] new_sync_write+0x108/0x180
[<0>] __kernel_write+0xd4/0x160
[<0>] kernel_write+0x74/0x110
[<0>] btrfs_ioctl_send+0xb51/0x1867
[<0>] _btrfs_ioctl_send+0xbf/0x100
[<0>] btrfs_ioctl+0x1d4c/0x3090
[<0>] __x64_sys_ioctl+0x83/0xb0
[<0>] do_syscall_64+0x33/0x40
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

and the read side stuck here

[<0>] pipe_wait+0xa4/0x100
[<0>] splice_from_pipe_next.part.0+0x33/0xe0
[<0>] __splice_from_pipe+0x6a/0x200
[<0>] splice_from_pipe+0x50/0x70
[<0>] do_splice+0x35c/0x7e0
[<0>] __x64_sys_splice+0x92/0x110
[<0>] do_syscall_64+0x33/0x40
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Using drgn to debug, I found that the pipe rd_wait was empty, but the
wr_wait was not

>>> pipe.rd_wait.head
(struct list_head){
        .next = (struct list_head *)0xffff9d2e5a3ad8d0,
        .prev = (struct list_head *)0xffff9d2e5a3ad8d0,
}
>>> pipe.wr_wait.head
(struct list_head){
        .next = (struct list_head *)0xffffa8934111bd88,
        .prev = (struct list_head *)0xffffa89341143ba8,
}

>>> for e in list_for_each_entry('struct wait_queue_entry',
				 pipe.wr_wait.head.address_of_(), 'entry'):
...   task = Object(prog, 'struct task_struct',
		    address=e.private.value_())
...   print("pid {} state {}".format(task.pid, task.state))
...
pid (pid_t)3080640 state (volatile long)1
pid (pid_t)3080638 state (volatile long)1
>>> for e in list_for_each_entry('struct wait_queue_entry',
				 pipe.rd_wait.head.address_of_(), 'entry'):
...   task = Object(prog, 'struct task_struct',
		    address=e.private.value_())
...   print("pid {} state {}".format(task.pid, task.state))
...

The wr_wait has both the writer and the reader waiting, which is
expected, the pipe is full

>>> pipe.head
(unsigned int)179612
>>> pipe.tail
(unsigned int)179596
>>> pipe.max_usage
(unsigned int)16

and the read side is only waiting on wr_wait, rd_wait doesn't have our
entry any more.

This can happen in the following way

WRITER					READER
pipe_write()				splice
  pipe_lock				pipe_lock()
    was_empty = true
    do the write
    break out of the loop
  pipe_unlock
					  consume what was written
					pipe_wait()
					prepare_to_wait(rd_wait)
					  set_task_state(INTERRUPTIBLE)
  wake_up(rd_wait)
    set_task_state(READER, RUNNING)
					prepare_to_wait(wr_wait)
					  set_task_state(INTERRUPTIBLE)
					pip_unlock()
					schedule()

The problem is we're doing the prepare_to_wait, which sets our state
each time, however we can be woken up either with reads or writes.  In
the case above we race with the WRITER waking us up, and re-set our
state to INTERRUPTIBLE, and thus never break out of schedule.

Instead we need to set our state once, and then add ourselves to our
respective waitqueues.  This way if any of our waitqueues wake us up,
we'll have TASK_RUNNING set when we enter schedule() and will go right
back to check for whatever it is we're waiting on.

I tested this patch with the test that hung, but this only happened on
one vm last night, and these tests run twice on 3 vm's every night and
this is the first time I hit the problem, so it's a rare occurrence.

Fixes: 0ddad21d3e99 ("pipe: use exclusive waits when reading or writing")
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/pipe.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index 60dbee457143..8803a11cbc1b 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -116,8 +116,9 @@ void pipe_wait(struct pipe_inode_info *pipe)
 	 * Pipes are system-local resources, so sleeping on them
 	 * is considered a noninteractive wait:
 	 */
-	prepare_to_wait(&pipe->rd_wait, &rdwait, TASK_INTERRUPTIBLE);
-	prepare_to_wait(&pipe->wr_wait, &wrwait, TASK_INTERRUPTIBLE);
+	set_current_state(TASK_INTERRUPTIBLE);
+	add_wait_queue(&pipe->rd_wait, &rdwait);
+	add_wait_queue(&pipe->wr_wait, &wrwait);
 	pipe_unlock(pipe);
 	schedule();
 	finish_wait(&pipe->rd_wait, &rdwait);
-- 
2.26.2


             reply index

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-01 15:58 Josef Bacik [this message]
2020-10-01 17:28 ` Linus Torvalds
2020-10-01 17:41   ` Josef Bacik
2020-10-01 18:38     ` Linus Torvalds
2020-10-01 19:52       ` Josef Bacik
2020-10-01 21:08       ` Josef Bacik

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bfa88b5ad6f069b2b679316b9e495a970130416c.1601567868.git.josef@toxicpanda.com \
    --to=josef@toxicpanda.com \
    --cc=akpm@linux-foundation.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@ZenIV.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-Fsdevel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-fsdevel/0 linux-fsdevel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-fsdevel linux-fsdevel/ https://lore.kernel.org/linux-fsdevel \
		linux-fsdevel@vger.kernel.org
	public-inbox-index linux-fsdevel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-fsdevel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git