linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Hang due to nfs letting tasks freeze with locked inodes
@ 2016-07-06 17:46 Seth Forshee
  2016-07-06 22:07 ` Jeff Layton
  0 siblings, 1 reply; 19+ messages in thread
From: Seth Forshee @ 2016-07-06 17:46 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker
  Cc: linux-fsdevel, linux-nfs, linux-kernel, Tycho Andersen

We're seeing a hang when freezing a container with an nfs bind mount while
running iozone. Two iozone processes were hung with this stack trace.

 [<ffffffff81821b15>] schedule+0x35/0x80
 [<ffffffff81821dbe>] schedule_preempt_disabled+0xe/0x10
 [<ffffffff818239f9>] __mutex_lock_slowpath+0xb9/0x130
 [<ffffffff81823a8f>] mutex_lock+0x1f/0x30
 [<ffffffff8121d00b>] do_unlinkat+0x12b/0x2d0
 [<ffffffff8121dc16>] SyS_unlink+0x16/0x20
 [<ffffffff81825bf2>] entry_SYSCALL_64_fastpath+0x16/0x71

This seems to be due to another iozone thread frozen during unlink with
this stack trace:

 [<ffffffff810e9cfa>] __refrigerator+0x7a/0x140
 [<ffffffffc08e80b8>] nfs4_handle_exception+0x118/0x130 [nfsv4]
 [<ffffffffc08e9efd>] nfs4_proc_remove+0x7d/0xf0 [nfsv4]
 [<ffffffffc088a329>] nfs_unlink+0x149/0x350 [nfs]
 [<ffffffff81219bd1>] vfs_unlink+0xf1/0x1a0
 [<ffffffff8121d159>] do_unlinkat+0x279/0x2d0
 [<ffffffff8121dc16>] SyS_unlink+0x16/0x20
 [<ffffffff81825bf2>] entry_SYSCALL_64_fastpath+0x16/0x71

Since nfs is allowing the thread to be frozen with the inode locked it's
preventing other threads trying to lock the same inode from freezing. It
seems like a bad idea for nfs to be doing this.

Can nfs do something different here to prevent this? Maybe use a
non-freezable sleep and let the operation complete, or else abort the
operation and return ERESTARTSYS?

Thanks,
Seth

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Hang due to nfs letting tasks freeze with locked inodes
  2016-07-06 17:46 Hang due to nfs letting tasks freeze with locked inodes Seth Forshee
@ 2016-07-06 22:07 ` Jeff Layton
  2016-07-07  3:55   ` Seth Forshee
                     ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Jeff Layton @ 2016-07-06 22:07 UTC (permalink / raw)
  To: Seth Forshee, Trond Myklebust, Anna Schumaker
  Cc: linux-fsdevel, linux-nfs, linux-kernel, Tycho Andersen

On Wed, 2016-07-06 at 12:46 -0500, Seth Forshee wrote:
> We're seeing a hang when freezing a container with an nfs bind mount while
> running iozone. Two iozone processes were hung with this stack trace.
> 
>  [] schedule+0x35/0x80
>  [] schedule_preempt_disabled+0xe/0x10
>  [] __mutex_lock_slowpath+0xb9/0x130
>  [] mutex_lock+0x1f/0x30
>  [] do_unlinkat+0x12b/0x2d0
>  [] SyS_unlink+0x16/0x20
>  [] entry_SYSCALL_64_fastpath+0x16/0x71
> 
> This seems to be due to another iozone thread frozen during unlink with
> this stack trace:
> 
>  [] __refrigerator+0x7a/0x140
>  [] nfs4_handle_exception+0x118/0x130 [nfsv4]
>  [] nfs4_proc_remove+0x7d/0xf0 [nfsv4]
>  [] nfs_unlink+0x149/0x350 [nfs]
>  [] vfs_unlink+0xf1/0x1a0
>  [] do_unlinkat+0x279/0x2d0
>  [] SyS_unlink+0x16/0x20
>  [] entry_SYSCALL_64_fastpath+0x16/0x71
> 
> Since nfs is allowing the thread to be frozen with the inode locked it's
> preventing other threads trying to lock the same inode from freezing. It
> seems like a bad idea for nfs to be doing this.
> 

Yeah, known problem. Not a simple one to fix though.

> Can nfs do something different here to prevent this? Maybe use a
> non-freezable sleep and let the operation complete, or else abort the
> operation and return ERESTARTSYS?

The problem with letting the op complete is that often by the time you
get to the point of trying to freeze processes, the network interfaces
are already shut down. So the operation you're waiting on might never
complete. Stuff like suspend operations on your laptop fail, leading to
fun bug reports like: "Oh, my laptop burned to crisp inside my bag
because the suspend never completed."

You could (in principle) return something like -ERESTARTSYS iff the
call has not yet been transmitted. If it has already been transmitted,
then you might end up sending the call a second time (but not as an RPC
retransmission of course). If that call was non-idempotent then you end
up with all of _those_ sorts of problems.

Also, -ERESTARTSYS is not quite right as it doesn't always cause the
call to be restarted. It depends on the syscall. I think this would
probably need some other sort of syscall-restart machinery plumbed in.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Hang due to nfs letting tasks freeze with locked inodes
  2016-07-06 22:07 ` Jeff Layton
@ 2016-07-07  3:55   ` Seth Forshee
  2016-07-07 10:29     ` Jeff Layton
  2016-07-07 23:53   ` Dave Chinner
  2016-07-08 12:22   ` Michal Hocko
  2 siblings, 1 reply; 19+ messages in thread
From: Seth Forshee @ 2016-07-07  3:55 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Trond Myklebust, Anna Schumaker, linux-fsdevel, linux-nfs,
	linux-kernel, Tycho Andersen

On Wed, Jul 06, 2016 at 06:07:18PM -0400, Jeff Layton wrote:
> On Wed, 2016-07-06 at 12:46 -0500, Seth Forshee wrote:
> > We're seeing a hang when freezing a container with an nfs bind mount while
> > running iozone. Two iozone processes were hung with this stack trace.
> > 
> >  [] schedule+0x35/0x80
> >  [] schedule_preempt_disabled+0xe/0x10
> >  [] __mutex_lock_slowpath+0xb9/0x130
> >  [] mutex_lock+0x1f/0x30
> >  [] do_unlinkat+0x12b/0x2d0
> >  [] SyS_unlink+0x16/0x20
> >  [] entry_SYSCALL_64_fastpath+0x16/0x71
> > 
> > This seems to be due to another iozone thread frozen during unlink with
> > this stack trace:
> > 
> >  [] __refrigerator+0x7a/0x140
> >  [] nfs4_handle_exception+0x118/0x130 [nfsv4]
> >  [] nfs4_proc_remove+0x7d/0xf0 [nfsv4]
> >  [] nfs_unlink+0x149/0x350 [nfs]
> >  [] vfs_unlink+0xf1/0x1a0
> >  [] do_unlinkat+0x279/0x2d0
> >  [] SyS_unlink+0x16/0x20
> >  [] entry_SYSCALL_64_fastpath+0x16/0x71
> > 
> > Since nfs is allowing the thread to be frozen with the inode locked it's
> > preventing other threads trying to lock the same inode from freezing. It
> > seems like a bad idea for nfs to be doing this.
> > 
> 
> Yeah, known problem. Not a simple one to fix though.
> 
> > Can nfs do something different here to prevent this? Maybe use a
> > non-freezable sleep and let the operation complete, or else abort the
> > operation and return ERESTARTSYS?
> 
> The problem with letting the op complete is that often by the time you
> get to the point of trying to freeze processes, the network interfaces
> are already shut down. So the operation you're waiting on might never
> complete. Stuff like suspend operations on your laptop fail, leading to
> fun bug reports like: "Oh, my laptop burned to crisp inside my bag
> because the suspend never completed."
> 
> You could (in principle) return something like -ERESTARTSYS iff the
> call has not yet been transmitted. If it has already been transmitted,
> then you might end up sending the call a second time (but not as an RPC
> retransmission of course). If that call was non-idempotent then you end
> up with all of _those_ sorts of problems.
> 
> Also, -ERESTARTSYS is not quite right as it doesn't always cause the
> call to be restarted. It depends on the syscall. I think this would
> probably need some other sort of syscall-restart machinery plumbed in.

I don't really know much at all about how NFS works, so I hope you don't
mind indulging me in some questions.

What happens then if you suspend waiting for an op to complete and then
resume an hour later? Will it actually succeed or end up returning some
sort of "timed out" error?

If it's going to be an error (or even likely to be one) could the op
just be aborted immediately with an error code? It just seems like there
must be something better than potentially deadlocking the kernel.

Thanks,
Seth

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Hang due to nfs letting tasks freeze with locked inodes
  2016-07-07  3:55   ` Seth Forshee
@ 2016-07-07 10:29     ` Jeff Layton
  0 siblings, 0 replies; 19+ messages in thread
From: Jeff Layton @ 2016-07-07 10:29 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Trond Myklebust, Anna Schumaker, linux-fsdevel, linux-nfs,
	linux-kernel, Tycho Andersen

On Wed, 2016-07-06 at 22:55 -0500, Seth Forshee wrote:
> On Wed, Jul 06, 2016 at 06:07:18PM -0400, Jeff Layton wrote:
> > 
> > On Wed, 2016-07-06 at 12:46 -0500, Seth Forshee wrote:
> > > 
> > > We're seeing a hang when freezing a container with an nfs bind mount while
> > > running iozone. Two iozone processes were hung with this stack trace.
> > > 
> > >  [] schedule+0x35/0x80
> > >  [] schedule_preempt_disabled+0xe/0x10
> > >  [] __mutex_lock_slowpath+0xb9/0x130
> > >  [] mutex_lock+0x1f/0x30
> > >  [] do_unlinkat+0x12b/0x2d0
> > >  [] SyS_unlink+0x16/0x20
> > >  [] entry_SYSCALL_64_fastpath+0x16/0x71
> > > 
> > > This seems to be due to another iozone thread frozen during unlink with
> > > this stack trace:
> > > 
> > >  [] __refrigerator+0x7a/0x140
> > >  [] nfs4_handle_exception+0x118/0x130 [nfsv4]
> > >  [] nfs4_proc_remove+0x7d/0xf0 [nfsv4]
> > >  [] nfs_unlink+0x149/0x350 [nfs]
> > >  [] vfs_unlink+0xf1/0x1a0
> > >  [] do_unlinkat+0x279/0x2d0
> > >  [] SyS_unlink+0x16/0x20
> > >  [] entry_SYSCALL_64_fastpath+0x16/0x71
> > > 
> > > Since nfs is allowing the thread to be frozen with the inode locked it's
> > > preventing other threads trying to lock the same inode from freezing. It
> > > seems like a bad idea for nfs to be doing this.
> > > 
> > Yeah, known problem. Not a simple one to fix though.
> > 
> > > 
> > > Can nfs do something different here to prevent this? Maybe use a
> > > non-freezable sleep and let the operation complete, or else abort the
> > > operation and return ERESTARTSYS?
> > The problem with letting the op complete is that often by the time you
> > get to the point of trying to freeze processes, the network interfaces
> > are already shut down. So the operation you're waiting on might never
> > complete. Stuff like suspend operations on your laptop fail, leading to
> > fun bug reports like: "Oh, my laptop burned to crisp inside my bag
> > because the suspend never completed."
> > 
> > You could (in principle) return something like -ERESTARTSYS iff the
> > call has not yet been transmitted. If it has already been transmitted,
> > then you might end up sending the call a second time (but not as an RPC
> > retransmission of course). If that call was non-idempotent then you end
> > up with all of _those_ sorts of problems.
> > 
> > Also, -ERESTARTSYS is not quite right as it doesn't always cause the
> > call to be restarted. It depends on the syscall. I think this would
> > probably need some other sort of syscall-restart machinery plumbed in.
> I don't really know much at all about how NFS works, so I hope you don't
> mind indulging me in some questions.
> 
> What happens then if you suspend waiting for an op to complete and then
> resume an hour later? Will it actually succeed or end up returning some
> sort of "timed out" error?
> 

Well, the RPC would likely time out. The RPC engine would then likely
end up retransmitting it. What happens at that point depends on a lot
of different factors -- what sort of call it was and how the server
behaves, whether it's NFSv3 or v4, etc...

If it was an idempotent call or the server still has the reply in its
duplicate reply cache, then everything "just works". If it's non-
idempotent or relies on some now-expired state, then you might get an
error because the same call ended up getting retransmitted or the state
that it relies on is now gone.

> If it's going to be an error (or even likely to be one) could the op
> just be aborted immediately with an error code? It just seems like there
> must be something better than potentially deadlocking the kernel.
> 

Not without breaking "hard" retry semantics. We had discussed at one
point adding a 3rd alternative to hard vs. soft mount options
(squishy?) that would do more or less what you suggest: allow syscalls
to return an error when the task is being frozen. You'd only really
want to do that though if you've already transmitted the call, waited
for a while (several seconds) and didn't get a reply. If the call
hasn't been transmitted yet, then you'd either want to restart the call
from scratch after unfreezing (a'la something like ERESTARTSYS).

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Hang due to nfs letting tasks freeze with locked inodes
  2016-07-06 22:07 ` Jeff Layton
  2016-07-07  3:55   ` Seth Forshee
@ 2016-07-07 23:53   ` Dave Chinner
  2016-07-08 11:33     ` Jeff Layton
  2016-07-08 12:48     ` Seth Forshee
  2016-07-08 12:22   ` Michal Hocko
  2 siblings, 2 replies; 19+ messages in thread
From: Dave Chinner @ 2016-07-07 23:53 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Seth Forshee, Trond Myklebust, Anna Schumaker, linux-fsdevel,
	linux-nfs, linux-kernel, Tycho Andersen

On Wed, Jul 06, 2016 at 06:07:18PM -0400, Jeff Layton wrote:
> On Wed, 2016-07-06 at 12:46 -0500, Seth Forshee wrote:
> > We're seeing a hang when freezing a container with an nfs bind mount while
> > running iozone. Two iozone processes were hung with this stack trace.
> > 
> >  [] schedule+0x35/0x80
> >  [] schedule_preempt_disabled+0xe/0x10
> >  [] __mutex_lock_slowpath+0xb9/0x130
> >  [] mutex_lock+0x1f/0x30
> >  [] do_unlinkat+0x12b/0x2d0
> >  [] SyS_unlink+0x16/0x20
> >  [] entry_SYSCALL_64_fastpath+0x16/0x71
> > 
> > This seems to be due to another iozone thread frozen during unlink with
> > this stack trace:
> > 
> >  [] __refrigerator+0x7a/0x140
> >  [] nfs4_handle_exception+0x118/0x130 [nfsv4]
> >  [] nfs4_proc_remove+0x7d/0xf0 [nfsv4]
> >  [] nfs_unlink+0x149/0x350 [nfs]
> >  [] vfs_unlink+0xf1/0x1a0
> >  [] do_unlinkat+0x279/0x2d0
> >  [] SyS_unlink+0x16/0x20
> >  [] entry_SYSCALL_64_fastpath+0x16/0x71
> > 
> > Since nfs is allowing the thread to be frozen with the inode locked it's
> > preventing other threads trying to lock the same inode from freezing. It
> > seems like a bad idea for nfs to be doing this.
> > 
> 
> Yeah, known problem. Not a simple one to fix though.

Actually, it is simple to fix.

<insert broken record about suspend should be using freeze_super(),
not sys_sync(), to suspend filesystem operations>

i.e. the VFS blocks new operations from starting, and then then the
NFS client simply needs to implement ->freeze_fs to drain all it's
active operations before returning. Problem solved.

> > Can nfs do something different here to prevent this? Maybe use a
> > non-freezable sleep and let the operation complete, or else abort the
> > operation and return ERESTARTSYS?
> 
> The problem with letting the op complete is that often by the time you
> get to the point of trying to freeze processes, the network interfaces
> are already shut down. So the operation you're waiting on might never
> complete. Stuff like suspend operations on your laptop fail, leading to
> fun bug reports like: "Oh, my laptop burned to crisp inside my bag
> because the suspend never completed."

Yup, precisely the sort of problems we've had over the past 10 years
with XFS because we do lots of stuff aynchronously in the background
(just like NFS) and hence sys_sync() isn't sufficient to quiesce a
filesystem's operations.

But I'm used to being ignored on this topic (for almost 10 years,
now!). Indeed, it's been made clear in the past that I know
absolutely nothing about what is needed to be done to safely
suspend filesystem operations...  :/

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Hang due to nfs letting tasks freeze with locked inodes
  2016-07-07 23:53   ` Dave Chinner
@ 2016-07-08 11:33     ` Jeff Layton
  2016-07-08 12:48     ` Seth Forshee
  1 sibling, 0 replies; 19+ messages in thread
From: Jeff Layton @ 2016-07-08 11:33 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Seth Forshee, Trond Myklebust, Anna Schumaker, linux-fsdevel,
	linux-nfs, linux-kernel, Tycho Andersen

On Fri, 2016-07-08 at 09:53 +1000, Dave Chinner wrote:
> On Wed, Jul 06, 2016 at 06:07:18PM -0400, Jeff Layton wrote:
> > On Wed, 2016-07-06 at 12:46 -0500, Seth Forshee wrote:
> > > We're seeing a hang when freezing a container with an nfs bind
> > > mount while
> > > running iozone. Two iozone processes were hung with this stack
> > > trace.
> > > 
> > >  [] schedule+0x35/0x80
> > >  [] schedule_preempt_disabled+0xe/0x10
> > >  [] __mutex_lock_slowpath+0xb9/0x130
> > >  [] mutex_lock+0x1f/0x30
> > >  [] do_unlinkat+0x12b/0x2d0
> > >  [] SyS_unlink+0x16/0x20
> > >  [] entry_SYSCALL_64_fastpath+0x16/0x71
> > > 
> > > This seems to be due to another iozone thread frozen during
> > > unlink with
> > > this stack trace:
> > > 
> > >  [] __refrigerator+0x7a/0x140
> > >  [] nfs4_handle_exception+0x118/0x130 [nfsv4]
> > >  [] nfs4_proc_remove+0x7d/0xf0 [nfsv4]
> > >  [] nfs_unlink+0x149/0x350 [nfs]
> > >  [] vfs_unlink+0xf1/0x1a0
> > >  [] do_unlinkat+0x279/0x2d0
> > >  [] SyS_unlink+0x16/0x20
> > >  [] entry_SYSCALL_64_fastpath+0x16/0x71
> > > 
> > > Since nfs is allowing the thread to be frozen with the inode
> > > locked it's
> > > preventing other threads trying to lock the same inode from
> > > freezing. It
> > > seems like a bad idea for nfs to be doing this.
> > > 
> > 
> > Yeah, known problem. Not a simple one to fix though.
> 
> Actually, it is simple to fix.
> 
> <insert broken record about suspend should be using freeze_super(),
> not sys_sync(), to suspend filesystem operations>
> 
> i.e. the VFS blocks new operations from starting, and then then the
> NFS client simply needs to implement ->freeze_fs to drain all it's
> active operations before returning. Problem solved.
> 

Not a bad idea. In the case of NFS though, I'm not sure we'd actually
do anything different than what we're doing though. Part of the problem
is that by the time 

FWIW, we already have CONFIG_SUSPEND_SKIP_SYNC. It might be worth
experimenting with a CONFIG_SUSPEND_FREEZE_FS that does what you
suggest?

> > > Can nfs do something different here to prevent this? Maybe use a
> > > non-freezable sleep and let the operation complete, or else abort
> > > the
> > > operation and return ERESTARTSYS?
> > 
> > The problem with letting the op complete is that often by the time
> > you
> > get to the point of trying to freeze processes, the network
> > interfaces
> > are already shut down. So the operation you're waiting on might
> > never
> > complete. Stuff like suspend operations on your laptop fail,
> > leading to
> > fun bug reports like: "Oh, my laptop burned to crisp inside my bag
> > because the suspend never completed."
> 
> Yup, precisely the sort of problems we've had over the past 10 years
> with XFS because we do lots of stuff aynchronously in the background
> (just like NFS) and hence sys_sync() isn't sufficient to quiesce a
> filesystem's operations.
> 

Yeah, adding a freeze_fs operation for NFS (and using that during
suspend) sounds reasonable at first blush. I can probably trawl the
archives to better understand, but what are the arguments against doing
that? Is it just that freeze_fs is relatively new and the
suspend/resume subsystems haven't caught up?

> But I'm used to being ignored on this topic (for almost 10 years,
> now!). Indeed, it's been made clear in the past that I know
> absolutely nothing about what is needed to be done to safely
> suspend filesystem operations...  :/
> 
> Cheers,
> 
> Dave.
-- 

Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Hang due to nfs letting tasks freeze with locked inodes
  2016-07-06 22:07 ` Jeff Layton
  2016-07-07  3:55   ` Seth Forshee
  2016-07-07 23:53   ` Dave Chinner
@ 2016-07-08 12:22   ` Michal Hocko
  2016-07-08 12:47     ` Seth Forshee
  2016-07-08 12:51     ` Jeff Layton
  2 siblings, 2 replies; 19+ messages in thread
From: Michal Hocko @ 2016-07-08 12:22 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Seth Forshee, Trond Myklebust, Anna Schumaker, linux-fsdevel,
	linux-nfs, linux-kernel, Tycho Andersen

On Wed 06-07-16 18:07:18, Jeff Layton wrote:
> On Wed, 2016-07-06 at 12:46 -0500, Seth Forshee wrote:
> > We're seeing a hang when freezing a container with an nfs bind mount while
> > running iozone. Two iozone processes were hung with this stack trace.
> > 
> >  [] schedule+0x35/0x80
> >  [] schedule_preempt_disabled+0xe/0x10
> >  [] __mutex_lock_slowpath+0xb9/0x130
> >  [] mutex_lock+0x1f/0x30
> >  [] do_unlinkat+0x12b/0x2d0
> >  [] SyS_unlink+0x16/0x20
> >  [] entry_SYSCALL_64_fastpath+0x16/0x71
> > 
> > This seems to be due to another iozone thread frozen during unlink with
> > this stack trace:
> > 
> >  [] __refrigerator+0x7a/0x140
> >  [] nfs4_handle_exception+0x118/0x130 [nfsv4]
> >  [] nfs4_proc_remove+0x7d/0xf0 [nfsv4]
> >  [] nfs_unlink+0x149/0x350 [nfs]
> >  [] vfs_unlink+0xf1/0x1a0
> >  [] do_unlinkat+0x279/0x2d0
> >  [] SyS_unlink+0x16/0x20
> >  [] entry_SYSCALL_64_fastpath+0x16/0x71
> > 
> > Since nfs is allowing the thread to be frozen with the inode locked it's
> > preventing other threads trying to lock the same inode from freezing. It
> > seems like a bad idea for nfs to be doing this.
> > 
> 
> Yeah, known problem. Not a simple one to fix though.

Apart from alternative Dave was mentioning in other email, what is the
point to use freezable wait from this path in the first place?

nfs4_handle_exception does nfs4_wait_clnt_recover from the same path and
that does wait_on_bit_action with TASK_KILLABLE so we are waiting in two
different modes from the same path AFAICS. There do not seem to be other
callers of nfs4_delay outside of nfs4_handle_exception. Sounds like
something is not quite right here to me. If the nfs4_delay did regular
wait then the freezing would fail as well but at least it would be clear
who is the culrprit rather than having an indirect dependency.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Hang due to nfs letting tasks freeze with locked inodes
  2016-07-08 12:22   ` Michal Hocko
@ 2016-07-08 12:47     ` Seth Forshee
  2016-07-08 12:51     ` Jeff Layton
  1 sibling, 0 replies; 19+ messages in thread
From: Seth Forshee @ 2016-07-08 12:47 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jeff Layton, Trond Myklebust, Anna Schumaker, linux-fsdevel,
	linux-nfs, linux-kernel, Tycho Andersen

On Fri, Jul 08, 2016 at 02:22:24PM +0200, Michal Hocko wrote:
> On Wed 06-07-16 18:07:18, Jeff Layton wrote:
> > On Wed, 2016-07-06 at 12:46 -0500, Seth Forshee wrote:
> > > We're seeing a hang when freezing a container with an nfs bind mount while
> > > running iozone. Two iozone processes were hung with this stack trace.
> > > 
> > >  [] schedule+0x35/0x80
> > >  [] schedule_preempt_disabled+0xe/0x10
> > >  [] __mutex_lock_slowpath+0xb9/0x130
> > >  [] mutex_lock+0x1f/0x30
> > >  [] do_unlinkat+0x12b/0x2d0
> > >  [] SyS_unlink+0x16/0x20
> > >  [] entry_SYSCALL_64_fastpath+0x16/0x71
> > > 
> > > This seems to be due to another iozone thread frozen during unlink with
> > > this stack trace:
> > > 
> > >  [] __refrigerator+0x7a/0x140
> > >  [] nfs4_handle_exception+0x118/0x130 [nfsv4]
> > >  [] nfs4_proc_remove+0x7d/0xf0 [nfsv4]
> > >  [] nfs_unlink+0x149/0x350 [nfs]
> > >  [] vfs_unlink+0xf1/0x1a0
> > >  [] do_unlinkat+0x279/0x2d0
> > >  [] SyS_unlink+0x16/0x20
> > >  [] entry_SYSCALL_64_fastpath+0x16/0x71
> > > 
> > > Since nfs is allowing the thread to be frozen with the inode locked it's
> > > preventing other threads trying to lock the same inode from freezing. It
> > > seems like a bad idea for nfs to be doing this.
> > > 
> > 
> > Yeah, known problem. Not a simple one to fix though.
> 
> Apart from alternative Dave was mentioning in other email, what is the
> point to use freezable wait from this path in the first place?
> 
> nfs4_handle_exception does nfs4_wait_clnt_recover from the same path and
> that does wait_on_bit_action with TASK_KILLABLE so we are waiting in two
> different modes from the same path AFAICS. There do not seem to be other
> callers of nfs4_delay outside of nfs4_handle_exception. Sounds like
> something is not quite right here to me. If the nfs4_delay did regular
> wait then the freezing would fail as well but at least it would be clear
> who is the culrprit rather than having an indirect dependency.

It turns out there are more paths than this one doing a freezable wait,
and they're all also killable. This leads me to a slightly different
question than yours, why nfs can give up waiting in the case of a signal
but not when the task is frozen.

I know the changes below aren't "correct," but I've been experimenting
with them anyway to see what would happen. So far things seem to be
fine, and the deadlock is gone. That should give you an idea of all the
places I found using a freezable wait.

Seth


diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index f714b98..62dbe59 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -77,8 +77,8 @@ nfs_fattr_to_ino_t(struct nfs_fattr *fattr)
  */
 int nfs_wait_bit_killable(struct wait_bit_key *key, int mode)
 {
-	freezable_schedule_unsafe();
-	if (signal_pending_state(mode, current))
+	schedule();
+	if (signal_pending_state(mode, current) || freezing(current))
 		return -ERESTARTSYS;
 	return 0;
 }
diff --git a/fs/nfs/nfs3proc.c b/fs/nfs/nfs3proc.c
index cb28cce..2315183 100644
--- a/fs/nfs/nfs3proc.c
+++ b/fs/nfs/nfs3proc.c
@@ -35,9 +35,9 @@ nfs3_rpc_wrapper(struct rpc_clnt *clnt, struct rpc_message *msg, int flags)
 		res = rpc_call_sync(clnt, msg, flags);
 		if (res != -EJUKEBOX)
 			break;
-		freezable_schedule_timeout_killable_unsafe(NFS_JUKEBOX_RETRY_TIME);
+		schedule_timeout_killable(NFS_JUKEBOX_RETRY_TIME);
 		res = -ERESTARTSYS;
-	} while (!fatal_signal_pending(current));
+	} while (!fatal_signal_pending(current) && !freezing(current));
 	return res;
 }
 
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 98a4415..0dad2fb 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -334,9 +334,8 @@ static int nfs4_delay(struct rpc_clnt *clnt, long *timeout)
 
 	might_sleep();
 
-	freezable_schedule_timeout_killable_unsafe(
-		nfs4_update_delay(timeout));
-	if (fatal_signal_pending(current))
+	schedule_timeout_killable(nfs4_update_delay(timeout));
+	if (fatal_signal_pending(current) || freezing(current))
 		res = -ERESTARTSYS;
 	return res;
 }
@@ -5447,7 +5446,7 @@ int nfs4_proc_delegreturn(struct inode *inode, struct rpc_cred *cred, const nfs4
 static unsigned long
 nfs4_set_lock_task_retry(unsigned long timeout)
 {
-	freezable_schedule_timeout_killable_unsafe(timeout);
+	schedule_timeout_killable(timeout);
 	timeout <<= 1;
 	if (timeout > NFS4_LOCK_MAXTIMEOUT)
 		return NFS4_LOCK_MAXTIMEOUT;
@@ -6148,7 +6147,7 @@ nfs4_proc_lock(struct file *filp, int cmd, struct file_lock *request)
 			break;
 		timeout = nfs4_set_lock_task_retry(timeout);
 		status = -ERESTARTSYS;
-		if (signalled())
+		if (signalled() || freezing(current))
 			break;
 	} while(status < 0);
 	return status;
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index 73ad57a..0218dc2 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -252,8 +252,8 @@ EXPORT_SYMBOL_GPL(rpc_destroy_wait_queue);
 
 static int rpc_wait_bit_killable(struct wait_bit_key *key, int mode)
 {
-	freezable_schedule_unsafe();
-	if (signal_pending_state(mode, current))
+	schedule();
+	if (signal_pending_state(mode, current) || freezing(current))
 		return -ERESTARTSYS;
 	return 0;
 }

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: Hang due to nfs letting tasks freeze with locked inodes
  2016-07-07 23:53   ` Dave Chinner
  2016-07-08 11:33     ` Jeff Layton
@ 2016-07-08 12:48     ` Seth Forshee
  2016-07-08 12:55       ` Trond Myklebust
  1 sibling, 1 reply; 19+ messages in thread
From: Seth Forshee @ 2016-07-08 12:48 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jeff Layton, Trond Myklebust, Anna Schumaker, linux-fsdevel,
	linux-nfs, linux-kernel, Tycho Andersen

On Fri, Jul 08, 2016 at 09:53:30AM +1000, Dave Chinner wrote:
> On Wed, Jul 06, 2016 at 06:07:18PM -0400, Jeff Layton wrote:
> > On Wed, 2016-07-06 at 12:46 -0500, Seth Forshee wrote:
> > > We're seeing a hang when freezing a container with an nfs bind mount while
> > > running iozone. Two iozone processes were hung with this stack trace.
> > > 
> > >  [] schedule+0x35/0x80
> > >  [] schedule_preempt_disabled+0xe/0x10
> > >  [] __mutex_lock_slowpath+0xb9/0x130
> > >  [] mutex_lock+0x1f/0x30
> > >  [] do_unlinkat+0x12b/0x2d0
> > >  [] SyS_unlink+0x16/0x20
> > >  [] entry_SYSCALL_64_fastpath+0x16/0x71
> > > 
> > > This seems to be due to another iozone thread frozen during unlink with
> > > this stack trace:
> > > 
> > >  [] __refrigerator+0x7a/0x140
> > >  [] nfs4_handle_exception+0x118/0x130 [nfsv4]
> > >  [] nfs4_proc_remove+0x7d/0xf0 [nfsv4]
> > >  [] nfs_unlink+0x149/0x350 [nfs]
> > >  [] vfs_unlink+0xf1/0x1a0
> > >  [] do_unlinkat+0x279/0x2d0
> > >  [] SyS_unlink+0x16/0x20
> > >  [] entry_SYSCALL_64_fastpath+0x16/0x71
> > > 
> > > Since nfs is allowing the thread to be frozen with the inode locked it's
> > > preventing other threads trying to lock the same inode from freezing. It
> > > seems like a bad idea for nfs to be doing this.
> > > 
> > 
> > Yeah, known problem. Not a simple one to fix though.
> 
> Actually, it is simple to fix.
> 
> <insert broken record about suspend should be using freeze_super(),
> not sys_sync(), to suspend filesystem operations>
> 
> i.e. the VFS blocks new operations from starting, and then then the
> NFS client simply needs to implement ->freeze_fs to drain all it's
> active operations before returning. Problem solved.

No, this won't solve my problem. We're not doing a full suspend, rather
using a freezer cgroup to freeze a subset of processes. We don't want to
want to fully freeze the filesystem.

Thanks,
Seth

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Hang due to nfs letting tasks freeze with locked inodes
  2016-07-08 12:22   ` Michal Hocko
  2016-07-08 12:47     ` Seth Forshee
@ 2016-07-08 12:51     ` Jeff Layton
  2016-07-08 14:23       ` Michal Hocko
  1 sibling, 1 reply; 19+ messages in thread
From: Jeff Layton @ 2016-07-08 12:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Seth Forshee, Trond Myklebust, Anna Schumaker, linux-fsdevel,
	linux-nfs, linux-kernel, Tycho Andersen

On Fri, 2016-07-08 at 14:22 +0200, Michal Hocko wrote:
> On Wed 06-07-16 18:07:18, Jeff Layton wrote:
> > 
> > On Wed, 2016-07-06 at 12:46 -0500, Seth Forshee wrote:
> > > 
> > > We're seeing a hang when freezing a container with an nfs bind mount while
> > > running iozone. Two iozone processes were hung with this stack trace.
> > > 
> > >  [] schedule+0x35/0x80
> > >  [] schedule_preempt_disabled+0xe/0x10
> > >  [] __mutex_lock_slowpath+0xb9/0x130
> > >  [] mutex_lock+0x1f/0x30
> > >  [] do_unlinkat+0x12b/0x2d0
> > >  [] SyS_unlink+0x16/0x20
> > >  [] entry_SYSCALL_64_fastpath+0x16/0x71
> > > 
> > > This seems to be due to another iozone thread frozen during unlink with
> > > this stack trace:
> > > 
> > >  [] __refrigerator+0x7a/0x140
> > >  [] nfs4_handle_exception+0x118/0x130 [nfsv4]
> > >  [] nfs4_proc_remove+0x7d/0xf0 [nfsv4]
> > >  [] nfs_unlink+0x149/0x350 [nfs]
> > >  [] vfs_unlink+0xf1/0x1a0
> > >  [] do_unlinkat+0x279/0x2d0
> > >  [] SyS_unlink+0x16/0x20
> > >  [] entry_SYSCALL_64_fastpath+0x16/0x71
> > > 
> > > Since nfs is allowing the thread to be frozen with the inode locked it's
> > > preventing other threads trying to lock the same inode from freezing. It
> > > seems like a bad idea for nfs to be doing this.
> > > 
> > Yeah, known problem. Not a simple one to fix though.
> Apart from alternative Dave was mentioning in other email, what is the
> point to use freezable wait from this path in the first place?
> 
> nfs4_handle_exception does nfs4_wait_clnt_recover from the same path and
> that does wait_on_bit_action with TASK_KILLABLE so we are waiting in two
> different modes from the same path AFAICS. There do not seem to be other
> callers of nfs4_delay outside of nfs4_handle_exception. Sounds like
> something is not quite right here to me. If the nfs4_delay did regular
> wait then the freezing would fail as well but at least it would be clear
> who is the culrprit rather than having an indirect dependency.

The codepaths involved there are a lot more complex than that
unfortunately.

nfs4_delay is the function that we use to handle the case where the
server returns NFS4ERR_DELAY. Basically telling us that it's too busy
right now or has some transient error and the client should retry after
a small, sliding delay.

That codepath could probably be made more freezer-safe. The typical
case however, is that we've sent a call and just haven't gotten a
reply. That's the trickier one to handle.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Hang due to nfs letting tasks freeze with locked inodes
  2016-07-08 12:48     ` Seth Forshee
@ 2016-07-08 12:55       ` Trond Myklebust
  2016-07-08 13:05         ` Trond Myklebust
  0 siblings, 1 reply; 19+ messages in thread
From: Trond Myklebust @ 2016-07-08 12:55 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Chinner Dave, Jeff Layton, Schumaker Anna, linux-fsdevel,
	linux-nfs, linux-kernel, Tycho Andersen


> On Jul 8, 2016, at 08:48, Seth Forshee <seth.forshee@canonical.com> wrote:
> 
> On Fri, Jul 08, 2016 at 09:53:30AM +1000, Dave Chinner wrote:
>> On Wed, Jul 06, 2016 at 06:07:18PM -0400, Jeff Layton wrote:
>>> On Wed, 2016-07-06 at 12:46 -0500, Seth Forshee wrote:
>>>> We're seeing a hang when freezing a container with an nfs bind mount while
>>>> running iozone. Two iozone processes were hung with this stack trace.
>>>> 
>>>>  [] schedule+0x35/0x80
>>>>  [] schedule_preempt_disabled+0xe/0x10
>>>>  [] __mutex_lock_slowpath+0xb9/0x130
>>>>  [] mutex_lock+0x1f/0x30
>>>>  [] do_unlinkat+0x12b/0x2d0
>>>>  [] SyS_unlink+0x16/0x20
>>>>  [] entry_SYSCALL_64_fastpath+0x16/0x71
>>>> 
>>>> This seems to be due to another iozone thread frozen during unlink with
>>>> this stack trace:
>>>> 
>>>>  [] __refrigerator+0x7a/0x140
>>>>  [] nfs4_handle_exception+0x118/0x130 [nfsv4]
>>>>  [] nfs4_proc_remove+0x7d/0xf0 [nfsv4]
>>>>  [] nfs_unlink+0x149/0x350 [nfs]
>>>>  [] vfs_unlink+0xf1/0x1a0
>>>>  [] do_unlinkat+0x279/0x2d0
>>>>  [] SyS_unlink+0x16/0x20
>>>>  [] entry_SYSCALL_64_fastpath+0x16/0x71
>>>> 
>>>> Since nfs is allowing the thread to be frozen with the inode locked it's
>>>> preventing other threads trying to lock the same inode from freezing. It
>>>> seems like a bad idea for nfs to be doing this.
>>>> 
>>> 
>>> Yeah, known problem. Not a simple one to fix though.
>> 
>> Actually, it is simple to fix.
>> 
>> <insert broken record about suspend should be using freeze_super(),
>> not sys_sync(), to suspend filesystem operations>
>> 
>> i.e. the VFS blocks new operations from starting, and then then the
>> NFS client simply needs to implement ->freeze_fs to drain all it's
>> active operations before returning. Problem solved.
> 
> No, this won't solve my problem. We're not doing a full suspend, rather
> using a freezer cgroup to freeze a subset of processes. We don't want to
> want to fully freeze the filesystem.

…and therein lies the rub. The whole cgroup freezer stuff assumes that you can safely deactivate a bunch of processes that may or may not hold state in the filesystem. That’s definitely not OK when you hold locks etc that can affect processes that lies outside the cgroup (and/or outside the NFS client itself).

Cheers
  Trond

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Hang due to nfs letting tasks freeze with locked inodes
  2016-07-08 12:55       ` Trond Myklebust
@ 2016-07-08 13:05         ` Trond Myklebust
  2016-07-11  1:20           ` Dave Chinner
  0 siblings, 1 reply; 19+ messages in thread
From: Trond Myklebust @ 2016-07-08 13:05 UTC (permalink / raw)
  To: Seth Forshee
  Cc: Chinner Dave, Jeff Layton, Schumaker Anna, linux-fsdevel,
	linux-nfs, linux-kernel, Tycho Andersen


> On Jul 8, 2016, at 08:55, Trond Myklebust <trondmy@primarydata.com> wrote:
> 
> 
>> On Jul 8, 2016, at 08:48, Seth Forshee <seth.forshee@canonical.com> wrote:
>> 
>> On Fri, Jul 08, 2016 at 09:53:30AM +1000, Dave Chinner wrote:
>>> On Wed, Jul 06, 2016 at 06:07:18PM -0400, Jeff Layton wrote:
>>>> On Wed, 2016-07-06 at 12:46 -0500, Seth Forshee wrote:
>>>>> We're seeing a hang when freezing a container with an nfs bind mount while
>>>>> running iozone. Two iozone processes were hung with this stack trace.
>>>>> 
>>>>> [] schedule+0x35/0x80
>>>>> [] schedule_preempt_disabled+0xe/0x10
>>>>> [] __mutex_lock_slowpath+0xb9/0x130
>>>>> [] mutex_lock+0x1f/0x30
>>>>> [] do_unlinkat+0x12b/0x2d0
>>>>> [] SyS_unlink+0x16/0x20
>>>>> [] entry_SYSCALL_64_fastpath+0x16/0x71
>>>>> 
>>>>> This seems to be due to another iozone thread frozen during unlink with
>>>>> this stack trace:
>>>>> 
>>>>> [] __refrigerator+0x7a/0x140
>>>>> [] nfs4_handle_exception+0x118/0x130 [nfsv4]
>>>>> [] nfs4_proc_remove+0x7d/0xf0 [nfsv4]
>>>>> [] nfs_unlink+0x149/0x350 [nfs]
>>>>> [] vfs_unlink+0xf1/0x1a0
>>>>> [] do_unlinkat+0x279/0x2d0
>>>>> [] SyS_unlink+0x16/0x20
>>>>> [] entry_SYSCALL_64_fastpath+0x16/0x71
>>>>> 
>>>>> Since nfs is allowing the thread to be frozen with the inode locked it's
>>>>> preventing other threads trying to lock the same inode from freezing. It
>>>>> seems like a bad idea for nfs to be doing this.
>>>>> 
>>>> 
>>>> Yeah, known problem. Not a simple one to fix though.
>>> 
>>> Actually, it is simple to fix.
>>> 
>>> <insert broken record about suspend should be using freeze_super(),
>>> not sys_sync(), to suspend filesystem operations>
>>> 
>>> i.e. the VFS blocks new operations from starting, and then then the
>>> NFS client simply needs to implement ->freeze_fs to drain all it's
>>> active operations before returning. Problem solved.
>> 
>> No, this won't solve my problem. We're not doing a full suspend, rather
>> using a freezer cgroup to freeze a subset of processes. We don't want to
>> want to fully freeze the filesystem.
> 
> …and therein lies the rub. The whole cgroup freezer stuff assumes that you can safely deactivate a bunch of processes that may or may not hold state in the filesystem. That’s definitely not OK when you hold locks etc that can affect processes that lies outside the cgroup (and/or outside the NFS client itself).
> 

In case it wasn’t clear, I’m not just talking about VFS mutexes here. I’m also talking about all the other stuff, a lot of which the kernel has no control over, including POSIX file locking, share locks, leases/delegations, etc.

Trond

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Hang due to nfs letting tasks freeze with locked inodes
  2016-07-08 12:51     ` Jeff Layton
@ 2016-07-08 14:23       ` Michal Hocko
  2016-07-08 14:27         ` Jeff Layton
  0 siblings, 1 reply; 19+ messages in thread
From: Michal Hocko @ 2016-07-08 14:23 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Seth Forshee, Trond Myklebust, Anna Schumaker, linux-fsdevel,
	linux-nfs, linux-kernel, Tycho Andersen

On Fri 08-07-16 08:51:54, Jeff Layton wrote:
> On Fri, 2016-07-08 at 14:22 +0200, Michal Hocko wrote:
[...]
> > Apart from alternative Dave was mentioning in other email, what is the
> > point to use freezable wait from this path in the first place?
> > 
> > nfs4_handle_exception does nfs4_wait_clnt_recover from the same path and
> > that does wait_on_bit_action with TASK_KILLABLE so we are waiting in two
> > different modes from the same path AFAICS. There do not seem to be other
> > callers of nfs4_delay outside of nfs4_handle_exception. Sounds like
> > something is not quite right here to me. If the nfs4_delay did regular
> > wait then the freezing would fail as well but at least it would be clear
> > who is the culrprit rather than having an indirect dependency.
> 
> The codepaths involved there are a lot more complex than that
> unfortunately.
> 
> nfs4_delay is the function that we use to handle the case where the
> server returns NFS4ERR_DELAY. Basically telling us that it's too busy
> right now or has some transient error and the client should retry after
> a small, sliding delay.
> 
> That codepath could probably be made more freezer-safe. The typical
> case however, is that we've sent a call and just haven't gotten a
> reply. That's the trickier one to handle.

Why using a regular non-freezable wait would be a problem?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Hang due to nfs letting tasks freeze with locked inodes
  2016-07-08 14:23       ` Michal Hocko
@ 2016-07-08 14:27         ` Jeff Layton
  2016-07-11  7:23           ` Michal Hocko
  0 siblings, 1 reply; 19+ messages in thread
From: Jeff Layton @ 2016-07-08 14:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Seth Forshee, Trond Myklebust, Anna Schumaker, linux-fsdevel,
	linux-nfs, linux-kernel, Tycho Andersen

On Fri, 2016-07-08 at 16:23 +0200, Michal Hocko wrote:
> On Fri 08-07-16 08:51:54, Jeff Layton wrote:
> > 
> > On Fri, 2016-07-08 at 14:22 +0200, Michal Hocko wrote:
> [...]
> > 
> > > 
> > > Apart from alternative Dave was mentioning in other email, what
> > > is the
> > > point to use freezable wait from this path in the first place?
> > > 
> > > nfs4_handle_exception does nfs4_wait_clnt_recover from the same
> > > path and
> > > that does wait_on_bit_action with TASK_KILLABLE so we are waiting
> > > in two
> > > different modes from the same path AFAICS. There do not seem to
> > > be other
> > > callers of nfs4_delay outside of nfs4_handle_exception. Sounds
> > > like
> > > something is not quite right here to me. If the nfs4_delay did
> > > regular
> > > wait then the freezing would fail as well but at least it would
> > > be clear
> > > who is the culrprit rather than having an indirect dependency.
> > The codepaths involved there are a lot more complex than that
> > unfortunately.
> > 
> > nfs4_delay is the function that we use to handle the case where the
> > server returns NFS4ERR_DELAY. Basically telling us that it's too
> > busy
> > right now or has some transient error and the client should retry
> > after
> > a small, sliding delay.
> > 
> > That codepath could probably be made more freezer-safe. The typical
> > case however, is that we've sent a call and just haven't gotten a
> > reply. That's the trickier one to handle.
> Why using a regular non-freezable wait would be a problem?

It has been a while since I looked at that code, but IIRC, that could
block the freezer for up to 15s, which is a significant portion of the
20s that you get before the freezer gives up.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Hang due to nfs letting tasks freeze with locked inodes
  2016-07-08 13:05         ` Trond Myklebust
@ 2016-07-11  1:20           ` Dave Chinner
  0 siblings, 0 replies; 19+ messages in thread
From: Dave Chinner @ 2016-07-11  1:20 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Seth Forshee, Jeff Layton, Schumaker Anna, linux-fsdevel,
	linux-nfs, linux-kernel, Tycho Andersen

On Fri, Jul 08, 2016 at 01:05:40PM +0000, Trond Myklebust wrote:
> > On Jul 8, 2016, at 08:55, Trond Myklebust
> > <trondmy@primarydata.com> wrote:
> >> On Jul 8, 2016, at 08:48, Seth Forshee
> >> <seth.forshee@canonical.com> wrote: On Fri, Jul 08, 2016 at
> >> 09:53:30AM +1000, Dave Chinner wrote:
> >>> On Wed, Jul 06, 2016 at 06:07:18PM -0400, Jeff Layton wrote:
> >>>> On Wed, 2016-07-06 at 12:46 -0500, Seth Forshee wrote:
> >>>>> We're seeing a hang when freezing a container with an nfs
> >>>>> bind mount while running iozone. Two iozone processes were
> >>>>> hung with this stack trace.
> >>>>> 
> >>>>> [] schedule+0x35/0x80 [] schedule_preempt_disabled+0xe/0x10
> >>>>> [] __mutex_lock_slowpath+0xb9/0x130 [] mutex_lock+0x1f/0x30
> >>>>> [] do_unlinkat+0x12b/0x2d0 [] SyS_unlink+0x16/0x20 []
> >>>>> entry_SYSCALL_64_fastpath+0x16/0x71
> >>>>> 
> >>>>> This seems to be due to another iozone thread frozen during
> >>>>> unlink with this stack trace:
> >>>>> 
> >>>>> [] __refrigerator+0x7a/0x140 []
> >>>>> nfs4_handle_exception+0x118/0x130 [nfsv4] []
> >>>>> nfs4_proc_remove+0x7d/0xf0 [nfsv4] [] nfs_unlink+0x149/0x350
> >>>>> [nfs] [] vfs_unlink+0xf1/0x1a0 [] do_unlinkat+0x279/0x2d0 []
> >>>>> SyS_unlink+0x16/0x20 [] entry_SYSCALL_64_fastpath+0x16/0x71
> >>>>> 
> >>>>> Since nfs is allowing the thread to be frozen with the inode
> >>>>> locked it's preventing other threads trying to lock the same
> >>>>> inode from freezing. It seems like a bad idea for nfs to be
> >>>>> doing this.
> >>>>> 
> >>>> 
> >>>> Yeah, known problem. Not a simple one to fix though.
> >>> 
> >>> Actually, it is simple to fix.
> >>> 
> >>> <insert broken record about suspend should be using
> >>> freeze_super(), not sys_sync(), to suspend filesystem
> >>> operations>
> >>> 
> >>> i.e. the VFS blocks new operations from starting, and then
> >>> then the NFS client simply needs to implement ->freeze_fs to
> >>> drain all it's active operations before returning. Problem
> >>> solved.
> >> 
> >> No, this won't solve my problem. We're not doing a full
> >> suspend, rather using a freezer cgroup to freeze a subset of
> >> processes. We don't want to want to fully freeze the
> >> filesystem.
> > 
> > …and therein lies the rub. The whole cgroup freezer stuff
> > assumes that you can safely deactivate a bunch of processes that
> > may or may not hold state in the filesystem. That’s
> > definitely not OK when you hold locks etc that can affect
> > processes that lies outside the cgroup (and/or outside the NFS
> > client itself).

Not just locks, but even just reference counts are bad. e.g. just
being suspended with an active write reference to the superblock
will cause the next filesystem freeze to hang waiting for that
reference to drain. In essence, that's a filesystem-wide DOS vector
for anyone using snapshots....

> In case it wasn’t clear, I’m not just talking about VFS
> mutexes here. I’m also talking about all the other stuff, a
> lot of which the kernel has no control over, including POSIX file
> locking, share locks, leases/delegations, etc.

Yeah, freezer base process-granularity suspend just seems like a bad
idea to me...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Hang due to nfs letting tasks freeze with locked inodes
  2016-07-08 14:27         ` Jeff Layton
@ 2016-07-11  7:23           ` Michal Hocko
  2016-07-11 11:03             ` Jeff Layton
  0 siblings, 1 reply; 19+ messages in thread
From: Michal Hocko @ 2016-07-11  7:23 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Seth Forshee, Trond Myklebust, Anna Schumaker, linux-fsdevel,
	linux-nfs, linux-kernel, Tycho Andersen

On Fri 08-07-16 10:27:38, Jeff Layton wrote:
> On Fri, 2016-07-08 at 16:23 +0200, Michal Hocko wrote:
> > On Fri 08-07-16 08:51:54, Jeff Layton wrote:
> > > 
> > > On Fri, 2016-07-08 at 14:22 +0200, Michal Hocko wrote:
> > [...]
> > > 
> > > > 
> > > > Apart from alternative Dave was mentioning in other email, what
> > > > is the
> > > > point to use freezable wait from this path in the first place?
> > > > 
> > > > nfs4_handle_exception does nfs4_wait_clnt_recover from the same
> > > > path and
> > > > that does wait_on_bit_action with TASK_KILLABLE so we are waiting
> > > > in two
> > > > different modes from the same path AFAICS. There do not seem to
> > > > be other
> > > > callers of nfs4_delay outside of nfs4_handle_exception. Sounds
> > > > like
> > > > something is not quite right here to me. If the nfs4_delay did
> > > > regular
> > > > wait then the freezing would fail as well but at least it would
> > > > be clear
> > > > who is the culrprit rather than having an indirect dependency.
> > > The codepaths involved there are a lot more complex than that
> > > unfortunately.
> > > 
> > > nfs4_delay is the function that we use to handle the case where the
> > > server returns NFS4ERR_DELAY. Basically telling us that it's too
> > > busy
> > > right now or has some transient error and the client should retry
> > > after
> > > a small, sliding delay.
> > > 
> > > That codepath could probably be made more freezer-safe. The typical
> > > case however, is that we've sent a call and just haven't gotten a
> > > reply. That's the trickier one to handle.
> > Why using a regular non-freezable wait would be a problem?
> 
> It has been a while since I looked at that code, but IIRC, that could
> block the freezer for up to 15s, which is a significant portion of the
> 20s that you get before the freezer gives up.

But how does that differ from the situation when the freezer has to give
up on the timeout because another task fails due to lock dependency.

As Trond and Dave have written in other emails. It is really danngerous
to freeze a task while it is holding locks and other resources.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Hang due to nfs letting tasks freeze with locked inodes
  2016-07-11  7:23           ` Michal Hocko
@ 2016-07-11 11:03             ` Jeff Layton
  2016-07-11 11:43               ` Michal Hocko
  2016-07-11 12:50               ` Seth Forshee
  0 siblings, 2 replies; 19+ messages in thread
From: Jeff Layton @ 2016-07-11 11:03 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Seth Forshee, Trond Myklebust, Anna Schumaker, linux-fsdevel,
	linux-nfs, linux-kernel, Tycho Andersen

On Mon, 2016-07-11 at 09:23 +0200, Michal Hocko wrote:
> On Fri 08-07-16 10:27:38, Jeff Layton wrote:
> > On Fri, 2016-07-08 at 16:23 +0200, Michal Hocko wrote:
> > > On Fri 08-07-16 08:51:54, Jeff Layton wrote:
> > > > 
> > > > On Fri, 2016-07-08 at 14:22 +0200, Michal Hocko wrote:
> > > [...]
> > > > 
> > > > > 
> > > > > Apart from alternative Dave was mentioning in other email, what
> > > > > is the
> > > > > point to use freezable wait from this path in the first place?
> > > > > 
> > > > > nfs4_handle_exception does nfs4_wait_clnt_recover from the same
> > > > > path and
> > > > > that does wait_on_bit_action with TASK_KILLABLE so we are waiting
> > > > > in two
> > > > > different modes from the same path AFAICS. There do not seem to
> > > > > be other
> > > > > callers of nfs4_delay outside of nfs4_handle_exception. Sounds
> > > > > like
> > > > > something is not quite right here to me. If the nfs4_delay did
> > > > > regular
> > > > > wait then the freezing would fail as well but at least it would
> > > > > be clear
> > > > > who is the culrprit rather than having an indirect dependency.
> > > > The codepaths involved there are a lot more complex than that
> > > > unfortunately.
> > > > 
> > > > nfs4_delay is the function that we use to handle the case where the
> > > > server returns NFS4ERR_DELAY. Basically telling us that it's too
> > > > busy
> > > > right now or has some transient error and the client should retry
> > > > after
> > > > a small, sliding delay.
> > > > 
> > > > That codepath could probably be made more freezer-safe. The typical
> > > > case however, is that we've sent a call and just haven't gotten a
> > > > reply. That's the trickier one to handle.
> > > Why using a regular non-freezable wait would be a problem?
> > 
> > It has been a while since I looked at that code, but IIRC, that could
> > block the freezer for up to 15s, which is a significant portion of the
> > 20s that you get before the freezer gives up.
> 
> But how does that differ from the situation when the freezer has to give
> up on the timeout because another task fails due to lock dependency.
> 
> As Trond and Dave have written in other emails. It is really danngerous
> to freeze a task while it is holding locks and other resources.

It's not really dangerous if you're freezing every task on the host.
Sure, you're freezing with locks held, but everything else is freezing
too, so nothing will be contending for those locks.

I'm not at all opposed to changing how all of that works. My only
stipulation is that we not break the ability to reliably suspend a host
that is actively using an NFS mount. If you can come up with a way to
do that that also works for freezing cgroups, then I'm all for it.

-- 

Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Hang due to nfs letting tasks freeze with locked inodes
  2016-07-11 11:03             ` Jeff Layton
@ 2016-07-11 11:43               ` Michal Hocko
  2016-07-11 12:50               ` Seth Forshee
  1 sibling, 0 replies; 19+ messages in thread
From: Michal Hocko @ 2016-07-11 11:43 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Seth Forshee, Trond Myklebust, Anna Schumaker, linux-fsdevel,
	linux-nfs, linux-kernel, Tycho Andersen

On Mon 11-07-16 07:03:31, Jeff Layton wrote:
> On Mon, 2016-07-11 at 09:23 +0200, Michal Hocko wrote:
> > On Fri 08-07-16 10:27:38, Jeff Layton wrote:
> > > On Fri, 2016-07-08 at 16:23 +0200, Michal Hocko wrote:
> > > > On Fri 08-07-16 08:51:54, Jeff Layton wrote:
> > > > > 
> > > > > On Fri, 2016-07-08 at 14:22 +0200, Michal Hocko wrote:
> > > > [...]
> > > > > 
> > > > > > 
> > > > > > Apart from alternative Dave was mentioning in other email, what
> > > > > > is the
> > > > > > point to use freezable wait from this path in the first place?
> > > > > > 
> > > > > > nfs4_handle_exception does nfs4_wait_clnt_recover from the same
> > > > > > path and
> > > > > > that does wait_on_bit_action with TASK_KILLABLE so we are waiting
> > > > > > in two
> > > > > > different modes from the same path AFAICS. There do not seem to
> > > > > > be other
> > > > > > callers of nfs4_delay outside of nfs4_handle_exception. Sounds
> > > > > > like
> > > > > > something is not quite right here to me. If the nfs4_delay did
> > > > > > regular
> > > > > > wait then the freezing would fail as well but at least it would
> > > > > > be clear
> > > > > > who is the culrprit rather than having an indirect dependency.
> > > > > The codepaths involved there are a lot more complex than that
> > > > > unfortunately.
> > > > > 
> > > > > nfs4_delay is the function that we use to handle the case where the
> > > > > server returns NFS4ERR_DELAY. Basically telling us that it's too
> > > > > busy
> > > > > right now or has some transient error and the client should retry
> > > > > after
> > > > > a small, sliding delay.
> > > > > 
> > > > > That codepath could probably be made more freezer-safe. The typical
> > > > > case however, is that we've sent a call and just haven't gotten a
> > > > > reply. That's the trickier one to handle.
> > > > Why using a regular non-freezable wait would be a problem?
> > > 
> > > It has been a while since I looked at that code, but IIRC, that could
> > > block the freezer for up to 15s, which is a significant portion of the
> > > 20s that you get before the freezer gives up.
> > 
> > But how does that differ from the situation when the freezer has to give
> > up on the timeout because another task fails due to lock dependency.
> > 
> > As Trond and Dave have written in other emails. It is really danngerous
> > to freeze a task while it is holding locks and other resources.
> 
> It's not really dangerous if you're freezing every task on the host.
> Sure, you're freezing with locks held, but everything else is freezing
> too, so nothing will be contending for those locks.

But the very same path is used also for cgroup freezer so you can end up
freezing a task while it holds locks which might block tasks from
unrelated cgroups, right?
 
> I'm not at all opposed to changing how all of that works. My only
> stipulation is that we not break the ability to reliably suspend a host
> that is actively using an NFS mount. If you can come up with a way to
> do that that also works for freezing cgroups, then I'm all for it.

My knowledge of NFS is too limited to help you out here but I guess it
would be a good start to stop using unsafe freezer APIs. Or use it only
when you are sure you cannot block any resources.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Hang due to nfs letting tasks freeze with locked inodes
  2016-07-11 11:03             ` Jeff Layton
  2016-07-11 11:43               ` Michal Hocko
@ 2016-07-11 12:50               ` Seth Forshee
  1 sibling, 0 replies; 19+ messages in thread
From: Seth Forshee @ 2016-07-11 12:50 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Michal Hocko, Trond Myklebust, Anna Schumaker, linux-fsdevel,
	linux-nfs, linux-kernel, Tycho Andersen

On Mon, Jul 11, 2016 at 07:03:31AM -0400, Jeff Layton wrote:
> On Mon, 2016-07-11 at 09:23 +0200, Michal Hocko wrote:
> > On Fri 08-07-16 10:27:38, Jeff Layton wrote:
> > > On Fri, 2016-07-08 at 16:23 +0200, Michal Hocko wrote:
> > > > On Fri 08-07-16 08:51:54, Jeff Layton wrote:
> > > > > 
> > > > > On Fri, 2016-07-08 at 14:22 +0200, Michal Hocko wrote:
> > > > [...]
> > > > > 
> > > > > > 
> > > > > > Apart from alternative Dave was mentioning in other email, what
> > > > > > is the
> > > > > > point to use freezable wait from this path in the first place?
> > > > > > 
> > > > > > nfs4_handle_exception does nfs4_wait_clnt_recover from the same
> > > > > > path and
> > > > > > that does wait_on_bit_action with TASK_KILLABLE so we are waiting
> > > > > > in two
> > > > > > different modes from the same path AFAICS. There do not seem to
> > > > > > be other
> > > > > > callers of nfs4_delay outside of nfs4_handle_exception. Sounds
> > > > > > like
> > > > > > something is not quite right here to me. If the nfs4_delay did
> > > > > > regular
> > > > > > wait then the freezing would fail as well but at least it would
> > > > > > be clear
> > > > > > who is the culrprit rather than having an indirect dependency.
> > > > > The codepaths involved there are a lot more complex than that
> > > > > unfortunately.
> > > > > 
> > > > > nfs4_delay is the function that we use to handle the case where the
> > > > > server returns NFS4ERR_DELAY. Basically telling us that it's too
> > > > > busy
> > > > > right now or has some transient error and the client should retry
> > > > > after
> > > > > a small, sliding delay.
> > > > > 
> > > > > That codepath could probably be made more freezer-safe. The typical
> > > > > case however, is that we've sent a call and just haven't gotten a
> > > > > reply. That's the trickier one to handle.
> > > > Why using a regular non-freezable wait would be a problem?
> > > 
> > > It has been a while since I looked at that code, but IIRC, that could
> > > block the freezer for up to 15s, which is a significant portion of the
> > > 20s that you get before the freezer gives up.
> > 
> > But how does that differ from the situation when the freezer has to give
> > up on the timeout because another task fails due to lock dependency.
> > 
> > As Trond and Dave have written in other emails. It is really danngerous
> > to freeze a task while it is holding locks and other resources.
> 
> It's not really dangerous if you're freezing every task on the host.
> Sure, you're freezing with locks held, but everything else is freezing
> too, so nothing will be contending for those locks.

Unless you have tasks either already waiting on those locks or that will
attaempt to lock them before calling try_to_freeze. That happens to be
the case in this cgroup freezer hang too, all the tasks stuck waiting on
the i_mutex are p being frozen.

> I'm not at all opposed to changing how all of that works. My only
> stipulation is that we not break the ability to reliably suspend a host
> that is actively using an NFS mount. If you can come up with a way to
> do that that also works for freezing cgroups, then I'm all for it.

The deadlock that we've seen should apply equally to suspend and to the
cgroup freezer. The only difference is that suspend will eventually time
out and abort the suspend whereas the cgroup freezer does not.

Seth

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2016-07-11 12:50 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-06 17:46 Hang due to nfs letting tasks freeze with locked inodes Seth Forshee
2016-07-06 22:07 ` Jeff Layton
2016-07-07  3:55   ` Seth Forshee
2016-07-07 10:29     ` Jeff Layton
2016-07-07 23:53   ` Dave Chinner
2016-07-08 11:33     ` Jeff Layton
2016-07-08 12:48     ` Seth Forshee
2016-07-08 12:55       ` Trond Myklebust
2016-07-08 13:05         ` Trond Myklebust
2016-07-11  1:20           ` Dave Chinner
2016-07-08 12:22   ` Michal Hocko
2016-07-08 12:47     ` Seth Forshee
2016-07-08 12:51     ` Jeff Layton
2016-07-08 14:23       ` Michal Hocko
2016-07-08 14:27         ` Jeff Layton
2016-07-11  7:23           ` Michal Hocko
2016-07-11 11:03             ` Jeff Layton
2016-07-11 11:43               ` Michal Hocko
2016-07-11 12:50               ` Seth Forshee

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).