All of lore.kernel.org
 help / color / mirror / Atom feed
* NFS Force Unmounting
@ 2017-10-25 17:11 Joshua Watt
  2017-10-30 20:20 ` J. Bruce Fields
  0 siblings, 1 reply; 36+ messages in thread
From: Joshua Watt @ 2017-10-25 17:11 UTC (permalink / raw)
  To: linux-nfs

Hello,

I'm working on a networking embedded system where NFS servers can come
and go from the network, and I've discovered that the Kernel NFS server
make it difficult to cleanup applications in a timely manner when the
server disappears (and yes, I am mounting with "soft" and relatively
short timeouts). I currently have a user space mechanism that can
quickly detect when the server disappears, and does a umount() with the
MNT_FORCE and MNT_DETACH flags. Using MNT_DETACH prevents new accesses
to files on the defunct remote server, and I have traced through the
code to see that MNT_FORCE does indeed cancel any current RPC tasks
with -EIO. However, this isn't sufficient for my use case because if a
user space application isn't currently waiting on an RCP task that gets
canceled, it will have to timeout again before it detects the
disconnect. For example, if a simple client is copying a file from the
NFS server, and happens to not be waiting on the RPC task in the read()
call when umount() occurs, it will be none the wiser and loop around to
call read() again, which must then try the whole NFS timeout + recovery
before the failure is detected. If a client is more complex and has a
lot of open file descriptor, it will typical have to wait for each one
to timeout, leading to very long delays.

The (naive?) solution seems to be to add some flag in either the NFS
client or the RPC client that gets set in nfs_umount_begin(). This
would cause all subsequent operations to fail with an error code
instead of having to be queued as an RPC task and the and then timing
out. In our example client, the application would then get the -EIO
immediately on the next (and all subsequent) read() calls.

There does seem to be some precedence for doing this (especially with
network file systems), as both cifs (CifsExiting) and ceph
(CEPH_MOUNT_SHUTDOWN) appear to implement this behavior (at least from
looking at the code. I haven't verified runtime behavior).

Are there any pitfalls I'm oversimplifying?

Thanks,
Joshua Watt

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NFS Force Unmounting
  2017-10-25 17:11 NFS Force Unmounting Joshua Watt
@ 2017-10-30 20:20 ` J. Bruce Fields
  2017-10-30 21:04   ` Joshua Watt
  2017-10-30 21:09   ` NeilBrown
  0 siblings, 2 replies; 36+ messages in thread
From: J. Bruce Fields @ 2017-10-30 20:20 UTC (permalink / raw)
  To: Joshua Watt; +Cc: linux-nfs

On Wed, Oct 25, 2017 at 12:11:46PM -0500, Joshua Watt wrote:
> I'm working on a networking embedded system where NFS servers can come
> and go from the network, and I've discovered that the Kernel NFS server

For "Kernel NFS server", I think you mean "Kernel NFS client".

> make it difficult to cleanup applications in a timely manner when the
> server disappears (and yes, I am mounting with "soft" and relatively
> short timeouts). I currently have a user space mechanism that can
> quickly detect when the server disappears, and does a umount() with the
> MNT_FORCE and MNT_DETACH flags. Using MNT_DETACH prevents new accesses
> to files on the defunct remote server, and I have traced through the
> code to see that MNT_FORCE does indeed cancel any current RPC tasks
> with -EIO. However, this isn't sufficient for my use case because if a
> user space application isn't currently waiting on an RCP task that gets
> canceled, it will have to timeout again before it detects the
> disconnect. For example, if a simple client is copying a file from the
> NFS server, and happens to not be waiting on the RPC task in the read()
> call when umount() occurs, it will be none the wiser and loop around to
> call read() again, which must then try the whole NFS timeout + recovery
> before the failure is detected. If a client is more complex and has a
> lot of open file descriptor, it will typical have to wait for each one
> to timeout, leading to very long delays.
> 
> The (naive?) solution seems to be to add some flag in either the NFS
> client or the RPC client that gets set in nfs_umount_begin(). This
> would cause all subsequent operations to fail with an error code
> instead of having to be queued as an RPC task and the and then timing
> out. In our example client, the application would then get the -EIO
> immediately on the next (and all subsequent) read() calls.
> 
> There does seem to be some precedence for doing this (especially with
> network file systems), as both cifs (CifsExiting) and ceph
> (CEPH_MOUNT_SHUTDOWN) appear to implement this behavior (at least from
> looking at the code. I haven't verified runtime behavior).
> 
> Are there any pitfalls I'm oversimplifying?

I don't know.

In the hard case I don't think you'd want to do something like
this--applications expect mounts to be stay pinned while they're using
them, not to get -EIO.  In the soft case maybe an exception like this
makes sense.

--b.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NFS Force Unmounting
  2017-10-30 20:20 ` J. Bruce Fields
@ 2017-10-30 21:04   ` Joshua Watt
  2017-10-30 21:09   ` NeilBrown
  1 sibling, 0 replies; 36+ messages in thread
From: Joshua Watt @ 2017-10-30 21:04 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: linux-nfs

On Mon, 2017-10-30 at 16:20 -0400, J. Bruce Fields wrote:
> On Wed, Oct 25, 2017 at 12:11:46PM -0500, Joshua Watt wrote:
> > I'm working on a networking embedded system where NFS servers can
> > come
> > and go from the network, and I've discovered that the Kernel NFS
> > server
> 
> For "Kernel NFS server", I think you mean "Kernel NFS client".

Yes, sorry. I was digging through the code and saw "struct nfs_server",
which is really "The local client object that represents the remote
server", and it inadvertently crept into my e-mail.

> 
> > make it difficult to cleanup applications in a timely manner when
> > the
> > server disappears (and yes, I am mounting with "soft" and
> > relatively
> > short timeouts). I currently have a user space mechanism that can
> > quickly detect when the server disappears, and does a umount() with
> > the
> > MNT_FORCE and MNT_DETACH flags. Using MNT_DETACH prevents new
> > accesses
> > to files on the defunct remote server, and I have traced through
> > the
> > code to see that MNT_FORCE does indeed cancel any current RPC tasks
> > with -EIO. However, this isn't sufficient for my use case because
> > if a
> > user space application isn't currently waiting on an RCP task that
> > gets
> > canceled, it will have to timeout again before it detects the
> > disconnect. For example, if a simple client is copying a file from
> > the
> > NFS server, and happens to not be waiting on the RPC task in the
> > read()
> > call when umount() occurs, it will be none the wiser and loop
> > around to
> > call read() again, which must then try the whole NFS timeout +
> > recovery
> > before the failure is detected. If a client is more complex and has
> > a
> > lot of open file descriptor, it will typical have to wait for each
> > one
> > to timeout, leading to very long delays.
> > 
> > The (naive?) solution seems to be to add some flag in either the
> > NFS
> > client or the RPC client that gets set in nfs_umount_begin(). This
> > would cause all subsequent operations to fail with an error code
> > instead of having to be queued as an RPC task and the and then
> > timing
> > out. In our example client, the application would then get the -EIO
> > immediately on the next (and all subsequent) read() calls.
> > 
> > There does seem to be some precedence for doing this (especially
> > with
> > network file systems), as both cifs (CifsExiting) and ceph
> > (CEPH_MOUNT_SHUTDOWN) appear to implement this behavior (at least
> > from
> > looking at the code. I haven't verified runtime behavior).
> > 
> > Are there any pitfalls I'm oversimplifying?
> 
> I don't know.
> 
> In the hard case I don't think you'd want to do something like
> this--applications expect mounts to be stay pinned while they're
> using
> them, not to get -EIO.  In the soft case maybe an exception like this
> makes sense.

Yes, I agree... maybe it should only do anything in the "soft" case (at
least, that works for my use case), however as a bit of a counter
argument, you *can* get -EIO even when mounted with "hard" because
nfs_umount_begin() still calls rpc_killall_tasks(), which will abort
any *pending* operations with -EIO, just not any future ones.

I also found that it causes a little havoc if you mount with the
"sharedcache" mount option because putting one mount into the force
unmounting state also causes all the other mounts to start returning
-EIO, as they are all sharing a superblock. Perhaps this is the correct
behavior anyway, but it certainly seems non-intuitive to a user that
force unmounting one directory would have such an effect on others so I
suspect not. I'm not sure of a decent way around this one (other than
mounting with "nosharedcache"). Is it even reasonable to detect if you
are force unmounting a superblock mounted in multiple locations and
then somehow split it up so that one mounted location can diverge in
behavior from the others (or some other mechanism I haven't thought
of)? Again however, the same counter argument as above still applies
here: if you force unmount one of the mounts that shares a superblock,
you can get -EIO on the others if they happen to have a pending
operation at the time.


> 
> --b.

Thanks,
Joshua Watt


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NFS Force Unmounting
  2017-10-30 20:20 ` J. Bruce Fields
  2017-10-30 21:04   ` Joshua Watt
@ 2017-10-30 21:09   ` NeilBrown
  2017-10-31 14:41     ` Jeff Layton
  2017-11-01 17:24     ` Jeff Layton
  1 sibling, 2 replies; 36+ messages in thread
From: NeilBrown @ 2017-10-30 21:09 UTC (permalink / raw)
  To: J. Bruce Fields, Joshua Watt; +Cc: linux-nfs

[-- Attachment #1: Type: text/plain, Size: 3629 bytes --]

On Mon, Oct 30 2017, J. Bruce Fields wrote:

> On Wed, Oct 25, 2017 at 12:11:46PM -0500, Joshua Watt wrote:
>> I'm working on a networking embedded system where NFS servers can come
>> and go from the network, and I've discovered that the Kernel NFS server
>
> For "Kernel NFS server", I think you mean "Kernel NFS client".
>
>> make it difficult to cleanup applications in a timely manner when the
>> server disappears (and yes, I am mounting with "soft" and relatively
>> short timeouts). I currently have a user space mechanism that can
>> quickly detect when the server disappears, and does a umount() with the
>> MNT_FORCE and MNT_DETACH flags. Using MNT_DETACH prevents new accesses
>> to files on the defunct remote server, and I have traced through the
>> code to see that MNT_FORCE does indeed cancel any current RPC tasks
>> with -EIO. However, this isn't sufficient for my use case because if a
>> user space application isn't currently waiting on an RCP task that gets
>> canceled, it will have to timeout again before it detects the
>> disconnect. For example, if a simple client is copying a file from the
>> NFS server, and happens to not be waiting on the RPC task in the read()
>> call when umount() occurs, it will be none the wiser and loop around to
>> call read() again, which must then try the whole NFS timeout + recovery
>> before the failure is detected. If a client is more complex and has a
>> lot of open file descriptor, it will typical have to wait for each one
>> to timeout, leading to very long delays.
>> 
>> The (naive?) solution seems to be to add some flag in either the NFS
>> client or the RPC client that gets set in nfs_umount_begin(). This
>> would cause all subsequent operations to fail with an error code
>> instead of having to be queued as an RPC task and the and then timing
>> out. In our example client, the application would then get the -EIO
>> immediately on the next (and all subsequent) read() calls.
>> 
>> There does seem to be some precedence for doing this (especially with
>> network file systems), as both cifs (CifsExiting) and ceph
>> (CEPH_MOUNT_SHUTDOWN) appear to implement this behavior (at least from
>> looking at the code. I haven't verified runtime behavior).
>> 
>> Are there any pitfalls I'm oversimplifying?
>
> I don't know.
>
> In the hard case I don't think you'd want to do something like
> this--applications expect mounts to be stay pinned while they're using
> them, not to get -EIO.  In the soft case maybe an exception like this
> makes sense.

Applications also expect to get responses to read() requests, and expect
fsync() to complete, but if the servers has melted down, that isn't
going to happen.  Sometimes unexpected errors are better than unexpected
infinite delays.

I think we need a reliable way to unmount an NFS filesystem mounted from
a non-responsive server.  Maybe that just means fixing all the places
where use we use TASK_UNINTERRUTIBLE when waiting for the server.  That
would allow processes accessing the filesystem to be killed.  I don't
know if that would meet Joshua's needs.

Last time this came up, Trond didn't want to make MNT_FORCE too strong as
it only makes sense to be forceful on the final unmount, and we cannot
know if this is the "final" unmount (no other bind-mounts around) until
much later than ->umount_prepare.  Maybe umount is the wrong interface.
Maybe we should expose "struct nfs_client" (or maybe "struct
nfs_server") objects via sysfs so they can be marked "dead" (or similar)
meaning that all IO should fail.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NFS Force Unmounting
  2017-10-30 21:09   ` NeilBrown
@ 2017-10-31 14:41     ` Jeff Layton
  2017-10-31 14:55       ` Chuck Lever
  2017-11-01  0:53       ` NeilBrown
  2017-11-01 17:24     ` Jeff Layton
  1 sibling, 2 replies; 36+ messages in thread
From: Jeff Layton @ 2017-10-31 14:41 UTC (permalink / raw)
  To: NeilBrown, J. Bruce Fields, Joshua Watt; +Cc: linux-nfs

On Tue, 2017-10-31 at 08:09 +1100, NeilBrown wrote:
> On Mon, Oct 30 2017, J. Bruce Fields wrote:
> 
> > On Wed, Oct 25, 2017 at 12:11:46PM -0500, Joshua Watt wrote:
> > > I'm working on a networking embedded system where NFS servers can come
> > > and go from the network, and I've discovered that the Kernel NFS server
> > 
> > For "Kernel NFS server", I think you mean "Kernel NFS client".
> > 
> > > make it difficult to cleanup applications in a timely manner when the
> > > server disappears (and yes, I am mounting with "soft" and relatively
> > > short timeouts). I currently have a user space mechanism that can
> > > quickly detect when the server disappears, and does a umount() with the
> > > MNT_FORCE and MNT_DETACH flags. Using MNT_DETACH prevents new accesses
> > > to files on the defunct remote server, and I have traced through the
> > > code to see that MNT_FORCE does indeed cancel any current RPC tasks
> > > with -EIO. However, this isn't sufficient for my use case because if a
> > > user space application isn't currently waiting on an RCP task that gets
> > > canceled, it will have to timeout again before it detects the
> > > disconnect. For example, if a simple client is copying a file from the
> > > NFS server, and happens to not be waiting on the RPC task in the read()
> > > call when umount() occurs, it will be none the wiser and loop around to
> > > call read() again, which must then try the whole NFS timeout + recovery
> > > before the failure is detected. If a client is more complex and has a
> > > lot of open file descriptor, it will typical have to wait for each one
> > > to timeout, leading to very long delays.
> > > 
> > > The (naive?) solution seems to be to add some flag in either the NFS
> > > client or the RPC client that gets set in nfs_umount_begin(). This
> > > would cause all subsequent operations to fail with an error code
> > > instead of having to be queued as an RPC task and the and then timing
> > > out. In our example client, the application would then get the -EIO
> > > immediately on the next (and all subsequent) read() calls.
> > > 
> > > There does seem to be some precedence for doing this (especially with
> > > network file systems), as both cifs (CifsExiting) and ceph
> > > (CEPH_MOUNT_SHUTDOWN) appear to implement this behavior (at least from
> > > looking at the code. I haven't verified runtime behavior).
> > > 
> > > Are there any pitfalls I'm oversimplifying?
> > 
> > I don't know.
> > 
> > In the hard case I don't think you'd want to do something like
> > this--applications expect mounts to be stay pinned while they're using
> > them, not to get -EIO.  In the soft case maybe an exception like this
> > makes sense.
> 
> Applications also expect to get responses to read() requests, and expect
> fsync() to complete, but if the servers has melted down, that isn't
> going to happen.  Sometimes unexpected errors are better than unexpected
> infinite delays.
> 
> I think we need a reliable way to unmount an NFS filesystem mounted from
> a non-responsive server.  Maybe that just means fixing all the places
> where use we use TASK_UNINTERRUTIBLE when waiting for the server.  That
> would allow processes accessing the filesystem to be killed.  I don't
> know if that would meet Joshua's needs.
> 
> Last time this came up, Trond didn't want to make MNT_FORCE too strong as
> it only makes sense to be forceful on the final unmount, and we cannot
> know if this is the "final" unmount (no other bind-mounts around) until
> much later than ->umount_prepare.  Maybe umount is the wrong interface.
> Maybe we should expose "struct nfs_client" (or maybe "struct
> nfs_server") objects via sysfs so they can be marked "dead" (or similar)
> meaning that all IO should fail.
> 

I like this idea.

Note that we already have some per-rpc_xprt / per-rpc_clnt info in
debugfs sunrpc dir. We could make some writable files in there, to allow
you to kill off individual RPCs or maybe mark a whole clnt and/or xprt
dead in some fashion.

I don't really have a good feel for what this interface should look like
yet. debugfs is attractive here, as it's supposedly not part of the
kernel ABI guarantee. That allows us to do some experimentation in this
area, without making too big an initial commitment.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NFS Force Unmounting
  2017-10-31 14:41     ` Jeff Layton
@ 2017-10-31 14:55       ` Chuck Lever
  2017-10-31 17:04         ` Joshua Watt
  2017-11-01  0:53       ` NeilBrown
  1 sibling, 1 reply; 36+ messages in thread
From: Chuck Lever @ 2017-10-31 14:55 UTC (permalink / raw)
  To: Jeff Layton; +Cc: NeilBrown, Bruce Fields, Joshua Watt, Linux NFS Mailing List


> On Oct 31, 2017, at 10:41 AM, Jeff Layton <jlayton@kernel.org> wrote:
> 
> On Tue, 2017-10-31 at 08:09 +1100, NeilBrown wrote:
>> On Mon, Oct 30 2017, J. Bruce Fields wrote:
>> 
>>> On Wed, Oct 25, 2017 at 12:11:46PM -0500, Joshua Watt wrote:
>>>> I'm working on a networking embedded system where NFS servers can come
>>>> and go from the network, and I've discovered that the Kernel NFS server
>>> 
>>> For "Kernel NFS server", I think you mean "Kernel NFS client".
>>> 
>>>> make it difficult to cleanup applications in a timely manner when the
>>>> server disappears (and yes, I am mounting with "soft" and relatively
>>>> short timeouts). I currently have a user space mechanism that can
>>>> quickly detect when the server disappears, and does a umount() with the
>>>> MNT_FORCE and MNT_DETACH flags. Using MNT_DETACH prevents new accesses
>>>> to files on the defunct remote server, and I have traced through the
>>>> code to see that MNT_FORCE does indeed cancel any current RPC tasks
>>>> with -EIO. However, this isn't sufficient for my use case because if a
>>>> user space application isn't currently waiting on an RCP task that gets
>>>> canceled, it will have to timeout again before it detects the
>>>> disconnect. For example, if a simple client is copying a file from the
>>>> NFS server, and happens to not be waiting on the RPC task in the read()
>>>> call when umount() occurs, it will be none the wiser and loop around to
>>>> call read() again, which must then try the whole NFS timeout + recovery
>>>> before the failure is detected. If a client is more complex and has a
>>>> lot of open file descriptor, it will typical have to wait for each one
>>>> to timeout, leading to very long delays.
>>>> 
>>>> The (naive?) solution seems to be to add some flag in either the NFS
>>>> client or the RPC client that gets set in nfs_umount_begin(). This
>>>> would cause all subsequent operations to fail with an error code
>>>> instead of having to be queued as an RPC task and the and then timing
>>>> out. In our example client, the application would then get the -EIO
>>>> immediately on the next (and all subsequent) read() calls.
>>>> 
>>>> There does seem to be some precedence for doing this (especially with
>>>> network file systems), as both cifs (CifsExiting) and ceph
>>>> (CEPH_MOUNT_SHUTDOWN) appear to implement this behavior (at least from
>>>> looking at the code. I haven't verified runtime behavior).
>>>> 
>>>> Are there any pitfalls I'm oversimplifying?
>>> 
>>> I don't know.
>>> 
>>> In the hard case I don't think you'd want to do something like
>>> this--applications expect mounts to be stay pinned while they're using
>>> them, not to get -EIO.  In the soft case maybe an exception like this
>>> makes sense.
>> 
>> Applications also expect to get responses to read() requests, and expect
>> fsync() to complete, but if the servers has melted down, that isn't
>> going to happen.  Sometimes unexpected errors are better than unexpected
>> infinite delays.
>> 
>> I think we need a reliable way to unmount an NFS filesystem mounted from
>> a non-responsive server.  Maybe that just means fixing all the places
>> where use we use TASK_UNINTERRUTIBLE when waiting for the server.  That
>> would allow processes accessing the filesystem to be killed.  I don't
>> know if that would meet Joshua's needs.
>> 
>> Last time this came up, Trond didn't want to make MNT_FORCE too strong as
>> it only makes sense to be forceful on the final unmount, and we cannot
>> know if this is the "final" unmount (no other bind-mounts around) until
>> much later than ->umount_prepare.  Maybe umount is the wrong interface.
>> Maybe we should expose "struct nfs_client" (or maybe "struct
>> nfs_server") objects via sysfs so they can be marked "dead" (or similar)
>> meaning that all IO should fail.
>> 
> 
> I like this idea.
> 
> Note that we already have some per-rpc_xprt / per-rpc_clnt info in
> debugfs sunrpc dir. We could make some writable files in there, to allow
> you to kill off individual RPCs or maybe mark a whole clnt and/or xprt
> dead in some fashion.

Listing individual RPCs seems like overkill. It would be straightforward
to identify these transports by the IP addresses of the remotes, and just
mark the specific transports for a dead server. Maybe the --force option
of umount could do this.

The RPC client can then walk through the dead transport's list of RPC
tasks and terminate all of them.

That makes it easy to find these things. But it doesn't take care of
killing processes that are not interruptible, though.


> I don't really have a good feel for what this interface should look like
> yet. debugfs is attractive here, as it's supposedly not part of the
> kernel ABI guarantee. That allows us to do some experimentation in this
> area, without making too big an initial commitment.
> -- 
> Jeff Layton <jlayton@kernel.org>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Chuck Lever




^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NFS Force Unmounting
  2017-10-31 14:55       ` Chuck Lever
@ 2017-10-31 17:04         ` Joshua Watt
  2017-10-31 19:46           ` Chuck Lever
  0 siblings, 1 reply; 36+ messages in thread
From: Joshua Watt @ 2017-10-31 17:04 UTC (permalink / raw)
  To: Chuck Lever, Jeff Layton; +Cc: NeilBrown, Bruce Fields, Linux NFS Mailing List

On Tue, 2017-10-31 at 10:55 -0400, Chuck Lever wrote:
> > On Oct 31, 2017, at 10:41 AM, Jeff Layton <jlayton@kernel.org>
> > wrote:
> > 
> > On Tue, 2017-10-31 at 08:09 +1100, NeilBrown wrote:
> > > On Mon, Oct 30 2017, J. Bruce Fields wrote:
> > > 
> > > > On Wed, Oct 25, 2017 at 12:11:46PM -0500, Joshua Watt wrote:
> > > > > I'm working on a networking embedded system where NFS servers
> > > > > can come
> > > > > and go from the network, and I've discovered that the Kernel
> > > > > NFS server
> > > > 
> > > > For "Kernel NFS server", I think you mean "Kernel NFS client".
> > > > 
> > > > > make it difficult to cleanup applications in a timely manner
> > > > > when the
> > > > > server disappears (and yes, I am mounting with "soft" and
> > > > > relatively
> > > > > short timeouts). I currently have a user space mechanism that
> > > > > can
> > > > > quickly detect when the server disappears, and does a
> > > > > umount() with the
> > > > > MNT_FORCE and MNT_DETACH flags. Using MNT_DETACH prevents new
> > > > > accesses
> > > > > to files on the defunct remote server, and I have traced
> > > > > through the
> > > > > code to see that MNT_FORCE does indeed cancel any current RPC
> > > > > tasks
> > > > > with -EIO. However, this isn't sufficient for my use case
> > > > > because if a
> > > > > user space application isn't currently waiting on an RCP task
> > > > > that gets
> > > > > canceled, it will have to timeout again before it detects the
> > > > > disconnect. For example, if a simple client is copying a file
> > > > > from the
> > > > > NFS server, and happens to not be waiting on the RPC task in
> > > > > the read()
> > > > > call when umount() occurs, it will be none the wiser and loop
> > > > > around to
> > > > > call read() again, which must then try the whole NFS timeout
> > > > > + recovery
> > > > > before the failure is detected. If a client is more complex
> > > > > and has a
> > > > > lot of open file descriptor, it will typical have to wait for
> > > > > each one
> > > > > to timeout, leading to very long delays.
> > > > > 
> > > > > The (naive?) solution seems to be to add some flag in either
> > > > > the NFS
> > > > > client or the RPC client that gets set in nfs_umount_begin().
> > > > > This
> > > > > would cause all subsequent operations to fail with an error
> > > > > code
> > > > > instead of having to be queued as an RPC task and the and
> > > > > then timing
> > > > > out. In our example client, the application would then get
> > > > > the -EIO
> > > > > immediately on the next (and all subsequent) read() calls.
> > > > > 
> > > > > There does seem to be some precedence for doing this
> > > > > (especially with
> > > > > network file systems), as both cifs (CifsExiting) and ceph
> > > > > (CEPH_MOUNT_SHUTDOWN) appear to implement this behavior (at
> > > > > least from
> > > > > looking at the code. I haven't verified runtime behavior).
> > > > > 
> > > > > Are there any pitfalls I'm oversimplifying?
> > > > 
> > > > I don't know.
> > > > 
> > > > In the hard case I don't think you'd want to do something like
> > > > this--applications expect mounts to be stay pinned while
> > > > they're using
> > > > them, not to get -EIO.  In the soft case maybe an exception
> > > > like this
> > > > makes sense.
> > > 
> > > Applications also expect to get responses to read() requests, and
> > > expect
> > > fsync() to complete, but if the servers has melted down, that
> > > isn't
> > > going to happen.  Sometimes unexpected errors are better than
> > > unexpected
> > > infinite delays.
> > > 
> > > I think we need a reliable way to unmount an NFS filesystem
> > > mounted from
> > > a non-responsive server.  Maybe that just means fixing all the
> > > places
> > > where use we use TASK_UNINTERRUTIBLE when waiting for the
> > > server.  That
> > > would allow processes accessing the filesystem to be killed.  I
> > > don't
> > > know if that would meet Joshua's needs.

No, it doesn't. I unfortunately cannot just kill off my processes that
are accessing NFS when it dies :)

> > > 
> > > Last time this came up, Trond didn't want to make MNT_FORCE too
> > > strong as
> > > it only makes sense to be forceful on the final unmount, and we
> > > cannot
> > > know if this is the "final" unmount (no other bind-mounts around)
> > > until
> > > much later than ->umount_prepare.  Maybe umount is the wrong
> > > interface.
> > > Maybe we should expose "struct nfs_client" (or maybe "struct
> > > nfs_server") objects via sysfs so they can be marked "dead" (or
> > > similar)
> > > meaning that all IO should fail.
> > > 
> > 
> > I like this idea.
> > 
> > Note that we already have some per-rpc_xprt / per-rpc_clnt info in
> > debugfs sunrpc dir. We could make some writable files in there, to
> > allow
> > you to kill off individual RPCs or maybe mark a whole clnt and/or
> > xprt
> > dead in some fashion.
> 
> Listing individual RPCs seems like overkill. It would be
> straightforward
> to identify these transports by the IP addresses of the remotes, and
> just
> mark the specific transports for a dead server. Maybe the --force
> option
> of umount could do this.

If we simply want --force to do it, the patch is pretty simple (I will
push it up as an RFC), and there is no need for any debugfs support,
although that leads back what --force should be doing. IMHO, --force is
a little strange in its current implementation because while it aborts
all the current pending RPC calls, you could *just* miss an RPC call
that comes in later, which could then block for long periods of time
(and prevent unmounting). Perhaps some light on the actual intended use
of --force would be useful? It seems to me like its more of an
indication that the user wants to abort any pending operations than an
indication that they *really* want to unmount the file system (but
again, I've noted this seems to vary between the different filesystems
than handle --force).

If we are going to debugfs support, it seems like the simplest thing is
for --force to continue to do whatever it does today unchanged, and the
new debugfs API is the sole mechanism for killing off a server.

Were you trying to say that there should be some sequence of --force
and debugfs operations to kill off a dead server? If so, I'm not quite
sure I follow (sorry).

> 
> The RPC client can then walk through the dead transport's list of RPC
> tasks and terminate all of them.
> 
> That makes it easy to find these things. But it doesn't take care of
> killing processes that are not interruptible, though.

IIUC, if a process is waiting on an RPC, and that RPC gets rpc_exit(-
EIO) called on it, the call will return with -EIO regardless of the
interruptable status of the process? Am I misreading the code? Maybe
better put is the question: Is it acceptable for a non-interruptable
task to get -EIO from a syscall? I *think* the answer is yes (but I
haven't been doing this very long).

> 
> 
> > I don't really have a good feel for what this interface should look
> > like
> > yet. debugfs is attractive here, as it's supposedly not part of the
> > kernel ABI guarantee. That allows us to do some experimentation in
> > this
> > area, without making too big an initial commitment.
> > -- 
> > Jeff Layton <jlayton@kernel.org>
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-
> > nfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> Chuck Lever
> 
> 
> 

Thanks,
Joshua Watt


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NFS Force Unmounting
  2017-10-31 17:04         ` Joshua Watt
@ 2017-10-31 19:46           ` Chuck Lever
  0 siblings, 0 replies; 36+ messages in thread
From: Chuck Lever @ 2017-10-31 19:46 UTC (permalink / raw)
  To: Joshua Watt; +Cc: Jeff Layton, NeilBrown, Bruce Fields, Linux NFS Mailing List


> On Oct 31, 2017, at 1:04 PM, Joshua Watt <jpewhacker@gmail.com> wrote:
> 
> On Tue, 2017-10-31 at 10:55 -0400, Chuck Lever wrote:
>>> On Oct 31, 2017, at 10:41 AM, Jeff Layton <jlayton@kernel.org>
>>> wrote:
>>> 
>>> On Tue, 2017-10-31 at 08:09 +1100, NeilBrown wrote:
>>>> On Mon, Oct 30 2017, J. Bruce Fields wrote:
>>>> 
>>>>> On Wed, Oct 25, 2017 at 12:11:46PM -0500, Joshua Watt wrote:
>>>>>> I'm working on a networking embedded system where NFS servers
>>>>>> can come
>>>>>> and go from the network, and I've discovered that the Kernel
>>>>>> NFS server
>>>>> 
>>>>> For "Kernel NFS server", I think you mean "Kernel NFS client".
>>>>> 
>>>>>> make it difficult to cleanup applications in a timely manner
>>>>>> when the
>>>>>> server disappears (and yes, I am mounting with "soft" and
>>>>>> relatively
>>>>>> short timeouts). I currently have a user space mechanism that
>>>>>> can
>>>>>> quickly detect when the server disappears, and does a
>>>>>> umount() with the
>>>>>> MNT_FORCE and MNT_DETACH flags. Using MNT_DETACH prevents new
>>>>>> accesses
>>>>>> to files on the defunct remote server, and I have traced
>>>>>> through the
>>>>>> code to see that MNT_FORCE does indeed cancel any current RPC
>>>>>> tasks
>>>>>> with -EIO. However, this isn't sufficient for my use case
>>>>>> because if a
>>>>>> user space application isn't currently waiting on an RCP task
>>>>>> that gets
>>>>>> canceled, it will have to timeout again before it detects the
>>>>>> disconnect. For example, if a simple client is copying a file
>>>>>> from the
>>>>>> NFS server, and happens to not be waiting on the RPC task in
>>>>>> the read()
>>>>>> call when umount() occurs, it will be none the wiser and loop
>>>>>> around to
>>>>>> call read() again, which must then try the whole NFS timeout
>>>>>> + recovery
>>>>>> before the failure is detected. If a client is more complex
>>>>>> and has a
>>>>>> lot of open file descriptor, it will typical have to wait for
>>>>>> each one
>>>>>> to timeout, leading to very long delays.
>>>>>> 
>>>>>> The (naive?) solution seems to be to add some flag in either
>>>>>> the NFS
>>>>>> client or the RPC client that gets set in nfs_umount_begin().
>>>>>> This
>>>>>> would cause all subsequent operations to fail with an error
>>>>>> code
>>>>>> instead of having to be queued as an RPC task and the and
>>>>>> then timing
>>>>>> out. In our example client, the application would then get
>>>>>> the -EIO
>>>>>> immediately on the next (and all subsequent) read() calls.
>>>>>> 
>>>>>> There does seem to be some precedence for doing this
>>>>>> (especially with
>>>>>> network file systems), as both cifs (CifsExiting) and ceph
>>>>>> (CEPH_MOUNT_SHUTDOWN) appear to implement this behavior (at
>>>>>> least from
>>>>>> looking at the code. I haven't verified runtime behavior).
>>>>>> 
>>>>>> Are there any pitfalls I'm oversimplifying?
>>>>> 
>>>>> I don't know.
>>>>> 
>>>>> In the hard case I don't think you'd want to do something like
>>>>> this--applications expect mounts to be stay pinned while
>>>>> they're using
>>>>> them, not to get -EIO.  In the soft case maybe an exception
>>>>> like this
>>>>> makes sense.
>>>> 
>>>> Applications also expect to get responses to read() requests, and
>>>> expect
>>>> fsync() to complete, but if the servers has melted down, that
>>>> isn't
>>>> going to happen.  Sometimes unexpected errors are better than
>>>> unexpected
>>>> infinite delays.
>>>> 
>>>> I think we need a reliable way to unmount an NFS filesystem
>>>> mounted from
>>>> a non-responsive server.  Maybe that just means fixing all the
>>>> places
>>>> where use we use TASK_UNINTERRUTIBLE when waiting for the
>>>> server.  That
>>>> would allow processes accessing the filesystem to be killed.  I
>>>> don't
>>>> know if that would meet Joshua's needs.
> 
> No, it doesn't. I unfortunately cannot just kill off my processes that
> are accessing NFS when it dies :)
> 
>>>> 
>>>> Last time this came up, Trond didn't want to make MNT_FORCE too
>>>> strong as
>>>> it only makes sense to be forceful on the final unmount, and we
>>>> cannot
>>>> know if this is the "final" unmount (no other bind-mounts around)
>>>> until
>>>> much later than ->umount_prepare.  Maybe umount is the wrong
>>>> interface.
>>>> Maybe we should expose "struct nfs_client" (or maybe "struct
>>>> nfs_server") objects via sysfs so they can be marked "dead" (or
>>>> similar)
>>>> meaning that all IO should fail.
>>>> 
>>> 
>>> I like this idea.
>>> 
>>> Note that we already have some per-rpc_xprt / per-rpc_clnt info in
>>> debugfs sunrpc dir. We could make some writable files in there, to
>>> allow
>>> you to kill off individual RPCs or maybe mark a whole clnt and/or
>>> xprt
>>> dead in some fashion.
>> 
>> Listing individual RPCs seems like overkill. It would be
>> straightforward
>> to identify these transports by the IP addresses of the remotes, and
>> just
>> mark the specific transports for a dead server. Maybe the --force
>> option
>> of umount could do this.
> 
> If we simply want --force to do it, the patch is pretty simple (I will
> push it up as an RFC), and there is no need for any debugfs support,
> although that leads back what --force should be doing. IMHO, --force is
> a little strange in its current implementation because while it aborts
> all the current pending RPC calls, you could *just* miss an RPC call
> that comes in later, which could then block for long periods of time
> (and prevent unmounting). Perhaps some light on the actual intended use
> of --force would be useful? It seems to me like its more of an
> indication that the user wants to abort any pending operations than an
> indication that they *really* want to unmount the file system (but
> again, I've noted this seems to vary between the different filesystems
> than handle --force).
> 
> If we are going to debugfs support, it seems like the simplest thing is
> for --force to continue to do whatever it does today unchanged, and the
> new debugfs API is the sole mechanism for killing off a server.
> 
> Were you trying to say that there should be some sequence of --force
> and debugfs operations to kill off a dead server? If so, I'm not quite
> sure I follow (sorry).

I was following Jeff's suggestion of building a debugfs API
to find and mark RPCs we'd like to terminate. Then "umount
--force" could use this API to get rid of stuck RPCs so that
umount can complete cleanly.

We have a fine needle to thread here. We want to allow umount
to work, but do not want to endanger RPCs that are waiting
legitimately for long-running work to complete. Perhaps we
should keep these administrative steps separated.


>> The RPC client can then walk through the dead transport's list of RPC
>> tasks and terminate all of them.
>> 
>> That makes it easy to find these things. But it doesn't take care of
>> killing processes that are not interruptible, though.
> 
> IIUC, if a process is waiting on an RPC, and that RPC gets rpc_exit(-
> EIO) called on it, the call will return with -EIO regardless of the
> interruptable status of the process? Am I misreading the code? Maybe
> better put is the question: Is it acceptable for a non-interruptable
> task to get -EIO from a syscall? I *think* the answer is yes (but I
> haven't been doing this very long).

I don't think rpc_exit_task(task, -EIO) will be sufficient.
READ and WRITE are done by a background process. So -EIO
awakens the waiting user space process but the background
process is still waiting for an RPC reply, and will continue
to pin the transport and the mount.

In addition there could be some NFS-level waiters that might
not be visible to the RPC layer.

I wondered if there might be a way of feeding a faux reply
into the transport for each of the waiting dead RPCs (and any
newly submitted ones). I bet often these will be waiting for
a transport connection event; set a "kill me" bit in each
sleeping doomed RPC, pretend to reconnect, and xprt_transmit
could recognize the bit during retransmit and cause the RPCs
to terminate.

It might also be helpful to decorate the RPC procs with some
indication of whether an RPC is safe to kill. Any RPC that
might alter cached data/metadata is not, but others would be
safe. Maybe this is already the case, but I thought generally
SYNC RPCs are killable and ASYNC ones are not.

</random crazy ideas>


>>> I don't really have a good feel for what this interface should look
>>> like
>>> yet. debugfs is attractive here, as it's supposedly not part of the
>>> kernel ABI guarantee. That allows us to do some experimentation in
>>> this
>>> area, without making too big an initial commitment.
>>> -- 
>>> Jeff Layton <jlayton@kernel.org>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-
>>> nfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>> --
>> Chuck Lever
>> 
>> 
>> 
> 
> Thanks,
> Joshua Watt

--
Chuck Lever




^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NFS Force Unmounting
  2017-10-31 14:41     ` Jeff Layton
  2017-10-31 14:55       ` Chuck Lever
@ 2017-11-01  0:53       ` NeilBrown
  2017-11-01  2:22         ` Chuck Lever
  1 sibling, 1 reply; 36+ messages in thread
From: NeilBrown @ 2017-11-01  0:53 UTC (permalink / raw)
  To: Jeff Layton, J. Bruce Fields, Joshua Watt; +Cc: linux-nfs

[-- Attachment #1: Type: text/plain, Size: 6917 bytes --]

On Tue, Oct 31 2017, Jeff Layton wrote:

> On Tue, 2017-10-31 at 08:09 +1100, NeilBrown wrote:
>> On Mon, Oct 30 2017, J. Bruce Fields wrote:
>> 
>> > On Wed, Oct 25, 2017 at 12:11:46PM -0500, Joshua Watt wrote:
>> > > I'm working on a networking embedded system where NFS servers can come
>> > > and go from the network, and I've discovered that the Kernel NFS server
>> > 
>> > For "Kernel NFS server", I think you mean "Kernel NFS client".
>> > 
>> > > make it difficult to cleanup applications in a timely manner when the
>> > > server disappears (and yes, I am mounting with "soft" and relatively
>> > > short timeouts). I currently have a user space mechanism that can
>> > > quickly detect when the server disappears, and does a umount() with the
>> > > MNT_FORCE and MNT_DETACH flags. Using MNT_DETACH prevents new accesses
>> > > to files on the defunct remote server, and I have traced through the
>> > > code to see that MNT_FORCE does indeed cancel any current RPC tasks
>> > > with -EIO. However, this isn't sufficient for my use case because if a
>> > > user space application isn't currently waiting on an RCP task that gets
>> > > canceled, it will have to timeout again before it detects the
>> > > disconnect. For example, if a simple client is copying a file from the
>> > > NFS server, and happens to not be waiting on the RPC task in the read()
>> > > call when umount() occurs, it will be none the wiser and loop around to
>> > > call read() again, which must then try the whole NFS timeout + recovery
>> > > before the failure is detected. If a client is more complex and has a
>> > > lot of open file descriptor, it will typical have to wait for each one
>> > > to timeout, leading to very long delays.
>> > > 
>> > > The (naive?) solution seems to be to add some flag in either the NFS
>> > > client or the RPC client that gets set in nfs_umount_begin(). This
>> > > would cause all subsequent operations to fail with an error code
>> > > instead of having to be queued as an RPC task and the and then timing
>> > > out. In our example client, the application would then get the -EIO
>> > > immediately on the next (and all subsequent) read() calls.
>> > > 
>> > > There does seem to be some precedence for doing this (especially with
>> > > network file systems), as both cifs (CifsExiting) and ceph
>> > > (CEPH_MOUNT_SHUTDOWN) appear to implement this behavior (at least from
>> > > looking at the code. I haven't verified runtime behavior).
>> > > 
>> > > Are there any pitfalls I'm oversimplifying?
>> > 
>> > I don't know.
>> > 
>> > In the hard case I don't think you'd want to do something like
>> > this--applications expect mounts to be stay pinned while they're using
>> > them, not to get -EIO.  In the soft case maybe an exception like this
>> > makes sense.
>> 
>> Applications also expect to get responses to read() requests, and expect
>> fsync() to complete, but if the servers has melted down, that isn't
>> going to happen.  Sometimes unexpected errors are better than unexpected
>> infinite delays.
>> 
>> I think we need a reliable way to unmount an NFS filesystem mounted from
>> a non-responsive server.  Maybe that just means fixing all the places
>> where use we use TASK_UNINTERRUTIBLE when waiting for the server.  That
>> would allow processes accessing the filesystem to be killed.  I don't
>> know if that would meet Joshua's needs.
>> 
>> Last time this came up, Trond didn't want to make MNT_FORCE too strong as
>> it only makes sense to be forceful on the final unmount, and we cannot
>> know if this is the "final" unmount (no other bind-mounts around) until
>> much later than ->umount_prepare.  Maybe umount is the wrong interface.
>> Maybe we should expose "struct nfs_client" (or maybe "struct
>> nfs_server") objects via sysfs so they can be marked "dead" (or similar)
>> meaning that all IO should fail.
>> 
>
> I like this idea.
>
> Note that we already have some per-rpc_xprt / per-rpc_clnt info in
> debugfs sunrpc dir. We could make some writable files in there, to allow
> you to kill off individual RPCs or maybe mark a whole clnt and/or xprt
> dead in some fashion.
>
> I don't really have a good feel for what this interface should look like
> yet. debugfs is attractive here, as it's supposedly not part of the
> kernel ABI guarantee. That allows us to do some experimentation in this
> area, without making too big an initial commitment.

debugfs might be attractive to kernel developers: "all care but not
responsibility", but not so much to application developers (though I do
realize that your approch was "something to experiment with" so maybe
that doesn't matter).

My particular focus is to make systemd shutdown completely reliable.  It
should not block indefinitely on any condition, including inaccessible
servers and broken networks.

In stark contrast to Chuck's suggestion that


   Any RPC that might alter cached data/metadata is not, but others
   would be safe.

("safe" here meaning "safe to kill the RPC"), I think that everything
can and should be killed.  Maybe the first step is to purge any dirty
pages from the cache.
- the server is up, we write the data
- if we are happy to wait, we wait
- otherwise (the case I'm interested in), we just destroy anything
  that gets in the way of unmounting the filesystem.

I'd also like to make the interface completely generic.  I'd rather
systemd didn't need to know any specific details about nfs (it already
does to some extend - it knows it is a "remote" filesystem) but
I'd rather not require more.

Maybe I could just sweep the problem under the carpet and use lazy
unmounts.  That hides some of the problem, but doesn't stop sync(2) from
blocking indefinitely.  And once you have done the lazy unmount, there
is no longer any opportunity to use MNT_FORCE.

Another way to think about this is to consider the bdi rather than the
mount point.  If the NFS server is never coming back, then the "backing
device" is broken.  If /sys/class/bdi/* contained suitable information
to identify the right backing device, and had some way to "terminate
with extreme prejudice", then and admin process (like systemd or
anything else) could choose to terminate a bdi that was not working
properly.

We would need quite a bit of integration so that this "terminate"
command would take effect, cause various syscalls to return EIO, purge
dirty memory, avoid stalling sync().  But it hopefully it would be
a well defined interface and a good starting point.

If the bdi provided more information and more control, it would be a lot
safer to use lazy unmounts, as we could then work with the filesystem
even after it had been unmounted.

Maybe I'll trying playing with bdis in my spare time (if I ever find out
what "spare time" is).

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NFS Force Unmounting
  2017-11-01  0:53       ` NeilBrown
@ 2017-11-01  2:22         ` Chuck Lever
  2017-11-01 14:38           ` Joshua Watt
  2017-11-02  0:15           ` NeilBrown
  0 siblings, 2 replies; 36+ messages in thread
From: Chuck Lever @ 2017-11-01  2:22 UTC (permalink / raw)
  To: NeilBrown; +Cc: Jeff Layton, Bruce Fields, Joshua Watt, Linux NFS Mailing List


> On Oct 31, 2017, at 8:53 PM, NeilBrown <neilb@suse.com> wrote:
> 
> On Tue, Oct 31 2017, Jeff Layton wrote:
> 
>> On Tue, 2017-10-31 at 08:09 +1100, NeilBrown wrote:
>>> On Mon, Oct 30 2017, J. Bruce Fields wrote:
>>> 
>>>> On Wed, Oct 25, 2017 at 12:11:46PM -0500, Joshua Watt wrote:
>>>>> I'm working on a networking embedded system where NFS servers can come
>>>>> and go from the network, and I've discovered that the Kernel NFS server
>>>> 
>>>> For "Kernel NFS server", I think you mean "Kernel NFS client".
>>>> 
>>>>> make it difficult to cleanup applications in a timely manner when the
>>>>> server disappears (and yes, I am mounting with "soft" and relatively
>>>>> short timeouts). I currently have a user space mechanism that can
>>>>> quickly detect when the server disappears, and does a umount() with the
>>>>> MNT_FORCE and MNT_DETACH flags. Using MNT_DETACH prevents new accesses
>>>>> to files on the defunct remote server, and I have traced through the
>>>>> code to see that MNT_FORCE does indeed cancel any current RPC tasks
>>>>> with -EIO. However, this isn't sufficient for my use case because if a
>>>>> user space application isn't currently waiting on an RCP task that gets
>>>>> canceled, it will have to timeout again before it detects the
>>>>> disconnect. For example, if a simple client is copying a file from the
>>>>> NFS server, and happens to not be waiting on the RPC task in the read()
>>>>> call when umount() occurs, it will be none the wiser and loop around to
>>>>> call read() again, which must then try the whole NFS timeout + recovery
>>>>> before the failure is detected. If a client is more complex and has a
>>>>> lot of open file descriptor, it will typical have to wait for each one
>>>>> to timeout, leading to very long delays.
>>>>> 
>>>>> The (naive?) solution seems to be to add some flag in either the NFS
>>>>> client or the RPC client that gets set in nfs_umount_begin(). This
>>>>> would cause all subsequent operations to fail with an error code
>>>>> instead of having to be queued as an RPC task and the and then timing
>>>>> out. In our example client, the application would then get the -EIO
>>>>> immediately on the next (and all subsequent) read() calls.
>>>>> 
>>>>> There does seem to be some precedence for doing this (especially with
>>>>> network file systems), as both cifs (CifsExiting) and ceph
>>>>> (CEPH_MOUNT_SHUTDOWN) appear to implement this behavior (at least from
>>>>> looking at the code. I haven't verified runtime behavior).
>>>>> 
>>>>> Are there any pitfalls I'm oversimplifying?
>>>> 
>>>> I don't know.
>>>> 
>>>> In the hard case I don't think you'd want to do something like
>>>> this--applications expect mounts to be stay pinned while they're using
>>>> them, not to get -EIO.  In the soft case maybe an exception like this
>>>> makes sense.
>>> 
>>> Applications also expect to get responses to read() requests, and expect
>>> fsync() to complete, but if the servers has melted down, that isn't
>>> going to happen.  Sometimes unexpected errors are better than unexpected
>>> infinite delays.
>>> 
>>> I think we need a reliable way to unmount an NFS filesystem mounted from
>>> a non-responsive server.  Maybe that just means fixing all the places
>>> where use we use TASK_UNINTERRUTIBLE when waiting for the server.  That
>>> would allow processes accessing the filesystem to be killed.  I don't
>>> know if that would meet Joshua's needs.
>>> 
>>> Last time this came up, Trond didn't want to make MNT_FORCE too strong as
>>> it only makes sense to be forceful on the final unmount, and we cannot
>>> know if this is the "final" unmount (no other bind-mounts around) until
>>> much later than ->umount_prepare.  Maybe umount is the wrong interface.
>>> Maybe we should expose "struct nfs_client" (or maybe "struct
>>> nfs_server") objects via sysfs so they can be marked "dead" (or similar)
>>> meaning that all IO should fail.
>>> 
>> 
>> I like this idea.
>> 
>> Note that we already have some per-rpc_xprt / per-rpc_clnt info in
>> debugfs sunrpc dir. We could make some writable files in there, to allow
>> you to kill off individual RPCs or maybe mark a whole clnt and/or xprt
>> dead in some fashion.
>> 
>> I don't really have a good feel for what this interface should look like
>> yet. debugfs is attractive here, as it's supposedly not part of the
>> kernel ABI guarantee. That allows us to do some experimentation in this
>> area, without making too big an initial commitment.
> 
> debugfs might be attractive to kernel developers: "all care but not
> responsibility", but not so much to application developers (though I do
> realize that your approch was "something to experiment with" so maybe
> that doesn't matter).

I read Jeff's suggestion as "start in debugfs and move to /sys
with the other long-term administrative interfaces". <shrug>


> My particular focus is to make systemd shutdown completely reliable.

In particular: umount.nfs has to be reliable, and/or stuck NFS
mounts have to avoid impacting other aspects of system
operation (like syncing other filesystems).


> It should not block indefinitely on any condition, including inaccessible
> servers and broken networks.

There are occasions where a "hard" semantic is appropriate,
and killing everything unconditionally can result in unexpected
and unwanted consequences. I would strongly object to any
approach that includes automatically discarding data without
user/administrator choice in the matter. One size does not fit
all here.

At least let's stop and think about consequences. I'm sure I
don't understand all of them yet.


> In stark contrast to Chuck's suggestion that
> 
> 
>   Any RPC that might alter cached data/metadata is not, but others
>   would be safe.
> 
> ("safe" here meaning "safe to kill the RPC"), I think that everything
> can and should be killed.  Maybe the first step is to purge any dirty
> pages from the cache.
> - the server is up, we write the data
> - if we are happy to wait, we wait
> - otherwise (the case I'm interested in), we just destroy anything
>  that gets in the way of unmounting the filesystem.

(The technical issue still to be resolved is how to kill
processes that are sleeping uninterruptibly, but for the
moment let's assume we can do that, and think about the
policy questions).

How do you decide the server is "up" without possibly hanging
or human intervention? Certainly there is a point where we
want to scorch the earth, but I think the user and administrator
get to decide where that is. systemd is not the place for that
decision.

Actually, I think that dividing the RPC world into "safe to
kill" and "need to ask first" is roughly the same as the
steps you outline above: at some point we both eventually get
to kill_all. The question is how to get there allowing as much
automation as possible and without compromising data integrity.

There are RPCs that are always safe to kill, eg NFS READ and
GETATTR. If those are the only waiters, then they can always be
terminated on umount. That is already a good start at making
NFSv3 umount more reliable, since often the only thing it is
waiting for is a single GETATTR.

Question marks arise when we are dealing with WRITE, SETATTR,
and so on, when we know the server is still there somewhere,
and even when we're not sure whether the server is still
available. It is here where we assess our satisfaction for
waiting out a network partition or server restart.

There's no way to automate that safely, unless you state in
advance that "your data is not protected during an umount."
There are some environments where that is OK, and some where
it is absolute disaster. Again, the user and administrator get
to make that choice, IMO. Maybe a new mount option?

Data integrity is critical up until some point where it isn't.
A container is no longer needed, all the data it cared about
can be discarded. I think that line needs to be drawn while
the system is still operational. It's not akin to umount at
all, since it involves possible destruction of data. It's more
like TRIM.


> I'd also like to make the interface completely generic.  I'd rather
> systemd didn't need to know any specific details about nfs (it already
> does to some extend - it knows it is a "remote" filesystem) but
> I'd rather not require more.

> Maybe I could just sweep the problem under the carpet and use lazy
> unmounts.  That hides some of the problem, but doesn't stop sync(2) from
> blocking indefinitely.  And once you have done the lazy unmount, there
> is no longer any opportunity to use MNT_FORCE.

IMO a partial answer could be data caching in local files. If
the client can't flush, then it can preserve the files until
after the umount and reboot (using, say, fscache). Multi-client
sharing is still hazardous, but that isn't a very frequent use
case.


> Another way to think about this is to consider the bdi rather than the
> mount point.  If the NFS server is never coming back, then the "backing
> device" is broken.  If /sys/class/bdi/* contained suitable information
> to identify the right backing device, and had some way to "terminate
> with extreme prejudice", then and admin process (like systemd or
> anything else) could choose to terminate a bdi that was not working
> properly.
> 
> We would need quite a bit of integration so that this "terminate"
> command would take effect, cause various syscalls to return EIO, purge
> dirty memory, avoid stalling sync().  But it hopefully it would be
> a well defined interface and a good starting point.

Unhooking the NFS filesystem's "sync" method might be helpful,
though, to let the system preserve other filesystems before
shutdown. The current situation can indeed result in local
data corruption, and should be considered a bug IMO.


> If the bdi provided more information and more control, it would be a lot
> safer to use lazy unmounts, as we could then work with the filesystem
> even after it had been unmounted.
> 
> Maybe I'll trying playing with bdis in my spare time (if I ever find out
> what "spare time" is).
> 
> Thanks,
> NeilBrown

--
Chuck Lever




^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NFS Force Unmounting
  2017-11-01  2:22         ` Chuck Lever
@ 2017-11-01 14:38           ` Joshua Watt
  2017-11-02  0:15           ` NeilBrown
  1 sibling, 0 replies; 36+ messages in thread
From: Joshua Watt @ 2017-11-01 14:38 UTC (permalink / raw)
  To: Chuck Lever, NeilBrown; +Cc: Jeff Layton, Bruce Fields, Linux NFS Mailing List

On Tue, 2017-10-31 at 22:22 -0400, Chuck Lever wrote:
> > On Oct 31, 2017, at 8:53 PM, NeilBrown <neilb@suse.com> wrote:
> > 
> > On Tue, Oct 31 2017, Jeff Layton wrote:
> > 
> > > On Tue, 2017-10-31 at 08:09 +1100, NeilBrown wrote:
> > > > On Mon, Oct 30 2017, J. Bruce Fields wrote:
> > > > 
> > > > > On Wed, Oct 25, 2017 at 12:11:46PM -0500, Joshua Watt wrote:
> > > > > > I'm working on a networking embedded system where NFS
> > > > > > servers can come
> > > > > > and go from the network, and I've discovered that the
> > > > > > Kernel NFS server
> > > > > 
> > > > > For "Kernel NFS server", I think you mean "Kernel NFS
> > > > > client".
> > > > > 
> > > > > > make it difficult to cleanup applications in a timely
> > > > > > manner when the
> > > > > > server disappears (and yes, I am mounting with "soft" and
> > > > > > relatively
> > > > > > short timeouts). I currently have a user space mechanism
> > > > > > that can
> > > > > > quickly detect when the server disappears, and does a
> > > > > > umount() with the
> > > > > > MNT_FORCE and MNT_DETACH flags. Using MNT_DETACH prevents
> > > > > > new accesses
> > > > > > to files on the defunct remote server, and I have traced
> > > > > > through the
> > > > > > code to see that MNT_FORCE does indeed cancel any current
> > > > > > RPC tasks
> > > > > > with -EIO. However, this isn't sufficient for my use case
> > > > > > because if a
> > > > > > user space application isn't currently waiting on an RCP
> > > > > > task that gets
> > > > > > canceled, it will have to timeout again before it detects
> > > > > > the
> > > > > > disconnect. For example, if a simple client is copying a
> > > > > > file from the
> > > > > > NFS server, and happens to not be waiting on the RPC task
> > > > > > in the read()
> > > > > > call when umount() occurs, it will be none the wiser and
> > > > > > loop around to
> > > > > > call read() again, which must then try the whole NFS
> > > > > > timeout + recovery
> > > > > > before the failure is detected. If a client is more complex
> > > > > > and has a
> > > > > > lot of open file descriptor, it will typical have to wait
> > > > > > for each one
> > > > > > to timeout, leading to very long delays.
> > > > > > 
> > > > > > The (naive?) solution seems to be to add some flag in
> > > > > > either the NFS
> > > > > > client or the RPC client that gets set in
> > > > > > nfs_umount_begin(). This
> > > > > > would cause all subsequent operations to fail with an error
> > > > > > code
> > > > > > instead of having to be queued as an RPC task and the and
> > > > > > then timing
> > > > > > out. In our example client, the application would then get
> > > > > > the -EIO
> > > > > > immediately on the next (and all subsequent) read() calls.
> > > > > > 
> > > > > > There does seem to be some precedence for doing this
> > > > > > (especially with
> > > > > > network file systems), as both cifs (CifsExiting) and ceph
> > > > > > (CEPH_MOUNT_SHUTDOWN) appear to implement this behavior (at
> > > > > > least from
> > > > > > looking at the code. I haven't verified runtime behavior).
> > > > > > 
> > > > > > Are there any pitfalls I'm oversimplifying?
> > > > > 
> > > > > I don't know.
> > > > > 
> > > > > In the hard case I don't think you'd want to do something
> > > > > like
> > > > > this--applications expect mounts to be stay pinned while
> > > > > they're using
> > > > > them, not to get -EIO.  In the soft case maybe an exception
> > > > > like this
> > > > > makes sense.
> > > > 
> > > > Applications also expect to get responses to read() requests,
> > > > and expect
> > > > fsync() to complete, but if the servers has melted down, that
> > > > isn't
> > > > going to happen.  Sometimes unexpected errors are better than
> > > > unexpected
> > > > infinite delays.
> > > > 
> > > > I think we need a reliable way to unmount an NFS filesystem
> > > > mounted from
> > > > a non-responsive server.  Maybe that just means fixing all the
> > > > places
> > > > where use we use TASK_UNINTERRUTIBLE when waiting for the
> > > > server.  That
> > > > would allow processes accessing the filesystem to be killed.  I
> > > > don't
> > > > know if that would meet Joshua's needs.
> > > > 
> > > > Last time this came up, Trond didn't want to make MNT_FORCE too
> > > > strong as
> > > > it only makes sense to be forceful on the final unmount, and we
> > > > cannot
> > > > know if this is the "final" unmount (no other bind-mounts
> > > > around) until
> > > > much later than ->umount_prepare.  Maybe umount is the wrong
> > > > interface.
> > > > Maybe we should expose "struct nfs_client" (or maybe "struct
> > > > nfs_server") objects via sysfs so they can be marked "dead" (or
> > > > similar)
> > > > meaning that all IO should fail.
> > > > 
> > > 
> > > I like this idea.
> > > 
> > > Note that we already have some per-rpc_xprt / per-rpc_clnt info
> > > in
> > > debugfs sunrpc dir. We could make some writable files in there,
> > > to allow
> > > you to kill off individual RPCs or maybe mark a whole clnt and/or
> > > xprt
> > > dead in some fashion.
> > > 
> > > I don't really have a good feel for what this interface should
> > > look like
> > > yet. debugfs is attractive here, as it's supposedly not part of
> > > the
> > > kernel ABI guarantee. That allows us to do some experimentation
> > > in this
> > > area, without making too big an initial commitment.
> > 
> > debugfs might be attractive to kernel developers: "all care but not
> > responsibility", but not so much to application developers (though
> > I do
> > realize that your approch was "something to experiment with" so
> > maybe
> > that doesn't matter).
> 
> I read Jeff's suggestion as "start in debugfs and move to /sys
> with the other long-term administrative interfaces". <shrug>
> 
> 
> > My particular focus is to make systemd shutdown completely
> > reliable.
> 
> In particular: umount.nfs has to be reliable, and/or stuck NFS
> mounts have to avoid impacting other aspects of system
> operation (like syncing other filesystems).
> 
> 
> > It should not block indefinitely on any condition, including
> > inaccessible
> > servers and broken networks.
> 
> There are occasions where a "hard" semantic is appropriate,
> and killing everything unconditionally can result in unexpected
> and unwanted consequences. I would strongly object to any
> approach that includes automatically discarding data without
> user/administrator choice in the matter. One size does not fit
> all here.
> 
> At least let's stop and think about consequences. I'm sure I
> don't understand all of them yet.
> 
> 
> > In stark contrast to Chuck's suggestion that
> > 
> > 
> >   Any RPC that might alter cached data/metadata is not, but others
> >   would be safe.
> > 
> > ("safe" here meaning "safe to kill the RPC"), I think that
> > everything
> > can and should be killed.  Maybe the first step is to purge any
> > dirty
> > pages from the cache.
> > - the server is up, we write the data
> > - if we are happy to wait, we wait
> > - otherwise (the case I'm interested in), we just destroy anything
> >  that gets in the way of unmounting the filesystem.
> 
> (The technical issue still to be resolved is how to kill
> processes that are sleeping uninterruptibly, but for the
> moment let's assume we can do that, and think about the
> policy questions).
> 
> How do you decide the server is "up" without possibly hanging
> or human intervention? Certainly there is a point where we
> want to scorch the earth, but I think the user and administrator
> get to decide where that is. systemd is not the place for that
> decision.
> 
> Actually, I think that dividing the RPC world into "safe to
> kill" and "need to ask first" is roughly the same as the
> steps you outline above: at some point we both eventually get
> to kill_all. The question is how to get there allowing as much
> automation as possible and without compromising data integrity.
> 
> There are RPCs that are always safe to kill, eg NFS READ and
> GETATTR. If those are the only waiters, then they can always be
> terminated on umount. That is already a good start at making
> NFSv3 umount more reliable, since often the only thing it is
> waiting for is a single GETATTR.
> 
> Question marks arise when we are dealing with WRITE, SETATTR,
> and so on, when we know the server is still there somewhere,
> and even when we're not sure whether the server is still
> available. It is here where we assess our satisfaction for
> waiting out a network partition or server restart.
> 
> There's no way to automate that safely, unless you state in
> advance that "your data is not protected during an umount."
> There are some environments where that is OK, and some where
> it is absolute disaster. Again, the user and administrator get
> to make that choice, IMO. Maybe a new mount option?
> 
> Data integrity is critical up until some point where it isn't.
> A container is no longer needed, all the data it cared about
> can be discarded. I think that line needs to be drawn while
> the system is still operational. It's not akin to umount at
> all, since it involves possible destruction of data. It's more
> like TRIM.

For my specific use case, I have some independent userspace monitoring,
so I *know* the server is gone so we can "burn the forest down" so to
speak, and our applications with just have to deal with it (they
historically have done pretty well). I think my strategy will be to
start with a debugfs entry that sets a flag that, when enabled, causes
umount --force to do the hard abort actions. Without the flag being
set, the umount --force behavior will be unchanged. We can promote this
flag to a mount option or a sysfs entry in the future.

I am personally slightly leaning toward a mount option. I think it
could either be a new mount type (e.g. "hard", "soft", "killable"), or
just an option that combines with either (e.g. "killable" or
"nokillable"), although "killable" and "hard" may not really make not
really make a lot of sense together.

I think a mount option might be better than a sysfs entry because it
will play with the sharestate mount option (that is, the superblock
won't be shared between killable and non-killable mounts). It *does*
mean you have to know before you mount what behavior you want, unless
remounting can be tricky, not to invoke any RPC procedures, and clone
the superblock if the options change... I haven't looked at how remount
works on NFS. It is also slightly easier for my specific use case to
simply flag the mount when it is mounted, instead of having to cross
reference it with a sysfs entry (but I realize my use case isn't
necessarily the best one to model on).

> 
> 
> > I'd also like to make the interface completely generic.  I'd rather
> > systemd didn't need to know any specific details about nfs (it
> > already
> > does to some extend - it knows it is a "remote" filesystem) but
> > I'd rather not require more.
> > Maybe I could just sweep the problem under the carpet and use lazy
> > unmounts.  That hides some of the problem, but doesn't stop sync(2)
> > from
> > blocking indefinitely.  And once you have done the lazy unmount,
> > there
> > is no longer any opportunity to use MNT_FORCE.
> 
> IMO a partial answer could be data caching in local files. If
> the client can't flush, then it can preserve the files until
> after the umount and reboot (using, say, fscache). Multi-client
> sharing is still hazardous, but that isn't a very frequent use
> case.
> 
> 
> > Another way to think about this is to consider the bdi rather than
> > the
> > mount point.  If the NFS server is never coming back, then the
> > "backing
> > device" is broken.  If /sys/class/bdi/* contained suitable
> > information
> > to identify the right backing device, and had some way to
> > "terminate
> > with extreme prejudice", then and admin process (like systemd or
> > anything else) could choose to terminate a bdi that was not working
> > properly.
> > 
> > We would need quite a bit of integration so that this "terminate"
> > command would take effect, cause various syscalls to return EIO,
> > purge
> > dirty memory, avoid stalling sync().  But it hopefully it would be
> > a well defined interface and a good starting point.
> 
> Unhooking the NFS filesystem's "sync" method might be helpful,
> though, to let the system preserve other filesystems before
> shutdown. The current situation can indeed result in local
> data corruption, and should be considered a bug IMO.
> 
> 
> > If the bdi provided more information and more control, it would be
> > a lot
> > safer to use lazy unmounts, as we could then work with the
> > filesystem
> > even after it had been unmounted.
> > 
> > Maybe I'll trying playing with bdis in my spare time (if I ever
> > find out
> > what "spare time" is).
> > 
> > Thanks,
> > NeilBrown
> 
> --
> Chuck Lever
> 
> 
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NFS Force Unmounting
  2017-10-30 21:09   ` NeilBrown
  2017-10-31 14:41     ` Jeff Layton
@ 2017-11-01 17:24     ` Jeff Layton
  2017-11-01 23:13       ` NeilBrown
  1 sibling, 1 reply; 36+ messages in thread
From: Jeff Layton @ 2017-11-01 17:24 UTC (permalink / raw)
  To: NeilBrown, J. Bruce Fields, Joshua Watt; +Cc: linux-nfs, Al Viro

On Tue, 2017-10-31 at 08:09 +1100, NeilBrown wrote:
> On Mon, Oct 30 2017, J. Bruce Fields wrote:
> 
> > On Wed, Oct 25, 2017 at 12:11:46PM -0500, Joshua Watt wrote:
> > > I'm working on a networking embedded system where NFS servers can come
> > > and go from the network, and I've discovered that the Kernel NFS server
> > 
> > For "Kernel NFS server", I think you mean "Kernel NFS client".
> > 
> > > make it difficult to cleanup applications in a timely manner when the
> > > server disappears (and yes, I am mounting with "soft" and relatively
> > > short timeouts). I currently have a user space mechanism that can
> > > quickly detect when the server disappears, and does a umount() with the
> > > MNT_FORCE and MNT_DETACH flags. Using MNT_DETACH prevents new accesses
> > > to files on the defunct remote server, and I have traced through the
> > > code to see that MNT_FORCE does indeed cancel any current RPC tasks
> > > with -EIO. However, this isn't sufficient for my use case because if a
> > > user space application isn't currently waiting on an RCP task that gets
> > > canceled, it will have to timeout again before it detects the
> > > disconnect. For example, if a simple client is copying a file from the
> > > NFS server, and happens to not be waiting on the RPC task in the read()
> > > call when umount() occurs, it will be none the wiser and loop around to
> > > call read() again, which must then try the whole NFS timeout + recovery
> > > before the failure is detected. If a client is more complex and has a
> > > lot of open file descriptor, it will typical have to wait for each one
> > > to timeout, leading to very long delays.
> > > 
> > > The (naive?) solution seems to be to add some flag in either the NFS
> > > client or the RPC client that gets set in nfs_umount_begin(). This
> > > would cause all subsequent operations to fail with an error code
> > > instead of having to be queued as an RPC task and the and then timing
> > > out. In our example client, the application would then get the -EIO
> > > immediately on the next (and all subsequent) read() calls.
> > > 
> > > There does seem to be some precedence for doing this (especially with
> > > network file systems), as both cifs (CifsExiting) and ceph
> > > (CEPH_MOUNT_SHUTDOWN) appear to implement this behavior (at least from
> > > looking at the code. I haven't verified runtime behavior).
> > > 
> > > Are there any pitfalls I'm oversimplifying?
> > 
> > I don't know.
> > 
> > In the hard case I don't think you'd want to do something like
> > this--applications expect mounts to be stay pinned while they're using
> > them, not to get -EIO.  In the soft case maybe an exception like this
> > makes sense.
> 
> Applications also expect to get responses to read() requests, and expect
> fsync() to complete, but if the servers has melted down, that isn't
> going to happen.  Sometimes unexpected errors are better than unexpected
> infinite delays.
> 
> I think we need a reliable way to unmount an NFS filesystem mounted from
> a non-responsive server.  Maybe that just means fixing all the places
> where use we use TASK_UNINTERRUTIBLE when waiting for the server.  That
> would allow processes accessing the filesystem to be killed.  I don't
> know if that would meet Joshua's needs.
> 

I don't quite grok why rpc_kill on all of the RPCs doesn't do the right
thing here. Are we ending up stuck because dirty pages remain after
that has gone through?

> Last time this came up, Trond didn't want to make MNT_FORCE too strong as
> it only makes sense to be forceful on the final unmount, and we cannot
> know if this is the "final" unmount (no other bind-mounts around) until
> much later than ->umount_prepare. 

We can't know for sure that one won't race in while we're tearing things
down, but do we really care so much? If the mount is stuck enough to
require MNT_FORCE then it's likely that you'll end up stuck before you
can do anything on that new bind mount anyway.

Just to dream here for a minute...

We could do a check for bind-mountedness during umount_begin. If it
looks like there is one, we do a MNT_DETACH instead. If not, we flag the
sb in such a way to block (or deny) any new bind mounts until we've had
a chance to tear down the RPCs.

I do realize that the mnt table locking is pretty hairy (we'll probably
need Al Viro's help and support there), but it seems like that's where
we should be aiming.

>  Maybe umount is the wrong interface.
> Maybe we should expose "struct nfs_client" (or maybe "struct
> nfs_server") objects via sysfs so they can be marked "dead" (or similar)
> meaning that all IO should fail.
> 

Now that I've thought about it more, I rather like using umount with
MNT_FORCE for this, really. It seems like that's what its intended use
was, and the fact that it doesn't quite work that way has always been a
point of confusion for users. It'd be nice if that magically started
working more like they expect.
-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NFS Force Unmounting
  2017-11-01 17:24     ` Jeff Layton
@ 2017-11-01 23:13       ` NeilBrown
  2017-11-02 12:09         ` Jeff Layton
  0 siblings, 1 reply; 36+ messages in thread
From: NeilBrown @ 2017-11-01 23:13 UTC (permalink / raw)
  To: Jeff Layton, J. Bruce Fields, Joshua Watt; +Cc: linux-nfs, Al Viro

[-- Attachment #1: Type: text/plain, Size: 7574 bytes --]

On Wed, Nov 01 2017, Jeff Layton wrote:

> On Tue, 2017-10-31 at 08:09 +1100, NeilBrown wrote:
>> On Mon, Oct 30 2017, J. Bruce Fields wrote:
>> 
>> > On Wed, Oct 25, 2017 at 12:11:46PM -0500, Joshua Watt wrote:
>> > > I'm working on a networking embedded system where NFS servers can come
>> > > and go from the network, and I've discovered that the Kernel NFS server
>> > 
>> > For "Kernel NFS server", I think you mean "Kernel NFS client".
>> > 
>> > > make it difficult to cleanup applications in a timely manner when the
>> > > server disappears (and yes, I am mounting with "soft" and relatively
>> > > short timeouts). I currently have a user space mechanism that can
>> > > quickly detect when the server disappears, and does a umount() with the
>> > > MNT_FORCE and MNT_DETACH flags. Using MNT_DETACH prevents new accesses
>> > > to files on the defunct remote server, and I have traced through the
>> > > code to see that MNT_FORCE does indeed cancel any current RPC tasks
>> > > with -EIO. However, this isn't sufficient for my use case because if a
>> > > user space application isn't currently waiting on an RCP task that gets
>> > > canceled, it will have to timeout again before it detects the
>> > > disconnect. For example, if a simple client is copying a file from the
>> > > NFS server, and happens to not be waiting on the RPC task in the read()
>> > > call when umount() occurs, it will be none the wiser and loop around to
>> > > call read() again, which must then try the whole NFS timeout + recovery
>> > > before the failure is detected. If a client is more complex and has a
>> > > lot of open file descriptor, it will typical have to wait for each one
>> > > to timeout, leading to very long delays.
>> > > 
>> > > The (naive?) solution seems to be to add some flag in either the NFS
>> > > client or the RPC client that gets set in nfs_umount_begin(). This
>> > > would cause all subsequent operations to fail with an error code
>> > > instead of having to be queued as an RPC task and the and then timing
>> > > out. In our example client, the application would then get the -EIO
>> > > immediately on the next (and all subsequent) read() calls.
>> > > 
>> > > There does seem to be some precedence for doing this (especially with
>> > > network file systems), as both cifs (CifsExiting) and ceph
>> > > (CEPH_MOUNT_SHUTDOWN) appear to implement this behavior (at least from
>> > > looking at the code. I haven't verified runtime behavior).
>> > > 
>> > > Are there any pitfalls I'm oversimplifying?
>> > 
>> > I don't know.
>> > 
>> > In the hard case I don't think you'd want to do something like
>> > this--applications expect mounts to be stay pinned while they're using
>> > them, not to get -EIO.  In the soft case maybe an exception like this
>> > makes sense.
>> 
>> Applications also expect to get responses to read() requests, and expect
>> fsync() to complete, but if the servers has melted down, that isn't
>> going to happen.  Sometimes unexpected errors are better than unexpected
>> infinite delays.
>> 
>> I think we need a reliable way to unmount an NFS filesystem mounted from
>> a non-responsive server.  Maybe that just means fixing all the places
>> where use we use TASK_UNINTERRUTIBLE when waiting for the server.  That
>> would allow processes accessing the filesystem to be killed.  I don't
>> know if that would meet Joshua's needs.
>> 
>
> I don't quite grok why rpc_kill on all of the RPCs doesn't do the right
> thing here. Are we ending up stuck because dirty pages remain after
> that has gone through?

Simply because the caller might submit a new RPC that could then block.
I've (long ago) had experiences where I had to run "umount -f" several
times before the processes would die and the filesystem could be
unmounted. 

>
>> Last time this came up, Trond didn't want to make MNT_FORCE too strong as
>> it only makes sense to be forceful on the final unmount, and we cannot
>> know if this is the "final" unmount (no other bind-mounts around) until
>> much later than ->umount_prepare. 
>
> We can't know for sure that one won't race in while we're tearing things
> down, but do we really care so much? If the mount is stuck enough to
> require MNT_FORCE then it's likely that you'll end up stuck before you
> can do anything on that new bind mount anyway.

I might be happy to wait for 5 seconds, you might be happy to wait for 5
minutes.
So it is fair for me to use MNT_FORCE on a filesystem that both of us
have mounted in different places?

>
> Just to dream here for a minute...
>
> We could do a check for bind-mountedness during umount_begin. If it
> looks like there is one, we do a MNT_DETACH instead. If not, we flag the
> sb in such a way to block (or deny) any new bind mounts until we've had
> a chance to tear down the RPCs.

MNT_DETACH mustn't be used when it isn't requested.
Without MNT_DETACH, umount checks for any open file descriptions
(including executables and cwd etc).  If it finds any, it fails.
With MNT_DETACH that check is skipped.  So they have very different
semantics.

The point of MNT_FORCE (as I understand it), is to release processes
that are blocking in uninterruptible waits so they can respond to
signals (that have already been sent) and can close all fds and die, so
that there will be no more open-file-description on that mount
(different binds mounts have different sets of ofds) so that the
unmount can complete.

If we used TASK_KILLABLE everywhere so that any process blocked in NFS
could be killed, then we could move the handling of MNT_FORCE out of
umount_begin and into nfs_kill_super.... maybe.  Then
MNT_FORCE|MNT_DETACH might be able to make sense.  Maybe the MNT_FORCE
from the last unmount wins?  If it is set, then any dirty pages are
discarded.  If not set, we keep trying to write dirty pages. (though
with current code, dirty pages will stop nfs_kill_super() from even
being called).

For Joshua's use case, he doesn't want to signal those processes but he
presumably trusts them to close file descriptors when they get EIO, and
maybe they never chdir into an NFS filesystem or exec a binary stored there.
So he really does want a persistent "kill all future rpcs".
I don't really think this is an "unmount" function at all.
Maybe it is more like "mount -o remount,soft,timeo=1,retrans=0"
Except that you cannot change any of those with a remount (and when you
try, mount doesn't tell you it failed, unless you use "-v").
I wonder if it is safe to allow them to change if nosharecache is
given. Or maybe even if it isn't but the nfs_client isn't shared.

>
> I do realize that the mnt table locking is pretty hairy (we'll probably
> need Al Viro's help and support there), but it seems like that's where
> we should be aiming.
>
>>  Maybe umount is the wrong interface.
>> Maybe we should expose "struct nfs_client" (or maybe "struct
>> nfs_server") objects via sysfs so they can be marked "dead" (or similar)
>> meaning that all IO should fail.
>> 
>
> Now that I've thought about it more, I rather like using umount with
> MNT_FORCE for this, really. It seems like that's what its intended use
> was, and the fact that it doesn't quite work that way has always been a
> point of confusion for users. It'd be nice if that magically started
> working more like they expect.

So what, exactly, would you suggest be the semantics of MNT_FORCE?
How does it interact with MNT_DETACH?

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NFS Force Unmounting
  2017-11-01  2:22         ` Chuck Lever
  2017-11-01 14:38           ` Joshua Watt
@ 2017-11-02  0:15           ` NeilBrown
  2017-11-02 19:46             ` Chuck Lever
  1 sibling, 1 reply; 36+ messages in thread
From: NeilBrown @ 2017-11-02  0:15 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Jeff Layton, Bruce Fields, Joshua Watt, Linux NFS Mailing List

[-- Attachment #1: Type: text/plain, Size: 12991 bytes --]

On Tue, Oct 31 2017, Chuck Lever wrote:

>> On Oct 31, 2017, at 8:53 PM, NeilBrown <neilb@suse.com> wrote:
>> 
>> On Tue, Oct 31 2017, Jeff Layton wrote:
>> 
>>> On Tue, 2017-10-31 at 08:09 +1100, NeilBrown wrote:
>>>> On Mon, Oct 30 2017, J. Bruce Fields wrote:
>>>> 
>>>>> On Wed, Oct 25, 2017 at 12:11:46PM -0500, Joshua Watt wrote:
>>>>>> I'm working on a networking embedded system where NFS servers can come
>>>>>> and go from the network, and I've discovered that the Kernel NFS server
>>>>> 
>>>>> For "Kernel NFS server", I think you mean "Kernel NFS client".
>>>>> 
>>>>>> make it difficult to cleanup applications in a timely manner when the
>>>>>> server disappears (and yes, I am mounting with "soft" and relatively
>>>>>> short timeouts). I currently have a user space mechanism that can
>>>>>> quickly detect when the server disappears, and does a umount() with the
>>>>>> MNT_FORCE and MNT_DETACH flags. Using MNT_DETACH prevents new accesses
>>>>>> to files on the defunct remote server, and I have traced through the
>>>>>> code to see that MNT_FORCE does indeed cancel any current RPC tasks
>>>>>> with -EIO. However, this isn't sufficient for my use case because if a
>>>>>> user space application isn't currently waiting on an RCP task that gets
>>>>>> canceled, it will have to timeout again before it detects the
>>>>>> disconnect. For example, if a simple client is copying a file from the
>>>>>> NFS server, and happens to not be waiting on the RPC task in the read()
>>>>>> call when umount() occurs, it will be none the wiser and loop around to
>>>>>> call read() again, which must then try the whole NFS timeout + recovery
>>>>>> before the failure is detected. If a client is more complex and has a
>>>>>> lot of open file descriptor, it will typical have to wait for each one
>>>>>> to timeout, leading to very long delays.
>>>>>> 
>>>>>> The (naive?) solution seems to be to add some flag in either the NFS
>>>>>> client or the RPC client that gets set in nfs_umount_begin(). This
>>>>>> would cause all subsequent operations to fail with an error code
>>>>>> instead of having to be queued as an RPC task and the and then timing
>>>>>> out. In our example client, the application would then get the -EIO
>>>>>> immediately on the next (and all subsequent) read() calls.
>>>>>> 
>>>>>> There does seem to be some precedence for doing this (especially with
>>>>>> network file systems), as both cifs (CifsExiting) and ceph
>>>>>> (CEPH_MOUNT_SHUTDOWN) appear to implement this behavior (at least from
>>>>>> looking at the code. I haven't verified runtime behavior).
>>>>>> 
>>>>>> Are there any pitfalls I'm oversimplifying?
>>>>> 
>>>>> I don't know.
>>>>> 
>>>>> In the hard case I don't think you'd want to do something like
>>>>> this--applications expect mounts to be stay pinned while they're using
>>>>> them, not to get -EIO.  In the soft case maybe an exception like this
>>>>> makes sense.
>>>> 
>>>> Applications also expect to get responses to read() requests, and expect
>>>> fsync() to complete, but if the servers has melted down, that isn't
>>>> going to happen.  Sometimes unexpected errors are better than unexpected
>>>> infinite delays.
>>>> 
>>>> I think we need a reliable way to unmount an NFS filesystem mounted from
>>>> a non-responsive server.  Maybe that just means fixing all the places
>>>> where use we use TASK_UNINTERRUTIBLE when waiting for the server.  That
>>>> would allow processes accessing the filesystem to be killed.  I don't
>>>> know if that would meet Joshua's needs.
>>>> 
>>>> Last time this came up, Trond didn't want to make MNT_FORCE too strong as
>>>> it only makes sense to be forceful on the final unmount, and we cannot
>>>> know if this is the "final" unmount (no other bind-mounts around) until
>>>> much later than ->umount_prepare.  Maybe umount is the wrong interface.
>>>> Maybe we should expose "struct nfs_client" (or maybe "struct
>>>> nfs_server") objects via sysfs so they can be marked "dead" (or similar)
>>>> meaning that all IO should fail.
>>>> 
>>> 
>>> I like this idea.
>>> 
>>> Note that we already have some per-rpc_xprt / per-rpc_clnt info in
>>> debugfs sunrpc dir. We could make some writable files in there, to allow
>>> you to kill off individual RPCs or maybe mark a whole clnt and/or xprt
>>> dead in some fashion.
>>> 
>>> I don't really have a good feel for what this interface should look like
>>> yet. debugfs is attractive here, as it's supposedly not part of the
>>> kernel ABI guarantee. That allows us to do some experimentation in this
>>> area, without making too big an initial commitment.
>> 
>> debugfs might be attractive to kernel developers: "all care but not
>> responsibility", but not so much to application developers (though I do
>> realize that your approch was "something to experiment with" so maybe
>> that doesn't matter).
>
> I read Jeff's suggestion as "start in debugfs and move to /sys
> with the other long-term administrative interfaces". <shrug>
>
>
>> My particular focus is to make systemd shutdown completely reliable.
>
> In particular: umount.nfs has to be reliable, and/or stuck NFS
> mounts have to avoid impacting other aspects of system
> operation (like syncing other filesystems).
>
>
>> It should not block indefinitely on any condition, including inaccessible
>> servers and broken networks.
>
> There are occasions where a "hard" semantic is appropriate,
> and killing everything unconditionally can result in unexpected
> and unwanted consequences. I would strongly object to any
> approach that includes automatically discarding data without
> user/administrator choice in the matter. One size does not fit
> all here.

Unix has always had a shutdown sequence that sends SIGTERM to all
processes, then waits a while, then sends SIGKILL.  If those processes
had data in memory which had not yet been written to storage, then that
process automatically discards data.
I don't see an important difference between that, and purging
the cache of filesystems which cannot write out to storage.
Just like with SIGTERM, we first trigger a sync() and give it a
reasonable chance to make progress (ideally monitoring the progress
somehow so we know when it completes, or when it stalls).  But
if sync isn't making progress, and if the  sysadmin has requested a
system shutdown, then I think it is entirely appropriate to use a bigger
hammer.

>
> At least let's stop and think about consequences. I'm sure I
> don't understand all of them yet.
>
>
>> In stark contrast to Chuck's suggestion that
>> 
>> 
>>   Any RPC that might alter cached data/metadata is not, but others
>>   would be safe.
>> 
>> ("safe" here meaning "safe to kill the RPC"), I think that everything
>> can and should be killed.  Maybe the first step is to purge any dirty
>> pages from the cache.
>> - the server is up, we write the data
>> - if we are happy to wait, we wait
>> - otherwise (the case I'm interested in), we just destroy anything
>>  that gets in the way of unmounting the filesystem.
>
> (The technical issue still to be resolved is how to kill
> processes that are sleeping uninterruptibly, but for the
> moment let's assume we can do that, and think about the
> policy questions).
>
> How do you decide the server is "up" without possibly hanging
> or human intervention? Certainly there is a point where we
> want to scorch the earth, but I think the user and administrator
> get to decide where that is. systemd is not the place for that
> decision.

When I tell the system to shutdown, I want it to do that.  No ifs or
buts or maybes.  I don't expect it to be instant (though I expect
"poweroff -f -n" to be jolly close), but do expect it to be prompt.

>
> Actually, I think that dividing the RPC world into "safe to
> kill" and "need to ask first" is roughly the same as the
> steps you outline above: at some point we both eventually get
> to kill_all. The question is how to get there allowing as much
> automation as possible and without compromising data integrity.

Once upon a time there was an NFS implementation which had a "spongy"
mount option, that was meant to be a compromise between "soft"
and "hard".

http://uw714doc.sco.com/en/FS_manager/nfsD.nfsopt.html

I don't know why it didn't catch on, but I suspect because the question
of whether or not it is "safe" to abort an NFS operation doesn't
actually depend on the operation itself, but depends on the use that the
result will be put to, and on the care put into error handling in the
application code.

So I don't think we can meaningfully draw that line (tempting though it
is).  Either the NFS server is working, and we wait for it, or we give
up waiting and cancel everything equally.

>
> There are RPCs that are always safe to kill, eg NFS READ and
> GETATTR. If those are the only waiters, then they can always be
> terminated on umount. That is already a good start at making
> NFSv3 umount more reliable, since often the only thing it is
> waiting for is a single GETATTR.
>
> Question marks arise when we are dealing with WRITE, SETATTR,
> and so on, when we know the server is still there somewhere,
> and even when we're not sure whether the server is still
> available. It is here where we assess our satisfaction for
> waiting out a network partition or server restart.
>
> There's no way to automate that safely, unless you state in
> advance that "your data is not protected during an umount."
> There are some environments where that is OK, and some where
> it is absolute disaster. Again, the user and administrator get
> to make that choice, IMO. Maybe a new mount option?
>
> Data integrity is critical up until some point where it isn't.
> A container is no longer needed, all the data it cared about
> can be discarded. I think that line needs to be drawn while
> the system is still operational. It's not akin to umount at
> all, since it involves possible destruction of data. It's more
> like TRIM.
>
>
>> I'd also like to make the interface completely generic.  I'd rather
>> systemd didn't need to know any specific details about nfs (it already
>> does to some extend - it knows it is a "remote" filesystem) but
>> I'd rather not require more.
>
>> Maybe I could just sweep the problem under the carpet and use lazy
>> unmounts.  That hides some of the problem, but doesn't stop sync(2) from
>> blocking indefinitely.  And once you have done the lazy unmount, there
>> is no longer any opportunity to use MNT_FORCE.
>
> IMO a partial answer could be data caching in local files. If
> the client can't flush, then it can preserve the files until
> after the umount and reboot (using, say, fscache). Multi-client
> sharing is still hazardous, but that isn't a very frequent use
> case.

What data is it, exactly, that we are worried about here?
Data that an application has written, but that it hasn't called fsync()
on?  Do it isn't really all that important.  It might just be scratch
data.  It might be event logs.  It certainly isn't committed database
data, or an incoming email message, or data saved by an editor, or
really anything else important.
It is data that would be lost if you kicked the powerplug out by
mistake.
It is data that we would rather save if we could, but data that is not
worth bending over backwards to keep a copy of in a non-standard
location just in case someone really cares.

That's how I see it anyway.

Thanks,
NeilBrown


>
>
>> Another way to think about this is to consider the bdi rather than the
>> mount point.  If the NFS server is never coming back, then the "backing
>> device" is broken.  If /sys/class/bdi/* contained suitable information
>> to identify the right backing device, and had some way to "terminate
>> with extreme prejudice", then and admin process (like systemd or
>> anything else) could choose to terminate a bdi that was not working
>> properly.
>> 
>> We would need quite a bit of integration so that this "terminate"
>> command would take effect, cause various syscalls to return EIO, purge
>> dirty memory, avoid stalling sync().  But it hopefully it would be
>> a well defined interface and a good starting point.
>
> Unhooking the NFS filesystem's "sync" method might be helpful,
> though, to let the system preserve other filesystems before
> shutdown. The current situation can indeed result in local
> data corruption, and should be considered a bug IMO.
>
>
>> If the bdi provided more information and more control, it would be a lot
>> safer to use lazy unmounts, as we could then work with the filesystem
>> even after it had been unmounted.
>> 
>> Maybe I'll trying playing with bdis in my spare time (if I ever find out
>> what "spare time" is).
>> 
>> Thanks,
>> NeilBrown
>
> --
> Chuck Lever

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NFS Force Unmounting
  2017-11-01 23:13       ` NeilBrown
@ 2017-11-02 12:09         ` Jeff Layton
  2017-11-02 14:54           ` Joshua Watt
  0 siblings, 1 reply; 36+ messages in thread
From: Jeff Layton @ 2017-11-02 12:09 UTC (permalink / raw)
  To: NeilBrown, J. Bruce Fields, Joshua Watt; +Cc: linux-nfs, Al Viro, David Howells

On Thu, 2017-11-02 at 10:13 +1100, NeilBrown wrote:
> On Wed, Nov 01 2017, Jeff Layton wrote:
> 
> > On Tue, 2017-10-31 at 08:09 +1100, NeilBrown wrote:
> > > On Mon, Oct 30 2017, J. Bruce Fields wrote:
> > > 
> > > > On Wed, Oct 25, 2017 at 12:11:46PM -0500, Joshua Watt wrote:
> > > > > I'm working on a networking embedded system where NFS servers can come
> > > > > and go from the network, and I've discovered that the Kernel NFS server
> > > > 
> > > > For "Kernel NFS server", I think you mean "Kernel NFS client".
> > > > 
> > > > > make it difficult to cleanup applications in a timely manner when the
> > > > > server disappears (and yes, I am mounting with "soft" and relatively
> > > > > short timeouts). I currently have a user space mechanism that can
> > > > > quickly detect when the server disappears, and does a umount() with the
> > > > > MNT_FORCE and MNT_DETACH flags. Using MNT_DETACH prevents new accesses
> > > > > to files on the defunct remote server, and I have traced through the
> > > > > code to see that MNT_FORCE does indeed cancel any current RPC tasks
> > > > > with -EIO. However, this isn't sufficient for my use case because if a
> > > > > user space application isn't currently waiting on an RCP task that gets
> > > > > canceled, it will have to timeout again before it detects the
> > > > > disconnect. For example, if a simple client is copying a file from the
> > > > > NFS server, and happens to not be waiting on the RPC task in the read()
> > > > > call when umount() occurs, it will be none the wiser and loop around to
> > > > > call read() again, which must then try the whole NFS timeout + recovery
> > > > > before the failure is detected. If a client is more complex and has a
> > > > > lot of open file descriptor, it will typical have to wait for each one
> > > > > to timeout, leading to very long delays.
> > > > > 
> > > > > The (naive?) solution seems to be to add some flag in either the NFS
> > > > > client or the RPC client that gets set in nfs_umount_begin(). This
> > > > > would cause all subsequent operations to fail with an error code
> > > > > instead of having to be queued as an RPC task and the and then timing
> > > > > out. In our example client, the application would then get the -EIO
> > > > > immediately on the next (and all subsequent) read() calls.
> > > > > 
> > > > > There does seem to be some precedence for doing this (especially with
> > > > > network file systems), as both cifs (CifsExiting) and ceph
> > > > > (CEPH_MOUNT_SHUTDOWN) appear to implement this behavior (at least from
> > > > > looking at the code. I haven't verified runtime behavior).
> > > > > 
> > > > > Are there any pitfalls I'm oversimplifying?
> > > > 
> > > > I don't know.
> > > > 
> > > > In the hard case I don't think you'd want to do something like
> > > > this--applications expect mounts to be stay pinned while they're using
> > > > them, not to get -EIO.  In the soft case maybe an exception like this
> > > > makes sense.
> > > 
> > > Applications also expect to get responses to read() requests, and expect
> > > fsync() to complete, but if the servers has melted down, that isn't
> > > going to happen.  Sometimes unexpected errors are better than unexpected
> > > infinite delays.
> > > 
> > > I think we need a reliable way to unmount an NFS filesystem mounted from
> > > a non-responsive server.  Maybe that just means fixing all the places
> > > where use we use TASK_UNINTERRUTIBLE when waiting for the server.  That
> > > would allow processes accessing the filesystem to be killed.  I don't
> > > know if that would meet Joshua's needs.
> > > 
> > 
> > I don't quite grok why rpc_kill on all of the RPCs doesn't do the right
> > thing here. Are we ending up stuck because dirty pages remain after
> > that has gone through?
> 
> Simply because the caller might submit a new RPC that could then block.
> I've (long ago) had experiences where I had to run "umount -f" several
> times before the processes would die and the filesystem could be
> unmounted. 
> 

Ok, makes sense, thanks, and I've seen that too.

> > 
> > > Last time this came up, Trond didn't want to make MNT_FORCE too strong as
> > > it only makes sense to be forceful on the final unmount, and we cannot
> > > know if this is the "final" unmount (no other bind-mounts around) until
> > > much later than ->umount_prepare. 
> > 
> > We can't know for sure that one won't race in while we're tearing things
> > down, but do we really care so much? If the mount is stuck enough to
> > require MNT_FORCE then it's likely that you'll end up stuck before you
> > can do anything on that new bind mount anyway.
> 
> I might be happy to wait for 5 seconds, you might be happy to wait for 5
> minutes.
> So it is fair for me to use MNT_FORCE on a filesystem that both of us
> have mounted in different places?
> 

Arguably, yes.

MNT_FORCE is an administrative decision. It's only used if root (or the
equivalent) requests it. root had better know what he's doing if he does
request it.

That said, we don't make that part easy and it's very poorly documented.
Seeing errors on a seemingly unrelated bind mount is probably rather
surprising for most folks.

> > 
> > Just to dream here for a minute...
> > 
> > We could do a check for bind-mountedness during umount_begin. If it
> > looks like there is one, we do a MNT_DETACH instead. If not, we flag the
> > sb in such a way to block (or deny) any new bind mounts until we've had
> > a chance to tear down the RPCs.
> 
> MNT_DETACH mustn't be used when it isn't requested.
> Without MNT_DETACH, umount checks for any open file descriptions
> (including executables and cwd etc).  If it finds any, it fails.
> With MNT_DETACH that check is skipped.  So they have very different
> semantics.
> 
> The point of MNT_FORCE (as I understand it), is to release processes
> that are blocking in uninterruptible waits so they can respond to
> signals (that have already been sent) and can close all fds and die, so
> that there will be no more open-file-description on that mount
> (different binds mounts have different sets of ofds) so that the
> unmount can complete.
>

The manpage is a bit more vague in this regard (and the Only for NFS
mounts comment is clearly wrong -- we should fix that):

       MNT_FORCE (since Linux 2.1.116)
              Force  unmount  even  if  busy.  This can cause data loss.
              (Only for NFS mounts.)

So it doesn't quite say anything about releasing processes, though I'm
pretty sure that is how it has worked historically.

> If we used TASK_KILLABLE everywhere so that any process blocked in NFS
> could be killed, then we could move the handling of MNT_FORCE out of
> umount_begin and into nfs_kill_super.... maybe.  Then
> MNT_FORCE|MNT_DETACH might be able to make sense.  Maybe the MNT_FORCE
> from the last unmount wins?  If it is set, then any dirty pages are
> discarded.  If not set, we keep trying to write dirty pages. (though
> with current code, dirty pages will stop nfs_kill_super() from even
> being called).
> 
> For Joshua's use case, he doesn't want to signal those processes but he
> presumably trusts them to close file descriptors when they get EIO, and
> maybe they never chdir into an NFS filesystem or exec a binary stored there.
> So he really does want a persistent "kill all future rpcs".
> I don't really think this is an "unmount" function at all.
> Maybe it is more like "mount -o remount,soft,timeo=1,retrans=0"
> Except that you cannot change any of those with a remount (and when you
> try, mount doesn't tell you it failed, unless you use "-v").
> I wonder if it is safe to allow them to change if nosharecache is
> given. Or maybe even if it isn't but the nfs_client isn't shared.
> 
> > 
> > I do realize that the mnt table locking is pretty hairy (we'll probably
> > need Al Viro's help and support there), but it seems like that's where
> > we should be aiming.
> > 
> > >  Maybe umount is the wrong interface.
> > > Maybe we should expose "struct nfs_client" (or maybe "struct
> > > nfs_server") objects via sysfs so they can be marked "dead" (or similar)
> > > meaning that all IO should fail.
> > > 
> > 
> > Now that I've thought about it more, I rather like using umount with
> > MNT_FORCE for this, really. It seems like that's what its intended use
> > was, and the fact that it doesn't quite work that way has always been a
> > point of confusion for users. It'd be nice if that magically started
> > working more like they expect.
> 
> So what, exactly, would you suggest be the semantics of MNT_FORCE?
> How does it interact with MNT_DETACH?
> 

My (admittedly half-baked) thinking was to "reinterpret" MNT_FORCE to
act like MNT_DETACH in the case where there are other bind mounts
present.

It was all smoke and mirrors to get the thing detached from the tree,
and hopefully to clean stuff up once it's detached. Now that you've
pointed out the difficulties here though, I think it's not really
sufficient.

You could still have processes with open fd's, for instance and that
doesn't really do anything about them. It's a pity that the revoke()
syscall work never came to fruition, that could be useful here.
-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NFS Force Unmounting
  2017-11-02 12:09         ` Jeff Layton
@ 2017-11-02 14:54           ` Joshua Watt
  2017-11-08  3:30             ` NeilBrown
  2017-11-08 14:59             ` [RFC 0/4] " Joshua Watt
  0 siblings, 2 replies; 36+ messages in thread
From: Joshua Watt @ 2017-11-02 14:54 UTC (permalink / raw)
  To: Jeff Layton
  Cc: NeilBrown, J. Bruce Fields, Linux NFS Mailing List, Al Viro,
	David Howells

On Thu, Nov 2, 2017 at 7:09 AM, Jeff Layton <jlayton@redhat.com> wrote:
> On Thu, 2017-11-02 at 10:13 +1100, NeilBrown wrote:
>> On Wed, Nov 01 2017, Jeff Layton wrote:
>>
>> > On Tue, 2017-10-31 at 08:09 +1100, NeilBrown wrote:
>> > > On Mon, Oct 30 2017, J. Bruce Fields wrote:
>> > >
>> > > > On Wed, Oct 25, 2017 at 12:11:46PM -0500, Joshua Watt wrote:
>> > > > > I'm working on a networking embedded system where NFS servers can come
>> > > > > and go from the network, and I've discovered that the Kernel NFS server
>> > > >
>> > > > For "Kernel NFS server", I think you mean "Kernel NFS client".
>> > > >
>> > > > > make it difficult to cleanup applications in a timely manner when the
>> > > > > server disappears (and yes, I am mounting with "soft" and relatively
>> > > > > short timeouts). I currently have a user space mechanism that can
>> > > > > quickly detect when the server disappears, and does a umount() with the
>> > > > > MNT_FORCE and MNT_DETACH flags. Using MNT_DETACH prevents new accesses
>> > > > > to files on the defunct remote server, and I have traced through the
>> > > > > code to see that MNT_FORCE does indeed cancel any current RPC tasks
>> > > > > with -EIO. However, this isn't sufficient for my use case because if a
>> > > > > user space application isn't currently waiting on an RCP task that gets
>> > > > > canceled, it will have to timeout again before it detects the
>> > > > > disconnect. For example, if a simple client is copying a file from the
>> > > > > NFS server, and happens to not be waiting on the RPC task in the read()
>> > > > > call when umount() occurs, it will be none the wiser and loop around to
>> > > > > call read() again, which must then try the whole NFS timeout + recovery
>> > > > > before the failure is detected. If a client is more complex and has a
>> > > > > lot of open file descriptor, it will typical have to wait for each one
>> > > > > to timeout, leading to very long delays.
>> > > > >
>> > > > > The (naive?) solution seems to be to add some flag in either the NFS
>> > > > > client or the RPC client that gets set in nfs_umount_begin(). This
>> > > > > would cause all subsequent operations to fail with an error code
>> > > > > instead of having to be queued as an RPC task and the and then timing
>> > > > > out. In our example client, the application would then get the -EIO
>> > > > > immediately on the next (and all subsequent) read() calls.
>> > > > >
>> > > > > There does seem to be some precedence for doing this (especially with
>> > > > > network file systems), as both cifs (CifsExiting) and ceph
>> > > > > (CEPH_MOUNT_SHUTDOWN) appear to implement this behavior (at least from
>> > > > > looking at the code. I haven't verified runtime behavior).
>> > > > >
>> > > > > Are there any pitfalls I'm oversimplifying?
>> > > >
>> > > > I don't know.
>> > > >
>> > > > In the hard case I don't think you'd want to do something like
>> > > > this--applications expect mounts to be stay pinned while they're using
>> > > > them, not to get -EIO.  In the soft case maybe an exception like this
>> > > > makes sense.
>> > >
>> > > Applications also expect to get responses to read() requests, and expect
>> > > fsync() to complete, but if the servers has melted down, that isn't
>> > > going to happen.  Sometimes unexpected errors are better than unexpected
>> > > infinite delays.
>> > >
>> > > I think we need a reliable way to unmount an NFS filesystem mounted from
>> > > a non-responsive server.  Maybe that just means fixing all the places
>> > > where use we use TASK_UNINTERRUTIBLE when waiting for the server.  That
>> > > would allow processes accessing the filesystem to be killed.  I don't
>> > > know if that would meet Joshua's needs.
>> > >
>> >
>> > I don't quite grok why rpc_kill on all of the RPCs doesn't do the right
>> > thing here. Are we ending up stuck because dirty pages remain after
>> > that has gone through?
>>
>> Simply because the caller might submit a new RPC that could then block.
>> I've (long ago) had experiences where I had to run "umount -f" several
>> times before the processes would die and the filesystem could be
>> unmounted.
>>
>
> Ok, makes sense, thanks, and I've seen that too.
>
>> >
>> > > Last time this came up, Trond didn't want to make MNT_FORCE too strong as
>> > > it only makes sense to be forceful on the final unmount, and we cannot
>> > > know if this is the "final" unmount (no other bind-mounts around) until
>> > > much later than ->umount_prepare.
>> >
>> > We can't know for sure that one won't race in while we're tearing things
>> > down, but do we really care so much? If the mount is stuck enough to
>> > require MNT_FORCE then it's likely that you'll end up stuck before you
>> > can do anything on that new bind mount anyway.
>>
>> I might be happy to wait for 5 seconds, you might be happy to wait for 5
>> minutes.
>> So it is fair for me to use MNT_FORCE on a filesystem that both of us
>> have mounted in different places?
>>
>
> Arguably, yes.
>
> MNT_FORCE is an administrative decision. It's only used if root (or the
> equivalent) requests it. root had better know what he's doing if he does
> request it.

I agree with this. If we are punting this complication to the the
system administator (which makes sense to me), can't the administrator
mark the NFS mount as unbindable if they don't want to deal with that
particular problem (Full disclosure, I'm not fully up to speed on what
MS_UNBINDABLE does, so I might be missing something here)? Some
documentation might be in order to suggest that?

I think there is some disparity about what MNT_FORCE applies to. Is it
intended to apply to the superblock, or the specific mount? As I
pointed out, CEPH and CIFS already have some sort of mechanism to deal
with MNT_FORCE that appears to prevent future accesses (as I am
proposing), and it appears to apply at the superblock level (at least
in a cursory reading, we need to dig further to figure out what it is
actually doing), so perhaps there is some precedent for this?

>
> That said, we don't make that part easy and it's very poorly documented.
> Seeing errors on a seemingly unrelated bind mount is probably rather
> surprising for most folks.
>
>> >
>> > Just to dream here for a minute...
>> >
>> > We could do a check for bind-mountedness during umount_begin. If it
>> > looks like there is one, we do a MNT_DETACH instead. If not, we flag the
>> > sb in such a way to block (or deny) any new bind mounts until we've had
>> > a chance to tear down the RPCs.
>>
>> MNT_DETACH mustn't be used when it isn't requested.
>> Without MNT_DETACH, umount checks for any open file descriptions
>> (including executables and cwd etc).  If it finds any, it fails.
>> With MNT_DETACH that check is skipped.  So they have very different
>> semantics.
>>
>> The point of MNT_FORCE (as I understand it), is to release processes
>> that are blocking in uninterruptible waits so they can respond to
>> signals (that have already been sent) and can close all fds and die, so
>> that there will be no more open-file-description on that mount
>> (different binds mounts have different sets of ofds) so that the
>> unmount can complete.
>>
>
> The manpage is a bit more vague in this regard (and the Only for NFS
> mounts comment is clearly wrong -- we should fix that):
>
>        MNT_FORCE (since Linux 2.1.116)
>               Force  unmount  even  if  busy.  This can cause data loss.
>               (Only for NFS mounts.)
>
> So it doesn't quite say anything about releasing processes, though I'm
> pretty sure that is how it has worked historically.
>
>> If we used TASK_KILLABLE everywhere so that any process blocked in NFS
>> could be killed, then we could move the handling of MNT_FORCE out of
>> umount_begin and into nfs_kill_super.... maybe.  Then
>> MNT_FORCE|MNT_DETACH might be able to make sense.  Maybe the MNT_FORCE

Somewhat of a side note: The userspace umount program won't let you
specify both --lazy and --force even though the kernel doesn't appear
to care:
see https://marc.info/?l=util-linux-ng&m=150939967502180&w=2

Anyone here happen to know why?

>> from the last unmount wins?  If it is set, then any dirty pages are
>> discarded.  If not set, we keep trying to write dirty pages. (though
>> with current code, dirty pages will stop nfs_kill_super() from even
>> being called).
>>
>> For Joshua's use case, he doesn't want to signal those processes but he
>> presumably trusts them to close file descriptors when they get EIO, and
>> maybe they never chdir into an NFS filesystem or exec a binary stored there.
>> So he really does want a persistent "kill all future rpcs".
>> I don't really think this is an "unmount" function at all.
>> Maybe it is more like "mount -o remount,soft,timeo=1,retrans=0"
>> Except that you cannot change any of those with a remount (and when you
>> try, mount doesn't tell you it failed, unless you use "-v").
>> I wonder if it is safe to allow them to change if nosharecache is
>> given. Or maybe even if it isn't but the nfs_client isn't shared.
>>
>> >
>> > I do realize that the mnt table locking is pretty hairy (we'll probably
>> > need Al Viro's help and support there), but it seems like that's where
>> > we should be aiming.
>> >
>> > >  Maybe umount is the wrong interface.
>> > > Maybe we should expose "struct nfs_client" (or maybe "struct
>> > > nfs_server") objects via sysfs so they can be marked "dead" (or similar)
>> > > meaning that all IO should fail.
>> > >
>> >
>> > Now that I've thought about it more, I rather like using umount with
>> > MNT_FORCE for this, really. It seems like that's what its intended use
>> > was, and the fact that it doesn't quite work that way has always been a
>> > point of confusion for users. It'd be nice if that magically started
>> > working more like they expect.
>>
>> So what, exactly, would you suggest be the semantics of MNT_FORCE?
>> How does it interact with MNT_DETACH?
>>
>
> My (admittedly half-baked) thinking was to "reinterpret" MNT_FORCE to
> act like MNT_DETACH in the case where there are other bind mounts
> present.
>
> It was all smoke and mirrors to get the thing detached from the tree,
> and hopefully to clean stuff up once it's detached. Now that you've
> pointed out the difficulties here though, I think it's not really
> sufficient.
>
> You could still have processes with open fd's, for instance and that
> doesn't really do anything about them. It's a pity that the revoke()
> syscall work never came to fruition, that could be useful here.

Yes revoke() was my first though for solving this.... I (and probably
the rest of the embedded Linux world) will owe whoever finally
implements that a beer.

> --
> Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NFS Force Unmounting
  2017-11-02  0:15           ` NeilBrown
@ 2017-11-02 19:46             ` Chuck Lever
  2017-11-02 21:51               ` NeilBrown
  0 siblings, 1 reply; 36+ messages in thread
From: Chuck Lever @ 2017-11-02 19:46 UTC (permalink / raw)
  To: NeilBrown; +Cc: Jeff Layton, Bruce Fields, Joshua Watt, Linux NFS Mailing List


> On Nov 1, 2017, at 8:15 PM, NeilBrown <neilb@suse.com> wrote:
> 
> On Tue, Oct 31 2017, Chuck Lever wrote:
> 
>>> On Oct 31, 2017, at 8:53 PM, NeilBrown <neilb@suse.com> wrote:
>> 
>>> Maybe I could just sweep the problem under the carpet and use lazy
>>> unmounts.  That hides some of the problem, but doesn't stop sync(2) from
>>> blocking indefinitely.  And once you have done the lazy unmount, there
>>> is no longer any opportunity to use MNT_FORCE.
>> 
>> IMO a partial answer could be data caching in local files. If
>> the client can't flush, then it can preserve the files until
>> after the umount and reboot (using, say, fscache). Multi-client
>> sharing is still hazardous, but that isn't a very frequent use
>> case.
> 
> What data is it, exactly, that we are worried about here?
> Data that an application has written, but that it hasn't called fsync()
> on?  Do it isn't really all that important.  It might just be scratch
> data.  It might be event logs.  It certainly isn't committed database
> data, or an incoming email message, or data saved by an editor, or
> really anything else important.
> It is data that would be lost if you kicked the powerplug out by
> mistake.
> It is data that we would rather save if we could, but data that is not
> worth bending over backwards to keep a copy of in a non-standard
> location just in case someone really cares.
> 
> That's how I see it anyway.

The assumption here is that any data loss or corruption when a
mount is forcibly removed is strictly the responsibility of
inadequate application design. Fair enough.

One of my concerns (and I suspect Jeff is also worried about this
use case) is what to do when tearing down containers that have
stuck NFS mounts. Here, the host system is not being shut down,
but the guests need to be shutdown cleanly so they can be destroyed.

These containers may be sharing page cache data or a transport
with other users on the host. In this case, we are trying to
avoid a host shutdown to recover the stuck resources, and we
do have the (possibly unnecessary) luxury of having a human
being present to ask what to do to complete the recovery.


--
Chuck Lever




^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NFS Force Unmounting
  2017-11-02 19:46             ` Chuck Lever
@ 2017-11-02 21:51               ` NeilBrown
  0 siblings, 0 replies; 36+ messages in thread
From: NeilBrown @ 2017-11-02 21:51 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Jeff Layton, Bruce Fields, Joshua Watt, Linux NFS Mailing List

[-- Attachment #1: Type: text/plain, Size: 2936 bytes --]

On Thu, Nov 02 2017, Chuck Lever wrote:

>> On Nov 1, 2017, at 8:15 PM, NeilBrown <neilb@suse.com> wrote:
>> 
>> On Tue, Oct 31 2017, Chuck Lever wrote:
>> 
>>>> On Oct 31, 2017, at 8:53 PM, NeilBrown <neilb@suse.com> wrote:
>>> 
>>>> Maybe I could just sweep the problem under the carpet and use lazy
>>>> unmounts.  That hides some of the problem, but doesn't stop sync(2) from
>>>> blocking indefinitely.  And once you have done the lazy unmount, there
>>>> is no longer any opportunity to use MNT_FORCE.
>>> 
>>> IMO a partial answer could be data caching in local files. If
>>> the client can't flush, then it can preserve the files until
>>> after the umount and reboot (using, say, fscache). Multi-client
>>> sharing is still hazardous, but that isn't a very frequent use
>>> case.
>> 
>> What data is it, exactly, that we are worried about here?
>> Data that an application has written, but that it hasn't called fsync()
>> on?  Do it isn't really all that important.  It might just be scratch
>> data.  It might be event logs.  It certainly isn't committed database
>> data, or an incoming email message, or data saved by an editor, or
>> really anything else important.
>> It is data that would be lost if you kicked the powerplug out by
>> mistake.
>> It is data that we would rather save if we could, but data that is not
>> worth bending over backwards to keep a copy of in a non-standard
>> location just in case someone really cares.
>> 
>> That's how I see it anyway.
>
> The assumption here is that any data loss or corruption when a
> mount is forcibly removed is strictly the responsibility of
> inadequate application design. Fair enough.
>
> One of my concerns (and I suspect Jeff is also worried about this
> use case) is what to do when tearing down containers that have
> stuck NFS mounts. Here, the host system is not being shut down,
> but the guests need to be shutdown cleanly so they can be destroyed.

Container-shutdown is a strong argument that MNT_FORCE is not enough and
that we need SIGKILL to actually remove the process.

Currently if there is pending dirty data on the final unmount of an NFS
filesystem, the superblock hangs around until the data is gone.  This
is good for shutting down containers, but it is a little weird - totally
different to every other filesystem.
If NFS were changed so that unmount blocked on the last unmount when
there is dirty data - just like every other filesystem - then we
probably need a way to "force" that unmount even in a container.

Thanks,
NeilBrown


>
> These containers may be sharing page cache data or a transport
> with other users on the host. In this case, we are trying to
> avoid a host shutdown to recover the stuck resources, and we
> do have the (possibly unnecessary) luxury of having a human
> being present to ask what to do to complete the recovery.
>
>
> --
> Chuck Lever

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NFS Force Unmounting
  2017-11-02 14:54           ` Joshua Watt
@ 2017-11-08  3:30             ` NeilBrown
  2017-11-08 12:08               ` Jeff Layton
  2017-11-08 14:59             ` [RFC 0/4] " Joshua Watt
  1 sibling, 1 reply; 36+ messages in thread
From: NeilBrown @ 2017-11-08  3:30 UTC (permalink / raw)
  To: Joshua Watt, Jeff Layton, Trond Myklebust
  Cc: J. Bruce Fields, Linux NFS Mailing List, Al Viro, David Howells

[-- Attachment #1: Type: text/plain, Size: 3721 bytes --]


What to people think of the following as an approach
to Joshua's need?

It isn't complete by itself: it needs a couple of changes to
nfs-utils so that it doesn't stat the mountpoint on remount,
and it might need another kernel change so that the "mount" system
call performs the same sort of careful lookup for remount as  the umount
system call does, but those are relatively small details.

This is the patch that you will either love of hate.

With this patch, Joshua (or any other sysadmin) could:

  mount -o remount,retrans=0,timeo=1 /path

and then new requests on any mountpoint from that server will timeout
quickly.
Then
  umount -f /path
  umount -f /path

should kill off any existing requests that use the old timeout. (I just
tested and I did need this twice to kill of an "ls -l".
The first was an 'open' systemcall, then next as 'newfstat'. I wonder
why the getattr for fstat didn't use the new timeout...)

Thoughts?
NeilBrown

diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index c9d24bae3025..ced12fcec349 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -2210,27 +2210,39 @@ static int nfs_validate_text_mount_data(void *options,
 		~(NFS_MOUNT_UNSHARED | NFS_MOUNT_NORESVPORT))
 
 static int
-nfs_compare_remount_data(struct nfs_server *nfss,
-			 struct nfs_parsed_mount_data *data)
+nfs_compare_and_set_remount_data(struct nfs_server *nfss,
+				 struct nfs_parsed_mount_data *data)
 {
 	if ((data->flags ^ nfss->flags) & NFS_REMOUNT_CMP_FLAGMASK ||
 	    data->rsize != nfss->rsize ||
 	    data->wsize != nfss->wsize ||
 	    data->version != nfss->nfs_client->rpc_ops->version ||
 	    data->minorversion != nfss->nfs_client->cl_minorversion ||
-	    data->retrans != nfss->client->cl_timeout->to_retries ||
 	    !nfs_auth_info_match(&data->auth_info, nfss->client->cl_auth->au_flavor) ||
 	    data->acregmin != nfss->acregmin / HZ ||
 	    data->acregmax != nfss->acregmax / HZ ||
 	    data->acdirmin != nfss->acdirmin / HZ ||
 	    data->acdirmax != nfss->acdirmax / HZ ||
-	    data->timeo != (10U * nfss->client->cl_timeout->to_initval / HZ) ||
 	    data->nfs_server.port != nfss->port ||
 	    data->nfs_server.addrlen != nfss->nfs_client->cl_addrlen ||
 	    !rpc_cmp_addr((struct sockaddr *)&data->nfs_server.address,
 			  (struct sockaddr *)&nfss->nfs_client->cl_addr))
 		return -EINVAL;
 
+	if (data->retrans != nfss->client->cl_timeout->to_retries ||
+	    data->timeo != (10U * nfss->client->cl_timeout->to_initval / HZ)) {
+		/* Note that this will affect all mounts from the same server,
+		 * that use the same protocol.  The timeouts are always forced
+		 * to be the same.
+		 */
+		struct rpc_clnt *cl = nfss->client;
+		if (cl->cl_timeout != &cl->cl_timeout_default)
+			memcpy(&cl->cl_timeout_default, cl->cl_timeout,
+			       sizeof(struct rpc_timeout));
+		cl->cl_timeout_default.to_retries = data->retrans;
+		cl->cl_timeout_default.to_initval = data->timeo * HZ / 10U;
+	}
+
 	return 0;
 }
 
@@ -2244,7 +2256,8 @@ nfs_remount(struct super_block *sb, int *flags, char *raw_data)
 	struct nfs4_mount_data *options4 = (struct nfs4_mount_data *)raw_data;
 	u32 nfsvers = nfss->nfs_client->rpc_ops->version;
 
-	sync_filesystem(sb);
+	if (sb->s_readonly_remount)
+		sync_filesystem(sb);
 
 	/*
 	 * Userspace mount programs that send binary options generally send
@@ -2295,7 +2308,7 @@ nfs_remount(struct super_block *sb, int *flags, char *raw_data)
 		*flags |= MS_SYNCHRONOUS;
 
 	/* compare new mount options with old ones */
-	error = nfs_compare_remount_data(nfss, data);
+	error = nfs_compare_and_set_remount_data(nfss, data);
 out:
 	kfree(data);
 	return error;

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: NFS Force Unmounting
  2017-11-08  3:30             ` NeilBrown
@ 2017-11-08 12:08               ` Jeff Layton
  2017-11-08 15:52                 ` J. Bruce Fields
  0 siblings, 1 reply; 36+ messages in thread
From: Jeff Layton @ 2017-11-08 12:08 UTC (permalink / raw)
  To: NeilBrown, Joshua Watt, Trond Myklebust
  Cc: J. Bruce Fields, Linux NFS Mailing List, Al Viro, David Howells

On Wed, 2017-11-08 at 14:30 +1100, NeilBrown wrote:
> What to people think of the following as an approach
> to Joshua's need?
> 
> It isn't complete by itself: it needs a couple of changes to
> nfs-utils so that it doesn't stat the mountpoint on remount,
> and it might need another kernel change so that the "mount" system
> call performs the same sort of careful lookup for remount as  the umount
> system call does, but those are relatively small details.
> 

Yeah, that'd be good.

> This is the patch that you will either love of hate.
> 
> With this patch, Joshua (or any other sysadmin) could:
> 
>   mount -o remount,retrans=0,timeo=1 /path
> 
> and then new requests on any mountpoint from that server will timeout
> quickly.
> Then
>   umount -f /path
>   umount -f /path
> 
> should kill off any existing requests that use the old timeout. (I just
> tested and I did need this twice to kill of an "ls -l".
> The first was an 'open' systemcall, then next as 'newfstat'. I wonder
> why the getattr for fstat didn't use the new timeout...)
> 
> Thoughts?
> NeilBrown
> 
> diff --git a/fs/nfs/super.c b/fs/nfs/super.c
> index c9d24bae3025..ced12fcec349 100644
> --- a/fs/nfs/super.c
> +++ b/fs/nfs/super.c
> @@ -2210,27 +2210,39 @@ static int nfs_validate_text_mount_data(void *options,
>  		~(NFS_MOUNT_UNSHARED | NFS_MOUNT_NORESVPORT))
>  
>  static int
> -nfs_compare_remount_data(struct nfs_server *nfss,
> -			 struct nfs_parsed_mount_data *data)
> +nfs_compare_and_set_remount_data(struct nfs_server *nfss,
> +				 struct nfs_parsed_mount_data *data)
>  {
>  	if ((data->flags ^ nfss->flags) & NFS_REMOUNT_CMP_FLAGMASK ||
>  	    data->rsize != nfss->rsize ||
>  	    data->wsize != nfss->wsize ||
>  	    data->version != nfss->nfs_client->rpc_ops->version ||
>  	    data->minorversion != nfss->nfs_client->cl_minorversion ||
> -	    data->retrans != nfss->client->cl_timeout->to_retries ||
>  	    !nfs_auth_info_match(&data->auth_info, nfss->client->cl_auth->au_flavor) ||
>  	    data->acregmin != nfss->acregmin / HZ ||
>  	    data->acregmax != nfss->acregmax / HZ ||
>  	    data->acdirmin != nfss->acdirmin / HZ ||
>  	    data->acdirmax != nfss->acdirmax / HZ ||
> -	    data->timeo != (10U * nfss->client->cl_timeout->to_initval / HZ) ||
>  	    data->nfs_server.port != nfss->port ||
>  	    data->nfs_server.addrlen != nfss->nfs_client->cl_addrlen ||
>  	    !rpc_cmp_addr((struct sockaddr *)&data->nfs_server.address,
>  			  (struct sockaddr *)&nfss->nfs_client->cl_addr))
>  		return -EINVAL;
>  
> +	if (data->retrans != nfss->client->cl_timeout->to_retries ||
> +	    data->timeo != (10U * nfss->client->cl_timeout->to_initval / HZ)) {
> +		/* Note that this will affect all mounts from the same server,
> +		 * that use the same protocol.  The timeouts are always forced
> +		 * to be the same.
> +		 */
> +		struct rpc_clnt *cl = nfss->client;
> +		if (cl->cl_timeout != &cl->cl_timeout_default)
> +			memcpy(&cl->cl_timeout_default, cl->cl_timeout,
> +			       sizeof(struct rpc_timeout));
> +		cl->cl_timeout_default.to_retries = data->retrans;
> +		cl->cl_timeout_default.to_initval = data->timeo * HZ / 10U;
> +	}
> +

No objection to allowing more mount options to be changed on remount, we
really ought to allow a lot more options to be changed on remount if we
can reasonably pull it off.

>  	return 0;
>  }
>  
> @@ -2244,7 +2256,8 @@ nfs_remount(struct super_block *sb, int *flags, char *raw_data)
>  	struct nfs4_mount_data *options4 = (struct nfs4_mount_data *)raw_data;
>  	u32 nfsvers = nfss->nfs_client->rpc_ops->version;
>  
> -	sync_filesystem(sb);
> +	if (sb->s_readonly_remount)
> +		sync_filesystem(sb);
>  
>  	/*
>  	 * Userspace mount programs that send binary options generally send
> @@ -2295,7 +2308,7 @@ nfs_remount(struct super_block *sb, int *flags, char *raw_data)
>  		*flags |= MS_SYNCHRONOUS;
>  
>  	/* compare new mount options with old ones */
> -	error = nfs_compare_remount_data(nfss, data);
> +	error = nfs_compare_and_set_remount_data(nfss, data);
>  out:
>  	kfree(data);
>  	return error;

Looks like a reasonable approach overall to preventing new RPCs from
being dispatched once the "force" umount runs.

I do wonder if this ought to be more automatic when you specify -f on
the umount. Having to manually do a remount first doesn't seem very
admin-friendly.
-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [RFC 0/4] Re: NFS Force Unmounting
  2017-11-02 14:54           ` Joshua Watt
  2017-11-08  3:30             ` NeilBrown
@ 2017-11-08 14:59             ` Joshua Watt
  2017-11-08 14:59               ` [RFC 1/4] SUNRPC: Add flag to kill new tasks Joshua Watt
                                 ` (3 more replies)
  1 sibling, 4 replies; 36+ messages in thread
From: Joshua Watt @ 2017-11-08 14:59 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Trond Myklebust
  Cc: linux-nfs, Al Viro, J . Bruce Fields, David Howells, Joshua Watt

This was my attempt at solving the force unmounting problem, which I
though I would post for discussion (as sort of a alternate to Neil's
patch).

I think there are a few advantages to this approach:
 1) "umount -f" pretty much always does what you want if you have
    "forcekill" in your mount options
 2) It is easy to separate out at an administrative level what mounts
    will be affected by "umount -f", because the "forcekill" mount
    option affects the decision as to whether the NFS superblock can be
    shared (e.g. only mounts to the same client with the same state of
    the forcekill flag are shared, and thus affected by umount -f)
 3) I think this will catch all the different types of RPCs
    (asynchronous, etc.) for a given client?

And a few disadvantages:
 1) It requires a new mount option (maybe there is another way? A mount
    option was all I could figure to prevent superblock sharing).
 2) It doesn't currently affect the behavior of the DESTROY_SESSION RPC
    call made by the nfs_client when the last superblock reference is
    removed. This means that this call must go through the normal
    recovery and timeout process. I am OK with this behavior (for my use
    case at least) because that is *only* going to block the umount
    caller for a (bounded) amount of time.
 3) Probably more I haven't though of...

I don't claim to be a Linux Kernel NFS or filesystem expert, so I will
leave the decisions to those who have a better understanding.

Also, this is my first (non-trivial) Linux kernel patch, so any sort of
review and comments are greatly appreciated.

Thanks!

Joshua Watt (4):
  SUNRPC: Add flag to kill new tasks
  SUNRPC: Kill client tasks from debugfs
  SUNRPC: Simplify client shutdown
  NFS: Add forcekill mount option

 fs/nfs/super.c                 | 17 ++++++++++++---
 include/linux/sunrpc/clnt.h    |  1 +
 include/linux/sunrpc/sched.h   |  2 +-
 include/uapi/linux/nfs_mount.h |  1 +
 net/sunrpc/clnt.c              | 13 +++++++++---
 net/sunrpc/debugfs.c           | 48 ++++++++++++++++++++++++++++++++++++++++++
 net/sunrpc/sched.c             |  7 ++++++
 7 files changed, 82 insertions(+), 7 deletions(-)

-- 
2.13.6


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [RFC 1/4] SUNRPC: Add flag to kill new tasks
  2017-11-08 14:59             ` [RFC 0/4] " Joshua Watt
@ 2017-11-08 14:59               ` Joshua Watt
  2017-11-10  1:39                 ` NeilBrown
  2017-11-08 14:59               ` [RFC 2/4] SUNRPC: Kill client tasks from debugfs Joshua Watt
                                 ` (2 subsequent siblings)
  3 siblings, 1 reply; 36+ messages in thread
From: Joshua Watt @ 2017-11-08 14:59 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Trond Myklebust
  Cc: linux-nfs, Al Viro, J . Bruce Fields, David Howells, Joshua Watt

The new flag to rpc_killall_tasks() causes new tasks that are killed
after to call to be killed immediately instead of being executed. This
will all clients (particularly NFS) to prevents these task from delaying
shutdown of the RPC session longer than necessary.

Signed-off-by: Joshua Watt <JPEWhacker@gmail.com>
---
 fs/nfs/super.c               |  4 ++--
 include/linux/sunrpc/clnt.h  |  1 +
 include/linux/sunrpc/sched.h |  2 +-
 net/sunrpc/clnt.c            | 13 ++++++++++---
 net/sunrpc/sched.c           |  7 +++++++
 5 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index 216f67d628b3..66fda2dcadd0 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -903,10 +903,10 @@ void nfs_umount_begin(struct super_block *sb)
 	/* -EIO all pending I/O */
 	rpc = server->client_acl;
 	if (!IS_ERR(rpc))
-		rpc_killall_tasks(rpc);
+		rpc_killall_tasks(rpc, 0);
 	rpc = server->client;
 	if (!IS_ERR(rpc))
-		rpc_killall_tasks(rpc);
+		rpc_killall_tasks(rpc, 0);
 }
 EXPORT_SYMBOL_GPL(nfs_umount_begin);
 
diff --git a/include/linux/sunrpc/clnt.h b/include/linux/sunrpc/clnt.h
index 71c237e8240e..94f4e464de1b 100644
--- a/include/linux/sunrpc/clnt.h
+++ b/include/linux/sunrpc/clnt.h
@@ -54,6 +54,7 @@ struct rpc_clnt {
 				cl_noretranstimeo: 1,/* No retransmit timeouts */
 				cl_autobind : 1,/* use getport() */
 				cl_chatty   : 1;/* be verbose */
+	bool			cl_kill_new_tasks;	/* Kill all new tasks */
 
 	struct rpc_rtt *	cl_rtt;		/* RTO estimator data */
 	const struct rpc_timeout *cl_timeout;	/* Timeout strategy */
diff --git a/include/linux/sunrpc/sched.h b/include/linux/sunrpc/sched.h
index d96e74e114c0..9b8bebaee0f4 100644
--- a/include/linux/sunrpc/sched.h
+++ b/include/linux/sunrpc/sched.h
@@ -218,7 +218,7 @@ void		rpc_put_task_async(struct rpc_task *);
 void		rpc_exit_task(struct rpc_task *);
 void		rpc_exit(struct rpc_task *, int);
 void		rpc_release_calldata(const struct rpc_call_ops *, void *);
-void		rpc_killall_tasks(struct rpc_clnt *);
+void		rpc_killall_tasks(struct rpc_clnt *clnt, int kill_new_tasks);
 void		rpc_execute(struct rpc_task *);
 void		rpc_init_priority_wait_queue(struct rpc_wait_queue *, const char *);
 void		rpc_init_wait_queue(struct rpc_wait_queue *, const char *);
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index df4ecb042ebe..70005252b32f 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -814,14 +814,21 @@ EXPORT_SYMBOL_GPL(rpc_clnt_iterate_for_each_xprt);
  * Kill all tasks for the given client.
  * XXX: kill their descendants as well?
  */
-void rpc_killall_tasks(struct rpc_clnt *clnt)
+void rpc_killall_tasks(struct rpc_clnt *clnt, int kill_new_tasks)
 {
 	struct rpc_task	*rovr;
 
+	dprintk("RPC:       killing all tasks for client %p\n", clnt);
+
+	if (kill_new_tasks) {
+		WRITE_ONCE(clnt->cl_kill_new_tasks, true);
+
+		/* Ensure the kill flag is set before checking the task list */
+		smp_mb();
+	}
 
 	if (list_empty(&clnt->cl_tasks))
 		return;
-	dprintk("RPC:       killing all tasks for client %p\n", clnt);
 	/*
 	 * Spin lock all_tasks to prevent changes...
 	 */
@@ -854,7 +861,7 @@ void rpc_shutdown_client(struct rpc_clnt *clnt)
 			rcu_dereference(clnt->cl_xprt)->servername);
 
 	while (!list_empty(&clnt->cl_tasks)) {
-		rpc_killall_tasks(clnt);
+		rpc_killall_tasks(clnt, 0);
 		wait_event_timeout(destroy_wait,
 			list_empty(&clnt->cl_tasks), 1*HZ);
 	}
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index 0cc83839c13c..c9bf1ab9e941 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -748,6 +748,13 @@ static void __rpc_execute(struct rpc_task *task)
 	dprintk("RPC: %5u __rpc_execute flags=0x%x\n",
 			task->tk_pid, task->tk_flags);
 
+	/* Ensure the task is in the client task list before checking the kill
+	 * flag
+	 */
+	smp_mb();
+	if (READ_ONCE(task->tk_client->cl_kill_new_tasks))
+		rpc_exit(task, -EIO);
+
 	WARN_ON_ONCE(RPC_IS_QUEUED(task));
 	if (RPC_IS_QUEUED(task))
 		return;
-- 
2.13.6


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC 2/4] SUNRPC: Kill client tasks from debugfs
  2017-11-08 14:59             ` [RFC 0/4] " Joshua Watt
  2017-11-08 14:59               ` [RFC 1/4] SUNRPC: Add flag to kill new tasks Joshua Watt
@ 2017-11-08 14:59               ` Joshua Watt
  2017-11-10  1:47                 ` NeilBrown
  2017-11-08 14:59               ` [RFC 3/4] SUNRPC: Simplify client shutdown Joshua Watt
  2017-11-08 14:59               ` [RFC 4/4] NFS: Add forcekill mount option Joshua Watt
  3 siblings, 1 reply; 36+ messages in thread
From: Joshua Watt @ 2017-11-08 14:59 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Trond Myklebust
  Cc: linux-nfs, Al Viro, J . Bruce Fields, David Howells, Joshua Watt

The "kill" client entry in debugfs can be used to kill client tasks.
Writing "0" causes only the currently pending tasks to be killed, while
writing "1" all currently pending, and any future pending tasks to be
killed.

Signed-off-by: Joshua Watt <JPEWhacker@gmail.com>
---
 net/sunrpc/debugfs.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 48 insertions(+)

diff --git a/net/sunrpc/debugfs.c b/net/sunrpc/debugfs.c
index e980d2a493de..a2f33d168873 100644
--- a/net/sunrpc/debugfs.c
+++ b/net/sunrpc/debugfs.c
@@ -118,6 +118,50 @@ static const struct file_operations tasks_fops = {
 	.release	= tasks_release,
 };
 
+static int
+kill_set(void *data, u64 val)
+{
+	struct rpc_clnt *clnt = data;
+
+	rpc_killall_tasks(clnt, (int)val);
+	return 0;
+}
+
+static int
+kill_open(struct inode *inode, struct file *filp)
+{
+	int ret = simple_attr_open(inode, filp, NULL, kill_set, "%i");
+
+	if (!ret) {
+		struct rpc_clnt *clnt = inode->i_private;
+
+		if (!atomic_inc_not_zero(&clnt->cl_count)) {
+			simple_attr_release(inode, filp);
+			ret = -EINVAL;
+		}
+	}
+
+	return ret;
+}
+
+static int
+kill_release(struct inode *inode, struct file *filp)
+{
+	struct rpc_clnt *clnt = inode->i_private;
+
+	rpc_release_client(clnt);
+
+	return simple_attr_release(inode, filp);
+}
+
+static const struct file_operations kill_fops = {
+	.owner	 = THIS_MODULE,
+	.open	 = kill_open,
+	.release = kill_release,
+	.write	 = debugfs_attr_write,
+	.llseek  = no_llseek,
+};
+
 void
 rpc_clnt_debugfs_register(struct rpc_clnt *clnt)
 {
@@ -160,6 +204,10 @@ rpc_clnt_debugfs_register(struct rpc_clnt *clnt)
 	if (!debugfs_create_symlink("xprt", clnt->cl_debugfs, name))
 		goto out_err;
 
+	if (!debugfs_create_file("kill", S_IFREG | 0200, clnt->cl_debugfs,
+				 clnt, &kill_fops))
+		goto out_err;
+
 	return;
 out_err:
 	debugfs_remove_recursive(clnt->cl_debugfs);
-- 
2.13.6


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC 3/4] SUNRPC: Simplify client shutdown
  2017-11-08 14:59             ` [RFC 0/4] " Joshua Watt
  2017-11-08 14:59               ` [RFC 1/4] SUNRPC: Add flag to kill new tasks Joshua Watt
  2017-11-08 14:59               ` [RFC 2/4] SUNRPC: Kill client tasks from debugfs Joshua Watt
@ 2017-11-08 14:59               ` Joshua Watt
  2017-11-10  1:50                 ` NeilBrown
  2017-11-08 14:59               ` [RFC 4/4] NFS: Add forcekill mount option Joshua Watt
  3 siblings, 1 reply; 36+ messages in thread
From: Joshua Watt @ 2017-11-08 14:59 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Trond Myklebust
  Cc: linux-nfs, Al Viro, J . Bruce Fields, David Howells, Joshua Watt

Use the flag to kill all new tasks when shutting down instead of
repeatedly killing all the pending tasks.

Signed-off-by: Joshua Watt <JPEWhacker@gmail.com>
---
 net/sunrpc/clnt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index 70005252b32f..8d51e0788a63 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -860,8 +860,8 @@ void rpc_shutdown_client(struct rpc_clnt *clnt)
 			clnt->cl_program->name,
 			rcu_dereference(clnt->cl_xprt)->servername);
 
+	rpc_killall_tasks(clnt, 1);
 	while (!list_empty(&clnt->cl_tasks)) {
-		rpc_killall_tasks(clnt, 0);
 		wait_event_timeout(destroy_wait,
 			list_empty(&clnt->cl_tasks), 1*HZ);
 	}
-- 
2.13.6


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC 4/4] NFS: Add forcekill mount option
  2017-11-08 14:59             ` [RFC 0/4] " Joshua Watt
                                 ` (2 preceding siblings ...)
  2017-11-08 14:59               ` [RFC 3/4] SUNRPC: Simplify client shutdown Joshua Watt
@ 2017-11-08 14:59               ` Joshua Watt
  2017-11-10  2:01                 ` NeilBrown
  3 siblings, 1 reply; 36+ messages in thread
From: Joshua Watt @ 2017-11-08 14:59 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Trond Myklebust
  Cc: linux-nfs, Al Viro, J . Bruce Fields, David Howells, Joshua Watt

Specifying the forcekill mount option causes umount() with the MNT_FORCE
flag to not only kill all currently pending RPC tasks with -EIO, but
also all future RPC tasks. This prevents the long delays caused by tasks
queuing after rpc_killall_tasks() that can occur if the server drops off
the network.

Signed-off-by: Joshua Watt <JPEWhacker@gmail.com>
---
 fs/nfs/super.c                 | 17 ++++++++++++++---
 include/uapi/linux/nfs_mount.h |  1 +
 2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index 66fda2dcadd0..d972f6289aca 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -90,6 +90,7 @@ enum {
 	Opt_resvport, Opt_noresvport,
 	Opt_fscache, Opt_nofscache,
 	Opt_migration, Opt_nomigration,
+	Opt_forcekill, Opt_noforcekill,
 
 	/* Mount options that take integer arguments */
 	Opt_port,
@@ -151,6 +152,8 @@ static const match_table_t nfs_mount_option_tokens = {
 	{ Opt_nofscache, "nofsc" },
 	{ Opt_migration, "migration" },
 	{ Opt_nomigration, "nomigration" },
+	{ Opt_forcekill, "forcekill" },
+	{ Opt_noforcekill, "noforcekill" },
 
 	{ Opt_port, "port=%s" },
 	{ Opt_rsize, "rsize=%s" },
@@ -637,6 +640,7 @@ static void nfs_show_mount_options(struct seq_file *m, struct nfs_server *nfss,
 		{ NFS_MOUNT_NORDIRPLUS, ",nordirplus", "" },
 		{ NFS_MOUNT_UNSHARED, ",nosharecache", "" },
 		{ NFS_MOUNT_NORESVPORT, ",noresvport", "" },
+		{ NFS_MOUNT_FORCEKILL, ",forcekill", ",noforcekill" },
 		{ 0, NULL, NULL }
 	};
 	const struct proc_nfs_info *nfs_infop;
@@ -896,17 +900,18 @@ EXPORT_SYMBOL_GPL(nfs_show_stats);
  */
 void nfs_umount_begin(struct super_block *sb)
 {
-	struct nfs_server *server;
+	struct nfs_server *server = NFS_SB(sb);
 	struct rpc_clnt *rpc;
+	int kill_new_tasks = !!(server->flags & NFS_MOUNT_FORCEKILL);
 
 	server = NFS_SB(sb);
 	/* -EIO all pending I/O */
 	rpc = server->client_acl;
 	if (!IS_ERR(rpc))
-		rpc_killall_tasks(rpc, 0);
+		rpc_killall_tasks(rpc, kill_new_tasks);
 	rpc = server->client;
 	if (!IS_ERR(rpc))
-		rpc_killall_tasks(rpc, 0);
+		rpc_killall_tasks(rpc, kill_new_tasks);
 }
 EXPORT_SYMBOL_GPL(nfs_umount_begin);
 
@@ -1334,6 +1339,12 @@ static int nfs_parse_mount_options(char *raw,
 		case Opt_nomigration:
 			mnt->options &= ~NFS_OPTION_MIGRATION;
 			break;
+		case Opt_forcekill:
+			mnt->flags |= NFS_MOUNT_FORCEKILL;
+			break;
+		case Opt_noforcekill:
+			mnt->flags &= ~NFS_MOUNT_FORCEKILL;
+			break;
 
 		/*
 		 * options that take numeric values
diff --git a/include/uapi/linux/nfs_mount.h b/include/uapi/linux/nfs_mount.h
index e44e00616ab5..66821542a38f 100644
--- a/include/uapi/linux/nfs_mount.h
+++ b/include/uapi/linux/nfs_mount.h
@@ -74,5 +74,6 @@ struct nfs_mount_data {
 
 #define NFS_MOUNT_LOCAL_FLOCK	0x100000
 #define NFS_MOUNT_LOCAL_FCNTL	0x200000
+#define NFS_MOUNT_FORCEKILL	0x400000
 
 #endif
-- 
2.13.6


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: NFS Force Unmounting
  2017-11-08 12:08               ` Jeff Layton
@ 2017-11-08 15:52                 ` J. Bruce Fields
  2017-11-08 22:34                   ` NeilBrown
  0 siblings, 1 reply; 36+ messages in thread
From: J. Bruce Fields @ 2017-11-08 15:52 UTC (permalink / raw)
  To: Jeff Layton
  Cc: NeilBrown, Joshua Watt, Trond Myklebust, Linux NFS Mailing List,
	Al Viro, David Howells

On Wed, Nov 08, 2017 at 07:08:25AM -0500, Jeff Layton wrote:
> On Wed, 2017-11-08 at 14:30 +1100, NeilBrown wrote:
> > What to people think of the following as an approach
> > to Joshua's need?
> > 
> > It isn't complete by itself: it needs a couple of changes to
> > nfs-utils so that it doesn't stat the mountpoint on remount,
> > and it might need another kernel change so that the "mount" system
> > call performs the same sort of careful lookup for remount as  the umount
> > system call does, but those are relatively small details.
> > 
> 
> Yeah, that'd be good.
> 
> > This is the patch that you will either love of hate.
> > 
> > With this patch, Joshua (or any other sysadmin) could:
> > 
> >   mount -o remount,retrans=0,timeo=1 /path
> > 
> > and then new requests on any mountpoint from that server will timeout
> > quickly.
> > Then
> >   umount -f /path
> >   umount -f /path
...
> Looks like a reasonable approach overall to preventing new RPCs from
> being dispatched once the "force" umount runs.

I've lost track of the discussion--after this patch, how close are we to
a guaranteed force unmount?  I assume there are still a few obstacles.

> I do wonder if this ought to be more automatic when you specify -f on
> the umount. Having to manually do a remount first doesn't seem very
> admin-friendly.

It's an odd interface.  Maybe we could wrap it in something more
intuitive.

I'd be nervous about making "umount -f" do it.  I think administrators
could be unpleasantly surprised in some cases if an "umount -f" affects
other mounts of the same server.

--b.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NFS Force Unmounting
  2017-11-08 15:52                 ` J. Bruce Fields
@ 2017-11-08 22:34                   ` NeilBrown
  2017-11-08 23:52                     ` Trond Myklebust
  0 siblings, 1 reply; 36+ messages in thread
From: NeilBrown @ 2017-11-08 22:34 UTC (permalink / raw)
  To: J. Bruce Fields, Jeff Layton
  Cc: Joshua Watt, Trond Myklebust, Linux NFS Mailing List, Al Viro,
	David Howells

[-- Attachment #1: Type: text/plain, Size: 3219 bytes --]

On Wed, Nov 08 2017, J. Bruce Fields wrote:

> On Wed, Nov 08, 2017 at 07:08:25AM -0500, Jeff Layton wrote:
>> On Wed, 2017-11-08 at 14:30 +1100, NeilBrown wrote:
>> > What to people think of the following as an approach
>> > to Joshua's need?
>> > 
>> > It isn't complete by itself: it needs a couple of changes to
>> > nfs-utils so that it doesn't stat the mountpoint on remount,
>> > and it might need another kernel change so that the "mount" system
>> > call performs the same sort of careful lookup for remount as  the umount
>> > system call does, but those are relatively small details.
>> > 
>> 
>> Yeah, that'd be good.
>> 
>> > This is the patch that you will either love of hate.
>> > 
>> > With this patch, Joshua (or any other sysadmin) could:
>> > 
>> >   mount -o remount,retrans=0,timeo=1 /path
>> > 
>> > and then new requests on any mountpoint from that server will timeout
>> > quickly.
>> > Then
>> >   umount -f /path
>> >   umount -f /path
> ...
>> Looks like a reasonable approach overall to preventing new RPCs from
>> being dispatched once the "force" umount runs.
>
> I've lost track of the discussion--after this patch, how close are we to
> a guaranteed force unmount?  I assume there are still a few obstacles.

This isn't really about forced unmount.
The way forward to forced unmount it:
 - make all waits on NFS be TASK_KILLABLE
 - figure out what happens to dirty data when all processes have
   been killed.

This is about allowing processes to be told that the filesystem is dead
so that can respond (without blocking indefinitely) without
necessarily being killed.
With a local filesystem you can (in some cases) kill the underlying
device and all processes will start getting EIO.  This is providing
similar functionality for NFS.

>
>> I do wonder if this ought to be more automatic when you specify -f on
>> the umount. Having to manually do a remount first doesn't seem very
>> admin-friendly.
>
> It's an odd interface.  Maybe we could wrap it in something more
> intuitive.
>
> I'd be nervous about making "umount -f" do it.  I think administrators
> could be unpleasantly surprised in some cases if an "umount -f" affects
> other mounts of the same server.

I was all set to tell you that it already does, but then tested and
found it doesn't and ....

struct nfs_server (which sb->s_fs_info points to) contains

	struct nfs_client *	nfs_client;	/* shared client and NFS4 state */

which is shared between different mounts from the same server, and

	struct rpc_clnt *	client;		/* RPC client handle */

which isn't shared.
struct nfs_client contains
	struct rpc_clnt *	cl_rpcclient;

which server->client is clones from.

The timeouts that apply to a mount are the ones from server->client,
and so apply only to that mount (I thought they were shared, but that is
a thought from years ago, and maybe it was wrong at the time).
umount_begin aborts all rpcs associated with server->client.

So the 'remount,retrans=0,timeo=1' that I propose would only affect the
one superblock (all bind-mounts of course, included sharecache mounts).

The comment in my code was wrong.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NFS Force Unmounting
  2017-11-08 22:34                   ` NeilBrown
@ 2017-11-08 23:52                     ` Trond Myklebust
  2017-11-09 19:48                       ` Joshua Watt
  0 siblings, 1 reply; 36+ messages in thread
From: Trond Myklebust @ 2017-11-08 23:52 UTC (permalink / raw)
  To: bfields, jlayton, neilb; +Cc: jpewhacker, viro, linux-nfs, dhowells

T24gVGh1LCAyMDE3LTExLTA5IGF0IDA5OjM0ICsxMTAwLCBOZWlsQnJvd24gd3JvdGU6DQo+IE9u
IFdlZCwgTm92IDA4IDIwMTcsIEouIEJydWNlIEZpZWxkcyB3cm90ZToNCj4gDQo+ID4gT24gV2Vk
LCBOb3YgMDgsIDIwMTcgYXQgMDc6MDg6MjVBTSAtMDUwMCwgSmVmZiBMYXl0b24gd3JvdGU6DQo+
ID4gPiBPbiBXZWQsIDIwMTctMTEtMDggYXQgMTQ6MzAgKzExMDAsIE5laWxCcm93biB3cm90ZToN
Cj4gPiA+ID4gV2hhdCB0byBwZW9wbGUgdGhpbmsgb2YgdGhlIGZvbGxvd2luZyBhcyBhbiBhcHBy
b2FjaA0KPiA+ID4gPiB0byBKb3NodWEncyBuZWVkPw0KPiA+ID4gPiANCj4gPiA+ID4gSXQgaXNu
J3QgY29tcGxldGUgYnkgaXRzZWxmOiBpdCBuZWVkcyBhIGNvdXBsZSBvZiBjaGFuZ2VzIHRvDQo+
ID4gPiA+IG5mcy11dGlscyBzbyB0aGF0IGl0IGRvZXNuJ3Qgc3RhdCB0aGUgbW91bnRwb2ludCBv
biByZW1vdW50LA0KPiA+ID4gPiBhbmQgaXQgbWlnaHQgbmVlZCBhbm90aGVyIGtlcm5lbCBjaGFu
Z2Ugc28gdGhhdCB0aGUgIm1vdW50Ig0KPiA+ID4gPiBzeXN0ZW0NCj4gPiA+ID4gY2FsbCBwZXJm
b3JtcyB0aGUgc2FtZSBzb3J0IG9mIGNhcmVmdWwgbG9va3VwIGZvciByZW1vdW50DQo+ID4gPiA+
IGFzICB0aGUgdW1vdW50DQo+ID4gPiA+IHN5c3RlbSBjYWxsIGRvZXMsIGJ1dCB0aG9zZSBhcmUg
cmVsYXRpdmVseSBzbWFsbCBkZXRhaWxzLg0KPiA+ID4gPiANCj4gPiA+IA0KPiA+ID4gWWVhaCwg
dGhhdCdkIGJlIGdvb2QuDQo+ID4gPiANCj4gPiA+ID4gVGhpcyBpcyB0aGUgcGF0Y2ggdGhhdCB5
b3Ugd2lsbCBlaXRoZXIgbG92ZSBvZiBoYXRlLg0KPiA+ID4gPiANCj4gPiA+ID4gV2l0aCB0aGlz
IHBhdGNoLCBKb3NodWEgKG9yIGFueSBvdGhlciBzeXNhZG1pbikgY291bGQ6DQo+ID4gPiA+IA0K
PiA+ID4gPiAgIG1vdW50IC1vIHJlbW91bnQscmV0cmFucz0wLHRpbWVvPTEgL3BhdGgNCj4gPiA+
ID4gDQo+ID4gPiA+IGFuZCB0aGVuIG5ldyByZXF1ZXN0cyBvbiBhbnkgbW91bnRwb2ludCBmcm9t
IHRoYXQgc2VydmVyIHdpbGwNCj4gPiA+ID4gdGltZW91dA0KPiA+ID4gPiBxdWlja2x5Lg0KPiA+
ID4gPiBUaGVuDQo+ID4gPiA+ICAgdW1vdW50IC1mIC9wYXRoDQo+ID4gPiA+ICAgdW1vdW50IC1m
IC9wYXRoDQo+ID4gDQo+ID4gLi4uDQo+ID4gPiBMb29rcyBsaWtlIGEgcmVhc29uYWJsZSBhcHBy
b2FjaCBvdmVyYWxsIHRvIHByZXZlbnRpbmcgbmV3IFJQQ3MNCj4gPiA+IGZyb20NCj4gPiA+IGJl
aW5nIGRpc3BhdGNoZWQgb25jZSB0aGUgImZvcmNlIiB1bW91bnQgcnVucy4NCj4gPiANCj4gPiBJ
J3ZlIGxvc3QgdHJhY2sgb2YgdGhlIGRpc2N1c3Npb24tLWFmdGVyIHRoaXMgcGF0Y2gsIGhvdyBj
bG9zZSBhcmUNCj4gPiB3ZSB0bw0KPiA+IGEgZ3VhcmFudGVlZCBmb3JjZSB1bm1vdW50PyAgSSBh
c3N1bWUgdGhlcmUgYXJlIHN0aWxsIGEgZmV3DQo+ID4gb2JzdGFjbGVzLg0KPiANCj4gVGhpcyBp
c24ndCByZWFsbHkgYWJvdXQgZm9yY2VkIHVubW91bnQuDQo+IFRoZSB3YXkgZm9yd2FyZCB0byBm
b3JjZWQgdW5tb3VudCBpdDoNCj4gIC0gbWFrZSBhbGwgd2FpdHMgb24gTkZTIGJlIFRBU0tfS0lM
TEFCTEUNCj4gIC0gZmlndXJlIG91dCB3aGF0IGhhcHBlbnMgdG8gZGlydHkgZGF0YSB3aGVuIGFs
bCBwcm9jZXNzZXMgaGF2ZQ0KPiAgICBiZWVuIGtpbGxlZC4NCj4gDQo+IFRoaXMgaXMgYWJvdXQg
YWxsb3dpbmcgcHJvY2Vzc2VzIHRvIGJlIHRvbGQgdGhhdCB0aGUgZmlsZXN5c3RlbSBpcw0KPiBk
ZWFkDQo+IHNvIHRoYXQgY2FuIHJlc3BvbmQgKHdpdGhvdXQgYmxvY2tpbmcgaW5kZWZpbml0ZWx5
KSB3aXRob3V0DQo+IG5lY2Vzc2FyaWx5IGJlaW5nIGtpbGxlZC4NCj4gV2l0aCBhIGxvY2FsIGZp
bGVzeXN0ZW0geW91IGNhbiAoaW4gc29tZSBjYXNlcykga2lsbCB0aGUgdW5kZXJseWluZw0KPiBk
ZXZpY2UgYW5kIGFsbCBwcm9jZXNzZXMgd2lsbCBzdGFydCBnZXR0aW5nIEVJTy4gIFRoaXMgaXMg
cHJvdmlkaW5nDQo+IHNpbWlsYXIgZnVuY3Rpb25hbGl0eSBmb3IgTkZTLg0KPiANCj4gPiANCj4g
PiA+IEkgZG8gd29uZGVyIGlmIHRoaXMgb3VnaHQgdG8gYmUgbW9yZSBhdXRvbWF0aWMgd2hlbiB5
b3Ugc3BlY2lmeQ0KPiA+ID4gLWYgb24NCj4gPiA+IHRoZSB1bW91bnQuIEhhdmluZyB0byBtYW51
YWxseSBkbyBhIHJlbW91bnQgZmlyc3QgZG9lc24ndCBzZWVtDQo+ID4gPiB2ZXJ5DQo+ID4gPiBh
ZG1pbi1mcmllbmRseS4NCj4gPiANCj4gPiBJdCdzIGFuIG9kZCBpbnRlcmZhY2UuICBNYXliZSB3
ZSBjb3VsZCB3cmFwIGl0IGluIHNvbWV0aGluZyBtb3JlDQo+ID4gaW50dWl0aXZlLg0KPiA+IA0K
PiA+IEknZCBiZSBuZXJ2b3VzIGFib3V0IG1ha2luZyAidW1vdW50IC1mIiBkbyBpdC4gIEkgdGhp
bmsNCj4gPiBhZG1pbmlzdHJhdG9ycw0KPiA+IGNvdWxkIGJlIHVucGxlYXNhbnRseSBzdXJwcmlz
ZWQgaW4gc29tZSBjYXNlcyBpZiBhbiAidW1vdW50IC1mIg0KPiA+IGFmZmVjdHMNCj4gPiBvdGhl
ciBtb3VudHMgb2YgdGhlIHNhbWUgc2VydmVyLg0KPiANCj4gSSB3YXMgYWxsIHNldCB0byB0ZWxs
IHlvdSB0aGF0IGl0IGFscmVhZHkgZG9lcywgYnV0IHRoZW4gdGVzdGVkIGFuZA0KPiBmb3VuZCBp
dCBkb2Vzbid0IGFuZCAuLi4uDQo+IA0KPiBzdHJ1Y3QgbmZzX3NlcnZlciAod2hpY2ggc2ItPnNf
ZnNfaW5mbyBwb2ludHMgdG8pIGNvbnRhaW5zDQo+IA0KPiAJc3RydWN0IG5mc19jbGllbnQgKglu
ZnNfY2xpZW50OwkvKiBzaGFyZWQgY2xpZW50DQo+IGFuZCBORlM0IHN0YXRlICovDQo+IA0KPiB3
aGljaCBpcyBzaGFyZWQgYmV0d2VlbiBkaWZmZXJlbnQgbW91bnRzIGZyb20gdGhlIHNhbWUgc2Vy
dmVyLCBhbmQNCj4gDQo+IAlzdHJ1Y3QgcnBjX2NsbnQgKgljbGllbnQ7CQkvKiBSUEMgY2xpZW50
DQo+IGhhbmRsZSAqLw0KPiANCj4gd2hpY2ggaXNuJ3Qgc2hhcmVkLg0KPiBzdHJ1Y3QgbmZzX2Ns
aWVudCBjb250YWlucw0KPiAJc3RydWN0IHJwY19jbG50ICoJY2xfcnBjY2xpZW50Ow0KPiANCj4g
d2hpY2ggc2VydmVyLT5jbGllbnQgaXMgY2xvbmVzIGZyb20uDQo+IA0KPiBUaGUgdGltZW91dHMg
dGhhdCBhcHBseSB0byBhIG1vdW50IGFyZSB0aGUgb25lcyBmcm9tIHNlcnZlci0+Y2xpZW50LA0K
PiBhbmQgc28gYXBwbHkgb25seSB0byB0aGF0IG1vdW50IChJIHRob3VnaHQgdGhleSB3ZXJlIHNo
YXJlZCwgYnV0IHRoYXQNCj4gaXMNCj4gYSB0aG91Z2h0IGZyb20geWVhcnMgYWdvLCBhbmQgbWF5
YmUgaXQgd2FzIHdyb25nIGF0IHRoZSB0aW1lKS4NCj4gdW1vdW50X2JlZ2luIGFib3J0cyBhbGwg
cnBjcyBhc3NvY2lhdGVkIHdpdGggc2VydmVyLT5jbGllbnQuDQo+IA0KPiBTbyB0aGUgJ3JlbW91
bnQscmV0cmFucz0wLHRpbWVvPTEnIHRoYXQgSSBwcm9wb3NlIHdvdWxkIG9ubHkgYWZmZWN0DQo+
IHRoZQ0KPiBvbmUgc3VwZXJibG9jayAoYWxsIGJpbmQtbW91bnRzIG9mIGNvdXJzZSwgaW5jbHVk
ZWQgc2hhcmVjYWNoZQ0KPiBtb3VudHMpLg0KPiANCj4gVGhlIGNvbW1lbnQgaW4gbXkgY29kZSB3
YXMgd3JvbmcuDQo+IA0KDQpJIGJ5IGZhciBwcmVmZXIgYW4gb3BlcmF0aW9uIHRoYXQgY2hhbmdl
cyB0aGUgc3VwZXJibG9jayBzdGF0ZSB0byBiZQ0KZG9uZSB1c2luZyAnbW91bnQgLW8gcmVtb3Vu
dCcuIFRoZSBpZGVhIG9mIGEgJ3Vtb3VudCAtZicgdGhhdCBtYWtlcyB0aGUNCnN1cGVyYmxvY2sg
aXJyZXZlcnNpYmx5IGNoYW5nZSBzdGF0ZSBpcyBjbGVhcmx5IG5vdCBkZXNpcmFibGUgaW4gYW4N
CmVudmlyb25tZW50IHdoZXJlIHRoZSBzYW1lIHN1cGVyYmxvY2sgY291bGQgYmUgYmluZCBtb3Vu
dGVkIGFuZC9vcg0KbW91bnRlZCBpbiBtdWx0aXBsZSBwcml2YXRlIG5hbWVzcGFjZXMuDQoNCklP
VzogJ3JlbW91bnQscmV0cmFucz0wLHRpbWVvPTEnIHdvdWxkIGJlIHNsaWdodGx5IHByZWZlcmFi
bGUgdG8NCmhhY2tpbmcgdXAgInVtb3VudCAtZiIgYmVjYXVzZSBpdCBpcyByZXZlcnNpYmxlLg0K
DQpUaGF0IHNhaWQsIHRoZXJlIGlzIG5vIHJlYXNvbiB3aHkgd2UgY291bGRuJ3QgaW1wbGVtZW50
IGEgc2luZ2xlIG1vdW50DQpvcHRpb24gdGhhdCBpbXBsZW1lbnRzIHNvbWV0aGluZyBha2luIHRv
IHRoaXMgInVtb3VudCAtZiIgYmVoYXZpb3VyDQooYW5kIHRoYXQgY2FuIGJlIHJldmVyc2VkIGJ5
IHJlc2V0dGluZyBpdCB0aHJvdWdoIGEgc2Vjb25kICdyZW1vdW50JykuDQpJdCBzZWVtcyB0byBt
ZSB0aGF0IHRoZSByZXF1ZXN0ZWQgYmVoYXZpb3VyIGlzIGFscmVhZHkgcHJldHR5IGNsb3NlIHRv
DQp3aGF0IHdlIGFscmVhZHkgaW1wbGVtZW50IHdpdGggdGhlIFJQQ19UQVNLX1NPRlRDT05OIG9w
dGlvbi4NCg0KLS0gDQpUcm9uZCBNeWtsZWJ1c3QNCkxpbnV4IE5GUyBjbGllbnQgbWFpbnRhaW5l
ciwgUHJpbWFyeURhdGENCnRyb25kLm15a2xlYnVzdEBwcmltYXJ5ZGF0YS5jb20NCg==


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NFS Force Unmounting
  2017-11-08 23:52                     ` Trond Myklebust
@ 2017-11-09 19:48                       ` Joshua Watt
  2017-11-10  0:16                         ` NeilBrown
  0 siblings, 1 reply; 36+ messages in thread
From: Joshua Watt @ 2017-11-09 19:48 UTC (permalink / raw)
  To: Trond Myklebust, bfields, jlayton, neilb; +Cc: viro, linux-nfs, dhowells

On Wed, 2017-11-08 at 23:52 +0000, Trond Myklebust wrote:
> On Thu, 2017-11-09 at 09:34 +1100, NeilBrown wrote:
> > On Wed, Nov 08 2017, J. Bruce Fields wrote:
> > 
> > > On Wed, Nov 08, 2017 at 07:08:25AM -0500, Jeff Layton wrote:
> > > > On Wed, 2017-11-08 at 14:30 +1100, NeilBrown wrote:
> > > > > What to people think of the following as an approach
> > > > > to Joshua's need?
> > > > > 
> > > > > It isn't complete by itself: it needs a couple of changes to
> > > > > nfs-utils so that it doesn't stat the mountpoint on remount,
> > > > > and it might need another kernel change so that the "mount"
> > > > > system
> > > > > call performs the same sort of careful lookup for remount
> > > > > as  the umount
> > > > > system call does, but those are relatively small details.
> > > > > 
> > > > 
> > > > Yeah, that'd be good.
> > > > 
> > > > > This is the patch that you will either love of hate.
> > > > > 
> > > > > With this patch, Joshua (or any other sysadmin) could:
> > > > > 
> > > > >   mount -o remount,retrans=0,timeo=1 /path
> > > > > 
> > > > > and then new requests on any mountpoint from that server will
> > > > > timeout
> > > > > quickly.
> > > > > Then
> > > > >   umount -f /path
> > > > >   umount -f /path
> > > 
> > > ...
> > > > Looks like a reasonable approach overall to preventing new RPCs
> > > > from
> > > > being dispatched once the "force" umount runs.
> > > 
> > > I've lost track of the discussion--after this patch, how close
> > > are
> > > we to
> > > a guaranteed force unmount?  I assume there are still a few
> > > obstacles.
> > 
> > This isn't really about forced unmount.
> > The way forward to forced unmount it:
> >  - make all waits on NFS be TASK_KILLABLE
> >  - figure out what happens to dirty data when all processes have
> >    been killed.
> > 
> > This is about allowing processes to be told that the filesystem is
> > dead
> > so that can respond (without blocking indefinitely) without
> > necessarily being killed.
> > With a local filesystem you can (in some cases) kill the underlying
> > device and all processes will start getting EIO.  This is providing
> > similar functionality for NFS.
> > 
> > > 
> > > > I do wonder if this ought to be more automatic when you specify
> > > > -f on
> > > > the umount. Having to manually do a remount first doesn't seem
> > > > very
> > > > admin-friendly.
> > > 
> > > It's an odd interface.  Maybe we could wrap it in something more
> > > intuitive.
> > > 
> > > I'd be nervous about making "umount -f" do it.  I think
> > > administrators
> > > could be unpleasantly surprised in some cases if an "umount -f"
> > > affects
> > > other mounts of the same server.
> > 
> > I was all set to tell you that it already does, but then tested and
> > found it doesn't and ....

I tried mounting two different remote paths from an NFS4 server, and
when I did 'remount,retrans=0', it changed the parameter for both of
them meaning they are indeed shaing the struct nfs_server (which based
on my reading of the code is what I would have expected). What
procedure did you use to test this?

> > 
> > struct nfs_server (which sb->s_fs_info points to) contains
> > 
> > 	struct nfs_client *	nfs_client;	/* shared client
> > and NFS4 state */
> > 
> > which is shared between different mounts from the same server, and
> > 
> > 	struct rpc_clnt *	client;		/* RPC client
> > handle */
> > 
> > which isn't shared.
> > struct nfs_client contains
> > 	struct rpc_clnt *	cl_rpcclient;
> > 
> > which server->client is clones from.
> > 
> > The timeouts that apply to a mount are the ones from server-
> > >client,
> > and so apply only to that mount (I thought they were shared, but
> > that
> > is
> > a thought from years ago, and maybe it was wrong at the time).
> > umount_begin aborts all rpcs associated with server->client.
> > 
> > So the 'remount,retrans=0,timeo=1' that I propose would only affect
> > the
> > one superblock (all bind-mounts of course, included sharecache
> > mounts).
> > 
> > The comment in my code was wrong.
> > 
> 
> I by far prefer an operation that changes the superblock state to be
> done using 'mount -o remount'. The idea of a 'umount -f' that makes
> the
> superblock irreversibly change state is clearly not desirable in an
> environment where the same superblock could be bind mounted and/or
> mounted in multiple private namespaces.
> 
> IOW: 'remount,retrans=0,timeo=1' would be slightly preferable to
> hacking up "umount -f" because it is reversible.
> 
> That said, there is no reason why we couldn't implement a single
> mount
> option that implements something akin to this "umount -f" behaviour
> (and that can be reversed by resetting it through a second
> 'remount').
> It seems to me that the requested behaviour is already pretty close
> to
> what we already implement with the RPC_TASK_SOFTCONN option.

I *think* my patch series pretty much does just that (not via
RPC_TASK_SOFTCONN). I might have broken the thread by posting with the
wrong parent... anyway you find can find them here: http://www.spinics.
net/lists/linux-nfs/msg66348.html. 

I added support for toggling the forcekill flag via remount, which I
think might be close to what you are looking for, e.g. if you mount
without the forcekill option, umount -f does what it always has. You
can then "-o remount,forcekill" and umount -f will cause all RPC tasks
to fail with -EIO. The number of patches has grown, so I currently have
this on a GitHub branch (including Neil's patch for testing): https://g
ithub.com/JPEWdev/linux/tree/sunrpc-kill, but I can send them to the
list if that would be helpful.

Using a changable mount option is fine, but it does mean that the
option again applies to all mounts from the same server (at least as
far as I can understand: I could be wrong and Neil could be right - see
above). Choosing the behavior at mount time (and not allowing it to
change after) has the advantage that you can choose which mounts it
applies to and which it doesn't. Although, I suppose if you really
wanted that behavior, you should be mounting with "nosharecache"
anyway?

> 
> -- 
> Trond Myklebust
> Linux NFS client maintainer, PrimaryData
> trond.myklebust@primarydata.com

Thanks,
Joshua Watt

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: NFS Force Unmounting
  2017-11-09 19:48                       ` Joshua Watt
@ 2017-11-10  0:16                         ` NeilBrown
  0 siblings, 0 replies; 36+ messages in thread
From: NeilBrown @ 2017-11-10  0:16 UTC (permalink / raw)
  To: Joshua Watt, Trond Myklebust, bfields, jlayton; +Cc: viro, linux-nfs, dhowells

[-- Attachment #1: Type: text/plain, Size: 1304 bytes --]

On Thu, Nov 09 2017, Joshua Watt wrote:

> On Wed, 2017-11-08 at 23:52 +0000, Trond Myklebust wrote:
>> On Thu, 2017-11-09 at 09:34 +1100, NeilBrown wrote:
>> > On Wed, Nov 08 2017, J. Bruce Fields wrote:
>> > 
>> > > 
>> > > I'd be nervous about making "umount -f" do it.  I think
>> > > administrators
>> > > could be unpleasantly surprised in some cases if an "umount -f"
>> > > affects
>> > > other mounts of the same server.
>> > 
>> > I was all set to tell you that it already does, but then tested and
>> > found it doesn't and ....
>
> I tried mounting two different remote paths from an NFS4 server, and
> when I did 'remount,retrans=0', it changed the parameter for both of
> them meaning they are indeed shaing the struct nfs_server (which based
> on my reading of the code is what I would have expected). What
> procedure did you use to test this?

I was using nosharecache, because I was thinking "of course it will
affect all mounts that use sharecache, that is really the same as
a bind-mount".  But I should have been explicit.

With nosharecache, "umount -f" or remount only affects the one mount.
With sharecache (the default) or bind mounts, "umount -f" and remount
affects the underlying superblock which might be mounted at multiple places.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC 1/4] SUNRPC: Add flag to kill new tasks
  2017-11-08 14:59               ` [RFC 1/4] SUNRPC: Add flag to kill new tasks Joshua Watt
@ 2017-11-10  1:39                 ` NeilBrown
  0 siblings, 0 replies; 36+ messages in thread
From: NeilBrown @ 2017-11-10  1:39 UTC (permalink / raw)
  To: Joshua Watt, Jeff Layton, Trond Myklebust
  Cc: linux-nfs, Al Viro, J . Bruce Fields, David Howells, Joshua Watt

[-- Attachment #1: Type: text/plain, Size: 5127 bytes --]

On Wed, Nov 08 2017, Joshua Watt wrote:

> The new flag to rpc_killall_tasks() causes new tasks that are killed
> after to call to be killed immediately instead of being executed. This
> will all clients (particularly NFS) to prevents these task from delaying
> shutdown of the RPC session longer than necessary.

I have trouble remembering to proof-read my commit messages too. :-)
(I count 3 typos).

>
> Signed-off-by: Joshua Watt <JPEWhacker@gmail.com>
> ---
>  fs/nfs/super.c               |  4 ++--
>  include/linux/sunrpc/clnt.h  |  1 +
>  include/linux/sunrpc/sched.h |  2 +-
>  net/sunrpc/clnt.c            | 13 ++++++++++---
>  net/sunrpc/sched.c           |  7 +++++++
>  5 files changed, 21 insertions(+), 6 deletions(-)
>
> diff --git a/fs/nfs/super.c b/fs/nfs/super.c
> index 216f67d628b3..66fda2dcadd0 100644
> --- a/fs/nfs/super.c
> +++ b/fs/nfs/super.c
> @@ -903,10 +903,10 @@ void nfs_umount_begin(struct super_block *sb)
>  	/* -EIO all pending I/O */
>  	rpc = server->client_acl;
>  	if (!IS_ERR(rpc))
> -		rpc_killall_tasks(rpc);
> +		rpc_killall_tasks(rpc, 0);
>  	rpc = server->client;
>  	if (!IS_ERR(rpc))
> -		rpc_killall_tasks(rpc);
> +		rpc_killall_tasks(rpc, 0);
>  }
>  EXPORT_SYMBOL_GPL(nfs_umount_begin);
>  
> diff --git a/include/linux/sunrpc/clnt.h b/include/linux/sunrpc/clnt.h
> index 71c237e8240e..94f4e464de1b 100644
> --- a/include/linux/sunrpc/clnt.h
> +++ b/include/linux/sunrpc/clnt.h
> @@ -54,6 +54,7 @@ struct rpc_clnt {
>  				cl_noretranstimeo: 1,/* No retransmit timeouts */
>  				cl_autobind : 1,/* use getport() */
>  				cl_chatty   : 1;/* be verbose */
> +	bool			cl_kill_new_tasks;	/* Kill all new tasks */
>  
>  	struct rpc_rtt *	cl_rtt;		/* RTO estimator data */
>  	const struct rpc_timeout *cl_timeout;	/* Timeout strategy */
> diff --git a/include/linux/sunrpc/sched.h b/include/linux/sunrpc/sched.h
> index d96e74e114c0..9b8bebaee0f4 100644
> --- a/include/linux/sunrpc/sched.h
> +++ b/include/linux/sunrpc/sched.h
> @@ -218,7 +218,7 @@ void		rpc_put_task_async(struct rpc_task *);
>  void		rpc_exit_task(struct rpc_task *);
>  void		rpc_exit(struct rpc_task *, int);
>  void		rpc_release_calldata(const struct rpc_call_ops *, void *);
> -void		rpc_killall_tasks(struct rpc_clnt *);
> +void		rpc_killall_tasks(struct rpc_clnt *clnt, int kill_new_tasks);
>  void		rpc_execute(struct rpc_task *);
>  void		rpc_init_priority_wait_queue(struct rpc_wait_queue *, const char *);
>  void		rpc_init_wait_queue(struct rpc_wait_queue *, const char *);
> diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
> index df4ecb042ebe..70005252b32f 100644
> --- a/net/sunrpc/clnt.c
> +++ b/net/sunrpc/clnt.c
> @@ -814,14 +814,21 @@ EXPORT_SYMBOL_GPL(rpc_clnt_iterate_for_each_xprt);
>   * Kill all tasks for the given client.
>   * XXX: kill their descendants as well?
>   */
> -void rpc_killall_tasks(struct rpc_clnt *clnt)
> +void rpc_killall_tasks(struct rpc_clnt *clnt, int kill_new_tasks)
>  {
>  	struct rpc_task	*rovr;
>  
> +	dprintk("RPC:       killing all tasks for client %p\n", clnt);
> +
> +	if (kill_new_tasks) {
> +		WRITE_ONCE(clnt->cl_kill_new_tasks, true);
> +
> +		/* Ensure the kill flag is set before checking the task list */
> +		smp_mb();
> +	}

I don't find this smp_mb() usage convincing.  It might be "right", but
to me it looks weird.
Partly, this is because there is a perfectly good spinlock that would
make the code obviously correct.

I would remove the short-circuit (which returns if list_empty()) and
always take the spinlock.  Then
   if (kill_new_tasks)
   	clnt->cl_kill_new_tasks = true;


In __rpc_execute() I would just have
  if (task->tk_client->cl_kill_new_tasks)
  	rpc_exit(task, -EIO);

As the task was added to the client list with the cl_lock spinlock
helds, there are no visibility issues to require a memory barrier.

Otherwise, the patch seems to make sense.

Thanks,
NeilBrown


>  
>  	if (list_empty(&clnt->cl_tasks))
>  		return;
> -	dprintk("RPC:       killing all tasks for client %p\n", clnt);
>  	/*
>  	 * Spin lock all_tasks to prevent changes...
>  	 */
> @@ -854,7 +861,7 @@ void rpc_shutdown_client(struct rpc_clnt *clnt)
>  			rcu_dereference(clnt->cl_xprt)->servername);
>  
>  	while (!list_empty(&clnt->cl_tasks)) {
> -		rpc_killall_tasks(clnt);
> +		rpc_killall_tasks(clnt, 0);
>  		wait_event_timeout(destroy_wait,
>  			list_empty(&clnt->cl_tasks), 1*HZ);
>  	}
> diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
> index 0cc83839c13c..c9bf1ab9e941 100644
> --- a/net/sunrpc/sched.c
> +++ b/net/sunrpc/sched.c
> @@ -748,6 +748,13 @@ static void __rpc_execute(struct rpc_task *task)
>  	dprintk("RPC: %5u __rpc_execute flags=0x%x\n",
>  			task->tk_pid, task->tk_flags);
>  
> +	/* Ensure the task is in the client task list before checking the kill
> +	 * flag
> +	 */
> +	smp_mb();
> +	if (READ_ONCE(task->tk_client->cl_kill_new_tasks))
> +		rpc_exit(task, -EIO);
> +
>  	WARN_ON_ONCE(RPC_IS_QUEUED(task));
>  	if (RPC_IS_QUEUED(task))
>  		return;
> -- 
> 2.13.6

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC 2/4] SUNRPC: Kill client tasks from debugfs
  2017-11-08 14:59               ` [RFC 2/4] SUNRPC: Kill client tasks from debugfs Joshua Watt
@ 2017-11-10  1:47                 ` NeilBrown
  2017-11-10 14:13                   ` Joshua Watt
  0 siblings, 1 reply; 36+ messages in thread
From: NeilBrown @ 2017-11-10  1:47 UTC (permalink / raw)
  To: Joshua Watt, Jeff Layton, Trond Myklebust
  Cc: linux-nfs, Al Viro, J . Bruce Fields, David Howells, Joshua Watt

[-- Attachment #1: Type: text/plain, Size: 2556 bytes --]

On Wed, Nov 08 2017, Joshua Watt wrote:

> The "kill" client entry in debugfs can be used to kill client tasks.
> Writing "0" causes only the currently pending tasks to be killed, while
> writing "1" all currently pending, and any future pending tasks to be
> killed.
>
> Signed-off-by: Joshua Watt <JPEWhacker@gmail.com>
> ---
>  net/sunrpc/debugfs.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 48 insertions(+)
>
> diff --git a/net/sunrpc/debugfs.c b/net/sunrpc/debugfs.c
> index e980d2a493de..a2f33d168873 100644
> --- a/net/sunrpc/debugfs.c
> +++ b/net/sunrpc/debugfs.c
> @@ -118,6 +118,50 @@ static const struct file_operations tasks_fops = {
>  	.release	= tasks_release,
>  };
>  
> +static int
> +kill_set(void *data, u64 val)
> +{
> +	struct rpc_clnt *clnt = data;
> +
> +	rpc_killall_tasks(clnt, (int)val);
> +	return 0;
> +}
> +
> +static int
> +kill_open(struct inode *inode, struct file *filp)
> +{
> +	int ret = simple_attr_open(inode, filp, NULL, kill_set, "%i");
> +
> +	if (!ret) {
> +		struct rpc_clnt *clnt = inode->i_private;
> +
> +		if (!atomic_inc_not_zero(&clnt->cl_count)) {
> +			simple_attr_release(inode, filp);
> +			ret = -EINVAL;
> +		}
> +	}
> +
> +	return ret;
> +}
> +
> +static int
> +kill_release(struct inode *inode, struct file *filp)
> +{
> +	struct rpc_clnt *clnt = inode->i_private;
> +
> +	rpc_release_client(clnt);
> +
> +	return simple_attr_release(inode, filp);
> +}
> +
> +static const struct file_operations kill_fops = {
> +	.owner	 = THIS_MODULE,
> +	.open	 = kill_open,
> +	.release = kill_release,
> +	.write	 = debugfs_attr_write,
> +	.llseek  = no_llseek,
> +};
> +
>  void
>  rpc_clnt_debugfs_register(struct rpc_clnt *clnt)
>  {
> @@ -160,6 +204,10 @@ rpc_clnt_debugfs_register(struct rpc_clnt *clnt)
>  	if (!debugfs_create_symlink("xprt", clnt->cl_debugfs, name))
>  		goto out_err;
>  
> +	if (!debugfs_create_file("kill", S_IFREG | 0200, clnt->cl_debugfs,
> +				 clnt, &kill_fops))

S_IFREG is the default for debugfs_create_file, so it isn't needed.
There are about 1085 calls to debugfs_create_file() in the kernel of
which 85 specify S_IFREG.  Maybe we should fix them all...

And please use S_IWUSR rather than 0200.

I don't know if we really need this functionality if debugfs, but I
guess that is why we are putting it there - to see.

Thanks,
NeilBrown


> +		goto out_err;
> +
>  	return;
>  out_err:
>  	debugfs_remove_recursive(clnt->cl_debugfs);
> -- 
> 2.13.6

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC 3/4] SUNRPC: Simplify client shutdown
  2017-11-08 14:59               ` [RFC 3/4] SUNRPC: Simplify client shutdown Joshua Watt
@ 2017-11-10  1:50                 ` NeilBrown
  0 siblings, 0 replies; 36+ messages in thread
From: NeilBrown @ 2017-11-10  1:50 UTC (permalink / raw)
  To: Joshua Watt, Jeff Layton, Trond Myklebust
  Cc: linux-nfs, Al Viro, J . Bruce Fields, David Howells, Joshua Watt

[-- Attachment #1: Type: text/plain, Size: 1082 bytes --]

On Wed, Nov 08 2017, Joshua Watt wrote:

> Use the flag to kill all new tasks when shutting down instead of
> repeatedly killing all the pending tasks.
>
> Signed-off-by: Joshua Watt <JPEWhacker@gmail.com>
> ---
>  net/sunrpc/clnt.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
> index 70005252b32f..8d51e0788a63 100644
> --- a/net/sunrpc/clnt.c
> +++ b/net/sunrpc/clnt.c
> @@ -860,8 +860,8 @@ void rpc_shutdown_client(struct rpc_clnt *clnt)
>  			clnt->cl_program->name,
>  			rcu_dereference(clnt->cl_xprt)->servername);
>  
> +	rpc_killall_tasks(clnt, 1);
>  	while (!list_empty(&clnt->cl_tasks)) {
> -		rpc_killall_tasks(clnt, 0);
>  		wait_event_timeout(destroy_wait,
>  			list_empty(&clnt->cl_tasks), 1*HZ);
>  	}

The previously called rpc_killall_tasks every second.
Now that it doesn't do that, it doesn't do anything every second.
so:
     rpc_killall_tasks(clnt, 1);
     wait_event(destroy_wait, list_empty(&clnt->cl_tasks));

should be sufficient.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC 4/4] NFS: Add forcekill mount option
  2017-11-08 14:59               ` [RFC 4/4] NFS: Add forcekill mount option Joshua Watt
@ 2017-11-10  2:01                 ` NeilBrown
  2017-11-10 14:16                   ` Joshua Watt
  0 siblings, 1 reply; 36+ messages in thread
From: NeilBrown @ 2017-11-10  2:01 UTC (permalink / raw)
  To: Joshua Watt, Jeff Layton, Trond Myklebust
  Cc: linux-nfs, Al Viro, J . Bruce Fields, David Howells, Joshua Watt

[-- Attachment #1: Type: text/plain, Size: 3768 bytes --]

On Wed, Nov 08 2017, Joshua Watt wrote:

> Specifying the forcekill mount option causes umount() with the MNT_FORCE
> flag to not only kill all currently pending RPC tasks with -EIO, but
> also all future RPC tasks. This prevents the long delays caused by tasks
> queuing after rpc_killall_tasks() that can occur if the server drops off
> the network.
>
> Signed-off-by: Joshua Watt <JPEWhacker@gmail.com>
> ---
>  fs/nfs/super.c                 | 17 ++++++++++++++---
>  include/uapi/linux/nfs_mount.h |  1 +
>  2 files changed, 15 insertions(+), 3 deletions(-)
>
> diff --git a/fs/nfs/super.c b/fs/nfs/super.c
> index 66fda2dcadd0..d972f6289aca 100644
> --- a/fs/nfs/super.c
> +++ b/fs/nfs/super.c
> @@ -90,6 +90,7 @@ enum {
>  	Opt_resvport, Opt_noresvport,
>  	Opt_fscache, Opt_nofscache,
>  	Opt_migration, Opt_nomigration,
> +	Opt_forcekill, Opt_noforcekill,
>  
>  	/* Mount options that take integer arguments */
>  	Opt_port,
> @@ -151,6 +152,8 @@ static const match_table_t nfs_mount_option_tokens = {
>  	{ Opt_nofscache, "nofsc" },
>  	{ Opt_migration, "migration" },
>  	{ Opt_nomigration, "nomigration" },
> +	{ Opt_forcekill, "forcekill" },
> +	{ Opt_noforcekill, "noforcekill" },
>  
>  	{ Opt_port, "port=%s" },
>  	{ Opt_rsize, "rsize=%s" },
> @@ -637,6 +640,7 @@ static void nfs_show_mount_options(struct seq_file *m, struct nfs_server *nfss,
>  		{ NFS_MOUNT_NORDIRPLUS, ",nordirplus", "" },
>  		{ NFS_MOUNT_UNSHARED, ",nosharecache", "" },
>  		{ NFS_MOUNT_NORESVPORT, ",noresvport", "" },
> +		{ NFS_MOUNT_FORCEKILL, ",forcekill", ",noforcekill" },
>  		{ 0, NULL, NULL }
>  	};
>  	const struct proc_nfs_info *nfs_infop;
> @@ -896,17 +900,18 @@ EXPORT_SYMBOL_GPL(nfs_show_stats);
>   */
>  void nfs_umount_begin(struct super_block *sb)
>  {
> -	struct nfs_server *server;
> +	struct nfs_server *server = NFS_SB(sb);
>  	struct rpc_clnt *rpc;
> +	int kill_new_tasks = !!(server->flags & NFS_MOUNT_FORCEKILL);
>  
>  	server = NFS_SB(sb);

Now that you initialized 'server' at declaration, you should really
remove this (re)initialization.

I'm not sure what I think of this patchset...
You are setting a flag which causes "umount -f" to set a different flag.
Why not just have "mount -o remount,XX" set the flag that you actually
want to set.
e.g
   mount -o remount,serverfailed /mountpoint
causes all rpcs to fail, and
   mount -o remount,noserverfailed /mountpoint

causes all rpcs to be sent to the server.
(or pick a different name than 'serverfailed').

It isn't clear what the indirection buys us.

Thanks,
NeilBrown



>  	/* -EIO all pending I/O */
>  	rpc = server->client_acl;
>  	if (!IS_ERR(rpc))
> -		rpc_killall_tasks(rpc, 0);
> +		rpc_killall_tasks(rpc, kill_new_tasks);
>  	rpc = server->client;
>  	if (!IS_ERR(rpc))
> -		rpc_killall_tasks(rpc, 0);
> +		rpc_killall_tasks(rpc, kill_new_tasks);
>  }
>  EXPORT_SYMBOL_GPL(nfs_umount_begin);
>  
> @@ -1334,6 +1339,12 @@ static int nfs_parse_mount_options(char *raw,
>  		case Opt_nomigration:
>  			mnt->options &= ~NFS_OPTION_MIGRATION;
>  			break;
> +		case Opt_forcekill:
> +			mnt->flags |= NFS_MOUNT_FORCEKILL;
> +			break;
> +		case Opt_noforcekill:
> +			mnt->flags &= ~NFS_MOUNT_FORCEKILL;
> +			break;
>  
>  		/*
>  		 * options that take numeric values
> diff --git a/include/uapi/linux/nfs_mount.h b/include/uapi/linux/nfs_mount.h
> index e44e00616ab5..66821542a38f 100644
> --- a/include/uapi/linux/nfs_mount.h
> +++ b/include/uapi/linux/nfs_mount.h
> @@ -74,5 +74,6 @@ struct nfs_mount_data {
>  
>  #define NFS_MOUNT_LOCAL_FLOCK	0x100000
>  #define NFS_MOUNT_LOCAL_FCNTL	0x200000
> +#define NFS_MOUNT_FORCEKILL	0x400000
>  
>  #endif
> -- 
> 2.13.6

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC 2/4] SUNRPC: Kill client tasks from debugfs
  2017-11-10  1:47                 ` NeilBrown
@ 2017-11-10 14:13                   ` Joshua Watt
  0 siblings, 0 replies; 36+ messages in thread
From: Joshua Watt @ 2017-11-10 14:13 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Trond Myklebust
  Cc: linux-nfs, Al Viro, J . Bruce Fields, David Howells

On Fri, 2017-11-10 at 12:47 +1100, NeilBrown wrote:
> On Wed, Nov 08 2017, Joshua Watt wrote:
> 
> > The "kill" client entry in debugfs can be used to kill client
> > tasks.
> > Writing "0" causes only the currently pending tasks to be killed,
> > while
> > writing "1" all currently pending, and any future pending tasks to
> > be
> > killed.
> > 
> > Signed-off-by: Joshua Watt <JPEWhacker@gmail.com>
> > ---
> >  net/sunrpc/debugfs.c | 48
> > ++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 48 insertions(+)
> > 
> > diff --git a/net/sunrpc/debugfs.c b/net/sunrpc/debugfs.c
> > index e980d2a493de..a2f33d168873 100644
> > --- a/net/sunrpc/debugfs.c
> > +++ b/net/sunrpc/debugfs.c
> > @@ -118,6 +118,50 @@ static const struct file_operations tasks_fops
> > = {
> >  	.release	= tasks_release,
> >  };
> >  
> > +static int
> > +kill_set(void *data, u64 val)
> > +{
> > +	struct rpc_clnt *clnt = data;
> > +
> > +	rpc_killall_tasks(clnt, (int)val);
> > +	return 0;
> > +}
> > +
> > +static int
> > +kill_open(struct inode *inode, struct file *filp)
> > +{
> > +	int ret = simple_attr_open(inode, filp, NULL, kill_set,
> > "%i");
> > +
> > +	if (!ret) {
> > +		struct rpc_clnt *clnt = inode->i_private;
> > +
> > +		if (!atomic_inc_not_zero(&clnt->cl_count)) {
> > +			simple_attr_release(inode, filp);
> > +			ret = -EINVAL;
> > +		}
> > +	}
> > +
> > +	return ret;
> > +}
> > +
> > +static int
> > +kill_release(struct inode *inode, struct file *filp)
> > +{
> > +	struct rpc_clnt *clnt = inode->i_private;
> > +
> > +	rpc_release_client(clnt);
> > +
> > +	return simple_attr_release(inode, filp);
> > +}
> > +
> > +static const struct file_operations kill_fops = {
> > +	.owner	 = THIS_MODULE,
> > +	.open	 = kill_open,
> > +	.release = kill_release,
> > +	.write	 = debugfs_attr_write,
> > +	.llseek  = no_llseek,
> > +};
> > +
> >  void
> >  rpc_clnt_debugfs_register(struct rpc_clnt *clnt)
> >  {
> > @@ -160,6 +204,10 @@ rpc_clnt_debugfs_register(struct rpc_clnt
> > *clnt)
> >  	if (!debugfs_create_symlink("xprt", clnt->cl_debugfs,
> > name))
> >  		goto out_err;
> >  
> > +	if (!debugfs_create_file("kill", S_IFREG | 0200, clnt-
> > >cl_debugfs,
> > +				 clnt, &kill_fops))
> 
> S_IFREG is the default for debugfs_create_file, so it isn't needed.
> There are about 1085 calls to debugfs_create_file() in the kernel of
> which 85 specify S_IFREG.  Maybe we should fix them all...
> 
> And please use S_IWUSR rather than 0200.

Oddly enough, checkpatch.pl complains if you use S_IWUSR, stating that
you should use the octal instead:

  WARNING: Symbolic permissions 'S_IWUSR' are not preferred. Consider
using octal permissions '0200'.

I'm alright using S_IWUSR here since the rest of the file uses them
("when in Rome" and all that).

> 
> I don't know if we really need this functionality if debugfs, but I
> guess that is why we are putting it there - to see.

Yes, I purposely constructed this patch so it could be skipped if it is
unneeded/unwanted.

> 
> Thanks,
> NeilBrown
> 
> 
> > +		goto out_err;
> > +
> >  	return;
> >  out_err:
> >  	debugfs_remove_recursive(clnt->cl_debugfs);
> > -- 
> > 2.13.6

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC 4/4] NFS: Add forcekill mount option
  2017-11-10  2:01                 ` NeilBrown
@ 2017-11-10 14:16                   ` Joshua Watt
  0 siblings, 0 replies; 36+ messages in thread
From: Joshua Watt @ 2017-11-10 14:16 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Trond Myklebust
  Cc: linux-nfs, Al Viro, J . Bruce Fields, David Howells

On Fri, 2017-11-10 at 13:01 +1100, NeilBrown wrote:
> On Wed, Nov 08 2017, Joshua Watt wrote:
> 
> > Specifying the forcekill mount option causes umount() with the
> > MNT_FORCE
> > flag to not only kill all currently pending RPC tasks with -EIO,
> > but
> > also all future RPC tasks. This prevents the long delays caused by
> > tasks
> > queuing after rpc_killall_tasks() that can occur if the server
> > drops off
> > the network.
> > 
> > Signed-off-by: Joshua Watt <JPEWhacker@gmail.com>
> > ---
> >  fs/nfs/super.c                 | 17 ++++++++++++++---
> >  include/uapi/linux/nfs_mount.h |  1 +
> >  2 files changed, 15 insertions(+), 3 deletions(-)
> > 
> > diff --git a/fs/nfs/super.c b/fs/nfs/super.c
> > index 66fda2dcadd0..d972f6289aca 100644
> > --- a/fs/nfs/super.c
> > +++ b/fs/nfs/super.c
> > @@ -90,6 +90,7 @@ enum {
> >  	Opt_resvport, Opt_noresvport,
> >  	Opt_fscache, Opt_nofscache,
> >  	Opt_migration, Opt_nomigration,
> > +	Opt_forcekill, Opt_noforcekill,
> >  
> >  	/* Mount options that take integer arguments */
> >  	Opt_port,
> > @@ -151,6 +152,8 @@ static const match_table_t
> > nfs_mount_option_tokens = {
> >  	{ Opt_nofscache, "nofsc" },
> >  	{ Opt_migration, "migration" },
> >  	{ Opt_nomigration, "nomigration" },
> > +	{ Opt_forcekill, "forcekill" },
> > +	{ Opt_noforcekill, "noforcekill" },
> >  
> >  	{ Opt_port, "port=%s" },
> >  	{ Opt_rsize, "rsize=%s" },
> > @@ -637,6 +640,7 @@ static void nfs_show_mount_options(struct
> > seq_file *m, struct nfs_server *nfss,
> >  		{ NFS_MOUNT_NORDIRPLUS, ",nordirplus", "" },
> >  		{ NFS_MOUNT_UNSHARED, ",nosharecache", "" },
> >  		{ NFS_MOUNT_NORESVPORT, ",noresvport", "" },
> > +		{ NFS_MOUNT_FORCEKILL, ",forcekill",
> > ",noforcekill" },
> >  		{ 0, NULL, NULL }
> >  	};
> >  	const struct proc_nfs_info *nfs_infop;
> > @@ -896,17 +900,18 @@ EXPORT_SYMBOL_GPL(nfs_show_stats);
> >   */
> >  void nfs_umount_begin(struct super_block *sb)
> >  {
> > -	struct nfs_server *server;
> > +	struct nfs_server *server = NFS_SB(sb);
> >  	struct rpc_clnt *rpc;
> > +	int kill_new_tasks = !!(server->flags &
> > NFS_MOUNT_FORCEKILL);
> >  
> >  	server = NFS_SB(sb);
> 
> Now that you initialized 'server' at declaration, you should really
> remove this (re)initialization.

Oops.

> 
> I'm not sure what I think of this patchset...
> You are setting a flag which causes "umount -f" to set a different
> flag.
> Why not just have "mount -o remount,XX" set the flag that you
> actually
> want to set.
> e.g
>    mount -o remount,serverfailed /mountpoint
> causes all rpcs to fail, and
>    mount -o remount,noserverfailed /mountpoint
> 
> causes all rpcs to be sent to the server.
> (or pick a different name than 'serverfailed').

Ah. I think I get it now :) Yes I will do that.

> 
> It isn't clear what the indirection buys us.
> 
> Thanks,
> NeilBrown
> 
> 
> 
> >  	/* -EIO all pending I/O */
> >  	rpc = server->client_acl;
> >  	if (!IS_ERR(rpc))
> > -		rpc_killall_tasks(rpc, 0);
> > +		rpc_killall_tasks(rpc, kill_new_tasks);
> >  	rpc = server->client;
> >  	if (!IS_ERR(rpc))
> > -		rpc_killall_tasks(rpc, 0);
> > +		rpc_killall_tasks(rpc, kill_new_tasks);
> >  }
> >  EXPORT_SYMBOL_GPL(nfs_umount_begin);
> >  
> > @@ -1334,6 +1339,12 @@ static int nfs_parse_mount_options(char
> > *raw,
> >  		case Opt_nomigration:
> >  			mnt->options &= ~NFS_OPTION_MIGRATION;
> >  			break;
> > +		case Opt_forcekill:
> > +			mnt->flags |= NFS_MOUNT_FORCEKILL;
> > +			break;
> > +		case Opt_noforcekill:
> > +			mnt->flags &= ~NFS_MOUNT_FORCEKILL;
> > +			break;
> >  
> >  		/*
> >  		 * options that take numeric values
> > diff --git a/include/uapi/linux/nfs_mount.h
> > b/include/uapi/linux/nfs_mount.h
> > index e44e00616ab5..66821542a38f 100644
> > --- a/include/uapi/linux/nfs_mount.h
> > +++ b/include/uapi/linux/nfs_mount.h
> > @@ -74,5 +74,6 @@ struct nfs_mount_data {
> >  
> >  #define NFS_MOUNT_LOCAL_FLOCK	0x100000
> >  #define NFS_MOUNT_LOCAL_FCNTL	0x200000
> > +#define NFS_MOUNT_FORCEKILL	0x400000
> >  
> >  #endif
> > -- 
> > 2.13.6

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2017-11-10 14:16 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-25 17:11 NFS Force Unmounting Joshua Watt
2017-10-30 20:20 ` J. Bruce Fields
2017-10-30 21:04   ` Joshua Watt
2017-10-30 21:09   ` NeilBrown
2017-10-31 14:41     ` Jeff Layton
2017-10-31 14:55       ` Chuck Lever
2017-10-31 17:04         ` Joshua Watt
2017-10-31 19:46           ` Chuck Lever
2017-11-01  0:53       ` NeilBrown
2017-11-01  2:22         ` Chuck Lever
2017-11-01 14:38           ` Joshua Watt
2017-11-02  0:15           ` NeilBrown
2017-11-02 19:46             ` Chuck Lever
2017-11-02 21:51               ` NeilBrown
2017-11-01 17:24     ` Jeff Layton
2017-11-01 23:13       ` NeilBrown
2017-11-02 12:09         ` Jeff Layton
2017-11-02 14:54           ` Joshua Watt
2017-11-08  3:30             ` NeilBrown
2017-11-08 12:08               ` Jeff Layton
2017-11-08 15:52                 ` J. Bruce Fields
2017-11-08 22:34                   ` NeilBrown
2017-11-08 23:52                     ` Trond Myklebust
2017-11-09 19:48                       ` Joshua Watt
2017-11-10  0:16                         ` NeilBrown
2017-11-08 14:59             ` [RFC 0/4] " Joshua Watt
2017-11-08 14:59               ` [RFC 1/4] SUNRPC: Add flag to kill new tasks Joshua Watt
2017-11-10  1:39                 ` NeilBrown
2017-11-08 14:59               ` [RFC 2/4] SUNRPC: Kill client tasks from debugfs Joshua Watt
2017-11-10  1:47                 ` NeilBrown
2017-11-10 14:13                   ` Joshua Watt
2017-11-08 14:59               ` [RFC 3/4] SUNRPC: Simplify client shutdown Joshua Watt
2017-11-10  1:50                 ` NeilBrown
2017-11-08 14:59               ` [RFC 4/4] NFS: Add forcekill mount option Joshua Watt
2017-11-10  2:01                 ` NeilBrown
2017-11-10 14:16                   ` Joshua Watt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.