All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC v4 0/9] NFS Force Unmounting
@ 2017-11-17 17:45 Joshua Watt
  2017-11-17 17:45 ` [RFC v4 1/9] SUNRPC: Add flag to kill new tasks Joshua Watt
                   ` (9 more replies)
  0 siblings, 10 replies; 32+ messages in thread
From: Joshua Watt @ 2017-11-17 17:45 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Trond Myklebust, J . Bruce Fields
  Cc: linux-nfs, Al Viro, David Howells, Joshua Watt

Latest attempt at unifying the various constraints for force umounting
NFS servers that have disappeared in a timely manner. Namely:
  * umount2(..., MNT_FORCE) should not be made stronger, unless we know
    this is the final umount()
  * The "failed server" state should be reversible
  * The mechanism should be able to "unstick" a sync(2) that is stuck on
    an unresponsive server

I believe the proposal satisfies all of these concerns. There are a few
major components to this proposal:
 1) The umount_begin superblock operation now has a corresponding
    umount_end operation. This is invoked by umount() when MNT_FORCE is
    specified (like umount_begin()), but the actual unmount failed (i.e.
    the mount is busy).
 2) Instead of killing all the RPC queued at a single point in time, the
    NFS mount now kills all queue RPCs and all RPCs that get queued
    between nfs_umount_begin() and nfs_umount_end(). I believe this is
    not a significant change in behavior because there were always races
    between queuing RPCs and killing them in nfs_umount_begin().
 3) nfs_umount_end() is *not* called when MNT_DETACH is specified. This
    is the indication that the user is done with this mount and all
    RPCs will be killed until the mount finally gets removed.
 4) The new "transient" mount option prevents sharing nfs_clients
    between multiple superblocks. The only exception to this is when the
    kernel does an automatic mount to cross a device boundary ("xdev"
    NFS mount). In this case, the existing code always shares the
    existing nfs_client from parent superblock. The "transient" mount
    option implies "nosharecache", as it doesn't make sense to share
    superblocks if clients aren't shared.
 5) If the "transient" mount option is specified (and hence the
    nfs_client is not shared), MNT_FORCE kills all RPCs for the entire
    nfs_client (and all its nfs_servers). This effectively enables the
    "burn down the forest" option when combined with MNT_DETACH.

The choice to use MNT_FORCE as the mechanism for triggering this
behavior stems from the desire to unstick sync(2) calls that might be
blocked on a non-responsive NFS server. While it was previously
discussed to remount with a new mount option to enable this behavior,
this cannot release the blocked sync(2) call because both
sync_fileystem() and do_remount() lock the struct superblock->s_umount
reader-writer lock. As such, a remount will block until the sync(2)
finishes, which is undesirable. umount2() doesn't have this restriction
and can unblock the sync(2) call.

For the most part, all existing behavior is unchanged if the "transient"
option is not specified. umount -f continues to behave as is has, but
umount -fl (see note below) now does a more aggressive kill to get
everything out of there and allow unmounting in a more timely manner.
Note that there will probably be some communication with the server
still, as destruction of the NFS client ID and such will occur when the
last reference is removed. If the server is truly gone, this can result
in long blocks at that time.

If it is known at mount time that the server might be disappearing, it
should be mounted with "transient". Doing this will allow mount -fl to
do a more complete cleanup and prevent all communication with the
server, which should allow a timely cleanup in all cases.

Notes:

Until recently, libmount did not allow a detached and lazy mount at the
same time. This was recently fixed (see
https://marc.info/?l=util-linux-ng&m=151000714401929&w=2). If you want
to test this, you may need to write a simple test program that directly
calls umount2() with MNT_FORCE | MNT_DETACH.

Thank you all again for your time and comments,
Joshua Watt

Joshua Watt (9):
  SUNRPC: Add flag to kill new tasks
  SUNRPC: Expose kill_new_tasks in debugfs
  SUNRPC: Simplify client shutdown
  namespace: Add umount_end superblock operation
  NFS: Kill RPCs for the duration of umount
  NFS: Add debugfs for nfs_server and nfs_client
  NFS: Add transient mount option
  NFS: Don't shared transient clients
  NFS: Kill all client RPCs if transient

 fs/namespace.c                 |  22 ++++++-
 fs/nfs/Makefile                |   2 +-
 fs/nfs/client.c                |  96 +++++++++++++++++++++++++--
 fs/nfs/debugfs.c               | 143 +++++++++++++++++++++++++++++++++++++++++
 fs/nfs/inode.c                 |   5 ++
 fs/nfs/internal.h              |  11 ++++
 fs/nfs/nfs3client.c            |   2 +
 fs/nfs/nfs4client.c            |   5 ++
 fs/nfs/nfs4super.c             |   1 +
 fs/nfs/super.c                 |  81 ++++++++++++++++++++---
 include/linux/fs.h             |   1 +
 include/linux/nfs_fs_sb.h      |   6 ++
 include/linux/sunrpc/clnt.h    |   1 +
 include/uapi/linux/nfs_mount.h |   1 +
 net/sunrpc/clnt.c              |  13 ++--
 net/sunrpc/debugfs.c           |   5 ++
 net/sunrpc/sched.c             |   3 +
 17 files changed, 372 insertions(+), 26 deletions(-)
 create mode 100644 fs/nfs/debugfs.c

-- 
2.13.6


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC v4 1/9] SUNRPC: Add flag to kill new tasks
  2017-11-17 17:45 [RFC v4 0/9] NFS Force Unmounting Joshua Watt
@ 2017-11-17 17:45 ` Joshua Watt
  2017-12-05 22:59   ` NeilBrown
  2017-11-17 17:45 ` [RFC v4 2/9] SUNRPC: Expose kill_new_tasks in debugfs Joshua Watt
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 32+ messages in thread
From: Joshua Watt @ 2017-11-17 17:45 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Trond Myklebust, J . Bruce Fields
  Cc: linux-nfs, Al Viro, David Howells, Joshua Watt

The flag causes any new tasks that are queued to exit immediately with
-EIO instead of executing. This will allow clients (particularly NFS) to
prevents these task from delaying shutdown of the RPC session longer
than necessary.

Signed-off-by: Joshua Watt <JPEWhacker@gmail.com>
---
 include/linux/sunrpc/clnt.h | 1 +
 net/sunrpc/clnt.c           | 5 ++---
 net/sunrpc/sched.c          | 3 +++
 3 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/include/linux/sunrpc/clnt.h b/include/linux/sunrpc/clnt.h
index 71c237e8240e..d36dc529fdf0 100644
--- a/include/linux/sunrpc/clnt.h
+++ b/include/linux/sunrpc/clnt.h
@@ -54,6 +54,7 @@ struct rpc_clnt {
 				cl_noretranstimeo: 1,/* No retransmit timeouts */
 				cl_autobind : 1,/* use getport() */
 				cl_chatty   : 1;/* be verbose */
+	atomic_t		cl_kill_new_tasks;	/* Kill all new tasks */
 
 	struct rpc_rtt *	cl_rtt;		/* RTO estimator data */
 	const struct rpc_timeout *cl_timeout;	/* Timeout strategy */
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index df4ecb042ebe..d5eedabf0b4d 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -626,6 +626,8 @@ static struct rpc_clnt *__rpc_clone_client(struct rpc_create_args *args,
 	new->cl_noretranstimeo = clnt->cl_noretranstimeo;
 	new->cl_discrtry = clnt->cl_discrtry;
 	new->cl_chatty = clnt->cl_chatty;
+	atomic_set(&new->cl_kill_new_tasks,
+		   atomic_read(&clnt->cl_kill_new_tasks));
 	return new;
 
 out_err:
@@ -818,9 +820,6 @@ void rpc_killall_tasks(struct rpc_clnt *clnt)
 {
 	struct rpc_task	*rovr;
 
-
-	if (list_empty(&clnt->cl_tasks))
-		return;
 	dprintk("RPC:       killing all tasks for client %p\n", clnt);
 	/*
 	 * Spin lock all_tasks to prevent changes...
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index 0cc83839c13c..e70034e19ccd 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -748,6 +748,9 @@ static void __rpc_execute(struct rpc_task *task)
 	dprintk("RPC: %5u __rpc_execute flags=0x%x\n",
 			task->tk_pid, task->tk_flags);
 
+	if (atomic_read(&task->tk_client->cl_kill_new_tasks))
+		rpc_exit(task, -EIO);
+
 	WARN_ON_ONCE(RPC_IS_QUEUED(task));
 	if (RPC_IS_QUEUED(task))
 		return;
-- 
2.13.6


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC v4 2/9] SUNRPC: Expose kill_new_tasks in debugfs
  2017-11-17 17:45 [RFC v4 0/9] NFS Force Unmounting Joshua Watt
  2017-11-17 17:45 ` [RFC v4 1/9] SUNRPC: Add flag to kill new tasks Joshua Watt
@ 2017-11-17 17:45 ` Joshua Watt
  2017-11-17 17:45 ` [RFC v4 3/9] SUNRPC: Simplify client shutdown Joshua Watt
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 32+ messages in thread
From: Joshua Watt @ 2017-11-17 17:45 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Trond Myklebust, J . Bruce Fields
  Cc: linux-nfs, Al Viro, David Howells, Joshua Watt

Signed-off-by: Joshua Watt <JPEWhacker@gmail.com>
---
 net/sunrpc/debugfs.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/net/sunrpc/debugfs.c b/net/sunrpc/debugfs.c
index e980d2a493de..72d9d074abd2 100644
--- a/net/sunrpc/debugfs.c
+++ b/net/sunrpc/debugfs.c
@@ -160,6 +160,11 @@ rpc_clnt_debugfs_register(struct rpc_clnt *clnt)
 	if (!debugfs_create_symlink("xprt", clnt->cl_debugfs, name))
 		goto out_err;
 
+	if (!debugfs_create_atomic_t("kill_new_tasks", S_IRUSR | S_IWUSR,
+				     clnt->cl_debugfs,
+				     &clnt->cl_kill_new_tasks))
+		goto out_err;
+
 	return;
 out_err:
 	debugfs_remove_recursive(clnt->cl_debugfs);
-- 
2.13.6


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC v4 3/9] SUNRPC: Simplify client shutdown
  2017-11-17 17:45 [RFC v4 0/9] NFS Force Unmounting Joshua Watt
  2017-11-17 17:45 ` [RFC v4 1/9] SUNRPC: Add flag to kill new tasks Joshua Watt
  2017-11-17 17:45 ` [RFC v4 2/9] SUNRPC: Expose kill_new_tasks in debugfs Joshua Watt
@ 2017-11-17 17:45 ` Joshua Watt
  2017-11-17 17:45 ` [RFC v4 4/9] namespace: Add umount_end superblock operation Joshua Watt
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 32+ messages in thread
From: Joshua Watt @ 2017-11-17 17:45 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Trond Myklebust, J . Bruce Fields
  Cc: linux-nfs, Al Viro, David Howells, Joshua Watt

Use the flag to kill all new tasks when shutting down instead of
repeatedly killing all the pending tasks.

Signed-off-by: Joshua Watt <JPEWhacker@gmail.com>
---
 net/sunrpc/clnt.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index d5eedabf0b4d..c3abcc800f25 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -852,11 +852,9 @@ void rpc_shutdown_client(struct rpc_clnt *clnt)
 			clnt->cl_program->name,
 			rcu_dereference(clnt->cl_xprt)->servername);
 
-	while (!list_empty(&clnt->cl_tasks)) {
-		rpc_killall_tasks(clnt);
-		wait_event_timeout(destroy_wait,
-			list_empty(&clnt->cl_tasks), 1*HZ);
-	}
+	atomic_inc(&clnt->cl_kill_new_tasks);
+	rpc_killall_tasks(clnt);
+	wait_event(destroy_wait, list_empty(&clnt->cl_tasks));
 
 	rpc_release_client(clnt);
 }
-- 
2.13.6


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC v4 4/9] namespace: Add umount_end superblock operation
  2017-11-17 17:45 [RFC v4 0/9] NFS Force Unmounting Joshua Watt
                   ` (2 preceding siblings ...)
  2017-11-17 17:45 ` [RFC v4 3/9] SUNRPC: Simplify client shutdown Joshua Watt
@ 2017-11-17 17:45 ` Joshua Watt
  2017-12-06 11:54   ` Jeff Layton
  2017-12-06 12:14   ` Al Viro
  2017-11-17 17:45 ` [RFC v4 5/9] NFS: Kill RPCs for the duration of umount Joshua Watt
                   ` (5 subsequent siblings)
  9 siblings, 2 replies; 32+ messages in thread
From: Joshua Watt @ 2017-11-17 17:45 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Trond Myklebust, J . Bruce Fields
  Cc: linux-nfs, Al Viro, David Howells, Joshua Watt

The umount_end operation allows cleaning of state set by umount_begin in
the event the filesystem doesn't actually get unmounted.

Signed-off-by: Joshua Watt <JPEWhacker@gmail.com>
---
 fs/namespace.c     | 22 ++++++++++++++++++++--
 include/linux/fs.h |  1 +
 2 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index d18deb4c410b..d2587be4d08b 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1524,6 +1524,12 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 	}
 }
 
+static void umount_end(struct super_block *sb, int flags)
+{
+	if (flags & MNT_FORCE && sb->s_op->umount_end)
+		sb->s_op->umount_end(sb);
+}
+
 static void shrink_submounts(struct mount *mnt);
 
 static int do_umount(struct mount *mnt, int flags)
@@ -1589,12 +1595,16 @@ static int do_umount(struct mount *mnt, int flags)
 		 * Special case for "unmounting" root ...
 		 * we just try to remount it readonly.
 		 */
-		if (!capable(CAP_SYS_ADMIN))
-			return -EPERM;
+		if (!capable(CAP_SYS_ADMIN)) {
+			retval = -EPERM;
+			goto out_umount_end;
+		}
 		down_write(&sb->s_umount);
 		if (!sb_rdonly(sb))
 			retval = do_remount_sb(sb, SB_RDONLY, NULL, 0);
 		up_write(&sb->s_umount);
+		/* Still mounted. Always invoke the cleanup */
+		umount_end(sb, flags);
 		return retval;
 	}
 
@@ -1617,6 +1627,14 @@ static int do_umount(struct mount *mnt, int flags)
 	}
 	unlock_mount_hash();
 	namespace_unlock();
+
+out_umount_end:
+	/* If the umount failed and the file system is still mounted, allow the
+	 * driver to cleanup any state it may have setup in umount_begin(). Note
+	 * that this is purposely *not* called when MNT_DETACH is specified.
+	 */
+	if (retval)
+		umount_end(sb, flags);
 	return retval;
 }
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 885266aae2d7..5443c22da18f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1816,6 +1816,7 @@ struct super_operations {
 	int (*statfs) (struct dentry *, struct kstatfs *);
 	int (*remount_fs) (struct super_block *, int *, char *);
 	void (*umount_begin) (struct super_block *);
+	void (*umount_end)(struct super_block *);
 
 	int (*show_options)(struct seq_file *, struct dentry *);
 	int (*show_devname)(struct seq_file *, struct dentry *);
-- 
2.13.6


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC v4 5/9] NFS: Kill RPCs for the duration of umount
  2017-11-17 17:45 [RFC v4 0/9] NFS Force Unmounting Joshua Watt
                   ` (3 preceding siblings ...)
  2017-11-17 17:45 ` [RFC v4 4/9] namespace: Add umount_end superblock operation Joshua Watt
@ 2017-11-17 17:45 ` Joshua Watt
  2017-12-05 23:07   ` NeilBrown
  2017-11-17 17:45 ` [RFC v4 6/9] NFS: Add debugfs for nfs_server and nfs_client Joshua Watt
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 32+ messages in thread
From: Joshua Watt @ 2017-11-17 17:45 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Trond Myklebust, J . Bruce Fields
  Cc: linux-nfs, Al Viro, David Howells, Joshua Watt

RPCs are now killed for the duration of the umount operation when
MNT_FORCE is specified. Specifically, if the umount flags are MNT_FORCE
| MNT_DETACH, all future RPCs will be killed.

Signed-off-by: Joshua Watt <JPEWhacker@gmail.com>
---
 fs/nfs/internal.h  |  1 +
 fs/nfs/nfs4super.c |  1 +
 fs/nfs/super.c     | 36 +++++++++++++++++++++++++++++-------
 3 files changed, 31 insertions(+), 7 deletions(-)

diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index f9a4a5524bd5..3ec165368a08 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -461,6 +461,7 @@ extern void nfs_pageio_reset_read_mds(struct nfs_pageio_descriptor *pgio);
 
 /* super.c */
 void nfs_umount_begin(struct super_block *);
+void nfs_umount_end(struct super_block *sb);
 int  nfs_statfs(struct dentry *, struct kstatfs *);
 int  nfs_show_options(struct seq_file *, struct dentry *);
 int  nfs_show_devname(struct seq_file *, struct dentry *);
diff --git a/fs/nfs/nfs4super.c b/fs/nfs/nfs4super.c
index 6fb7cb6b3f4b..e9788f9c3394 100644
--- a/fs/nfs/nfs4super.c
+++ b/fs/nfs/nfs4super.c
@@ -56,6 +56,7 @@ static const struct super_operations nfs4_sops = {
 	.statfs		= nfs_statfs,
 	.evict_inode	= nfs4_evict_inode,
 	.umount_begin	= nfs_umount_begin,
+	.umount_end	= nfs_umount_end,
 	.show_options	= nfs_show_options,
 	.show_devname	= nfs_show_devname,
 	.show_path	= nfs_show_path,
diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index 216f67d628b3..71361ca6d6b4 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -315,6 +315,7 @@ const struct super_operations nfs_sops = {
 	.statfs		= nfs_statfs,
 	.evict_inode	= nfs_evict_inode,
 	.umount_begin	= nfs_umount_begin,
+	.umount_end	= nfs_umount_end,
 	.show_options	= nfs_show_options,
 	.show_devname	= nfs_show_devname,
 	.show_path	= nfs_show_path,
@@ -890,6 +891,15 @@ int nfs_show_stats(struct seq_file *m, struct dentry *root)
 }
 EXPORT_SYMBOL_GPL(nfs_show_stats);
 
+static void
+kill_rpc(struct rpc_clnt *rpc)
+{
+	if (!IS_ERR(rpc)) {
+		atomic_inc(&rpc->cl_kill_new_tasks);
+		rpc_killall_tasks(rpc);
+	}
+}
+
 /*
  * Begin unmount by attempting to remove all automounted mountpoints we added
  * in response to xdev traversals and referrals
@@ -897,19 +907,31 @@ EXPORT_SYMBOL_GPL(nfs_show_stats);
 void nfs_umount_begin(struct super_block *sb)
 {
 	struct nfs_server *server;
-	struct rpc_clnt *rpc;
 
 	server = NFS_SB(sb);
 	/* -EIO all pending I/O */
-	rpc = server->client_acl;
-	if (!IS_ERR(rpc))
-		rpc_killall_tasks(rpc);
-	rpc = server->client;
-	if (!IS_ERR(rpc))
-		rpc_killall_tasks(rpc);
+	kill_rpc(server->client_acl);
+	kill_rpc(server->client);
 }
 EXPORT_SYMBOL_GPL(nfs_umount_begin);
 
+static void
+restore_rpc(struct rpc_clnt *rpc)
+{
+	if (!IS_ERR(rpc))
+		__atomic_add_unless(&rpc->cl_kill_new_tasks, -1, 0);
+}
+
+void
+nfs_umount_end(struct super_block *sb)
+{
+	struct nfs_server *server = NFS_SB(sb);
+
+	restore_rpc(server->client_acl);
+	restore_rpc(server->client);
+}
+EXPORT_SYMBOL_GPL(nfs_umount_end);
+
 static struct nfs_parsed_mount_data *nfs_alloc_parsed_mount_data(void)
 {
 	struct nfs_parsed_mount_data *data;
-- 
2.13.6


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC v4 6/9] NFS: Add debugfs for nfs_server and nfs_client
  2017-11-17 17:45 [RFC v4 0/9] NFS Force Unmounting Joshua Watt
                   ` (4 preceding siblings ...)
  2017-11-17 17:45 ` [RFC v4 5/9] NFS: Kill RPCs for the duration of umount Joshua Watt
@ 2017-11-17 17:45 ` Joshua Watt
  2017-11-17 17:45 ` [RFC v4 7/9] NFS: Add transient mount option Joshua Watt
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 32+ messages in thread
From: Joshua Watt @ 2017-11-17 17:45 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Trond Myklebust, J . Bruce Fields
  Cc: linux-nfs, Al Viro, David Howells, Joshua Watt

Signed-off-by: Joshua Watt <JPEWhacker@gmail.com>
---
 fs/nfs/Makefile           |   2 +-
 fs/nfs/client.c           |  87 ++++++++++++++++++++++++++--
 fs/nfs/debugfs.c          | 143 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/nfs/inode.c            |   5 ++
 fs/nfs/internal.h         |  10 ++++
 fs/nfs/nfs4client.c       |   1 +
 include/linux/nfs_fs_sb.h |   5 ++
 7 files changed, 247 insertions(+), 6 deletions(-)
 create mode 100644 fs/nfs/debugfs.c

diff --git a/fs/nfs/Makefile b/fs/nfs/Makefile
index c587e3c4c6a6..9a0553888a28 100644
--- a/fs/nfs/Makefile
+++ b/fs/nfs/Makefile
@@ -6,7 +6,7 @@
 obj-$(CONFIG_NFS_FS) += nfs.o
 
 CFLAGS_nfstrace.o += -I$(src)
-nfs-y 			:= client.o dir.o file.o getroot.o inode.o super.o \
+nfs-y 			:= client.o debugfs.o dir.o file.o getroot.o inode.o super.o \
 			   io.o direct.o pagelist.o read.o symlink.o unlink.o \
 			   write.o namespace.o mount_clnt.o nfstrace.o export.o
 nfs-$(CONFIG_ROOT_NFS)	+= nfsroot.o
diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index 22880ef6d8dd..7bab47d0dd07 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -142,6 +142,32 @@ void unregister_nfs_version(struct nfs_subversion *nfs)
 }
 EXPORT_SYMBOL_GPL(unregister_nfs_version);
 
+static DEFINE_IDA(nfs_client_ids);
+
+void
+nfs_cleanup_client_ids(void)
+{
+	ida_destroy(&nfs_client_ids);
+}
+
+static int
+nfs_alloc_client_id(struct nfs_client *client)
+{
+	int id;
+
+	id = ida_simple_get(&nfs_client_ids, 0, 0, GFP_KERNEL);
+	if (id < 0)
+		return id;
+	client->cl_id = id;
+	return 0;
+}
+
+static void
+nfs_free_client_id(struct nfs_client *client)
+{
+	ida_simple_remove(&nfs_client_ids, client->cl_id);
+}
+
 /*
  * Allocate a shared client record
  *
@@ -161,6 +187,8 @@ struct nfs_client *nfs_alloc_client(const struct nfs_client_initdata *cl_init)
 	if (!try_module_get(clp->cl_nfs_mod->owner))
 		goto error_dealloc;
 
+	nfs_alloc_client_id(clp);
+
 	clp->rpc_ops = clp->cl_nfs_mod->rpc_ops;
 
 	atomic_set(&clp->cl_count, 1);
@@ -249,8 +277,11 @@ void nfs_free_client(struct nfs_client *clp)
 	if (clp->cl_machine_cred != NULL)
 		put_rpccred(clp->cl_machine_cred);
 
+	nfs_client_debugfs_unregister(clp);
+
 	put_net(clp->cl_net);
 	put_nfs_version(clp->cl_nfs_mod);
+	nfs_free_client_id(clp);
 	kfree(clp->cl_hostname);
 	kfree(clp->cl_acceptor);
 	kfree(clp);
@@ -380,6 +411,17 @@ nfs_found_client(const struct nfs_client_initdata *cl_init,
 	return clp;
 }
 
+static struct nfs_client *
+init_client(struct nfs_client *new, const struct nfs_client_initdata *cl_init)
+{
+	struct nfs_client *ret =
+		cl_init->nfs_mod->rpc_ops->init_client(new, cl_init);
+
+	if (ret)
+		nfs_client_debugfs_register(new);
+	return ret;
+}
+
 /*
  * Look up a client by IP address and protocol version
  * - creates a new record if one doesn't yet exist
@@ -411,7 +453,7 @@ struct nfs_client *nfs_get_client(const struct nfs_client_initdata *cl_init)
 					&nn->nfs_client_list);
 			spin_unlock(&nn->nfs_client_lock);
 			new->cl_flags = cl_init->init_flags;
-			return rpc_ops->init_client(new, cl_init);
+			return init_client(new, cl_init);
 		}
 
 		spin_unlock(&nn->nfs_client_lock);
@@ -856,6 +898,32 @@ void nfs_server_remove_lists(struct nfs_server *server)
 }
 EXPORT_SYMBOL_GPL(nfs_server_remove_lists);
 
+static DEFINE_IDA(nfs_server_ids);
+
+void
+nfs_cleanup_server_ids(void)
+{
+	ida_destroy(&nfs_server_ids);
+}
+
+static int
+nfs_alloc_server_id(struct nfs_server *server)
+{
+	int id;
+
+	id = ida_simple_get(&nfs_server_ids, 0, 0, GFP_KERNEL);
+	if (id < 0)
+		return id;
+	server->id = id;
+	return 0;
+}
+
+static void
+nfs_free_server_id(struct nfs_server *server)
+{
+	ida_simple_remove(&nfs_server_ids, server->id);
+}
+
 /*
  * Allocate and initialise a server record
  */
@@ -867,6 +935,8 @@ struct nfs_server *nfs_alloc_server(void)
 	if (!server)
 		return NULL;
 
+	nfs_alloc_server_id(server);
+
 	server->client = server->client_acl = ERR_PTR(-EINVAL);
 
 	/* Zero out the NFS state stuff */
@@ -879,10 +949,8 @@ struct nfs_server *nfs_alloc_server(void)
 	atomic_set(&server->active, 0);
 
 	server->io_stats = nfs_alloc_iostats();
-	if (!server->io_stats) {
-		kfree(server);
-		return NULL;
-	}
+	if (!server->io_stats)
+		goto out_error;
 
 	ida_init(&server->openowner_id);
 	ida_init(&server->lockowner_id);
@@ -890,6 +958,10 @@ struct nfs_server *nfs_alloc_server(void)
 	rpc_init_wait_queue(&server->uoc_rpcwaitq, "NFS UOC");
 
 	return server;
+out_error:
+	nfs_free_server_id(server);
+	kfree(server);
+	return NULL;
 }
 EXPORT_SYMBOL_GPL(nfs_alloc_server);
 
@@ -910,9 +982,12 @@ void nfs_free_server(struct nfs_server *server)
 
 	nfs_put_client(server->nfs_client);
 
+	nfs_server_debugfs_unregister(server);
+
 	ida_destroy(&server->lockowner_id);
 	ida_destroy(&server->openowner_id);
 	nfs_free_iostats(server->io_stats);
+	nfs_free_server_id(server);
 	kfree(server);
 	nfs_release_automount_timer();
 }
@@ -973,6 +1048,7 @@ struct nfs_server *nfs_create_server(struct nfs_mount_info *mount_info,
 	nfs_server_insert_lists(server);
 	server->mount_time = jiffies;
 	nfs_free_fattr(fattr);
+	nfs_server_debugfs_register(server);
 	return server;
 
 error:
@@ -1033,6 +1109,7 @@ struct nfs_server *nfs_clone_server(struct nfs_server *source,
 	server->mount_time = jiffies;
 
 	nfs_free_fattr(fattr_fsinfo);
+	nfs_server_debugfs_register(server);
 	return server;
 
 out_free_server:
diff --git a/fs/nfs/debugfs.c b/fs/nfs/debugfs.c
new file mode 100644
index 000000000000..0195431427c2
--- /dev/null
+++ b/fs/nfs/debugfs.c
@@ -0,0 +1,143 @@
+// SPDX-License-Identifier: GPL-2.0
+/**
+ * debugfs interface for nfs
+ *
+ * (c) 2017 Garmin International
+ */
+
+#include <linux/debugfs.h>
+#include <linux/sunrpc/clnt.h>
+#include <linux/nfs_fs.h>
+
+#include "nfs4_fs.h"
+#include "internal.h"
+
+static struct dentry *topdir;
+static struct dentry *nfs_server_dir;
+static struct dentry *nfs_client_dir;
+
+static struct dentry*
+link_rpc_client(char const *name, struct rpc_clnt *client,
+		struct dentry *parent)
+{
+	int len;
+	char target[34]; /* "../../../sunrpc/rpc_clnt/" + 8 hex digits + '\0' */
+
+	if (IS_ERR(client) || !client->cl_debugfs)
+		return NULL;
+
+	len = snprintf(target, sizeof(target), "../../../sunrpc/rpc_clnt/%s",
+		       client->cl_debugfs->d_name.name);
+
+	if (len >= sizeof(target))
+		return NULL;
+
+	return debugfs_create_symlink(name, parent, target);
+}
+
+void
+nfs_server_debugfs_register(struct nfs_server *server)
+{
+	char name[26]; /* "../../nfs_client/" + 8 hex digits + '\0' */
+	int len;
+
+	if (server->debugfs || !nfs_server_dir)
+		return;
+
+	len = snprintf(name, sizeof(name), "%x", server->id);
+	if (len >= sizeof(name))
+		return;
+
+	server->debugfs = debugfs_create_dir(name, nfs_server_dir);
+	if (!server->debugfs)
+		return;
+
+	link_rpc_client("rpc_client", server->client, server->debugfs);
+	link_rpc_client("rpc_client_acl", server->client_acl, server->debugfs);
+
+	if (server->nfs_client->cl_debugfs) {
+		len = snprintf(name, sizeof(name), "../../nfs_client/%s",
+			       server->nfs_client->cl_debugfs->d_name.name);
+		if (len >= sizeof(name))
+			goto out_error;
+
+		if (!debugfs_create_symlink("nfs_client", server->debugfs,
+					    name))
+			goto out_error;
+	}
+
+	return;
+out_error:
+	debugfs_remove_recursive(server->debugfs);
+	server->debugfs = NULL;
+}
+EXPORT_SYMBOL_GPL(nfs_server_debugfs_register);
+
+void
+nfs_server_debugfs_unregister(struct nfs_server *server)
+{
+	debugfs_remove_recursive(server->debugfs);
+	server->debugfs = NULL;
+}
+
+void
+nfs_client_debugfs_register(struct nfs_client *client)
+{
+	char name[9]; /* 8 hex digits + '\0' */
+	int len;
+
+	if (client->cl_debugfs || !nfs_client_dir)
+		return;
+
+	len = snprintf(name, sizeof(name), "%x", client->cl_id);
+	if (len >= sizeof(name))
+		return;
+
+	client->cl_debugfs = debugfs_create_dir(name, nfs_client_dir);
+	if (!client->cl_debugfs)
+		return;
+
+	link_rpc_client("rpc_client", client->cl_rpcclient,
+			client->cl_debugfs);
+}
+
+void
+nfs_client_debugfs_unregister(struct nfs_client *client)
+{
+	debugfs_remove_recursive(client->cl_debugfs);
+	client->cl_debugfs = NULL;
+}
+
+void __exit
+nfs_debugfs_exit(void)
+{
+	debugfs_remove_recursive(topdir);
+	topdir = NULL;
+	nfs_client_dir = NULL;
+	nfs_server_dir = NULL;
+}
+
+void __init
+nfs_debugfs_init(void)
+{
+	topdir = debugfs_create_dir("nfs", NULL);
+	if (!topdir)
+		return;
+
+	nfs_server_dir = debugfs_create_dir("nfs_server", topdir);
+	if (!nfs_server_dir)
+		goto out_remove;
+
+	nfs_client_dir = debugfs_create_dir("nfs_client", topdir);
+	if (!nfs_client_dir)
+		goto out_remove;
+
+	return;
+out_remove:
+	debugfs_remove_recursive(topdir);
+	topdir = NULL;
+	nfs_server_dir = NULL;
+	nfs_client_dir = NULL;
+}
+
+
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 134d9f560240..28879236f949 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -2144,6 +2144,8 @@ static int __init init_nfs_fs(void)
 	if (err)
 		goto out0;
 
+	nfs_debugfs_init();
+
 	return 0;
 out0:
 	rpc_proc_unregister(&init_net, "nfs");
@@ -2181,6 +2183,9 @@ static void __exit exit_nfs_fs(void)
 	unregister_nfs_fs();
 	nfs_fs_proc_exit();
 	nfsiod_stop();
+	nfs_cleanup_server_ids();
+	nfs_cleanup_client_ids();
+	nfs_debugfs_exit();
 }
 
 /* Not quite true; I just maintain it */
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 3ec165368a08..969958205cbd 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -156,6 +156,7 @@ extern void nfs_umount(const struct nfs_mount_request *info);
 /* client.c */
 extern const struct rpc_program nfs_program;
 extern void nfs_clients_init(struct net *net);
+void nfs_cleanup_client_ids(void);
 extern struct nfs_client *nfs_alloc_client(const struct nfs_client_initdata *);
 int nfs_create_rpc_client(struct nfs_client *, const struct nfs_client_initdata *, rpc_authflavor_t);
 struct nfs_client *nfs_get_client(const struct nfs_client_initdata *);
@@ -165,6 +166,7 @@ void nfs_server_remove_lists(struct nfs_server *);
 void nfs_init_timeout_values(struct rpc_timeout *to, int proto, int timeo, int retrans);
 int nfs_init_server_rpcclient(struct nfs_server *, const struct rpc_timeout *t,
 		rpc_authflavor_t);
+void nfs_cleanup_server_ids(void);
 struct nfs_server *nfs_alloc_server(void);
 void nfs_server_copy_userdata(struct nfs_server *, struct nfs_server *);
 
@@ -561,6 +563,14 @@ void nfs_init_cinfo_from_dreq(struct nfs_commit_info *cinfo,
 			      struct nfs_direct_req *dreq);
 extern ssize_t nfs_dreq_bytes_left(struct nfs_direct_req *dreq);
 
+/* debugfs.c */
+void nfs_server_debugfs_register(struct nfs_server *server);
+void nfs_server_debugfs_unregister(struct nfs_server *server);
+void nfs_client_debugfs_register(struct nfs_client *client);
+void nfs_client_debugfs_unregister(struct nfs_client *client);
+void __exit nfs_debugfs_exit(void);
+void __init nfs_debugfs_init(void);
+
 /* nfs4proc.c */
 extern struct nfs_client *nfs4_init_client(struct nfs_client *clp,
 			    const struct nfs_client_initdata *);
diff --git a/fs/nfs/nfs4client.c b/fs/nfs/nfs4client.c
index e9bea90dc017..ef725106e45c 100644
--- a/fs/nfs/nfs4client.c
+++ b/fs/nfs/nfs4client.c
@@ -1086,6 +1086,7 @@ struct nfs_server *nfs4_create_server(struct nfs_mount_info *mount_info,
 	if (error < 0)
 		goto error;
 
+	nfs_server_debugfs_register(server);
 	return server;
 
 error:
diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
index 286b71c418b4..5dac856355f0 100644
--- a/include/linux/nfs_fs_sb.h
+++ b/include/linux/nfs_fs_sb.h
@@ -27,6 +27,7 @@ struct nfs41_impl_id;
 struct nfs_client {
 	atomic_t		cl_count;
 	atomic_t		cl_mds_count;
+	unsigned int		cl_id;		/* client id */
 	int			cl_cons_state;	/* current construction state (-ve: init error) */
 #define NFS_CS_READY		0		/* ready to be used */
 #define NFS_CS_INITING		1		/* busy initialising */
@@ -120,6 +121,7 @@ struct nfs_client {
 #endif
 
 	struct net		*cl_net;
+	struct dentry		*cl_debugfs;	/* debugfs entry */
 };
 
 /*
@@ -131,6 +133,7 @@ struct nfs_server {
 						 * that share the same client
 						 */
 	struct list_head	master_link;	/* link in master servers list */
+	unsigned int		id;		/* server id */
 	struct rpc_clnt *	client;		/* RPC client handle */
 	struct rpc_clnt *	client_acl;	/* ACL RPC client handle */
 	struct nlm_host		*nlm_host;	/* NLM client handle */
@@ -225,6 +228,8 @@ struct nfs_server {
 	unsigned short		mountd_port;
 	unsigned short		mountd_protocol;
 	struct rpc_wait_queue	uoc_rpcwaitq;
+
+	struct dentry		*debugfs;
 };
 
 /* Server capabilities */
-- 
2.13.6


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC v4 7/9] NFS: Add transient mount option
  2017-11-17 17:45 [RFC v4 0/9] NFS Force Unmounting Joshua Watt
                   ` (5 preceding siblings ...)
  2017-11-17 17:45 ` [RFC v4 6/9] NFS: Add debugfs for nfs_server and nfs_client Joshua Watt
@ 2017-11-17 17:45 ` Joshua Watt
  2017-12-06 12:23   ` Jeff Layton
  2017-11-17 17:45 ` [RFC v4 8/9] NFS: Don't shared transient clients Joshua Watt
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 32+ messages in thread
From: Joshua Watt @ 2017-11-17 17:45 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Trond Myklebust, J . Bruce Fields
  Cc: linux-nfs, Al Viro, David Howells, Joshua Watt

The "transient" option is used to indicate that the server might
disappear unexpectedly, and a more aggressive behavior should be use to
force unmount it. It implies "nosharecache"

Signed-off-by: Joshua Watt <JPEWhacker@gmail.com>
---
 fs/nfs/super.c                 | 13 ++++++++++++-
 include/uapi/linux/nfs_mount.h |  1 +
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index 71361ca6d6b4..09c413c098ef 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -90,6 +90,7 @@ enum {
 	Opt_resvport, Opt_noresvport,
 	Opt_fscache, Opt_nofscache,
 	Opt_migration, Opt_nomigration,
+	Opt_transient, Opt_notransient,
 
 	/* Mount options that take integer arguments */
 	Opt_port,
@@ -151,6 +152,8 @@ static const match_table_t nfs_mount_option_tokens = {
 	{ Opt_nofscache, "nofsc" },
 	{ Opt_migration, "migration" },
 	{ Opt_nomigration, "nomigration" },
+	{ Opt_transient, "transient" },
+	{ Opt_notransient, "notransient" },
 
 	{ Opt_port, "port=%s" },
 	{ Opt_rsize, "rsize=%s" },
@@ -638,6 +641,7 @@ static void nfs_show_mount_options(struct seq_file *m, struct nfs_server *nfss,
 		{ NFS_MOUNT_NORDIRPLUS, ",nordirplus", "" },
 		{ NFS_MOUNT_UNSHARED, ",nosharecache", "" },
 		{ NFS_MOUNT_NORESVPORT, ",noresvport", "" },
+		{ NFS_MOUNT_TRANSIENT, ",transient", "" },
 		{ 0, NULL, NULL }
 	};
 	const struct proc_nfs_info *nfs_infop;
@@ -1329,7 +1333,8 @@ static int nfs_parse_mount_options(char *raw,
 			mnt->flags |= NFS_MOUNT_NORDIRPLUS;
 			break;
 		case Opt_sharecache:
-			mnt->flags &= ~NFS_MOUNT_UNSHARED;
+			mnt->flags &= ~(NFS_MOUNT_UNSHARED |
+					NFS_MOUNT_TRANSIENT);
 			break;
 		case Opt_nosharecache:
 			mnt->flags |= NFS_MOUNT_UNSHARED;
@@ -1356,6 +1361,12 @@ static int nfs_parse_mount_options(char *raw,
 		case Opt_nomigration:
 			mnt->options &= ~NFS_OPTION_MIGRATION;
 			break;
+		case Opt_transient:
+			mnt->flags |= NFS_MOUNT_TRANSIENT | NFS_MOUNT_UNSHARED;
+			break;
+		case Opt_notransient:
+			mnt->flags &= ~NFS_MOUNT_TRANSIENT;
+			break;
 
 		/*
 		 * options that take numeric values
diff --git a/include/uapi/linux/nfs_mount.h b/include/uapi/linux/nfs_mount.h
index e44e00616ab5..6067a9449d9f 100644
--- a/include/uapi/linux/nfs_mount.h
+++ b/include/uapi/linux/nfs_mount.h
@@ -74,5 +74,6 @@ struct nfs_mount_data {
 
 #define NFS_MOUNT_LOCAL_FLOCK	0x100000
 #define NFS_MOUNT_LOCAL_FCNTL	0x200000
+#define NFS_MOUNT_TRANSIENT	0x400000
 
 #endif
-- 
2.13.6


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC v4 8/9] NFS: Don't shared transient clients
  2017-11-17 17:45 [RFC v4 0/9] NFS Force Unmounting Joshua Watt
                   ` (6 preceding siblings ...)
  2017-11-17 17:45 ` [RFC v4 7/9] NFS: Add transient mount option Joshua Watt
@ 2017-11-17 17:45 ` Joshua Watt
  2017-11-17 17:45 ` [RFC v4 9/9] NFS: Kill all client RPCs if transient Joshua Watt
  2017-12-04 14:36 ` [RFC v4 0/9] NFS Force Unmounting Joshua Watt
  9 siblings, 0 replies; 32+ messages in thread
From: Joshua Watt @ 2017-11-17 17:45 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Trond Myklebust, J . Bruce Fields
  Cc: linux-nfs, Al Viro, David Howells, Joshua Watt

If the transient mount option is set the client should not be shared so
that operations done to it will not affect other mounts.

Signed-off-by: Joshua Watt <JPEWhacker@gmail.com>
---
 fs/nfs/client.c           | 9 +++++++--
 fs/nfs/nfs3client.c       | 2 ++
 fs/nfs/nfs4client.c       | 4 ++++
 include/linux/nfs_fs_sb.h | 1 +
 4 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index 7bab47d0dd07..9d88dea4bc59 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -328,6 +328,10 @@ static struct nfs_client *nfs_match_client(const struct nfs_client_initdata *dat
 		if (clp->cl_cons_state < 0)
 			continue;
 
+		/* Don't match clients that don't want to be shared */
+		if (test_bit(NFS_CS_UNSHARED, &clp->cl_flags))
+			continue;
+
 		/* Different NFS versions cannot share the same nfs_client */
 		if (clp->rpc_ops != data->nfs_mod->rpc_ops)
 			continue;
@@ -428,7 +432,7 @@ init_client(struct nfs_client *new, const struct nfs_client_initdata *cl_init)
  */
 struct nfs_client *nfs_get_client(const struct nfs_client_initdata *cl_init)
 {
-	struct nfs_client *clp, *new = NULL;
+	struct nfs_client *clp = NULL, *new = NULL;
 	struct nfs_net *nn = net_generic(cl_init->net, nfs_net_id);
 	const struct nfs_rpc_ops *rpc_ops = cl_init->nfs_mod->rpc_ops;
 
@@ -441,7 +445,8 @@ struct nfs_client *nfs_get_client(const struct nfs_client_initdata *cl_init)
 	do {
 		spin_lock(&nn->nfs_client_lock);
 
-		clp = nfs_match_client(cl_init);
+		if (!test_bit(NFS_CS_UNSHARED, &cl_init->init_flags))
+			clp = nfs_match_client(cl_init);
 		if (clp) {
 			spin_unlock(&nn->nfs_client_lock);
 			if (new)
diff --git a/fs/nfs/nfs3client.c b/fs/nfs/nfs3client.c
index 7879f2a0fcfd..a627a925f210 100644
--- a/fs/nfs/nfs3client.c
+++ b/fs/nfs/nfs3client.c
@@ -102,6 +102,8 @@ struct nfs_client *nfs3_set_ds_client(struct nfs_server *mds_srv,
 
 	if (mds_srv->flags & NFS_MOUNT_NORESVPORT)
 		set_bit(NFS_CS_NORESVPORT, &cl_init.init_flags);
+	if (mds_srv->flags & NFS_MOUNT_TRANSIENT)
+		__set_bit(NFS_CS_UNSHARED, &cl_init.init_flags);
 
 	/* Use the MDS nfs_client cl_ipaddr. */
 	nfs_init_timeout_values(&ds_timeout, ds_proto, ds_timeo, ds_retrans);
diff --git a/fs/nfs/nfs4client.c b/fs/nfs/nfs4client.c
index ef725106e45c..ffbc1143a4ab 100644
--- a/fs/nfs/nfs4client.c
+++ b/fs/nfs/nfs4client.c
@@ -852,6 +852,8 @@ static int nfs4_set_client(struct nfs_server *server,
 		set_bit(NFS_CS_MIGRATION, &cl_init.init_flags);
 	if (test_bit(NFS_MIG_TSM_POSSIBLE, &server->mig_status))
 		set_bit(NFS_CS_TSM_POSSIBLE, &cl_init.init_flags);
+	if (server->flags & NFS_MOUNT_TRANSIENT)
+		__set_bit(NFS_CS_UNSHARED, &cl_init.init_flags);
 
 	/* Allocate or find a client reference we can use */
 	clp = nfs_get_client(&cl_init);
@@ -910,6 +912,8 @@ struct nfs_client *nfs4_set_ds_client(struct nfs_server *mds_srv,
 
 	if (mds_srv->flags & NFS_MOUNT_NORESVPORT)
 		__set_bit(NFS_CS_NORESVPORT, &cl_init.init_flags);
+	if (mds_srv->flags & NFS_MOUNT_TRANSIENT)
+		__set_bit(NFS_CS_UNSHARED, &cl_init.init_flags);
 
 	/*
 	 * Set an authflavor equual to the MDS value. Use the MDS nfs_client
diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
index 5dac856355f0..0eb7093faea7 100644
--- a/include/linux/nfs_fs_sb.h
+++ b/include/linux/nfs_fs_sb.h
@@ -45,6 +45,7 @@ struct nfs_client {
 #define NFS_CS_INFINITE_SLOTS	3		/* - don't limit TCP slots */
 #define NFS_CS_NO_RETRANS_TIMEOUT	4	/* - Disable retransmit timeouts */
 #define NFS_CS_TSM_POSSIBLE	5		/* - Maybe state migration */
+#define NFS_CS_UNSHARED		6		/* - Client state not shared */
 	struct sockaddr_storage	cl_addr;	/* server identifier */
 	size_t			cl_addrlen;
 	char *			cl_hostname;	/* hostname of server */
-- 
2.13.6


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC v4 9/9] NFS: Kill all client RPCs if transient
  2017-11-17 17:45 [RFC v4 0/9] NFS Force Unmounting Joshua Watt
                   ` (7 preceding siblings ...)
  2017-11-17 17:45 ` [RFC v4 8/9] NFS: Don't shared transient clients Joshua Watt
@ 2017-11-17 17:45 ` Joshua Watt
  2017-12-04 14:36 ` [RFC v4 0/9] NFS Force Unmounting Joshua Watt
  9 siblings, 0 replies; 32+ messages in thread
From: Joshua Watt @ 2017-11-17 17:45 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Trond Myklebust, J . Bruce Fields
  Cc: linux-nfs, Al Viro, David Howells, Joshua Watt

When unmounting with the MNT_FORCE flag, all client RPCs will be killed
in all mount points if the mount is transient

Signed-off-by: Joshua Watt <JPEWhacker@gmail.com>
---
 fs/nfs/super.c | 40 ++++++++++++++++++++++++++++++++++++----
 1 file changed, 36 insertions(+), 4 deletions(-)

diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index 09c413c098ef..f97cec1a1dec 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -904,6 +904,18 @@ kill_rpc(struct rpc_clnt *rpc)
 	}
 }
 
+static void
+kill_client_rpc(struct nfs_client *client)
+{
+	struct nfs_server *server;
+
+	kill_rpc(client->cl_rpcclient);
+	list_for_each_entry_rcu(server, &client->cl_superblocks, client_link) {
+		kill_rpc(server->client_acl);
+		kill_rpc(server->client);
+	}
+}
+
 /*
  * Begin unmount by attempting to remove all automounted mountpoints we added
  * in response to xdev traversals and referrals
@@ -914,8 +926,12 @@ void nfs_umount_begin(struct super_block *sb)
 
 	server = NFS_SB(sb);
 	/* -EIO all pending I/O */
-	kill_rpc(server->client_acl);
-	kill_rpc(server->client);
+	if (server->flags & NFS_MOUNT_TRANSIENT) {
+		kill_client_rpc(server->nfs_client);
+	} else {
+		kill_rpc(server->client_acl);
+		kill_rpc(server->client);
+	}
 }
 EXPORT_SYMBOL_GPL(nfs_umount_begin);
 
@@ -926,13 +942,29 @@ restore_rpc(struct rpc_clnt *rpc)
 		__atomic_add_unless(&rpc->cl_kill_new_tasks, -1, 0);
 }
 
+static void
+restore_client_rpc(struct nfs_client *client)
+{
+	struct nfs_server *server;
+
+	restore_rpc(client->cl_rpcclient);
+	list_for_each_entry_rcu(server, &client->cl_superblocks, client_link) {
+		restore_rpc(server->client_acl);
+		restore_rpc(server->client);
+	}
+}
+
 void
 nfs_umount_end(struct super_block *sb)
 {
 	struct nfs_server *server = NFS_SB(sb);
 
-	restore_rpc(server->client_acl);
-	restore_rpc(server->client);
+	if (server->flags & NFS_MOUNT_TRANSIENT) {
+		restore_client_rpc(server->nfs_client);
+	} else {
+		restore_rpc(server->client_acl);
+		restore_rpc(server->client);
+	}
 }
 EXPORT_SYMBOL_GPL(nfs_umount_end);
 
-- 
2.13.6


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [RFC v4 0/9] NFS Force Unmounting
  2017-11-17 17:45 [RFC v4 0/9] NFS Force Unmounting Joshua Watt
                   ` (8 preceding siblings ...)
  2017-11-17 17:45 ` [RFC v4 9/9] NFS: Kill all client RPCs if transient Joshua Watt
@ 2017-12-04 14:36 ` Joshua Watt
  2017-12-05 23:34     ` NeilBrown
  9 siblings, 1 reply; 32+ messages in thread
From: Joshua Watt @ 2017-12-04 14:36 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Trond Myklebust, J . Bruce Fields
  Cc: linux-nfs, Al Viro, David Howells

On Fri, 2017-11-17 at 11:45 -0600, Joshua Watt wrote:
> Latest attempt at unifying the various constraints for force
> umounting
> NFS servers that have disappeared in a timely manner. Namely:
>   * umount2(..., MNT_FORCE) should not be made stronger, unless we
> know
>     this is the final umount()
>   * The "failed server" state should be reversible
>   * The mechanism should be able to "unstick" a sync(2) that is stuck
> on
>     an unresponsive server
> 
> I believe the proposal satisfies all of these concerns. There are a
> few
> major components to this proposal:
>  1) The umount_begin superblock operation now has a corresponding
>     umount_end operation. This is invoked by umount() when MNT_FORCE
> is
>     specified (like umount_begin()), but the actual unmount failed
> (i.e.
>     the mount is busy).
>  2) Instead of killing all the RPC queued at a single point in time,
> the
>     NFS mount now kills all queue RPCs and all RPCs that get queued
>     between nfs_umount_begin() and nfs_umount_end(). I believe this
> is
>     not a significant change in behavior because there were always
> races
>     between queuing RPCs and killing them in nfs_umount_begin().
>  3) nfs_umount_end() is *not* called when MNT_DETACH is specified.
> This
>     is the indication that the user is done with this mount and all
>     RPCs will be killed until the mount finally gets removed.
>  4) The new "transient" mount option prevents sharing nfs_clients
>     between multiple superblocks. The only exception to this is when
> the
>     kernel does an automatic mount to cross a device boundary ("xdev"
>     NFS mount). In this case, the existing code always shares the
>     existing nfs_client from parent superblock. The "transient" mount
>     option implies "nosharecache", as it doesn't make sense to share
>     superblocks if clients aren't shared.
>  5) If the "transient" mount option is specified (and hence the
>     nfs_client is not shared), MNT_FORCE kills all RPCs for the
> entire
>     nfs_client (and all its nfs_servers). This effectively enables
> the
>     "burn down the forest" option when combined with MNT_DETACH.
> 
> The choice to use MNT_FORCE as the mechanism for triggering this
> behavior stems from the desire to unstick sync(2) calls that might be
> blocked on a non-responsive NFS server. While it was previously
> discussed to remount with a new mount option to enable this behavior,
> this cannot release the blocked sync(2) call because both
> sync_fileystem() and do_remount() lock the struct superblock-
> >s_umount
> reader-writer lock. As such, a remount will block until the sync(2)
> finishes, which is undesirable. umount2() doesn't have this
> restriction
> and can unblock the sync(2) call.
> 
> For the most part, all existing behavior is unchanged if the
> "transient"
> option is not specified. umount -f continues to behave as is has, but
> umount -fl (see note below) now does a more aggressive kill to get
> everything out of there and allow unmounting in a more timely manner.
> Note that there will probably be some communication with the server
> still, as destruction of the NFS client ID and such will occur when
> the
> last reference is removed. If the server is truly gone, this can
> result
> in long blocks at that time.
> 
> If it is known at mount time that the server might be disappearing,
> it
> should be mounted with "transient". Doing this will allow mount -fl
> to
> do a more complete cleanup and prevent all communication with the
> server, which should allow a timely cleanup in all cases.
> 
> Notes:
> 
> Until recently, libmount did not allow a detached and lazy mount at
> the
> same time. This was recently fixed (see
> https://marc.info/?l=util-linux-ng&m=151000714401929&w=2). If you
> want
> to test this, you may need to write a simple test program that
> directly
> calls umount2() with MNT_FORCE | MNT_DETACH.
> 
> Thank you all again for your time and comments,
> Joshua Watt
> 
> Joshua Watt (9):
>   SUNRPC: Add flag to kill new tasks
>   SUNRPC: Expose kill_new_tasks in debugfs
>   SUNRPC: Simplify client shutdown
>   namespace: Add umount_end superblock operation
>   NFS: Kill RPCs for the duration of umount
>   NFS: Add debugfs for nfs_server and nfs_client
>   NFS: Add transient mount option
>   NFS: Don't shared transient clients
>   NFS: Kill all client RPCs if transient
> 
>  fs/namespace.c                 |  22 ++++++-
>  fs/nfs/Makefile                |   2 +-
>  fs/nfs/client.c                |  96 +++++++++++++++++++++++++--
>  fs/nfs/debugfs.c               | 143
> +++++++++++++++++++++++++++++++++++++++++
>  fs/nfs/inode.c                 |   5 ++
>  fs/nfs/internal.h              |  11 ++++
>  fs/nfs/nfs3client.c            |   2 +
>  fs/nfs/nfs4client.c            |   5 ++
>  fs/nfs/nfs4super.c             |   1 +
>  fs/nfs/super.c                 |  81 ++++++++++++++++++++---
>  include/linux/fs.h             |   1 +
>  include/linux/nfs_fs_sb.h      |   6 ++
>  include/linux/sunrpc/clnt.h    |   1 +
>  include/uapi/linux/nfs_mount.h |   1 +
>  net/sunrpc/clnt.c              |  13 ++--
>  net/sunrpc/debugfs.c           |   5 ++
>  net/sunrpc/sched.c             |   3 +
>  17 files changed, 372 insertions(+), 26 deletions(-)
>  create mode 100644 fs/nfs/debugfs.c
> 

Anyone have any comments on this? Sorry fo the churn, it took a few
tries to get this to where it would work. I also realize I should have
put all my RFC patchsets in-reply-to each other instead of starting a
new thread for each one.

Thanks for your time,
Joshua Watt

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC v4 1/9] SUNRPC: Add flag to kill new tasks
  2017-11-17 17:45 ` [RFC v4 1/9] SUNRPC: Add flag to kill new tasks Joshua Watt
@ 2017-12-05 22:59   ` NeilBrown
  0 siblings, 0 replies; 32+ messages in thread
From: NeilBrown @ 2017-12-05 22:59 UTC (permalink / raw)
  To: Joshua Watt, Jeff Layton, Trond Myklebust, J . Bruce Fields
  Cc: linux-nfs, Al Viro, David Howells, Joshua Watt

[-- Attachment #1: Type: text/plain, Size: 2848 bytes --]

On Fri, Nov 17 2017, Joshua Watt wrote:

> The flag causes any new tasks that are queued to exit immediately with
> -EIO instead of executing. This will allow clients (particularly NFS) to
> prevents these task from delaying shutdown of the RPC session longer
> than necessary.
>
> Signed-off-by: Joshua Watt <JPEWhacker@gmail.com>
> ---
>  include/linux/sunrpc/clnt.h | 1 +
>  net/sunrpc/clnt.c           | 5 ++---
>  net/sunrpc/sched.c          | 3 +++
>  3 files changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/sunrpc/clnt.h b/include/linux/sunrpc/clnt.h
> index 71c237e8240e..d36dc529fdf0 100644
> --- a/include/linux/sunrpc/clnt.h
> +++ b/include/linux/sunrpc/clnt.h
> @@ -54,6 +54,7 @@ struct rpc_clnt {
>  				cl_noretranstimeo: 1,/* No retransmit timeouts */
>  				cl_autobind : 1,/* use getport() */
>  				cl_chatty   : 1;/* be verbose */
> +	atomic_t		cl_kill_new_tasks;	/* Kill all new tasks */

It looks a little weird that this is an atomic_t.
Looking further ahead, it seems that it is a counter of threads that
have request new tasks be killed.
It might help to make that clear, at least in a comment.

>  
>  	struct rpc_rtt *	cl_rtt;		/* RTO estimator data */
>  	const struct rpc_timeout *cl_timeout;	/* Timeout strategy */
> diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
> index df4ecb042ebe..d5eedabf0b4d 100644
> --- a/net/sunrpc/clnt.c
> +++ b/net/sunrpc/clnt.c
> @@ -626,6 +626,8 @@ static struct rpc_clnt *__rpc_clone_client(struct rpc_create_args *args,
>  	new->cl_noretranstimeo = clnt->cl_noretranstimeo;
>  	new->cl_discrtry = clnt->cl_discrtry;
>  	new->cl_chatty = clnt->cl_chatty;
> +	atomic_set(&new->cl_kill_new_tasks,
> +		   atomic_read(&clnt->cl_kill_new_tasks));

However ... as it is a counter, it isn't clear that copying the counter
makes sense, and the new counter will never get decremented.
Maybe __rpc_clone_client() should fail if ->cl_kill_new_tasks > 0 ??

Thanks,
NeilBrown


>  	return new;
>  
>  out_err:
> @@ -818,9 +820,6 @@ void rpc_killall_tasks(struct rpc_clnt *clnt)
>  {
>  	struct rpc_task	*rovr;
>  
> -
> -	if (list_empty(&clnt->cl_tasks))
> -		return;
>  	dprintk("RPC:       killing all tasks for client %p\n", clnt);
>  	/*
>  	 * Spin lock all_tasks to prevent changes...
> diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
> index 0cc83839c13c..e70034e19ccd 100644
> --- a/net/sunrpc/sched.c
> +++ b/net/sunrpc/sched.c
> @@ -748,6 +748,9 @@ static void __rpc_execute(struct rpc_task *task)
>  	dprintk("RPC: %5u __rpc_execute flags=0x%x\n",
>  			task->tk_pid, task->tk_flags);
>  
> +	if (atomic_read(&task->tk_client->cl_kill_new_tasks))
> +		rpc_exit(task, -EIO);
> +
>  	WARN_ON_ONCE(RPC_IS_QUEUED(task));
>  	if (RPC_IS_QUEUED(task))
>  		return;
> -- 
> 2.13.6

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC v4 5/9] NFS: Kill RPCs for the duration of umount
  2017-11-17 17:45 ` [RFC v4 5/9] NFS: Kill RPCs for the duration of umount Joshua Watt
@ 2017-12-05 23:07   ` NeilBrown
  0 siblings, 0 replies; 32+ messages in thread
From: NeilBrown @ 2017-12-05 23:07 UTC (permalink / raw)
  To: Joshua Watt, Jeff Layton, Trond Myklebust, J . Bruce Fields
  Cc: linux-nfs, Al Viro, David Howells, Joshua Watt

[-- Attachment #1: Type: text/plain, Size: 3862 bytes --]

On Fri, Nov 17 2017, Joshua Watt wrote:

> RPCs are now killed for the duration of the umount operation when
> MNT_FORCE is specified. Specifically, if the umount flags are MNT_FORCE
> | MNT_DETACH, all future RPCs will be killed.
>
> Signed-off-by: Joshua Watt <JPEWhacker@gmail.com>
> ---
>  fs/nfs/internal.h  |  1 +
>  fs/nfs/nfs4super.c |  1 +
>  fs/nfs/super.c     | 36 +++++++++++++++++++++++++++++-------
>  3 files changed, 31 insertions(+), 7 deletions(-)
>
> diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
> index f9a4a5524bd5..3ec165368a08 100644
> --- a/fs/nfs/internal.h
> +++ b/fs/nfs/internal.h
> @@ -461,6 +461,7 @@ extern void nfs_pageio_reset_read_mds(struct nfs_pageio_descriptor *pgio);
>  
>  /* super.c */
>  void nfs_umount_begin(struct super_block *);
> +void nfs_umount_end(struct super_block *sb);
>  int  nfs_statfs(struct dentry *, struct kstatfs *);
>  int  nfs_show_options(struct seq_file *, struct dentry *);
>  int  nfs_show_devname(struct seq_file *, struct dentry *);
> diff --git a/fs/nfs/nfs4super.c b/fs/nfs/nfs4super.c
> index 6fb7cb6b3f4b..e9788f9c3394 100644
> --- a/fs/nfs/nfs4super.c
> +++ b/fs/nfs/nfs4super.c
> @@ -56,6 +56,7 @@ static const struct super_operations nfs4_sops = {
>  	.statfs		= nfs_statfs,
>  	.evict_inode	= nfs4_evict_inode,
>  	.umount_begin	= nfs_umount_begin,
> +	.umount_end	= nfs_umount_end,
>  	.show_options	= nfs_show_options,
>  	.show_devname	= nfs_show_devname,
>  	.show_path	= nfs_show_path,
> diff --git a/fs/nfs/super.c b/fs/nfs/super.c
> index 216f67d628b3..71361ca6d6b4 100644
> --- a/fs/nfs/super.c
> +++ b/fs/nfs/super.c
> @@ -315,6 +315,7 @@ const struct super_operations nfs_sops = {
>  	.statfs		= nfs_statfs,
>  	.evict_inode	= nfs_evict_inode,
>  	.umount_begin	= nfs_umount_begin,
> +	.umount_end	= nfs_umount_end,
>  	.show_options	= nfs_show_options,
>  	.show_devname	= nfs_show_devname,
>  	.show_path	= nfs_show_path,
> @@ -890,6 +891,15 @@ int nfs_show_stats(struct seq_file *m, struct dentry *root)
>  }
>  EXPORT_SYMBOL_GPL(nfs_show_stats);
>  
> +static void
> +kill_rpc(struct rpc_clnt *rpc)
> +{
> +	if (!IS_ERR(rpc)) {
> +		atomic_inc(&rpc->cl_kill_new_tasks);
> +		rpc_killall_tasks(rpc);
> +	}
> +}
> +
>  /*
>   * Begin unmount by attempting to remove all automounted mountpoints we added
>   * in response to xdev traversals and referrals
> @@ -897,19 +907,31 @@ EXPORT_SYMBOL_GPL(nfs_show_stats);
>  void nfs_umount_begin(struct super_block *sb)
>  {
>  	struct nfs_server *server;
> -	struct rpc_clnt *rpc;
>  
>  	server = NFS_SB(sb);
>  	/* -EIO all pending I/O */
> -	rpc = server->client_acl;
> -	if (!IS_ERR(rpc))
> -		rpc_killall_tasks(rpc);
> -	rpc = server->client;
> -	if (!IS_ERR(rpc))
> -		rpc_killall_tasks(rpc);
> +	kill_rpc(server->client_acl);
> +	kill_rpc(server->client);
>  }
>  EXPORT_SYMBOL_GPL(nfs_umount_begin);
>  
> +static void
> +restore_rpc(struct rpc_clnt *rpc)
> +{
> +	if (!IS_ERR(rpc))
> +		__atomic_add_unless(&rpc->cl_kill_new_tasks, -1, 0);

Why use __atomic_add_unless() instead of the more obvious
atomic_add_unless()?
The only difference is the return value and as you ignore that, it isn't
a difference.

Thanks,
NeilBrown


> +}
> +
> +void
> +nfs_umount_end(struct super_block *sb)
> +{
> +	struct nfs_server *server = NFS_SB(sb);
> +
> +	restore_rpc(server->client_acl);
> +	restore_rpc(server->client);
> +}
> +EXPORT_SYMBOL_GPL(nfs_umount_end);
> +
>  static struct nfs_parsed_mount_data *nfs_alloc_parsed_mount_data(void)
>  {
>  	struct nfs_parsed_mount_data *data;
> -- 
> 2.13.6
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC v4 0/9] NFS Force Unmounting
@ 2017-12-05 23:34     ` NeilBrown
  0 siblings, 0 replies; 32+ messages in thread
From: NeilBrown @ 2017-12-05 23:34 UTC (permalink / raw)
  To: Joshua Watt, Jeff Layton, Trond Myklebust, J . Bruce Fields
  Cc: linux-nfs, Al Viro, David Howells, linux-api

[-- Attachment #1: Type: text/plain, Size: 6712 bytes --]

On Mon, Dec 04 2017, Joshua Watt wrote:

> On Fri, 2017-11-17 at 11:45 -0600, Joshua Watt wrote:
>> Latest attempt at unifying the various constraints for force
>> umounting
>> NFS servers that have disappeared in a timely manner. Namely:
>>   * umount2(..., MNT_FORCE) should not be made stronger, unless we
>> know
>>     this is the final umount()
>>   * The "failed server" state should be reversible
>>   * The mechanism should be able to "unstick" a sync(2) that is stuck
>> on
>>     an unresponsive server
>> 
>> I believe the proposal satisfies all of these concerns. There are a
>> few
>> major components to this proposal:
>>  1) The umount_begin superblock operation now has a corresponding
>>     umount_end operation. This is invoked by umount() when MNT_FORCE
>> is
>>     specified (like umount_begin()), but the actual unmount failed
>> (i.e.
>>     the mount is busy).
>>  2) Instead of killing all the RPC queued at a single point in time,
>> the
>>     NFS mount now kills all queue RPCs and all RPCs that get queued
>>     between nfs_umount_begin() and nfs_umount_end(). I believe this
>> is
>>     not a significant change in behavior because there were always
>> races
>>     between queuing RPCs and killing them in nfs_umount_begin().
>>  3) nfs_umount_end() is *not* called when MNT_DETACH is specified.
>> This
>>     is the indication that the user is done with this mount and all
>>     RPCs will be killed until the mount finally gets removed.
>>  4) The new "transient" mount option prevents sharing nfs_clients
>>     between multiple superblocks. The only exception to this is when
>> the
>>     kernel does an automatic mount to cross a device boundary ("xdev"
>>     NFS mount). In this case, the existing code always shares the
>>     existing nfs_client from parent superblock. The "transient" mount
>>     option implies "nosharecache", as it doesn't make sense to share
>>     superblocks if clients aren't shared.
>>  5) If the "transient" mount option is specified (and hence the
>>     nfs_client is not shared), MNT_FORCE kills all RPCs for the
>> entire
>>     nfs_client (and all its nfs_servers). This effectively enables
>> the
>>     "burn down the forest" option when combined with MNT_DETACH.
>> 
>> The choice to use MNT_FORCE as the mechanism for triggering this
>> behavior stems from the desire to unstick sync(2) calls that might be
>> blocked on a non-responsive NFS server. While it was previously
>> discussed to remount with a new mount option to enable this behavior,
>> this cannot release the blocked sync(2) call because both
>> sync_fileystem() and do_remount() lock the struct superblock-
>> >s_umount
>> reader-writer lock. As such, a remount will block until the sync(2)
>> finishes, which is undesirable. umount2() doesn't have this
>> restriction
>> and can unblock the sync(2) call.
>> 
>> For the most part, all existing behavior is unchanged if the
>> "transient"
>> option is not specified. umount -f continues to behave as is has, but
>> umount -fl (see note below) now does a more aggressive kill to get
>> everything out of there and allow unmounting in a more timely manner.
>> Note that there will probably be some communication with the server
>> still, as destruction of the NFS client ID and such will occur when
>> the
>> last reference is removed. If the server is truly gone, this can
>> result
>> in long blocks at that time.
>> 
>> If it is known at mount time that the server might be disappearing,
>> it
>> should be mounted with "transient". Doing this will allow mount -fl
>> to
>> do a more complete cleanup and prevent all communication with the
>> server, which should allow a timely cleanup in all cases.
>> 
>> Notes:
>> 
>> Until recently, libmount did not allow a detached and lazy mount at
>> the
>> same time. This was recently fixed (see
>> https://marc.info/?l=util-linux-ng&m=151000714401929&w=2). If you
>> want
>> to test this, you may need to write a simple test program that
>> directly
>> calls umount2() with MNT_FORCE | MNT_DETACH.
>> 
>> Thank you all again for your time and comments,
>> Joshua Watt
>> 
>> Joshua Watt (9):
>>   SUNRPC: Add flag to kill new tasks
>>   SUNRPC: Expose kill_new_tasks in debugfs
>>   SUNRPC: Simplify client shutdown
>>   namespace: Add umount_end superblock operation
>>   NFS: Kill RPCs for the duration of umount
>>   NFS: Add debugfs for nfs_server and nfs_client
>>   NFS: Add transient mount option
>>   NFS: Don't shared transient clients
>>   NFS: Kill all client RPCs if transient
>> 
>>  fs/namespace.c                 |  22 ++++++-
>>  fs/nfs/Makefile                |   2 +-
>>  fs/nfs/client.c                |  96 +++++++++++++++++++++++++--
>>  fs/nfs/debugfs.c               | 143
>> +++++++++++++++++++++++++++++++++++++++++
>>  fs/nfs/inode.c                 |   5 ++
>>  fs/nfs/internal.h              |  11 ++++
>>  fs/nfs/nfs3client.c            |   2 +
>>  fs/nfs/nfs4client.c            |   5 ++
>>  fs/nfs/nfs4super.c             |   1 +
>>  fs/nfs/super.c                 |  81 ++++++++++++++++++++---
>>  include/linux/fs.h             |   1 +
>>  include/linux/nfs_fs_sb.h      |   6 ++
>>  include/linux/sunrpc/clnt.h    |   1 +
>>  include/uapi/linux/nfs_mount.h |   1 +
>>  net/sunrpc/clnt.c              |  13 ++--
>>  net/sunrpc/debugfs.c           |   5 ++
>>  net/sunrpc/sched.c             |   3 +
>>  17 files changed, 372 insertions(+), 26 deletions(-)
>>  create mode 100644 fs/nfs/debugfs.c
>> 
>
> Anyone have any comments on this? Sorry fo the churn, it took a few
> tries to get this to where it would work. I also realize I should have
> put all my RFC patchsets in-reply-to each other instead of starting a
> new thread for each one.

The new semantic for MNT_DETACH|MNT_FORCE is interesting.
As it was never possible before (from /bin/umount), it should be safe to
add a new meaning.
The meaning is effectively "detach the filesystem from the namespace and
detach the transport from the filesystem", which sounds like it is
useful.
It is worth highlighting this, and maybe even cc:ing
linux-api@vger.kernel.org ... done that.

The new umount_end to match umount_begin seems to make sense.  It can
replace a racy semantic with a more predictable one, and that is good.
As you say, we cannot do anything really useful here with remount as
sync can block it by holding ->s_umount.

I'm not sure about the new "transient" option.  It is probably a good
idea but I haven't yet wrapped my mind around exactly what it does so I
don't want to commit just yet.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC v4 0/9] NFS Force Unmounting
@ 2017-12-05 23:34     ` NeilBrown
  0 siblings, 0 replies; 32+ messages in thread
From: NeilBrown @ 2017-12-05 23:34 UTC (permalink / raw)
  To: Joshua Watt, Jeff Layton, Trond Myklebust, J . Bruce Fields
  Cc: linux-nfs-u79uwXL29TY76Z2rM5mHXA, Al Viro, David Howells,
	linux-api-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 6736 bytes --]

On Mon, Dec 04 2017, Joshua Watt wrote:

> On Fri, 2017-11-17 at 11:45 -0600, Joshua Watt wrote:
>> Latest attempt at unifying the various constraints for force
>> umounting
>> NFS servers that have disappeared in a timely manner. Namely:
>>   * umount2(..., MNT_FORCE) should not be made stronger, unless we
>> know
>>     this is the final umount()
>>   * The "failed server" state should be reversible
>>   * The mechanism should be able to "unstick" a sync(2) that is stuck
>> on
>>     an unresponsive server
>> 
>> I believe the proposal satisfies all of these concerns. There are a
>> few
>> major components to this proposal:
>>  1) The umount_begin superblock operation now has a corresponding
>>     umount_end operation. This is invoked by umount() when MNT_FORCE
>> is
>>     specified (like umount_begin()), but the actual unmount failed
>> (i.e.
>>     the mount is busy).
>>  2) Instead of killing all the RPC queued at a single point in time,
>> the
>>     NFS mount now kills all queue RPCs and all RPCs that get queued
>>     between nfs_umount_begin() and nfs_umount_end(). I believe this
>> is
>>     not a significant change in behavior because there were always
>> races
>>     between queuing RPCs and killing them in nfs_umount_begin().
>>  3) nfs_umount_end() is *not* called when MNT_DETACH is specified.
>> This
>>     is the indication that the user is done with this mount and all
>>     RPCs will be killed until the mount finally gets removed.
>>  4) The new "transient" mount option prevents sharing nfs_clients
>>     between multiple superblocks. The only exception to this is when
>> the
>>     kernel does an automatic mount to cross a device boundary ("xdev"
>>     NFS mount). In this case, the existing code always shares the
>>     existing nfs_client from parent superblock. The "transient" mount
>>     option implies "nosharecache", as it doesn't make sense to share
>>     superblocks if clients aren't shared.
>>  5) If the "transient" mount option is specified (and hence the
>>     nfs_client is not shared), MNT_FORCE kills all RPCs for the
>> entire
>>     nfs_client (and all its nfs_servers). This effectively enables
>> the
>>     "burn down the forest" option when combined with MNT_DETACH.
>> 
>> The choice to use MNT_FORCE as the mechanism for triggering this
>> behavior stems from the desire to unstick sync(2) calls that might be
>> blocked on a non-responsive NFS server. While it was previously
>> discussed to remount with a new mount option to enable this behavior,
>> this cannot release the blocked sync(2) call because both
>> sync_fileystem() and do_remount() lock the struct superblock-
>> >s_umount
>> reader-writer lock. As such, a remount will block until the sync(2)
>> finishes, which is undesirable. umount2() doesn't have this
>> restriction
>> and can unblock the sync(2) call.
>> 
>> For the most part, all existing behavior is unchanged if the
>> "transient"
>> option is not specified. umount -f continues to behave as is has, but
>> umount -fl (see note below) now does a more aggressive kill to get
>> everything out of there and allow unmounting in a more timely manner.
>> Note that there will probably be some communication with the server
>> still, as destruction of the NFS client ID and such will occur when
>> the
>> last reference is removed. If the server is truly gone, this can
>> result
>> in long blocks at that time.
>> 
>> If it is known at mount time that the server might be disappearing,
>> it
>> should be mounted with "transient". Doing this will allow mount -fl
>> to
>> do a more complete cleanup and prevent all communication with the
>> server, which should allow a timely cleanup in all cases.
>> 
>> Notes:
>> 
>> Until recently, libmount did not allow a detached and lazy mount at
>> the
>> same time. This was recently fixed (see
>> https://marc.info/?l=util-linux-ng&m=151000714401929&w=2). If you
>> want
>> to test this, you may need to write a simple test program that
>> directly
>> calls umount2() with MNT_FORCE | MNT_DETACH.
>> 
>> Thank you all again for your time and comments,
>> Joshua Watt
>> 
>> Joshua Watt (9):
>>   SUNRPC: Add flag to kill new tasks
>>   SUNRPC: Expose kill_new_tasks in debugfs
>>   SUNRPC: Simplify client shutdown
>>   namespace: Add umount_end superblock operation
>>   NFS: Kill RPCs for the duration of umount
>>   NFS: Add debugfs for nfs_server and nfs_client
>>   NFS: Add transient mount option
>>   NFS: Don't shared transient clients
>>   NFS: Kill all client RPCs if transient
>> 
>>  fs/namespace.c                 |  22 ++++++-
>>  fs/nfs/Makefile                |   2 +-
>>  fs/nfs/client.c                |  96 +++++++++++++++++++++++++--
>>  fs/nfs/debugfs.c               | 143
>> +++++++++++++++++++++++++++++++++++++++++
>>  fs/nfs/inode.c                 |   5 ++
>>  fs/nfs/internal.h              |  11 ++++
>>  fs/nfs/nfs3client.c            |   2 +
>>  fs/nfs/nfs4client.c            |   5 ++
>>  fs/nfs/nfs4super.c             |   1 +
>>  fs/nfs/super.c                 |  81 ++++++++++++++++++++---
>>  include/linux/fs.h             |   1 +
>>  include/linux/nfs_fs_sb.h      |   6 ++
>>  include/linux/sunrpc/clnt.h    |   1 +
>>  include/uapi/linux/nfs_mount.h |   1 +
>>  net/sunrpc/clnt.c              |  13 ++--
>>  net/sunrpc/debugfs.c           |   5 ++
>>  net/sunrpc/sched.c             |   3 +
>>  17 files changed, 372 insertions(+), 26 deletions(-)
>>  create mode 100644 fs/nfs/debugfs.c
>> 
>
> Anyone have any comments on this? Sorry fo the churn, it took a few
> tries to get this to where it would work. I also realize I should have
> put all my RFC patchsets in-reply-to each other instead of starting a
> new thread for each one.

The new semantic for MNT_DETACH|MNT_FORCE is interesting.
As it was never possible before (from /bin/umount), it should be safe to
add a new meaning.
The meaning is effectively "detach the filesystem from the namespace and
detach the transport from the filesystem", which sounds like it is
useful.
It is worth highlighting this, and maybe even cc:ing
linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org ... done that.

The new umount_end to match umount_begin seems to make sense.  It can
replace a racy semantic with a more predictable one, and that is good.
As you say, we cannot do anything really useful here with remount as
sync can block it by holding ->s_umount.

I'm not sure about the new "transient" option.  It is probably a good
idea but I haven't yet wrapped my mind around exactly what it does so I
don't want to commit just yet.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC v4 4/9] namespace: Add umount_end superblock operation
  2017-11-17 17:45 ` [RFC v4 4/9] namespace: Add umount_end superblock operation Joshua Watt
@ 2017-12-06 11:54   ` Jeff Layton
  2017-12-06 12:14   ` Al Viro
  1 sibling, 0 replies; 32+ messages in thread
From: Jeff Layton @ 2017-12-06 11:54 UTC (permalink / raw)
  To: Joshua Watt, NeilBrown, Trond Myklebust, J . Bruce Fields
  Cc: linux-nfs, Al Viro, David Howells

On Fri, 2017-11-17 at 11:45 -0600, Joshua Watt wrote:
> The umount_end operation allows cleaning of state set by umount_begin in
> the event the filesystem doesn't actually get unmounted.
> 
> Signed-off-by: Joshua Watt <JPEWhacker@gmail.com>
> ---
>  fs/namespace.c     | 22 ++++++++++++++++++++--
>  include/linux/fs.h |  1 +
>  2 files changed, 21 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/namespace.c b/fs/namespace.c
> index d18deb4c410b..d2587be4d08b 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -1524,6 +1524,12 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
>  	}
>  }
>  
> +static void umount_end(struct super_block *sb, int flags)
> +{
> +	if (flags & MNT_FORCE && sb->s_op->umount_end)

nit: might get some complaints from compiler about order of operations
there. I'd put parenthesis around the flags & MNT_FORCE.

> +		sb->s_op->umount_end(sb);
> +}
> +
>  static void shrink_submounts(struct mount *mnt);
>  
>  static int do_umount(struct mount *mnt, int flags)
> @@ -1589,12 +1595,16 @@ static int do_umount(struct mount *mnt, int flags)
>  		 * Special case for "unmounting" root ...
>  		 * we just try to remount it readonly.
>  		 */
> -		if (!capable(CAP_SYS_ADMIN))
> -			return -EPERM;
> +		if (!capable(CAP_SYS_ADMIN)) {
> +			retval = -EPERM;
> +			goto out_umount_end;
> +		}
>  		down_write(&sb->s_umount);
>  		if (!sb_rdonly(sb))
>  			retval = do_remount_sb(sb, SB_RDONLY, NULL, 0);
>  		up_write(&sb->s_umount);
> +		/* Still mounted. Always invoke the cleanup */
> +		umount_end(sb, flags);
>  		return retval;
>  	}
>  
> @@ -1617,6 +1627,14 @@ static int do_umount(struct mount *mnt, int flags)
>  	}
>  	unlock_mount_hash();
>  	namespace_unlock();
> +
> +out_umount_end:
> +	/* If the umount failed and the file system is still mounted, allow the
> +	 * driver to cleanup any state it may have setup in umount_begin(). Note
> +	 * that this is purposely *not* called when MNT_DETACH is specified.
> +	 */
> +	if (retval)
> +		umount_end(sb, flags);
>  	return retval;
>  }
>  
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 885266aae2d7..5443c22da18f 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1816,6 +1816,7 @@ struct super_operations {
>  	int (*statfs) (struct dentry *, struct kstatfs *);
>  	int (*remount_fs) (struct super_block *, int *, char *);
>  	void (*umount_begin) (struct super_block *);
> +	void (*umount_end)(struct super_block *);
>  
>  	int (*show_options)(struct seq_file *, struct dentry *);
>  	int (*show_devname)(struct seq_file *, struct dentry *);

Since this involves vfs-level changes, please cc linux-fsdevel@vger.kern
el.org on further postings. Other filesystem devs may be interested in
what you're doing here.
-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC v4 4/9] namespace: Add umount_end superblock operation
  2017-11-17 17:45 ` [RFC v4 4/9] namespace: Add umount_end superblock operation Joshua Watt
  2017-12-06 11:54   ` Jeff Layton
@ 2017-12-06 12:14   ` Al Viro
  2017-12-06 12:33     ` Al Viro
  1 sibling, 1 reply; 32+ messages in thread
From: Al Viro @ 2017-12-06 12:14 UTC (permalink / raw)
  To: Joshua Watt
  Cc: NeilBrown, Jeff Layton, Trond Myklebust, J . Bruce Fields,
	linux-nfs, David Howells

On Fri, Nov 17, 2017 at 11:45:47AM -0600, Joshua Watt wrote:
> The umount_end operation allows cleaning of state set by umount_begin in
> the event the filesystem doesn't actually get unmounted.

The locking doesn't make any sense.  This thing can come at *any* moment -
one process does this force-unmount of yours, another comes and accesses
the damn thing just as you've decided that umount has failed and go
to call that method.

In other words, filesystem can be accessed before and during that call.
And that is not to mention the fact that in another namespace the same
super_block might remain mounted through all of that, giving overlapping
accesses during and after ->umount_begin(), so the latter can't do
anything other than "try and kill those waiting on that fs" anyway.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC v4 7/9] NFS: Add transient mount option
  2017-11-17 17:45 ` [RFC v4 7/9] NFS: Add transient mount option Joshua Watt
@ 2017-12-06 12:23   ` Jeff Layton
  0 siblings, 0 replies; 32+ messages in thread
From: Jeff Layton @ 2017-12-06 12:23 UTC (permalink / raw)
  To: Joshua Watt, NeilBrown, Trond Myklebust, J . Bruce Fields
  Cc: linux-nfs, Al Viro, David Howells

On Fri, 2017-11-17 at 11:45 -0600, Joshua Watt wrote:
> The "transient" option is used to indicate that the server might
> disappear unexpectedly, and a more aggressive behavior should be use to
> force unmount it. It implies "nosharecache"
> 
> Signed-off-by: Joshua Watt <JPEWhacker@gmail.com>
> ---
>  fs/nfs/super.c                 | 13 ++++++++++++-
>  include/uapi/linux/nfs_mount.h |  1 +
>  2 files changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/nfs/super.c b/fs/nfs/super.c
> index 71361ca6d6b4..09c413c098ef 100644
> --- a/fs/nfs/super.c
> +++ b/fs/nfs/super.c
> @@ -90,6 +90,7 @@ enum {
>  	Opt_resvport, Opt_noresvport,
>  	Opt_fscache, Opt_nofscache,
>  	Opt_migration, Opt_nomigration,
> +	Opt_transient, Opt_notransient,
>  
>  	/* Mount options that take integer arguments */
>  	Opt_port,
> @@ -151,6 +152,8 @@ static const match_table_t nfs_mount_option_tokens = {
>  	{ Opt_nofscache, "nofsc" },
>  	{ Opt_migration, "migration" },
>  	{ Opt_nomigration, "nomigration" },
> +	{ Opt_transient, "transient" },
> +	{ Opt_notransient, "notransient" },
>  
>  	{ Opt_port, "port=%s" },
>  	{ Opt_rsize, "rsize=%s" },
> @@ -638,6 +641,7 @@ static void nfs_show_mount_options(struct seq_file *m, struct nfs_server *nfss,
>  		{ NFS_MOUNT_NORDIRPLUS, ",nordirplus", "" },
>  		{ NFS_MOUNT_UNSHARED, ",nosharecache", "" },
>  		{ NFS_MOUNT_NORESVPORT, ",noresvport", "" },
> +		{ NFS_MOUNT_TRANSIENT, ",transient", "" },
>  		{ 0, NULL, NULL }
>  	};
>  	const struct proc_nfs_info *nfs_infop;
> @@ -1329,7 +1333,8 @@ static int nfs_parse_mount_options(char *raw,
>  			mnt->flags |= NFS_MOUNT_NORDIRPLUS;
>  			break;
>  		case Opt_sharecache:
> -			mnt->flags &= ~NFS_MOUNT_UNSHARED;
> +			mnt->flags &= ~(NFS_MOUNT_UNSHARED |
> +					NFS_MOUNT_TRANSIENT);
>  			break;
>  		case Opt_nosharecache:
>  			mnt->flags |= NFS_MOUNT_UNSHARED;
> @@ -1356,6 +1361,12 @@ static int nfs_parse_mount_options(char *raw,
>  		case Opt_nomigration:
>  			mnt->options &= ~NFS_OPTION_MIGRATION;
>  			break;
> +		case Opt_transient:
> +			mnt->flags |= NFS_MOUNT_TRANSIENT | NFS_MOUNT_UNSHARED;
> +			break;
> +		case Opt_notransient:
> +			mnt->flags &= ~NFS_MOUNT_TRANSIENT;
> +			break;
>  
>  		/*
>  		 * options that take numeric values
> diff --git a/include/uapi/linux/nfs_mount.h b/include/uapi/linux/nfs_mount.h
> index e44e00616ab5..6067a9449d9f 100644
> --- a/include/uapi/linux/nfs_mount.h
> +++ b/include/uapi/linux/nfs_mount.h
> @@ -74,5 +74,6 @@ struct nfs_mount_data {
>  
>  #define NFS_MOUNT_LOCAL_FLOCK	0x100000
>  #define NFS_MOUNT_LOCAL_FCNTL	0x200000
> +#define NFS_MOUNT_TRANSIENT	0x400000
>  
>  #endif


I can't say I'm a fan of this new mount option.

Under what circumstances would I, as an admin, use this thing? Note too
that nosharecache is not without cost, so this option will have costs as
well. That will need to be documented of course.

I tend to think that stuck mounts are not something one usually
anticipates before it occurs. Requiring a mount option to give you an
out for it is probably not going to be as helpful as we'd like.
-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC v4 4/9] namespace: Add umount_end superblock operation
  2017-12-06 12:14   ` Al Viro
@ 2017-12-06 12:33     ` Al Viro
  2017-12-06 15:41       ` Joshua Watt
  0 siblings, 1 reply; 32+ messages in thread
From: Al Viro @ 2017-12-06 12:33 UTC (permalink / raw)
  To: Joshua Watt
  Cc: NeilBrown, Jeff Layton, Trond Myklebust, J . Bruce Fields,
	linux-nfs, David Howells

On Wed, Dec 06, 2017 at 12:14:41PM +0000, Al Viro wrote:
> On Fri, Nov 17, 2017 at 11:45:47AM -0600, Joshua Watt wrote:
> > The umount_end operation allows cleaning of state set by umount_begin in
> > the event the filesystem doesn't actually get unmounted.
> 
> The locking doesn't make any sense.  This thing can come at *any* moment -
> one process does this force-unmount of yours, another comes and accesses
> the damn thing just as you've decided that umount has failed and go
> to call that method.

Consider, BTW, the situation when another umount -f comes just as you've
dropped ->s_umount.  Now your ->umount_end() comes *after* ->umount_begin()
from the second call.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC v4 0/9] NFS Force Unmounting
@ 2017-12-06 13:03       ` Jeff Layton
  0 siblings, 0 replies; 32+ messages in thread
From: Jeff Layton @ 2017-12-06 13:03 UTC (permalink / raw)
  To: NeilBrown, Joshua Watt, Trond Myklebust, J . Bruce Fields
  Cc: linux-nfs, Al Viro, David Howells, linux-api

On Wed, 2017-12-06 at 10:34 +1100, NeilBrown wrote:
> On Mon, Dec 04 2017, Joshua Watt wrote:
> 
> > On Fri, 2017-11-17 at 11:45 -0600, Joshua Watt wrote:
> > > Latest attempt at unifying the various constraints for force
> > > umounting
> > > NFS servers that have disappeared in a timely manner. Namely:
> > >   * umount2(..., MNT_FORCE) should not be made stronger, unless we
> > > know
> > >     this is the final umount()
> > >   * The "failed server" state should be reversible
> > >   * The mechanism should be able to "unstick" a sync(2) that is stuck
> > > on
> > >     an unresponsive server
> > > 
> > > I believe the proposal satisfies all of these concerns. There are a
> > > few
> > > major components to this proposal:
> > >  1) The umount_begin superblock operation now has a corresponding
> > >     umount_end operation. This is invoked by umount() when MNT_FORCE
> > > is
> > >     specified (like umount_begin()), but the actual unmount failed
> > > (i.e.
> > >     the mount is busy).
> > >  2) Instead of killing all the RPC queued at a single point in time,
> > > the
> > >     NFS mount now kills all queue RPCs and all RPCs that get queued
> > >     between nfs_umount_begin() and nfs_umount_end(). I believe this
> > > is
> > >     not a significant change in behavior because there were always
> > > races
> > >     between queuing RPCs and killing them in nfs_umount_begin().
> > >  3) nfs_umount_end() is *not* called when MNT_DETACH is specified.
> > > This
> > >     is the indication that the user is done with this mount and all
> > >     RPCs will be killed until the mount finally gets removed.
> > >  4) The new "transient" mount option prevents sharing nfs_clients
> > >     between multiple superblocks. The only exception to this is when
> > > the
> > >     kernel does an automatic mount to cross a device boundary ("xdev"
> > >     NFS mount). In this case, the existing code always shares the
> > >     existing nfs_client from parent superblock. The "transient" mount
> > >     option implies "nosharecache", as it doesn't make sense to share
> > >     superblocks if clients aren't shared.
> > >  5) If the "transient" mount option is specified (and hence the
> > >     nfs_client is not shared), MNT_FORCE kills all RPCs for the
> > > entire
> > >     nfs_client (and all its nfs_servers). This effectively enables
> > > the
> > >     "burn down the forest" option when combined with MNT_DETACH.
> > > 
> > > The choice to use MNT_FORCE as the mechanism for triggering this
> > > behavior stems from the desire to unstick sync(2) calls that might be
> > > blocked on a non-responsive NFS server. While it was previously
> > > discussed to remount with a new mount option to enable this behavior,
> > > this cannot release the blocked sync(2) call because both
> > > sync_fileystem() and do_remount() lock the struct superblock-
> > > > s_umount
> > > 
> > > reader-writer lock. As such, a remount will block until the sync(2)
> > > finishes, which is undesirable. umount2() doesn't have this
> > > restriction
> > > and can unblock the sync(2) call.
> > > 
> > > For the most part, all existing behavior is unchanged if the
> > > "transient"
> > > option is not specified. umount -f continues to behave as is has, but
> > > umount -fl (see note below) now does a more aggressive kill to get
> > > everything out of there and allow unmounting in a more timely manner.
> > > Note that there will probably be some communication with the server
> > > still, as destruction of the NFS client ID and such will occur when
> > > the
> > > last reference is removed. If the server is truly gone, this can
> > > result
> > > in long blocks at that time.
> > > 
> > > If it is known at mount time that the server might be disappearing,
> > > it
> > > should be mounted with "transient". Doing this will allow mount -fl
> > > to
> > > do a more complete cleanup and prevent all communication with the
> > > server, which should allow a timely cleanup in all cases.
> > > 
> > > Notes:
> > > 
> > > Until recently, libmount did not allow a detached and lazy mount at
> > > the
> > > same time. This was recently fixed (see
> > > https://marc.info/?l=util-linux-ng&m=151000714401929&w=2). If you
> > > want
> > > to test this, you may need to write a simple test program that
> > > directly
> > > calls umount2() with MNT_FORCE | MNT_DETACH.
> > > 
> > > Thank you all again for your time and comments,
> > > Joshua Watt
> > > 
> > > Joshua Watt (9):
> > >   SUNRPC: Add flag to kill new tasks
> > >   SUNRPC: Expose kill_new_tasks in debugfs
> > >   SUNRPC: Simplify client shutdown
> > >   namespace: Add umount_end superblock operation
> > >   NFS: Kill RPCs for the duration of umount
> > >   NFS: Add debugfs for nfs_server and nfs_client
> > >   NFS: Add transient mount option
> > >   NFS: Don't shared transient clients
> > >   NFS: Kill all client RPCs if transient
> > > 
> > >  fs/namespace.c                 |  22 ++++++-
> > >  fs/nfs/Makefile                |   2 +-
> > >  fs/nfs/client.c                |  96 +++++++++++++++++++++++++--
> > >  fs/nfs/debugfs.c               | 143
> > > +++++++++++++++++++++++++++++++++++++++++
> > >  fs/nfs/inode.c                 |   5 ++
> > >  fs/nfs/internal.h              |  11 ++++
> > >  fs/nfs/nfs3client.c            |   2 +
> > >  fs/nfs/nfs4client.c            |   5 ++
> > >  fs/nfs/nfs4super.c             |   1 +
> > >  fs/nfs/super.c                 |  81 ++++++++++++++++++++---
> > >  include/linux/fs.h             |   1 +
> > >  include/linux/nfs_fs_sb.h      |   6 ++
> > >  include/linux/sunrpc/clnt.h    |   1 +
> > >  include/uapi/linux/nfs_mount.h |   1 +
> > >  net/sunrpc/clnt.c              |  13 ++--
> > >  net/sunrpc/debugfs.c           |   5 ++
> > >  net/sunrpc/sched.c             |   3 +
> > >  17 files changed, 372 insertions(+), 26 deletions(-)
> > >  create mode 100644 fs/nfs/debugfs.c
> > > 
> > 
> > Anyone have any comments on this? Sorry fo the churn, it took a few
> > tries to get this to where it would work. I also realize I should have
> > put all my RFC patchsets in-reply-to each other instead of starting a
> > new thread for each one.
> 
> The new semantic for MNT_DETACH|MNT_FORCE is interesting.
> As it was never possible before (from /bin/umount), it should be safe to
> add a new meaning.
> The meaning is effectively "detach the filesystem from the namespace and
> detach the transport from the filesystem", which sounds like it is
> useful.
> It is worth highlighting this, and maybe even cc:ing
> linux-api@vger.kernel.org ... done that.
> 

I'm not thrilled with the new flag combo, personally. Given that we're
introducing new behavior here, I think it wouldn't hurt to add a new
UMOUNT_* flag for this (UMOUNT_NUKE_FROM_ORBIT?).

That should still give you a -EINVAL when you try to use it on kernels
that don't understand it, and cleanly separates this behavior from the
old flags. Userland umount program would need changes of course, but
that's not hard.

Fundamentally, I don't really agree with all the hand-wringing about
torpedoing other mounts that are using a client that you're going to
kill off. This is going to be a privileged operation, and I think we
just need to ensure that it's well-documented and that the admin can
figure out what mounts will be affected and how.

> The new umount_end to match umount_begin seems to make sense.  It can
> replace a racy semantic with a more predictable one, and that is good.
> As you say, we cannot do anything really useful here with remount as
> sync can block it by holding ->s_umount.
> 
> I'm not sure about the new "transient" option.  It is probably a good
> idea but I haven't yet wrapped my mind around exactly what it does so I
> don't want to commit just yet.

I think fixing this with new mount options is a losing strategy. By the
time you realize you need it, it's too late to add it. NFS has a huge
entrenched user-base and you'll never get them all to sprinkle this new
mount option in there.

That said, it might work if you can make it take effect on a remount.

My two lumps of copper and nickel.
-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC v4 0/9] NFS Force Unmounting
@ 2017-12-06 13:03       ` Jeff Layton
  0 siblings, 0 replies; 32+ messages in thread
From: Jeff Layton @ 2017-12-06 13:03 UTC (permalink / raw)
  To: NeilBrown, Joshua Watt, Trond Myklebust, J . Bruce Fields
  Cc: linux-nfs-u79uwXL29TY76Z2rM5mHXA, Al Viro, David Howells,
	linux-api-u79uwXL29TY76Z2rM5mHXA

On Wed, 2017-12-06 at 10:34 +1100, NeilBrown wrote:
> On Mon, Dec 04 2017, Joshua Watt wrote:
> 
> > On Fri, 2017-11-17 at 11:45 -0600, Joshua Watt wrote:
> > > Latest attempt at unifying the various constraints for force
> > > umounting
> > > NFS servers that have disappeared in a timely manner. Namely:
> > >   * umount2(..., MNT_FORCE) should not be made stronger, unless we
> > > know
> > >     this is the final umount()
> > >   * The "failed server" state should be reversible
> > >   * The mechanism should be able to "unstick" a sync(2) that is stuck
> > > on
> > >     an unresponsive server
> > > 
> > > I believe the proposal satisfies all of these concerns. There are a
> > > few
> > > major components to this proposal:
> > >  1) The umount_begin superblock operation now has a corresponding
> > >     umount_end operation. This is invoked by umount() when MNT_FORCE
> > > is
> > >     specified (like umount_begin()), but the actual unmount failed
> > > (i.e.
> > >     the mount is busy).
> > >  2) Instead of killing all the RPC queued at a single point in time,
> > > the
> > >     NFS mount now kills all queue RPCs and all RPCs that get queued
> > >     between nfs_umount_begin() and nfs_umount_end(). I believe this
> > > is
> > >     not a significant change in behavior because there were always
> > > races
> > >     between queuing RPCs and killing them in nfs_umount_begin().
> > >  3) nfs_umount_end() is *not* called when MNT_DETACH is specified.
> > > This
> > >     is the indication that the user is done with this mount and all
> > >     RPCs will be killed until the mount finally gets removed.
> > >  4) The new "transient" mount option prevents sharing nfs_clients
> > >     between multiple superblocks. The only exception to this is when
> > > the
> > >     kernel does an automatic mount to cross a device boundary ("xdev"
> > >     NFS mount). In this case, the existing code always shares the
> > >     existing nfs_client from parent superblock. The "transient" mount
> > >     option implies "nosharecache", as it doesn't make sense to share
> > >     superblocks if clients aren't shared.
> > >  5) If the "transient" mount option is specified (and hence the
> > >     nfs_client is not shared), MNT_FORCE kills all RPCs for the
> > > entire
> > >     nfs_client (and all its nfs_servers). This effectively enables
> > > the
> > >     "burn down the forest" option when combined with MNT_DETACH.
> > > 
> > > The choice to use MNT_FORCE as the mechanism for triggering this
> > > behavior stems from the desire to unstick sync(2) calls that might be
> > > blocked on a non-responsive NFS server. While it was previously
> > > discussed to remount with a new mount option to enable this behavior,
> > > this cannot release the blocked sync(2) call because both
> > > sync_fileystem() and do_remount() lock the struct superblock-
> > > > s_umount
> > > 
> > > reader-writer lock. As such, a remount will block until the sync(2)
> > > finishes, which is undesirable. umount2() doesn't have this
> > > restriction
> > > and can unblock the sync(2) call.
> > > 
> > > For the most part, all existing behavior is unchanged if the
> > > "transient"
> > > option is not specified. umount -f continues to behave as is has, but
> > > umount -fl (see note below) now does a more aggressive kill to get
> > > everything out of there and allow unmounting in a more timely manner.
> > > Note that there will probably be some communication with the server
> > > still, as destruction of the NFS client ID and such will occur when
> > > the
> > > last reference is removed. If the server is truly gone, this can
> > > result
> > > in long blocks at that time.
> > > 
> > > If it is known at mount time that the server might be disappearing,
> > > it
> > > should be mounted with "transient". Doing this will allow mount -fl
> > > to
> > > do a more complete cleanup and prevent all communication with the
> > > server, which should allow a timely cleanup in all cases.
> > > 
> > > Notes:
> > > 
> > > Until recently, libmount did not allow a detached and lazy mount at
> > > the
> > > same time. This was recently fixed (see
> > > https://marc.info/?l=util-linux-ng&m=151000714401929&w=2). If you
> > > want
> > > to test this, you may need to write a simple test program that
> > > directly
> > > calls umount2() with MNT_FORCE | MNT_DETACH.
> > > 
> > > Thank you all again for your time and comments,
> > > Joshua Watt
> > > 
> > > Joshua Watt (9):
> > >   SUNRPC: Add flag to kill new tasks
> > >   SUNRPC: Expose kill_new_tasks in debugfs
> > >   SUNRPC: Simplify client shutdown
> > >   namespace: Add umount_end superblock operation
> > >   NFS: Kill RPCs for the duration of umount
> > >   NFS: Add debugfs for nfs_server and nfs_client
> > >   NFS: Add transient mount option
> > >   NFS: Don't shared transient clients
> > >   NFS: Kill all client RPCs if transient
> > > 
> > >  fs/namespace.c                 |  22 ++++++-
> > >  fs/nfs/Makefile                |   2 +-
> > >  fs/nfs/client.c                |  96 +++++++++++++++++++++++++--
> > >  fs/nfs/debugfs.c               | 143
> > > +++++++++++++++++++++++++++++++++++++++++
> > >  fs/nfs/inode.c                 |   5 ++
> > >  fs/nfs/internal.h              |  11 ++++
> > >  fs/nfs/nfs3client.c            |   2 +
> > >  fs/nfs/nfs4client.c            |   5 ++
> > >  fs/nfs/nfs4super.c             |   1 +
> > >  fs/nfs/super.c                 |  81 ++++++++++++++++++++---
> > >  include/linux/fs.h             |   1 +
> > >  include/linux/nfs_fs_sb.h      |   6 ++
> > >  include/linux/sunrpc/clnt.h    |   1 +
> > >  include/uapi/linux/nfs_mount.h |   1 +
> > >  net/sunrpc/clnt.c              |  13 ++--
> > >  net/sunrpc/debugfs.c           |   5 ++
> > >  net/sunrpc/sched.c             |   3 +
> > >  17 files changed, 372 insertions(+), 26 deletions(-)
> > >  create mode 100644 fs/nfs/debugfs.c
> > > 
> > 
> > Anyone have any comments on this? Sorry fo the churn, it took a few
> > tries to get this to where it would work. I also realize I should have
> > put all my RFC patchsets in-reply-to each other instead of starting a
> > new thread for each one.
> 
> The new semantic for MNT_DETACH|MNT_FORCE is interesting.
> As it was never possible before (from /bin/umount), it should be safe to
> add a new meaning.
> The meaning is effectively "detach the filesystem from the namespace and
> detach the transport from the filesystem", which sounds like it is
> useful.
> It is worth highlighting this, and maybe even cc:ing
> linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org ... done that.
> 

I'm not thrilled with the new flag combo, personally. Given that we're
introducing new behavior here, I think it wouldn't hurt to add a new
UMOUNT_* flag for this (UMOUNT_NUKE_FROM_ORBIT?).

That should still give you a -EINVAL when you try to use it on kernels
that don't understand it, and cleanly separates this behavior from the
old flags. Userland umount program would need changes of course, but
that's not hard.

Fundamentally, I don't really agree with all the hand-wringing about
torpedoing other mounts that are using a client that you're going to
kill off. This is going to be a privileged operation, and I think we
just need to ensure that it's well-documented and that the admin can
figure out what mounts will be affected and how.

> The new umount_end to match umount_begin seems to make sense.  It can
> replace a racy semantic with a more predictable one, and that is good.
> As you say, we cannot do anything really useful here with remount as
> sync can block it by holding ->s_umount.
> 
> I'm not sure about the new "transient" option.  It is probably a good
> idea but I haven't yet wrapped my mind around exactly what it does so I
> don't want to commit just yet.

I think fixing this with new mount options is a losing strategy. By the
time you realize you need it, it's too late to add it. NFS has a huge
entrenched user-base and you'll never get them all to sprinkle this new
mount option in there.

That said, it might work if you can make it take effect on a remount.

My two lumps of copper and nickel.
-- 
Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC v4 4/9] namespace: Add umount_end superblock operation
  2017-12-06 12:33     ` Al Viro
@ 2017-12-06 15:41       ` Joshua Watt
  0 siblings, 0 replies; 32+ messages in thread
From: Joshua Watt @ 2017-12-06 15:41 UTC (permalink / raw)
  To: Al Viro
  Cc: NeilBrown, Jeff Layton, Trond Myklebust, J . Bruce Fields,
	linux-nfs, David Howells, linux-fsdevel

On Wed, 2017-12-06 at 12:33 +0000, Al Viro wrote:
> On Wed, Dec 06, 2017 at 12:14:41PM +0000, Al Viro wrote:
> > On Fri, Nov 17, 2017 at 11:45:47AM -0600, Joshua Watt wrote:
> > > The umount_end operation allows cleaning of state set by
> > > umount_begin in
> > > the event the filesystem doesn't actually get unmounted.
> > 
> > The locking doesn't make any sense.  This thing can come at *any*
> > moment -
> > one process does this force-unmount of yours, another comes and
> > accesses
> > the damn thing just as you've decided that umount has failed and go
> > to call that method.
> 
> Consider, BTW, the situation when another umount -f comes just as
> you've
> dropped ->s_umount.  Now your ->umount_end() comes *after*
> ->umount_begin()
> from the second call.

Yes I realised that was a posibility, which is why the NFS
implementation of this uses an atomic counter in ->umount_begin() and 
->umount_end(). My line of thinking was that most fs drivers are
probably going to ignore ->umount_end(), so it is only those that
implement it that need to actually account for that problem.

Or rather put, we can punt that problem to the FS driver writers if
they choose to implement ->umount_end(). Maybe that isn't fair to the
fs driver writers? Or maybe it just needs better documentation of that
expectation?

cc'ing linux-fsdevel@vger.kernel.org, original message chain can be
found here:
http://www.spinics.net/lists/linux-nfs/msg66483.html

Thanks,
Joshua Watt

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC v4 0/9] NFS Force Unmounting
@ 2017-12-06 16:40         ` Joshua Watt
  0 siblings, 0 replies; 32+ messages in thread
From: Joshua Watt @ 2017-12-06 16:40 UTC (permalink / raw)
  To: Jeff Layton, NeilBrown, Trond Myklebust, J . Bruce Fields
  Cc: linux-nfs, Al Viro, David Howells, linux-api

On Wed, 2017-12-06 at 08:03 -0500, Jeff Layton wrote:
> On Wed, 2017-12-06 at 10:34 +1100, NeilBrown wrote:
> > On Mon, Dec 04 2017, Joshua Watt wrote:
> > 
> > > On Fri, 2017-11-17 at 11:45 -0600, Joshua Watt wrote:
> > > > Latest attempt at unifying the various constraints for force
> > > > umounting
> > > > NFS servers that have disappeared in a timely manner. Namely:
> > > >   * umount2(..., MNT_FORCE) should not be made stronger, unless
> > > > we
> > > > know
> > > >     this is the final umount()
> > > >   * The "failed server" state should be reversible
> > > >   * The mechanism should be able to "unstick" a sync(2) that is
> > > > stuck
> > > > on
> > > >     an unresponsive server
> > > > 
> > > > I believe the proposal satisfies all of these concerns. There
> > > > are a
> > > > few
> > > > major components to this proposal:
> > > >  1) The umount_begin superblock operation now has a
> > > > corresponding
> > > >     umount_end operation. This is invoked by umount() when
> > > > MNT_FORCE
> > > > is
> > > >     specified (like umount_begin()), but the actual unmount
> > > > failed
> > > > (i.e.
> > > >     the mount is busy).
> > > >  2) Instead of killing all the RPC queued at a single point in
> > > > time,
> > > > the
> > > >     NFS mount now kills all queue RPCs and all RPCs that get
> > > > queued
> > > >     between nfs_umount_begin() and nfs_umount_end(). I believe
> > > > this
> > > > is
> > > >     not a significant change in behavior because there were
> > > > always
> > > > races
> > > >     between queuing RPCs and killing them in
> > > > nfs_umount_begin().
> > > >  3) nfs_umount_end() is *not* called when MNT_DETACH is
> > > > specified.
> > > > This
> > > >     is the indication that the user is done with this mount and
> > > > all
> > > >     RPCs will be killed until the mount finally gets removed.
> > > >  4) The new "transient" mount option prevents sharing
> > > > nfs_clients
> > > >     between multiple superblocks. The only exception to this is
> > > > when
> > > > the
> > > >     kernel does an automatic mount to cross a device boundary
> > > > ("xdev"
> > > >     NFS mount). In this case, the existing code always shares
> > > > the
> > > >     existing nfs_client from parent superblock. The "transient"
> > > > mount
> > > >     option implies "nosharecache", as it doesn't make sense to
> > > > share
> > > >     superblocks if clients aren't shared.
> > > >  5) If the "transient" mount option is specified (and hence the
> > > >     nfs_client is not shared), MNT_FORCE kills all RPCs for the
> > > > entire
> > > >     nfs_client (and all its nfs_servers). This effectively
> > > > enables
> > > > the
> > > >     "burn down the forest" option when combined with
> > > > MNT_DETACH.
> > > > 
> > > > The choice to use MNT_FORCE as the mechanism for triggering
> > > > this
> > > > behavior stems from the desire to unstick sync(2) calls that
> > > > might be
> > > > blocked on a non-responsive NFS server. While it was previously
> > > > discussed to remount with a new mount option to enable this
> > > > behavior,
> > > > this cannot release the blocked sync(2) call because both
> > > > sync_fileystem() and do_remount() lock the struct superblock-
> > > > > s_umount
> > > > 
> > > > reader-writer lock. As such, a remount will block until the
> > > > sync(2)
> > > > finishes, which is undesirable. umount2() doesn't have this
> > > > restriction
> > > > and can unblock the sync(2) call.
> > > > 
> > > > For the most part, all existing behavior is unchanged if the
> > > > "transient"
> > > > option is not specified. umount -f continues to behave as is
> > > > has, but
> > > > umount -fl (see note below) now does a more aggressive kill to
> > > > get
> > > > everything out of there and allow unmounting in a more timely
> > > > manner.
> > > > Note that there will probably be some communication with the
> > > > server
> > > > still, as destruction of the NFS client ID and such will occur
> > > > when
> > > > the
> > > > last reference is removed. If the server is truly gone, this
> > > > can
> > > > result
> > > > in long blocks at that time.
> > > > 
> > > > If it is known at mount time that the server might be
> > > > disappearing,
> > > > it
> > > > should be mounted with "transient". Doing this will allow mount
> > > > -fl
> > > > to
> > > > do a more complete cleanup and prevent all communication with
> > > > the
> > > > server, which should allow a timely cleanup in all cases.
> > > > 
> > > > Notes:
> > > > 
> > > > Until recently, libmount did not allow a detached and lazy
> > > > mount at
> > > > the
> > > > same time. This was recently fixed (see
> > > > https://marc.info/?l=util-linux-ng&m=151000714401929&w=2). If
> > > > you
> > > > want
> > > > to test this, you may need to write a simple test program that
> > > > directly
> > > > calls umount2() with MNT_FORCE | MNT_DETACH.
> > > > 
> > > > Thank you all again for your time and comments,
> > > > Joshua Watt
> > > > 
> > > > Joshua Watt (9):
> > > >   SUNRPC: Add flag to kill new tasks
> > > >   SUNRPC: Expose kill_new_tasks in debugfs
> > > >   SUNRPC: Simplify client shutdown
> > > >   namespace: Add umount_end superblock operation
> > > >   NFS: Kill RPCs for the duration of umount
> > > >   NFS: Add debugfs for nfs_server and nfs_client
> > > >   NFS: Add transient mount option
> > > >   NFS: Don't shared transient clients
> > > >   NFS: Kill all client RPCs if transient
> > > > 
> > > >  fs/namespace.c                 |  22 ++++++-
> > > >  fs/nfs/Makefile                |   2 +-
> > > >  fs/nfs/client.c                |  96
> > > > +++++++++++++++++++++++++--
> > > >  fs/nfs/debugfs.c               | 143
> > > > +++++++++++++++++++++++++++++++++++++++++
> > > >  fs/nfs/inode.c                 |   5 ++
> > > >  fs/nfs/internal.h              |  11 ++++
> > > >  fs/nfs/nfs3client.c            |   2 +
> > > >  fs/nfs/nfs4client.c            |   5 ++
> > > >  fs/nfs/nfs4super.c             |   1 +
> > > >  fs/nfs/super.c                 |  81 ++++++++++++++++++++---
> > > >  include/linux/fs.h             |   1 +
> > > >  include/linux/nfs_fs_sb.h      |   6 ++
> > > >  include/linux/sunrpc/clnt.h    |   1 +
> > > >  include/uapi/linux/nfs_mount.h |   1 +
> > > >  net/sunrpc/clnt.c              |  13 ++--
> > > >  net/sunrpc/debugfs.c           |   5 ++
> > > >  net/sunrpc/sched.c             |   3 +
> > > >  17 files changed, 372 insertions(+), 26 deletions(-)
> > > >  create mode 100644 fs/nfs/debugfs.c
> > > > 
> > > 
> > > Anyone have any comments on this? Sorry fo the churn, it took a
> > > few
> > > tries to get this to where it would work. I also realize I should
> > > have
> > > put all my RFC patchsets in-reply-to each other instead of
> > > starting a
> > > new thread for each one.
> > 
> > The new semantic for MNT_DETACH|MNT_FORCE is interesting.
> > As it was never possible before (from /bin/umount), it should be
> > safe to
> > add a new meaning.
> > The meaning is effectively "detach the filesystem from the
> > namespace and
> > detach the transport from the filesystem", which sounds like it is
> > useful.
> > It is worth highlighting this, and maybe even cc:ing
> > linux-api@vger.kernel.org ... done that.

Full disclosure: It looks like the kernel umount2() syscall always
allowed both to be specified, it was only libmount's umount API (and by
extension the umount(8) command) that didn't allow it in userspace
(unitl I recently changed it). If fact, the umount(8) command in
busybox allows both to be specified, and IIRC there is a comment
expressing some confusion as to why util-linux didn't. 

> > 
> 
> I'm not thrilled with the new flag combo, personally. Given that
> we're
> introducing new behavior here, I think it wouldn't hurt to add a new
> UMOUNT_* flag for this (UMOUNT_NUKE_FROM_ORBIT?).
> 
> That should still give you a -EINVAL when you try to use it on
> kernels
> that don't understand it, and cleanly separates this behavior from
> the
> old flags. Userland umount program would need changes of course, but
> that's not hard.
> 
> Fundamentally, I don't really agree with all the hand-wringing about
> torpedoing other mounts that are using a client that you're going to
> kill off. This is going to be a privileged operation, and I think we
> just need to ensure that it's well-documented and that the admin can
> figure out what mounts will be affected and how.
> 
> > The new umount_end to match umount_begin seems to make sense.  It
> > can
> > replace a racy semantic with a more predictable one, and that is
> > good.
> > As you say, we cannot do anything really useful here with remount
> > as
> > sync can block it by holding ->s_umount.
> > 
> > I'm not sure about the new "transient" option.  It is probably a
> > good
> > idea but I haven't yet wrapped my mind around exactly what it does
> > so I
> > don't want to commit just yet.
> 
> I think fixing this with new mount options is a losing strategy. By
> the
> time you realize you need it, it's too late to add it. NFS has a huge
> entrenched user-base and you'll never get them all to sprinkle this
> new
> mount option in there.

My use case fortunately has the convience of knowing apriori that the
mount option is needed when the filesystem is mounted (*occasionally*,
embedded use cases do simplify things), but I obviously realize that my
use case isn't necessarily what is best for everyone.

> 
> That said, it might work if you can make it take effect on a remount.

I think it could.... the thing that blocks remount from is waiting on a
"stuck" call that is never going to finish and has the struct
superblock->s_umount reader-write lock (specifically sync(2), but maybe
others). You could actually use umount(..., MNT_FORCE) to try and
recover that like so:

 until mount -o remount,transient $MOUNT; do
     umount -f $MOUNT
 done
 umount -fl $MOUNT

I'll play a little devil's advocate to my own patch here and ask why
the above is any better than simply forgoing my patches entirely and
doing:

 until umount -f $MOUNT; do echo "Retry"; done

Technically speaking the difference between these is that the first
only races between processes calling sync(2) and the remount, which the
second races between all NFS calls that make RPC calls and the umount.
In theory a race is a race, but I would guess in practice the second is
much more likely to have to keep retrying than the first.

Honestly, if you are a system admin trying to remove a dead NFS mount,
the second option might be better in many cases. If the umount takes
too long, you can go in and kill processes, cleanup resources, or
whatever might be required to get the mount shutdown. However, there
are use cases (like mine) where having the system administrator step in
to clean up the system isn't an option. On my embedded system, I *know*
this mount can disappear on me at any time, and have decided that
keeping our system responsive is more important than trying a long
recovery processes (that's probably going to fail anyway) to try and
save the data, and we have confidence that our applications can handle
this without requiring drastic things like being killed. As such, I
mount with the "transient" option right away so that my system can
clean up in a timely manner when and if it becomes necessary. I've
effectively made that administrative decision as part of the system
design, but there is currently no mechanism in the kernel that allows
me to enforce it in a sane manner.

Thanks,
Joshua Watt

> 
> My two lumps of copper and nickel.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC v4 0/9] NFS Force Unmounting
@ 2017-12-06 16:40         ` Joshua Watt
  0 siblings, 0 replies; 32+ messages in thread
From: Joshua Watt @ 2017-12-06 16:40 UTC (permalink / raw)
  To: Jeff Layton, NeilBrown, Trond Myklebust, J . Bruce Fields
  Cc: linux-nfs-u79uwXL29TY76Z2rM5mHXA, Al Viro, David Howells,
	linux-api-u79uwXL29TY76Z2rM5mHXA

On Wed, 2017-12-06 at 08:03 -0500, Jeff Layton wrote:
> On Wed, 2017-12-06 at 10:34 +1100, NeilBrown wrote:
> > On Mon, Dec 04 2017, Joshua Watt wrote:
> > 
> > > On Fri, 2017-11-17 at 11:45 -0600, Joshua Watt wrote:
> > > > Latest attempt at unifying the various constraints for force
> > > > umounting
> > > > NFS servers that have disappeared in a timely manner. Namely:
> > > >   * umount2(..., MNT_FORCE) should not be made stronger, unless
> > > > we
> > > > know
> > > >     this is the final umount()
> > > >   * The "failed server" state should be reversible
> > > >   * The mechanism should be able to "unstick" a sync(2) that is
> > > > stuck
> > > > on
> > > >     an unresponsive server
> > > > 
> > > > I believe the proposal satisfies all of these concerns. There
> > > > are a
> > > > few
> > > > major components to this proposal:
> > > >  1) The umount_begin superblock operation now has a
> > > > corresponding
> > > >     umount_end operation. This is invoked by umount() when
> > > > MNT_FORCE
> > > > is
> > > >     specified (like umount_begin()), but the actual unmount
> > > > failed
> > > > (i.e.
> > > >     the mount is busy).
> > > >  2) Instead of killing all the RPC queued at a single point in
> > > > time,
> > > > the
> > > >     NFS mount now kills all queue RPCs and all RPCs that get
> > > > queued
> > > >     between nfs_umount_begin() and nfs_umount_end(). I believe
> > > > this
> > > > is
> > > >     not a significant change in behavior because there were
> > > > always
> > > > races
> > > >     between queuing RPCs and killing them in
> > > > nfs_umount_begin().
> > > >  3) nfs_umount_end() is *not* called when MNT_DETACH is
> > > > specified.
> > > > This
> > > >     is the indication that the user is done with this mount and
> > > > all
> > > >     RPCs will be killed until the mount finally gets removed.
> > > >  4) The new "transient" mount option prevents sharing
> > > > nfs_clients
> > > >     between multiple superblocks. The only exception to this is
> > > > when
> > > > the
> > > >     kernel does an automatic mount to cross a device boundary
> > > > ("xdev"
> > > >     NFS mount). In this case, the existing code always shares
> > > > the
> > > >     existing nfs_client from parent superblock. The "transient"
> > > > mount
> > > >     option implies "nosharecache", as it doesn't make sense to
> > > > share
> > > >     superblocks if clients aren't shared.
> > > >  5) If the "transient" mount option is specified (and hence the
> > > >     nfs_client is not shared), MNT_FORCE kills all RPCs for the
> > > > entire
> > > >     nfs_client (and all its nfs_servers). This effectively
> > > > enables
> > > > the
> > > >     "burn down the forest" option when combined with
> > > > MNT_DETACH.
> > > > 
> > > > The choice to use MNT_FORCE as the mechanism for triggering
> > > > this
> > > > behavior stems from the desire to unstick sync(2) calls that
> > > > might be
> > > > blocked on a non-responsive NFS server. While it was previously
> > > > discussed to remount with a new mount option to enable this
> > > > behavior,
> > > > this cannot release the blocked sync(2) call because both
> > > > sync_fileystem() and do_remount() lock the struct superblock-
> > > > > s_umount
> > > > 
> > > > reader-writer lock. As such, a remount will block until the
> > > > sync(2)
> > > > finishes, which is undesirable. umount2() doesn't have this
> > > > restriction
> > > > and can unblock the sync(2) call.
> > > > 
> > > > For the most part, all existing behavior is unchanged if the
> > > > "transient"
> > > > option is not specified. umount -f continues to behave as is
> > > > has, but
> > > > umount -fl (see note below) now does a more aggressive kill to
> > > > get
> > > > everything out of there and allow unmounting in a more timely
> > > > manner.
> > > > Note that there will probably be some communication with the
> > > > server
> > > > still, as destruction of the NFS client ID and such will occur
> > > > when
> > > > the
> > > > last reference is removed. If the server is truly gone, this
> > > > can
> > > > result
> > > > in long blocks at that time.
> > > > 
> > > > If it is known at mount time that the server might be
> > > > disappearing,
> > > > it
> > > > should be mounted with "transient". Doing this will allow mount
> > > > -fl
> > > > to
> > > > do a more complete cleanup and prevent all communication with
> > > > the
> > > > server, which should allow a timely cleanup in all cases.
> > > > 
> > > > Notes:
> > > > 
> > > > Until recently, libmount did not allow a detached and lazy
> > > > mount at
> > > > the
> > > > same time. This was recently fixed (see
> > > > https://marc.info/?l=util-linux-ng&m=151000714401929&w=2). If
> > > > you
> > > > want
> > > > to test this, you may need to write a simple test program that
> > > > directly
> > > > calls umount2() with MNT_FORCE | MNT_DETACH.
> > > > 
> > > > Thank you all again for your time and comments,
> > > > Joshua Watt
> > > > 
> > > > Joshua Watt (9):
> > > >   SUNRPC: Add flag to kill new tasks
> > > >   SUNRPC: Expose kill_new_tasks in debugfs
> > > >   SUNRPC: Simplify client shutdown
> > > >   namespace: Add umount_end superblock operation
> > > >   NFS: Kill RPCs for the duration of umount
> > > >   NFS: Add debugfs for nfs_server and nfs_client
> > > >   NFS: Add transient mount option
> > > >   NFS: Don't shared transient clients
> > > >   NFS: Kill all client RPCs if transient
> > > > 
> > > >  fs/namespace.c                 |  22 ++++++-
> > > >  fs/nfs/Makefile                |   2 +-
> > > >  fs/nfs/client.c                |  96
> > > > +++++++++++++++++++++++++--
> > > >  fs/nfs/debugfs.c               | 143
> > > > +++++++++++++++++++++++++++++++++++++++++
> > > >  fs/nfs/inode.c                 |   5 ++
> > > >  fs/nfs/internal.h              |  11 ++++
> > > >  fs/nfs/nfs3client.c            |   2 +
> > > >  fs/nfs/nfs4client.c            |   5 ++
> > > >  fs/nfs/nfs4super.c             |   1 +
> > > >  fs/nfs/super.c                 |  81 ++++++++++++++++++++---
> > > >  include/linux/fs.h             |   1 +
> > > >  include/linux/nfs_fs_sb.h      |   6 ++
> > > >  include/linux/sunrpc/clnt.h    |   1 +
> > > >  include/uapi/linux/nfs_mount.h |   1 +
> > > >  net/sunrpc/clnt.c              |  13 ++--
> > > >  net/sunrpc/debugfs.c           |   5 ++
> > > >  net/sunrpc/sched.c             |   3 +
> > > >  17 files changed, 372 insertions(+), 26 deletions(-)
> > > >  create mode 100644 fs/nfs/debugfs.c
> > > > 
> > > 
> > > Anyone have any comments on this? Sorry fo the churn, it took a
> > > few
> > > tries to get this to where it would work. I also realize I should
> > > have
> > > put all my RFC patchsets in-reply-to each other instead of
> > > starting a
> > > new thread for each one.
> > 
> > The new semantic for MNT_DETACH|MNT_FORCE is interesting.
> > As it was never possible before (from /bin/umount), it should be
> > safe to
> > add a new meaning.
> > The meaning is effectively "detach the filesystem from the
> > namespace and
> > detach the transport from the filesystem", which sounds like it is
> > useful.
> > It is worth highlighting this, and maybe even cc:ing
> > linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org ... done that.

Full disclosure: It looks like the kernel umount2() syscall always
allowed both to be specified, it was only libmount's umount API (and by
extension the umount(8) command) that didn't allow it in userspace
(unitl I recently changed it). If fact, the umount(8) command in
busybox allows both to be specified, and IIRC there is a comment
expressing some confusion as to why util-linux didn't. 

> > 
> 
> I'm not thrilled with the new flag combo, personally. Given that
> we're
> introducing new behavior here, I think it wouldn't hurt to add a new
> UMOUNT_* flag for this (UMOUNT_NUKE_FROM_ORBIT?).
> 
> That should still give you a -EINVAL when you try to use it on
> kernels
> that don't understand it, and cleanly separates this behavior from
> the
> old flags. Userland umount program would need changes of course, but
> that's not hard.
> 
> Fundamentally, I don't really agree with all the hand-wringing about
> torpedoing other mounts that are using a client that you're going to
> kill off. This is going to be a privileged operation, and I think we
> just need to ensure that it's well-documented and that the admin can
> figure out what mounts will be affected and how.
> 
> > The new umount_end to match umount_begin seems to make sense.  It
> > can
> > replace a racy semantic with a more predictable one, and that is
> > good.
> > As you say, we cannot do anything really useful here with remount
> > as
> > sync can block it by holding ->s_umount.
> > 
> > I'm not sure about the new "transient" option.  It is probably a
> > good
> > idea but I haven't yet wrapped my mind around exactly what it does
> > so I
> > don't want to commit just yet.
> 
> I think fixing this with new mount options is a losing strategy. By
> the
> time you realize you need it, it's too late to add it. NFS has a huge
> entrenched user-base and you'll never get them all to sprinkle this
> new
> mount option in there.

My use case fortunately has the convience of knowing apriori that the
mount option is needed when the filesystem is mounted (*occasionally*,
embedded use cases do simplify things), but I obviously realize that my
use case isn't necessarily what is best for everyone.

> 
> That said, it might work if you can make it take effect on a remount.

I think it could.... the thing that blocks remount from is waiting on a
"stuck" call that is never going to finish and has the struct
superblock->s_umount reader-write lock (specifically sync(2), but maybe
others). You could actually use umount(..., MNT_FORCE) to try and
recover that like so:

 until mount -o remount,transient $MOUNT; do
     umount -f $MOUNT
 done
 umount -fl $MOUNT

I'll play a little devil's advocate to my own patch here and ask why
the above is any better than simply forgoing my patches entirely and
doing:

 until umount -f $MOUNT; do echo "Retry"; done

Technically speaking the difference between these is that the first
only races between processes calling sync(2) and the remount, which the
second races between all NFS calls that make RPC calls and the umount.
In theory a race is a race, but I would guess in practice the second is
much more likely to have to keep retrying than the first.

Honestly, if you are a system admin trying to remove a dead NFS mount,
the second option might be better in many cases. If the umount takes
too long, you can go in and kill processes, cleanup resources, or
whatever might be required to get the mount shutdown. However, there
are use cases (like mine) where having the system administrator step in
to clean up the system isn't an option. On my embedded system, I *know*
this mount can disappear on me at any time, and have decided that
keeping our system responsive is more important than trying a long
recovery processes (that's probably going to fail anyway) to try and
save the data, and we have confidence that our applications can handle
this without requiring drastic things like being killed. As such, I
mount with the "transient" option right away so that my system can
clean up in a timely manner when and if it becomes necessary. I've
effectively made that administrative decision as part of the system
design, but there is currently no mechanism in the kernel that allows
me to enforce it in a sane manner.

Thanks,
Joshua Watt

> 
> My two lumps of copper and nickel.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC v4 0/9] NFS Force Unmounting
@ 2017-12-08  2:10         ` NeilBrown
  0 siblings, 0 replies; 32+ messages in thread
From: NeilBrown @ 2017-12-08  2:10 UTC (permalink / raw)
  To: Jeff Layton, Joshua Watt, Trond Myklebust, J . Bruce Fields
  Cc: linux-nfs, Al Viro, David Howells, linux-api

[-- Attachment #1: Type: text/plain, Size: 2268 bytes --]

On Wed, Dec 06 2017, Jeff Layton wrote:

> On Wed, 2017-12-06 at 10:34 +1100, NeilBrown wrote:
>> 
>> The new semantic for MNT_DETACH|MNT_FORCE is interesting.
>> As it was never possible before (from /bin/umount), it should be safe to
>> add a new meaning.
>> The meaning is effectively "detach the filesystem from the namespace and
>> detach the transport from the filesystem", which sounds like it is
>> useful.
>> It is worth highlighting this, and maybe even cc:ing
>> linux-api@vger.kernel.org ... done that.
>> 
>
> I'm not thrilled with the new flag combo, personally. Given that we're
> introducing new behavior here, I think it wouldn't hurt to add a new
> UMOUNT_* flag for this (UMOUNT_NUKE_FROM_ORBIT?).

Suppose we did... MNT_FORCE_PONIES. What would be the semantics of this
flag?  Once we had it, would anyone ever want to use MNT_FORCE again?

MNT_FORCE is already fairly heavy handled.  It abort an arbitrary
collections of RPC requests being sent for the given filesystem, no
matter where else that filesystem might be mounted.
Is it ever safe to use this flag unless you have good reason to believe
that the server is not available and there is no point pretending any
more?
And if that is the case, why not use the new MNT_FORCE_PONIES which is
at least predictable and reliable.

We've talking a lot about the one NFS filesystem being mounted in
multiple containers.  MNT_FORCE is already a problem for such mounts as
one contains can kill requests generated from another container.  Maybe
MNT_FORCE needs to be restricted to "real" root.
Once we restrict it, do we need to keep it from being too harsh?

What would be really nice is a timeout for umount, and for sync.
The timeout only starts when the filesystem stops making progress for
writeback.  If it eventually does timeout, then the caller can fall back
to MNT_DETACH if they are in a container, or MNT_FORCE if not.
(Maybe MNT_FORCE should map to MNT_DETACH in a container??? or maybe
not).

There is a lot here that still isn't clear to me, but one this does seem
to be becoming clear:  MNT_FORCE as it stands is nearly useless and it
would serve is well to find a semantic that it actually useful, and
impose that.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC v4 0/9] NFS Force Unmounting
@ 2017-12-08  2:10         ` NeilBrown
  0 siblings, 0 replies; 32+ messages in thread
From: NeilBrown @ 2017-12-08  2:10 UTC (permalink / raw)
  To: Jeff Layton, Joshua Watt, Trond Myklebust, J . Bruce Fields
  Cc: linux-nfs-u79uwXL29TY76Z2rM5mHXA, Al Viro, David Howells,
	linux-api-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 2292 bytes --]

On Wed, Dec 06 2017, Jeff Layton wrote:

> On Wed, 2017-12-06 at 10:34 +1100, NeilBrown wrote:
>> 
>> The new semantic for MNT_DETACH|MNT_FORCE is interesting.
>> As it was never possible before (from /bin/umount), it should be safe to
>> add a new meaning.
>> The meaning is effectively "detach the filesystem from the namespace and
>> detach the transport from the filesystem", which sounds like it is
>> useful.
>> It is worth highlighting this, and maybe even cc:ing
>> linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org ... done that.
>> 
>
> I'm not thrilled with the new flag combo, personally. Given that we're
> introducing new behavior here, I think it wouldn't hurt to add a new
> UMOUNT_* flag for this (UMOUNT_NUKE_FROM_ORBIT?).

Suppose we did... MNT_FORCE_PONIES. What would be the semantics of this
flag?  Once we had it, would anyone ever want to use MNT_FORCE again?

MNT_FORCE is already fairly heavy handled.  It abort an arbitrary
collections of RPC requests being sent for the given filesystem, no
matter where else that filesystem might be mounted.
Is it ever safe to use this flag unless you have good reason to believe
that the server is not available and there is no point pretending any
more?
And if that is the case, why not use the new MNT_FORCE_PONIES which is
at least predictable and reliable.

We've talking a lot about the one NFS filesystem being mounted in
multiple containers.  MNT_FORCE is already a problem for such mounts as
one contains can kill requests generated from another container.  Maybe
MNT_FORCE needs to be restricted to "real" root.
Once we restrict it, do we need to keep it from being too harsh?

What would be really nice is a timeout for umount, and for sync.
The timeout only starts when the filesystem stops making progress for
writeback.  If it eventually does timeout, then the caller can fall back
to MNT_DETACH if they are in a container, or MNT_FORCE if not.
(Maybe MNT_FORCE should map to MNT_DETACH in a container??? or maybe
not).

There is a lot here that still isn't clear to me, but one this does seem
to be becoming clear:  MNT_FORCE as it stands is nearly useless and it
would serve is well to find a semantic that it actually useful, and
impose that.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC v4 0/9] NFS Force Unmounting
@ 2017-12-14 18:22           ` Joshua Watt
  0 siblings, 0 replies; 32+ messages in thread
From: Joshua Watt @ 2017-12-14 18:22 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Trond Myklebust, J . Bruce Fields
  Cc: linux-nfs, Al Viro, David Howells, linux-api

On Fri, 2017-12-08 at 13:10 +1100, NeilBrown wrote:
> On Wed, Dec 06 2017, Jeff Layton wrote:
> 
> > On Wed, 2017-12-06 at 10:34 +1100, NeilBrown wrote:
> > > 
> > > The new semantic for MNT_DETACH|MNT_FORCE is interesting.
> > > As it was never possible before (from /bin/umount), it should be
> > > safe to
> > > add a new meaning.
> > > The meaning is effectively "detach the filesystem from the
> > > namespace and
> > > detach the transport from the filesystem", which sounds like it
> > > is
> > > useful.
> > > It is worth highlighting this, and maybe even cc:ing
> > > linux-api@vger.kernel.org ... done that.
> > > 
> > 
> > I'm not thrilled with the new flag combo, personally. Given that
> > we're
> > introducing new behavior here, I think it wouldn't hurt to add a
> > new
> > UMOUNT_* flag for this (UMOUNT_NUKE_FROM_ORBIT?).
> 
> Suppose we did... MNT_FORCE_PONIES. What would be the semantics of
> this
> flag?  Once we had it, would anyone ever want to use MNT_FORCE again?
> 
> MNT_FORCE is already fairly heavy handled.  It abort an arbitrary
> collections of RPC requests being sent for the given filesystem, no
> matter where else that filesystem might be mounted.
> Is it ever safe to use this flag unless you have good reason to
> believe
> that the server is not available and there is no point pretending any
> more?
> And if that is the case, why not use the new MNT_FORCE_PONIES which
> is
> at least predictable and reliable.
> 
> We've talking a lot about the one NFS filesystem being mounted in
> multiple containers.  MNT_FORCE is already a problem for such mounts
> as
> one contains can kill requests generated from another
> container.  Maybe
> MNT_FORCE needs to be restricted to "real" root.
> Once we restrict it, do we need to keep it from being too harsh?
> 
> What would be really nice is a timeout for umount, and for sync.
> The timeout only starts when the filesystem stops making progress for
> writeback.  If it eventually does timeout, then the caller can fall
> back
> to MNT_DETACH if they are in a container, or MNT_FORCE if not.
> (Maybe MNT_FORCE should map to MNT_DETACH in a container??? or maybe
> not).
> 
> There is a lot here that still isn't clear to me, but one this does
> seem
> to be becoming clear:  MNT_FORCE as it stands is nearly useless and
> it
> would serve is well to find a semantic that it actually useful, and
> impose that.

Trying to keep the discussion going... does anyone else have thoughts
on this?

Thanks,
Joshua Watt
> 
> Thanks,
> NeilBrown

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC v4 0/9] NFS Force Unmounting
@ 2017-12-14 18:22           ` Joshua Watt
  0 siblings, 0 replies; 32+ messages in thread
From: Joshua Watt @ 2017-12-14 18:22 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Trond Myklebust, J . Bruce Fields
  Cc: linux-nfs-u79uwXL29TY76Z2rM5mHXA, Al Viro, David Howells,
	linux-api-u79uwXL29TY76Z2rM5mHXA

On Fri, 2017-12-08 at 13:10 +1100, NeilBrown wrote:
> On Wed, Dec 06 2017, Jeff Layton wrote:
> 
> > On Wed, 2017-12-06 at 10:34 +1100, NeilBrown wrote:
> > > 
> > > The new semantic for MNT_DETACH|MNT_FORCE is interesting.
> > > As it was never possible before (from /bin/umount), it should be
> > > safe to
> > > add a new meaning.
> > > The meaning is effectively "detach the filesystem from the
> > > namespace and
> > > detach the transport from the filesystem", which sounds like it
> > > is
> > > useful.
> > > It is worth highlighting this, and maybe even cc:ing
> > > linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org ... done that.
> > > 
> > 
> > I'm not thrilled with the new flag combo, personally. Given that
> > we're
> > introducing new behavior here, I think it wouldn't hurt to add a
> > new
> > UMOUNT_* flag for this (UMOUNT_NUKE_FROM_ORBIT?).
> 
> Suppose we did... MNT_FORCE_PONIES. What would be the semantics of
> this
> flag?  Once we had it, would anyone ever want to use MNT_FORCE again?
> 
> MNT_FORCE is already fairly heavy handled.  It abort an arbitrary
> collections of RPC requests being sent for the given filesystem, no
> matter where else that filesystem might be mounted.
> Is it ever safe to use this flag unless you have good reason to
> believe
> that the server is not available and there is no point pretending any
> more?
> And if that is the case, why not use the new MNT_FORCE_PONIES which
> is
> at least predictable and reliable.
> 
> We've talking a lot about the one NFS filesystem being mounted in
> multiple containers.  MNT_FORCE is already a problem for such mounts
> as
> one contains can kill requests generated from another
> container.  Maybe
> MNT_FORCE needs to be restricted to "real" root.
> Once we restrict it, do we need to keep it from being too harsh?
> 
> What would be really nice is a timeout for umount, and for sync.
> The timeout only starts when the filesystem stops making progress for
> writeback.  If it eventually does timeout, then the caller can fall
> back
> to MNT_DETACH if they are in a container, or MNT_FORCE if not.
> (Maybe MNT_FORCE should map to MNT_DETACH in a container??? or maybe
> not).
> 
> There is a lot here that still isn't clear to me, but one this does
> seem
> to be becoming clear:  MNT_FORCE as it stands is nearly useless and
> it
> would serve is well to find a semantic that it actually useful, and
> impose that.

Trying to keep the discussion going... does anyone else have thoughts
on this?

Thanks,
Joshua Watt
> 
> Thanks,
> NeilBrown

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC v4 0/9] NFS Force Unmounting
@ 2017-12-14 21:52             ` NeilBrown
  0 siblings, 0 replies; 32+ messages in thread
From: NeilBrown @ 2017-12-14 21:52 UTC (permalink / raw)
  To: Joshua Watt, Jeff Layton, Trond Myklebust, J . Bruce Fields
  Cc: linux-nfs, Al Viro, David Howells, linux-api

[-- Attachment #1: Type: text/plain, Size: 5762 bytes --]

On Thu, Dec 14 2017, Joshua Watt wrote:

> On Fri, 2017-12-08 at 13:10 +1100, NeilBrown wrote:
>> On Wed, Dec 06 2017, Jeff Layton wrote:
>> 
>> > On Wed, 2017-12-06 at 10:34 +1100, NeilBrown wrote:
>> > > 
>> > > The new semantic for MNT_DETACH|MNT_FORCE is interesting.
>> > > As it was never possible before (from /bin/umount), it should be
>> > > safe to
>> > > add a new meaning.
>> > > The meaning is effectively "detach the filesystem from the
>> > > namespace and
>> > > detach the transport from the filesystem", which sounds like it
>> > > is
>> > > useful.
>> > > It is worth highlighting this, and maybe even cc:ing
>> > > linux-api@vger.kernel.org ... done that.
>> > > 
>> > 
>> > I'm not thrilled with the new flag combo, personally. Given that
>> > we're
>> > introducing new behavior here, I think it wouldn't hurt to add a
>> > new
>> > UMOUNT_* flag for this (UMOUNT_NUKE_FROM_ORBIT?).
>> 
>> Suppose we did... MNT_FORCE_PONIES. What would be the semantics of
>> this
>> flag?  Once we had it, would anyone ever want to use MNT_FORCE again?
>> 
>> MNT_FORCE is already fairly heavy handled.  It abort an arbitrary
>> collections of RPC requests being sent for the given filesystem, no
>> matter where else that filesystem might be mounted.
>> Is it ever safe to use this flag unless you have good reason to
>> believe
>> that the server is not available and there is no point pretending any
>> more?
>> And if that is the case, why not use the new MNT_FORCE_PONIES which
>> is
>> at least predictable and reliable.
>> 
>> We've talking a lot about the one NFS filesystem being mounted in
>> multiple containers.  MNT_FORCE is already a problem for such mounts
>> as
>> one contains can kill requests generated from another
>> container.  Maybe
>> MNT_FORCE needs to be restricted to "real" root.
>> Once we restrict it, do we need to keep it from being too harsh?
>> 
>> What would be really nice is a timeout for umount, and for sync.
>> The timeout only starts when the filesystem stops making progress for
>> writeback.  If it eventually does timeout, then the caller can fall
>> back
>> to MNT_DETACH if they are in a container, or MNT_FORCE if not.
>> (Maybe MNT_FORCE should map to MNT_DETACH in a container??? or maybe
>> not).
>> 
>> There is a lot here that still isn't clear to me, but one this does
>> seem
>> to be becoming clear:  MNT_FORCE as it stands is nearly useless and
>> it
>> would serve is well to find a semantic that it actually useful, and
>> impose that.
>
> Trying to keep the discussion going... does anyone else have thoughts
> on this?

It's a challenge, isn't it ... keeping people on-task to make forward
progress.
If only we could all meet in the canteen at 4pm every Friday and discuss
these things over drinks.  I don't suppose any of the video conference
tools support timeshifting, so we can each do 4pm in our own time
zone....

I would like to arrange that nothing can block indefinitely on
->s_umount.  This probably means that the various "flush data" calls
made under this lock need a timeout, or to be interruptible.
Then both umount and remount could be sure of getting ->s_umount
without undue delay.
Then I would like MNT_FORCE *not* to abort requests before trying to get
the lock, but instead to be passed down to ->kill_sb().
We probably cannot pass it explicitly, but could set a flag while
->s_umount is held.
This flag might be handled by generic_shutdown_super(), causing
it to purge any unwritten data, rather than call sync_filesystems().

This way, if the filesystem is mounted elsewhere, then the MNT_FORCE has
no effect.  If it is a final mount, then it cleans up properly.

Your need to cause writes to start failing would be achieved by
performing a remount, either just setting "soft,retrans=0,timeo=1", or
by setting some special-purpose mount option.

In order for s_umount not to be held indefinite, the generic things that
need to be fixed include:
 __writeback_inodes_wb() calls writeback_sb_inodes() under the lock.
    This needs to be interruptible
 Same for try_to_writeback_inodes_sb() -> __writeback_inodes_sb_nr()
 sync_sync and do_sync_work call iterate_supers() with various handlers, and these need
 to be interruptible.

and do_remount_sb needs to not block.

Finding a way to interrupt those writeback calls would be tricky,
especially as we need to trigger the interrupt without holding s_umount.

I really like the idea that an umount attempt would interrupt a sync().
Currently sync() can block indefinitely, which is occasionally
inconvenient.
If "umount" (or "umount -f" at least) on a filesystem would abort the
sync of that filesystem, take the lock and clean up more forcefully,
that would make for fairly clean shutdown processing.
1/ call sync() in a separate thread.
2/ wait until Dirty in /proc/meminfo stops changing
3/ umount -f every remaining filesystem.  Even if the
   umount fails, the sync will abort.


Part of this effort would require making sure that SIGKILL really kills
processes blocked on filesystem IO.

So:
 1/ make sure all filesystem IO waits are TASK_KILLABLE
 2/ find a way to interrupt any write-back wait when there is a pending
    remount or umount.  Possibly the write-back thing would need to
    retry after the umount/remount, I'm not sure.
 3/ Cause MNT_FORCE to set a superblock flag, and have
    generic_shutdown_super() and/or ->kill_sb() interpret this flag to
    be very forceful
 4/ Possibly introduce new NFS mount option which causes all requests
    to fail
 5/ Teach NFS to support remount of this option, and of soft, retrans,
    timeo.

How does that sound?

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC v4 0/9] NFS Force Unmounting
@ 2017-12-14 21:52             ` NeilBrown
  0 siblings, 0 replies; 32+ messages in thread
From: NeilBrown @ 2017-12-14 21:52 UTC (permalink / raw)
  To: Joshua Watt, Jeff Layton, Trond Myklebust, J . Bruce Fields
  Cc: linux-nfs-u79uwXL29TY76Z2rM5mHXA, Al Viro, David Howells,
	linux-api-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 5786 bytes --]

On Thu, Dec 14 2017, Joshua Watt wrote:

> On Fri, 2017-12-08 at 13:10 +1100, NeilBrown wrote:
>> On Wed, Dec 06 2017, Jeff Layton wrote:
>> 
>> > On Wed, 2017-12-06 at 10:34 +1100, NeilBrown wrote:
>> > > 
>> > > The new semantic for MNT_DETACH|MNT_FORCE is interesting.
>> > > As it was never possible before (from /bin/umount), it should be
>> > > safe to
>> > > add a new meaning.
>> > > The meaning is effectively "detach the filesystem from the
>> > > namespace and
>> > > detach the transport from the filesystem", which sounds like it
>> > > is
>> > > useful.
>> > > It is worth highlighting this, and maybe even cc:ing
>> > > linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org ... done that.
>> > > 
>> > 
>> > I'm not thrilled with the new flag combo, personally. Given that
>> > we're
>> > introducing new behavior here, I think it wouldn't hurt to add a
>> > new
>> > UMOUNT_* flag for this (UMOUNT_NUKE_FROM_ORBIT?).
>> 
>> Suppose we did... MNT_FORCE_PONIES. What would be the semantics of
>> this
>> flag?  Once we had it, would anyone ever want to use MNT_FORCE again?
>> 
>> MNT_FORCE is already fairly heavy handled.  It abort an arbitrary
>> collections of RPC requests being sent for the given filesystem, no
>> matter where else that filesystem might be mounted.
>> Is it ever safe to use this flag unless you have good reason to
>> believe
>> that the server is not available and there is no point pretending any
>> more?
>> And if that is the case, why not use the new MNT_FORCE_PONIES which
>> is
>> at least predictable and reliable.
>> 
>> We've talking a lot about the one NFS filesystem being mounted in
>> multiple containers.  MNT_FORCE is already a problem for such mounts
>> as
>> one contains can kill requests generated from another
>> container.  Maybe
>> MNT_FORCE needs to be restricted to "real" root.
>> Once we restrict it, do we need to keep it from being too harsh?
>> 
>> What would be really nice is a timeout for umount, and for sync.
>> The timeout only starts when the filesystem stops making progress for
>> writeback.  If it eventually does timeout, then the caller can fall
>> back
>> to MNT_DETACH if they are in a container, or MNT_FORCE if not.
>> (Maybe MNT_FORCE should map to MNT_DETACH in a container??? or maybe
>> not).
>> 
>> There is a lot here that still isn't clear to me, but one this does
>> seem
>> to be becoming clear:  MNT_FORCE as it stands is nearly useless and
>> it
>> would serve is well to find a semantic that it actually useful, and
>> impose that.
>
> Trying to keep the discussion going... does anyone else have thoughts
> on this?

It's a challenge, isn't it ... keeping people on-task to make forward
progress.
If only we could all meet in the canteen at 4pm every Friday and discuss
these things over drinks.  I don't suppose any of the video conference
tools support timeshifting, so we can each do 4pm in our own time
zone....

I would like to arrange that nothing can block indefinitely on
->s_umount.  This probably means that the various "flush data" calls
made under this lock need a timeout, or to be interruptible.
Then both umount and remount could be sure of getting ->s_umount
without undue delay.
Then I would like MNT_FORCE *not* to abort requests before trying to get
the lock, but instead to be passed down to ->kill_sb().
We probably cannot pass it explicitly, but could set a flag while
->s_umount is held.
This flag might be handled by generic_shutdown_super(), causing
it to purge any unwritten data, rather than call sync_filesystems().

This way, if the filesystem is mounted elsewhere, then the MNT_FORCE has
no effect.  If it is a final mount, then it cleans up properly.

Your need to cause writes to start failing would be achieved by
performing a remount, either just setting "soft,retrans=0,timeo=1", or
by setting some special-purpose mount option.

In order for s_umount not to be held indefinite, the generic things that
need to be fixed include:
 __writeback_inodes_wb() calls writeback_sb_inodes() under the lock.
    This needs to be interruptible
 Same for try_to_writeback_inodes_sb() -> __writeback_inodes_sb_nr()
 sync_sync and do_sync_work call iterate_supers() with various handlers, and these need
 to be interruptible.

and do_remount_sb needs to not block.

Finding a way to interrupt those writeback calls would be tricky,
especially as we need to trigger the interrupt without holding s_umount.

I really like the idea that an umount attempt would interrupt a sync().
Currently sync() can block indefinitely, which is occasionally
inconvenient.
If "umount" (or "umount -f" at least) on a filesystem would abort the
sync of that filesystem, take the lock and clean up more forcefully,
that would make for fairly clean shutdown processing.
1/ call sync() in a separate thread.
2/ wait until Dirty in /proc/meminfo stops changing
3/ umount -f every remaining filesystem.  Even if the
   umount fails, the sync will abort.


Part of this effort would require making sure that SIGKILL really kills
processes blocked on filesystem IO.

So:
 1/ make sure all filesystem IO waits are TASK_KILLABLE
 2/ find a way to interrupt any write-back wait when there is a pending
    remount or umount.  Possibly the write-back thing would need to
    retry after the umount/remount, I'm not sure.
 3/ Cause MNT_FORCE to set a superblock flag, and have
    generic_shutdown_super() and/or ->kill_sb() interpret this flag to
    be very forceful
 4/ Possibly introduce new NFS mount option which causes all requests
    to fail
 5/ Teach NFS to support remount of this option, and of soft, retrans,
    timeo.

How does that sound?

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC v4 0/9] NFS Force Unmounting
@ 2017-12-18 21:48               ` Joshua Watt
  0 siblings, 0 replies; 32+ messages in thread
From: Joshua Watt @ 2017-12-18 21:48 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Trond Myklebust, J . Bruce Fields
  Cc: linux-nfs, Al Viro, David Howells, linux-api

On Fri, 2017-12-15 at 08:52 +1100, NeilBrown wrote:
> On Thu, Dec 14 2017, Joshua Watt wrote:
> 
> > On Fri, 2017-12-08 at 13:10 +1100, NeilBrown wrote:
> > > On Wed, Dec 06 2017, Jeff Layton wrote:
> > > 
> > > > On Wed, 2017-12-06 at 10:34 +1100, NeilBrown wrote:
> > > > > 
> > > > > The new semantic for MNT_DETACH|MNT_FORCE is interesting.
> > > > > As it was never possible before (from /bin/umount), it should
> > > > > be
> > > > > safe to
> > > > > add a new meaning.
> > > > > The meaning is effectively "detach the filesystem from the
> > > > > namespace and
> > > > > detach the transport from the filesystem", which sounds like
> > > > > it
> > > > > is
> > > > > useful.
> > > > > It is worth highlighting this, and maybe even cc:ing
> > > > > linux-api@vger.kernel.org ... done that.
> > > > > 
> > > > 
> > > > I'm not thrilled with the new flag combo, personally. Given
> > > > that
> > > > we're
> > > > introducing new behavior here, I think it wouldn't hurt to add
> > > > a
> > > > new
> > > > UMOUNT_* flag for this (UMOUNT_NUKE_FROM_ORBIT?).
> > > 
> > > Suppose we did... MNT_FORCE_PONIES. What would be the semantics
> > > of
> > > this
> > > flag?  Once we had it, would anyone ever want to use MNT_FORCE
> > > again?
> > > 
> > > MNT_FORCE is already fairly heavy handled.  It abort an arbitrary
> > > collections of RPC requests being sent for the given filesystem,
> > > no
> > > matter where else that filesystem might be mounted.
> > > Is it ever safe to use this flag unless you have good reason to
> > > believe
> > > that the server is not available and there is no point pretending
> > > any
> > > more?
> > > And if that is the case, why not use the new MNT_FORCE_PONIES
> > > which
> > > is
> > > at least predictable and reliable.
> > > 
> > > We've talking a lot about the one NFS filesystem being mounted in
> > > multiple containers.  MNT_FORCE is already a problem for such
> > > mounts
> > > as
> > > one contains can kill requests generated from another
> > > container.  Maybe
> > > MNT_FORCE needs to be restricted to "real" root.
> > > Once we restrict it, do we need to keep it from being too harsh?
> > > 
> > > What would be really nice is a timeout for umount, and for sync.
> > > The timeout only starts when the filesystem stops making progress
> > > for
> > > writeback.  If it eventually does timeout, then the caller can
> > > fall
> > > back
> > > to MNT_DETACH if they are in a container, or MNT_FORCE if not.
> > > (Maybe MNT_FORCE should map to MNT_DETACH in a container??? or
> > > maybe
> > > not).
> > > 
> > > There is a lot here that still isn't clear to me, but one this
> > > does
> > > seem
> > > to be becoming clear:  MNT_FORCE as it stands is nearly useless
> > > and
> > > it
> > > would serve is well to find a semantic that it actually useful,
> > > and
> > > impose that.
> > 
> > Trying to keep the discussion going... does anyone else have
> > thoughts
> > on this?
> 
> It's a challenge, isn't it ... keeping people on-task to make forward
> progress.
> If only we could all meet in the canteen at 4pm every Friday and
> discuss
> these things over drinks.  I don't suppose any of the video
> conference
> tools support timeshifting, so we can each do 4pm in our own time
> zone....

Well, you can all come to my house... it might be easier than building
a time machine. I'll provide the beer.

> 
> I would like to arrange that nothing can block indefinitely on
> ->s_umount.  This probably means that the various "flush data" calls
> made under this lock need a timeout, or to be interruptible.
> Then both umount and remount could be sure of getting ->s_umount
> without undue delay.
> Then I would like MNT_FORCE *not* to abort requests before trying to
> get
> the lock, but instead to be passed down to ->kill_sb().
> We probably cannot pass it explicitly, but could set a flag while
> ->s_umount is held.
> This flag might be handled by generic_shutdown_super(), causing
> it to purge any unwritten data, rather than call sync_filesystems().
> 
> This way, if the filesystem is mounted elsewhere, then the MNT_FORCE
> has
> no effect.  If it is a final mount, then it cleans up properly.

> Your need to cause writes to start failing would be achieved by
> performing a remount, either just setting "soft,retrans=0,timeo=1",
> or
> by setting some special-purpose mount option.

Hmm... If I have a mount option that does what I want and remount
doesn't block, what do I need MNT_FORCE for at all? All I would need
from userspace is:

 mount /mnt/nfs -o remount,serverdead
 umount -l /mnt/nfs

I does seem nice to know that this is the last place the superblock is
mounted in ->kill_sb()... But it seems like we replace MNT_FORCE having
consequences on a shared superblock with some mount options that
probably(?) have consequences on a shared superblock (and in the
process, makes MNT_FORCE less useful?). 

> In order for s_umount not to be held indefinite, the generic things
> that
> need to be fixed include:
>  __writeback_inodes_wb() calls writeback_sb_inodes() under the lock.
>     This needs to be interruptible
>  Same for try_to_writeback_inodes_sb() -> __writeback_inodes_sb_nr()
>  sync_sync and do_sync_work call iterate_supers() with various
> handlers, and these need
>  to be interruptible.
> 
> and do_remount_sb needs to not block.
> 
> Finding a way to interrupt those writeback calls would be tricky,
> especially as we need to trigger the interrupt without holding
> s_umount.

I'll look into it more, but I don't think it would be terribly tricky.
AFAIK, rpc_killall_tasks() covers a lot of these cases.

> 
> I really like the idea that an umount attempt would interrupt a
> sync().
> Currently sync() can block indefinitely, which is occasionally
> inconvenient.
> If "umount" (or "umount -f" at least) on a filesystem would abort the
> sync of that filesystem, take the lock and clean up more forcefully,
> that would make for fairly clean shutdown processing.
> 1/ call sync() in a separate thread.
> 2/ wait until Dirty in /proc/meminfo stops changing
> 3/ umount -f every remaining filesystem.  Even if the
>    umount fails, the sync will abort.

Isn't that pretty close to the current behavior? I'm having a bit of
trouble reconciling this with "Then I would like MNT_FORCE *not* to
abort requests before trying to get the lock"... it sort of sounds like
here you *do* want umount (or maybe umount + MNT_FORCE) to abort a
sync()? Or are you trying to say that sync() should *only* be
interrupted in kill_sb()? Would you mind clarifying?

> 
> 
> Part of this effort would require making sure that SIGKILL really
> kills
> processes blocked on filesystem IO.

I get that making sure processes can be killed when blocked on I/O is a
really good thing, but I don't understand why it keeps getting rolled
into what I'm trying to do.... it seems like an orthagonal concern? You
can certianly do "better" force unmounting without having to SIGKILL
processes, and someone could surely make SIGKILL work for all I/O
without having to implement a full force unmount scheme?

I guess more directly.... SIGKILLing a processes isn't necessary (or
really acceptable) for my use case. If I don't do that portion, is that
going to result in rejection of my patches, or am I missing something
important? 

> 
> So:
>  1/ make sure all filesystem IO waits are TASK_KILLABLE
>  2/ find a way to interrupt any write-back wait when there is a
> pending
>     remount or umount.  Possibly the write-back thing would need to
>     retry after the umount/remount, I'm not sure.
>  3/ Cause MNT_FORCE to set a superblock flag, and have
>     generic_shutdown_super() and/or ->kill_sb() interpret this flag
> to
>     be very forceful
>  4/ Possibly introduce new NFS mount option which causes all requests
>     to fail
>  5/ Teach NFS to support remount of this option, and of soft,
> retrans,
>     timeo.
> 
> How does that sound?

All right, let me reiterate where I think you are going with this:

Currently, MNT_FORCE is not particularly useful for actually force
unmounting, since it only cancels RPCs at a single point in time.
However, this would be a useful behavior to have during remount.
Specifically, do_remount_sb() should get this behavior so that any
"stuck" sync() calls that would block it from getting ->s_umount can be
cleaned up in a timely manner. If a sync() was canceled for this
reason, it might need to be restarted after the remount?

Once remount can be done without undue blocking, some combination of
mount options can be used to mark the server as "dead", thereby
preventing long blocking API calls going forward.

Finally, MNT_FORCE can be reworked. Instead of trying to clean things
up so that a mount *might* be able to be unmounted, wait until we
*know* the mount is getting removed (->kill_sb()) and then be agressive
with cleaning up.



I'm not sure if this is really any different than MNT_FORCE_PONIES. Not
that I prefer one mechanism over the other... it just sort of seems
like they are all pretty similar. 

So far we have:

 1) umount with MNT_FORCE_PONIES - Only *real* root can do this, and it
destroys everything

 2) Remount (which won't block) with some "server dead" flag(s), then
umount with MNT_DETACH 

 3) Mount with a flag that optionally makes MNT_FORCE act like
MNT_FORCE_PONIES (e.g. my original proposal with the "transient" mount
flag).

Am I missing any? In all of these mechanism, a superblock mounted in
multiple namespaces get affected in some way. The real difference is
what that change looks like in the other namespaces. In all cases, the
admin doing these actions is going to have to be aware of how these
action affect other namespaces... I don't know if we can save them from
that.

Thanks,
Joshua

> 
> NeilBrown

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC v4 0/9] NFS Force Unmounting
@ 2017-12-18 21:48               ` Joshua Watt
  0 siblings, 0 replies; 32+ messages in thread
From: Joshua Watt @ 2017-12-18 21:48 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Trond Myklebust, J . Bruce Fields
  Cc: linux-nfs-u79uwXL29TY76Z2rM5mHXA, Al Viro, David Howells,
	linux-api-u79uwXL29TY76Z2rM5mHXA

On Fri, 2017-12-15 at 08:52 +1100, NeilBrown wrote:
> On Thu, Dec 14 2017, Joshua Watt wrote:
> 
> > On Fri, 2017-12-08 at 13:10 +1100, NeilBrown wrote:
> > > On Wed, Dec 06 2017, Jeff Layton wrote:
> > > 
> > > > On Wed, 2017-12-06 at 10:34 +1100, NeilBrown wrote:
> > > > > 
> > > > > The new semantic for MNT_DETACH|MNT_FORCE is interesting.
> > > > > As it was never possible before (from /bin/umount), it should
> > > > > be
> > > > > safe to
> > > > > add a new meaning.
> > > > > The meaning is effectively "detach the filesystem from the
> > > > > namespace and
> > > > > detach the transport from the filesystem", which sounds like
> > > > > it
> > > > > is
> > > > > useful.
> > > > > It is worth highlighting this, and maybe even cc:ing
> > > > > linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org ... done that.
> > > > > 
> > > > 
> > > > I'm not thrilled with the new flag combo, personally. Given
> > > > that
> > > > we're
> > > > introducing new behavior here, I think it wouldn't hurt to add
> > > > a
> > > > new
> > > > UMOUNT_* flag for this (UMOUNT_NUKE_FROM_ORBIT?).
> > > 
> > > Suppose we did... MNT_FORCE_PONIES. What would be the semantics
> > > of
> > > this
> > > flag?  Once we had it, would anyone ever want to use MNT_FORCE
> > > again?
> > > 
> > > MNT_FORCE is already fairly heavy handled.  It abort an arbitrary
> > > collections of RPC requests being sent for the given filesystem,
> > > no
> > > matter where else that filesystem might be mounted.
> > > Is it ever safe to use this flag unless you have good reason to
> > > believe
> > > that the server is not available and there is no point pretending
> > > any
> > > more?
> > > And if that is the case, why not use the new MNT_FORCE_PONIES
> > > which
> > > is
> > > at least predictable and reliable.
> > > 
> > > We've talking a lot about the one NFS filesystem being mounted in
> > > multiple containers.  MNT_FORCE is already a problem for such
> > > mounts
> > > as
> > > one contains can kill requests generated from another
> > > container.  Maybe
> > > MNT_FORCE needs to be restricted to "real" root.
> > > Once we restrict it, do we need to keep it from being too harsh?
> > > 
> > > What would be really nice is a timeout for umount, and for sync.
> > > The timeout only starts when the filesystem stops making progress
> > > for
> > > writeback.  If it eventually does timeout, then the caller can
> > > fall
> > > back
> > > to MNT_DETACH if they are in a container, or MNT_FORCE if not.
> > > (Maybe MNT_FORCE should map to MNT_DETACH in a container??? or
> > > maybe
> > > not).
> > > 
> > > There is a lot here that still isn't clear to me, but one this
> > > does
> > > seem
> > > to be becoming clear:  MNT_FORCE as it stands is nearly useless
> > > and
> > > it
> > > would serve is well to find a semantic that it actually useful,
> > > and
> > > impose that.
> > 
> > Trying to keep the discussion going... does anyone else have
> > thoughts
> > on this?
> 
> It's a challenge, isn't it ... keeping people on-task to make forward
> progress.
> If only we could all meet in the canteen at 4pm every Friday and
> discuss
> these things over drinks.  I don't suppose any of the video
> conference
> tools support timeshifting, so we can each do 4pm in our own time
> zone....

Well, you can all come to my house... it might be easier than building
a time machine. I'll provide the beer.

> 
> I would like to arrange that nothing can block indefinitely on
> ->s_umount.  This probably means that the various "flush data" calls
> made under this lock need a timeout, or to be interruptible.
> Then both umount and remount could be sure of getting ->s_umount
> without undue delay.
> Then I would like MNT_FORCE *not* to abort requests before trying to
> get
> the lock, but instead to be passed down to ->kill_sb().
> We probably cannot pass it explicitly, but could set a flag while
> ->s_umount is held.
> This flag might be handled by generic_shutdown_super(), causing
> it to purge any unwritten data, rather than call sync_filesystems().
> 
> This way, if the filesystem is mounted elsewhere, then the MNT_FORCE
> has
> no effect.  If it is a final mount, then it cleans up properly.

> Your need to cause writes to start failing would be achieved by
> performing a remount, either just setting "soft,retrans=0,timeo=1",
> or
> by setting some special-purpose mount option.

Hmm... If I have a mount option that does what I want and remount
doesn't block, what do I need MNT_FORCE for at all? All I would need
from userspace is:

 mount /mnt/nfs -o remount,serverdead
 umount -l /mnt/nfs

I does seem nice to know that this is the last place the superblock is
mounted in ->kill_sb()... But it seems like we replace MNT_FORCE having
consequences on a shared superblock with some mount options that
probably(?) have consequences on a shared superblock (and in the
process, makes MNT_FORCE less useful?). 

> In order for s_umount not to be held indefinite, the generic things
> that
> need to be fixed include:
>  __writeback_inodes_wb() calls writeback_sb_inodes() under the lock.
>     This needs to be interruptible
>  Same for try_to_writeback_inodes_sb() -> __writeback_inodes_sb_nr()
>  sync_sync and do_sync_work call iterate_supers() with various
> handlers, and these need
>  to be interruptible.
> 
> and do_remount_sb needs to not block.
> 
> Finding a way to interrupt those writeback calls would be tricky,
> especially as we need to trigger the interrupt without holding
> s_umount.

I'll look into it more, but I don't think it would be terribly tricky.
AFAIK, rpc_killall_tasks() covers a lot of these cases.

> 
> I really like the idea that an umount attempt would interrupt a
> sync().
> Currently sync() can block indefinitely, which is occasionally
> inconvenient.
> If "umount" (or "umount -f" at least) on a filesystem would abort the
> sync of that filesystem, take the lock and clean up more forcefully,
> that would make for fairly clean shutdown processing.
> 1/ call sync() in a separate thread.
> 2/ wait until Dirty in /proc/meminfo stops changing
> 3/ umount -f every remaining filesystem.  Even if the
>    umount fails, the sync will abort.

Isn't that pretty close to the current behavior? I'm having a bit of
trouble reconciling this with "Then I would like MNT_FORCE *not* to
abort requests before trying to get the lock"... it sort of sounds like
here you *do* want umount (or maybe umount + MNT_FORCE) to abort a
sync()? Or are you trying to say that sync() should *only* be
interrupted in kill_sb()? Would you mind clarifying?

> 
> 
> Part of this effort would require making sure that SIGKILL really
> kills
> processes blocked on filesystem IO.

I get that making sure processes can be killed when blocked on I/O is a
really good thing, but I don't understand why it keeps getting rolled
into what I'm trying to do.... it seems like an orthagonal concern? You
can certianly do "better" force unmounting without having to SIGKILL
processes, and someone could surely make SIGKILL work for all I/O
without having to implement a full force unmount scheme?

I guess more directly.... SIGKILLing a processes isn't necessary (or
really acceptable) for my use case. If I don't do that portion, is that
going to result in rejection of my patches, or am I missing something
important? 

> 
> So:
>  1/ make sure all filesystem IO waits are TASK_KILLABLE
>  2/ find a way to interrupt any write-back wait when there is a
> pending
>     remount or umount.  Possibly the write-back thing would need to
>     retry after the umount/remount, I'm not sure.
>  3/ Cause MNT_FORCE to set a superblock flag, and have
>     generic_shutdown_super() and/or ->kill_sb() interpret this flag
> to
>     be very forceful
>  4/ Possibly introduce new NFS mount option which causes all requests
>     to fail
>  5/ Teach NFS to support remount of this option, and of soft,
> retrans,
>     timeo.
> 
> How does that sound?

All right, let me reiterate where I think you are going with this:

Currently, MNT_FORCE is not particularly useful for actually force
unmounting, since it only cancels RPCs at a single point in time.
However, this would be a useful behavior to have during remount.
Specifically, do_remount_sb() should get this behavior so that any
"stuck" sync() calls that would block it from getting ->s_umount can be
cleaned up in a timely manner. If a sync() was canceled for this
reason, it might need to be restarted after the remount?

Once remount can be done without undue blocking, some combination of
mount options can be used to mark the server as "dead", thereby
preventing long blocking API calls going forward.

Finally, MNT_FORCE can be reworked. Instead of trying to clean things
up so that a mount *might* be able to be unmounted, wait until we
*know* the mount is getting removed (->kill_sb()) and then be agressive
with cleaning up.



I'm not sure if this is really any different than MNT_FORCE_PONIES. Not
that I prefer one mechanism over the other... it just sort of seems
like they are all pretty similar. 

So far we have:

 1) umount with MNT_FORCE_PONIES - Only *real* root can do this, and it
destroys everything

 2) Remount (which won't block) with some "server dead" flag(s), then
umount with MNT_DETACH 

 3) Mount with a flag that optionally makes MNT_FORCE act like
MNT_FORCE_PONIES (e.g. my original proposal with the "transient" mount
flag).

Am I missing any? In all of these mechanism, a superblock mounted in
multiple namespaces get affected in some way. The real difference is
what that change looks like in the other namespaces. In all cases, the
admin doing these actions is going to have to be aware of how these
action affect other namespaces... I don't know if we can save them from
that.

Thanks,
Joshua

> 
> NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2017-12-18 21:48 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-17 17:45 [RFC v4 0/9] NFS Force Unmounting Joshua Watt
2017-11-17 17:45 ` [RFC v4 1/9] SUNRPC: Add flag to kill new tasks Joshua Watt
2017-12-05 22:59   ` NeilBrown
2017-11-17 17:45 ` [RFC v4 2/9] SUNRPC: Expose kill_new_tasks in debugfs Joshua Watt
2017-11-17 17:45 ` [RFC v4 3/9] SUNRPC: Simplify client shutdown Joshua Watt
2017-11-17 17:45 ` [RFC v4 4/9] namespace: Add umount_end superblock operation Joshua Watt
2017-12-06 11:54   ` Jeff Layton
2017-12-06 12:14   ` Al Viro
2017-12-06 12:33     ` Al Viro
2017-12-06 15:41       ` Joshua Watt
2017-11-17 17:45 ` [RFC v4 5/9] NFS: Kill RPCs for the duration of umount Joshua Watt
2017-12-05 23:07   ` NeilBrown
2017-11-17 17:45 ` [RFC v4 6/9] NFS: Add debugfs for nfs_server and nfs_client Joshua Watt
2017-11-17 17:45 ` [RFC v4 7/9] NFS: Add transient mount option Joshua Watt
2017-12-06 12:23   ` Jeff Layton
2017-11-17 17:45 ` [RFC v4 8/9] NFS: Don't shared transient clients Joshua Watt
2017-11-17 17:45 ` [RFC v4 9/9] NFS: Kill all client RPCs if transient Joshua Watt
2017-12-04 14:36 ` [RFC v4 0/9] NFS Force Unmounting Joshua Watt
2017-12-05 23:34   ` NeilBrown
2017-12-05 23:34     ` NeilBrown
2017-12-06 13:03     ` Jeff Layton
2017-12-06 13:03       ` Jeff Layton
2017-12-06 16:40       ` Joshua Watt
2017-12-06 16:40         ` Joshua Watt
2017-12-08  2:10       ` NeilBrown
2017-12-08  2:10         ` NeilBrown
2017-12-14 18:22         ` Joshua Watt
2017-12-14 18:22           ` Joshua Watt
2017-12-14 21:52           ` NeilBrown
2017-12-14 21:52             ` NeilBrown
2017-12-18 21:48             ` Joshua Watt
2017-12-18 21:48               ` Joshua Watt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.