All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH][RFC] nfsd/lockd: have locks_in_grace take a sb arg
@ 2012-04-03 12:14 Jeff Layton
  2012-04-09 23:18 ` J. Bruce Fields
  2012-04-10 11:44 ` Stanislav Kinsbursky
  0 siblings, 2 replies; 29+ messages in thread
From: Jeff Layton @ 2012-04-03 12:14 UTC (permalink / raw)
  To: linux-nfs

The main reason for the grace period is to prevent the server from
allowing an operation that might otherwise be denied once the client has
reclaimed all of its stateful objects.

Currently, the grace period handling in the nfsd/lockd server code is
very simple. When the lock managers start, they stick an entry on a list
and set a timer. When the timers pop, then they remove the entry from
the list. The locks_in_grace check just looks to see if the list is
empty. If it is, then the grace period is considered to be over.

This is insufficient for a clustered filesystem that is being served
from multiple nodes at the same time. In such a configuration, the grace
period must be coordinated in some fashion, or else one node might hand
out stateful objects that conflict with those that have not yet been
reclaimed.

This patch paves the way for fixing this by adding a new export
operation called locks_in_grace that takes a superblock argument. The
existing locks_in_grace function is renamed to generic_locks_in_grace,
and a new locks_in_grace function that takes a superblock arg is added.
If a filesystem does not have a locks_in_grace export operation then the
generic version will be used.

Care has also been taken to reorder calls such that locks_in_grace is
called last in compound conditional statements. Handling this for
clustered filesystems may involve upcalls, so we don't want to call it
unnecessarily.

For now, this patch is just an RFC as I do not yet have any code that
overrides this function and am still specing out what that code should
look like.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/lockd/grace.c         |   23 +++++++++++++++++++++--
 fs/lockd/svc4proc.c      |   43 +++++++++++++++++++++----------------------
 fs/lockd/svclock.c       |    8 ++++----
 fs/lockd/svcproc.c       |   44 ++++++++++++++++++++++----------------------
 fs/nfsd/nfs4proc.c       |   12 +++++++-----
 fs/nfsd/nfs4state.c      |   32 ++++++++++++++++++--------------
 include/linux/exportfs.h |    6 ++++++
 include/linux/fs.h       |    3 ++-
 8 files changed, 101 insertions(+), 70 deletions(-)

diff --git a/fs/lockd/grace.c b/fs/lockd/grace.c
index 183cc1f..9faa613 100644
--- a/fs/lockd/grace.c
+++ b/fs/lockd/grace.c
@@ -4,6 +4,8 @@
 
 #include <linux/module.h>
 #include <linux/lockd/bind.h>
+#include <linux/fs.h>
+#include <linux/exportfs.h>
 
 static LIST_HEAD(grace_list);
 static DEFINE_SPINLOCK(grace_lock);
@@ -46,14 +48,31 @@ void locks_end_grace(struct lock_manager *lm)
 EXPORT_SYMBOL_GPL(locks_end_grace);
 
 /**
+ * generic_locks_in_grace
+ *
+ * Most filesystems don't require special handling for the grace period
+ * and just use this standard one.
+ */
+bool generic_locks_in_grace(void)
+{
+	return !list_empty(&grace_list);
+}
+EXPORT_SYMBOL_GPL(generic_locks_in_grace);
+
+/**
  * locks_in_grace
  *
  * Lock managers call this function to determine when it is OK for them
  * to answer ordinary lock requests, and when they should accept only
  * lock reclaims.
+ *
+ * Most filesystems won't define a locks_in_grace export op, but those
+ * that need special handling can.
  */
-int locks_in_grace(void)
+bool locks_in_grace(struct super_block *sb)
 {
-	return !list_empty(&grace_list);
+	if (sb->s_export_op && sb->s_export_op->locks_in_grace)
+		return sb->s_export_op->locks_in_grace(sb);
+	return generic_locks_in_grace();
 }
 EXPORT_SYMBOL_GPL(locks_in_grace);
diff --git a/fs/lockd/svc4proc.c b/fs/lockd/svc4proc.c
index 9a41fdc..6fe6fb7 100644
--- a/fs/lockd/svc4proc.c
+++ b/fs/lockd/svc4proc.c
@@ -150,16 +150,16 @@ nlm4svc_proc_cancel(struct svc_rqst *rqstp, struct nlm_args *argp,
 
 	resp->cookie = argp->cookie;
 
+	/* Obtain client and file */
+	if ((resp->status = nlm4svc_retrieve_args(rqstp, argp, &host, &file)))
+		return resp->status == nlm_drop_reply ? rpc_drop_reply :rpc_success;
+
 	/* Don't accept requests during grace period */
-	if (locks_in_grace()) {
+	if (locks_in_grace(file->f_file->f_path.dentry->d_sb)) {
 		resp->status = nlm_lck_denied_grace_period;
 		return rpc_success;
 	}
 
-	/* Obtain client and file */
-	if ((resp->status = nlm4svc_retrieve_args(rqstp, argp, &host, &file)))
-		return resp->status == nlm_drop_reply ? rpc_drop_reply :rpc_success;
-
 	/* Try to cancel request. */
 	resp->status = nlmsvc_cancel_blocked(file, &argp->lock);
 
@@ -183,16 +183,16 @@ nlm4svc_proc_unlock(struct svc_rqst *rqstp, struct nlm_args *argp,
 
 	resp->cookie = argp->cookie;
 
-	/* Don't accept new lock requests during grace period */
-	if (locks_in_grace()) {
-		resp->status = nlm_lck_denied_grace_period;
-		return rpc_success;
-	}
-
 	/* Obtain client and file */
 	if ((resp->status = nlm4svc_retrieve_args(rqstp, argp, &host, &file)))
 		return resp->status == nlm_drop_reply ? rpc_drop_reply :rpc_success;
 
+	/* Don't accept requests during grace period */
+	if (locks_in_grace(file->f_file->f_path.dentry->d_sb)) {
+		resp->status = nlm_lck_denied_grace_period;
+		return rpc_success;
+	}
+
 	/* Now try to remove the lock */
 	resp->status = nlmsvc_unlock(file, &argp->lock);
 
@@ -320,16 +320,15 @@ nlm4svc_proc_share(struct svc_rqst *rqstp, struct nlm_args *argp,
 
 	resp->cookie = argp->cookie;
 
-	/* Don't accept new lock requests during grace period */
-	if (locks_in_grace() && !argp->reclaim) {
-		resp->status = nlm_lck_denied_grace_period;
-		return rpc_success;
-	}
-
 	/* Obtain client and file */
 	if ((resp->status = nlm4svc_retrieve_args(rqstp, argp, &host, &file)))
 		return resp->status == nlm_drop_reply ? rpc_drop_reply :rpc_success;
 
+	if (!argp->reclaim && locks_in_grace(file->f_file->f_path.dentry->d_sb)) {
+		resp->status = nlm_lck_denied_grace_period;
+		return rpc_success;
+	}
+
 	/* Now try to create the share */
 	resp->status = nlmsvc_share_file(host, file, argp);
 
@@ -353,16 +352,16 @@ nlm4svc_proc_unshare(struct svc_rqst *rqstp, struct nlm_args *argp,
 
 	resp->cookie = argp->cookie;
 
+	/* Obtain client and file */
+	if ((resp->status = nlm4svc_retrieve_args(rqstp, argp, &host, &file)))
+		return resp->status == nlm_drop_reply ? rpc_drop_reply :rpc_success;
+
 	/* Don't accept requests during grace period */
-	if (locks_in_grace()) {
+	if (locks_in_grace(file->f_file->f_path.dentry->d_sb)) {
 		resp->status = nlm_lck_denied_grace_period;
 		return rpc_success;
 	}
 
-	/* Obtain client and file */
-	if ((resp->status = nlm4svc_retrieve_args(rqstp, argp, &host, &file)))
-		return resp->status == nlm_drop_reply ? rpc_drop_reply :rpc_success;
-
 	/* Now try to lock the file */
 	resp->status = nlmsvc_unshare_file(host, file, argp);
 
diff --git a/fs/lockd/svclock.c b/fs/lockd/svclock.c
index e46353f..64d6c80 100644
--- a/fs/lockd/svclock.c
+++ b/fs/lockd/svclock.c
@@ -447,11 +447,11 @@ nlmsvc_lock(struct svc_rqst *rqstp, struct nlm_file *file,
 		goto out;
 	}
 
-	if (locks_in_grace() && !reclaim) {
+	if (!reclaim && locks_in_grace(file->f_file->f_path.dentry->d_sb)) {
 		ret = nlm_lck_denied_grace_period;
 		goto out;
 	}
-	if (reclaim && !locks_in_grace()) {
+	if (reclaim && !locks_in_grace(file->f_file->f_path.dentry->d_sb)) {
 		ret = nlm_lck_denied_grace_period;
 		goto out;
 	}
@@ -559,7 +559,7 @@ nlmsvc_testlock(struct svc_rqst *rqstp, struct nlm_file *file,
 		goto out;
 	}
 
-	if (locks_in_grace()) {
+	if (locks_in_grace(file->f_file->f_path.dentry->d_sb)) {
 		ret = nlm_lck_denied_grace_period;
 		goto out;
 	}
@@ -643,7 +643,7 @@ nlmsvc_cancel_blocked(struct nlm_file *file, struct nlm_lock *lock)
 				(long long)lock->fl.fl_start,
 				(long long)lock->fl.fl_end);
 
-	if (locks_in_grace())
+	if (locks_in_grace(file->f_file->f_path.dentry->d_sb))
 		return nlm_lck_denied_grace_period;
 
 	mutex_lock(&file->f_mutex);
diff --git a/fs/lockd/svcproc.c b/fs/lockd/svcproc.c
index d27aab1..17e03c5 100644
--- a/fs/lockd/svcproc.c
+++ b/fs/lockd/svcproc.c
@@ -180,16 +180,16 @@ nlmsvc_proc_cancel(struct svc_rqst *rqstp, struct nlm_args *argp,
 
 	resp->cookie = argp->cookie;
 
+	/* Obtain client and file */
+	if ((resp->status = nlmsvc_retrieve_args(rqstp, argp, &host, &file)))
+		return resp->status == nlm_drop_reply ? rpc_drop_reply :rpc_success;
+
 	/* Don't accept requests during grace period */
-	if (locks_in_grace()) {
+	if (locks_in_grace(file->f_file->f_path.dentry->d_sb)) {
 		resp->status = nlm_lck_denied_grace_period;
 		return rpc_success;
 	}
 
-	/* Obtain client and file */
-	if ((resp->status = nlmsvc_retrieve_args(rqstp, argp, &host, &file)))
-		return resp->status == nlm_drop_reply ? rpc_drop_reply :rpc_success;
-
 	/* Try to cancel request. */
 	resp->status = cast_status(nlmsvc_cancel_blocked(file, &argp->lock));
 
@@ -213,16 +213,16 @@ nlmsvc_proc_unlock(struct svc_rqst *rqstp, struct nlm_args *argp,
 
 	resp->cookie = argp->cookie;
 
-	/* Don't accept new lock requests during grace period */
-	if (locks_in_grace()) {
-		resp->status = nlm_lck_denied_grace_period;
-		return rpc_success;
-	}
-
 	/* Obtain client and file */
 	if ((resp->status = nlmsvc_retrieve_args(rqstp, argp, &host, &file)))
 		return resp->status == nlm_drop_reply ? rpc_drop_reply :rpc_success;
 
+	/* Don't accept requests during grace period */
+	if (locks_in_grace(file->f_file->f_path.dentry->d_sb)) {
+		resp->status = nlm_lck_denied_grace_period;
+		return rpc_success;
+	}
+
 	/* Now try to remove the lock */
 	resp->status = cast_status(nlmsvc_unlock(file, &argp->lock));
 
@@ -360,16 +360,16 @@ nlmsvc_proc_share(struct svc_rqst *rqstp, struct nlm_args *argp,
 
 	resp->cookie = argp->cookie;
 
-	/* Don't accept new lock requests during grace period */
-	if (locks_in_grace() && !argp->reclaim) {
-		resp->status = nlm_lck_denied_grace_period;
-		return rpc_success;
-	}
-
 	/* Obtain client and file */
 	if ((resp->status = nlmsvc_retrieve_args(rqstp, argp, &host, &file)))
 		return resp->status == nlm_drop_reply ? rpc_drop_reply :rpc_success;
 
+	/* Don't accept requests during grace period */
+	if (locks_in_grace(file->f_file->f_path.dentry->d_sb)) {
+		resp->status = nlm_lck_denied_grace_period;
+		return rpc_success;
+	}
+
 	/* Now try to create the share */
 	resp->status = cast_status(nlmsvc_share_file(host, file, argp));
 
@@ -393,16 +393,16 @@ nlmsvc_proc_unshare(struct svc_rqst *rqstp, struct nlm_args *argp,
 
 	resp->cookie = argp->cookie;
 
+	/* Obtain client and file */
+	if ((resp->status = nlmsvc_retrieve_args(rqstp, argp, &host, &file)))
+		return resp->status == nlm_drop_reply ? rpc_drop_reply :rpc_success;
+
 	/* Don't accept requests during grace period */
-	if (locks_in_grace()) {
+	if (locks_in_grace(file->f_file->f_path.dentry->d_sb)) {
 		resp->status = nlm_lck_denied_grace_period;
 		return rpc_success;
 	}
 
-	/* Obtain client and file */
-	if ((resp->status = nlmsvc_retrieve_args(rqstp, argp, &host, &file)))
-		return resp->status == nlm_drop_reply ? rpc_drop_reply :rpc_success;
-
 	/* Now try to unshare the file */
 	resp->status = cast_status(nlmsvc_unshare_file(host, file, argp));
 
diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index 2ed14df..351f7e3 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -354,10 +354,12 @@ nfsd4_open(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 	/* Openowner is now set, so sequence id will get bumped.  Now we need
 	 * these checks before we do any creates: */
 	status = nfserr_grace;
-	if (locks_in_grace() && open->op_claim_type != NFS4_OPEN_CLAIM_PREVIOUS)
+	if (open->op_claim_type != NFS4_OPEN_CLAIM_PREVIOUS &&
+	    locks_in_grace(cstate->current_fh.fh_dentry->d_sb))
 		goto out;
 	status = nfserr_no_grace;
-	if (!locks_in_grace() && open->op_claim_type == NFS4_OPEN_CLAIM_PREVIOUS)
+	if (open->op_claim_type == NFS4_OPEN_CLAIM_PREVIOUS &&
+	    !locks_in_grace(cstate->current_fh.fh_dentry->d_sb))
 		goto out;
 
 	switch (open->op_claim_type) {
@@ -741,7 +743,7 @@ nfsd4_remove(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 {
 	__be32 status;
 
-	if (locks_in_grace())
+	if (locks_in_grace(cstate->current_fh.fh_dentry->d_sb))
 		return nfserr_grace;
 	status = nfsd_unlink(rqstp, &cstate->current_fh, 0,
 			     remove->rm_name, remove->rm_namelen);
@@ -760,8 +762,8 @@ nfsd4_rename(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 
 	if (!cstate->save_fh.fh_dentry)
 		return status;
-	if (locks_in_grace() && !(cstate->save_fh.fh_export->ex_flags
-					& NFSEXP_NOSUBTREECHECK))
+	if (!(cstate->save_fh.fh_export->ex_flags & NFSEXP_NOSUBTREECHECK) &&
+	    locks_in_grace(cstate->save_fh.fh_dentry->d_sb))
 		return nfserr_grace;
 	status = nfsd_rename(rqstp, &cstate->save_fh, rename->rn_sname,
 			     rename->rn_snamelen, &cstate->current_fh,
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index a822e31..3f3d9f0 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -2938,7 +2938,7 @@ nfs4_open_delegation(struct svc_fh *fh, struct nfsd4_open *open, struct nfs4_ol_
 		case NFS4_OPEN_CLAIM_NULL:
 			/* Let's not give out any delegations till everyone's
 			 * had the chance to reclaim theirs.... */
-			if (locks_in_grace())
+			if (locks_in_grace(fh->fh_dentry->d_sb))
 				goto out;
 			if (!cb_up || !(oo->oo_flags & NFS4_OO_CONFIRMED))
 				goto out;
@@ -3183,7 +3183,7 @@ nfs4_laundromat(void)
 	nfs4_lock_state();
 
 	dprintk("NFSD: laundromat service - starting\n");
-	if (locks_in_grace())
+	if (generic_locks_in_grace())
 		nfsd4_end_grace();
 	INIT_LIST_HEAD(&reaplist);
 	spin_lock(&client_lock);
@@ -3312,7 +3312,7 @@ check_special_stateids(svc_fh *current_fh, stateid_t *stateid, int flags)
 {
 	if (ONE_STATEID(stateid) && (flags & RD_STATE))
 		return nfs_ok;
-	else if (locks_in_grace()) {
+	else if (locks_in_grace(current_fh->fh_dentry->d_sb)) {
 		/* Answer in remaining cases depends on existence of
 		 * conflicting state; so we must wait out the grace period. */
 		return nfserr_grace;
@@ -3331,7 +3331,7 @@ check_special_stateids(svc_fh *current_fh, stateid_t *stateid, int flags)
 static inline int
 grace_disallows_io(struct inode *inode)
 {
-	return locks_in_grace() && mandatory_lock(inode);
+	return mandatory_lock(inode) && locks_in_grace(inode->i_sb);
 }
 
 /* Returns true iff a is later than b: */
@@ -4128,13 +4128,6 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 	if (status)
 		goto out;
 
-	status = nfserr_grace;
-	if (locks_in_grace() && !lock->lk_reclaim)
-		goto out;
-	status = nfserr_no_grace;
-	if (!locks_in_grace() && lock->lk_reclaim)
-		goto out;
-
 	locks_init_lock(&file_lock);
 	switch (lock->lk_type) {
 		case NFS4_READ_LT:
@@ -4159,6 +4152,14 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 		status = nfserr_openmode;
 		goto out;
 	}
+
+	status = nfserr_grace;
+	if (!lock->lk_reclaim && locks_in_grace(filp->f_path.dentry->d_sb))
+		goto out;
+	status = nfserr_no_grace;
+	if (lock->lk_reclaim && !locks_in_grace(filp->f_path.dentry->d_sb))
+		goto out;
+
 	file_lock.fl_owner = (fl_owner_t)lock_sop;
 	file_lock.fl_pid = current->tgid;
 	file_lock.fl_file = filp;
@@ -4235,9 +4236,6 @@ nfsd4_lockt(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 	int error;
 	__be32 status;
 
-	if (locks_in_grace())
-		return nfserr_grace;
-
 	if (check_lock_length(lockt->lt_offset, lockt->lt_length))
 		 return nfserr_inval;
 
@@ -4251,6 +4249,12 @@ nfsd4_lockt(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 		goto out;
 
 	inode = cstate->current_fh.fh_dentry->d_inode;
+
+	if (locks_in_grace(inode->i_sb)) {
+		status = nfserr_grace;
+		goto out;
+	}
+
 	locks_init_lock(&file_lock);
 	switch (lockt->lt_type) {
 		case NFS4_READ_LT:
diff --git a/include/linux/exportfs.h b/include/linux/exportfs.h
index 3a4cef5..05c29b5 100644
--- a/include/linux/exportfs.h
+++ b/include/linux/exportfs.h
@@ -159,6 +159,11 @@ struct fid {
  * commit_metadata:
  *    @commit_metadata should commit metadata changes to stable storage.
  *
+ * locks_in_grace:
+ *    @locks_in_grace should return whether the grace period is active for the
+ *    given super_block. This function is optional and if not provided then
+ *    generic_locks_in_grace will be used.
+ *
  * Locking rules:
  *    get_parent is called with child->d_inode->i_mutex down
  *    get_name is not (which is possibly inconsistent)
@@ -175,6 +180,7 @@ struct export_operations {
 			struct dentry *child);
 	struct dentry * (*get_parent)(struct dentry *child);
 	int (*commit_metadata)(struct inode *inode);
+	bool (*locks_in_grace)(struct super_block *sb);
 };
 
 extern int exportfs_encode_fh(struct dentry *dentry, struct fid *fid,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 135693e..fcfa51a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1130,7 +1130,8 @@ struct lock_manager {
 
 void locks_start_grace(struct lock_manager *);
 void locks_end_grace(struct lock_manager *);
-int locks_in_grace(void);
+bool generic_locks_in_grace(void);
+bool locks_in_grace(struct super_block *sb);
 
 /* that will die - we need it for nfs_lock_info */
 #include <linux/nfs_fs_i.h>
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH][RFC] nfsd/lockd: have locks_in_grace take a sb arg
  2012-04-03 12:14 [PATCH][RFC] nfsd/lockd: have locks_in_grace take a sb arg Jeff Layton
@ 2012-04-09 23:18 ` J. Bruce Fields
  2012-04-10 11:13   ` Jeff Layton
  2012-04-10 11:44 ` Stanislav Kinsbursky
  1 sibling, 1 reply; 29+ messages in thread
From: J. Bruce Fields @ 2012-04-09 23:18 UTC (permalink / raw)
  To: Jeff Layton; +Cc: linux-nfs

On Tue, Apr 03, 2012 at 08:14:39AM -0400, Jeff Layton wrote:
> The main reason for the grace period is to prevent the server from
> allowing an operation that might otherwise be denied once the client has
> reclaimed all of its stateful objects.
> 
> Currently, the grace period handling in the nfsd/lockd server code is
> very simple. When the lock managers start, they stick an entry on a list
> and set a timer. When the timers pop, then they remove the entry from
> the list. The locks_in_grace check just looks to see if the list is
> empty. If it is, then the grace period is considered to be over.
> 
> This is insufficient for a clustered filesystem that is being served
> from multiple nodes at the same time. In such a configuration, the grace
> period must be coordinated in some fashion, or else one node might hand
> out stateful objects that conflict with those that have not yet been
> reclaimed.
> 
> This patch paves the way for fixing this by adding a new export
> operation called locks_in_grace that takes a superblock argument. The
> existing locks_in_grace function is renamed to generic_locks_in_grace,
> and a new locks_in_grace function that takes a superblock arg is added.
> If a filesystem does not have a locks_in_grace export operation then the
> generic version will be used.

Looks more or less OK to me....

> Care has also been taken to reorder calls such that locks_in_grace is
> called last in compound conditional statements. Handling this for
> clustered filesystems may involve upcalls, so we don't want to call it
> unnecessarily.

Even if we're careful to do the check last, we potentially still have to
do it on every otherwise-succesful open and lock operation.

And really I don't think it's too much to ask that this be fast.

> @@ -3183,7 +3183,7 @@ nfs4_laundromat(void)
>  	nfs4_lock_state();
>  
>  	dprintk("NFSD: laundromat service - starting\n");
> -	if (locks_in_grace())
> +	if (generic_locks_in_grace())
>  		nfsd4_end_grace();

Looking at the code.... This is really just checking whether we've ended
our own grace period.  The laundromat's scheduled to run a grace period
after startup.  So I think we should just make this:

	static bool grace_ended = false;

	if (!grace_ended) {
		grace_ended = true;
		nfsd4_end_grace();
	}

or something.  No reason not to do that now.

(Hm, and maybe there's a reason to: locks_in_grace() could in theory
still return true on a second run of nfs4_laundromat(), but
nfsd4_end_grace() probably shouldn't really be run twice?)

--b.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH][RFC] nfsd/lockd: have locks_in_grace take a sb arg
  2012-04-09 23:18 ` J. Bruce Fields
@ 2012-04-10 11:13   ` Jeff Layton
  2012-04-10 13:18     ` J. Bruce Fields
  0 siblings, 1 reply; 29+ messages in thread
From: Jeff Layton @ 2012-04-10 11:13 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: linux-nfs

On Mon, 9 Apr 2012 19:18:49 -0400
"J. Bruce Fields" <bfields@fieldses.org> wrote:

> On Tue, Apr 03, 2012 at 08:14:39AM -0400, Jeff Layton wrote:
> > The main reason for the grace period is to prevent the server from
> > allowing an operation that might otherwise be denied once the client has
> > reclaimed all of its stateful objects.
> > 
> > Currently, the grace period handling in the nfsd/lockd server code is
> > very simple. When the lock managers start, they stick an entry on a list
> > and set a timer. When the timers pop, then they remove the entry from
> > the list. The locks_in_grace check just looks to see if the list is
> > empty. If it is, then the grace period is considered to be over.
> > 
> > This is insufficient for a clustered filesystem that is being served
> > from multiple nodes at the same time. In such a configuration, the grace
> > period must be coordinated in some fashion, or else one node might hand
> > out stateful objects that conflict with those that have not yet been
> > reclaimed.
> > 
> > This patch paves the way for fixing this by adding a new export
> > operation called locks_in_grace that takes a superblock argument. The
> > existing locks_in_grace function is renamed to generic_locks_in_grace,
> > and a new locks_in_grace function that takes a superblock arg is added.
> > If a filesystem does not have a locks_in_grace export operation then the
> > generic version will be used.
> 
> Looks more or less OK to me....
> 
> > Care has also been taken to reorder calls such that locks_in_grace is
> > called last in compound conditional statements. Handling this for
> > clustered filesystems may involve upcalls, so we don't want to call it
> > unnecessarily.
> 
> Even if we're careful to do the check last, we potentially still have to
> do it on every otherwise-succesful open and lock operation.
> 
> And really I don't think it's too much to ask that this be fast.
> 

Yes, FS implementers should expect that this could get called
frequently and ensure that it doesn't generate undue load.

I'd expect that any that do this via an upcall would ratelimit it in
some fashion during the grace period. They'd then set a flag or
something in the superblock afterward so they wouldn't need to upcall
anymore once it ends.

I'd rather push those smarts into the filesystems for now though in
order to allow for more flexibility. There are potential designs where
a fs could end up back in grace after initially leaving it and we
should allow for that.

> > @@ -3183,7 +3183,7 @@ nfs4_laundromat(void)
> >  	nfs4_lock_state();
> >  
> >  	dprintk("NFSD: laundromat service - starting\n");
> > -	if (locks_in_grace())
> > +	if (generic_locks_in_grace())
> >  		nfsd4_end_grace();
> 
> Looking at the code.... This is really just checking whether we've ended
> our own grace period.  The laundromat's scheduled to run a grace period
> after startup.  So I think we should just make this:
> 
> 	static bool grace_ended = false;
> 
> 	if (!grace_ended) {
> 		grace_ended = true;
> 		nfsd4_end_grace();
> 	}
> 
> or something.  No reason not to do that now.
> 
> (Hm, and maybe there's a reason to: locks_in_grace() could in theory
> still return true on a second run of nfs4_laundromat(), but
> nfsd4_end_grace() probably shouldn't really be run twice?)
> 

Most of the things that nfsd4_end_grace does should be safe to run
twice. The exception is nfsd4_recdir_purge_old which could be bad news.
So, doing what you suggest looks reasonable. I'll add that into the next
respin.

Thanks for having a look!
-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH][RFC] nfsd/lockd: have locks_in_grace take a sb arg
  2012-04-03 12:14 [PATCH][RFC] nfsd/lockd: have locks_in_grace take a sb arg Jeff Layton
  2012-04-09 23:18 ` J. Bruce Fields
@ 2012-04-10 11:44 ` Stanislav Kinsbursky
  2012-04-10 12:05   ` Jeff Layton
  2012-04-10 12:16   ` Jeff Layton
  1 sibling, 2 replies; 29+ messages in thread
From: Stanislav Kinsbursky @ 2012-04-10 11:44 UTC (permalink / raw)
  To: Jeff Layton; +Cc: linux-nfs

03.04.2012 16:14, Jeff Layton пишет:
> The main reason for the grace period is to prevent the server from
> allowing an operation that might otherwise be denied once the client has
> reclaimed all of its stateful objects.
>
> Currently, the grace period handling in the nfsd/lockd server code is
> very simple. When the lock managers start, they stick an entry on a list
> and set a timer. When the timers pop, then they remove the entry from
> the list. The locks_in_grace check just looks to see if the list is
> empty. If it is, then the grace period is considered to be over.
>
> This is insufficient for a clustered filesystem that is being served
> from multiple nodes at the same time. In such a configuration, the grace
> period must be coordinated in some fashion, or else one node might hand
> out stateful objects that conflict with those that have not yet been
> reclaimed.
>
> This patch paves the way for fixing this by adding a new export
> operation called locks_in_grace that takes a superblock argument. The
> existing locks_in_grace function is renamed to generic_locks_in_grace,
> and a new locks_in_grace function that takes a superblock arg is added.
> If a filesystem does not have a locks_in_grace export operation then the
> generic version will be used.
>
> Care has also been taken to reorder calls such that locks_in_grace is
> called last in compound conditional statements. Handling this for
> clustered filesystems may involve upcalls, so we don't want to call it
> unnecessarily.
>
> For now, this patch is just an RFC as I do not yet have any code that
> overrides this function and am still specing out what that code should
> look like.
>

Oops, I've noticed your patch after I replied in "Grace period" thread.
This patch looks good, but doesn't explain, how this per-filesystem logic will 
work in case of sharing non-nested subdirectories with the same superblock.
This is a valid situation. But how to handle grace period in this case?
Also, don't we need to prevent of exporting the same file system parts but 
different servers always, but not only for grace period?

-- 
Best regards,
Stanislav Kinsbursky

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH][RFC] nfsd/lockd: have locks_in_grace take a sb arg
  2012-04-10 11:44 ` Stanislav Kinsbursky
@ 2012-04-10 12:05   ` Jeff Layton
  2012-04-10 12:18     ` Stanislav Kinsbursky
  2012-04-10 12:16   ` Jeff Layton
  1 sibling, 1 reply; 29+ messages in thread
From: Jeff Layton @ 2012-04-10 12:05 UTC (permalink / raw)
  To: Stanislav Kinsbursky; +Cc: linux-nfs

On Tue, 10 Apr 2012 15:44:42 +0400
Stanislav Kinsbursky <skinsbursky@parallels.com> wrote:

> 03.04.2012 16:14, Jeff Layton пишет:
> > The main reason for the grace period is to prevent the server from
> > allowing an operation that might otherwise be denied once the client has
> > reclaimed all of its stateful objects.
> >
> > Currently, the grace period handling in the nfsd/lockd server code is
> > very simple. When the lock managers start, they stick an entry on a list
> > and set a timer. When the timers pop, then they remove the entry from
> > the list. The locks_in_grace check just looks to see if the list is
> > empty. If it is, then the grace period is considered to be over.
> >
> > This is insufficient for a clustered filesystem that is being served
> > from multiple nodes at the same time. In such a configuration, the grace
> > period must be coordinated in some fashion, or else one node might hand
> > out stateful objects that conflict with those that have not yet been
> > reclaimed.
> >
> > This patch paves the way for fixing this by adding a new export
> > operation called locks_in_grace that takes a superblock argument. The
> > existing locks_in_grace function is renamed to generic_locks_in_grace,
> > and a new locks_in_grace function that takes a superblock arg is added.
> > If a filesystem does not have a locks_in_grace export operation then the
> > generic version will be used.
> >
> > Care has also been taken to reorder calls such that locks_in_grace is
> > called last in compound conditional statements. Handling this for
> > clustered filesystems may involve upcalls, so we don't want to call it
> > unnecessarily.
> >
> > For now, this patch is just an RFC as I do not yet have any code that
> > overrides this function and am still specing out what that code should
> > look like.
> >
> 
> Oops, I've noticed your patch after I replied in "Grace period" thread.
> This patch looks good, but doesn't explain, how this per-filesystem logic will 
> work in case of sharing non-nested subdirectories with the same superblock.
> This is a valid situation. But how to handle grace period in this case?


It's a valid situation but one that's discouraged.
> Also, don't we need to prevent of exporting the same file system parts but 
> different servers always, but not only for grace period?
> 


-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH][RFC] nfsd/lockd: have locks_in_grace take a sb arg
  2012-04-10 11:44 ` Stanislav Kinsbursky
  2012-04-10 12:05   ` Jeff Layton
@ 2012-04-10 12:16   ` Jeff Layton
  2012-04-10 12:46     ` Stanislav Kinsbursky
  1 sibling, 1 reply; 29+ messages in thread
From: Jeff Layton @ 2012-04-10 12:16 UTC (permalink / raw)
  To: Stanislav Kinsbursky; +Cc: linux-nfs

On Tue, 10 Apr 2012 15:44:42 +0400
Stanislav Kinsbursky <skinsbursky@parallels.com> wrote:

> 03.04.2012 16:14, Jeff Layton пишет:
> > The main reason for the grace period is to prevent the server from
> > allowing an operation that might otherwise be denied once the client has
> > reclaimed all of its stateful objects.
> >
> > Currently, the grace period handling in the nfsd/lockd server code is
> > very simple. When the lock managers start, they stick an entry on a list
> > and set a timer. When the timers pop, then they remove the entry from
> > the list. The locks_in_grace check just looks to see if the list is
> > empty. If it is, then the grace period is considered to be over.
> >
> > This is insufficient for a clustered filesystem that is being served
> > from multiple nodes at the same time. In such a configuration, the grace
> > period must be coordinated in some fashion, or else one node might hand
> > out stateful objects that conflict with those that have not yet been
> > reclaimed.
> >
> > This patch paves the way for fixing this by adding a new export
> > operation called locks_in_grace that takes a superblock argument. The
> > existing locks_in_grace function is renamed to generic_locks_in_grace,
> > and a new locks_in_grace function that takes a superblock arg is added.
> > If a filesystem does not have a locks_in_grace export operation then the
> > generic version will be used.
> >
> > Care has also been taken to reorder calls such that locks_in_grace is
> > called last in compound conditional statements. Handling this for
> > clustered filesystems may involve upcalls, so we don't want to call it
> > unnecessarily.
> >
> > For now, this patch is just an RFC as I do not yet have any code that
> > overrides this function and am still specing out what that code should
> > look like.
> >
> 

(sorry about the earlier truncated reply, my MUA has a mind of its own
this morning)

> Oops, I've noticed your patch after I replied in "Grace period" thread.
> This patch looks good, but doesn't explain, how this per-filesystem logic will 
> work in case of sharing non-nested subdirectories with the same superblock.
> This is a valid situation. But how to handle grace period in this case?

TBH, I haven't considered that in depth. That is a valid situation, but
one that's discouraged. It's very difficult (and expensive) to
sequester off portions of a filesystem for serving.

A filehandle is somewhat analogous to a device/inode combination. When
the server gets a filehandle, it has to determine "is this within a
path that's exported to this host"? That process is called subtree
checking. It's expensive and difficult to handle. It's always better to
export along filesystem boundaries.

My suggestion would be to simply not deal with those cases in this
patch. Possibly we could force no_subtree_check when we export an fs
with a locks_in_grace option defined.

> Also, don't we need to prevent of exporting the same file system parts but 
> different servers always, but not only for grace period?
> 

I'm not sure I understand what you're asking here. Were you referring
to my suggestion earlier of not allowing the export of the same
filesystem from more than one container? If so, then yes that would
apply before and after the grace period ends.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH][RFC] nfsd/lockd: have locks_in_grace take a sb arg
  2012-04-10 12:05   ` Jeff Layton
@ 2012-04-10 12:18     ` Stanislav Kinsbursky
  0 siblings, 0 replies; 29+ messages in thread
From: Stanislav Kinsbursky @ 2012-04-10 12:18 UTC (permalink / raw)
  To: Jeff Layton; +Cc: linux-nfs

10.04.2012 16:05, Jeff Layton пишет:
> On Tue, 10 Apr 2012 15:44:42 +0400
> Stanislav Kinsbursky<skinsbursky@parallels.com>  wrote:
>
>> 03.04.2012 16:14, Jeff Layton пишет:
>>> The main reason for the grace period is to prevent the server from
>>> allowing an operation that might otherwise be denied once the client has
>>> reclaimed all of its stateful objects.
>>>
>>> Currently, the grace period handling in the nfsd/lockd server code is
>>> very simple. When the lock managers start, they stick an entry on a list
>>> and set a timer. When the timers pop, then they remove the entry from
>>> the list. The locks_in_grace check just looks to see if the list is
>>> empty. If it is, then the grace period is considered to be over.
>>>
>>> This is insufficient for a clustered filesystem that is being served
>>> from multiple nodes at the same time. In such a configuration, the grace
>>> period must be coordinated in some fashion, or else one node might hand
>>> out stateful objects that conflict with those that have not yet been
>>> reclaimed.
>>>
>>> This patch paves the way for fixing this by adding a new export
>>> operation called locks_in_grace that takes a superblock argument. The
>>> existing locks_in_grace function is renamed to generic_locks_in_grace,
>>> and a new locks_in_grace function that takes a superblock arg is added.
>>> If a filesystem does not have a locks_in_grace export operation then the
>>> generic version will be used.
>>>
>>> Care has also been taken to reorder calls such that locks_in_grace is
>>> called last in compound conditional statements. Handling this for
>>> clustered filesystems may involve upcalls, so we don't want to call it
>>> unnecessarily.
>>>
>>> For now, this patch is just an RFC as I do not yet have any code that
>>> overrides this function and am still specing out what that code should
>>> look like.
>>>
>>
>> Oops, I've noticed your patch after I replied in "Grace period" thread.
>> This patch looks good, but doesn't explain, how this per-filesystem logic will
>> work in case of sharing non-nested subdirectories with the same superblock.
>> This is a valid situation. But how to handle grace period in this case?
>
>
> It's a valid situation but one that's discouraged.

But it looks like common case. In future we are going to get rid of using one 
mounted file system for more than one virtual environment, but currently in 
OpenVZ we do exactly this.
And it looks like LXC uses chroot for containers as well...

-- 
Best regards,
Stanislav Kinsbursky

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH][RFC] nfsd/lockd: have locks_in_grace take a sb arg
  2012-04-10 12:16   ` Jeff Layton
@ 2012-04-10 12:46     ` Stanislav Kinsbursky
  2012-04-10 13:39       ` Jeff Layton
  2012-04-10 20:22       ` J. Bruce Fields
  0 siblings, 2 replies; 29+ messages in thread
From: Stanislav Kinsbursky @ 2012-04-10 12:46 UTC (permalink / raw)
  To: Jeff Layton; +Cc: linux-nfs

10.04.2012 16:16, Jeff Layton пишет:
> On Tue, 10 Apr 2012 15:44:42 +0400
>
> (sorry about the earlier truncated reply, my MUA has a mind of its own
> this morning)
>

OK then. Previous letter confused me a bit.

>
> TBH, I haven't considered that in depth. That is a valid situation, but
> one that's discouraged. It's very difficult (and expensive) to
> sequester off portions of a filesystem for serving.
>
> A filehandle is somewhat analogous to a device/inode combination. When
> the server gets a filehandle, it has to determine "is this within a
> path that's exported to this host"? That process is called subtree
> checking. It's expensive and difficult to handle. It's always better to
> export along filesystem boundaries.
>
> My suggestion would be to simply not deal with those cases in this
> patch. Possibly we could force no_subtree_check when we export an fs
> with a locks_in_grace option defined.
>

Sorry, but without dealing with those cases your patch looks a bit... Useless.
I.e. it changes nothing, it there will be no support from file systems, going to 
be exported.
But how are you going to push developers to implement these calls? Or, even if 
you'll try to implement them by yourself, how they will looks like?
Simple check only for superblock looks bad to me, because any other start of 
NFSd will lead to grace period for all other containers (which uses the same 
filesystem).

>> Also, don't we need to prevent of exporting the same file system parts but
>> different servers always, but not only for grace period?
>>
>
> I'm not sure I understand what you're asking here. Were you referring
> to my suggestion earlier of not allowing the export of the same
> filesystem from more than one container? If so, then yes that would
> apply before and after the grace period ends.
>

I was talking about preventing of exporting intersecting directories by 
different server.
IOW, exporting of the same file system by different NFS server is allowed, but 
only if their exporting directories doesn't intersect.
This check is expensive (as you mentioned), but have to be done only once on NFS 
server start.
With this solution, grace period can simple, and no support from exporting file 
system is required.
But the main problem here is that such intersections can be checked only in 
initial file system environment (containers with it's own roots, gained via 
chroot, can't handle this situation).
So, it means, that there have to be some daemon (kernel or user space), which 
will handle such requests from different NFS server instances... Which in turn 
means, that some way of communication between this daemon and NFS servers is 
required. And unix (any of them) sockets doesn't suits here, which makes this 
problem more difficult.

-- 
Best regards,
Stanislav Kinsbursky

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH][RFC] nfsd/lockd: have locks_in_grace take a sb arg
  2012-04-10 11:13   ` Jeff Layton
@ 2012-04-10 13:18     ` J. Bruce Fields
  0 siblings, 0 replies; 29+ messages in thread
From: J. Bruce Fields @ 2012-04-10 13:18 UTC (permalink / raw)
  To: Jeff Layton; +Cc: linux-nfs

On Tue, Apr 10, 2012 at 07:13:17AM -0400, Jeff Layton wrote:
> Yes, FS implementers should expect that this could get called
> frequently and ensure that it doesn't generate undue load.
> 
> I'd expect that any that do this via an upcall would ratelimit it in
> some fashion during the grace period. They'd then set a flag or
> something in the superblock afterward so they wouldn't need to upcall
> anymore once it ends.
> 
> I'd rather push those smarts into the filesystems for now though in
> order to allow for more flexibility. There are potential designs where
> a fs could end up back in grace after initially leaving it and we
> should allow for that.

Even then a grace period transition should be rare, so I'd think they'd
want to notify the kernel on the transition rather than polling?

> 
> > > @@ -3183,7 +3183,7 @@ nfs4_laundromat(void)
> > >  	nfs4_lock_state();
> > >  
> > >  	dprintk("NFSD: laundromat service - starting\n");
> > > -	if (locks_in_grace())
> > > +	if (generic_locks_in_grace())
> > >  		nfsd4_end_grace();
> > 
> > Looking at the code.... This is really just checking whether we've ended
> > our own grace period.  The laundromat's scheduled to run a grace period
> > after startup.  So I think we should just make this:
> > 
> > 	static bool grace_ended = false;
> > 
> > 	if (!grace_ended) {
> > 		grace_ended = true;
> > 		nfsd4_end_grace();
> > 	}
> > 
> > or something.  No reason not to do that now.
> > 
> > (Hm, and maybe there's a reason to: locks_in_grace() could in theory
> > still return true on a second run of nfs4_laundromat(), but
> > nfsd4_end_grace() probably shouldn't really be run twice?)
> > 
> 
> Most of the things that nfsd4_end_grace does should be safe to run
> twice. The exception is nfsd4_recdir_purge_old which could be bad news.
> So, doing what you suggest looks reasonable. I'll add that into the next
> respin.

Thanks; that at least I can merge whenever it's ready.

--b.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH][RFC] nfsd/lockd: have locks_in_grace take a sb arg
  2012-04-10 12:46     ` Stanislav Kinsbursky
@ 2012-04-10 13:39       ` Jeff Layton
  2012-04-10 14:52         ` Stanislav Kinsbursky
  2012-04-10 20:22       ` J. Bruce Fields
  1 sibling, 1 reply; 29+ messages in thread
From: Jeff Layton @ 2012-04-10 13:39 UTC (permalink / raw)
  To: Stanislav Kinsbursky; +Cc: linux-nfs

On Tue, 10 Apr 2012 16:46:38 +0400
Stanislav Kinsbursky <skinsbursky@parallels.com> wrote:

> 10.04.2012 16:16, Jeff Layton пишет:
> > On Tue, 10 Apr 2012 15:44:42 +0400
> >
> > (sorry about the earlier truncated reply, my MUA has a mind of its own
> > this morning)
> >
> 
> OK then. Previous letter confused me a bit.
> 
> >
> > TBH, I haven't considered that in depth. That is a valid situation, but
> > one that's discouraged. It's very difficult (and expensive) to
> > sequester off portions of a filesystem for serving.
> >
> > A filehandle is somewhat analogous to a device/inode combination. When
> > the server gets a filehandle, it has to determine "is this within a
> > path that's exported to this host"? That process is called subtree
> > checking. It's expensive and difficult to handle. It's always better to
> > export along filesystem boundaries.
> >
> > My suggestion would be to simply not deal with those cases in this
> > patch. Possibly we could force no_subtree_check when we export an fs
> > with a locks_in_grace option defined.
> >
> 
> Sorry, but without dealing with those cases your patch looks a bit... Useless.
> I.e. it changes nothing, it there will be no support from file systems, going to 
> be exported.
> But how are you going to push developers to implement these calls? Or, even if 
> you'll try to implement them by yourself, how they will looks like?
> Simple check only for superblock looks bad to me, because any other start of 
> NFSd will lead to grace period for all other containers (which uses the same 
> filesystem).
> 

Changing nothing was sort of the point. The idea was to allow
filesystems to override this if they choose. The main impetus here was
to allow clustered filesystems to handle this in a different fashion to
allow them to do active/active serving from multiple nodes. I wasn't
considering the container use-case when I spun this up last week...

Now that said, we probably can accommodate containers with this too.
Perhaps we could consider passing in a sb+namespace tuple eventually?

> >> Also, don't we need to prevent of exporting the same file system parts but
> >> different servers always, but not only for grace period?
> >>
> >
> > I'm not sure I understand what you're asking here. Were you referring
> > to my suggestion earlier of not allowing the export of the same
> > filesystem from more than one container? If so, then yes that would
> > apply before and after the grace period ends.
> >
> 
> I was talking about preventing of exporting intersecting directories by 
> different server.
> IOW, exporting of the same file system by different NFS server is allowed, but 
> only if their exporting directories doesn't intersect.

Doesn't that require that the containers are aware of each other to
some degree? Or are you considering doing this in the kernel?

If the latter, then there's another problem. The export table is kept
in userspace (in mountd) and the kernel only upcalls for it as needed.

You'll need to change that overall design if you want the kernel to do
this enforcement.

> This check is expensive (as you mentioned), but have to be done only once on NFS 
> server start.

Well, no. The subtree check happens every time nfsd processes a
filehandle -- see nfsd_acceptable().

Basically we have to turn the filehandle into a dentry and then walk
back up to the directory that's exported to verify that it is within
the correct subtree. If that fails, then we might have to do it more
than once if it's a hardlinked file.

> With this solution, grace period can simple, and no support from exporting file 
> system is required.
> But the main problem here is that such intersections can be checked only in 
> initial file system environment (containers with it's own roots, gained via 
> chroot, can't handle this situation).
> So, it means, that there have to be some daemon (kernel or user space), which 
> will handle such requests from different NFS server instances... Which in turn 
> means, that some way of communication between this daemon and NFS servers is 
> required. And unix (any of them) sockets doesn't suits here, which makes this 
> problem more difficult.
> 

This is a truly ugly problem, and unfortunately parts of the nfsd
codebase are very old and crusty. We've got a lot of cleanup work ahead
of us no matter what design we settle on.

This is really a lot bigger than the grace period. I think we ought to
step back a bit and consider this more "holistically" first. Do you
have a pointer to an overall design document or something?

One thing that puzzles me at the moment. We have two namespaces to deal
with -- the network and the mount namespace. With nfs client code,
everything is keyed off of the net namespace. That's not really the
case here since we have to deal with a local fs tree as well.

When an nfsd running in a container receives an RPC, how does it
determine what mount namespace it should do its operations in?

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH][RFC] nfsd/lockd: have locks_in_grace take a sb arg
  2012-04-10 13:39       ` Jeff Layton
@ 2012-04-10 14:52         ` Stanislav Kinsbursky
  2012-04-10 18:45           ` Jeff Layton
  0 siblings, 1 reply; 29+ messages in thread
From: Stanislav Kinsbursky @ 2012-04-10 14:52 UTC (permalink / raw)
  To: Jeff Layton; +Cc: linux-nfs

10.04.2012 17:39, Jeff Layton пишет:
> On Tue, 10 Apr 2012 16:46:38 +0400
> Stanislav Kinsbursky<skinsbursky@parallels.com>  wrote:
>
>> 10.04.2012 16:16, Jeff Layton пишет:
>>> On Tue, 10 Apr 2012 15:44:42 +0400
>>>
>>> (sorry about the earlier truncated reply, my MUA has a mind of its own
>>> this morning)
>>>
>>
>> OK then. Previous letter confused me a bit.
>>
>>>
>>> TBH, I haven't considered that in depth. That is a valid situation, but
>>> one that's discouraged. It's very difficult (and expensive) to
>>> sequester off portions of a filesystem for serving.
>>>
>>> A filehandle is somewhat analogous to a device/inode combination. When
>>> the server gets a filehandle, it has to determine "is this within a
>>> path that's exported to this host"? That process is called subtree
>>> checking. It's expensive and difficult to handle. It's always better to
>>> export along filesystem boundaries.
>>>
>>> My suggestion would be to simply not deal with those cases in this
>>> patch. Possibly we could force no_subtree_check when we export an fs
>>> with a locks_in_grace option defined.
>>>
>>
>> Sorry, but without dealing with those cases your patch looks a bit... Useless.
>> I.e. it changes nothing, it there will be no support from file systems, going to
>> be exported.
>> But how are you going to push developers to implement these calls? Or, even if
>> you'll try to implement them by yourself, how they will looks like?
>> Simple check only for superblock looks bad to me, because any other start of
>> NFSd will lead to grace period for all other containers (which uses the same
>> filesystem).
>>
>
> Changing nothing was sort of the point. The idea was to allow
> filesystems to override this if they choose. The main impetus here was
> to allow clustered filesystems to handle this in a different fashion to
> allow them to do active/active serving from multiple nodes. I wasn't
> considering the container use-case when I spun this up last week...
>

Sorry, I didn't notice, that this patch was sent a week ago (thought, that you 
wrote it yesterday).

> Now that said, we probably can accommodate containers with this too.
> Perhaps we could consider passing in a sb+namespace tuple eventually?
>

We can, of course. But it looks like the problem with different NFSd on the same 
file system won't be solved.

>>>> Also, don't we need to prevent of exporting the same file system parts but
>>>> different servers always, but not only for grace period?
>>>>
>>>
>>> I'm not sure I understand what you're asking here. Were you referring
>>> to my suggestion earlier of not allowing the export of the same
>>> filesystem from more than one container? If so, then yes that would
>>> apply before and after the grace period ends.
>>>
>>
>> I was talking about preventing of exporting intersecting directories by
>> different server.
>> IOW, exporting of the same file system by different NFS server is allowed, but
>> only if their exporting directories doesn't intersect.
>
> Doesn't that require that the containers are aware of each other to
> some degree? Or are you considering doing this in the kernel?
>
> If the latter, then there's another problem. The export table is kept
> in userspace (in mountd) and the kernel only upcalls for it as needed.
>
> You'll need to change that overall design if you want the kernel to do
> this enforcement.
>

Hmm, I see...
Yes, I was thinking about doing it in kernel.
In theory (I'm just thinking and writing simultaneously - this is not a solid 
idea) this could be a kernel thread (this gives desired fs access). And most 
probably this thread have to be launched on nfsd module insertion.
There should be some way to add a job for it on NFSd start and a way to wait for 
the job to be done. This is the easy part.
But I forgot about cross mounts...

>> This check is expensive (as you mentioned), but have to be done only once on NFS
>> server start.
>
> Well, no. The subtree check happens every time nfsd processes a
> filehandle -- see nfsd_acceptable().
>
> Basically we have to turn the filehandle into a dentry and then walk
> back up to the directory that's exported to verify that it is within
> the correct subtree. If that fails, then we might have to do it more
> than once if it's a hardlinked file.
>

Wait. Looks like I'm missing something.
This subtree check has nothing with my proposal (if I'm not mistaken).
This option and it's logic remains the same.
My proposal was to check directories, desired to be exported, on NFS server 
start. And if any of passed exports intersects with any of exports, already 
shared by another NFSd - then shutdown NFSd and print error message.
Am I missing the point here?

>> With this solution, grace period can simple, and no support from exporting file
>> system is required.
>> But the main problem here is that such intersections can be checked only in
>> initial file system environment (containers with it's own roots, gained via
>> chroot, can't handle this situation).
>> So, it means, that there have to be some daemon (kernel or user space), which
>> will handle such requests from different NFS server instances... Which in turn
>> means, that some way of communication between this daemon and NFS servers is
>> required. And unix (any of them) sockets doesn't suits here, which makes this
>> problem more difficult.
>>
>
> This is a truly ugly problem, and unfortunately parts of the nfsd
> codebase are very old and crusty. We've got a lot of cleanup work ahead
> of us no matter what design we settle on.
>
> This is really a lot bigger than the grace period. I think we ought to
> step back a bit and consider this more "holistically" first. Do you
> have a pointer to an overall design document or something?
>

What exactly you are asking about? Overall design of containerization?

> One thing that puzzles me at the moment. We have two namespaces to deal
> with -- the network and the mount namespace. With nfs client code,
> everything is keyed off of the net namespace. That's not really the
> case here since we have to deal with a local fs tree as well.
>
> When an nfsd running in a container receives an RPC, how does it
> determine what mount namespace it should do its operations in?
>

We don't use mount namespaces, so that's why I wasn't thinking about it...
But if we have 2 types of namespaces, then we have to tie  mount namesapce to 
network. I.e we can get desired mount namespace from per-net NFSd data.

But, please, don't ask me, what will be, if two or more NFS servers shares the 
same mount namespace... Looks like this case should be forbidden.

-- 
Best regards,
Stanislav Kinsbursky

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH][RFC] nfsd/lockd: have locks_in_grace take a sb arg
  2012-04-10 14:52         ` Stanislav Kinsbursky
@ 2012-04-10 18:45           ` Jeff Layton
  2012-04-11 10:09             ` Stanislav Kinsbursky
  0 siblings, 1 reply; 29+ messages in thread
From: Jeff Layton @ 2012-04-10 18:45 UTC (permalink / raw)
  To: Stanislav Kinsbursky; +Cc: linux-nfs

On Tue, 10 Apr 2012 18:52:44 +0400
Stanislav Kinsbursky <skinsbursky@parallels.com> wrote:

> 10.04.2012 17:39, Jeff Layton пишет:
> > On Tue, 10 Apr 2012 16:46:38 +0400
> > Stanislav Kinsbursky<skinsbursky@parallels.com>  wrote:
> >
> >> 10.04.2012 16:16, Jeff Layton пишет:
> >>> On Tue, 10 Apr 2012 15:44:42 +0400
> >>>
> >>> (sorry about the earlier truncated reply, my MUA has a mind of its own
> >>> this morning)
> >>>
> >>
> >> OK then. Previous letter confused me a bit.
> >>
> >>>
> >>> TBH, I haven't considered that in depth. That is a valid situation, but
> >>> one that's discouraged. It's very difficult (and expensive) to
> >>> sequester off portions of a filesystem for serving.
> >>>
> >>> A filehandle is somewhat analogous to a device/inode combination. When
> >>> the server gets a filehandle, it has to determine "is this within a
> >>> path that's exported to this host"? That process is called subtree
> >>> checking. It's expensive and difficult to handle. It's always better to
> >>> export along filesystem boundaries.
> >>>
> >>> My suggestion would be to simply not deal with those cases in this
> >>> patch. Possibly we could force no_subtree_check when we export an fs
> >>> with a locks_in_grace option defined.
> >>>
> >>
> >> Sorry, but without dealing with those cases your patch looks a bit... Useless.
> >> I.e. it changes nothing, it there will be no support from file systems, going to
> >> be exported.
> >> But how are you going to push developers to implement these calls? Or, even if
> >> you'll try to implement them by yourself, how they will looks like?
> >> Simple check only for superblock looks bad to me, because any other start of
> >> NFSd will lead to grace period for all other containers (which uses the same
> >> filesystem).
> >>
> >
> > Changing nothing was sort of the point. The idea was to allow
> > filesystems to override this if they choose. The main impetus here was
> > to allow clustered filesystems to handle this in a different fashion to
> > allow them to do active/active serving from multiple nodes. I wasn't
> > considering the container use-case when I spun this up last week...
> >
> 
> Sorry, I didn't notice, that this patch was sent a week ago (thought, that you 
> wrote it yesterday).
> 
> > Now that said, we probably can accommodate containers with this too.
> > Perhaps we could consider passing in a sb+namespace tuple eventually?
> >
> 
> We can, of course. But it looks like the problem with different NFSd on the same 
> file system won't be solved.
> 

Probably not. I think the only way to solve that is to coordinate grace
periods for filesystems exported from multiple containers.

What may be a lot easier initially is to only allow a fs to be exported
from one container. You could always lift that restriction later if you
come up with a way to handle it safely.

We will probably need to re-think the current design of mountd and
exportfs in order to enforce that however.

> >>>> Also, don't we need to prevent of exporting the same file system parts but
> >>>> different servers always, but not only for grace period?
> >>>>
> >>>
> >>> I'm not sure I understand what you're asking here. Were you referring
> >>> to my suggestion earlier of not allowing the export of the same
> >>> filesystem from more than one container? If so, then yes that would
> >>> apply before and after the grace period ends.
> >>>
> >>
> >> I was talking about preventing of exporting intersecting directories by
> >> different server.
> >> IOW, exporting of the same file system by different NFS server is allowed, but
> >> only if their exporting directories doesn't intersect.
> >
> > Doesn't that require that the containers are aware of each other to
> > some degree? Or are you considering doing this in the kernel?
> >
> > If the latter, then there's another problem. The export table is kept
> > in userspace (in mountd) and the kernel only upcalls for it as needed.
> >
> > You'll need to change that overall design if you want the kernel to do
> > this enforcement.
> >
> 
> Hmm, I see...
> Yes, I was thinking about doing it in kernel.
> In theory (I'm just thinking and writing simultaneously - this is not a solid 
> idea) this could be a kernel thread (this gives desired fs access). And most 
> probably this thread have to be launched on nfsd module insertion.
> There should be some way to add a job for it on NFSd start and a way to wait for 
> the job to be done. This is the easy part.
> But I forgot about cross mounts...
>
> >> This check is expensive (as you mentioned), but have to be done only once on NFS
> >> server start.
> >
> > Well, no. The subtree check happens every time nfsd processes a
> > filehandle -- see nfsd_acceptable().
> >
> > Basically we have to turn the filehandle into a dentry and then walk
> > back up to the directory that's exported to verify that it is within
> > the correct subtree. If that fails, then we might have to do it more
> > than once if it's a hardlinked file.
> >
> 
> Wait. Looks like I'm missing something.
> This subtree check has nothing with my proposal (if I'm not mistaken).
> This option and it's logic remains the same.
> My proposal was to check directories, desired to be exported, on NFS server 
> start. And if any of passed exports intersects with any of exports, already 
> shared by another NFSd - then shutdown NFSd and print error message.
> Am I missing the point here?
> 

Sorry I got confused with the discussion. You will need to do
something similar to what subtree checking does in order to handle
your proposal however.

> >> With this solution, grace period can simple, and no support from exporting file
> >> system is required.
> >> But the main problem here is that such intersections can be checked only in
> >> initial file system environment (containers with it's own roots, gained via
> >> chroot, can't handle this situation).
> >> So, it means, that there have to be some daemon (kernel or user space), which
> >> will handle such requests from different NFS server instances... Which in turn
> >> means, that some way of communication between this daemon and NFS servers is
> >> required. And unix (any of them) sockets doesn't suits here, which makes this
> >> problem more difficult.
> >>
> >
> > This is a truly ugly problem, and unfortunately parts of the nfsd
> > codebase are very old and crusty. We've got a lot of cleanup work ahead
> > of us no matter what design we settle on.
> >
> > This is really a lot bigger than the grace period. I think we ought to
> > step back a bit and consider this more "holistically" first. Do you
> > have a pointer to an overall design document or something?
> >
> 
> What exactly you are asking about? Overall design of containerization?
> 

I meant containerization of nfsd in particular.

> > One thing that puzzles me at the moment. We have two namespaces to deal
> > with -- the network and the mount namespace. With nfs client code,
> > everything is keyed off of the net namespace. That's not really the
> > case here since we have to deal with a local fs tree as well.
> >
> > When an nfsd running in a container receives an RPC, how does it
> > determine what mount namespace it should do its operations in?
> >
> 
> We don't use mount namespaces, so that's why I wasn't thinking about it...
> But if we have 2 types of namespaces, then we have to tie  mount namesapce to 
> network. I.e we can get desired mount namespace from per-net NFSd data.
> 

One thing that Bruce mentioned to me privately is that we could plan to
use whatever mount namespace mountd is using within a particular net
namespace. That makes some sense since mountd is the final arbiter of
who gets access to what.

> But, please, don't ask me, what will be, if two or more NFS servers shares the 
> same mount namespace... Looks like this case should be forbidden.
> 

I'm not sure we need to forbid sharing the mount namespace. They might
be exporting completely different filesystems after all, in which case
we'd be forbidding it for no good reason.

Note that it is quite easy to get lost in the weeds with this. I've been
struggling to get a working design for a clustered nfsv4 server for the
last several months and have had some time to wrestle with these
issues. It's anything but trivial.

What you may need to do in order to make progress is to start with some
valid use-cases for this stuff, and get those working while disallowing
or ignoring other use cases. We'll never get anywhere if we try to solve
all of these problems at once...

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH][RFC] nfsd/lockd: have locks_in_grace take a sb arg
  2012-04-10 12:46     ` Stanislav Kinsbursky
  2012-04-10 13:39       ` Jeff Layton
@ 2012-04-10 20:22       ` J. Bruce Fields
  2012-04-11 10:34         ` Stanislav Kinsbursky
  1 sibling, 1 reply; 29+ messages in thread
From: J. Bruce Fields @ 2012-04-10 20:22 UTC (permalink / raw)
  To: Stanislav Kinsbursky; +Cc: Jeff Layton, linux-nfs

On Tue, Apr 10, 2012 at 04:46:38PM +0400, Stanislav Kinsbursky wrote:
> 10.04.2012 16:16, Jeff Layton пишет:
> >On Tue, 10 Apr 2012 15:44:42 +0400
> >
> >(sorry about the earlier truncated reply, my MUA has a mind of its own
> >this morning)
> >
> 
> OK then. Previous letter confused me a bit.
> 
> >
> >TBH, I haven't considered that in depth. That is a valid situation, but
> >one that's discouraged. It's very difficult (and expensive) to
> >sequester off portions of a filesystem for serving.
> >
> >A filehandle is somewhat analogous to a device/inode combination. When
> >the server gets a filehandle, it has to determine "is this within a
> >path that's exported to this host"? That process is called subtree
> >checking. It's expensive and difficult to handle. It's always better to
> >export along filesystem boundaries.
> >
> >My suggestion would be to simply not deal with those cases in this
> >patch. Possibly we could force no_subtree_check when we export an fs
> >with a locks_in_grace option defined.
> >
> 
> Sorry, but without dealing with those cases your patch looks a bit... Useless.
> I.e. it changes nothing, it there will be no support from file
> systems, going to be exported.
> But how are you going to push developers to implement these calls?
> Or, even if you'll try to implement them by yourself, how they will
> looks like?
> Simple check only for superblock looks bad to me, because any other
> start of NFSd will lead to grace period for all other containers
> (which uses the same filesystem).

That's the correct behavior, and it sounds simple to implement.  Let's
just do that.

If somebody doesn't like the grace period from another container
intruding on their use of the same filesystem, they should either
arrange to export different filesystems (not just different subtrees)
from their containers, or arrange to start all their containers at the
same time so their grace periods overlap.

--b.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH][RFC] nfsd/lockd: have locks_in_grace take a sb arg
  2012-04-10 18:45           ` Jeff Layton
@ 2012-04-11 10:09             ` Stanislav Kinsbursky
  2012-04-11 11:48               ` Jeff Layton
  0 siblings, 1 reply; 29+ messages in thread
From: Stanislav Kinsbursky @ 2012-04-11 10:09 UTC (permalink / raw)
  To: Jeff Layton; +Cc: linux-nfs

10.04.2012 22:45, Jeff Layton пишет:
>>>> This check is expensive (as you mentioned), but have to be done only once on NFS
>>>> server start.
>>>
>>> Well, no. The subtree check happens every time nfsd processes a
>>> filehandle -- see nfsd_acceptable().
>>>
>>> Basically we have to turn the filehandle into a dentry and then walk
>>> back up to the directory that's exported to verify that it is within
>>> the correct subtree. If that fails, then we might have to do it more
>>> than once if it's a hardlinked file.
>>>
>>
>> Wait. Looks like I'm missing something.
>> This subtree check has nothing with my proposal (if I'm not mistaken).
>> This option and it's logic remains the same.
>> My proposal was to check directories, desired to be exported, on NFS server
>> start. And if any of passed exports intersects with any of exports, already
>> shared by another NFSd - then shutdown NFSd and print error message.
>> Am I missing the point here?
>>
>
> Sorry I got confused with the discussion. You will need to do
> something similar to what subtree checking does in order to handle
> your proposal however.
>

Agreed. But this check should be performed only once on NFS server start (not 
every fh lookup.

>>>> With this solution, grace period can simple, and no support from exporting file
>>>> system is required.
>>>> But the main problem here is that such intersections can be checked only in
>>>> initial file system environment (containers with it's own roots, gained via
>>>> chroot, can't handle this situation).
>>>> So, it means, that there have to be some daemon (kernel or user space), which
>>>> will handle such requests from different NFS server instances... Which in turn
>>>> means, that some way of communication between this daemon and NFS servers is
>>>> required. And unix (any of them) sockets doesn't suits here, which makes this
>>>> problem more difficult.
>>>>
>>>
>>> This is a truly ugly problem, and unfortunately parts of the nfsd
>>> codebase are very old and crusty. We've got a lot of cleanup work ahead
>>> of us no matter what design we settle on.
>>>
>>> This is really a lot bigger than the grace period. I think we ought to
>>> step back a bit and consider this more "holistically" first. Do you
>>> have a pointer to an overall design document or something?
>>>
>>
>> What exactly you are asking about? Overall design of containerization?
>>
>
> I meant containerization of nfsd in particular.
>

If you are asking about some kind of white paper, then I don't have it.
But here are main visible targets:
1) Move all network-related resources to per-net data (caches, grace period, 
lockd calls, transports, your tracking engine).
2) make nfsd filesystem superblock per network namespace.
3) service itself will be controlled like Lockd done (one pool for all, per-net 
resources allocated on service start).

>>> One thing that puzzles me at the moment. We have two namespaces to deal
>>> with -- the network and the mount namespace. With nfs client code,
>>> everything is keyed off of the net namespace. That's not really the
>>> case here since we have to deal with a local fs tree as well.
>>>
>>> When an nfsd running in a container receives an RPC, how does it
>>> determine what mount namespace it should do its operations in?
>>>
>>
>> We don't use mount namespaces, so that's why I wasn't thinking about it...
>> But if we have 2 types of namespaces, then we have to tie  mount namesapce to
>> network. I.e we can get desired mount namespace from per-net NFSd data.
>>
>
> One thing that Bruce mentioned to me privately is that we could plan to
> use whatever mount namespace mountd is using within a particular net
> namespace. That makes some sense since mountd is the final arbiter of
> who gets access to what.
>

Could you, please, give some examples? I don't get the idea.

>> But, please, don't ask me, what will be, if two or more NFS servers shares the
>> same mount namespace... Looks like this case should be forbidden.
>>
>
> I'm not sure we need to forbid sharing the mount namespace. They might
> be exporting completely different filesystems after all, in which case
> we'd be forbidding it for no good reason.
>

Actually, if we will make file system responsible for grace period control, then 
yes, no reason for forbidding of shared mount namespace.

> Note that it is quite easy to get lost in the weeds with this. I've been
> struggling to get a working design for a clustered nfsv4 server for the
> last several months and have had some time to wrestle with these
> issues. It's anything but trivial.
>
> What you may need to do in order to make progress is to start with some
> valid use-cases for this stuff, and get those working while disallowing
> or ignoring other use cases. We'll never get anywhere if we try to solve
> all of these problems at once...
>

Agreed.
So, my current understanding of the situation can be summarized as follows:

1) The idea of making grace period (and int internals) per networks namespace 
stays the same. But it's implementation affect only current "generic grace 
period" code.

2) Your idea of making grace period per file system looks reasonable. And maybe 
this approach (using of filesystem's export operations if available) have to be 
used by default.
But I suggest to add new option to exports (say, "no_fs_grace"), which will 
disable this new functionality. With this option system administrator becomes 
responsible for any problems with shared file system.

Any objections?

-- 
Best regards,
Stanislav Kinsbursky

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH][RFC] nfsd/lockd: have locks_in_grace take a sb arg
  2012-04-10 20:22       ` J. Bruce Fields
@ 2012-04-11 10:34         ` Stanislav Kinsbursky
  2012-04-11 17:20           ` J. Bruce Fields
  0 siblings, 1 reply; 29+ messages in thread
From: Stanislav Kinsbursky @ 2012-04-11 10:34 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Jeff Layton, linux-nfs

11.04.2012 00:22, J. Bruce Fields пишет:
> On Tue, Apr 10, 2012 at 04:46:38PM +0400, Stanislav Kinsbursky wrote:
>> 10.04.2012 16:16, Jeff Layton пишет:
>>> On Tue, 10 Apr 2012 15:44:42 +0400
>>>
>>> (sorry about the earlier truncated reply, my MUA has a mind of its own
>>> this morning)
>>>
>>
>> OK then. Previous letter confused me a bit.
>>
>>>
>>> TBH, I haven't considered that in depth. That is a valid situation, but
>>> one that's discouraged. It's very difficult (and expensive) to
>>> sequester off portions of a filesystem for serving.
>>>
>>> A filehandle is somewhat analogous to a device/inode combination. When
>>> the server gets a filehandle, it has to determine "is this within a
>>> path that's exported to this host"? That process is called subtree
>>> checking. It's expensive and difficult to handle. It's always better to
>>> export along filesystem boundaries.
>>>
>>> My suggestion would be to simply not deal with those cases in this
>>> patch. Possibly we could force no_subtree_check when we export an fs
>>> with a locks_in_grace option defined.
>>>
>>
>> Sorry, but without dealing with those cases your patch looks a bit... Useless.
>> I.e. it changes nothing, it there will be no support from file
>> systems, going to be exported.
>> But how are you going to push developers to implement these calls?
>> Or, even if you'll try to implement them by yourself, how they will
>> looks like?
>> Simple check only for superblock looks bad to me, because any other
>> start of NFSd will lead to grace period for all other containers
>> (which uses the same filesystem).
>
> That's the correct behavior, and it sounds simple to implement.  Let's
> just do that.
>
> If somebody doesn't like the grace period from another container
> intruding on their use of the same filesystem, they should either
> arrange to export different filesystems (not just different subtrees)
> from their containers, or arrange to start all their containers at the
> same time so their grace periods overlap.
>

Starting all at once is not a very good solution.
When you start 100 containers simultaneously - then you can't predict, when the 
process as a whole will succeed (it will produce heavy load on all subsystems). 
Moreover, there is also  server restart...
Anyway, I agree with the idea of this patch.

Please, have a look at new export option I mentioned in "Grace period" thread.

-- 
Best regards,
Stanislav Kinsbursky

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH][RFC] nfsd/lockd: have locks_in_grace take a sb arg
  2012-04-11 10:09             ` Stanislav Kinsbursky
@ 2012-04-11 11:48               ` Jeff Layton
  2012-04-11 13:08                 ` Stanislav Kinsbursky
  0 siblings, 1 reply; 29+ messages in thread
From: Jeff Layton @ 2012-04-11 11:48 UTC (permalink / raw)
  To: Stanislav Kinsbursky; +Cc: linux-nfs

On Wed, 11 Apr 2012 14:09:40 +0400
Stanislav Kinsbursky <skinsbursky@parallels.com> wrote:

> 10.04.2012 22:45, Jeff Layton пишет:
> >>>> This check is expensive (as you mentioned), but have to be done only once on NFS
> >>>> server start.
> >>>
> >>> Well, no. The subtree check happens every time nfsd processes a
> >>> filehandle -- see nfsd_acceptable().
> >>>
> >>> Basically we have to turn the filehandle into a dentry and then walk
> >>> back up to the directory that's exported to verify that it is within
> >>> the correct subtree. If that fails, then we might have to do it more
> >>> than once if it's a hardlinked file.
> >>>
> >>
> >> Wait. Looks like I'm missing something.
> >> This subtree check has nothing with my proposal (if I'm not mistaken).
> >> This option and it's logic remains the same.
> >> My proposal was to check directories, desired to be exported, on NFS server
> >> start. And if any of passed exports intersects with any of exports, already
> >> shared by another NFSd - then shutdown NFSd and print error message.
> >> Am I missing the point here?
> >>
> >
> > Sorry I got confused with the discussion. You will need to do
> > something similar to what subtree checking does in order to handle
> > your proposal however.
> >
> 
> Agreed. But this check should be performed only once on NFS server start (not 
> every fh lookup.
> 
> >>>> With this solution, grace period can simple, and no support from exporting file
> >>>> system is required.
> >>>> But the main problem here is that such intersections can be checked only in
> >>>> initial file system environment (containers with it's own roots, gained via
> >>>> chroot, can't handle this situation).
> >>>> So, it means, that there have to be some daemon (kernel or user space), which
> >>>> will handle such requests from different NFS server instances... Which in turn
> >>>> means, that some way of communication between this daemon and NFS servers is
> >>>> required. And unix (any of them) sockets doesn't suits here, which makes this
> >>>> problem more difficult.
> >>>>
> >>>
> >>> This is a truly ugly problem, and unfortunately parts of the nfsd
> >>> codebase are very old and crusty. We've got a lot of cleanup work ahead
> >>> of us no matter what design we settle on.
> >>>
> >>> This is really a lot bigger than the grace period. I think we ought to
> >>> step back a bit and consider this more "holistically" first. Do you
> >>> have a pointer to an overall design document or something?
> >>>
> >>
> >> What exactly you are asking about? Overall design of containerization?
> >>
> >
> > I meant containerization of nfsd in particular.
> >
> 
> If you are asking about some kind of white paper, then I don't have it.
> But here are main visible targets:
> 1) Move all network-related resources to per-net data (caches, grace period, 
> lockd calls, transports, your tracking engine).
> 2) make nfsd filesystem superblock per network namespace.
> 3) service itself will be controlled like Lockd done (one pool for all, per-net 
> resources allocated on service start).
> 
> >>> One thing that puzzles me at the moment. We have two namespaces to deal
> >>> with -- the network and the mount namespace. With nfs client code,
> >>> everything is keyed off of the net namespace. That's not really the
> >>> case here since we have to deal with a local fs tree as well.
> >>>
> >>> When an nfsd running in a container receives an RPC, how does it
> >>> determine what mount namespace it should do its operations in?
> >>>
> >>
> >> We don't use mount namespaces, so that's why I wasn't thinking about it...
> >> But if we have 2 types of namespaces, then we have to tie  mount namesapce to
> >> network. I.e we can get desired mount namespace from per-net NFSd data.
> >>
> >
> > One thing that Bruce mentioned to me privately is that we could plan to
> > use whatever mount namespace mountd is using within a particular net
> > namespace. That makes some sense since mountd is the final arbiter of
> > who gets access to what.
> >
> 
> Could you, please, give some examples? I don't get the idea.
> 

When nfsd gets an RPC call, it needs to decide in what mount namespace
to do the fs operations. How do we decide this?

Bruce's thought was to look at what mount namespace rpc.mountd is using
and use that, but now that I consider it, it's a bit of a chicken and
egg problem really... nfsd talks to mountd via files in /proc/net/rpc/.
In order to talk to the right mountd, might you need to know what mount
namespace it's operating in?

A simpler method might be to take a reference to whatever mount
namespace rpc.nfsd has when it starts knfsd and keep that reference
inside of the nfsd_net struct. When a call comes in to a particular
nfsd "instance" you can just use that mount namespace.

> >> But, please, don't ask me, what will be, if two or more NFS servers shares the
> >> same mount namespace... Looks like this case should be forbidden.
> >>
> >
> > I'm not sure we need to forbid sharing the mount namespace. They might
> > be exporting completely different filesystems after all, in which case
> > we'd be forbidding it for no good reason.
> >
> 
> Actually, if we will make file system responsible for grace period control, then 
> yes, no reason for forbidding of shared mount namespace.
> 
> > Note that it is quite easy to get lost in the weeds with this. I've been
> > struggling to get a working design for a clustered nfsv4 server for the
> > last several months and have had some time to wrestle with these
> > issues. It's anything but trivial.
> >
> > What you may need to do in order to make progress is to start with some
> > valid use-cases for this stuff, and get those working while disallowing
> > or ignoring other use cases. We'll never get anywhere if we try to solve
> > all of these problems at once...
> >
> 
> Agreed.
> So, my current understanding of the situation can be summarized as follows:
> 
> 1) The idea of making grace period (and int internals) per networks namespace 
> stays the same. But it's implementation affect only current "generic grace 
> period" code.
> 

Yes, that's where you should focus your efforts for now. As I said, we
don't have any alternate grace period handling schemes yet, but we will
eventually need one to handle clustered filesystems and possibly the
case of serving the same local fs from multiple namespaces.

> 2) Your idea of making grace period per file system looks reasonable. And maybe 
> this approach (using of filesystem's export operations if available) have to be 
> used by default.
> But I suggest to add new option to exports (say, "no_fs_grace"), which will 
> disable this new functionality. With this option system administrator becomes 
> responsible for any problems with shared file system.
> 

Something like that may be a reasonable hack initially but we need to
ensure that we can deal with this properly later. I think we're going
to end up with "pluggable" grace period handling at some point, so it
may be more future proof to do something like "grace=simple" or
something instead of no_fs_grace. Still...

This is a complex enough problem that I think it behooves us to
consider it very carefully and come up with a clear design before we
code anything. We need to ensure that whatever we do doesn't end up
hamstringing other use cases later...

We have 3 cases that I can see that we're interested in initially.
There is some overlap between them however:

1) simple case of a filesystem being exported from a single namespace.
This covers non-containerized nfsd and containerized nfsd's that are
serving different filesystems.

2) a containerized nfsd that serves the same filesystem from multiple
namespaces.

3) a cluster serving the same filesystem from multiple namespaces. In
this case, the namespaces are also potentially spread across multiple
nodes as well.

There's a lot of overlap between #2 and #3 here.
-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH][RFC] nfsd/lockd: have locks_in_grace take a sb arg
  2012-04-11 11:48               ` Jeff Layton
@ 2012-04-11 13:08                 ` Stanislav Kinsbursky
  2012-04-11 17:19                   ` J. Bruce Fields
  0 siblings, 1 reply; 29+ messages in thread
From: Stanislav Kinsbursky @ 2012-04-11 13:08 UTC (permalink / raw)
  To: Jeff Layton; +Cc: linux-nfs

11.04.2012 15:48, Jeff Layton пишет:
>>>>> One thing that puzzles me at the moment. We have two namespaces to deal
>>>>> with -- the network and the mount namespace. With nfs client code,
>>>>> everything is keyed off of the net namespace. That's not really the
>>>>> case here since we have to deal with a local fs tree as well.
>>>>>
>>>>> When an nfsd running in a container receives an RPC, how does it
>>>>> determine what mount namespace it should do its operations in?
>>>>>
>>>>
>>>> We don't use mount namespaces, so that's why I wasn't thinking about it...
>>>> But if we have 2 types of namespaces, then we have to tie  mount namesapce to
>>>> network. I.e we can get desired mount namespace from per-net NFSd data.
>>>>
>>>
>>> One thing that Bruce mentioned to me privately is that we could plan to
>>> use whatever mount namespace mountd is using within a particular net
>>> namespace. That makes some sense since mountd is the final arbiter of
>>> who gets access to what.
>>>
>>
>> Could you, please, give some examples? I don't get the idea.
>>
>
> When nfsd gets an RPC call, it needs to decide in what mount namespace
> to do the fs operations. How do we decide this?
>
> Bruce's thought was to look at what mount namespace rpc.mountd is using
> and use that, but now that I consider it, it's a bit of a chicken and
> egg problem really... nfsd talks to mountd via files in /proc/net/rpc/.
> In order to talk to the right mountd, might you need to know what mount
> namespace it's operating in?
>

Not really... /proc itself depens on pid namespace. /proc/net depends on current 
(!) network namespace. So we can't just lookup for this dentry.

But, in spite of nfsd works in initial (init_net and friends) environment, we 
can get network namespace from RPC request. Having this, we can easily get 
desired proc entry (proc_net_rpc in sunrpc_net). So it looks like we can 
actually don't care about mount namespaces - we have our own back door.
If I'm not mistaken, of course...

> A simpler method might be to take a reference to whatever mount
> namespace rpc.nfsd has when it starts knfsd and keep that reference
> inside of the nfsd_net struct. When a call comes in to a particular
> nfsd "instance" you can just use that mount namespace.
>

This means that we tie mount namespace to network. Even worse - network 
namespace holds mount namespace. Currently, I can't see any problems. But I 
can't even imagine, how many pitfalls can (and, most probably, will) be found in 
future.
I think, we should try to avoid explicit cross-namespaces dependencies...

>>> Note that it is quite easy to get lost in the weeds with this. I've been
>>> struggling to get a working design for a clustered nfsv4 server for the
>>> last several months and have had some time to wrestle with these
>>> issues. It's anything but trivial.
>>>
>>> What you may need to do in order to make progress is to start with some
>>> valid use-cases for this stuff, and get those working while disallowing
>>> or ignoring other use cases. We'll never get anywhere if we try to solve
>>> all of these problems at once...
>>>
>>
>> Agreed.
>> So, my current understanding of the situation can be summarized as follows:
>>
>> 1) The idea of making grace period (and int internals) per networks namespace
>> stays the same. But it's implementation affect only current "generic grace
>> period" code.
>>
>
> Yes, that's where you should focus your efforts for now. As I said, we
> don't have any alternate grace period handling schemes yet, but we will
> eventually need one to handle clustered filesystems and possibly the
> case of serving the same local fs from multiple namespaces.
>

Ok.

>> 2) Your idea of making grace period per file system looks reasonable. And maybe
>> this approach (using of filesystem's export operations if available) have to be
>> used by default.
>> But I suggest to add new option to exports (say, "no_fs_grace"), which will
>> disable this new functionality. With this option system administrator becomes
>> responsible for any problems with shared file system.
>>
>
> Something like that may be a reasonable hack initially but we need to
> ensure that we can deal with this properly later. I think we're going
> to end up with "pluggable" grace period handling at some point, so it
> may be more future proof to do something like "grace=simple" or
> something instead of no_fs_grace. Still...
>
> This is a complex enough problem that I think it behooves us to
> consider it very carefully and come up with a clear design before we
> code anything. We need to ensure that whatever we do doesn't end up
> hamstringing other use cases later...
>
> We have 3 cases that I can see that we're interested in initially.
> There is some overlap between them however:
>
> 1) simple case of a filesystem being exported from a single namespace.
> This covers non-containerized nfsd and containerized nfsd's that are
> serving different filesystems.
>
> 2) a containerized nfsd that serves the same filesystem from multiple
> namespaces.
>
> 3) a cluster serving the same filesystem from multiple namespaces. In
> this case, the namespaces are also potentially spread across multiple
> nodes as well.
>
> There's a lot of overlap between #2 and #3 here.

Yep, sure. I have nothing to add or object here.

-- 
Best regards,
Stanislav Kinsbursky

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH][RFC] nfsd/lockd: have locks_in_grace take a sb arg
  2012-04-11 13:08                 ` Stanislav Kinsbursky
@ 2012-04-11 17:19                   ` J. Bruce Fields
  2012-04-11 17:37                     ` Stanislav Kinsbursky
  0 siblings, 1 reply; 29+ messages in thread
From: J. Bruce Fields @ 2012-04-11 17:19 UTC (permalink / raw)
  To: Stanislav Kinsbursky; +Cc: Jeff Layton, linux-nfs

On Wed, Apr 11, 2012 at 05:08:46PM +0400, Stanislav Kinsbursky wrote:
> 11.04.2012 15:48, Jeff Layton пишет:
> >>>>>One thing that puzzles me at the moment. We have two namespaces to deal
> >>>>>with -- the network and the mount namespace. With nfs client code,
> >>>>>everything is keyed off of the net namespace. That's not really the
> >>>>>case here since we have to deal with a local fs tree as well.
> >>>>>
> >>>>>When an nfsd running in a container receives an RPC, how does it
> >>>>>determine what mount namespace it should do its operations in?
> >>>>>
> >>>>
> >>>>We don't use mount namespaces, so that's why I wasn't thinking about it...
> >>>>But if we have 2 types of namespaces, then we have to tie  mount namesapce to
> >>>>network. I.e we can get desired mount namespace from per-net NFSd data.
> >>>>
> >>>
> >>>One thing that Bruce mentioned to me privately is that we could plan to
> >>>use whatever mount namespace mountd is using within a particular net
> >>>namespace. That makes some sense since mountd is the final arbiter of
> >>>who gets access to what.
> >>>
> >>
> >>Could you, please, give some examples? I don't get the idea.
> >>
> >
> >When nfsd gets an RPC call, it needs to decide in what mount namespace
> >to do the fs operations. How do we decide this?
> >
> >Bruce's thought was to look at what mount namespace rpc.mountd is using
> >and use that, but now that I consider it, it's a bit of a chicken and
> >egg problem really... nfsd talks to mountd via files in /proc/net/rpc/.
> >In order to talk to the right mountd, might you need to know what mount
> >namespace it's operating in?
> >
> 
> Not really... /proc itself depens on pid namespace. /proc/net
> depends on current (!) network namespace. So we can't just lookup
> for this dentry.
> 
> But, in spite of nfsd works in initial (init_net and friends)
> environment, we can get network namespace from RPC request. Having
> this, we can easily get desired proc entry (proc_net_rpc in
> sunrpc_net). So it looks like we can actually don't care about mount
> namespaces - we have our own back door.

OK, good, that's what I was hoping for.  Then we call up to whatever
mountd is running in our network namespace, and for path lookups it's
whatever fs namespace that mountd is running in that's going to matter.

--b.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH][RFC] nfsd/lockd: have locks_in_grace take a sb arg
  2012-04-11 10:34         ` Stanislav Kinsbursky
@ 2012-04-11 17:20           ` J. Bruce Fields
  2012-04-11 17:33             ` Stanislav Kinsbursky
  0 siblings, 1 reply; 29+ messages in thread
From: J. Bruce Fields @ 2012-04-11 17:20 UTC (permalink / raw)
  To: Stanislav Kinsbursky; +Cc: Jeff Layton, linux-nfs

On Wed, Apr 11, 2012 at 02:34:37PM +0400, Stanislav Kinsbursky wrote:
> 11.04.2012 00:22, J. Bruce Fields пишет:
> >On Tue, Apr 10, 2012 at 04:46:38PM +0400, Stanislav Kinsbursky wrote:
> >>10.04.2012 16:16, Jeff Layton пишет:
> >>>On Tue, 10 Apr 2012 15:44:42 +0400
> >>>
> >>>(sorry about the earlier truncated reply, my MUA has a mind of its own
> >>>this morning)
> >>>
> >>
> >>OK then. Previous letter confused me a bit.
> >>
> >>>
> >>>TBH, I haven't considered that in depth. That is a valid situation, but
> >>>one that's discouraged. It's very difficult (and expensive) to
> >>>sequester off portions of a filesystem for serving.
> >>>
> >>>A filehandle is somewhat analogous to a device/inode combination. When
> >>>the server gets a filehandle, it has to determine "is this within a
> >>>path that's exported to this host"? That process is called subtree
> >>>checking. It's expensive and difficult to handle. It's always better to
> >>>export along filesystem boundaries.
> >>>
> >>>My suggestion would be to simply not deal with those cases in this
> >>>patch. Possibly we could force no_subtree_check when we export an fs
> >>>with a locks_in_grace option defined.
> >>>
> >>
> >>Sorry, but without dealing with those cases your patch looks a bit... Useless.
> >>I.e. it changes nothing, it there will be no support from file
> >>systems, going to be exported.
> >>But how are you going to push developers to implement these calls?
> >>Or, even if you'll try to implement them by yourself, how they will
> >>looks like?
> >>Simple check only for superblock looks bad to me, because any other
> >>start of NFSd will lead to grace period for all other containers
> >>(which uses the same filesystem).
> >
> >That's the correct behavior, and it sounds simple to implement.  Let's
> >just do that.
> >
> >If somebody doesn't like the grace period from another container
> >intruding on their use of the same filesystem, they should either
> >arrange to export different filesystems (not just different subtrees)
> >from their containers, or arrange to start all their containers at the
> >same time so their grace periods overlap.
> >
> 
> Starting all at once is not a very good solution.
> When you start 100 containers simultaneously - then you can't
> predict, when the process as a whole will succeed (it will produce
> heavy load on all subsystems). Moreover, there is also  server
> restart...

So you really are exporting subtrees of the same filesystem from
multiple containers?  Why?

And are you sure you're not vulnerable to filehandle-guessing attacks?

--b.

> Anyway, I agree with the idea of this patch.
> 
> Please, have a look at new export option I mentioned in "Grace period" thread.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH][RFC] nfsd/lockd: have locks_in_grace take a sb arg
  2012-04-11 17:20           ` J. Bruce Fields
@ 2012-04-11 17:33             ` Stanislav Kinsbursky
  2012-04-11 17:40               ` Stanislav Kinsbursky
  2012-04-11 18:20               ` J. Bruce Fields
  0 siblings, 2 replies; 29+ messages in thread
From: Stanislav Kinsbursky @ 2012-04-11 17:33 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Jeff Layton, linux-nfs

11.04.2012 21:20, J. Bruce Fields пишет:
> On Wed, Apr 11, 2012 at 02:34:37PM +0400, Stanislav Kinsbursky wrote:
>> 11.04.2012 00:22, J. Bruce Fields пишет:
>>> On Tue, Apr 10, 2012 at 04:46:38PM +0400, Stanislav Kinsbursky wrote:
>>>> 10.04.2012 16:16, Jeff Layton пишет:
>>>>> On Tue, 10 Apr 2012 15:44:42 +0400
>>>>>
>>>>> (sorry about the earlier truncated reply, my MUA has a mind of its own
>>>>> this morning)
>>>>>
>>>>
>>>> OK then. Previous letter confused me a bit.
>>>>
>>>>>
>>>>> TBH, I haven't considered that in depth. That is a valid situation, but
>>>>> one that's discouraged. It's very difficult (and expensive) to
>>>>> sequester off portions of a filesystem for serving.
>>>>>
>>>>> A filehandle is somewhat analogous to a device/inode combination. When
>>>>> the server gets a filehandle, it has to determine "is this within a
>>>>> path that's exported to this host"? That process is called subtree
>>>>> checking. It's expensive and difficult to handle. It's always better to
>>>>> export along filesystem boundaries.
>>>>>
>>>>> My suggestion would be to simply not deal with those cases in this
>>>>> patch. Possibly we could force no_subtree_check when we export an fs
>>>>> with a locks_in_grace option defined.
>>>>>
>>>>
>>>> Sorry, but without dealing with those cases your patch looks a bit... Useless.
>>>> I.e. it changes nothing, it there will be no support from file
>>>> systems, going to be exported.
>>>> But how are you going to push developers to implement these calls?
>>>> Or, even if you'll try to implement them by yourself, how they will
>>>> looks like?
>>>> Simple check only for superblock looks bad to me, because any other
>>>> start of NFSd will lead to grace period for all other containers
>>>> (which uses the same filesystem).
>>>
>>> That's the correct behavior, and it sounds simple to implement.  Let's
>>> just do that.
>>>
>>> If somebody doesn't like the grace period from another container
>>> intruding on their use of the same filesystem, they should either
>>> arrange to export different filesystems (not just different subtrees)
>> >from their containers, or arrange to start all their containers at the
>>> same time so their grace periods overlap.
>>>
>>
>> Starting all at once is not a very good solution.
>> When you start 100 containers simultaneously - then you can't
>> predict, when the process as a whole will succeed (it will produce
>> heavy load on all subsystems). Moreover, there is also  server
>> restart...
>
> So you really are exporting subtrees of the same filesystem from
> multiple containers?  Why?
>

Everything is very-very simple and obvious.
We use "chroot jail". This is the most often and simple setup for containers.
And, basicaly, Virtuozzo container file system consist of two parts: one of them 
is it's private modified data, another part is a template, used for all 
containers based on it (rhel6, for example; when it's content is modified my 
some container - then modified file copied to private part of container, which 
modified the file). Anyway, with properly configured environment it could be as 
many containers on the same file system, as possible. And making sure, that no 
data shared between them is root's responsibility.
This approach gives us journal bottleneck. That's why, in future we are going to 
use "ploop" device (a kind of a very smart loop device) per container. And thus 
this problem with grace period for file systems will disappear.

> And are you sure you're not vulnerable to filehandle-guessing attacks?
>

No, I'm not. Could you give me some examples of such attacks?

-- 
Best regards,
Stanislav Kinsbursky

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH][RFC] nfsd/lockd: have locks_in_grace take a sb arg
  2012-04-11 17:19                   ` J. Bruce Fields
@ 2012-04-11 17:37                     ` Stanislav Kinsbursky
  2012-04-11 18:22                       ` J. Bruce Fields
  0 siblings, 1 reply; 29+ messages in thread
From: Stanislav Kinsbursky @ 2012-04-11 17:37 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Jeff Layton, linux-nfs

11.04.2012 21:19, J. Bruce Fields пишет:
> On Wed, Apr 11, 2012 at 05:08:46PM +0400, Stanislav Kinsbursky wrote:
>> 11.04.2012 15:48, Jeff Layton пишет:
>>>>>>> One thing that puzzles me at the moment. We have two namespaces to deal
>>>>>>> with -- the network and the mount namespace. With nfs client code,
>>>>>>> everything is keyed off of the net namespace. That's not really the
>>>>>>> case here since we have to deal with a local fs tree as well.
>>>>>>>
>>>>>>> When an nfsd running in a container receives an RPC, how does it
>>>>>>> determine what mount namespace it should do its operations in?
>>>>>>>
>>>>>>
>>>>>> We don't use mount namespaces, so that's why I wasn't thinking about it...
>>>>>> But if we have 2 types of namespaces, then we have to tie  mount namesapce to
>>>>>> network. I.e we can get desired mount namespace from per-net NFSd data.
>>>>>>
>>>>>
>>>>> One thing that Bruce mentioned to me privately is that we could plan to
>>>>> use whatever mount namespace mountd is using within a particular net
>>>>> namespace. That makes some sense since mountd is the final arbiter of
>>>>> who gets access to what.
>>>>>
>>>>
>>>> Could you, please, give some examples? I don't get the idea.
>>>>
>>>
>>> When nfsd gets an RPC call, it needs to decide in what mount namespace
>>> to do the fs operations. How do we decide this?
>>>
>>> Bruce's thought was to look at what mount namespace rpc.mountd is using
>>> and use that, but now that I consider it, it's a bit of a chicken and
>>> egg problem really... nfsd talks to mountd via files in /proc/net/rpc/.
>>> In order to talk to the right mountd, might you need to know what mount
>>> namespace it's operating in?
>>>
>>
>> Not really... /proc itself depens on pid namespace. /proc/net
>> depends on current (!) network namespace. So we can't just lookup
>> for this dentry.
>>
>> But, in spite of nfsd works in initial (init_net and friends)
>> environment, we can get network namespace from RPC request. Having
>> this, we can easily get desired proc entry (proc_net_rpc in
>> sunrpc_net). So it looks like we can actually don't care about mount
>> namespaces - we have our own back door.
>
> OK, good, that's what I was hoping for.  Then we call up to whatever
> mountd is running in our network namespace, and for path lookups it's
> whatever fs namespace that mountd is running in that's going to matter.
>

The problem here, is that mountd is running in pid namespace - not net.
What would happen, if we will have situation like below:

	mountd A	mountd B

	pid_ns		pid_ns
	  |		  |
	mnt_ns		mnt_ns
	  |		  |
	  -----	net_ns ----

Is it possible, BTW?
It yes, that is such construction valid?

-- 
Best regards,
Stanislav Kinsbursky

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH][RFC] nfsd/lockd: have locks_in_grace take a sb arg
  2012-04-11 17:33             ` Stanislav Kinsbursky
@ 2012-04-11 17:40               ` Stanislav Kinsbursky
  2012-04-11 18:20               ` J. Bruce Fields
  1 sibling, 0 replies; 29+ messages in thread
From: Stanislav Kinsbursky @ 2012-04-11 17:40 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Jeff Layton, linux-nfs

11.04.2012 21:33, Stanislav Kinsbursky пишет:
> 11.04.2012 21:20, J. Bruce Fields пишет:
>> On Wed, Apr 11, 2012 at 02:34:37PM +0400, Stanislav Kinsbursky wrote:
>>> 11.04.2012 00:22, J. Bruce Fields пишет:
>>>> On Tue, Apr 10, 2012 at 04:46:38PM +0400, Stanislav Kinsbursky wrote:
>>>>> 10.04.2012 16:16, Jeff Layton пишет:
>>>>>> On Tue, 10 Apr 2012 15:44:42 +0400
>>>>>>
>>>>>> (sorry about the earlier truncated reply, my MUA has a mind of its own
>>>>>> this morning)
>>>>>>
>>>>>
>>>>> OK then. Previous letter confused me a bit.
>>>>>
>>>>>>
>>>>>> TBH, I haven't considered that in depth. That is a valid situation, but
>>>>>> one that's discouraged. It's very difficult (and expensive) to
>>>>>> sequester off portions of a filesystem for serving.
>>>>>>
>>>>>> A filehandle is somewhat analogous to a device/inode combination. When
>>>>>> the server gets a filehandle, it has to determine "is this within a
>>>>>> path that's exported to this host"? That process is called subtree
>>>>>> checking. It's expensive and difficult to handle. It's always better to
>>>>>> export along filesystem boundaries.
>>>>>>
>>>>>> My suggestion would be to simply not deal with those cases in this
>>>>>> patch. Possibly we could force no_subtree_check when we export an fs
>>>>>> with a locks_in_grace option defined.
>>>>>>
>>>>>
>>>>> Sorry, but without dealing with those cases your patch looks a bit... Useless.
>>>>> I.e. it changes nothing, it there will be no support from file
>>>>> systems, going to be exported.
>>>>> But how are you going to push developers to implement these calls?
>>>>> Or, even if you'll try to implement them by yourself, how they will
>>>>> looks like?
>>>>> Simple check only for superblock looks bad to me, because any other
>>>>> start of NFSd will lead to grace period for all other containers
>>>>> (which uses the same filesystem).
>>>>
>>>> That's the correct behavior, and it sounds simple to implement.  Let's
>>>> just do that.
>>>>
>>>> If somebody doesn't like the grace period from another container
>>>> intruding on their use of the same filesystem, they should either
>>>> arrange to export different filesystems (not just different subtrees)
>>> >from their containers, or arrange to start all their containers at the
>>>> same time so their grace periods overlap.
>>>>
>>>
>>> Starting all at once is not a very good solution.
>>> When you start 100 containers simultaneously - then you can't
>>> predict, when the process as a whole will succeed (it will produce
>>> heavy load on all subsystems). Moreover, there is also  server
>>> restart...
>>
>> So you really are exporting subtrees of the same filesystem from
>> multiple containers?  Why?
>>
>
> Everything is very-very simple and obvious.
> We use "chroot jail". This is the most often and simple setup for containers.
> And, basicaly, Virtuozzo container file system consist of two parts: one of them
> is it's private modified data, another part is a template, used for all
> containers based on it (rhel6, for example; when it's content is modified my
> some container - then modified file copied to private part of container, which
> modified the file). Anyway, with properly configured environment it could be as
> many containers on the same file system, as possible. And making sure, that no
> data shared between them is root's responsibility.
> This approach gives us journal bottleneck. That's why, in future we are going to
> use "ploop" device (a kind of a very smart loop device) per container. And thus
> this problem with grace period for file systems will disappear.
>

One notice: of course, root can configure a partition per container. But it 
looks too much (especially when container is very tiny). And people don't keep 
in mind such non-obvious things like NFSd grace period while configuring the 
environment.

-- 
Best regards,
Stanislav Kinsbursky

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH][RFC] nfsd/lockd: have locks_in_grace take a sb arg
  2012-04-11 17:33             ` Stanislav Kinsbursky
  2012-04-11 17:40               ` Stanislav Kinsbursky
@ 2012-04-11 18:20               ` J. Bruce Fields
  2012-04-11 19:39                 ` Stanislav Kinsbursky
  1 sibling, 1 reply; 29+ messages in thread
From: J. Bruce Fields @ 2012-04-11 18:20 UTC (permalink / raw)
  To: Stanislav Kinsbursky; +Cc: Jeff Layton, linux-nfs

On Wed, Apr 11, 2012 at 09:33:59PM +0400, Stanislav Kinsbursky wrote:
> 11.04.2012 21:20, J. Bruce Fields пишет:
> >So you really are exporting subtrees of the same filesystem from
> >multiple containers?  Why?
> >
> 
> Everything is very-very simple and obvious.
> We use "chroot jail". This is the most often and simple setup for containers.
> And, basicaly, Virtuozzo container file system consist of two parts:
> one of them is it's private modified data, another part is a
> template, used for all containers based on it (rhel6, for example;
> when it's content is modified my some container - then modified file
> copied to private part of container, which modified the file).
> Anyway, with properly configured environment it could be as many
> containers on the same file system, as possible. And making sure,
> that no data shared between them is root's responsibility.
> This approach gives us journal bottleneck. That's why, in future we
> are going to use "ploop" device (a kind of a very smart loop device)
> per container. And thus this problem with grace period for file
> systems will disappear.
> 
> >And are you sure you're not vulnerable to filehandle-guessing attacks?
> >
> 
> No, I'm not. Could you give me some examples of such attacks?

Suppose you export subtree /export/foo of filesystem /export to a
client, that client can also easily access anything else in /export; all
it hsa to do is guess the filehandle of the thing it wants to access (or
just guess filehandle of /export itself; root filehandles are likely
especially easily to guess), and then work from there.

(There's a workaround: you can set the subtree_check option.  That
causes a number of problems (renaming a file to a different directory
changes its filehandle, for example, so anyone trying to use it while it
gets renamed gets an unexpected ESTALE).  So we don't recommend it.)

So if all the containers are sharing the same filesystem, then anyone
exporting a subdirectory of its own filesystem has essentially granted
access to everyone's filesystem.

For that reason it's really only recommended to export separate
filesystems....

--b.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH][RFC] nfsd/lockd: have locks_in_grace take a sb arg
  2012-04-11 17:37                     ` Stanislav Kinsbursky
@ 2012-04-11 18:22                       ` J. Bruce Fields
  2012-04-11 19:24                         ` Stanislav Kinsbursky
  0 siblings, 1 reply; 29+ messages in thread
From: J. Bruce Fields @ 2012-04-11 18:22 UTC (permalink / raw)
  To: Stanislav Kinsbursky; +Cc: Jeff Layton, linux-nfs

On Wed, Apr 11, 2012 at 09:37:33PM +0400, Stanislav Kinsbursky wrote:
> 11.04.2012 21:19, J. Bruce Fields пишет:
> >On Wed, Apr 11, 2012 at 05:08:46PM +0400, Stanislav Kinsbursky wrote:
> >>11.04.2012 15:48, Jeff Layton пишет:
> >>>>>>>One thing that puzzles me at the moment. We have two namespaces to deal
> >>>>>>>with -- the network and the mount namespace. With nfs client code,
> >>>>>>>everything is keyed off of the net namespace. That's not really the
> >>>>>>>case here since we have to deal with a local fs tree as well.
> >>>>>>>
> >>>>>>>When an nfsd running in a container receives an RPC, how does it
> >>>>>>>determine what mount namespace it should do its operations in?
> >>>>>>>
> >>>>>>
> >>>>>>We don't use mount namespaces, so that's why I wasn't thinking about it...
> >>>>>>But if we have 2 types of namespaces, then we have to tie  mount namesapce to
> >>>>>>network. I.e we can get desired mount namespace from per-net NFSd data.
> >>>>>>
> >>>>>
> >>>>>One thing that Bruce mentioned to me privately is that we could plan to
> >>>>>use whatever mount namespace mountd is using within a particular net
> >>>>>namespace. That makes some sense since mountd is the final arbiter of
> >>>>>who gets access to what.
> >>>>>
> >>>>
> >>>>Could you, please, give some examples? I don't get the idea.
> >>>>
> >>>
> >>>When nfsd gets an RPC call, it needs to decide in what mount namespace
> >>>to do the fs operations. How do we decide this?
> >>>
> >>>Bruce's thought was to look at what mount namespace rpc.mountd is using
> >>>and use that, but now that I consider it, it's a bit of a chicken and
> >>>egg problem really... nfsd talks to mountd via files in /proc/net/rpc/.
> >>>In order to talk to the right mountd, might you need to know what mount
> >>>namespace it's operating in?
> >>>
> >>
> >>Not really... /proc itself depens on pid namespace. /proc/net
> >>depends on current (!) network namespace. So we can't just lookup
> >>for this dentry.
> >>
> >>But, in spite of nfsd works in initial (init_net and friends)
> >>environment, we can get network namespace from RPC request. Having
> >>this, we can easily get desired proc entry (proc_net_rpc in
> >>sunrpc_net). So it looks like we can actually don't care about mount
> >>namespaces - we have our own back door.
> >
> >OK, good, that's what I was hoping for.  Then we call up to whatever
> >mountd is running in our network namespace, and for path lookups it's
> >whatever fs namespace that mountd is running in that's going to matter.
> >
> 
> The problem here, is that mountd is running in pid namespace - not net.

Every process runs in some pid namespace, and in some net namespace, so
I don't understand what you mean by that.

> What would happen, if we will have situation like below:
> 
> 	mountd A	mountd B
> 
> 	pid_ns		pid_ns
> 	  |		  |
> 	mnt_ns		mnt_ns
> 	  |		  |
> 	  -----	net_ns ----
> 
> Is it possible, BTW?
> It yes, that is such construction valid?

Looks like a mess, no.  I'd expect there to be only one rpc.mountd
running per network namespace.

--b.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH][RFC] nfsd/lockd: have locks_in_grace take a sb arg
  2012-04-11 18:22                       ` J. Bruce Fields
@ 2012-04-11 19:24                         ` Stanislav Kinsbursky
  2012-04-11 22:17                           ` J. Bruce Fields
  0 siblings, 1 reply; 29+ messages in thread
From: Stanislav Kinsbursky @ 2012-04-11 19:24 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Jeff Layton, linux-nfs

11.04.2012 22:22, J. Bruce Fields написал:
>> >  What would happen, if we will have situation like below:
>> >  
>> >  	mountd A	mountd B
>> >  
>> >  	pid_ns		pid_ns
>> >  	|		  |
>> >  	mnt_ns		mnt_ns
>> >  	|		  |
>> >  	-----	net_ns ----
>> >  
>> >  Is it possible, BTW?
>> >  It yes, that is such construction valid?
> Looks like a mess, no.  I'd expect there to be only one rpc.mountd
> running per network namespace.

Then we have to prevent such situations somehow.
Or it is done already?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH][RFC] nfsd/lockd: have locks_in_grace take a sb arg
  2012-04-11 18:20               ` J. Bruce Fields
@ 2012-04-11 19:39                 ` Stanislav Kinsbursky
  2012-04-11 19:54                   ` J. Bruce Fields
  0 siblings, 1 reply; 29+ messages in thread
From: Stanislav Kinsbursky @ 2012-04-11 19:39 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Jeff Layton, linux-nfs

11.04.2012 22:20, J. Bruce Fields написал:
> Suppose you export subtree /export/foo of filesystem /export to a
> client, that client can also easily access anything else in /export; all
> it hsa to do is guess the filehandle of the thing it wants to access (or
> just guess filehandle of /export itself; root filehandles are likely
> especially easily to guess), and then work from there.

I see.
So, if I undertand you correctly, filesystem to export should be not 
only one per server, but also should not consist or any other files, 
which are not allowed to export.
Currently, in OpenVZ we have kernel threads per container. Thus even 
kernel threads are in "chroot jail".
But I'll check, do we have such vulnerability.
Thank you.

> (There's a workaround: you can set the subtree_check option.  That
> causes a number of problems (renaming a file to a different directory
> changes its filehandle, for example, so anyone trying to use it while it
> gets renamed gets an unexpected ESTALE).  So we don't recommend it.)
>
> So if all the containers are sharing the same filesystem, then anyone
> exporting a subdirectory of its own filesystem has essentially granted
> access to everyone's filesystem.
>
> For that reason it's really only recommended to export separate
> filesystems....

Thanks. Anyway, we are going to get rid of "chroot jails" and replace 
them by separated loop device.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH][RFC] nfsd/lockd: have locks_in_grace take a sb arg
  2012-04-11 19:39                 ` Stanislav Kinsbursky
@ 2012-04-11 19:54                   ` J. Bruce Fields
  0 siblings, 0 replies; 29+ messages in thread
From: J. Bruce Fields @ 2012-04-11 19:54 UTC (permalink / raw)
  To: Stanislav Kinsbursky; +Cc: Jeff Layton, linux-nfs

On Wed, Apr 11, 2012 at 11:39:09PM +0400, Stanislav Kinsbursky wrote:
> 11.04.2012 22:20, J. Bruce Fields написал:
> >Suppose you export subtree /export/foo of filesystem /export to a
> >client, that client can also easily access anything else in /export; all
> >it hsa to do is guess the filehandle of the thing it wants to access (or
> >just guess filehandle of /export itself; root filehandles are likely
> >especially easily to guess), and then work from there.
> 
> I see.
> So, if I undertand you correctly, filesystem to export should be not
> only one per server, but also should not consist or any other files,
> which are not allowed to export.

Yes, exactly, even in the absence of containers, if you're exporting a
subdirectory of your root filesystem (for example) then you may be
granting access to a lot more than you intended.  So we strongly
recommend exporting separate filesystems unless you're very sure you
know what you're doing....

--b.

> Currently, in OpenVZ we have kernel threads per container. Thus even
> kernel threads are in "chroot jail".
> But I'll check, do we have such vulnerability.
> Thank you.
> 
> >(There's a workaround: you can set the subtree_check option.  That
> >causes a number of problems (renaming a file to a different directory
> >changes its filehandle, for example, so anyone trying to use it while it
> >gets renamed gets an unexpected ESTALE).  So we don't recommend it.)
> >
> >So if all the containers are sharing the same filesystem, then anyone
> >exporting a subdirectory of its own filesystem has essentially granted
> >access to everyone's filesystem.
> >
> >For that reason it's really only recommended to export separate
> >filesystems....
> 
> Thanks. Anyway, we are going to get rid of "chroot jails" and
> replace them by separated loop device.
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH][RFC] nfsd/lockd: have locks_in_grace take a sb arg
  2012-04-11 19:24                         ` Stanislav Kinsbursky
@ 2012-04-11 22:17                           ` J. Bruce Fields
  2012-04-12  9:05                             ` Stanislav Kinsbursky
  0 siblings, 1 reply; 29+ messages in thread
From: J. Bruce Fields @ 2012-04-11 22:17 UTC (permalink / raw)
  To: Stanislav Kinsbursky; +Cc: Jeff Layton, linux-nfs

On Wed, Apr 11, 2012 at 11:24:52PM +0400, Stanislav Kinsbursky wrote:
> 11.04.2012 22:22, J. Bruce Fields написал:
> >>>  What would happen, if we will have situation like below:
> >>>  >  	mountd A	mountd B
> >>>  >  	pid_ns		pid_ns
> >>>  	|		  |
> >>>  	mnt_ns		mnt_ns
> >>>  	|		  |
> >>>  	-----	net_ns ----
> >>>  >  Is it possible, BTW?
> >>>  It yes, that is such construction valid?
> >Looks like a mess, no.  I'd expect there to be only one rpc.mountd
> >running per network namespace.
> 
> Then we have to prevent such situations somehow.
> Or it is done already?

If there's a way to prevent it, great, otherwise I think we just tell
people not to do that, and make sure any distro scripts (or whatever)
don't do that.

Actually, my statement above isn't quite right: we do allow running
multiple rpc.mountd processes in one network namespace (see the
--num-threads option to rpc.mountd, which is sometimes necessary for
performance).  But they should all be in the same namespaces and be
using the same configuration.

--b.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH][RFC] nfsd/lockd: have locks_in_grace take a sb arg
  2012-04-11 22:17                           ` J. Bruce Fields
@ 2012-04-12  9:05                             ` Stanislav Kinsbursky
  0 siblings, 0 replies; 29+ messages in thread
From: Stanislav Kinsbursky @ 2012-04-12  9:05 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Jeff Layton, linux-nfs

12.04.2012 02:17, J. Bruce Fields пишет:
> On Wed, Apr 11, 2012 at 11:24:52PM +0400, Stanislav Kinsbursky wrote:
>> 11.04.2012 22:22, J. Bruce Fields написал:
>>>>>   What would happen, if we will have situation like below:
>>>>>   >   	mountd A	mountd B
>>>>>   >   	pid_ns		pid_ns
>>>>>   	|		  |
>>>>>   	mnt_ns		mnt_ns
>>>>>   	|		  |
>>>>>   	-----	net_ns ----
>>>>>   >   Is it possible, BTW?
>>>>>   It yes, that is such construction valid?
>>> Looks like a mess, no.  I'd expect there to be only one rpc.mountd
>>> running per network namespace.
>>
>> Then we have to prevent such situations somehow.
>> Or it is done already?
>
> If there's a way to prevent it, great, otherwise I think we just tell
> people not to do that, and make sure any distro scripts (or whatever)
> don't do that.
>

Sure, there is a way.
For example, mountd will need to ask kernel is he allowed to start or not. 
Kernel will have to compare starting mountd mount namespace with the mount 
namespace of the previous started mountd process. If they matches or previous 
mountd process is dead (or even never started), then new start is allowed.
And disallowed otherwise.

> Actually, my statement above isn't quite right: we do allow running
> multiple rpc.mountd processes in one network namespace (see the
> --num-threads option to rpc.mountd, which is sometimes necessary for
> performance).  But they should all be in the same namespaces and be
> using the same configuration.
>
> --b.


-- 
Best regards,
Stanislav Kinsbursky

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2012-04-12  9:06 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-03 12:14 [PATCH][RFC] nfsd/lockd: have locks_in_grace take a sb arg Jeff Layton
2012-04-09 23:18 ` J. Bruce Fields
2012-04-10 11:13   ` Jeff Layton
2012-04-10 13:18     ` J. Bruce Fields
2012-04-10 11:44 ` Stanislav Kinsbursky
2012-04-10 12:05   ` Jeff Layton
2012-04-10 12:18     ` Stanislav Kinsbursky
2012-04-10 12:16   ` Jeff Layton
2012-04-10 12:46     ` Stanislav Kinsbursky
2012-04-10 13:39       ` Jeff Layton
2012-04-10 14:52         ` Stanislav Kinsbursky
2012-04-10 18:45           ` Jeff Layton
2012-04-11 10:09             ` Stanislav Kinsbursky
2012-04-11 11:48               ` Jeff Layton
2012-04-11 13:08                 ` Stanislav Kinsbursky
2012-04-11 17:19                   ` J. Bruce Fields
2012-04-11 17:37                     ` Stanislav Kinsbursky
2012-04-11 18:22                       ` J. Bruce Fields
2012-04-11 19:24                         ` Stanislav Kinsbursky
2012-04-11 22:17                           ` J. Bruce Fields
2012-04-12  9:05                             ` Stanislav Kinsbursky
2012-04-10 20:22       ` J. Bruce Fields
2012-04-11 10:34         ` Stanislav Kinsbursky
2012-04-11 17:20           ` J. Bruce Fields
2012-04-11 17:33             ` Stanislav Kinsbursky
2012-04-11 17:40               ` Stanislav Kinsbursky
2012-04-11 18:20               ` J. Bruce Fields
2012-04-11 19:39                 ` Stanislav Kinsbursky
2012-04-11 19:54                   ` J. Bruce Fields

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.