linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [RFC PATCH] rados_cluster: add a "design" manpage
       [not found] ` <20180531213733.GB4654@fieldses.org>
@ 2018-06-01 10:42   ` Jeff Layton
  2018-06-01 14:17     ` J. Bruce Fields
  0 siblings, 1 reply; 7+ messages in thread
From: Jeff Layton @ 2018-06-01 10:42 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: devel, sostapov, Supriti.Singh, open list:NFS, SUNRPC, AND...

On Thu, 2018-05-31 at 17:37 -0400, J. Bruce Fields wrote:
> On Wed, May 23, 2018 at 08:21:40AM -0400, Jeff Layton wrote:
> > From: Jeff Layton <jlayton@redhat.com>
> > 
> > Bruce asked for better design documentation, so this is my attempt at
> > it. Let me know what you think. I'll probably end up squashing this into
> > one of the code patches but for now I'm sending this separately to see
> > if it helps clarify things.
> > 
> > Suggestions and feedback are welcome.
> > 
> > Change-Id: I53cc77f66b2407c2083638e5760666639ba1fd57
> > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > ---
> >  src/doc/man/ganesha-rados-cluster.rst | 227 ++++++++++++++++++++++++++
> >  1 file changed, 227 insertions(+)
> >  create mode 100644 src/doc/man/ganesha-rados-cluster.rst
> > 
> > diff --git a/src/doc/man/ganesha-rados-cluster.rst b/src/doc/man/ganesha-rados-cluster.rst
> > new file mode 100644
> > index 000000000000..1ba2d3c29093
> > --- /dev/null
> > +++ b/src/doc/man/ganesha-rados-cluster.rst
> > @@ -0,0 +1,227 @@
> > +==============================================================================
> > +ganesha-rados-cluster-design -- Clustered RADOS Recovery Backend Design
> > +==============================================================================
> > +
> > +.. program:: ganesha-rados-cluster-design
> > +
> > +This document aims to explain the theory and design behind the
> > +rados_cluster recovery backend, which coordinates grace period
> > +enforcement among multiple, independent NFS servers.
> > +
> > +In order to understand the clustered recovery backend, it's first necessary
> > +to understand how recovery works with a single server:
> > +
> > +Singleton Server Recovery
> > +-------------------------
> > +NFSv4 is a lease-based protocol. Clients set up a relationship to the
> > +server and must periodically renew their lease in order to maintain
> > +their ephemeral state (open files, locks, delegations or layouts).
> > +
> > +When a singleton NFS server is restarted, any ephemeral state is lost. When
> > +the server comes comes back online, NFS clients detect that the server has
> > +been restarted and will reclaim the ephemeral state that they held at the
> > +time of their last contact with the server.
> > +
> > +Singleton Grace Period
> > +----------------------
> > +
> > +In order to ensure that we don't end up with conflicts, clients are
> > +barred from acquiring any new state while in the Recovery phase. Only
> > +reclaim operations are allowed.
> > +
> > +This period of time is called the **grace period**. Most NFS servers
> > +have a grace period that lasts around two lease periods, however
> 
> knfsd's is one lease period, who does two?
> 
> (Still catching up on the rest.  Looks good.)
> 
> --b.

(cc'ing linux-nfs)

Thanks for having a look. Hmm...you're right.

        nn->nfsd4_lease = 90;   /* default lease time */                        
        nn->nfsd4_grace = 90;                                                   

nit: We should probably add a #define'd constant for that at some
point...but, might this be problematic?

In the pessimal case, you might renew your lease just before the server
crashes. It then comes back up quickly and starts the grace period. By
the time the client contacts the server again the grace period is almost
over and you may have very little time to actually do any reclaim.

ISTR that when we were working on the server at PD we had determined
that we needed around 2 grace periods + a small fudge factor. I don't
recall the details of how we determined it though.

Even worse: 

        $ cat /proc/sys/fs/lease-break-time 
        45

Maybe we should be basing the v4 lease time on the lease-break-time
value? It seems like we ought to revoke delegations after two lease
periods rather than after half of one.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH] rados_cluster: add a "design" manpage
  2018-06-01 10:42   ` [RFC PATCH] rados_cluster: add a "design" manpage Jeff Layton
@ 2018-06-01 14:17     ` J. Bruce Fields
  2018-06-01 14:33       ` Jeff Layton
  0 siblings, 1 reply; 7+ messages in thread
From: J. Bruce Fields @ 2018-06-01 14:17 UTC (permalink / raw)
  To: Jeff Layton; +Cc: devel, sostapov, Supriti.Singh, open list:NFS, SUNRPC, AND...

On Fri, Jun 01, 2018 at 06:42:37AM -0400, Jeff Layton wrote:
> Thanks for having a look. Hmm...you're right.
> 
>         nn->nfsd4_lease = 90;   /* default lease time */                        
>         nn->nfsd4_grace = 90;                                                   
> 
> nit: We should probably add a #define'd constant for that at some
> point...but, might this be problematic?
> 
> In the pessimal case, you might renew your lease just before the server
> crashes. It then comes back up quickly and starts the grace period. By
> the time the client contacts the server again the grace period is almost
> over and you may have very little time to actually do any reclaim.
> 
> ISTR that when we were working on the server at PD we had determined
> that we needed around 2 grace periods + a small fudge factor. I don't
> recall the details of how we determined it though.

OK, I'd be interested in any of that.

If we decide we need a larger grace/lease ratio, we may want to adjust
default lease down at same time to avoid increasing grace period, as
that's always an annoyance.

> Even worse: 
> 
>         $ cat /proc/sys/fs/lease-break-time 
>         45
> 
> Maybe we should be basing the v4 lease time on the lease-break-time
> value? It seems like we ought to revoke delegations after two lease
> periods rather than after half of one.

No, we ignore lease-break-time on delegations.  See the comment in
fs/nfsd/nfs4state.c:nfsd_break_deleg_cb():

	/*
	 * We don't want the locks code to timeout the lease for us;
	 * we'll remove it ourself if a delegation isn't returned
	 * in time:
	 */
	fl->fl_break_time = 0;


--b.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH] rados_cluster: add a "design" manpage
  2018-06-01 14:17     ` J. Bruce Fields
@ 2018-06-01 14:33       ` Jeff Layton
  2018-06-01 14:46         ` J. Bruce Fields
  0 siblings, 1 reply; 7+ messages in thread
From: Jeff Layton @ 2018-06-01 14:33 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: devel, sostapov, Supriti.Singh, open list:NFS, SUNRPC, AND...

On Fri, 2018-06-01 at 10:17 -0400, J. Bruce Fields wrote:
> On Fri, Jun 01, 2018 at 06:42:37AM -0400, Jeff Layton wrote:
> > Thanks for having a look. Hmm...you're right.
> > 
> >         nn->nfsd4_lease = 90;   /* default lease time */                        
> >         nn->nfsd4_grace = 90;                                                   
> > 
> > nit: We should probably add a #define'd constant for that at some
> > point...but, might this be problematic?
> > 
> > In the pessimal case, you might renew your lease just before the server
> > crashes. It then comes back up quickly and starts the grace period. By
> > the time the client contacts the server again the grace period is almost
> > over and you may have very little time to actually do any reclaim.
> > 
> > ISTR that when we were working on the server at PD we had determined
> > that we needed around 2 grace periods + a small fudge factor. I don't
> > recall the details of how we determined it though.
> 
> OK, I'd be interested in any of that.
> 
> If we decide we need a larger grace/lease ratio, we may want to adjust
> default lease down at same time to avoid increasing grace period, as
> that's always an annoyance.
> 

I think we just figured that we ought to give the client a full lease
period to reclaim everything, in case it holds a lot of state. You also
need to give it a full lease period to figure out that the server has
rebooted, and then I think we added a few seconds to account for "other
factors" (transport delays and whatnot).

> > Even worse: 
> > 
> >         $ cat /proc/sys/fs/lease-break-time 
> >         45
> > 
> > Maybe we should be basing the v4 lease time on the lease-break-time
> > value? It seems like we ought to revoke delegations after two lease
> > periods rather than after half of one.
> 
> No, we ignore lease-break-time on delegations.  See the comment in
> fs/nfsd/nfs4state.c:nfsd_break_deleg_cb():
> 
> 	/*
> 	 * We don't want the locks code to timeout the lease for us;
> 	 * we'll remove it ourself if a delegation isn't returned
> 	 * in time:
> 	 */
> 	fl->fl_break_time = 0;
> 

Ahh thanks, I had forgotten that. Whew!

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH] rados_cluster: add a "design" manpage
  2018-06-01 14:33       ` Jeff Layton
@ 2018-06-01 14:46         ` J. Bruce Fields
  2018-06-08 16:33           ` J. Bruce Fields
  0 siblings, 1 reply; 7+ messages in thread
From: J. Bruce Fields @ 2018-06-01 14:46 UTC (permalink / raw)
  To: Jeff Layton; +Cc: devel, sostapov, Supriti.Singh, open list:NFS, SUNRPC, AND...

On Fri, Jun 01, 2018 at 10:33:14AM -0400, Jeff Layton wrote:
> On Fri, 2018-06-01 at 10:17 -0400, J. Bruce Fields wrote:
> > On Fri, Jun 01, 2018 at 06:42:37AM -0400, Jeff Layton wrote:
> > > Thanks for having a look. Hmm...you're right.
> > > 
> > >         nn->nfsd4_lease = 90;   /* default lease time */                        
> > >         nn->nfsd4_grace = 90;                                                   
> > > 
> > > nit: We should probably add a #define'd constant for that at some
> > > point...but, might this be problematic?
> > > 
> > > In the pessimal case, you might renew your lease just before the server
> > > crashes. It then comes back up quickly and starts the grace period. By
> > > the time the client contacts the server again the grace period is almost
> > > over and you may have very little time to actually do any reclaim.
> > > 
> > > ISTR that when we were working on the server at PD we had determined
> > > that we needed around 2 grace periods + a small fudge factor. I don't
> > > recall the details of how we determined it though.
> > 
> > OK, I'd be interested in any of that.
> > 
> > If we decide we need a larger grace/lease ratio, we may want to adjust
> > default lease down at same time to avoid increasing grace period, as
> > that's always an annoyance.
> > 
> 
> I think we just figured that we ought to give the client a full lease
> period to reclaim everything, in case it holds a lot of state. You also
> need to give it a full lease period to figure out that the server has
> rebooted, and then I think we added a few seconds to account for "other
> factors" (transport delays and whatnot).

I think it would also be easy to extend it on demand.

So, for example: end the grace period when:

	(one lease period has passed *and* we've received no reclaim
	request in the last second) *or*
	two lease periods have passed

Maybe think about the exact choice of numbers a little.

--b.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH] rados_cluster: add a "design" manpage
  2018-06-01 14:46         ` J. Bruce Fields
@ 2018-06-08 16:33           ` J. Bruce Fields
  2018-06-14 10:01             ` Jeff Layton
  0 siblings, 1 reply; 7+ messages in thread
From: J. Bruce Fields @ 2018-06-08 16:33 UTC (permalink / raw)
  To: Jeff Layton; +Cc: devel, sostapov, Supriti.Singh, open list:NFS, SUNRPC, AND...

On Fri, Jun 01, 2018 at 10:46:55AM -0400, J. Bruce Fields wrote:
> I think it would also be easy to extend it on demand.
> 
> So, for example: end the grace period when:
> 
> 	(one lease period has passed *and* we've received no reclaim
> 	request in the last second) *or*
> 	two lease periods have passed
> 
> Maybe think about the exact choice of numbers a little.

Something like this.  (Totally untested.)

I also wonder if 90 seconds is overkill as our default lease time.  What
does anyone else do?  Maybe I'll just half it to 45s at the same time.

--b.

commit 90c471ab0150
Author: J. Bruce Fields <bfields@redhat.com>
Date:   Fri Jun 8 12:28:47 2018 -0400

    nfsd4: extend reclaim period for reclaiming clients
    
    If the client is only renewing state a little sooner than once a lease
    period, then it might not discover the server has restarted till close
    to the end of the grace period, and might run out of time to do the
    actual reclaim.
    
    Extend the grace period by a second each time we notice there are
    clients still trying to reclaim, up to a limit of another whole lease
    period.
    
    Signed-off-by: J. Bruce Fields <bfields@redhat.com>

diff --git a/fs/nfsd/netns.h b/fs/nfsd/netns.h
index 36358d435cb0..426f55005697 100644
--- a/fs/nfsd/netns.h
+++ b/fs/nfsd/netns.h
@@ -102,6 +102,7 @@ struct nfsd_net {
 
 	time_t nfsd4_lease;
 	time_t nfsd4_grace;
+	bool somebody_reclaimed;
 
 	bool nfsd_net_up;
 	bool lockd_up;
diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index 5d99e8810b85..1929f85b8269 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -354,6 +354,7 @@ nfsd4_open(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 	struct svc_fh *resfh = NULL;
 	struct net *net = SVC_NET(rqstp);
 	struct nfsd_net *nn = net_generic(net, nfsd_net_id);
+	bool reclaim = false;
 
 	dprintk("NFSD: nfsd4_open filename %.*s op_openowner %p\n",
 		(int)open->op_fname.len, open->op_fname.data,
@@ -424,6 +425,7 @@ nfsd4_open(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 			if (status)
 				goto out;
 			open->op_openowner->oo_flags |= NFS4_OO_CONFIRMED;
+			reclaim = true;
 		case NFS4_OPEN_CLAIM_FH:
 		case NFS4_OPEN_CLAIM_DELEG_CUR_FH:
 			status = do_open_fhandle(rqstp, cstate, open);
@@ -452,6 +454,8 @@ nfsd4_open(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 	WARN(status && open->op_created,
 	     "nfsd4_process_open2 failed to open newly-created file! status=%u\n",
 	     be32_to_cpu(status));
+	if (reclaim && !status)
+		nn->somebody_reclaimed = true;
 out:
 	if (resfh && resfh != &cstate->current_fh) {
 		fh_dup2(&cstate->current_fh, resfh);
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 59ae65d3eec3..ffe816fe6e89 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -4689,6 +4689,28 @@ nfsd4_end_grace(struct nfsd_net *nn)
 	 */
 }
 
+/*
+ * If we've waited a lease period but there are still clients trying to
+ * reclaim, wait a little longer to give them a chance to finish.
+ */
+static bool clients_still_reclaiming(struct nfsd_net *nn)
+{
+	unsigned long now = get_seconds();
+	unsigned long double_grace_period_end = nn->boot_time +
+						2 * nn->nfsd4_lease;
+
+	if (!nn->somebody_reclaimed)
+		return false;
+	nn->somebody_reclaimed = false;
+	/*
+	 * If we've given them *two* lease times to reclaim, and they're
+	 * still not done, give up:
+	 */
+	if (time_after(now, double_grace_period_end))
+		return false;
+	return true;
+}
+
 static time_t
 nfs4_laundromat(struct nfsd_net *nn)
 {
@@ -4702,6 +4724,11 @@ nfs4_laundromat(struct nfsd_net *nn)
 	time_t t, new_timeo = nn->nfsd4_lease;
 
 	dprintk("NFSD: laundromat service - starting\n");
+
+	if (clients_still_reclaiming(nn)) {
+		new_timeo = 0;
+		goto out;
+	}
 	nfsd4_end_grace(nn);
 	INIT_LIST_HEAD(&reaplist);
 	spin_lock(&nn->client_lock);
@@ -4799,7 +4826,7 @@ nfs4_laundromat(struct nfsd_net *nn)
 		posix_unblock_lock(&nbl->nbl_lock);
 		free_blocked_lock(nbl);
 	}
-
+out:
 	new_timeo = max_t(time_t, new_timeo, NFSD_LAUNDROMAT_MINTIMEOUT);
 	return new_timeo;
 }
@@ -6049,6 +6076,8 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 	case 0: /* success! */
 		nfs4_inc_and_copy_stateid(&lock->lk_resp_stateid, &lock_stp->st_stid);
 		status = 0;
+		if (lock->lk_reclaim)
+			nn->somebody_reclaimed = true;
 		break;
 	case FILE_LOCK_DEFERRED:
 		nbl = NULL;
diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c
index d107b4426f7e..5f22476cf371 100644
--- a/fs/nfsd/nfsctl.c
+++ b/fs/nfsd/nfsctl.c
@@ -1239,6 +1239,7 @@ static __net_init int nfsd_init_net(struct net *net)
 		goto out_idmap_error;
 	nn->nfsd4_lease = 90;	/* default lease time */
 	nn->nfsd4_grace = 90;
+	nn->somebody_reclaimed = false;
 	nn->clverifier_counter = prandom_u32();
 	nn->clientid_counter = prandom_u32();
 

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH] rados_cluster: add a "design" manpage
  2018-06-08 16:33           ` J. Bruce Fields
@ 2018-06-14 10:01             ` Jeff Layton
  2018-06-14 16:32               ` J. Bruce Fields
  0 siblings, 1 reply; 7+ messages in thread
From: Jeff Layton @ 2018-06-14 10:01 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: devel, sostapov, Supriti.Singh, open list:NFS, SUNRPC, AND...

On Fri, 2018-06-08 at 12:33 -0400, J. Bruce Fields wrote:
> On Fri, Jun 01, 2018 at 10:46:55AM -0400, J. Bruce Fields wrote:
> > I think it would also be easy to extend it on demand.
> > 
> > So, for example: end the grace period when:
> > 
> > 	(one lease period has passed *and* we've received no reclaim
> > 	request in the last second) *or*
> > 	two lease periods have passed
> > 
> > Maybe think about the exact choice of numbers a little.
> 
> Something like this.  (Totally untested.)
> 
> I also wonder if 90 seconds is overkill as our default lease time.  What
> does anyone else do?  Maybe I'll just half it to 45s at the same time.
> 

Yeah, that might not be a bad idea. Worth experimenting with anyway.

> --b.
> 
> commit 90c471ab0150
> Author: J. Bruce Fields <bfields@redhat.com>
> Date:   Fri Jun 8 12:28:47 2018 -0400
> 
>     nfsd4: extend reclaim period for reclaiming clients
>     
>     If the client is only renewing state a little sooner than once a lease
>     period, then it might not discover the server has restarted till close
>     to the end of the grace period, and might run out of time to do the
>     actual reclaim.
>     
>     Extend the grace period by a second each time we notice there are
>     clients still trying to reclaim, up to a limit of another whole lease
>     period.
>     
>     Signed-off-by: J. Bruce Fields <bfields@redhat.com>
> 

Seems like a reasonable thing to do. Ack from my standpoint.

> diff --git a/fs/nfsd/netns.h b/fs/nfsd/netns.h
> index 36358d435cb0..426f55005697 100644
> --- a/fs/nfsd/netns.h
> +++ b/fs/nfsd/netns.h
> @@ -102,6 +102,7 @@ struct nfsd_net {
>  
>  	time_t nfsd4_lease;
>  	time_t nfsd4_grace;
> +	bool somebody_reclaimed;
> 
>  
>  	bool nfsd_net_up;
>  	bool lockd_up;
> diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
> index 5d99e8810b85..1929f85b8269 100644
> --- a/fs/nfsd/nfs4proc.c
> +++ b/fs/nfsd/nfs4proc.c
> @@ -354,6 +354,7 @@ nfsd4_open(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>  	struct svc_fh *resfh = NULL;
>  	struct net *net = SVC_NET(rqstp);
>  	struct nfsd_net *nn = net_generic(net, nfsd_net_id);
> +	bool reclaim = false;
> 

I know this is a "best effort" sort of thing, but should this be done
with atomic loads and stores (READ_ONCE/WRITE_ONCE)?

>  
>  	dprintk("NFSD: nfsd4_open filename %.*s op_openowner %p\n",
>  		(int)open->op_fname.len, open->op_fname.data,
> @@ -424,6 +425,7 @@ nfsd4_open(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>  			if (status)
>  				goto out;
>  			open->op_openowner->oo_flags |= NFS4_OO_CONFIRMED;
> +			reclaim = true;
>  		case NFS4_OPEN_CLAIM_FH:
>  		case NFS4_OPEN_CLAIM_DELEG_CUR_FH:
>  			status = do_open_fhandle(rqstp, cstate, open);
> @@ -452,6 +454,8 @@ nfsd4_open(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>  	WARN(status && open->op_created,
>  	     "nfsd4_process_open2 failed to open newly-created file! status=%u\n",
>  	     be32_to_cpu(status));
> +	if (reclaim && !status)
> +		nn->somebody_reclaimed = true;
>  out:
>  	if (resfh && resfh != &cstate->current_fh) {
>  		fh_dup2(&cstate->current_fh, resfh);
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index 59ae65d3eec3..ffe816fe6e89 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -4689,6 +4689,28 @@ nfsd4_end_grace(struct nfsd_net *nn)
>  	 */
>  }
>  
> +/*
> + * If we've waited a lease period but there are still clients trying to
> + * reclaim, wait a little longer to give them a chance to finish.
> + */
> +static bool clients_still_reclaiming(struct nfsd_net *nn)
> +{
> +	unsigned long now = get_seconds();
> +	unsigned long double_grace_period_end = nn->boot_time +
> +						2 * nn->nfsd4_lease;
> +
> +	if (!nn->somebody_reclaimed)
> +		return false;
> +	nn->somebody_reclaimed = false;
> +	/*
> +	 * If we've given them *two* lease times to reclaim, and they're
> +	 * still not done, give up:
> +	 */
> +	if (time_after(now, double_grace_period_end))
> +		return false;
> +	return true;
> +}
> +
>  static time_t
>  nfs4_laundromat(struct nfsd_net *nn)
>  {
> @@ -4702,6 +4724,11 @@ nfs4_laundromat(struct nfsd_net *nn)
>  	time_t t, new_timeo = nn->nfsd4_lease;
>  
>  	dprintk("NFSD: laundromat service - starting\n");
> +
> +	if (clients_still_reclaiming(nn)) {
> +		new_timeo = 0;
> +		goto out;
> +	}
>  	nfsd4_end_grace(nn);
>  	INIT_LIST_HEAD(&reaplist);
>  	spin_lock(&nn->client_lock);
> @@ -4799,7 +4826,7 @@ nfs4_laundromat(struct nfsd_net *nn)
>  		posix_unblock_lock(&nbl->nbl_lock);
>  		free_blocked_lock(nbl);
>  	}
> -
> +out:
>  	new_timeo = max_t(time_t, new_timeo, NFSD_LAUNDROMAT_MINTIMEOUT);
>  	return new_timeo;
>  }
> @@ -6049,6 +6076,8 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>  	case 0: /* success! */
>  		nfs4_inc_and_copy_stateid(&lock->lk_resp_stateid, &lock_stp->st_stid);
>  		status = 0;
> +		if (lock->lk_reclaim)
> +			nn->somebody_reclaimed = true;
>  		break;
>  	case FILE_LOCK_DEFERRED:
>  		nbl = NULL;
> diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c
> index d107b4426f7e..5f22476cf371 100644
> --- a/fs/nfsd/nfsctl.c
> +++ b/fs/nfsd/nfsctl.c
> @@ -1239,6 +1239,7 @@ static __net_init int nfsd_init_net(struct net *net)
>  		goto out_idmap_error;
>  	nn->nfsd4_lease = 90;	/* default lease time */
>  	nn->nfsd4_grace = 90;
> +	nn->somebody_reclaimed = false;
>  	nn->clverifier_counter = prandom_u32();
>  	nn->clientid_counter = prandom_u32();
>  

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH] rados_cluster: add a "design" manpage
  2018-06-14 10:01             ` Jeff Layton
@ 2018-06-14 16:32               ` J. Bruce Fields
  0 siblings, 0 replies; 7+ messages in thread
From: J. Bruce Fields @ 2018-06-14 16:32 UTC (permalink / raw)
  To: Jeff Layton; +Cc: devel, sostapov, Supriti.Singh, open list:NFS, SUNRPC, AND...

On Thu, Jun 14, 2018 at 06:01:27AM -0400, Jeff Layton wrote:
> I know this is a "best effort" sort of thing, but should this be done
> with atomic loads and stores (READ_ONCE/WRITE_ONCE)?

We're just setting it to 1 when a reclaim comes in, and then once a
second checking the result and clearing it.  So yes that's a slightly
sloppy way of checking whether something happens once a second.

I *think* the worst that could happen is something like:

	read 0 from somebody_reclaimed
					reclaim writes 1 to somebody_reclaimed
	write 0 to somebody_reclaimed

and then a client that's reclaiming only very close to the 1-second mark
could get overlooked.

I'm assuming in any reasonable situation that a client can manage at
least 10-100 reclaims a second, and that the race window is probably
several orders of magnitude less than a second (even after taking into
account weird memory ordering).

Would READ/WRITE_ONCE do it right?  I'm not sure it's worth it.

--b.


> 
> >  
> >  	dprintk("NFSD: nfsd4_open filename %.*s op_openowner %p\n",
> >  		(int)open->op_fname.len, open->op_fname.data,
> > @@ -424,6 +425,7 @@ nfsd4_open(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
> >  			if (status)
> >  				goto out;
> >  			open->op_openowner->oo_flags |= NFS4_OO_CONFIRMED;
> > +			reclaim = true;
> >  		case NFS4_OPEN_CLAIM_FH:
> >  		case NFS4_OPEN_CLAIM_DELEG_CUR_FH:
> >  			status = do_open_fhandle(rqstp, cstate, open);
> > @@ -452,6 +454,8 @@ nfsd4_open(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
> >  	WARN(status && open->op_created,
> >  	     "nfsd4_process_open2 failed to open newly-created file! status=%u\n",
> >  	     be32_to_cpu(status));
> > +	if (reclaim && !status)
> > +		nn->somebody_reclaimed = true;
> >  out:
> >  	if (resfh && resfh != &cstate->current_fh) {
> >  		fh_dup2(&cstate->current_fh, resfh);
> > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> > index 59ae65d3eec3..ffe816fe6e89 100644
> > --- a/fs/nfsd/nfs4state.c
> > +++ b/fs/nfsd/nfs4state.c
> > @@ -4689,6 +4689,28 @@ nfsd4_end_grace(struct nfsd_net *nn)
> >  	 */
> >  }
> >  
> > +/*
> > + * If we've waited a lease period but there are still clients trying to
> > + * reclaim, wait a little longer to give them a chance to finish.
> > + */
> > +static bool clients_still_reclaiming(struct nfsd_net *nn)
> > +{
> > +	unsigned long now = get_seconds();
> > +	unsigned long double_grace_period_end = nn->boot_time +
> > +						2 * nn->nfsd4_lease;
> > +
> > +	if (!nn->somebody_reclaimed)
> > +		return false;
> > +	nn->somebody_reclaimed = false;
> > +	/*
> > +	 * If we've given them *two* lease times to reclaim, and they're
> > +	 * still not done, give up:
> > +	 */
> > +	if (time_after(now, double_grace_period_end))
> > +		return false;
> > +	return true;
> > +}
> > +
> >  static time_t
> >  nfs4_laundromat(struct nfsd_net *nn)
> >  {
> > @@ -4702,6 +4724,11 @@ nfs4_laundromat(struct nfsd_net *nn)
> >  	time_t t, new_timeo = nn->nfsd4_lease;
> >  
> >  	dprintk("NFSD: laundromat service - starting\n");
> > +
> > +	if (clients_still_reclaiming(nn)) {
> > +		new_timeo = 0;
> > +		goto out;
> > +	}
> >  	nfsd4_end_grace(nn);
> >  	INIT_LIST_HEAD(&reaplist);
> >  	spin_lock(&nn->client_lock);
> > @@ -4799,7 +4826,7 @@ nfs4_laundromat(struct nfsd_net *nn)
> >  		posix_unblock_lock(&nbl->nbl_lock);
> >  		free_blocked_lock(nbl);
> >  	}
> > -
> > +out:
> >  	new_timeo = max_t(time_t, new_timeo, NFSD_LAUNDROMAT_MINTIMEOUT);
> >  	return new_timeo;
> >  }
> > @@ -6049,6 +6076,8 @@ nfsd4_lock(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
> >  	case 0: /* success! */
> >  		nfs4_inc_and_copy_stateid(&lock->lk_resp_stateid, &lock_stp->st_stid);
> >  		status = 0;
> > +		if (lock->lk_reclaim)
> > +			nn->somebody_reclaimed = true;
> >  		break;
> >  	case FILE_LOCK_DEFERRED:
> >  		nbl = NULL;
> > diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c
> > index d107b4426f7e..5f22476cf371 100644
> > --- a/fs/nfsd/nfsctl.c
> > +++ b/fs/nfsd/nfsctl.c
> > @@ -1239,6 +1239,7 @@ static __net_init int nfsd_init_net(struct net *net)
> >  		goto out_idmap_error;
> >  	nn->nfsd4_lease = 90;	/* default lease time */
> >  	nn->nfsd4_grace = 90;
> > +	nn->somebody_reclaimed = false;
> >  	nn->clverifier_counter = prandom_u32();
> >  	nn->clientid_counter = prandom_u32();
> >  
> 
> -- 
> Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2018-06-14 16:32 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20180523122140.8373-1-jlayton@kernel.org>
     [not found] ` <20180531213733.GB4654@fieldses.org>
2018-06-01 10:42   ` [RFC PATCH] rados_cluster: add a "design" manpage Jeff Layton
2018-06-01 14:17     ` J. Bruce Fields
2018-06-01 14:33       ` Jeff Layton
2018-06-01 14:46         ` J. Bruce Fields
2018-06-08 16:33           ` J. Bruce Fields
2018-06-14 10:01             ` Jeff Layton
2018-06-14 16:32               ` J. Bruce Fields

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).