On Tue, Oct 15 2019, Trond Myklebust wrote: > On Wed, 2019-10-16 at 10:23 +1100, NeilBrown wrote: >> On Tue, Oct 15 2019, Trond Myklebust wrote: >> >> > Hi Neil, >> > >> > On Tue, 2019-10-15 at 10:36 +1100, NeilBrown wrote: >> > > The backchannel RPC requests - that are queued waiting >> > > for the reply to be sent by the "NFSv4 callback" thread - >> > > have a pointer to the xprt, but it is not reference counted. >> > > It is possible for the xprt to be freed while there are >> > > still queued requests. >> > > >> > > I think this has been a problem since >> > > Commit fb7a0b9addbd ("nfs41: New backchannel helper routines") >> > > when the code was introduced, but I suspect it became more of >> > > a problem after >> > > Commit 80b14d5e61ca ("SUNRPC: Add a structure to track multiple >> > > transports") >> > > (or there abouts). >> > > Before this second patch, the nfs client would hold a reference >> > > to >> > > the xprt to keep it alive. After multipath was introduced, >> > > a client holds a reference to a swtich, and the switch can have >> > > multiple >> > > xprts which can be added and removed. >> > > >> > > I'm not sure of all the causal issues, but this patch has >> > > fixed a customer problem were an NFSv4.1 client would run out >> > > of memory with tens of thousands of backchannel rpc requests >> > > queued for an xprt that had been freed. This was a 64K-page >> > > machine so each rpc_rqst consumed more than 128K of memory. >> > > >> > > Fixes: 80b14d5e61ca ("SUNRPC: Add a structure to track multiple >> > > transports") >> > > cc: stable@vger.kernel.org (v4.6) >> > > Signed-off-by: NeilBrown >> > > --- >> > > net/sunrpc/backchannel_rqst.c | 3 ++- >> > > 1 file changed, 2 insertions(+), 1 deletion(-) >> > > >> > > diff --git a/net/sunrpc/backchannel_rqst.c >> > > b/net/sunrpc/backchannel_rqst.c >> > > index 339e8c077c2d..c95ca39688b6 100644 >> > > --- a/net/sunrpc/backchannel_rqst.c >> > > +++ b/net/sunrpc/backchannel_rqst.c >> > > @@ -61,6 +61,7 @@ static void xprt_free_allocation(struct >> > > rpc_rqst >> > > *req) >> > > free_page((unsigned long)xbufp->head[0].iov_base); >> > > xbufp = &req->rq_snd_buf; >> > > free_page((unsigned long)xbufp->head[0].iov_base); >> > > + xprt_put(req->rq_xprt); >> > > kfree(req); >> > > } >> > >> > Would it perhaps make better sense to move the xprt_get() to >> > xprt_lookup_bc_request() and to release it in xprt_free_bc_rqst()? >> >> ... maybe ... >> >> > Otherwise as far as I can tell, we will have freed slots on the >> > xprt- >> > > bc_pa_list that hold a reference to the transport itself, meaning >> > > that >> > the latter never gets released. >> >> Apart from cleanup-on-error paths, xprt_free_allocation() is called: >> - in xprt_destroy_bc() - at most 'max_reqs' times. >> - in xprt_free_bc_rqst(), if the bc_alloc_count >= bc_alloc_max >> >> So when xprt_destroy_bc() is called (from nfs4_destroy_session, via >> xprt_destroy_backchannel()), bc_alloc_max() is reduced, and possibly >> the requests are freed and bc_alloc_count is reduced accordingly. >> If the requests were busy, they won't have been freed then, but will >> then be freed when xprt_free_bc_reqst is called - because >> bc_alloc_max >> has been reduced. >> >> Once nfs4_destroy_session() has been called on all session, and >> xprt_free_bc_rqst() has been called on all active requests, I think >> they should be no requests left to be holding a reference to >> the xprt. >> >> And if there were requests left, and if we change the refcount code >> (like you suggest) so that they weren't holding a reference, then >> they >> would never be freed. - freeing an xprt doesn't clean out the >> bc_pa_list. >> >> Though ... the connection from a session to an xprt isn't permanent >> (I >> guess, based on the rcu_read_lock in nfs4_destroy_session... which >> itself seems a bit odd as it doesn't inc a refcount while holding the >> lock). >> >> So maybe the counts can get out of balance. >> >> Conclusion: I'm not sure the ref counts are entirely correct, but I >> cannot see a benefit from moving the xprt_get/put requests like >> you >> suggest. >> > > I don't buy that conclusion. > > Nothing stops me from changing the value of NFS41_BC_MIN_CALLBACKS to > zero. Why should that affect the existence or not of the transport by > changing the number of references held? That parameter is supposed to > determine the number of pre-allocated requests and nothing else. If you reduce NFS41_BC_MIN_CALLBACKS to zero, bc_alloc_max won't be increased and will remain at zero. So xprt_get_bc_request() will always need to allocate a new request (bc_pa_list will be empty), and xprt_free_bc_rqst() will always free the request - never putting it on the bc_pa_list. Maybe it would make sense to do exactly that, get rid of bc_pa_list, and possibly use a mempool if there is any concern about delays when allocating. I agree that it is a little odd for requests on the pa list to hold a reference to the xprt, and I don't object at all to changing that. My only point is that I think my patch as it stands is correct, that it doesn't introduce new problems (and so might be the simplest thing for -stable). Thanks, NeilBrown > > I do agree with your assessment that destroying the xprt does not > currrently destroy the contents of xprt->bc_pa_list if they are non- > zero, but that would be a bug, and an easy one to fix. > > > -- > Trond Myklebust > Linux NFS client maintainer, Hammerspace > trond.myklebust@hammerspace.com