Re: sporadic hangs on generic/186

From: "NeilBrown" <neilb@suse.de>
To: "Trond Myklebust" <trondmy@hammerspace.com>
Cc: "david@fromorbit.com" <david@fromorbit.com>,
	"bfields@fieldses.org" <bfields@fieldses.org>,
	"linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	"chuck.lever@oracle.com" <chuck.lever@oracle.com>
Subject: Re: sporadic hangs on generic/186
Date: Thu, 07 Apr 2022 14:11:28 +1000	[thread overview]
Message-ID: <164930468885.10985.9905950866720150663@noble.neil.brown.name> (raw)
In-Reply-To: <b282c5b98c4518952f62662ea3ba1d4e6ef85f26.camel@hammerspace.com>

On Thu, 07 Apr 2022, Trond Myklebust wrote:
> On Thu, 2022-04-07 at 11:19 +1000, NeilBrown wrote:
> > On Thu, 07 Apr 2022, NeilBrown wrote:
> > > On Thu, 07 Apr 2022, Dave Chinner wrote:
> > > > On Wed, Apr 06, 2022 at 03:54:24PM -0400, J. Bruce Fields wrote:
> > > > > In the last couple days I've started getting hangs on xfstests
> > > > > generic/186 on upstream.  I also notice the test completes
> > > > > after 10+
> > > > > hours (usually it takes about 5 minutes).  Sometimes this is
> > > > > accompanied
> > > > > by "nfs: RPC call returned error 12" on the client.
> > > > 
> > > > #define ENOMEM          12      /* Out of memory */
> > > > 
> > > > So either the client or the server is running out of memory
> > > > somewhere?
> > > 
> > > Probably the client.  There are a bunch of changes recently which
> > > add
> > > __GFP_NORETRY to memory allocations from PF_WQ_WORKERs because that
> > > can
> > > result in deadlocks when swapping over NFS.
> > > This means that kmalloc request that previously never failed
> > > (because
> > > GFP_KERNEL never fails for kernel threads I think) can now fail. 
> > > This
> > > has tickled one bug that I know of.  There are likely to be more.
> > > 
> > > The RPC code should simply retry these allocations after a short
> > > delay. 
> > > HZ/4 is the number that is used in a couple of places.  Possibly
> > > there
> > > are more places that need to handle -ENOMEM with rpc_delay().
> > 
> > I had a look through the various places where alloc can now fail.
> > 
> > I think xdr_alloc_bvec() in xprt_sent_pagedata() is the most likely
> > cause of a problem here.  I don't think an -ENOMEM from there is
> > caught,
> > so it could likely filter up to NFS and result in the message you
> > got.
> > 
> > I don't think we can easily handle failure there.  We need to stay
> > with
> > GFP_KERNEL rely on PF_MEMALLOC to make forward progress for
> > swap-over-NFS.
> > 
> > Bruce: can you change that one line back to GFP_KERNEL and see if the
> > problem goes away?
> > 
> 
> Can we please just move the call to xdr_alloc_bvec() out of
> xprt_send_pagedata()? Move the client side allocation into
> xs_stream_prepare_request() and xs_udp_send_request(), then move the
> server side allocation into svc_udp_sendto().
> 
> That makes it possible to handle errors.

Like the below I guess.  Seems sensible, but I don't know the code well
enough to really review it.

> 
> > The other problem I found is that rpc_alloc_task() can now fail, but
> > rpc_new_task assumes that it never will.  If it does, then we get a
> > NULL
> > deref.
> > 
> > I don't think rpc_new_task() can ever be called from the rpciod work
> > queue, so it is safe to just use a mempool with GFP_KERNEL like we
> > did
> > before. 
> > 
> No. We shouldn't ever use mempools with GFP_KERNEL.

Why not?  mempools with GFP_KERNEL make perfect sense outside of the
rpciod and nfsiod threads.

> 
> Most, if not all of the callers of rpc_run_task() are still capable of
> dealing with errors, and ditto for the callers of rpc_run_bc_task().

Yes, they can deal with errors.  But in many cases that do so by passing
the error up the call stack so we could start getting ENOMEM for
systemcalls like stat().  I don't think that is a good idea.

Thanks,
NeilBrown


> 
> -- 
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> trond.myklebust@hammerspace.com
> 
> 
>

diff --git a/net/sunrpc/socklib.c b/net/sunrpc/socklib.c
index 05b38bf68316..71ba4cf513bc 100644
--- a/net/sunrpc/socklib.c
+++ b/net/sunrpc/socklib.c
@@ -221,12 +221,6 @@ static int xprt_send_kvec(struct socket *sock, struct msghdr *msg,
 static int xprt_send_pagedata(struct socket *sock, struct msghdr *msg,
 			      struct xdr_buf *xdr, size_t base)
 {
-	int err;
-
-	err = xdr_alloc_bvec(xdr, rpc_task_gfp_mask());
-	if (err < 0)
-		return err;
-
 	iov_iter_bvec(&msg->msg_iter, WRITE, xdr->bvec, xdr_buf_pagecount(xdr),
 		      xdr->page_len + xdr->page_base);
 	return xprt_sendmsg(sock, msg, base + xdr->page_base);
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 78af7518f263..2661828f7a85 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -828,6 +828,9 @@ xs_stream_prepare_request(struct rpc_rqst *req)
 	xdr_free_bvec(&req->rq_rcv_buf);
 	req->rq_task->tk_status = xdr_alloc_bvec(
 		&req->rq_rcv_buf, GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN);
+	if (req->rq_task->tk_status == 0)
+	req->rq_task->tk_status = xdr_alloc_bvec(
+		&req->rq_snd_buf, GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN);
 }
 
 /*