Re: sporadic hangs on generic/186

From: "NeilBrown" <neilb@suse.de>
To: "Trond Myklebust" <trondmy@hammerspace.com>
Cc: "bfields@fieldses.org" <bfields@fieldses.org>,
	"david@fromorbit.com" <david@fromorbit.com>,
	"linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	"chuck.lever@oracle.com" <chuck.lever@oracle.com>
Subject: Re: sporadic hangs on generic/186
Date: Fri, 08 Apr 2022 12:46:08 +1000	[thread overview]
Message-ID: <164938596863.10985.998515507989861871@noble.neil.brown.name> (raw)
In-Reply-To: <43aace26d3a09f868f732b2ad94ca2dbf90f50bd.camel@hammerspace.com>

On Thu, 07 Apr 2022, Trond Myklebust wrote:
> On Thu, 2022-04-07 at 14:11 +1000, NeilBrown wrote:
> > On Thu, 07 Apr 2022, Trond Myklebust wrote:
> > > On Thu, 2022-04-07 at 11:19 +1000, NeilBrown wrote:
> > > > On Thu, 07 Apr 2022, NeilBrown wrote:
> > > > > On Thu, 07 Apr 2022, Dave Chinner wrote:
> > > > > > On Wed, Apr 06, 2022 at 03:54:24PM -0400, J. Bruce Fields
> > > > > > wrote:
> > > > > > > In the last couple days I've started getting hangs on
> > > > > > > xfstests
> > > > > > > generic/186 on upstream.  I also notice the test completes
> > > > > > > after 10+
> > > > > > > hours (usually it takes about 5 minutes).  Sometimes this
> > > > > > > is
> > > > > > > accompanied
> > > > > > > by "nfs: RPC call returned error 12" on the client.
> > > > > > 
> > > > > > #define ENOMEM          12      /* Out of memory */
> > > > > > 
> > > > > > So either the client or the server is running out of memory
> > > > > > somewhere?
> > > > > 
> > > > > Probably the client.  There are a bunch of changes recently
> > > > > which
> > > > > add
> > > > > __GFP_NORETRY to memory allocations from PF_WQ_WORKERs because
> > > > > that
> > > > > can
> > > > > result in deadlocks when swapping over NFS.
> > > > > This means that kmalloc request that previously never failed
> > > > > (because
> > > > > GFP_KERNEL never fails for kernel threads I think) can now
> > > > > fail. 
> > > > > This
> > > > > has tickled one bug that I know of.  There are likely to be
> > > > > more.
> > > > > 
> > > > > The RPC code should simply retry these allocations after a
> > > > > short
> > > > > delay. 
> > > > > HZ/4 is the number that is used in a couple of places. 
> > > > > Possibly
> > > > > there
> > > > > are more places that need to handle -ENOMEM with rpc_delay().
> > > > 
> > > > I had a look through the various places where alloc can now fail.
> > > > 
> > > > I think xdr_alloc_bvec() in xprt_sent_pagedata() is the most
> > > > likely
> > > > cause of a problem here.  I don't think an -ENOMEM from there is
> > > > caught,
> > > > so it could likely filter up to NFS and result in the message you
> > > > got.
> > > > 
> > > > I don't think we can easily handle failure there.  We need to
> > > > stay
> > > > with
> > > > GFP_KERNEL rely on PF_MEMALLOC to make forward progress for
> > > > swap-over-NFS.
> > > > 
> > > > Bruce: can you change that one line back to GFP_KERNEL and see if
> > > > the
> > > > problem goes away?
> > > > 
> > > 
> > > Can we please just move the call to xdr_alloc_bvec() out of
> > > xprt_send_pagedata()? Move the client side allocation into
> > > xs_stream_prepare_request() and xs_udp_send_request(), then move
> > > the
> > > server side allocation into svc_udp_sendto().
> > > 
> > > That makes it possible to handle errors.
> > 
> > Like the below I guess.  Seems sensible, but I don't know the code
> > well
> > enough to really review it.
> > 
> > > 
> > > > The other problem I found is that rpc_alloc_task() can now fail,
> > > > but
> > > > rpc_new_task assumes that it never will.  If it does, then we get
> > > > a
> > > > NULL
> > > > deref.
> > > > 
> > > > I don't think rpc_new_task() can ever be called from the rpciod
> > > > work
> > > > queue, so it is safe to just use a mempool with GFP_KERNEL like
> > > > we
> > > > did
> > > > before. 
> > > > 
> > > No. We shouldn't ever use mempools with GFP_KERNEL.
> > 
> > Why not?  mempools with GFP_KERNEL make perfect sense outside of the
> > rpciod and nfsiod threads.
> 
> If you can afford to make it an infinite wait, there is __GFP_NOFAIL,
> so why waste the resources of an emergency pool? In my opinion,
> however, an infinite uninterruptible sleep is bad policy for almost all
> cases because someone will want to break out at some point.

"infinite" isn't a useful description.  The important question is "what
will allow the allocation to complete?".

For __GFP_NOFAIL there is no clear answer beyond "reduction of memory
pressure", and sometimes that is enough.
For a mempool we have a much more specific answer.  Memory becomes
available as previous requests complete.  rpc_task_mempool has a size of
8 so there can always be 8 requests in flight.  Waiting on the mempool
will wait at most until there are seven requests in flight, and then
will return a task.  This is a much better guarantee than for
__GFP_NOFAIL.
If you ever need an rpc task to relieve memory pressure, then
__GFP_NOFAIL could deadlock, while using the mempool won't.

If you are never going to block waiting on a mempool and would rather
fail instead, then there seems little point in having the mempool.

> 
> > 
> > > 
> > > Most, if not all of the callers of rpc_run_task() are still capable
> > > of
> > > dealing with errors, and ditto for the callers of
> > > rpc_run_bc_task().
> > 
> > Yes, they can deal with errors.  But in many cases that do so by
> > passing
> > the error up the call stack so we could start getting ENOMEM for
> > systemcalls like stat().  I don't think that is a good idea.
> > 
> 
> stat() has always been capable of returning ENOMEM if, for instance,
> inode allocation fails. There are almost no calls in NFS (or most other
> filesystems for that matter) that can't fail somehow when memory starts
> to get really scarce.

Fair enough.  It is writes that are really important.

> 
> The bottom line is that we use ordinary GFP_KERNEL memory allocations
> where we can. The new code follows that rule, breaking it only in cases
> where the specific rules of rpciod/xprtiod/nfsiod make it impossible to
> wait forever in the memory manager.

It is not safe to use GFP_KERNEL for an allocation that is needed in
order to free memory - and so any allocation that is needed to write out
data from the page cache.

> 
> I am preparing a set of patches to address the issues that you've
> identified, plus a case in call_transmit_status/call_bc_transmit_status
> where we're not handling ENOMEM. There are also patches to fix up two
> cases in the NFS code itself where we're not currently handling errors
> from rpc_run_task.

I look forward to reviewing them.

Thanks,
NeilBrown