All of lore.kernel.org
 help / color / mirror / Atom feed
From: "NeilBrown" <neilb@suse.de>
To: "Trond Myklebust" <trondmy@hammerspace.com>
Cc: "bfields@fieldses.org" <bfields@fieldses.org>,
	"david@fromorbit.com" <david@fromorbit.com>,
	"linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	"chuck.lever@oracle.com" <chuck.lever@oracle.com>
Subject: Re: sporadic hangs on generic/186
Date: Fri, 08 Apr 2022 12:46:08 +1000	[thread overview]
Message-ID: <164938596863.10985.998515507989861871@noble.neil.brown.name> (raw)
In-Reply-To: <43aace26d3a09f868f732b2ad94ca2dbf90f50bd.camel@hammerspace.com>

On Thu, 07 Apr 2022, Trond Myklebust wrote:
> On Thu, 2022-04-07 at 14:11 +1000, NeilBrown wrote:
> > On Thu, 07 Apr 2022, Trond Myklebust wrote:
> > > On Thu, 2022-04-07 at 11:19 +1000, NeilBrown wrote:
> > > > On Thu, 07 Apr 2022, NeilBrown wrote:
> > > > > On Thu, 07 Apr 2022, Dave Chinner wrote:
> > > > > > On Wed, Apr 06, 2022 at 03:54:24PM -0400, J. Bruce Fields
> > > > > > wrote:
> > > > > > > In the last couple days I've started getting hangs on
> > > > > > > xfstests
> > > > > > > generic/186 on upstream.  I also notice the test completes
> > > > > > > after 10+
> > > > > > > hours (usually it takes about 5 minutes).  Sometimes this
> > > > > > > is
> > > > > > > accompanied
> > > > > > > by "nfs: RPC call returned error 12" on the client.
> > > > > > 
> > > > > > #define ENOMEM          12      /* Out of memory */
> > > > > > 
> > > > > > So either the client or the server is running out of memory
> > > > > > somewhere?
> > > > > 
> > > > > Probably the client.  There are a bunch of changes recently
> > > > > which
> > > > > add
> > > > > __GFP_NORETRY to memory allocations from PF_WQ_WORKERs because
> > > > > that
> > > > > can
> > > > > result in deadlocks when swapping over NFS.
> > > > > This means that kmalloc request that previously never failed
> > > > > (because
> > > > > GFP_KERNEL never fails for kernel threads I think) can now
> > > > > fail. 
> > > > > This
> > > > > has tickled one bug that I know of.  There are likely to be
> > > > > more.
> > > > > 
> > > > > The RPC code should simply retry these allocations after a
> > > > > short
> > > > > delay. 
> > > > > HZ/4 is the number that is used in a couple of places. 
> > > > > Possibly
> > > > > there
> > > > > are more places that need to handle -ENOMEM with rpc_delay().
> > > > 
> > > > I had a look through the various places where alloc can now fail.
> > > > 
> > > > I think xdr_alloc_bvec() in xprt_sent_pagedata() is the most
> > > > likely
> > > > cause of a problem here.  I don't think an -ENOMEM from there is
> > > > caught,
> > > > so it could likely filter up to NFS and result in the message you
> > > > got.
> > > > 
> > > > I don't think we can easily handle failure there.  We need to
> > > > stay
> > > > with
> > > > GFP_KERNEL rely on PF_MEMALLOC to make forward progress for
> > > > swap-over-NFS.
> > > > 
> > > > Bruce: can you change that one line back to GFP_KERNEL and see if
> > > > the
> > > > problem goes away?
> > > > 
> > > 
> > > Can we please just move the call to xdr_alloc_bvec() out of
> > > xprt_send_pagedata()? Move the client side allocation into
> > > xs_stream_prepare_request() and xs_udp_send_request(), then move
> > > the
> > > server side allocation into svc_udp_sendto().
> > > 
> > > That makes it possible to handle errors.
> > 
> > Like the below I guess.  Seems sensible, but I don't know the code
> > well
> > enough to really review it.
> > 
> > > 
> > > > The other problem I found is that rpc_alloc_task() can now fail,
> > > > but
> > > > rpc_new_task assumes that it never will.  If it does, then we get
> > > > a
> > > > NULL
> > > > deref.
> > > > 
> > > > I don't think rpc_new_task() can ever be called from the rpciod
> > > > work
> > > > queue, so it is safe to just use a mempool with GFP_KERNEL like
> > > > we
> > > > did
> > > > before. 
> > > > 
> > > No. We shouldn't ever use mempools with GFP_KERNEL.
> > 
> > Why not?  mempools with GFP_KERNEL make perfect sense outside of the
> > rpciod and nfsiod threads.
> 
> If you can afford to make it an infinite wait, there is __GFP_NOFAIL,
> so why waste the resources of an emergency pool? In my opinion,
> however, an infinite uninterruptible sleep is bad policy for almost all
> cases because someone will want to break out at some point.

"infinite" isn't a useful description.  The important question is "what
will allow the allocation to complete?".

For __GFP_NOFAIL there is no clear answer beyond "reduction of memory
pressure", and sometimes that is enough.
For a mempool we have a much more specific answer.  Memory becomes
available as previous requests complete.  rpc_task_mempool has a size of
8 so there can always be 8 requests in flight.  Waiting on the mempool
will wait at most until there are seven requests in flight, and then
will return a task.  This is a much better guarantee than for
__GFP_NOFAIL.
If you ever need an rpc task to relieve memory pressure, then
__GFP_NOFAIL could deadlock, while using the mempool won't.

If you are never going to block waiting on a mempool and would rather
fail instead, then there seems little point in having the mempool.

> 
> > 
> > > 
> > > Most, if not all of the callers of rpc_run_task() are still capable
> > > of
> > > dealing with errors, and ditto for the callers of
> > > rpc_run_bc_task().
> > 
> > Yes, they can deal with errors.  But in many cases that do so by
> > passing
> > the error up the call stack so we could start getting ENOMEM for
> > systemcalls like stat().  I don't think that is a good idea.
> > 
> 
> stat() has always been capable of returning ENOMEM if, for instance,
> inode allocation fails. There are almost no calls in NFS (or most other
> filesystems for that matter) that can't fail somehow when memory starts
> to get really scarce.

Fair enough.  It is writes that are really important.

> 
> The bottom line is that we use ordinary GFP_KERNEL memory allocations
> where we can. The new code follows that rule, breaking it only in cases
> where the specific rules of rpciod/xprtiod/nfsiod make it impossible to
> wait forever in the memory manager.

It is not safe to use GFP_KERNEL for an allocation that is needed in
order to free memory - and so any allocation that is needed to write out
data from the page cache.

> 
> I am preparing a set of patches to address the issues that you've
> identified, plus a case in call_transmit_status/call_bc_transmit_status
> where we're not handling ENOMEM. There are also patches to fix up two
> cases in the NFS code itself where we're not currently handling errors
> from rpc_run_task.

I look forward to reviewing them.

Thanks,
NeilBrown

  reply	other threads:[~2022-04-08  2:46 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-06 19:54 sporadic hangs on generic/186 J. Bruce Fields
2022-04-06 19:58 ` Chuck Lever III
2022-04-07  0:14 ` Dave Chinner
2022-04-07  0:27   ` NeilBrown
2022-04-07  1:19     ` NeilBrown
2022-04-07  1:49       ` J. Bruce Fields
2022-04-07  4:23         ` NeilBrown
2022-04-07  1:54       ` Trond Myklebust
2022-04-07  4:11         ` NeilBrown
2022-04-07 13:01           ` Trond Myklebust
2022-04-08  2:46             ` NeilBrown [this message]
2022-04-08  5:03               ` Dave Chinner
2022-04-08  5:32                 ` NeilBrown
2022-04-10 23:34                   ` Dave Chinner
2022-04-12  3:13                     ` NeilBrown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=164938596863.10985.998515507989861871@noble.neil.brown.name \
    --to=neilb@suse.de \
    --cc=bfields@fieldses.org \
    --cc=chuck.lever@oracle.com \
    --cc=david@fromorbit.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=trondmy@hammerspace.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.