Re: [PATCH v2 1/2] SUNRPC: Fix memory reclaim deadlocks in rpciod

From: NeilBrown <neilb-l3A5Bk7waGM@public.gmane.org>
To: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
Cc: Mel Gorman <mgorman-IBi9RG/b67k@public.gmane.org>,
	Trond Myklebust
	<trond.myklebust-7I+n7zu2hftEKMMhf/gKZA@public.gmane.org>,
	Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>,
	Junxiao Bi <junxiao.bi-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>,
	Linux NFS Mailing List
	<linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	Devel FS Linux
	<linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
Subject: Re: [PATCH v2 1/2] SUNRPC: Fix memory reclaim deadlocks in rpciod
Date: Thu, 11 Sep 2014 20:53:05 +1000	[thread overview]
Message-ID: <20140911205305.578bc017@notabene.brown> (raw)
In-Reply-To: <20140911085046.GC22042-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>

[-- Attachment #1: Type: text/plain, Size: 5611 bytes --]

On Thu, 11 Sep 2014 10:50:47 +0200 Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> wrote:

> On Thu 11-09-14 09:57:43, Neil Brown wrote:
> > On Wed, 10 Sep 2014 15:48:43 +0200 Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> wrote:
> > 
> > > On Tue 09-09-14 12:33:46, Neil Brown wrote:
> > > > On Thu, 4 Sep 2014 15:54:27 +0200 Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> wrote:
> > > > 
> > > > > [Sorry for jumping in so late - I've been busy last days]
> > > > > 
> > > > > On Wed 27-08-14 16:36:44, Mel Gorman wrote:
> > > > > > On Tue, Aug 26, 2014 at 08:00:20PM -0400, Trond Myklebust wrote:
> > > > > > > On Tue, Aug 26, 2014 at 7:51 PM, Trond Myklebust
> > > > > > > <trond.myklebust-7I+n7zu2hftEKMMhf/gKZA@public.gmane.org> wrote:
> > > > > > > > On Tue, Aug 26, 2014 at 7:19 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > > > [...]
> > > > > > > >> wait_on_page_writeback() is a hammer, and we need to be better about
> > > > > > > >> this once we have per-memcg dirty writeback and throttling, but I
> > > > > > > >> think that really misses the point.  Even if memcg writeback waiting
> > > > > > > >> were smarter, any length of time spent waiting for yourself to make
> > > > > > > >> progress is absurd.  We just shouldn't be solving deadlock scenarios
> > > > > > > >> through arbitrary timeouts on one side.  If you can't wait for IO to
> > > > > > > >> finish, you shouldn't be passing __GFP_IO.
> > > > > 
> > > > > Exactly!
> > > > 
> > > > This is overly simplistic.
> > > > The code that cannot wait may be further up the call chain and not in a
> > > > position to avoid passing __GFP_IO.
> > > > In many case it isn't that "you can't wait for IO" in general, but that you
> > > > cannot wait for one specific IO request.
> > > 
> > > Could you be more specific, please? Why would a particular IO make any
> > > difference to general IO from the same path? My understanding was that
> > > once the page is marked PG_writeback then it is about to be written to
> > > its destination and if there is any need for memory allocation it should
> > > better not allow IO from reclaim.
> > 
> > The more complex the filesystem, the harder it is to "not allow IO from
> > reclaim".
> > For NFS (which started this thread) there might be a need to open a new
> > connection - so allocating in the networking code would all need to be
> > careful.
> 
> memalloc_noio_{save,restor} might help in that regards.

It might.  It is a bit of a heavy stick though.
Especially as "nofs" is what is really wanted (I think).

> 
> > And it isn't impossible that a 'gss' credential needs to be re-negotiated,
> > and that might even need user-space interaction (not sure of details).
> 
> OK, so if I understand you correctly all those allocations tmight happen
> _after_ the page has been marked PG_writeback. This would be bad indeed
> if such a path could appear in the memcg limit reclaim. The outcome of
> the previous discussion was that this doesn't happen in practice for
> nfs code, though, because the real flushing doesn't happen from a user
> context. The issue was reported for an old kernel where the flushing
> happened from the user context. It would be a huge problem to have a
> flusher within a restricted environment not only because of this path.
> 
> > What you say certainly used to be the case, and very often still is.  But it
> > doesn't really scale with complexity of filesystems.
> > 
> > I don't think there is (yet) any need to optimised for allocations that don't
> > disallow IO happening in the writeout path.  But I do think waiting
> > indefinitely for a particular IO is unjustifiable.
> 
> Well, as Johannes already pointed out. The right way to fix memcg
> reclaim is to implement proper memcg aware dirty pages throttling and
> flushing. This is a song of distant future I am afraid. Until then we
> have to live with workarounds. I would be happy to make this one more
> robust but timeout based solutions just sound too fragile and triggering
> OOM is a big risk.
> 
> Maybe we can disbale waiting if current->flags & PF_LESS_THROTTLE. I
> would be even tempted to WARN_ON(current->flags & PF_LESS_THROTTLE) in
> that path to catch a potential misconfiguration when the flusher is a
> part of restricted environment. The only real user of the flag is nfsd
> though and it runs from a kernel thread so this wouldn't help much to
> catch potentialy buggy code. So I am not really sure how much of an
> improvement this would be.
> 

I think it would be inappropriate to use PF_LESS_THROTTLE.  That is really
about throttling the dirtying of pages, not their writeback.

As has been said, there isn't really a bug that needs fixing at present, so
delving too deeply into designing a solution is probably pointless.

Using global flags is sometimes suitable, but it doesn't help when you are
waiting for memory allocation to happen in another process.
Using timeouts is sometimes suitable, but only if the backup plan isn't too
drastic.

My feeling is that the "ideal" would be to wait until:
  - this thread can make forward progress, or
  - no thread (in this memcg?) can make forward progress
In the first case we succeed.  In the second we take the most gentle backup
solution (e.g. use the last dregs of memory, or trigger OOM).
Detecting when no other thread can make forward progress is probably not
trivial, but it doesn't need to be cheap.

Hopefully when a real issue arises we'll be able to figure something out.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]