Re: [PATCH 5/6] XFS: remove congestion_wait() loop from kmem_alloc()

From: Dave Chinner <david@fromorbit.com>
To: NeilBrown <neilb@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Theodore Ts'o <tytso@mit.edu>,
	Andreas Dilger <adilger.kernel@dilger.ca>,
	"Darrick J. Wong" <djwong@kernel.org>,
	Matthew Wilcox <willy@infradead.org>,
	Mel Gorman <mgorman@suse.com>,
	linux-xfs@vger.kernel.org, linux-ext4@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-nfs@vger.kernel.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 5/6] XFS: remove congestion_wait() loop from kmem_alloc()
Date: Tue, 14 Sep 2021 16:05:19 +1000	[thread overview]
Message-ID: <20210914060519.GJ2361455@dread.disaster.area> (raw)
In-Reply-To: <163159005180.3992.2350725240228509854@noble.neil.brown.name>

On Tue, Sep 14, 2021 at 01:27:31PM +1000, NeilBrown wrote:
> On Tue, 14 Sep 2021, Dave Chinner wrote:
> > On Tue, Sep 14, 2021 at 10:13:04AM +1000, NeilBrown wrote:
> > > Documentation commment in gfp.h discourages indefinite retry loops on
> > > ENOMEM and says of __GFP_NOFAIL that it
> > > 
> > >     is definitely preferable to use the flag rather than opencode
> > >     endless loop around allocator.
> > > 
> > > So remove the loop, instead specifying __GFP_NOFAIL if KM_MAYFAIL was
> > > not given.
> > > 
> > > Signed-off-by: NeilBrown <neilb@suse.de>
> > > ---
> > >  fs/xfs/kmem.c |   16 ++++------------
> > >  1 file changed, 4 insertions(+), 12 deletions(-)
> > > 
> > > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> > > index 6f49bf39183c..f545f3633f88 100644
> > > --- a/fs/xfs/kmem.c
> > > +++ b/fs/xfs/kmem.c
> > > @@ -13,19 +13,11 @@ kmem_alloc(size_t size, xfs_km_flags_t flags)
> > >  {
> > >  	int	retries = 0;
> > >  	gfp_t	lflags = kmem_flags_convert(flags);
> > > -	void	*ptr;
> > >  
> > >  	trace_kmem_alloc(size, flags, _RET_IP_);
> > >  
> > > -	do {
> > > -		ptr = kmalloc(size, lflags);
> > > -		if (ptr || (flags & KM_MAYFAIL))
> > > -			return ptr;
> > > -		if (!(++retries % 100))
> > > -			xfs_err(NULL,
> > > -	"%s(%u) possible memory allocation deadlock size %u in %s (mode:0x%x)",
> > > -				current->comm, current->pid,
> > > -				(unsigned int)size, __func__, lflags);
> > > -		congestion_wait(BLK_RW_ASYNC, HZ/50);
> > > -	} while (1);
> > > +	if (!(flags & KM_MAYFAIL))
> > > +		lflags |= __GFP_NOFAIL;
> > > +
> > > +	return kmalloc(size, lflags);
> > >  }
> > 
> > Which means we no longer get warnings about memory allocation
> > failing - kmem_flags_convert() sets __GFP_NOWARN for all allocations
> > in this loop. Hence we'll now get silent deadlocks through this code
> > instead of getting warnings that memory allocation is failing
> > repeatedly.
> 
> Yes, that is a problem.  Could we just clear __GFP_NOWARN when setting
> __GFP_NOFAIL?

Probably.

> Or is the 1-in-100 important? I think default warning is 1 every 10
> seconds.

1-in-100 is an arbitrary number to prevent spamming of logs unless
there is a real likelihood of there being a memory allocation
deadlock. We've typically only ever seen this when trying to do
high-order allocations (e.g. 64kB for xattr buffers) and failing
repeatedly in extreme memory pressure events. It's a canary that we
leave in the logs so that when a user reports problems we know that
they've been running under extended extreme low memory conditions
and can adjust the triage process accordingly.

So, we could remove __GFP_NOWARN, as long as the core allocator code
has sufficient rate limiting that it won't spam the logs due to
extended failure looping...

> > I also wonder about changing the backoff behaviour here (it's a 20ms
> > wait right now because there are not early wakeups) will affect the
> > behaviour, as __GFP_NOFAIL won't wait for that extra time between
> > allocation attempts....
> 
> The internal backoff is 100ms if there is much pending writeout, and
> there are 16 internal retries.  If there is not much pending writeout, I
> think it just loops with cond_resched().
> So adding 20ms can only be at all interesting when the only way to
> reclaim memory is something other than writeout.  I don't know how to
> think about that.

Any cache that uses a shrinker to reclaim (e.g. dentry, inodes, fs
metadata, etc due to recursive directory traversals) can cause
reclaim looping and priority escalation without there being any page
cache writeback or reclaim possible. Especially when you have
GFP_NOFS allocation context and all your memory is in VFS level
caches. At that point, direct reclaim cannot (and will not) make
forwards progress, so we still have to wait for some other
GFP_KERNEL context reclaim (e.g. kswapd) to make progress reclaiming
memory while we wait.

Fundamentally, the memory reclaim backoff code doesn't play well
with shrinkers. Patches from an old patchset which pushed lack of
shrinker progress back up into the vmscan level backoff algorithms
was something I was experimenting with a few years ago. e.g.

https://lore.kernel.org/linux-xfs/20191031234618.15403-16-david@fromorbit.com/
https://lore.kernel.org/linux-xfs/20191031234618.15403-17-david@fromorbit.com/

We didn't end up going this way to solve the XFS inode reclaim
problems - I ended up solving that entirely by pinning XFS buffer
cache memory and modifying the XFS inode shrinker - but it was this
patchset that first exposed the fact that congestion_wait() was no
longer functioning as intended. See the last few paragraphs of the
(long) cover letter for v1 of that patchset here:

https://lore.kernel.org/linux-xfs/20190801021752.4986-1-david@fromorbit.com/

So, yeah, I know full well that congestion_wait() is mostly just
an unconditional timeout these days...

> > And, of course, how did you test this? Sometimes we see
> > unpredicted behaviours as a result of "simple" changes like this
> > under low memory conditions...
> 
> I suspect this is close to untestable.  While I accept that there might
> be a scenario where the change might cause some macro effect, it would
> most likely be some interplay with some other subsystem struggling with
> memory.  Testing XFS by itself would be unlikely to find it.

Filesystem traversal workloads (e.g. chown -R) are the ones that
hammer memory allocation from GFP_NOFS context which creates memory
pressure that cannot be balanced by direct reclaim as direct reclaim
cannot reclaim filesystem caches in this situation. This is where I
would expect extra backoff on failing GFP_NOFS allocations to have
some effect...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com