linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: NeilBrown <neilb@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Theodore Ts'o <tytso@mit.edu>,
	Andreas Dilger <adilger.kernel@dilger.ca>,
	"Darrick J. Wong" <djwong@kernel.org>,
	Matthew Wilcox <willy@infradead.org>,
	Mel Gorman <mgorman@suse.com>,
	linux-xfs@vger.kernel.org, linux-ext4@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-nfs@vger.kernel.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 5/6] XFS: remove congestion_wait() loop from kmem_alloc()
Date: Tue, 14 Sep 2021 16:05:19 +1000	[thread overview]
Message-ID: <20210914060519.GJ2361455@dread.disaster.area> (raw)
In-Reply-To: <163159005180.3992.2350725240228509854@noble.neil.brown.name>

On Tue, Sep 14, 2021 at 01:27:31PM +1000, NeilBrown wrote:
> On Tue, 14 Sep 2021, Dave Chinner wrote:
> > On Tue, Sep 14, 2021 at 10:13:04AM +1000, NeilBrown wrote:
> > > Documentation commment in gfp.h discourages indefinite retry loops on
> > > ENOMEM and says of __GFP_NOFAIL that it
> > > 
> > >     is definitely preferable to use the flag rather than opencode
> > >     endless loop around allocator.
> > > 
> > > So remove the loop, instead specifying __GFP_NOFAIL if KM_MAYFAIL was
> > > not given.
> > > 
> > > Signed-off-by: NeilBrown <neilb@suse.de>
> > > ---
> > >  fs/xfs/kmem.c |   16 ++++------------
> > >  1 file changed, 4 insertions(+), 12 deletions(-)
> > > 
> > > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> > > index 6f49bf39183c..f545f3633f88 100644
> > > --- a/fs/xfs/kmem.c
> > > +++ b/fs/xfs/kmem.c
> > > @@ -13,19 +13,11 @@ kmem_alloc(size_t size, xfs_km_flags_t flags)
> > >  {
> > >  	int	retries = 0;
> > >  	gfp_t	lflags = kmem_flags_convert(flags);
> > > -	void	*ptr;
> > >  
> > >  	trace_kmem_alloc(size, flags, _RET_IP_);
> > >  
> > > -	do {
> > > -		ptr = kmalloc(size, lflags);
> > > -		if (ptr || (flags & KM_MAYFAIL))
> > > -			return ptr;
> > > -		if (!(++retries % 100))
> > > -			xfs_err(NULL,
> > > -	"%s(%u) possible memory allocation deadlock size %u in %s (mode:0x%x)",
> > > -				current->comm, current->pid,
> > > -				(unsigned int)size, __func__, lflags);
> > > -		congestion_wait(BLK_RW_ASYNC, HZ/50);
> > > -	} while (1);
> > > +	if (!(flags & KM_MAYFAIL))
> > > +		lflags |= __GFP_NOFAIL;
> > > +
> > > +	return kmalloc(size, lflags);
> > >  }
> > 
> > Which means we no longer get warnings about memory allocation
> > failing - kmem_flags_convert() sets __GFP_NOWARN for all allocations
> > in this loop. Hence we'll now get silent deadlocks through this code
> > instead of getting warnings that memory allocation is failing
> > repeatedly.
> 
> Yes, that is a problem.  Could we just clear __GFP_NOWARN when setting
> __GFP_NOFAIL?

Probably.

> Or is the 1-in-100 important? I think default warning is 1 every 10
> seconds.

1-in-100 is an arbitrary number to prevent spamming of logs unless
there is a real likelihood of there being a memory allocation
deadlock. We've typically only ever seen this when trying to do
high-order allocations (e.g. 64kB for xattr buffers) and failing
repeatedly in extreme memory pressure events. It's a canary that we
leave in the logs so that when a user reports problems we know that
they've been running under extended extreme low memory conditions
and can adjust the triage process accordingly.

So, we could remove __GFP_NOWARN, as long as the core allocator code
has sufficient rate limiting that it won't spam the logs due to
extended failure looping...

> > I also wonder about changing the backoff behaviour here (it's a 20ms
> > wait right now because there are not early wakeups) will affect the
> > behaviour, as __GFP_NOFAIL won't wait for that extra time between
> > allocation attempts....
> 
> The internal backoff is 100ms if there is much pending writeout, and
> there are 16 internal retries.  If there is not much pending writeout, I
> think it just loops with cond_resched().
> So adding 20ms can only be at all interesting when the only way to
> reclaim memory is something other than writeout.  I don't know how to
> think about that.

Any cache that uses a shrinker to reclaim (e.g. dentry, inodes, fs
metadata, etc due to recursive directory traversals) can cause
reclaim looping and priority escalation without there being any page
cache writeback or reclaim possible. Especially when you have
GFP_NOFS allocation context and all your memory is in VFS level
caches. At that point, direct reclaim cannot (and will not) make
forwards progress, so we still have to wait for some other
GFP_KERNEL context reclaim (e.g. kswapd) to make progress reclaiming
memory while we wait.

Fundamentally, the memory reclaim backoff code doesn't play well
with shrinkers. Patches from an old patchset which pushed lack of
shrinker progress back up into the vmscan level backoff algorithms
was something I was experimenting with a few years ago. e.g.

https://lore.kernel.org/linux-xfs/20191031234618.15403-16-david@fromorbit.com/
https://lore.kernel.org/linux-xfs/20191031234618.15403-17-david@fromorbit.com/

We didn't end up going this way to solve the XFS inode reclaim
problems - I ended up solving that entirely by pinning XFS buffer
cache memory and modifying the XFS inode shrinker - but it was this
patchset that first exposed the fact that congestion_wait() was no
longer functioning as intended. See the last few paragraphs of the
(long) cover letter for v1 of that patchset here:

https://lore.kernel.org/linux-xfs/20190801021752.4986-1-david@fromorbit.com/

So, yeah, I know full well that congestion_wait() is mostly just
an unconditional timeout these days...

> > And, of course, how did you test this? Sometimes we see
> > unpredicted behaviours as a result of "simple" changes like this
> > under low memory conditions...
> 
> I suspect this is close to untestable.  While I accept that there might
> be a scenario where the change might cause some macro effect, it would
> most likely be some interplay with some other subsystem struggling with
> memory.  Testing XFS by itself would be unlikely to find it.

Filesystem traversal workloads (e.g. chown -R) are the ones that
hammer memory allocation from GFP_NOFS context which creates memory
pressure that cannot be balanced by direct reclaim as direct reclaim
cannot reclaim filesystem caches in this situation. This is where I
would expect extra backoff on failing GFP_NOFS allocations to have
some effect...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2021-09-14  6:05 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-14  0:13 [PATCH 0/6] congestion_wait() and GFP_NOFAIL NeilBrown
2021-09-14  0:13 ` [PATCH 2/6] MM: annotate congestion_wait() and wait_iff_congested() as ineffective NeilBrown
2021-09-15 11:56   ` Michal Hocko
2021-09-16 22:13     ` NeilBrown
2021-09-14  0:13 ` [PATCH 5/6] XFS: remove congestion_wait() loop from kmem_alloc() NeilBrown
2021-09-14  1:31   ` Dave Chinner
2021-09-14  3:27     ` NeilBrown
2021-09-14  6:05       ` Dave Chinner [this message]
2021-09-14  0:13 ` [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops NeilBrown
2021-09-14 16:34   ` Mel Gorman
2021-09-14 21:48     ` NeilBrown
2021-09-15 12:06       ` Michal Hocko
2021-09-15 22:35         ` NeilBrown
2021-09-16  0:37           ` Dave Chinner
2021-09-16  6:52           ` Michal Hocko
2021-09-14 23:55     ` Dave Chinner
2021-09-15  8:59       ` Mel Gorman
2021-09-15 12:20         ` Michal Hocko
2021-09-15 14:35         ` Mel Gorman
2021-09-15 22:38           ` Dave Chinner
2021-09-16  9:00             ` Mel Gorman
2021-09-15  0:28   ` Theodore Ts'o
2021-09-15  5:25     ` NeilBrown
2021-09-15 17:02       ` Theodore Ts'o
2021-09-14  0:13 ` [PATCH 1/6] MM: improve documentation for __GFP_NOFAIL NeilBrown
2021-09-15 11:51   ` Michal Hocko
2021-09-14  0:13 ` [PATCH 6/6] XFS: remove congestion_wait() loop from xfs_buf_alloc_pages() NeilBrown
2021-09-14  2:08   ` Dave Chinner
2021-09-14  2:35     ` NeilBrown
2021-09-14  5:33       ` Dave Chinner
2021-09-14 16:45       ` Mel Gorman
2021-09-14 21:13         ` NeilBrown
2021-09-14  0:13 ` [PATCH 4/6] EXT4: remove congestion_wait from ext4_bio_write_page, and simplify NeilBrown
2021-09-17  2:56 [PATCH 0/6 v2] congestion_wait() and GFP_NOFAIL NeilBrown
2021-09-17  2:56 ` [PATCH 5/6] XFS: remove congestion_wait() loop from kmem_alloc() NeilBrown
2021-09-17 21:45   ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210914060519.GJ2361455@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=adilger.kernel@dilger.ca \
    --cc=akpm@linux-foundation.org \
    --cc=djwong@kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=mgorman@suse.com \
    --cc=neilb@suse.de \
    --cc=tytso@mit.edu \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).