Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL

From: Dave Chinner <david@fromorbit.com>
To: NeilBrown <neilb@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Uladzislau Rezki <urezki@gmail.com>,
	Michal Hocko <mhocko@kernel.org>, Christoph Hellwig <hch@lst.de>,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	LKML <linux-kernel@vger.kernel.org>,
	Ilya Dryomov <idryomov@gmail.com>,
	Jeff Layton <jlayton@kernel.org>, Michal Hocko <mhocko@suse.com>
Subject: Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL
Date: Thu, 25 Nov 2021 10:45:11 +1100	[thread overview]
Message-ID: <20211124234511.GN418105@dread.disaster.area> (raw)
In-Reply-To: <163772381628.1891.9102201563412921921@noble.neil.brown.name>

On Wed, Nov 24, 2021 at 02:16:56PM +1100, NeilBrown wrote:
> On Wed, 24 Nov 2021, Andrew Morton wrote:
> > 
> > I added GFP_NOFAIL back in the mesozoic era because quite a lot of
> > sites were doing open-coded try-forever loops.  I thought "hey, they
> > shouldn't be doing that in the first place, but let's at least
> > centralize the concept to reduce code size, code duplication and so
> > it's something we can now grep for".  But longer term, all GFP_NOFAIL
> > sites should be reworked to no longer need to do the retry-forever
> > thing.  In retrospect, this bright idea of mine seems to have added
> > license for more sites to use retry-forever.  Sigh.
> 
> One of the costs of not having GFP_NOFAIL (or similar) is lots of
> untested failure-path code.
> 
> When does an allocation that is allowed to retry and reclaim ever fail
> anyway? I think the answer is "only when it has been killed by the oom
> killer".  That of course cannot happen to kernel threads, so maybe
> kernel threads should never need GFP_NOFAIL??
> 
> I'm not sure the above is 100%, but I do think that is the sort of
> semantic that we want.  We want to know what kmalloc failure *means*.
> We also need well defined and documented strategies to handle it.
> mempools are one such strategy, but not always suitable.

mempools are not suitable for anything that uses demand paging to
hold an unbounded data set in memory before it can free anything.
This is basically the definition of memory demand in an XFS
transaction, and most transaction based filesystems have similar
behaviour.

> preallocating can also be useful but can be clumsy to implement.  Maybe
> we should support a process preallocating a bunch of pages which can
> only be used by the process - and are auto-freed when the process
> returns to user-space.  That might allow the "error paths" to be simple
> and early, and subsequent allocations that were GFP_USEPREALLOC would be
> safe.

We talked about this years ago at LSFMM (2013 or 2014, IIRC). The
problem is "how much do you preallocate when the worst case
requirement to guarantee forwards progress is at least tens of
megabytes". Considering that there might be thousands of these
contexts running concurrent at any given time and we might be
running through several million preallocation contexts a second,
suddenly preallocation is a great big CPU and memory pit.

Hence preallocation simply doesn't work when the scope to guarantee
forwards progress is in the order of megabytes (even tens of
megabytes) per "no fail" context scope.  In situations like this we
need -memory reservations-, not preallocation.

Have the mm guarantee that there is a certain amount of memory
available (even if it requires reclaim to make available) before we
start the operation that cannot tolerate memory allocation failure,
track the memory usage as it goes, warn if it overruns, and release
the unused part of the reservation when context completes.

[ This is what we already do in XFS for guaranteeing forwards
progress for writing modifications into the strictly bound journal
space. We reserve space up front and use tracking tickets to account
for space used across each transaction context. This avoids
overcommit and all the deadlock and corruption problems that come
from running out of physical log space to write all the changes
we've made into the log. We could simply add memory reservations and
tracking structures to the transaction context and we've pretty much
got everything we need on the XFS side covered... ]

> i.e. we need a plan for how to rework all those no-fail call-sites.

Even if we do make them all the filesystems handle ENOMEM errors
gracefully and pass that back up to userspace, how are applications
going to react to random ENOMEM errors when doing data writes or
file creation or any other operation that accesses filesystems?

Given the way applications handle transient errors (i.e. they
don't!) propagating ENOMEM back up to userspace will result in
applications randomly failing under memory pressure.  That's a much
worse situation than having to manage the _extremely rare_ issues
that occur because of __GFP_NOFAIL usage in the kernel.

Let's keep that in mind here - __GFP_NOFAIL usage is not causing
system failures in the real world, whilst propagating ENOMEM back
out to userspace is potentially very harmful to users....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com