Re: [PATCH 1/2 v2] mm: Allow small allocations to fail

From: Johannes Weiner <hannes@cmpxchg.org>
To: Michal Hocko <mhocko@suse.cz>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>,
	akpm@linux-foundation.org, david@fromorbit.com, mgorman@suse.de,
	riel@redhat.com, fengguang.wu@intel.com, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH 1/2 v2] mm: Allow small allocations to fail
Date: Mon, 16 Mar 2015 17:11:46 -0400	[thread overview]
Message-ID: <20150316211146.GA15456@phnom.home.cmpxchg.org> (raw)
In-Reply-To: <20150316074607.GA24885@dhcp22.suse.cz>

On Mon, Mar 16, 2015 at 08:46:07AM +0100, Michal Hocko wrote:
> @@ -707,6 +708,29 @@ sysctl, it will revert to this default behavior.
>  
>  ==============================================================
>  
> +retry_allocation_attempts
> +
> +Page allocator tries hard to not fail small allocations requests.
> +Currently it retries indefinitely for small allocations requests (<= 32kB).
> +This works mostly fine but under an extreme low memory conditions system
> +might end up in deadlock situations because the looping allocation
> +request might block further progress for OOM killer victims.
> +
> +Even though this hasn't turned out to be a huge problem for many years the
> +long term plan is to move away from this default behavior but as this is
> +a long established behavior we cannot change it immediately.
> +
> +This knob should help in the transition. It tells how many times should
> +allocator retry when the system is OOM before the allocation fails.
> +The default value (ULONG_MAX) preserves the old behavior. This is a safe
> +default for production systems which cannot afford any unexpected
> +downtimes. More experimental systems might set it to a small number
> +(>=1), the higher the value the less probable would be allocation
> +failures when OOM is transient and could be resolved without the
> +particular allocation to fail.

This is a negotiation between the page allocator and the various
requirements of its in-kernel users.  If *we* can't make an educated
guess with the entire codebase available, how the heck can we expect
userspace to?

And just assuming for a second that they actually do a better job than
us, are they going to send us a report of their workload and machine
specs and the value that worked for them?  Of course not, why would
you think they'd suddenly send anything but regression reports?

And we wouldn't get regression reports without changing the default,
because really, what is the incentive to mess with that knob?  Making
a lockup you probably never encountered less likely to trigger, while
adding failures of unknown quantity or quality into the system?

This is truly insane.  You're taking one magic factor out of a complex
kernel mechanism and dump it on userspace, which has neither reason
nor context to meaningfully change the default.  We'd never leave that
state of transition.  Only when machines do lock up in the wild, at
least we can tell them they should have set this knob to "like, 50?"

If we want to address this problem, we are the ones that have to make
the call.  Pick a value based on a reasonable model, make it the
default, then deal with the fallout and update our assumptions.

Once that is done, whether we want to provide a boolean failsafe to
revert this in the field is another question.

A sysctl certainly doesn't sound appropriate to me because this is not
a tunable that we expect people to set according to their usecase.  We
expect our model to work for *everybody*.  A boot flag would be
marginally better but it still reeks too much of tunable.

Maybe CONFIG_FAILABLE_SMALL_ALLOCS.  Maybe something more euphemistic.
But I honestly can't think of anything that wouldn't scream "horrible
leak of implementation details."  The user just shouldn't ever care.

Given that there are usually several stages of various testing between
when a commit gets merged upstream and when it finally makes it into a
critical production system, maybe we don't need to provide userspace
control over this at all?

So what value do we choose?

Once we kick the OOM killer we should give the victim some time to
exit and then try the allocation again.  Looping just ONCE after that
means we scan all the LRU pages in the system a second time and invoke
the shrinkers another twelve times, with ratios approaching 1.  If the
OOM killer doesn't yield an allocatable page after this, I see very
little point in going on.  After all, we expect all our callers to
handle errors.

So why not just pass an "oomed" bool to should_alloc_retry() and bail
on small allocations at that point?  Put it upstream and deal with the
fallout long before this hits critical infrastructure?  By presumably
fixing up caller error handling and GFP flags?