All of lore.kernel.org
 help / color / mirror / Atom feed
From: Johannes Weiner <hannes@cmpxchg.org>
To: Michal Hocko <mhocko@suse.cz>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>,
	akpm@linux-foundation.org, david@fromorbit.com, mgorman@suse.de,
	riel@redhat.com, fengguang.wu@intel.com, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH 1/2 v2] mm: Allow small allocations to fail
Date: Mon, 16 Mar 2015 17:11:46 -0400	[thread overview]
Message-ID: <20150316211146.GA15456@phnom.home.cmpxchg.org> (raw)
In-Reply-To: <20150316074607.GA24885@dhcp22.suse.cz>

On Mon, Mar 16, 2015 at 08:46:07AM +0100, Michal Hocko wrote:
> @@ -707,6 +708,29 @@ sysctl, it will revert to this default behavior.
>  
>  ==============================================================
>  
> +retry_allocation_attempts
> +
> +Page allocator tries hard to not fail small allocations requests.
> +Currently it retries indefinitely for small allocations requests (<= 32kB).
> +This works mostly fine but under an extreme low memory conditions system
> +might end up in deadlock situations because the looping allocation
> +request might block further progress for OOM killer victims.
> +
> +Even though this hasn't turned out to be a huge problem for many years the
> +long term plan is to move away from this default behavior but as this is
> +a long established behavior we cannot change it immediately.
> +
> +This knob should help in the transition. It tells how many times should
> +allocator retry when the system is OOM before the allocation fails.
> +The default value (ULONG_MAX) preserves the old behavior. This is a safe
> +default for production systems which cannot afford any unexpected
> +downtimes. More experimental systems might set it to a small number
> +(>=1), the higher the value the less probable would be allocation
> +failures when OOM is transient and could be resolved without the
> +particular allocation to fail.

This is a negotiation between the page allocator and the various
requirements of its in-kernel users.  If *we* can't make an educated
guess with the entire codebase available, how the heck can we expect
userspace to?

And just assuming for a second that they actually do a better job than
us, are they going to send us a report of their workload and machine
specs and the value that worked for them?  Of course not, why would
you think they'd suddenly send anything but regression reports?

And we wouldn't get regression reports without changing the default,
because really, what is the incentive to mess with that knob?  Making
a lockup you probably never encountered less likely to trigger, while
adding failures of unknown quantity or quality into the system?

This is truly insane.  You're taking one magic factor out of a complex
kernel mechanism and dump it on userspace, which has neither reason
nor context to meaningfully change the default.  We'd never leave that
state of transition.  Only when machines do lock up in the wild, at
least we can tell them they should have set this knob to "like, 50?"

If we want to address this problem, we are the ones that have to make
the call.  Pick a value based on a reasonable model, make it the
default, then deal with the fallout and update our assumptions.

Once that is done, whether we want to provide a boolean failsafe to
revert this in the field is another question.

A sysctl certainly doesn't sound appropriate to me because this is not
a tunable that we expect people to set according to their usecase.  We
expect our model to work for *everybody*.  A boot flag would be
marginally better but it still reeks too much of tunable.

Maybe CONFIG_FAILABLE_SMALL_ALLOCS.  Maybe something more euphemistic.
But I honestly can't think of anything that wouldn't scream "horrible
leak of implementation details."  The user just shouldn't ever care.

Given that there are usually several stages of various testing between
when a commit gets merged upstream and when it finally makes it into a
critical production system, maybe we don't need to provide userspace
control over this at all?

So what value do we choose?

Once we kick the OOM killer we should give the victim some time to
exit and then try the allocation again.  Looping just ONCE after that
means we scan all the LRU pages in the system a second time and invoke
the shrinkers another twelve times, with ratios approaching 1.  If the
OOM killer doesn't yield an allocatable page after this, I see very
little point in going on.  After all, we expect all our callers to
handle errors.

So why not just pass an "oomed" bool to should_alloc_retry() and bail
on small allocations at that point?  Put it upstream and deal with the
fallout long before this hits critical infrastructure?  By presumably
fixing up caller error handling and GFP flags?

WARNING: multiple messages have this Message-ID (diff)
From: Johannes Weiner <hannes@cmpxchg.org>
To: Michal Hocko <mhocko@suse.cz>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>,
	akpm@linux-foundation.org, david@fromorbit.com, mgorman@suse.de,
	riel@redhat.com, fengguang.wu@intel.com, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH 1/2 v2] mm: Allow small allocations to fail
Date: Mon, 16 Mar 2015 17:11:46 -0400	[thread overview]
Message-ID: <20150316211146.GA15456@phnom.home.cmpxchg.org> (raw)
In-Reply-To: <20150316074607.GA24885@dhcp22.suse.cz>

On Mon, Mar 16, 2015 at 08:46:07AM +0100, Michal Hocko wrote:
> @@ -707,6 +708,29 @@ sysctl, it will revert to this default behavior.
>  
>  ==============================================================
>  
> +retry_allocation_attempts
> +
> +Page allocator tries hard to not fail small allocations requests.
> +Currently it retries indefinitely for small allocations requests (<= 32kB).
> +This works mostly fine but under an extreme low memory conditions system
> +might end up in deadlock situations because the looping allocation
> +request might block further progress for OOM killer victims.
> +
> +Even though this hasn't turned out to be a huge problem for many years the
> +long term plan is to move away from this default behavior but as this is
> +a long established behavior we cannot change it immediately.
> +
> +This knob should help in the transition. It tells how many times should
> +allocator retry when the system is OOM before the allocation fails.
> +The default value (ULONG_MAX) preserves the old behavior. This is a safe
> +default for production systems which cannot afford any unexpected
> +downtimes. More experimental systems might set it to a small number
> +(>=1), the higher the value the less probable would be allocation
> +failures when OOM is transient and could be resolved without the
> +particular allocation to fail.

This is a negotiation between the page allocator and the various
requirements of its in-kernel users.  If *we* can't make an educated
guess with the entire codebase available, how the heck can we expect
userspace to?

And just assuming for a second that they actually do a better job than
us, are they going to send us a report of their workload and machine
specs and the value that worked for them?  Of course not, why would
you think they'd suddenly send anything but regression reports?

And we wouldn't get regression reports without changing the default,
because really, what is the incentive to mess with that knob?  Making
a lockup you probably never encountered less likely to trigger, while
adding failures of unknown quantity or quality into the system?

This is truly insane.  You're taking one magic factor out of a complex
kernel mechanism and dump it on userspace, which has neither reason
nor context to meaningfully change the default.  We'd never leave that
state of transition.  Only when machines do lock up in the wild, at
least we can tell them they should have set this knob to "like, 50?"

If we want to address this problem, we are the ones that have to make
the call.  Pick a value based on a reasonable model, make it the
default, then deal with the fallout and update our assumptions.

Once that is done, whether we want to provide a boolean failsafe to
revert this in the field is another question.

A sysctl certainly doesn't sound appropriate to me because this is not
a tunable that we expect people to set according to their usecase.  We
expect our model to work for *everybody*.  A boot flag would be
marginally better but it still reeks too much of tunable.

Maybe CONFIG_FAILABLE_SMALL_ALLOCS.  Maybe something more euphemistic.
But I honestly can't think of anything that wouldn't scream "horrible
leak of implementation details."  The user just shouldn't ever care.

Given that there are usually several stages of various testing between
when a commit gets merged upstream and when it finally makes it into a
critical production system, maybe we don't need to provide userspace
control over this at all?

So what value do we choose?

Once we kick the OOM killer we should give the victim some time to
exit and then try the allocation again.  Looping just ONCE after that
means we scan all the LRU pages in the system a second time and invoke
the shrinkers another twelve times, with ratios approaching 1.  If the
OOM killer doesn't yield an allocatable page after this, I see very
little point in going on.  After all, we expect all our callers to
handle errors.

So why not just pass an "oomed" bool to should_alloc_retry() and bail
on small allocations at that point?  Put it upstream and deal with the
fallout long before this hits critical infrastructure?  By presumably
fixing up caller error handling and GFP flags?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2015-03-16 21:12 UTC|newest]

Thread overview: 63+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-03-11 20:54 [PATCH 0/2] Move away from non-failing small allocations Michal Hocko
2015-03-11 20:54 ` Michal Hocko
2015-03-11 20:54 ` Michal Hocko
2015-03-11 20:54 ` [PATCH 1/2] mm: Allow small allocations to fail Michal Hocko
2015-03-11 20:54   ` Michal Hocko
2015-03-11 20:54   ` Michal Hocko
2015-03-12 12:54   ` Tetsuo Handa
2015-03-12 12:54     ` Tetsuo Handa
2015-03-12 13:12     ` Michal Hocko
2015-03-12 13:12       ` Michal Hocko
2015-03-15  5:43   ` Tetsuo Handa
2015-03-15  5:43     ` Tetsuo Handa
2015-03-15 12:13     ` Michal Hocko
2015-03-15 12:13       ` Michal Hocko
2015-03-15 13:06       ` Tetsuo Handa
2015-03-15 13:06         ` Tetsuo Handa
2015-03-16  7:46         ` [PATCH 1/2 v2] " Michal Hocko
2015-03-16  7:46           ` Michal Hocko
2015-03-16 21:11           ` Johannes Weiner [this message]
2015-03-16 21:11             ` Johannes Weiner
2015-03-17 10:25             ` Michal Hocko
2015-03-17 10:25               ` Michal Hocko
2015-03-17 13:29               ` Johannes Weiner
2015-03-17 13:29                 ` Johannes Weiner
2015-03-17 14:17                 ` Michal Hocko
2015-03-17 14:17                   ` Michal Hocko
2015-03-17 17:26                   ` Johannes Weiner
2015-03-17 17:26                     ` Johannes Weiner
2015-03-17 19:41                     ` Michal Hocko
2015-03-17 19:41                       ` Michal Hocko
2015-03-18  9:10                       ` Vlastimil Babka
2015-03-18  9:10                         ` Vlastimil Babka
2015-03-18 12:04                         ` Michal Hocko
2015-03-18 12:04                           ` Michal Hocko
2015-03-18 12:36                         ` Tetsuo Handa
2015-03-18 12:36                           ` Tetsuo Handa
2015-03-18 11:35                       ` Tetsuo Handa
2015-03-18 11:35                         ` Tetsuo Handa
2015-03-17 11:13           ` Tetsuo Handa
2015-03-17 11:13             ` Tetsuo Handa
2015-03-17 13:15             ` Michal Hocko
2015-03-17 13:15               ` Michal Hocko
2015-03-18 11:33               ` Tetsuo Handa
2015-03-18 11:33                 ` Tetsuo Handa
2015-03-18 12:23                 ` Michal Hocko
2015-03-18 12:23                   ` Michal Hocko
2015-03-19 11:03                   ` Tetsuo Handa
2015-03-19 11:03                     ` Tetsuo Handa
2015-03-11 20:54 ` [PATCH 2/2] mmotm: Enable small allocation " Michal Hocko
2015-03-11 20:54   ` Michal Hocko
2015-03-11 20:54   ` Michal Hocko
2015-03-11 22:36 ` [PATCH 0/2] Move away from non-failing small allocations Sasha Levin
2015-03-11 22:36   ` Sasha Levin
2015-03-11 22:36   ` Sasha Levin
2015-03-16 22:38 ` Andrew Morton
2015-03-16 22:38   ` Andrew Morton
2015-03-16 22:38   ` Andrew Morton
2015-03-17  9:07   ` Michal Hocko
2015-03-17  9:07     ` Michal Hocko
2015-03-17 14:06     ` Tetsuo Handa
2015-03-17 14:06       ` Tetsuo Handa
2015-04-02 11:53       ` Tetsuo Handa
2015-04-02 11:53         ` Tetsuo Handa

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150316211146.GA15456@phnom.home.cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=akpm@linux-foundation.org \
    --cc=david@fromorbit.com \
    --cc=fengguang.wu@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@suse.cz \
    --cc=penguin-kernel@I-love.SAKURA.ne.jp \
    --cc=riel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.