linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Johannes Weiner <hannes@cmpxchg.org>
To: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@kernel.org>,
	Vlastimil Babka <vbabka@suse.cz>,
	"Artem S. Tashkinov" <aros@gmx.com>,
	LKML <linux-kernel@vger.kernel.org>,
	linux-mm <linux-mm@kvack.org>
Subject: Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
Date: Tue, 6 Aug 2019 18:01:50 -0400	[thread overview]
Message-ID: <20190806220150.GA22516@cmpxchg.org> (raw)
In-Reply-To: <CAJuCfpFmOzj-gU1NwoQFmS_pbDKKd2XN=CS1vUV4gKhYCJOUtw@mail.gmail.com>

On Tue, Aug 06, 2019 at 09:27:05AM -0700, Suren Baghdasaryan wrote:
> On Tue, Aug 6, 2019 at 7:36 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Tue 06-08-19 10:27:28, Johannes Weiner wrote:
> > > On Tue, Aug 06, 2019 at 11:36:48AM +0200, Vlastimil Babka wrote:
> > > > On 8/6/19 3:08 AM, Suren Baghdasaryan wrote:
> > > > >> @@ -1280,3 +1285,50 @@ static int __init psi_proc_init(void)
> > > > >>         return 0;
> > > > >>  }
> > > > >>  module_init(psi_proc_init);
> > > > >> +
> > > > >> +#define OOM_PRESSURE_LEVEL     80
> > > > >> +#define OOM_PRESSURE_PERIOD    (10 * NSEC_PER_SEC)
> > > > >
> > > > > 80% of the last 10 seconds spent in full stall would definitely be a
> > > > > problem. If the system was already low on memory (which it probably
> > > > > is, or we would not be reclaiming so hard and registering such a big
> > > > > stall) then oom-killer would probably kill something before 8 seconds
> > > > > are passed.
> > > >
> > > > If oom killer can act faster, than great! On small embedded systems you probably
> > > > don't enable PSI anyway?
> 
> We use PSI triggers with 1 sec tracking window. PSI averages are less
> useful on such systems because in 10 secs (which is the shortest PSI
> averaging window) memory conditions can change drastically.
> 
> > > > > If my line of thinking is correct, then do we really
> > > > > benefit from such additional protection mechanism? I might be wrong
> > > > > here because my experience is limited to embedded systems with
> > > > > relatively small amounts of memory.
> > > >
> > > > Well, Artem in his original mail describes a minutes long stall. Things are
> > > > really different on a fast desktop/laptop with SSD. I have experienced this as
> > > > well, ending up performing manual OOM by alt-sysrq-f (then I put more RAM than
> > > > 8GB in the laptop). IMHO the default limit should be set so that the user
> > > > doesn't do that manual OOM (or hard reboot) before the mechanism kicks in. 10
> > > > seconds should be fine.
> > >
> > > That's exactly what I have experienced in the past, and this was also
> > > the consistent story in the bug reports we have had.
> > >
> > > I suspect it requires a certain combination of RAM size, CPU speed,
> > > and IO capacity: the OOM killer kicks in when reclaim fails, which
> > > happens when all scanned LRU pages were locked and under IO. So IO
> > > needs to be slow enough, or RAM small enough, that the CPU can scan
> > > all LRU pages while they are temporarily unreclaimable (page lock).
> > >
> > > It may well be that on phones the RAM is small enough relative to CPU
> > > size.
> > >
> > > But on desktops/servers, we frequently see that there is a wider
> > > window of memory consumption in which reclaim efficiency doesn't drop
> > > low enough for the OOM killer to kick in. In the time it takes the CPU
> > > to scan through RAM, enough pages will have *just* finished reading
> > > for reclaim to free them again and continue to make "progress".
> > >
> > > We do know that the OOM killer might not kick in for at least 20-25
> > > minutes while the system is entirely unresponsive. People usually
> > > don't wait this long before forcibly rebooting. In a managed fleet,
> > > ssh heartbeat tests eventually fail and force a reboot.
> 
> Got it. Thanks for the explanation.
> 
> > > I'm not sure 10s is the perfect value here, but I do think the kernel
> > > should try to get out of such a state, where interacting with the
> > > system is impossible, within a reasonable amount of time.
> > >
> > > It could be a little too short for non-interactive number-crunching
> > > systems...
> >
> > Would it be possible to have a module with tunning knobs as parameters
> > and hook into the PSI infrastructure? People can play with the setting
> > to their need, we wouldn't really have think about the user visible API
> > for the tuning and this could be easily adopted as an opt-in mechanism
> > without a risk of regressions.

It's relatively easy to trigger a livelock that disables the entire
system for good, as a regular user. It's a little weird to make the
bug fix for that an opt-in with an extensive configuration interface.

This isn't like the hung task watch dog, where it's likely some kind
of kernel issue, right? This can happen on any current kernel.

What I would like to have is a way of self-recovery from a livelock. I
don't mind making it opt-out in case we make mistakes, but the kernel
should provide minimal self-protection out of the box, IMO.

> PSI averages stalls over 10, 60 and 300 seconds, so implementing 3
> corresponding thresholds would be easy. The patch Johannes posted can
> be extended to support 3 thresholds instead of 1. I can take a stab at
> it if Johannes is busy.
> If we want more flexibility we could use PSI triggers with
> configurable tracking window but that's more complex and probably not
> worth it.

This goes into quality-of-service for workloads territory again. I'm
not quite convinced yet we want to go there.

  reply	other threads:[~2019-08-06 22:01 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-08-04  9:23 Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure Artem S. Tashkinov
2019-08-05 12:13 ` Vlastimil Babka
2019-08-05 13:31   ` Michal Hocko
2019-08-05 16:47     ` Suren Baghdasaryan
2019-08-05 18:55     ` Johannes Weiner
2019-08-06  9:29       ` Michal Hocko
2019-08-05 19:31   ` Johannes Weiner
2019-08-06  1:08     ` Suren Baghdasaryan
2019-08-06  9:36       ` Vlastimil Babka
2019-08-06 14:27         ` Johannes Weiner
2019-08-06 14:36           ` Michal Hocko
2019-08-06 16:27             ` Suren Baghdasaryan
2019-08-06 22:01               ` Johannes Weiner [this message]
2019-08-07  7:59                 ` Michal Hocko
2019-08-07 20:51                   ` Johannes Weiner
2019-08-07 21:01                     ` Andrew Morton
2019-08-07 21:34                       ` Johannes Weiner
2019-08-07 21:12                     ` Johannes Weiner
2019-08-08 11:48                     ` Michal Hocko
2019-08-08 15:10                       ` ndrw.xf
2019-08-08 16:32                         ` Michal Hocko
2019-08-08 17:57                           ` ndrw.xf
2019-08-08 18:59                             ` Michal Hocko
2019-08-08 21:59                               ` ndrw
2019-08-09  8:57                                 ` Michal Hocko
2019-08-09 10:09                                   ` ndrw
2019-08-09 10:50                                     ` Michal Hocko
2019-08-09 14:18                                       ` Pintu Agarwal
2019-08-10 12:34                                       ` ndrw
2019-08-12  8:24                                         ` Michal Hocko
2019-08-10 21:07                                   ` ndrw
2021-07-24 17:32                         ` Alexey Avramov
2019-08-08 14:47                     ` Vlastimil Babka
2019-08-08 17:27                       ` Johannes Weiner
2019-08-09 14:56                         ` Vlastimil Babka
2019-08-09 17:31                           ` Johannes Weiner
2019-08-13 13:47                             ` Vlastimil Babka
2019-08-06 21:43       ` James Courtier-Dutton
2019-08-06 19:00 ` Florian Weimer
2019-08-20  6:46 ` Daniel Drake
2019-08-21 21:42   ` James Courtier-Dutton
2019-08-29 12:29     ` Michal Hocko
2019-09-02 20:15     ` Pavel Machek
2019-08-23  1:54   ` ndrw
2019-08-23  2:14     ` Daniel Drake
     [not found] <20190805090514.5992-1-hdanton@sina.com>
2019-08-05 12:01 ` Artem S. Tashkinov
2019-08-06  8:57 Johannes Buchner
2019-08-06 19:43 Remi Gauvin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190806220150.GA22516@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=aros@gmx.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).