All of lore.kernel.org
 help / color / mirror / Atom feed
From: Johannes Weiner <hannes@cmpxchg.org>
To: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@kernel.org>,
	Vlastimil Babka <vbabka@suse.cz>,
	"Artem S. Tashkinov" <aros@gmx.com>,
	LKML <linux-kernel@vger.kernel.org>,
	linux-mm <linux-mm@kvack.org>
Subject: Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
Date: Tue, 6 Aug 2019 18:01:50 -0400	[thread overview]
Message-ID: <20190806220150.GA22516@cmpxchg.org> (raw)
In-Reply-To: <CAJuCfpFmOzj-gU1NwoQFmS_pbDKKd2XN=CS1vUV4gKhYCJOUtw@mail.gmail.com>

On Tue, Aug 06, 2019 at 09:27:05AM -0700, Suren Baghdasaryan wrote:
> On Tue, Aug 6, 2019 at 7:36 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Tue 06-08-19 10:27:28, Johannes Weiner wrote:
> > > On Tue, Aug 06, 2019 at 11:36:48AM +0200, Vlastimil Babka wrote:
> > > > On 8/6/19 3:08 AM, Suren Baghdasaryan wrote:
> > > > >> @@ -1280,3 +1285,50 @@ static int __init psi_proc_init(void)
> > > > >>         return 0;
> > > > >>  }
> > > > >>  module_init(psi_proc_init);
> > > > >> +
> > > > >> +#define OOM_PRESSURE_LEVEL     80
> > > > >> +#define OOM_PRESSURE_PERIOD    (10 * NSEC_PER_SEC)
> > > > >
> > > > > 80% of the last 10 seconds spent in full stall would definitely be a
> > > > > problem. If the system was already low on memory (which it probably
> > > > > is, or we would not be reclaiming so hard and registering such a big
> > > > > stall) then oom-killer would probably kill something before 8 seconds
> > > > > are passed.
> > > >
> > > > If oom killer can act faster, than great! On small embedded systems you probably
> > > > don't enable PSI anyway?
> 
> We use PSI triggers with 1 sec tracking window. PSI averages are less
> useful on such systems because in 10 secs (which is the shortest PSI
> averaging window) memory conditions can change drastically.
> 
> > > > > If my line of thinking is correct, then do we really
> > > > > benefit from such additional protection mechanism? I might be wrong
> > > > > here because my experience is limited to embedded systems with
> > > > > relatively small amounts of memory.
> > > >
> > > > Well, Artem in his original mail describes a minutes long stall. Things are
> > > > really different on a fast desktop/laptop with SSD. I have experienced this as
> > > > well, ending up performing manual OOM by alt-sysrq-f (then I put more RAM than
> > > > 8GB in the laptop). IMHO the default limit should be set so that the user
> > > > doesn't do that manual OOM (or hard reboot) before the mechanism kicks in. 10
> > > > seconds should be fine.
> > >
> > > That's exactly what I have experienced in the past, and this was also
> > > the consistent story in the bug reports we have had.
> > >
> > > I suspect it requires a certain combination of RAM size, CPU speed,
> > > and IO capacity: the OOM killer kicks in when reclaim fails, which
> > > happens when all scanned LRU pages were locked and under IO. So IO
> > > needs to be slow enough, or RAM small enough, that the CPU can scan
> > > all LRU pages while they are temporarily unreclaimable (page lock).
> > >
> > > It may well be that on phones the RAM is small enough relative to CPU
> > > size.
> > >
> > > But on desktops/servers, we frequently see that there is a wider
> > > window of memory consumption in which reclaim efficiency doesn't drop
> > > low enough for the OOM killer to kick in. In the time it takes the CPU
> > > to scan through RAM, enough pages will have *just* finished reading
> > > for reclaim to free them again and continue to make "progress".
> > >
> > > We do know that the OOM killer might not kick in for at least 20-25
> > > minutes while the system is entirely unresponsive. People usually
> > > don't wait this long before forcibly rebooting. In a managed fleet,
> > > ssh heartbeat tests eventually fail and force a reboot.
> 
> Got it. Thanks for the explanation.
> 
> > > I'm not sure 10s is the perfect value here, but I do think the kernel
> > > should try to get out of such a state, where interacting with the
> > > system is impossible, within a reasonable amount of time.
> > >
> > > It could be a little too short for non-interactive number-crunching
> > > systems...
> >
> > Would it be possible to have a module with tunning knobs as parameters
> > and hook into the PSI infrastructure? People can play with the setting
> > to their need, we wouldn't really have think about the user visible API
> > for the tuning and this could be easily adopted as an opt-in mechanism
> > without a risk of regressions.

It's relatively easy to trigger a livelock that disables the entire
system for good, as a regular user. It's a little weird to make the
bug fix for that an opt-in with an extensive configuration interface.

This isn't like the hung task watch dog, where it's likely some kind
of kernel issue, right? This can happen on any current kernel.

What I would like to have is a way of self-recovery from a livelock. I
don't mind making it opt-out in case we make mistakes, but the kernel
should provide minimal self-protection out of the box, IMO.

> PSI averages stalls over 10, 60 and 300 seconds, so implementing 3
> corresponding thresholds would be easy. The patch Johannes posted can
> be extended to support 3 thresholds instead of 1. I can take a stab at
> it if Johannes is busy.
> If we want more flexibility we could use PSI triggers with
> configurable tracking window but that's more complex and probably not
> worth it.

This goes into quality-of-service for workloads territory again. I'm
not quite convinced yet we want to go there.

  reply	other threads:[~2019-08-06 22:01 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-08-04  9:23 Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure Artem S. Tashkinov
2019-08-05 12:13 ` Vlastimil Babka
2019-08-05 13:31   ` Michal Hocko
2019-08-05 16:47     ` Suren Baghdasaryan
2019-08-05 16:47       ` Suren Baghdasaryan
2019-08-05 18:55     ` Johannes Weiner
2019-08-06  9:29       ` Michal Hocko
2019-08-05 19:31   ` Johannes Weiner
2019-08-06  1:08     ` Suren Baghdasaryan
2019-08-06  1:08       ` Suren Baghdasaryan
2019-08-06  9:36       ` Vlastimil Babka
2019-08-06 14:27         ` Johannes Weiner
2019-08-06 14:36           ` Michal Hocko
2019-08-06 16:27             ` Suren Baghdasaryan
2019-08-06 16:27               ` Suren Baghdasaryan
2019-08-06 22:01               ` Johannes Weiner [this message]
2019-08-07  7:59                 ` Michal Hocko
2019-08-07 20:51                   ` Johannes Weiner
2019-08-07 21:01                     ` Andrew Morton
2019-08-07 21:34                       ` Johannes Weiner
2019-08-07 21:12                     ` Johannes Weiner
2019-08-08 11:48                     ` Michal Hocko
2019-08-08 15:10                       ` ndrw.xf
2019-08-08 15:10                         ` ndrw.xf
2019-08-08 16:32                         ` Michal Hocko
2019-08-08 17:57                           ` ndrw.xf
2019-08-08 17:57                             ` ndrw.xf
2019-08-08 18:59                             ` Michal Hocko
2019-08-08 21:59                               ` ndrw
2019-08-09  8:57                                 ` Michal Hocko
2019-08-09 10:09                                   ` ndrw
2019-08-09 10:50                                     ` Michal Hocko
2019-08-09 14:18                                       ` Pintu Agarwal
2019-08-09 14:18                                         ` Pintu Agarwal
2019-08-10 12:34                                       ` ndrw
2019-08-12  8:24                                         ` Michal Hocko
2019-08-10 21:07                                   ` ndrw
2021-07-24 17:32                         ` Alexey Avramov
2021-07-25  2:11                           ` Hillf Danton
2019-08-08 14:47                     ` Vlastimil Babka
2019-08-08 17:27                       ` Johannes Weiner
2019-08-09 14:56                         ` Vlastimil Babka
2019-08-09 17:31                           ` Johannes Weiner
2019-08-13 13:47                             ` Vlastimil Babka
2019-08-06 21:43       ` James Courtier-Dutton
2019-08-06 21:43         ` James Courtier-Dutton
2019-08-06 19:00 ` Florian Weimer
2019-08-20  6:46 ` Daniel Drake
2019-08-21 21:42   ` James Courtier-Dutton
2019-08-29 12:29     ` Michal Hocko
2019-09-02 20:15     ` Pavel Machek
2019-08-23  1:54   ` ndrw
2019-08-23  2:14     ` Daniel Drake
2019-08-05  9:05 Hillf Danton
2019-08-05 12:01 ` Artem S. Tashkinov
2019-08-06  8:57 Johannes Buchner
2019-08-06 19:43 Remi Gauvin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190806220150.GA22516@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=aros@gmx.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.