Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure

From: Johannes Weiner <hannes@cmpxchg.org>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>,
	Suren Baghdasaryan <surenb@google.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	"Artem S. Tashkinov" <aros@gmx.com>,
	LKML <linux-kernel@vger.kernel.org>,
	linux-mm <linux-mm@kvack.org>
Subject: Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
Date: Wed, 7 Aug 2019 17:34:43 -0400	[thread overview]
Message-ID: <20190807213443.GA11227@cmpxchg.org> (raw)
In-Reply-To: <20190807140130.7418e783654a9c53e6b6cd1b@linux-foundation.org>

On Wed, Aug 07, 2019 at 02:01:30PM -0700, Andrew Morton wrote:
> On Wed, 7 Aug 2019 16:51:38 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > However, eb414681d5a0 ("psi: pressure stall information for CPU,
> > memory, and IO") introduced a memory pressure metric that quantifies
> > the share of wallclock time in which userspace waits on reclaim,
> > refaults, swapins. By using absolute time, it encodes all the above
> > mentioned variables of hardware capacity and workload behavior. When
> > memory pressure is 40%, it means that 40% of the time the workload is
> > stalled on memory, period. This is the actual measure for the lack of
> > forward progress that users can experience. It's also something they
> > expect the kernel to manage and remedy if it becomes non-existent.
> > 
> > To accomplish this, this patch implements a thrashing cutoff for the
> > OOM killer. If the kernel determines a sustained high level of memory
> > pressure, and thus a lack of forward progress in userspace, it will
> > trigger the OOM killer to reduce memory contention.
> > 
> > Per default, the OOM killer will engage after 15 seconds of at least
> > 80% memory pressure. These values are tunable via sysctls
> > vm.thrashing_oom_period and vm.thrashing_oom_level.
> 
> Could be implemented in userspace?
> </troll>

We do in fact do this with oomd.

But it requires a comprehensive cgroup setup, with complete memory and
IO isolation, to protect that daemon from the memory pressure and
excessive paging of the rest of the system (mlock doesn't really cut
it because you need to potentially allocate quite a few proc dentries
and inodes just to walk the process tree and determine a kill target).

In a fleet that works fine, since we need to maintain that cgroup
infra anyway. But for other users, that's a lot of stack for basic
"don't hang forever if I allocate too much memory" functionality.