Re: [PATCH v2] memcg, oom: check memcg margin for parallel oom

From: David Rientjes <rientjes@google.com>
To: Yafang Shao <laoar.shao@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>,
	 Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>,
	 Andrew Morton <akpm@linux-foundation.org>,
	 Johannes Weiner <hannes@cmpxchg.org>,
	Linux MM <linux-mm@kvack.org>
Subject: Re: [PATCH v2] memcg, oom: check memcg margin for parallel oom
Date: Fri, 17 Jul 2020 12:26:07 -0700 (PDT)	[thread overview]
Message-ID: <alpine.DEB.2.23.453.2007171212210.3398972@chino.kir.corp.google.com> (raw)
In-Reply-To: <CALOAHbA5J23Fo3AmdANbPa_dDbjXJzLGb3PaZF8emfNENfcaJA@mail.gmail.com>

On Fri, 17 Jul 2020, Yafang Shao wrote:

> > > Actually the kernel is doing it now, see bellow,
> > >
> > > dump_header() <<<< dump lots of information
> > > __oom_kill_process
> > >     p = find_lock_task_mm(victim);
> > >     if (!p)
> > >        return;   <<<< without killing any process.
> > >
> >
> > Ah, this is catching an instance where the chosen process has already done
> > exit_mm(), good catch -- I can find examples of this by scraping kernel
> > logs from our fleet.
> >
> > So it appears there is precedence for dumping all the oom info but not
> > actually performing any action for it and I made the earlier point that
> > diagnostic information in the kernel log here is still useful.  I think it
> > is still preferable that the kernel at least tell us why it didn't do
> > anything, but as you mention that already happens today.
> >
> > Would you like to send a patch that checks for mem_cgroup_margin() here as
> > well?  A second patch could make the possible inaction more visibile,
> > something like "Process ${pid} (${comm}) is already exiting" for the above
> > check or "Memcg ${memcg} is no longer out of memory".
> >
> > Another thing that these messages indicate, beyond telling us why the oom
> > killer didn't actually SIGKILL anything, is that we can expect some skew
> > in the memory stats that shows an availability of memory.
> >
> 
> Agreed, these messages would be helpful.
> I will send a patch for it.
> 

Thanks Yafang.  We should also continue talking about challenges you 
encounter with the oom killer either at the system level or for memcg 
limit ooms in a separate thread.  It's clear that you are meeting several 
of the issues that we have previously seen ourselves.

I could do a full audit of all our oom killer changes that may be
interesting to you, but off the top of my head:

 - A means of triggering a memcg oom through the kernel: think of sysrq+f
   but scoped to processes attached to a memcg hierarchy.  This allows
   userspace to reliably oom kill processes on overcommitted systems
   (SIGKILL can be insufficient if we depend on oom reaping, for example,
   to make forward progress)

 - Storing the state of a memcg's memory at the time reclaim has failed
   and we must oom kill: when the memcg oom killer is disabled so that
   userspace can handle it, if it triggers an oom kill through the kernel
   because it prefers an oom kill on an overcommitted system, we need to
   dump the state of the memory at oom rather than with the stack of the
   explicit trigger

 - Supplement memcg oom notification with an additional notification event
   on kernel oom kill: allows users to register for an event that triggers
   when the kernel oom killer kills something (and keeps a count of these
   events available for read)

 - Add a notion of an oom delay: on overcommitted systems, userspace may
   become unreliable or unresponsive despite our best efforts, this
   supplements the ability to disable the oom killer for a memcg hierarchy
   with the ability to disable it for a set period of time until the oom
   killer intervenes and kills something (last ditch effort).

I'd be happy to discuss any of these topics if you are interested.