From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1753449AbbFHTwG (ORCPT <rfc822;w@1wt.eu>);
	Mon, 8 Jun 2015 15:52:06 -0400
Received: from mail-ig0-f180.google.com ([209.85.213.180]:34975 "EHLO
	mail-ig0-f180.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1753215AbbFHTvz (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Mon, 8 Jun 2015 15:51:55 -0400
Date: Mon, 8 Jun 2015 12:51:53 -0700 (PDT)
From: David Rientjes <rientjes@google.com>
X-X-Sender: rientjes@chino.kir.corp.google.com
To: Michal Hocko <mhocko@suse.cz>
cc: Andrew Morton <akpm@linux-foundation.org>,
        Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>, linux-mm@kvack.org,
        LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] oom: always panic on OOM when panic_on_oom is
 configured
In-Reply-To: <20150605111302.GB26113@dhcp22.suse.cz>
Message-ID: <alpine.DEB.2.10.1506081242250.13272@chino.kir.corp.google.com>
References: <1433159948-9912-1-git-send-email-mhocko@suse.cz> <alpine.DEB.2.10.1506041607020.16555@chino.kir.corp.google.com> <20150605111302.GB26113@dhcp22.suse.cz>
User-Agent: Alpine 2.10 (DEB 1266 2009-07-14)
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Fri, 5 Jun 2015, Michal Hocko wrote:

> > Nack, this is not the appropriate response to exit path livelocks.  By 
> > doing this, you are going to start unnecessarily panicking machines that 
> > have panic_on_oom set when it would not have triggered before.  If there 
> > is no reclaimable memory and a process that has already been signaled to 
> > die to is in the process of exiting has to allocate memory, it is 
> > perfectly acceptable to give them access to memory reserves so they can 
> > allocate and exit.  Under normal circumstances, that allows the process to 
> > naturally exit.  With your patch, it will cause the machine to panic.
> 
> Isn't that what the administrator of the system wants? The system
> is _clearly_ out of memory at this point. A coincidental exiting task
> doesn't change a lot in that regard. Moreover it increases a risk of
> unnecessarily unresponsive system which is what panic_on_oom tries to
> prevent from. So from my POV this is a clear violation of the user
> policy.
> 

We rely on the functionality that this patch is short cutting because we 
rely on userspace to trigger oom kills.  For system oom conditions, we 
must then rely on the kernel oom killer to set TIF_MEMDIE since userspace 
cannot grant it itself.  (I think the memcg case is very similar in that 
this patch is short cutting it, but I'm more concerned for the system oom 
in this case because it's a show stopper for us.)

We want to send the SIGKILL, which will interrupt things like 
get_user_pages() which we find is our culprit most of the time.  When the 
process enters the exit path, it must allocate other memory (slab, 
coredumping and the very problematic proc_exit_connector()) to free 
memory.  This patch would cause the machine to panic rather than utilizing 
memory reserves so that it can exit, not as a result of a kernel oom kill 
but rather a userspace kill.

Panic_on_oom is to suppress the kernel oom killer.  It's not a sysctl that 
triggers whenever watermarks are hit and it doesn't suppress memory 
reserves from being used for things like GFP_ATOMIC.  Setting TIF_MEMDIE 
for an exiting process is another type of memory reserves and is 
imperative that we have it to make forward progress.  Panic_on_oom should 
only trigger when the kernel can't make forward progress without killing 
something (not true in this case).  I believe that's how the documentation 
has always been interpreted and the tunable used in the wild.

It would be interesting to consider your other patch that refactors the 
sysrq+f tunable.  I think we should make that never trigger panic_on_oom 
(the sysadmin can use other sysrqs for that) and allow userspace to use 
sysrq+f as a trigger when it is responsive to handle oom conditions.

But this patch itself can't possibly be merged.