From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751877AbdLJKRO (ORCPT ); Sun, 10 Dec 2017 05:17:14 -0500 Received: from mx2.suse.de ([195.135.220.15]:47496 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751452AbdLJKRL (ORCPT ); Sun, 10 Dec 2017 05:17:11 -0500 Date: Sun, 10 Dec 2017 11:17:09 +0100 From: Michal Hocko To: Tetsuo Handa Cc: surenb@google.com, akpm@linux-foundation.org, hannes@cmpxchg.org, hillf.zj@alibaba-inc.com, minchan@kernel.org, mgorman@techsingularity.net, ying.huang@intel.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, timmurray@google.com, tkjos@google.com Subject: Re: [PATCH v2] mm: terminate shrink_slab loop if signal is pending Message-ID: <20171210101709.GB20234@dhcp22.suse.cz> References: <20171208082220.GQ20234@dhcp22.suse.cz> <20171208114806.GU20234@dhcp22.suse.cz> <201712082303.DDG90166.FOLSHOOFVQJMtF@I-love.SAKURA.ne.jp> <201712091708.GHG60458.MHFOVSFOQtOFLJ@I-love.SAKURA.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <201712091708.GHG60458.MHFOVSFOQtOFLJ@I-love.SAKURA.ne.jp> User-Agent: Mutt/1.9.1 (2017-09-22) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat 09-12-17 17:08:42, Tetsuo Handa wrote: > Suren Baghdasaryan wrote: > > On Fri, Dec 8, 2017 at 6:03 AM, Tetsuo Handa > > wrote: > > >> > >> This change checks for pending > > >> > >> fatal signals inside shrink_slab loop and if one is detected > > >> > >> terminates this loop early. > > >> > > > > >> > > This changelog doesn't really address my previous review feedback, I am > > >> > > afraid. You should mention more details about problems you are seeing > > >> > > and what causes them. > > > > The problem I'm facing is that a SIGKILL sent from user space to kill > > the least important process is delayed enough for OOM-killer to get a > > chance to kill something else, possibly a more important process. Here > > "important" is from user's point of view. So the delay in SIGKILL > > delivery effectively causes extra kills. Traces indicate that this > > delay happens when process being killed is in direct reclaim and > > shrinkers (before I fixed them) were the biggest cause for the delay. > > Sending SIGKILL from userspace is not releasing memory fast enough to prevent > the OOM killer from invoking? Yes, under memory pressure, even an attempt to > send SIGKILL from userspace could be delayed due to e.g. page fault. > > Unless it is memcg OOM, you could try OOM notifier callback for checking > whether there are SIGKILL pending processes and wait for timeout if any. Hell no! You surely do not want all the OOM livelocks you were pushing so hard to get fixed, do you? The whole problem here is that there are two implementations of the OOM handling and they do not use any synchronization. You cannot be really surprise they step on each others toes. That is one of the reasons why I really hated the LMK in the kernel btw. Stalling shrinkers is a real problem and it should be addressed but let's not screw an already nasty/fragile code all around that. -- Michal Hocko SUSE Labs