From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-4.0 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id DCEB3C433E3 for ; Tue, 28 Jul 2020 18:06:22 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 866942078E for ; Tue, 28 Jul 2020 18:06:22 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=cmpxchg-org.20150623.gappssmtp.com header.i=@cmpxchg-org.20150623.gappssmtp.com header.b="tTexMwlS" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 866942078E Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=cmpxchg.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id DCE388D0012; Tue, 28 Jul 2020 14:06:21 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D80238D0002; Tue, 28 Jul 2020 14:06:21 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C46578D0012; Tue, 28 Jul 2020 14:06:21 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0186.hostedemail.com [216.40.44.186]) by kanga.kvack.org (Postfix) with ESMTP id AF0D58D0002 for ; Tue, 28 Jul 2020 14:06:21 -0400 (EDT) Received: from smtpin12.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 66781181AC9BF for ; Tue, 28 Jul 2020 18:06:21 +0000 (UTC) X-FDA: 77088264162.12.ray93_0e114cf26f6b Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin12.hostedemail.com (Postfix) with ESMTP id 7701118009D30 for ; Tue, 28 Jul 2020 18:05:16 +0000 (UTC) X-HE-Tag: ray93_0e114cf26f6b X-Filterd-Recvd-Size: 8003 Received: from mail-qk1-f193.google.com (mail-qk1-f193.google.com [209.85.222.193]) by imf33.hostedemail.com (Postfix) with ESMTP for ; Tue, 28 Jul 2020 18:05:15 +0000 (UTC) Received: by mail-qk1-f193.google.com with SMTP id x69so19587641qkb.1 for ; Tue, 28 Jul 2020 11:05:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=8wfoeCHKeJ0j0BfkwoN9zlpXnpsZMW0czWLtnTeb1qY=; b=tTexMwlSHNE2hl4V7A+Stq1jwOg71pCSIYwf+zWjV57I1nqQi9GXZBWNXLJjBZBCCK Xyfx7/uiI23PXrCAhFsAHnnBhGOVi211WW07yc/0ceY3M7nvCJ4Zn05Tb1uvT4alGPXv SZo/5zsLP+rLVT/mOzfuwOIjOn3SKeRvgFNM3n2haG1fWa2RoP7GKnlDwVAVCDeMCr+o 6pBep8BZOhY41rki5ubmhScGoRPcDq/UfcLQylya9DbLku79osmCuPWuUx77AIsMnW+j UNXs44xWUQvYJC0sO86RZ0HfGwPtmtc1lm05aOqXUbQdIKlHmdkFpVWktszg6VwbUSI9 +c0Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=8wfoeCHKeJ0j0BfkwoN9zlpXnpsZMW0czWLtnTeb1qY=; b=J4/IrW+XztIjPdNntQwd09vBVsKNSZfF79PdTytuYa/lkvoM5mICRfkKYRwSbMWB4f fOKVxf6zgo5PM76Rov+BaoQ0N8ajmKa7cDkC86rI+Fy17JQaK0qkRdgfqlRa2RvRRO67 iTo4Mobd3xw1DGSQgn4gWQ1XWrCQ+chhyEoGUliotQ+SQko+0JhAhfNgNWk8ImVFbgvO pX2iTuh2Ni57dvtWH6WN3AsTxsrnwoIMKfFb0nO9VImCe/lG9Z4DqzS/ut0h6vdzLk94 ibqIrcVh28cLOMaIF47GheJIB9LAMUSEPPcz8t93Woia1fOPzZeGKOuC03bCLdgPlj1X L3gQ== X-Gm-Message-State: AOAM532NgyXEQ2Djzh1l8WjsKoVmR/Bshhk9CvjdUCY6jHpdIVUZdCIB QYu4Me0Y6EAGPZywOiro6rtL9A== X-Google-Smtp-Source: ABdhPJzM4pkHJj3Uch10mEqZzJGe0IbxdsLwuerVIUsWCOYHUWseSOJdbg6Nw9bNDoXKnC+FjzEl4Q== X-Received: by 2002:a05:620a:2298:: with SMTP id o24mr4171523qkh.73.1595959514593; Tue, 28 Jul 2020 11:05:14 -0700 (PDT) Received: from localhost ([2620:10d:c091:480::1:53c1]) by smtp.gmail.com with ESMTPSA id h55sm21015022qte.16.2020.07.28.11.05.13 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 28 Jul 2020 11:05:13 -0700 (PDT) Date: Tue, 28 Jul 2020 14:04:13 -0400 From: Johannes Weiner To: David Rientjes Cc: Michal Hocko , Yafang Shao , Tetsuo Handa , Andrew Morton , Linux MM Subject: Re: [PATCH v2] memcg, oom: check memcg margin for parallel oom Message-ID: <20200728180237.GA387006@cmpxchg.org> References: <1594735034-19190-1-git-send-email-laoar.shao@gmail.com> <20200716060814.GA31089@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Queue-Id: 7701118009D30 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam04 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000002, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Jul 15, 2020 at 11:56:11PM -0700, David Rientjes wrote: > On Thu, 16 Jul 2020, Michal Hocko wrote: > > > > I don't think moving the mem_cgroup_margin() check to out_of_memory() > > > right before printing the oom info and killing the process is a very > > > invasive patch. Any strong preference against doing it that way? I think > > > moving the check as late as possible to save a process from being killed > > > when racing with an exiter or killed process (including perhaps current) > > > has a pretty clear motivation. > > > > We have been through this discussion several times in the past IIRC > > The conclusion has been that the allocator (charging path for > > the memcg) is the one to define OOM situation. This is an inherently > > racy situation as long as we are not synchronizing oom with the world, > > which I believe we agree, we do not want to do. There are few exceptions > > to bail out early from the oom under certain situations and the trend > > was to remove some of the existing ones rather than adding new because > > they had subtle side effects and were prone to lockups. > > > > As much as it might sound attractive to move mem_cgroup_margin resp. > > last allocation attempt closer to the actual oom killing I haven't seen > > any convincing data that would support that such a change would make a > > big difference. select_bad_process is not a free operation as it scales > > with the number of tasks in the oom domain but it shouldn't be a super > > expensive. The oom reporting is by far the most expensive part of the > > operation. > > > > That being said, really convincing data should be presented in order > > to do such a change. I do not think we want to do that just in case. > > It's not possible to present data because we've had such a check for years > in our fleet so I can't say that it has prevented X unnecessary oom kills > compared to doing the check prior to calling out_of_memory(). I'm hoping > that can be understood. That makes sense. I would just be curious whether you can remember what data points you looked at at the time you made the change. Were you inspecting individual OOM kills and saw in the memory dump after the fact that the kill would have been unnecessary? Or were you looking at aggregate OOM kill rates in your fleet and saw a measurable reduction? > Since Yafang is facing the same issue, and there is no significant > downside to doing the mem_cgroup_margin() check prior to > oom_kill_process() (or checking task_will_free_mem(current)), and it's > acknowledged that it *can* prevent unnecessary oom killing, which is a > very good thing, I'd like to understand why such resistance to it. As someone who has been fairly vocal against this in the past, I have to admit I don't feel too strongly about it anymore. My argument in the past has been that if you're going for a probabilistic reduction of OOM kills, injecting sleeps in between reclaim and the last allocation attempt could also increase the chance of coincidental parallel memory frees. And that just always seemed somewhat arbitrary to me. I was also worried we'd be opening a can of worms by allowing this type of tweaking of the OOM killer, when OOM kills are supposed to be a once-in-a-blue-moon deadlock breaker, rather than a common resource enforcement strategy. However, I also have to admit that while artificial sleeps would indeed be pretty arbitrary and likely controversial, the time it takes to select a victim is unavoidable. It's something we need to do in any case. If checking the margin one last time after that helps you bring down the kill rate somewhat, I'd say go for it. It's a natural line, and I would agree that the change isn't very invasive. > Killing a user process is a serious matter. I would fully agree if the > margin is only one page: it's still better to kill something off. But > when a process has uncharged memory by means induced by a process waiting > on oom notication, such as a userspace kill or dropping of caches from > your malloc implementation, that uncharge can be quite substantial and oom > killing is then unnecessary. > > I can refresh the patch and send it formally. No objection from me.