From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751547AbdHaUB7 (ORCPT ); Thu, 31 Aug 2017 16:01:59 -0400 Received: from mail-pg0-f52.google.com ([74.125.83.52]:33934 "EHLO mail-pg0-f52.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751241AbdHaUB4 (ORCPT ); Thu, 31 Aug 2017 16:01:56 -0400 X-Google-Smtp-Source: ADKCNb7LINIvq8JNUdB9NAUIWL/MKF3ME4p2onS/vB00v7K/vqvglIWrMmpm+3z08JtzN/73SJcVog== Date: Thu, 31 Aug 2017 13:01:54 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Roman Gushchin cc: Michal Hocko , linux-mm@kvack.org, Vladimir Davydov , Johannes Weiner , Tetsuo Handa , Tejun Heo , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [v6 2/4] mm, oom: cgroup-aware OOM killer In-Reply-To: <20170831133423.GA30125@castle.DHCP.thefacebook.com> Message-ID: References: <20170823165201.24086-3-guro@fb.com> <20170824114706.GG5943@dhcp22.suse.cz> <20170824122846.GA15916@castle.DHCP.thefacebook.com> <20170824125811.GK5943@dhcp22.suse.cz> <20170824135842.GA21167@castle.DHCP.thefacebook.com> <20170824141336.GP5943@dhcp22.suse.cz> <20170824145801.GA23457@castle.DHCP.thefacebook.com> <20170825081402.GG25498@dhcp22.suse.cz> <20170830112240.GA4751@castle.dhcp.TheFacebook.com> <20170831133423.GA30125@castle.DHCP.thefacebook.com> User-Agent: Alpine 2.10 (DEB 1266 2009-07-14) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 31 Aug 2017, Roman Gushchin wrote: > So, it looks to me that we're close to an acceptable version, > and the only remaining question is the default behavior > (when oom_group is not set). > Nit: without knowledge of the implementation, I still don't think I would know what an "out of memory group" is. Out of memory doesn't necessarily imply a kill. I suggest oom_kill_all or something that includes the verb. > Michal suggests to ignore non-oom_group memcgs, and compare tasks with > memcgs with oom_group set. This makes the whole thing completely opt-in, > but then we probably need another knob (or value) to select between > "select memcg, kill biggest task" and "select memcg, kill all tasks". It seems like that would either bias toward or bias against cgroups that opt-in. I suggest comparing memory cgroups at each level in the hierarchy based on your new badness heuristic, regardless of any tunables it has enabled. Then kill either the largest process or all the processes attached depending on oom_group or oom_kill_all. > Also, as the whole thing is based on comparison between processes and > memcgs, we probably need oom_priority for processes. I think with the constraints of cgroup v2 that a victim memcg must first be chosen, and then a victim process attached to that memcg must be chosen or all eligible processes attached to it be killed, depending on the tunable. The simplest and most clear way to define this, in my opinion, is to implement a heuristic that compares sibling memcgs based on usage, as you have done. This can be overridden by a memory.oom_priority that userspace defines, and is enough support for them to change victim selection (no mount option needed, just set memory.oom_priority). Then kill the largest process or all eligible processes attached. We only use per-process priority to override process selection compared to sibling memcgs, but with cgroup v2 process constraints it doesn't seem like that is within the scope of your patchset.