Re: [PATCH] mm: memcontrol: protect the memory in cgroup from being oom killed

From: "程垲涛 Chengkaitao Cheng" <chengkaitao@didiglobal.com>
To: Michal Hocko <mhocko@suse.com>
Cc: "roman.gushchin@linux.dev" <roman.gushchin@linux.dev>,
	Tao pilgrim <pilgrimtao@gmail.com>,
	"tj@kernel.org" <tj@kernel.org>,
	"lizefan.x@bytedance.com" <lizefan.x@bytedance.com>,
	"hannes@cmpxchg.org" <hannes@cmpxchg.org>,
	"corbet@lwn.net" <corbet@lwn.net>,
	"shakeelb@google.com" <shakeelb@google.com>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"songmuchun@bytedance.com" <songmuchun@bytedance.com>,
	"cgel.zte@gmail.com" <cgel.zte@gmail.com>,
	"ran.xiaokai@zte.com.cn" <ran.xiaokai@zte.com.cn>,
	"viro@zeniv.linux.org.uk" <viro@zeniv.linux.org.uk>,
	"zhengqi.arch@bytedance.com" <zhengqi.arch@bytedance.com>,
	"ebiederm@xmission.com" <ebiederm@xmission.com>,
	"Liam.Howlett@oracle.com" <Liam.Howlett@oracle.com>,
	"chengzhihao1@huawei.com" <chengzhihao1@huawei.com>,
	"haolee.swjtu@gmail.com" <haolee.swjtu@gmail.com>,
	"yuzhao@google.com" <yuzhao@google.com>,
	"willy@infradead.org" <willy@infradead.org>,
	"vasily.averin@linux.dev" <vasily.averin@linux.dev>,
	"vbabka@suse.cz" <vbabka@suse.cz>,
	"surenb@google.com" <surenb@google.com>,
	"sfr@canb.auug.org.au" <sfr@canb.auug.org.au>,
	"mcgrof@kernel.org" <mcgrof@kernel.org>,
	"sujiaxun@uniontech.com" <sujiaxun@uniontech.com>,
	"feng.tang@intel.com" <feng.tang@intel.com>,
	"cgroups@vger.kernel.org" <cgroups@vger.kernel.org>,
	"linux-doc@vger.kernel.org" <linux-doc@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	"Bagas Sanjaya" <bagasdotme@gmail.com>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Subject: Re: [PATCH] mm: memcontrol: protect the memory in cgroup from being oom killed
Date: Thu, 1 Dec 2022 13:05:31 +0000	[thread overview]
Message-ID: <EF1DC035-442F-4BAE-B86F-6C6B10B4A094@didiglobal.com> (raw)
In-Reply-To: <Y4htjRAX1v7ZzC/z@dhcp22.suse.cz>

At 2022-12-01 17:02:05, "Michal Hocko" <mhocko@suse.com> wrote:
>On Thu 01-12-22 07:49:04, 程垲涛 Chengkaitao Cheng wrote:
>> At 2022-12-01 07:29:11, "Roman Gushchin" <roman.gushchin@linux.dev> wrote:
>[...]
>> >The problem is that the decision which process(es) to kill or preserve
>> >is individual to a specific workload (and can be even time-dependent
>> >for a given workload). 
>> 
>> It is correct to kill a process with high workload, but it may not be the 
>> most appropriate. I think the specific process to kill needs to be decided 
>> by the user. I think it is the original intention of score_adj design.
>
>I guess what Roman tries to say here is that there is no obviously _correct_
>oom victim candidate. Well, except for a very narrow situation when
>there is a memory leak that consumes most of the memory over time. But
>that is really hard to identify by the oom selection algorithm in
>general.
> 
>> >So it's really hard to come up with an in-kernel
>> >mechanism which is at the same time flexible enough to work for the majority
>> >of users and reliable enough to serve as the last oom resort measure (which
>> >is the basic goal of the kernel oom killer).
>> >
>> Our goal is to find a method that is less intrusive to the existing 
>> mechanisms of the kernel, and find a more reasonable supplement 
>> or alternative to the limitations of score_adj.
>> 
>> >Previously the consensus was to keep the in-kernel oom killer dumb and reliable
>> >and implement complex policies in userspace (e.g. systemd-oomd etc).
>> >
>> >Is there a reason why such approach can't work in your case?
>> 
>> I think that as kernel developers, we should try our best to provide 
>> users with simpler and more powerful interfaces. It is clear that the 
>> current oom score mechanism has many limitations. Users need to 
>> do a lot of timed loop detection in order to complete work similar 
>> to the oom score mechanism, or develop a new mechanism just to 
>> skip the imperfect oom score mechanism. This is an inefficient and 
>> forced behavior
>
>You are right that it makes sense to address typical usecases in the
>kernel if that is possible. But oom victim selection is really hard
>without a deeper understanding of the actual workload. The more clever
>we try to be the more corner cases we can produce. Please note that this
>has proven to be the case in the long oom development history. We used
>to sacrifice child processes over a large process to preserve work or
>prefer younger processes. Both those strategies led to problems.
>
>Memcg protection based mechanism sounds like an interesting idea because
>it mimics a reclaim protection scheme but I am a bit sceptical it will
>be practically useful. Most for 2 reasons. a) memory reclaim protection
>can be dynamically tuned because on reclaim/refault/psi metrics. oom
>events are rare and mostly a failure situation. This limits any feedback
>based approach IMHO. b) Hierarchical nature of the protection will make
>it quite hard to configure properly with predictable outcome.
>
More and more users want to save costs as much as possible by setting the 
mem.max to a very small value, resulting in a small number of oom events, 
but users can tolerate them, and users want to minimize the impact of oom 
events at this time. In similar scenarios, oom events are no longer abnormal 
and unpredictable. We need to provide convenient oom policies for users to 
choose.

Users have a greater say in oom victim selection, but they cannot perceive 
other users, so they cannot accurately formulate their own oom policies. 
This is a very contradictory thing. Therefore, we hope that each user's 
customized policies can be independent of each other and not interfere with 
each other.

>-- 
>Michal Hocko
>SUSE Labs