Re: [PATCH] mm: memcontrol: protect the memory in cgroup from being oom killed

From: "程垲涛 Chengkaitao Cheng" <chengkaitao@didiglobal.com>
To: Michal Hocko <mhocko@suse.com>
Cc: Tao pilgrim <pilgrimtao@gmail.com>,
	"tj@kernel.org" <tj@kernel.org>,
	"lizefan.x@bytedance.com" <lizefan.x@bytedance.com>,
	"hannes@cmpxchg.org" <hannes@cmpxchg.org>,
	"corbet@lwn.net" <corbet@lwn.net>,
	"roman.gushchin@linux.dev" <roman.gushchin@linux.dev>,
	"shakeelb@google.com" <shakeelb@google.com>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"songmuchun@bytedance.com" <songmuchun@bytedance.com>,
	"cgel.zte@gmail.com" <cgel.zte@gmail.com>,
	"ran.xiaokai@zte.com.cn" <ran.xiaokai@zte.com.cn>,
	"viro@zeniv.linux.org.uk" <viro@zeniv.linux.org.uk>,
	"zhengqi.arch@bytedance.com" <zhengqi.arch@bytedance.com>,
	"ebiederm@xmission.com" <ebiederm@xmission.com>,
	"Liam.Howlett@oracle.com" <Liam.Howlett@oracle.com>,
	"chengzhihao1@huawei.com" <chengzhihao1@huawei.com>,
	"haolee.swjtu@gmail.com" <haolee.swjtu@gmail.com>,
	"yuzhao@google.com" <yuzhao@google.com>,
	"willy@infradead.org" <willy@infradead.org>,
	"vasily.averin@linux.dev" <vasily.averin@linux.dev>,
	"vbabka@suse.cz" <vbabka@suse.cz>,
	"surenb@google.com" <surenb@google.com>,
	"sfr@canb.auug.org.au" <sfr@canb.auug.org.au>,
	"mcgrof@kernel.org" <mcgrof@kernel.org>,
	"sujiaxun@uniontech.com" <sujiaxun@uniontech.com>,
	"feng.tang@intel.com" <feng.tang@intel.com>,
	"cgroups@vger.kernel.org" <cgroups@vger.kernel.org>,
	"linux-doc@vger.kernel.org" <linux-doc@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	"Bagas Sanjaya" <bagasdotme@gmail.com>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Subject: Re: [PATCH] mm: memcontrol: protect the memory in cgroup from being oom killed
Date: Thu, 1 Dec 2022 04:52:27 +0000	[thread overview]
Message-ID: <E5A5BCC3-460E-4E81-8DD3-88B4A2868285@didiglobal.com> (raw)
In-Reply-To: <Y4eEiqwMMkHv9ELM@dhcp22.suse.cz>

At 2022-12-01 00:27:54, "Michal Hocko" <mhocko@suse.com> wrote:
>On Wed 30-11-22 15:46:19, 程垲涛 Chengkaitao Cheng wrote:
>> On 2022-11-30 21:15:06, "Michal Hocko" <mhocko@suse.com> wrote:
>> > On Wed 30-11-22 15:01:58, chengkaitao wrote:
>> > > From: chengkaitao <pilgrimtao@gmail.com>
>> > >
>> > > We created a new interface <memory.oom.protect> for memory, If there is
>> > > the OOM killer under parent memory cgroup, and the memory usage of a
>> > > child cgroup is within its effective oom.protect boundary, the cgroup's
>> > > tasks won't be OOM killed unless there is no unprotected tasks in other
>> > > children cgroups. It draws on the logic of <memory.min/low> in the
>> > > inheritance relationship.
>> >
>> > Could you be more specific about usecases?
>
>This is a very important question to answer.

usecases 1: users say that they want to protect an important process 
with high memory consumption from being killed by the oom in case 
of docker container failure, so as to retain more critical on-site 
information or a self recovery mechanism. At this time, they suggest 
setting the score_adj of this process to -1000, but I don't agree with 
it, because the docker container is not important to other docker 
containers of the same physical machine. If score_adj of the process 
is set to -1000, the probability of oom in other container processes will 
increase.

usecases 2: There are many business processes and agent processes 
mixed together on a physical machine, and they need to be classified 
and protected. However, some agents are the parents of business 
processes, and some business processes are the parents of agent 
processes, It will be troublesome to set different score_adj for them. 
Business processes and agents cannot determine which level their 
score_adj should be at, If we create another agent to set all processes's 
score_adj, we have to cycle through all the processes on the physical 
machine regularly, which looks stupid.

>> > How do you tune oom.protect
>> > wrt to other tunables? How does this interact with the oom_score_adj
>> > tunining (e.g. a first hand oom victim with the score_adj 1000 sitting
>> > in a oom protected memcg)?
>> 
>> We prefer users to use score_adj and oom.protect independently. Score_adj is 
>> a parameter applicable to host, and oom.protect is a parameter applicable to cgroup. 
>> When the physical machine's memory size is particularly large, the score_adj 
>> granularity is also very large. However, oom.protect can achieve more fine-grained 
>> adjustment.
>
>Let me clarify a bit. I am not trying to defend oom_score_adj. It has
>it's well known limitations and it is is essentially unusable for many
>situations other than - hide or auto-select potential oom victim.
>
>> When the score_adj of the processes are the same, I list the following cases 
>> for explanation,
>> 
>>           root
>>            |
>>         cgroup A
>>        /        \
>>  cgroup B      cgroup C
>> (task m,n)     (task x,y)
>> 
>> score_adj(all task) = 0;
>> oom.protect(cgroup A) = 0;
>> oom.protect(cgroup B) = 0;
>> oom.protect(cgroup C) = 3G;
>
>How can you enforce protection at C level without any protection at A
>level? 

The basic idea of this scheme is that all processes in the same cgroup are 
equally important. If some processes need extra protection, a new cgroup 
needs to be created for unified settings. I don't think it is necessary to 
implement protection in cgroup C, because task x and task y are equally 
important. Only the four processes (task m, n, x and y) in cgroup A, have 
important and secondary differences.

> This would easily allow arbitrary cgroup to hide from the oom
> killer and spill over to other cgroups.

I don't think this will happen, because eoom.protect only works on parent 
cgroup. If "oom.protect(parent cgroup) = 0", from perspective of 
grandpa cgroup, task x and y will not be specially protected.

>> usage(task m) = 1G
>> usage(task n) = 2G
>> usage(task x) = 1G
>> usage(task y) = 2G
>> 
>> oom killer order of cgroup A: n > m > y > x
>> oom killer order of host:     y = n > x = m
>> 
>> If cgroup A is a directory maintained by users, users can use oom.protect 
>> to protect relatively important tasks x and y.
>> 
>> However, when score_adj and oom.protect are used at the same time, we 
>> will also consider the impact of both, as expressed in the following formula. 
>> but I have to admit that it is an unstable result.
>> score = task_usage + score_adj * totalpage - eoom.protect * task_usage / local_memcg_usage
>
>I hope I am not misreading but this has some rather unexpected
>properties. First off, bigger memory consumers in a protected memcg are
>protected more. 

Since cgroup needs to reasonably distribute the protection quota to all 
processes in the cgroup, I think that processes consuming more memory 
should get more quota. It is fair to processes consuming less memory 
too, even if processes consuming more memory get more quota, its 
oom_score is still higher than the processes consuming less memory. 
When the oom killer appears in local cgroup, the order of oom killer 
remains unchanged

>Also I would expect the protection discount would
>be capped by the actual usage otherwise excessive protection
>configuration could skew the results considerably.

In the calculation, we will select the minimum value of memcg_usage and 
oom.protect

>> > I haven't really read through the whole patch but this struck me odd.
>> 
>> > > @@ -552,8 +552,19 @@ static int proc_oom_score(struct seq_file *m, struct pid_namespace *ns,
>> > > 	unsigned long totalpages = totalram_pages() + total_swap_pages;
>> > > 	unsigned long points = 0;
>> > > 	long badness;
>> > > +#ifdef CONFIG_MEMCG
>> > > +	struct mem_cgroup *memcg;
>> > > 
>> > > -	badness = oom_badness(task, totalpages);
>> > > +	rcu_read_lock();
>> > > +	memcg = mem_cgroup_from_task(task);
>> > > +	if (memcg && !css_tryget(&memcg->css))
>> > > +		memcg = NULL;
>> > > +	rcu_read_unlock();
>> > > +
>> > > +	update_parent_oom_protection(root_mem_cgroup, memcg);
>> > > +	css_put(&memcg->css);
>> > > +#endif
>> > > +	badness = oom_badness(task, totalpages, MEMCG_OOM_PROTECT);
>> >
>> > the badness means different thing depending on which memcg hierarchy
>> > subtree you look at. Scaling based on the global oom could get really
>> > misleading.
>> 
>> I also took it into consideration. I planned to change "/proc/pid/oom_score" 
>> to a writable node. When writing to different cgroup paths, different values 
>> will be output. The default output is root cgroup. Do you think this idea is 
>> feasible?
>
>I do not follow. Care to elaborate?

Take two example，
cmd: cat /proc/pid/oom_score
output: Scaling based on the global oom

cmd: echo "/cgroupA/cgroupB" > /proc/pid/oom_score
output: Scaling based on the cgroupB oom
(If the task is not in the cgroupB's hierarchy subtree, output: invalid parameter)

Thanks for your comment!
-- 
Chengkaitao
Best wishes