Re: [PATCH] mm: memcontrol: protect the memory in cgroup from being oom killed

From: "程垲涛 Chengkaitao Cheng" <chengkaitao@didiglobal.com>
To: Michal Hocko <mhocko@suse.com>
Cc: Tao pilgrim <pilgrimtao@gmail.com>,
	"tj@kernel.org" <tj@kernel.org>,
	"lizefan.x@bytedance.com" <lizefan.x@bytedance.com>,
	"hannes@cmpxchg.org" <hannes@cmpxchg.org>,
	"corbet@lwn.net" <corbet@lwn.net>,
	"roman.gushchin@linux.dev" <roman.gushchin@linux.dev>,
	"shakeelb@google.com" <shakeelb@google.com>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"songmuchun@bytedance.com" <songmuchun@bytedance.com>,
	"cgel.zte@gmail.com" <cgel.zte@gmail.com>,
	"ran.xiaokai@zte.com.cn" <ran.xiaokai@zte.com.cn>,
	"viro@zeniv.linux.org.uk" <viro@zeniv.linux.org.uk>,
	"zhengqi.arch@bytedance.com" <zhengqi.arch@bytedance.com>,
	"ebiederm@xmission.com" <ebiederm@xmission.com>,
	"Liam.Howlett@oracle.com" <Liam.Howlett@oracle.com>,
	"chengzhihao1@huawei.com" <chengzhihao1@huawei.com>,
	"haolee.swjtu@gmail.com" <haolee.swjtu@gmail.com>,
	"yuzhao@google.com" <yuzhao@google.com>,
	"willy@infradead.org" <willy@infradead.org>,
	"vasily.averin@linux.dev" <vasily.averin@linux.dev>,
	"vbabka@suse.cz" <vbabka@suse.cz>,
	"surenb@google.com" <surenb@google.com>,
	"sfr@canb.auug.org.au" <sfr@canb.auug.org.au>,
	"mcgrof@kernel.org" <mcgrof@kernel.org>,
	"sujiaxun@uniontech.com" <sujiaxun@uniontech.com>,
	"feng.tang@intel.com" <feng.tang@intel.com>,
	"cgroups@vger.kernel.org" <cgroups@vger.kernel.org>,
	"linux-doc@vger.kernel.org" <linux-doc@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	"Bagas Sanjaya" <bagasdotme@gmail.com>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Subject: Re: [PATCH] mm: memcontrol: protect the memory in cgroup from being oom killed
Date: Fri, 2 Dec 2022 08:37:52 +0000	[thread overview]
Message-ID: <771CC621-A19E-4174-B3D0-F451B1D7D69A@didiglobal.com> (raw)
In-Reply-To: <Y4jFnY7kMdB8ReSW@dhcp22.suse.cz>

At 2022-12-01 23:17:49, "Michal Hocko" <mhocko@suse.com> wrote:
>On Thu 01-12-22 14:30:11, 程垲涛 Chengkaitao Cheng wrote:
>> At 2022-12-01 21:08:26, "Michal Hocko" <mhocko@suse.com> wrote:
>> >On Thu 01-12-22 13:44:58, Michal Hocko wrote:
>> >> On Thu 01-12-22 10:52:35, 程垲涛 Chengkaitao Cheng wrote:
>> >> > At 2022-12-01 16:49:27, "Michal Hocko" <mhocko@suse.com> wrote:
>> >[...]
>> >> There is a misunderstanding, oom.protect does not replace the user's 
>> >> tailed policies, Its purpose is to make it easier and more efficient for 
>> >> users to customize policies, or try to avoid users completely abandoning 
>> >> the oom score to formulate new policies.
>> >
>> > Then you should focus on explaining on how this makes those policies and
>> > easier and moe efficient. I do not see it.
>> 
>> In fact, there are some relevant contents in the previous chat records. 
>> If oom.protect is applied, it will have the following benefits
>> 1. Users only need to focus on the management of the local cgroup, not the 
>> impact on other users' cgroups.
>
>Protection based balancing cannot really work in an isolation.

I think that a cgroup only needs to concern the protection value of the child 
cgroup, which is independent in a certain sense.

>> 2. Users and system do not need to spend extra time on complicated and 
>> repeated scanning and configuration. They just need to configure the 
>> oom.protect of specific cgroups, which is a one-time task
>
>This will not work same way as the memory reclaim protection cannot work
>in an isolation on the memcg level.

The parent cgroup's oom.protect can change the actual protected memory size 
of the child cgroup, which is exactly what we need. Because of it, the child cgroup 
can set its own oom.protect at will.

>> >> > >Why cannot you simply discount the protection from all processes
>> >> > >equally? I do not follow why the task_usage has to play any role in
>> >> > >that.
>> >> > 
>> >> > If all processes are protected equally, the oom protection of cgroup is 
>> >> > meaningless. For example, if there are more processes in the cgroup, 
>> >> > the cgroup can protect more mems, it is unfair to cgroups with fewer 
>> >> > processes. So we need to keep the total amount of memory that all 
>> >> > processes in the cgroup need to protect consistent with the value of 
>> >> > eoom.protect.
>> >> 
>> >> You are mixing two different concepts together I am afraid. The per
>> >> memcg protection should protect the cgroup (i.e. all processes in that
>> >> cgroup) while you want it to be also process aware. This results in a
>> >> very unclear runtime behavior when a process from a more protected memcg
>> >> is selected based on its individual memory usage.
>> >
>> The correct statement here should be that each memcg protection should 
>> protect the number of mems specified by the oom.protect. For example, 
>> a cgroup's usage is 6G, and it's oom.protect is 2G, when an oom killer occurs, 
>> In the worst case, we will only reduce the memory used by this cgroup to 2G 
>> through the om killer.
>
>I do not see how that could be guaranteed. Please keep in mind that a
>non-trivial amount of memory resources could be completely independent
>on any process life time (just consider tmpfs as a trivial example).
>
>> >Let me be more specific here. Although it is primarily processes which
>> >are the primary source of memcg charges the memory accounted for the oom
>> >badness purposes is not really comparable to the overal memcg charged
>> >memory. Kernel memory, non-mapped memory all that can generate rather
>> >interesting cornercases.
>> 
>> Sorry, I'm thoughtless enough about some special memory statistics. I will fix 
>> it in the next version
>
>Let me just emphasise that we are talking about fundamental disconnect.
>Rss based accounting has been used for the OOM killer selection because
>the memory gets unmapped and _potentially_ freed when the process goes
>away. Memcg changes are bound to the object life time and as said in
>many cases there is no direct relation with any process life time.

Based on your question, I want to revise the formula as follows,
score = task_usage + score_adj * totalpage - eoom.protect * (task_usage - task_rssshare) / 
(local_memcg_usage + local_memcg_swapcache)

After the process is killed, the unmapped cache and sharemem will not be 
released immediately, so they should not apply to cgroup for protection quota. 
In extreme environments, the memory that cannot be released by the oom killer 
(i.e. some mems that have not been charged to the process) may occupy a large 
share of protection quota, but it is expected. Of course, the idea may have some 
problems that I haven't considered.

>
>Hope that clarifies.

Thanks for your comment!
chengkaitao