From: Yosry Ahmed <yosryahmed@google.com>
To: "程垲涛 Chengkaitao Cheng" <chengkaitao@didiglobal.com>
Cc: "tj@kernel.org" <tj@kernel.org>,
"lizefan.x@bytedance.com" <lizefan.x@bytedance.com>,
"hannes@cmpxchg.org" <hannes@cmpxchg.org>,
"corbet@lwn.net" <corbet@lwn.net>,
"mhocko@kernel.org" <mhocko@kernel.org>,
"roman.gushchin@linux.dev" <roman.gushchin@linux.dev>,
"shakeelb@google.com" <shakeelb@google.com>,
"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
"brauner@kernel.org" <brauner@kernel.org>,
"muchun.song@linux.dev" <muchun.song@linux.dev>,
"viro@zeniv.linux.org.uk" <viro@zeniv.linux.org.uk>,
"zhengqi.arch@bytedance.com" <zhengqi.arch@bytedance.com>,
"ebiederm@xmission.com" <ebiederm@xmission.com>,
"Liam.Howlett@oracle.com" <Liam.Howlett@oracle.com>,
"chengzhihao1@huawei.com" <chengzhihao1@huawei.com>,
"pilgrimtao@gmail.com" <pilgrimtao@gmail.com>,
"haolee.swjtu@gmail.com" <haolee.swjtu@gmail.com>,
"yuzhao@google.com" <yuzhao@google.com>,
"willy@infradead.org" <willy@infradead.org>,
"vasily.averin@linux.dev" <vasily.averin@linux.dev>,
"vbabka@suse.cz" <vbabka@suse.cz>,
"surenb@google.com" <surenb@google.com>,
"sfr@canb.auug.org.au" <sfr@canb.auug.org.au>,
"mcgrof@kernel.org" <mcgrof@kernel.org>,
"feng.tang@intel.com" <feng.tang@intel.com>,
"cgroups@vger.kernel.org" <cgroups@vger.kernel.org>,
"linux-doc@vger.kernel.org" <linux-doc@vger.kernel.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
David Rientjes <rientjes@google.com>
Subject: Re: [PATCH v4 0/2] memcontrol: support cgroup level OOM protection
Date: Thu, 25 May 2023 10:19:20 -0700 [thread overview]
Message-ID: <CAJD7tkaQdSTDX0Q7zvvYrA3Y4TcvLdWKnN3yc8VpfWRpUjcYBw@mail.gmail.com> (raw)
In-Reply-To: <B438A058-7C4A-46B3-B6FB-4CF32BD7D294@didiglobal.com>
On Thu, May 25, 2023 at 1:19 AM 程垲涛 Chengkaitao Cheng
<chengkaitao@didiglobal.com> wrote:
>
> At 2023-05-24 06:02:55, "Yosry Ahmed" <yosryahmed@google.com> wrote:
> >On Sat, May 20, 2023 at 2:52 AM 程垲涛 Chengkaitao Cheng
> ><chengkaitao@didiglobal.com> wrote:
> >>
> >> At 2023-05-20 06:04:26, "Yosry Ahmed" <yosryahmed@google.com> wrote:
> >> >On Wed, May 17, 2023 at 10:12 PM 程垲涛 Chengkaitao Cheng
> >> ><chengkaitao@didiglobal.com> wrote:
> >> >>
> >> >> At 2023-05-18 04:42:12, "Yosry Ahmed" <yosryahmed@google.com> wrote:
> >> >> >On Wed, May 17, 2023 at 3:01 AM 程垲涛 Chengkaitao Cheng
> >> >> ><chengkaitao@didiglobal.com> wrote:
> >> >> >>
> >> >> >> At 2023-05-17 16:09:50, "Yosry Ahmed" <yosryahmed@google.com> wrote:
> >> >> >> >On Wed, May 17, 2023 at 1:01 AM 程垲涛 Chengkaitao Cheng
> >> >> >> ><chengkaitao@didiglobal.com> wrote:
> >> >> >> >>
> >> >> >>
> >> >> >> Killing processes in order of memory usage cannot effectively protect
> >> >> >> important processes. Killing processes in a user-defined priority order
> >> >> >> will result in a large number of OOM events and still not being able to
> >> >> >> release enough memory. I have been searching for a balance between
> >> >> >> the two methods, so that their shortcomings are not too obvious.
> >> >> >> The biggest advantage of memcg is its tree topology, and I also hope
> >> >> >> to make good use of it.
> >> >> >
> >> >> >For us, killing processes in a user-defined priority order works well.
> >> >> >
> >> >> >It seems like to tune memory.oom.protect you use oom_kill_inherit to
> >> >> >observe how many times this memcg has been killed due to a limit in an
> >> >> >ancestor. Wouldn't it be more straightforward to specify the priority
> >> >> >of protections among memcgs?
> >> >> >
> >> >> >For example, if you observe multiple memcgs being OOM killed due to
> >> >> >hitting an ancestor limit, you will need to decide which of them to
> >> >> >increase memory.oom.protect for more, based on their importance.
> >> >> >Otherwise, if you increase all of them, then there is no point if all
> >> >> >the memory is protected, right?
> >> >>
> >> >> If all memory in memcg is protected, its meaning is similar to that of the
> >> >> highest priority memcg in your approach, which is ultimately killed or
> >> >> never killed.
> >> >
> >> >Makes sense. I believe it gets a bit trickier when you want to
> >> >describe relative ordering between memcgs using memory.oom.protect.
> >>
> >> Actually, my original intention was not to use memory.oom.protect to
> >> achieve relative ordering between memcgs, it was just a feature that
> >> happened to be achievable. My initial idea was to protect a certain
> >> proportion of memory in memcg from being killed, and through the
> >> method, physical memory can be reasonably planned. Both the physical
> >> machine manager and container manager can add some unimportant
> >> loads beyond the oom.protect limit, greatly improving the oversold
> >> rate of memory. In the worst case scenario, the physical machine can
> >> always provide all the memory limited by memory.oom.protect for memcg.
> >>
> >> On the other hand, I also want to achieve relative ordering of internal
> >> processes in memcg, not just a unified ordering of all memcgs on
> >> physical machines.
> >
> >For us, having a strict priority ordering-based selection is
> >essential. We have different tiers of jobs of different importance,
> >and a job of higher priority should not be killed before a lower
> >priority task if possible, no matter how much memory either of them is
> >using. Protecting memcgs solely based on their usage can be useful in
> >some scenarios, but not in a system where you have different tiers of
> >jobs running with strict priority ordering.
>
> If you want to run with strict priority ordering, it can also be achieved,
> but it may be quite troublesome. The directory structure shown below
> can achieve the goal.
>
> root
> / \
> cgroup A cgroup B
> (protect=max) (protect=0)
> / \
> cgroup C cgroup D
> (protect=max) (protect=0)
> / \
> cgroup E cgroup F
> (protect=max) (protect=0)
>
> Oom kill order: F > E > C > A
This requires restructuring the cgroup hierarchy which comes with a
lot of other factors, I don't think that's practically an option.
>
> As mentioned earlier, "running with strict priority ordering" may be
> some extreme issues, that requires the manager to make a choice.
We have been using strict priority ordering in our fleet for many
years now and we depend on it. Some jobs are simply more important
than others, regardless of their usage.
>
> >>
> >> >> >In this case, wouldn't it be easier to just tell the OOM killer the
> >> >> >relative priority among the memcgs?
> >> >> >
> >> >> >>
> >> >> >> >If this approach works for you (or any other audience), that's great,
> >> >> >> >I can share more details and perhaps we can reach something that we
> >> >> >> >can both use :)
> >> >> >>
> >> >> >> If you have a good idea, please share more details or show some code.
> >> >> >> I would greatly appreciate it
> >> >> >
> >> >> >The code we have needs to be rebased onto a different version and
> >> >> >cleaned up before it can be shared, but essentially it is as
> >> >> >described.
> >> >> >
> >> >> >(a) All processes and memcgs start with a default score.
> >> >> >(b) Userspace can specify scores for memcgs and processes. A higher
> >> >> >score means higher priority (aka less score gets killed first).
> >> >> >(c) The OOM killer essentially looks for the memcg with the lowest
> >> >> >scores to kill, then among this memcg, it looks for the process with
> >> >> >the lowest score. Ties are broken based on usage, so essentially if
> >> >> >all processes/memcgs have the default score, we fallback to the
> >> >> >current OOM behavior.
> >> >>
> >> >> If memory oversold is severe, all processes of the lowest priority
> >> >> memcg may be killed before selecting other memcg processes.
> >> >> If there are 1000 processes with almost zero memory usage in
> >> >> the lowest priority memcg, 1000 invalid kill events may occur.
> >> >> To avoid this situation, even for the lowest priority memcg,
> >> >> I will leave him a very small oom.protect quota.
> >> >
> >> >I checked internally, and this is indeed something that we see from
> >> >time to time. We try to avoid that with userspace OOM killing, but
> >> >it's not 100% effective.
> >> >
> >> >>
> >> >> If faced with two memcgs with the same total memory usage and
> >> >> priority, memcg A has more processes but less memory usage per
> >> >> single process, and memcg B has fewer processes but more
> >> >> memory usage per single process, then when OOM occurs, the
> >> >> processes in memcg B may continue to be killed until all processes
> >> >> in memcg B are killed, which is unfair to memcg B because memcg A
> >> >> also occupies a large amount of memory.
> >> >
> >> >I believe in this case we will kill one process in memcg B, then the
> >> >usage of memcg A will become higher, so we will pick a process from
> >> >memcg A next.
> >>
> >> If there is only one process in memcg A and its memory usage is higher
> >> than any other process in memcg B, but the total memory usage of
> >> memcg A is lower than that of memcg B. In this case, if the OOM-killer
> >> still chooses the process in memcg A. it may be unfair to memcg A.
> >>
> >> >> Dose your approach have these issues? Killing processes in a
> >> >> user-defined priority is indeed easier and can work well in most cases,
> >> >> but I have been trying to solve the cases that it cannot cover.
> >> >
> >> >The first issue is relatable with our approach. Let me dig more info
> >> >from our internal teams and get back to you with more details.
>
> --
> Thanks for your comment!
> chengkaitao
>
>
prev parent reply other threads:[~2023-05-25 17:25 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-05-17 3:20 [PATCH v4 0/2] memcontrol: support cgroup level OOM protection chengkaitao
2023-05-17 3:20 ` [PATCH v4 1/2] mm: memcontrol: protect the memory in cgroup from being oom killed chengkaitao
2023-05-17 3:20 ` [PATCH v4 2/2] memcg: add oom_kill_inherit event indicator chengkaitao
2023-05-17 6:59 ` [PATCH v4 0/2] memcontrol: support cgroup level OOM protection Yosry Ahmed
2023-05-17 8:01 ` 程垲涛 Chengkaitao Cheng
2023-05-17 8:09 ` Yosry Ahmed
2023-05-17 10:01 ` 程垲涛 Chengkaitao Cheng
2023-05-17 20:42 ` Yosry Ahmed
2023-05-18 5:12 ` 程垲涛 Chengkaitao Cheng
2023-05-19 22:04 ` Yosry Ahmed
2023-05-20 9:52 ` 程垲涛 Chengkaitao Cheng
2023-05-23 22:02 ` Yosry Ahmed
2023-05-25 8:19 ` 程垲涛 Chengkaitao Cheng
2023-05-25 17:19 ` Yosry Ahmed [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAJD7tkaQdSTDX0Q7zvvYrA3Y4TcvLdWKnN3yc8VpfWRpUjcYBw@mail.gmail.com \
--to=yosryahmed@google.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=brauner@kernel.org \
--cc=cgroups@vger.kernel.org \
--cc=chengkaitao@didiglobal.com \
--cc=chengzhihao1@huawei.com \
--cc=corbet@lwn.net \
--cc=ebiederm@xmission.com \
--cc=feng.tang@intel.com \
--cc=hannes@cmpxchg.org \
--cc=haolee.swjtu@gmail.com \
--cc=linux-doc@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lizefan.x@bytedance.com \
--cc=mcgrof@kernel.org \
--cc=mhocko@kernel.org \
--cc=muchun.song@linux.dev \
--cc=pilgrimtao@gmail.com \
--cc=rientjes@google.com \
--cc=roman.gushchin@linux.dev \
--cc=sfr@canb.auug.org.au \
--cc=shakeelb@google.com \
--cc=surenb@google.com \
--cc=tj@kernel.org \
--cc=vasily.averin@linux.dev \
--cc=vbabka@suse.cz \
--cc=viro@zeniv.linux.org.uk \
--cc=willy@infradead.org \
--cc=yuzhao@google.com \
--cc=zhengqi.arch@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).