linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Mina Almasry <almasrymina@google.com>
To: Roman Gushchin <roman.gushchin@linux.dev>,
	Yosry Ahmed <yosryahmed@google.com>
Cc: chengkaitao <pilgrimtao@gmail.com>,
	tj@kernel.org, lizefan.x@bytedance.com,  hannes@cmpxchg.org,
	corbet@lwn.net, mhocko@kernel.org, shakeelb@google.com,
	 akpm@linux-foundation.org, songmuchun@bytedance.com,
	cgel.zte@gmail.com,  ran.xiaokai@zte.com.cn,
	viro@zeniv.linux.org.uk, zhengqi.arch@bytedance.com,
	 ebiederm@xmission.com, Liam.Howlett@oracle.com,
	chengzhihao1@huawei.com,  haolee.swjtu@gmail.com,
	yuzhao@google.com, willy@infradead.org,  vasily.averin@linux.dev,
	vbabka@suse.cz, surenb@google.com,  sfr@canb.auug.org.au,
	mcgrof@kernel.org, sujiaxun@uniontech.com,  feng.tang@intel.com,
	cgroups@vger.kernel.org, linux-doc@vger.kernel.org,
	 linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	 linux-mm@kvack.org
Subject: Re: [PATCH] mm: memcontrol: protect the memory in cgroup from being oom killed
Date: Thu, 1 Dec 2022 12:18:33 -0800	[thread overview]
Message-ID: <CAHS8izN3ej1mqUpnNQ8c-1Bx5EeO7q5NOkh0qrY_4PLqc8rkHA@mail.gmail.com> (raw)
In-Reply-To: <Y4fnRyIp17NXpti9@P9FQF9L96D.corp.robot.car>

On Wed, Nov 30, 2022 at 3:29 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> On Wed, Nov 30, 2022 at 03:01:58PM +0800, chengkaitao wrote:
> > From: chengkaitao <pilgrimtao@gmail.com>
> >
> > We created a new interface <memory.oom.protect> for memory, If there is
> > the OOM killer under parent memory cgroup, and the memory usage of a
> > child cgroup is within its effective oom.protect boundary, the cgroup's
> > tasks won't be OOM killed unless there is no unprotected tasks in other
> > children cgroups. It draws on the logic of <memory.min/low> in the
> > inheritance relationship.
> >
> > It has the following advantages,
> > 1. We have the ability to protect more important processes, when there
> > is a memcg's OOM killer. The oom.protect only takes effect local memcg,
> > and does not affect the OOM killer of the host.
> > 2. Historically, we can often use oom_score_adj to control a group of
> > processes, It requires that all processes in the cgroup must have a
> > common parent processes, we have to set the common parent process's
> > oom_score_adj, before it forks all children processes. So that it is
> > very difficult to apply it in other situations. Now oom.protect has no
> > such restrictions, we can protect a cgroup of processes more easily. The
> > cgroup can keep some memory, even if the OOM killer has to be called.
>
> It reminds me our attempts to provide a more sophisticated cgroup-aware oom
> killer. The problem is that the decision which process(es) to kill or preserve
> is individual to a specific workload (and can be even time-dependent
> for a given workload). So it's really hard to come up with an in-kernel
> mechanism which is at the same time flexible enough to work for the majority
> of users and reliable enough to serve as the last oom resort measure (which
> is the basic goal of the kernel oom killer).
>
> Previously the consensus was to keep the in-kernel oom killer dumb and reliable
> and implement complex policies in userspace (e.g. systemd-oomd etc).
>
> Is there a reason why such approach can't work in your case?
>

FWIW we run into similar issues and the systemd-oomd approach doesn't
work reliably enough for us to disable the kernel oom-killer. The
issue as I understand is when the machine is under heavy memory
pressure our userspace oom-killer fails to run quickly enough to save
the machine from getting completely stuck. Why our oom-killer fails to
run is more nuanced. There are cases where it seems stuck to itself to
acquire memory to do the oom-killing or stuck on some lock that needs
to be released by a process that itself is stuck trying to acquire
memory to release the lock, etc.

When the kernel oom-killer does run we would like to shield the
important jobs from it and kill the batch jobs or restartable
processes instead. So we have a similar feature to what is proposed
here internally. Our design is a bit different. For us we enable the
userspace to completely override the oom_badness score pretty much:

1. Every process has /proc/pid/oom_score_badness which overrides the
kernel's calculation if set.
2. Every memcg has a memory.oom_score_badness which indicates this
memcg's oom importance.

On global oom the kernel pretty much kills the baddest process in the
badesset memcg, so we can 'protect' the important jobs from
oom-killing that way.

I haven't tried upstreaming this because I assume there would be
little appetite for it in a general use case, but if the general use
case is interesting for some it would be good to collaborate on some
way for folks that enable the kernel oom-killer to shield certain jobs
that are important.

> Thanks!
>


      reply	other threads:[~2022-12-01 20:18 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-11-30  7:01 [PATCH] mm: memcontrol: protect the memory in cgroup from being oom killed chengkaitao
2022-11-30  8:41 ` Bagas Sanjaya
2022-11-30 11:33   ` Tao pilgrim
2022-11-30 12:43     ` Bagas Sanjaya
2022-11-30 13:25       ` 程垲涛 Chengkaitao Cheng
2022-11-30 15:46     ` 程垲涛 Chengkaitao Cheng
2022-11-30 16:27       ` Michal Hocko
2022-12-01  4:52         ` 程垲涛 Chengkaitao Cheng
2022-12-01  7:49           ` 程垲涛 Chengkaitao Cheng
2022-12-01  9:02             ` Michal Hocko
2022-12-01 13:05               ` 程垲涛 Chengkaitao Cheng
2022-12-01  8:49           ` Michal Hocko
2022-12-01 10:52             ` 程垲涛 Chengkaitao Cheng
2022-12-01 12:44               ` Michal Hocko
2022-12-01 13:08                 ` Michal Hocko
2022-12-01 14:30                   ` 程垲涛 Chengkaitao Cheng
2022-12-01 15:17                     ` Michal Hocko
2022-12-02  8:37                       ` 程垲涛 Chengkaitao Cheng
2022-11-30 13:15 ` Michal Hocko
2022-11-30 23:29 ` Roman Gushchin
2022-12-01 20:18   ` Mina Almasry [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAHS8izN3ej1mqUpnNQ8c-1Bx5EeO7q5NOkh0qrY_4PLqc8rkHA@mail.gmail.com \
    --to=almasrymina@google.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=cgel.zte@gmail.com \
    --cc=cgroups@vger.kernel.org \
    --cc=chengzhihao1@huawei.com \
    --cc=corbet@lwn.net \
    --cc=ebiederm@xmission.com \
    --cc=feng.tang@intel.com \
    --cc=hannes@cmpxchg.org \
    --cc=haolee.swjtu@gmail.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lizefan.x@bytedance.com \
    --cc=mcgrof@kernel.org \
    --cc=mhocko@kernel.org \
    --cc=pilgrimtao@gmail.com \
    --cc=ran.xiaokai@zte.com.cn \
    --cc=roman.gushchin@linux.dev \
    --cc=sfr@canb.auug.org.au \
    --cc=shakeelb@google.com \
    --cc=songmuchun@bytedance.com \
    --cc=sujiaxun@uniontech.com \
    --cc=surenb@google.com \
    --cc=tj@kernel.org \
    --cc=vasily.averin@linux.dev \
    --cc=vbabka@suse.cz \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    --cc=yosryahmed@google.com \
    --cc=yuzhao@google.com \
    --cc=zhengqi.arch@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).