All of lore.kernel.org
 help / color / mirror / Atom feed
From: Roman Gushchin <guro@fb.com>
To: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Michal Hocko <mhocko@kernel.org>,
	Vladimir Davydov <vdavydov.dev@gmail.com>,
	Johannes Weiner <hannes@cmpxchg.org>, Tejun Heo <tj@kernel.org>,
	<cgroups@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
	<linux-mm@kvack.org>
Subject: Re: [patch v3 -mm 3/6] mm, memcg: add hierarchical usage oom policy
Date: Mon, 16 Jul 2018 11:16:17 -0700	[thread overview]
Message-ID: <20180716181613.GA28327@castle> (raw)
In-Reply-To: <alpine.DEB.2.21.1807131605590.217600@chino.kir.corp.google.com>

On Fri, Jul 13, 2018 at 04:07:29PM -0700, David Rientjes wrote:
> One of the three significant concerns brought up about the cgroup aware
> oom killer is that its decisionmaking is completely evaded by creating
> subcontainers and attaching processes such that the ancestor's usage does
> not exceed another cgroup on the system.
> 
> Consider the example from the previous patch where "memory" is set in
> each mem cgroup's cgroup.controllers:
> 
> 	mem cgroup	cgroup.procs
> 	==========	============
> 	/cg1		1 process consuming 250MB
> 	/cg2		3 processes consuming 100MB each
> 	/cg3/cg31	2 processes consuming 100MB each
> 	/cg3/cg32	2 processes consuming 100MB each
> 
> If memory.oom_policy is "cgroup", a process from /cg2 is chosen because it
> is in the single indivisible memory consumer with the greatest usage.
> 
> The true usage of /cg3 is actually 400MB, but a process from /cg2 is
> chosen because cgroups are compared individually rather than
> hierarchically.
> 
> If a system is divided into two users, for example:
> 
> 	mem cgroup	memory.max
> 	==========	==========
> 	/userA		250MB
> 	/userB		250MB
> 
> If /userA runs all processes attached to the local mem cgroup, whereas
> /userB distributes their processes over a set of subcontainers under
> /userB, /userA will be unfairly penalized.
> 
> There is incentive with cgroup v2 to distribute processes over a set of
> subcontainers if those processes shall be constrained by other cgroup
> controllers; this is a direct result of mandating a single, unified
> hierarchy for cgroups.  A user may also reasonably do this for mem cgroup
> control or statistics.  And, a user may do this to evade the cgroup-aware
> oom killer selection logic.
> 
> This patch adds an oom policy, "tree", that accounts for hierarchical
> usage when comparing cgroups and the cgroup aware oom killer is enabled by
> an ancestor.  This allows administrators, for example, to require users in
> their own top-level mem cgroup subtree to be accounted for with
> hierarchical usage.  In other words, they can longer evade the oom killer
> by using other controllers or subcontainers.
> 
> If an oom policy of "tree" is in place for a subtree, such as /cg3 above,
> the hierarchical usage is used for comparisons with other cgroups if
> either "cgroup" or "tree" is the oom policy of the oom mem cgroup.  Thus,
> if /cg3/memory.oom_policy is "tree", one of the processes from /cg3's
> subcontainers is chosen for oom kill.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  Documentation/admin-guide/cgroup-v2.rst | 17 ++++++++++++++---
>  include/linux/memcontrol.h              |  5 +++++
>  mm/memcontrol.c                         | 18 ++++++++++++------
>  3 files changed, 31 insertions(+), 9 deletions(-)
> 
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1113,6 +1113,10 @@ PAGE_SIZE multiple when read back.
>  	memory consumers; that is, they will compare mem cgroup usage rather
>  	than process memory footprint.  See the "OOM Killer" section below.
>  
> +	If "tree", the OOM killer will compare mem cgroups and its subtree
> +	as a single indivisible memory consumer.  This policy cannot be set
> +	on the root mem cgroup.  See the "OOM Killer" section below.
> +
>  	When an OOM condition occurs, the policy is dictated by the mem
>  	cgroup that is OOM (the root mem cgroup for a system-wide OOM
>  	condition).  If a descendant mem cgroup has a policy of "none", for
> @@ -1120,6 +1124,10 @@ PAGE_SIZE multiple when read back.
>  	the heuristic will still compare mem cgroups as indivisible memory
>  	consumers.
>  
> +	When an OOM condition occurs in a mem cgroup with an OOM policy of
> +	"cgroup" or "tree", the OOM killer will compare mem cgroups with
> +	"cgroup" policy individually with "tree" policy subtrees.
> +
>    memory.events
>  	A read-only flat-keyed file which exists on non-root cgroups.
>  	The following entries are defined.  Unless specified
> @@ -1355,7 +1363,7 @@ out of memory, its memory.oom_policy will dictate how the OOM killer will
>  select a process, or cgroup, to kill.  Likewise, when the system is OOM,
>  the policy is dictated by the root mem cgroup.
>  
> -There are currently two available oom policies:
> +There are currently three available oom policies:
>  
>   - "none": default, choose the largest single memory hogging process to
>     oom kill, as traditionally the OOM killer has always done.
> @@ -1364,6 +1372,9 @@ There are currently two available oom policies:
>     subtree as an OOM victim and kill at least one process, depending on
>     memory.oom_group, from it.
>  
> + - "tree": choose the cgroup with the largest memory footprint considering
> +   itself and its subtree and kill at least one process.
> +
>  When selecting a cgroup as a victim, the OOM killer will kill the process
>  with the largest memory footprint.  A user can control this behavior by
>  enabling the per-cgroup memory.oom_group option.  If set, it causes the
> @@ -1382,8 +1393,8 @@ Please, note that memory charges are not migrating if tasks
>  are moved between different memory cgroups. Moving tasks with
>  significant memory footprint may affect OOM victim selection logic.
>  If it's a case, please, consider creating a common ancestor for
> -the source and destination memory cgroups and enabling oom_group
> -on ancestor layer.
> +the source and destination memory cgroups and setting a policy of "tree"
> +and enabling oom_group on an ancestor layer.
>  
>  
>  IO
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -77,6 +77,11 @@ enum memcg_oom_policy {
>  	 * mem cgroup as an indivisible consumer
>  	 */
>  	MEMCG_OOM_POLICY_CGROUP,
> +	/*
> +	 * Tree cgroup usage for all descendant memcg groups, treating each mem
> +	 * cgroup and its subtree as an indivisible consumer
> +	 */
> +	MEMCG_OOM_POLICY_TREE,
>  };
>  
>  struct mem_cgroup_reclaim_cookie {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2952,7 +2952,7 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
>  	/*
>  	 * The oom_score is calculated for leaf memory cgroups (including
>  	 * the root memcg).
> -	 * Non-leaf oom_group cgroups accumulating score of descendant
> +	 * Cgroups with oom policy of "tree" accumulate the score of descendant
>  	 * leaf memory cgroups.
>  	 */
>  	rcu_read_lock();
> @@ -2961,10 +2961,11 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
>  
>  		/*
>  		 * We don't consider non-leaf non-oom_group memory cgroups
> -		 * as OOM victims.
> +		 * without the oom policy of "tree" as OOM victims.
>  		 */
>  		if (memcg_has_children(iter) && iter != root_mem_cgroup &&
> -		    !mem_cgroup_oom_group(iter))
> +		    !mem_cgroup_oom_group(iter) &&
> +		    iter->oom_policy != MEMCG_OOM_POLICY_TREE)
>  			continue;

Hello, David!

I think that there is an inconsistency in the memory.oom_policy definition.
"none" and "cgroup" policies defining how the OOM scoped to this particular
memory cgroup (or system, if set on root) is handled. And all sub-tree
settings do not matter at all, right? Also, if a memory cgroup has no
memory.max set, there is no meaning in setting memory.oom_policy.

And "tree" is different. It actually changes how the selection algorithm works,
and sub-tree settings do matter in this case.

I find it very confusing.

Thanks!

WARNING: multiple messages have this Message-ID (diff)
From: Roman Gushchin <guro@fb.com>
To: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Michal Hocko <mhocko@kernel.org>,
	Vladimir Davydov <vdavydov.dev@gmail.com>,
	Johannes Weiner <hannes@cmpxchg.org>, Tejun Heo <tj@kernel.org>,
	cgroups@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org
Subject: Re: [patch v3 -mm 3/6] mm, memcg: add hierarchical usage oom policy
Date: Mon, 16 Jul 2018 11:16:17 -0700	[thread overview]
Message-ID: <20180716181613.GA28327@castle> (raw)
In-Reply-To: <alpine.DEB.2.21.1807131605590.217600@chino.kir.corp.google.com>

On Fri, Jul 13, 2018 at 04:07:29PM -0700, David Rientjes wrote:
> One of the three significant concerns brought up about the cgroup aware
> oom killer is that its decisionmaking is completely evaded by creating
> subcontainers and attaching processes such that the ancestor's usage does
> not exceed another cgroup on the system.
> 
> Consider the example from the previous patch where "memory" is set in
> each mem cgroup's cgroup.controllers:
> 
> 	mem cgroup	cgroup.procs
> 	==========	============
> 	/cg1		1 process consuming 250MB
> 	/cg2		3 processes consuming 100MB each
> 	/cg3/cg31	2 processes consuming 100MB each
> 	/cg3/cg32	2 processes consuming 100MB each
> 
> If memory.oom_policy is "cgroup", a process from /cg2 is chosen because it
> is in the single indivisible memory consumer with the greatest usage.
> 
> The true usage of /cg3 is actually 400MB, but a process from /cg2 is
> chosen because cgroups are compared individually rather than
> hierarchically.
> 
> If a system is divided into two users, for example:
> 
> 	mem cgroup	memory.max
> 	==========	==========
> 	/userA		250MB
> 	/userB		250MB
> 
> If /userA runs all processes attached to the local mem cgroup, whereas
> /userB distributes their processes over a set of subcontainers under
> /userB, /userA will be unfairly penalized.
> 
> There is incentive with cgroup v2 to distribute processes over a set of
> subcontainers if those processes shall be constrained by other cgroup
> controllers; this is a direct result of mandating a single, unified
> hierarchy for cgroups.  A user may also reasonably do this for mem cgroup
> control or statistics.  And, a user may do this to evade the cgroup-aware
> oom killer selection logic.
> 
> This patch adds an oom policy, "tree", that accounts for hierarchical
> usage when comparing cgroups and the cgroup aware oom killer is enabled by
> an ancestor.  This allows administrators, for example, to require users in
> their own top-level mem cgroup subtree to be accounted for with
> hierarchical usage.  In other words, they can longer evade the oom killer
> by using other controllers or subcontainers.
> 
> If an oom policy of "tree" is in place for a subtree, such as /cg3 above,
> the hierarchical usage is used for comparisons with other cgroups if
> either "cgroup" or "tree" is the oom policy of the oom mem cgroup.  Thus,
> if /cg3/memory.oom_policy is "tree", one of the processes from /cg3's
> subcontainers is chosen for oom kill.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  Documentation/admin-guide/cgroup-v2.rst | 17 ++++++++++++++---
>  include/linux/memcontrol.h              |  5 +++++
>  mm/memcontrol.c                         | 18 ++++++++++++------
>  3 files changed, 31 insertions(+), 9 deletions(-)
> 
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1113,6 +1113,10 @@ PAGE_SIZE multiple when read back.
>  	memory consumers; that is, they will compare mem cgroup usage rather
>  	than process memory footprint.  See the "OOM Killer" section below.
>  
> +	If "tree", the OOM killer will compare mem cgroups and its subtree
> +	as a single indivisible memory consumer.  This policy cannot be set
> +	on the root mem cgroup.  See the "OOM Killer" section below.
> +
>  	When an OOM condition occurs, the policy is dictated by the mem
>  	cgroup that is OOM (the root mem cgroup for a system-wide OOM
>  	condition).  If a descendant mem cgroup has a policy of "none", for
> @@ -1120,6 +1124,10 @@ PAGE_SIZE multiple when read back.
>  	the heuristic will still compare mem cgroups as indivisible memory
>  	consumers.
>  
> +	When an OOM condition occurs in a mem cgroup with an OOM policy of
> +	"cgroup" or "tree", the OOM killer will compare mem cgroups with
> +	"cgroup" policy individually with "tree" policy subtrees.
> +
>    memory.events
>  	A read-only flat-keyed file which exists on non-root cgroups.
>  	The following entries are defined.  Unless specified
> @@ -1355,7 +1363,7 @@ out of memory, its memory.oom_policy will dictate how the OOM killer will
>  select a process, or cgroup, to kill.  Likewise, when the system is OOM,
>  the policy is dictated by the root mem cgroup.
>  
> -There are currently two available oom policies:
> +There are currently three available oom policies:
>  
>   - "none": default, choose the largest single memory hogging process to
>     oom kill, as traditionally the OOM killer has always done.
> @@ -1364,6 +1372,9 @@ There are currently two available oom policies:
>     subtree as an OOM victim and kill at least one process, depending on
>     memory.oom_group, from it.
>  
> + - "tree": choose the cgroup with the largest memory footprint considering
> +   itself and its subtree and kill at least one process.
> +
>  When selecting a cgroup as a victim, the OOM killer will kill the process
>  with the largest memory footprint.  A user can control this behavior by
>  enabling the per-cgroup memory.oom_group option.  If set, it causes the
> @@ -1382,8 +1393,8 @@ Please, note that memory charges are not migrating if tasks
>  are moved between different memory cgroups. Moving tasks with
>  significant memory footprint may affect OOM victim selection logic.
>  If it's a case, please, consider creating a common ancestor for
> -the source and destination memory cgroups and enabling oom_group
> -on ancestor layer.
> +the source and destination memory cgroups and setting a policy of "tree"
> +and enabling oom_group on an ancestor layer.
>  
>  
>  IO
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -77,6 +77,11 @@ enum memcg_oom_policy {
>  	 * mem cgroup as an indivisible consumer
>  	 */
>  	MEMCG_OOM_POLICY_CGROUP,
> +	/*
> +	 * Tree cgroup usage for all descendant memcg groups, treating each mem
> +	 * cgroup and its subtree as an indivisible consumer
> +	 */
> +	MEMCG_OOM_POLICY_TREE,
>  };
>  
>  struct mem_cgroup_reclaim_cookie {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2952,7 +2952,7 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
>  	/*
>  	 * The oom_score is calculated for leaf memory cgroups (including
>  	 * the root memcg).
> -	 * Non-leaf oom_group cgroups accumulating score of descendant
> +	 * Cgroups with oom policy of "tree" accumulate the score of descendant
>  	 * leaf memory cgroups.
>  	 */
>  	rcu_read_lock();
> @@ -2961,10 +2961,11 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
>  
>  		/*
>  		 * We don't consider non-leaf non-oom_group memory cgroups
> -		 * as OOM victims.
> +		 * without the oom policy of "tree" as OOM victims.
>  		 */
>  		if (memcg_has_children(iter) && iter != root_mem_cgroup &&
> -		    !mem_cgroup_oom_group(iter))
> +		    !mem_cgroup_oom_group(iter) &&
> +		    iter->oom_policy != MEMCG_OOM_POLICY_TREE)
>  			continue;

Hello, David!

I think that there is an inconsistency in the memory.oom_policy definition.
"none" and "cgroup" policies defining how the OOM scoped to this particular
memory cgroup (or system, if set on root) is handled. And all sub-tree
settings do not matter at all, right? Also, if a memory cgroup has no
memory.max set, there is no meaning in setting memory.oom_policy.

And "tree" is different. It actually changes how the selection algorithm works,
and sub-tree settings do matter in this case.

I find it very confusing.

Thanks!

  reply	other threads:[~2018-07-16 18:16 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-03-13  0:57 [patch -mm v3 0/3] mm, memcg: introduce oom policies David Rientjes
2018-03-13  0:57 ` [patch -mm v3 1/3] mm, memcg: introduce per-memcg oom policy tunable David Rientjes
2018-03-14 12:38   ` Roman Gushchin
2018-03-14 12:38     ` Roman Gushchin
2018-03-14 12:38     ` Roman Gushchin
2018-03-14 20:58     ` David Rientjes
2018-03-15 17:10       ` Roman Gushchin
2018-03-15 17:10         ` Roman Gushchin
2018-03-15 20:16         ` David Rientjes
2018-03-13  0:57 ` [patch -mm v3 2/3] mm, memcg: replace cgroup aware oom killer mount option with tunable David Rientjes
2018-03-13  0:57 ` [patch -mm v3 3/3] mm, memcg: add hierarchical usage oom policy David Rientjes
2018-03-14  0:21 ` [patch -mm] mm, memcg: evaluate root and leaf memcgs fairly on oom David Rientjes
2018-03-14 12:17   ` Roman Gushchin
2018-03-14 12:17     ` Roman Gushchin
2018-03-14 20:41     ` David Rientjes
2018-03-15 16:46       ` Roman Gushchin
2018-03-15 16:46         ` Roman Gushchin
2018-03-15 20:01         ` David Rientjes
2018-03-15 20:34   ` [patch -mm] mm, memcg: separate oom_group from selection criteria David Rientjes
2018-03-15 20:51   ` [patch -mm] mm, memcg: disregard mempolicies for cgroup-aware oom killer David Rientjes
2018-03-15 20:54 ` [patch -mm v3 0/3] mm, memcg: introduce oom policies David Rientjes
2018-03-16 21:08   ` [patch -mm 0/6] rewrite cgroup aware oom killer for general use David Rientjes
2018-03-16 21:08     ` [patch -mm 1/6] mm, memcg: introduce per-memcg oom policy tunable David Rientjes
2018-03-16 21:08     ` [patch -mm 2/6] mm, memcg: replace cgroup aware oom killer mount option with tunable David Rientjes
2018-03-16 21:08     ` [patch -mm 3/6] mm, memcg: add hierarchical usage oom policy David Rientjes
2018-03-16 21:08     ` [patch -mm 4/6] mm, memcg: evaluate root and leaf memcgs fairly on oom David Rientjes
2018-03-18 15:00       ` kbuild test robot
2018-03-18 20:14         ` [patch -mm 4/6 updated] " David Rientjes
2018-03-18 18:18       ` [patch -mm 4/6] " kbuild test robot
2018-03-16 21:08     ` [patch -mm 5/6] mm, memcg: separate oom_group from selection criteria David Rientjes
2018-03-16 21:08     ` [patch -mm 6/6] mm, memcg: disregard mempolicies for cgroup-aware oom killer David Rientjes
2018-03-22 21:53     ` [patch v2 -mm 0/6] rewrite cgroup aware oom killer for general use David Rientjes
2018-03-22 21:53       ` [patch v2 -mm 1/6] mm, memcg: introduce per-memcg oom policy tunable David Rientjes
2018-03-22 21:53       ` [patch v2 -mm 2/6] mm, memcg: replace cgroup aware oom killer mount option with tunable David Rientjes
2018-03-22 21:53       ` [patch v2 -mm 3/6] mm, memcg: add hierarchical usage oom policy David Rientjes
2018-03-22 21:53       ` [patch v2 -mm 4/6] mm, memcg: evaluate root and leaf memcgs fairly on oom David Rientjes
2018-03-22 21:53       ` [patch v2 -mm 5/6] mm, memcg: separate oom_group from selection criteria David Rientjes
2018-03-22 21:53       ` [patch v2 -mm 6/6] mm, memcg: disregard mempolicies for cgroup-aware oom killer David Rientjes
2018-07-13 23:07       ` [patch v3 -mm 0/6] rewrite cgroup aware oom killer for general use David Rientjes
2018-07-13 23:07         ` [patch v3 -mm 1/6] mm, memcg: introduce per-memcg oom policy tunable David Rientjes
2018-07-13 23:07         ` [patch v3 -mm 2/6] mm, memcg: replace cgroup aware oom killer mount option with tunable David Rientjes
2018-07-13 23:07         ` [patch v3 -mm 3/6] mm, memcg: add hierarchical usage oom policy David Rientjes
2018-07-16 18:16           ` Roman Gushchin [this message]
2018-07-16 18:16             ` Roman Gushchin
2018-07-17  4:06             ` David Rientjes
2018-07-23 20:33               ` David Rientjes
2018-07-23 21:28                 ` Roman Gushchin
2018-07-23 21:28                   ` Roman Gushchin
2018-07-23 23:22                   ` David Rientjes
2018-07-13 23:07         ` [patch v3 -mm 4/6] mm, memcg: evaluate root and leaf memcgs fairly on oom David Rientjes
2018-07-13 23:07         ` [patch v3 -mm 5/6] mm, memcg: separate oom_group from selection criteria David Rientjes
2018-07-13 23:07         ` [patch v3 -mm 6/6] mm, memcg: disregard mempolicies for cgroup-aware oom killer David Rientjes

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180716181613.GA28327@castle \
    --to=guro@fb.com \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=rientjes@google.com \
    --cc=tj@kernel.org \
    --cc=vdavydov.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.