linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/3] mm: memcontrol: recursive memory protection
@ 2019-12-19 20:07 Johannes Weiner
  2019-12-19 20:07 ` [PATCH v2 1/3] mm: memcontrol: fix memory.low proportional distribution Johannes Weiner
                   ` (5 more replies)
  0 siblings, 6 replies; 52+ messages in thread
From: Johannes Weiner @ 2019-12-19 20:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Roman Gushchin, Michal Hocko, Tejun Heo, linux-mm, cgroups,
	linux-kernel, kernel-team

Changes since v1:
- improved Changelogs based on the discussion with Roman. Thanks!
- fix div0 when recursive & fixed protection is combined
- fix an unused compiler warning

The current memory.low (and memory.min) semantics require protection
to be assigned to a cgroup in an untinterrupted chain from the
top-level cgroup all the way to the leaf.

In practice, we want to protect entire cgroup subtrees from each other
(system management software vs. workload), but we would like the VM to
balance memory optimally *within* each subtree, without having to make
explicit weight allocations among individual components. The current
semantics make that impossible.

This patch series extends memory.low/min such that the knobs apply
recursively to the entire subtree. Users can still assign explicit
protection to subgroups, but if they don't, the protection set by the
parent cgroup will be distributed dynamically such that children
compete freely - as if no memory control were enabled inside the
subtree - but enjoy protection from neighboring trees.

Patch #1 fixes an existing bug that can give a cgroup tree more
protection than it should receive as per ancestor configuration.

Patch #2 simplifies and documents the existing code to make it easier
to reason about the changes in the next patch.

Patch #3 finally implements recursive memory protection semantics.

Because of a risk of regressing legacy setups, the new semantics are
hidden behind a cgroup2 mount option, 'memory_recursiveprot'.

More details in patch #3.

 Documentation/admin-guide/cgroup-v2.rst |  11 ++
 include/linux/cgroup-defs.h             |   5 +
 kernel/cgroup/cgroup.c                  |  17 ++-
 mm/memcontrol.c                         | 243 +++++++++++++++++++-----------
 mm/page_counter.c                       |  12 +-
 5 files changed, 192 insertions(+), 96 deletions(-)




^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v2 1/3] mm: memcontrol: fix memory.low proportional distribution
  2019-12-19 20:07 [PATCH v2 0/3] mm: memcontrol: recursive memory protection Johannes Weiner
@ 2019-12-19 20:07 ` Johannes Weiner
  2020-01-30 11:49   ` Michal Hocko
  2019-12-19 20:07 ` [PATCH v2 2/3] mm: memcontrol: clean up and document effective low/min calculations Johannes Weiner
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 52+ messages in thread
From: Johannes Weiner @ 2019-12-19 20:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Roman Gushchin, Michal Hocko, Tejun Heo, linux-mm, cgroups,
	linux-kernel, kernel-team

When memory.low is overcommitted - i.e. the children claim more
protection than their shared ancestor grants them - the allowance is
distributed in proportion to how much each sibling uses their own
declared protection:

	low_usage = min(memory.low, memory.current)
	elow = parent_elow * (low_usage / siblings_low_usage)

However, siblings_low_usage is not the sum of all low_usages. It sums
up the usages of *only those cgroups that are within their memory.low*
That means that low_usage can be *bigger* than siblings_low_usage, and
consequently the total protection afforded to the children can be
bigger than what the ancestor grants the subtree.

Consider three groups where two are in excess of their protection:

  A/memory.low = 10G
  A/A1/memory.low = 10G, memory.current = 20G
  A/A2/memory.low = 10G, memory.current = 20G
  A/A3/memory.low = 10G, memory.current =  8G
  siblings_low_usage = 8G (only A3 contributes)

  A1/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(8G) = 12.5G -> 10G
  A2/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(8G) = 12.5G -> 10G
  A3/elow = parent_elow(10G) * low_usage(8G) / siblings_low_usage(8G) = 10.0G

  (the 12.5G are capped to the explicit memory.low setting of 10G)

With that, the sum of all awarded protection below A is 30G, when A
only grants 10G for the entire subtree.

What does this mean in practice? A1 and A2 would still be in excess of
their 10G allowance and would be reclaimed, whereas A3 would not. As
they eventually drop below their protection setting, they would be
counted in siblings_low_usage again and the error would right itself.

When reclaim was applied in a binary fashion (cgroup is reclaimed when
it's above its protection, otherwise it's skipped) this would actually
work out just fine. However, since 1bc63fb1272b ("mm, memcg: make scan
aggression always exclude protection"), reclaim pressure is scaled to
how much a cgroup is above its protection. As a result this
calculation error unduly skews pressure away from A1 and A2 toward the
rest of the system.

But why did we do it like this in the first place?

The reasoning behind exempting groups in excess from
siblings_low_usage was to go after them first during reclaim in an
overcommitted subtree:

  A/memory.low = 2G, memory.current = 4G
  A/A1/memory.low = 3G, memory.current = 2G
  A/A2/memory.low = 1G, memory.current = 2G

  siblings_low_usage = 2G (only A1 contributes)
  A1/elow = parent_elow(2G) * low_usage(2G) / siblings_low_usage(2G) = 2G
  A2/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(2G) = 1G

While the children combined are overcomitting A and are technically
both at fault, A2 is actively declaring unprotected memory and we
would like to reclaim that first.

However, while this sounds like a noble goal on the face of it, it
doesn't make much difference in actual memory distribution: Because A
is overcommitted, reclaim will not stop once A2 gets pushed back to
within its allowance; we'll have to reclaim A1 either way. The end
result is still that protection is distributed proportionally, with A1
getting 3/4 (1.5G) and A2 getting 1/4 (0.5G) of A's allowance.

[ If A weren't overcommitted, it wouldn't make a difference since each
  cgroup would just get the protection it declares:

  A/memory.low = 2G, memory.current = 3G
  A/A1/memory.low = 1G, memory.current = 1G
  A/A2/memory.low = 1G, memory.current = 2G

  With the current calculation:

  siblings_low_usage = 1G (only A1 contributes)
  A1/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(1G) = 2G -> 1G
  A2/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(1G) = 2G -> 1G

  Including excess groups in siblings_low_usage:

  siblings_low_usage = 2G
  A1/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(2G) = 1G -> 1G
  A2/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(2G) = 1G -> 1G ]

Simplify the calculation and fix the proportional reclaim bug by
including excess cgroups in siblings_low_usage.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c   |  4 +---
 mm/page_counter.c | 12 ++----------
 2 files changed, 3 insertions(+), 13 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c5b5f74cfd4d..874a0b00f89b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6236,9 +6236,7 @@ struct cgroup_subsys memory_cgrp_subsys = {
  * elow = min( memory.low, parent->elow * ------------------ ),
  *                                        siblings_low_usage
  *
- *             | memory.current, if memory.current < memory.low
- * low_usage = |
- *	       | 0, otherwise.
+ * low_usage = min(memory.low, memory.current)
  *
  *
  * Such definition of the effective memory.low provides the expected
diff --git a/mm/page_counter.c b/mm/page_counter.c
index de31470655f6..75d53f15f040 100644
--- a/mm/page_counter.c
+++ b/mm/page_counter.c
@@ -23,11 +23,7 @@ static void propagate_protected_usage(struct page_counter *c,
 		return;
 
 	if (c->min || atomic_long_read(&c->min_usage)) {
-		if (usage <= c->min)
-			protected = usage;
-		else
-			protected = 0;
-
+		protected = min(usage, c->min);
 		old_protected = atomic_long_xchg(&c->min_usage, protected);
 		delta = protected - old_protected;
 		if (delta)
@@ -35,11 +31,7 @@ static void propagate_protected_usage(struct page_counter *c,
 	}
 
 	if (c->low || atomic_long_read(&c->low_usage)) {
-		if (usage <= c->low)
-			protected = usage;
-		else
-			protected = 0;
-
+		protected = min(usage, c->low);
 		old_protected = atomic_long_xchg(&c->low_usage, protected);
 		delta = protected - old_protected;
 		if (delta)
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 2/3] mm: memcontrol: clean up and document effective low/min calculations
  2019-12-19 20:07 [PATCH v2 0/3] mm: memcontrol: recursive memory protection Johannes Weiner
  2019-12-19 20:07 ` [PATCH v2 1/3] mm: memcontrol: fix memory.low proportional distribution Johannes Weiner
@ 2019-12-19 20:07 ` Johannes Weiner
  2020-01-30 12:54   ` Michal Hocko
  2020-02-21 17:10   ` Michal Koutný
  2019-12-19 20:07 ` [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection Johannes Weiner
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 52+ messages in thread
From: Johannes Weiner @ 2019-12-19 20:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Roman Gushchin, Michal Hocko, Tejun Heo, linux-mm, cgroups,
	linux-kernel, kernel-team

The effective protection of any given cgroup is a somewhat complicated
construct that depends on the ancestor's configuration, siblings'
configurations, as well as current memory utilization in all these
groups. It's done this way to satisfy hierarchical delegation
requirements while also making the configuration semantics flexible
and expressive in complex real life scenarios.

Unfortunately, all the rules and requirements are sparsely documented,
and the code is a little too clever in merging different scenarios
into a single min() expression. This makes it hard to reason about the
implementation and avoid breaking semantics when making changes to it.

This patch documents each semantic rule individually and splits out
the handling of the overcommit case from the regular case.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c | 190 ++++++++++++++++++++++++++----------------------
 1 file changed, 104 insertions(+), 86 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 874a0b00f89b..9c771c4d6339 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6204,65 +6204,55 @@ struct cgroup_subsys memory_cgrp_subsys = {
 	.early_init = 0,
 };
 
-/**
- * mem_cgroup_protected - check if memory consumption is in the normal range
- * @root: the top ancestor of the sub-tree being checked
- * @memcg: the memory cgroup to check
- *
- * WARNING: This function is not stateless! It can only be used as part
- *          of a top-down tree iteration, not for isolated queries.
- *
- * Returns one of the following:
- *   MEMCG_PROT_NONE: cgroup memory is not protected
- *   MEMCG_PROT_LOW: cgroup memory is protected as long there is
- *     an unprotected supply of reclaimable memory from other cgroups.
- *   MEMCG_PROT_MIN: cgroup memory is protected
- *
- * @root is exclusive; it is never protected when looked at directly
+/*
+ * This function calculates an individual cgroup's effective
+ * protection which is derived from its own memory.min/low, its
+ * parent's and siblings' settings, as well as the actual memory
+ * distribution in the tree.
  *
- * To provide a proper hierarchical behavior, effective memory.min/low values
- * are used. Below is the description of how effective memory.low is calculated.
- * Effective memory.min values is calculated in the same way.
+ * The following rules apply to the effective protection values:
  *
- * Effective memory.low is always equal or less than the original memory.low.
- * If there is no memory.low overcommittment (which is always true for
- * top-level memory cgroups), these two values are equal.
- * Otherwise, it's a part of parent's effective memory.low,
- * calculated as a cgroup's memory.low usage divided by sum of sibling's
- * memory.low usages, where memory.low usage is the size of actually
- * protected memory.
+ * 1. At the first level of reclaim, effective protection is equal to
+ *    the declared protection in memory.min and memory.low.
  *
- *                                             low_usage
- * elow = min( memory.low, parent->elow * ------------------ ),
- *                                        siblings_low_usage
+ * 2. To enable safe delegation of the protection configuration, at
+ *    subsequent levels the effective protection is capped to the
+ *    parent's effective protection.
  *
- * low_usage = min(memory.low, memory.current)
+ * 3. To make complex and dynamic subtrees easier to configure, the
+ *    user is allowed to overcommit the declared protection at a given
+ *    level. If that is the case, the parent's effective protection is
+ *    distributed to the children in proportion to how much protection
+ *    they have declared and how much of it they are utilizing.
  *
+ *    This makes distribution proportional, but also work-conserving:
+ *    if one cgroup claims much more protection than it uses memory,
+ *    the unused remainder is available to its siblings.
  *
- * Such definition of the effective memory.low provides the expected
- * hierarchical behavior: parent's memory.low value is limiting
- * children, unprotected memory is reclaimed first and cgroups,
- * which are not using their guarantee do not affect actual memory
- * distribution.
+ *    Consider the following example tree:
  *
- * For example, if there are memcgs A, A/B, A/C, A/D and A/E:
+ *        A      A/memory.low = 2G, A/memory.current = 6G
+ *       //\\
+ *      BC  DE   B/memory.low = 3G  B/memory.current = 2G
+ *               C/memory.low = 1G  C/memory.current = 2G
+ *               D/memory.low = 0   D/memory.current = 2G
+ *               E/memory.low = 10G E/memory.current = 0
  *
- *     A      A/memory.low = 2G, A/memory.current = 6G
- *    //\\
- *   BC  DE   B/memory.low = 3G  B/memory.current = 2G
- *            C/memory.low = 1G  C/memory.current = 2G
- *            D/memory.low = 0   D/memory.current = 2G
- *            E/memory.low = 10G E/memory.current = 0
+ *    and memory pressure is applied, the following memory
+ *    distribution is expected (approximately*):
  *
- * and the memory pressure is applied, the following memory distribution
- * is expected (approximately):
+ *      A/memory.current = 2G
+ *      B/memory.current = 1.3G
+ *      C/memory.current = 0.6G
+ *      D/memory.current = 0
+ *      E/memory.current = 0
  *
- *     A/memory.current = 2G
+ *    *assuming equal allocation rate and reclaimability
  *
- *     B/memory.current = 1.3G
- *     C/memory.current = 0.6G
- *     D/memory.current = 0
- *     E/memory.current = 0
+ * 4. Conversely, when the declared protection is undercommitted at a
+ *    given level, the distribution of the larger parental protection
+ *    budget is NOT proportional. A cgroup's protection from a sibling
+ *    is capped to its own memory.min/low setting.
  *
  * These calculations require constant tracking of the actual low usages
  * (see propagate_protected_usage()), as well as recursive calculation of
@@ -6272,12 +6262,63 @@ struct cgroup_subsys memory_cgrp_subsys = {
  * for next usage. This part is intentionally racy, but it's ok,
  * as memory.low is a best-effort mechanism.
  */
+static unsigned long effective_protection(unsigned long usage,
+					  unsigned long setting,
+					  unsigned long parent_effective,
+					  unsigned long siblings_protected)
+{
+	unsigned long protected;
+
+	protected = min(usage, setting);
+	/*
+	 * If all cgroups at this level combined claim and use more
+	 * protection then what the parent affords them, distribute
+	 * shares in proportion to utilization.
+	 *
+	 * We are using actual utilization rather than the statically
+	 * claimed protection in order to be work-conserving: claimed
+	 * but unused protection is available to siblings that would
+	 * otherwise get a smaller chunk than what they claimed.
+	 */
+	if (siblings_protected > parent_effective)
+		return protected * parent_effective / siblings_protected;
+
+	/*
+	 * Ok, utilized protection of all children is within what the
+	 * parent affords them, so we know whatever this child claims
+	 * and utilizes is effectively protected.
+	 *
+	 * If there is unprotected usage beyond this value, reclaim
+	 * will apply pressure in proportion to that amount.
+	 *
+	 * If there is unutilized protection, the cgroup will be fully
+	 * shielded from reclaim, but we do return a smaller value for
+	 * protection than what the group could enjoy in theory. This
+	 * is okay. With the overcommit distribution above, effective
+	 * protection is always dependent on how memory is actually
+	 * consumed among the siblings anyway.
+	 */
+	return protected;
+}
+
+/**
+ * mem_cgroup_protected - check if memory consumption is in the normal range
+ * @root: the top ancestor of the sub-tree being checked
+ * @memcg: the memory cgroup to check
+ *
+ * WARNING: This function is not stateless! It can only be used as part
+ *          of a top-down tree iteration, not for isolated queries.
+ *
+ * Returns one of the following:
+ *   MEMCG_PROT_NONE: cgroup memory is not protected
+ *   MEMCG_PROT_LOW: cgroup memory is protected as long there is
+ *     an unprotected supply of reclaimable memory from other cgroups.
+ *   MEMCG_PROT_MIN: cgroup memory is protected
+ */
 enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
 						struct mem_cgroup *memcg)
 {
 	struct mem_cgroup *parent;
-	unsigned long emin, parent_emin;
-	unsigned long elow, parent_elow;
 	unsigned long usage;
 
 	if (mem_cgroup_disabled())
@@ -6292,52 +6333,29 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
 	if (!usage)
 		return MEMCG_PROT_NONE;
 
-	emin = memcg->memory.min;
-	elow = memcg->memory.low;
-
 	parent = parent_mem_cgroup(memcg);
 	/* No parent means a non-hierarchical mode on v1 memcg */
 	if (!parent)
 		return MEMCG_PROT_NONE;
 
-	if (parent == root)
-		goto exit;
-
-	parent_emin = READ_ONCE(parent->memory.emin);
-	emin = min(emin, parent_emin);
-	if (emin && parent_emin) {
-		unsigned long min_usage, siblings_min_usage;
-
-		min_usage = min(usage, memcg->memory.min);
-		siblings_min_usage = atomic_long_read(
-			&parent->memory.children_min_usage);
-
-		if (min_usage && siblings_min_usage)
-			emin = min(emin, parent_emin * min_usage /
-				   siblings_min_usage);
+	if (parent == root) {
+		memcg->memory.emin = memcg->memory.min;
+		memcg->memory.elow = memcg->memory.low;
+		goto out;
 	}
 
-	parent_elow = READ_ONCE(parent->memory.elow);
-	elow = min(elow, parent_elow);
-	if (elow && parent_elow) {
-		unsigned long low_usage, siblings_low_usage;
-
-		low_usage = min(usage, memcg->memory.low);
-		siblings_low_usage = atomic_long_read(
-			&parent->memory.children_low_usage);
+	memcg->memory.emin = effective_protection(usage,
+			memcg->memory.min, READ_ONCE(parent->memory.emin),
+			atomic_long_read(&parent->memory.children_min_usage));
 
-		if (low_usage && siblings_low_usage)
-			elow = min(elow, parent_elow * low_usage /
-				   siblings_low_usage);
-	}
-
-exit:
-	memcg->memory.emin = emin;
-	memcg->memory.elow = elow;
+	memcg->memory.elow = effective_protection(usage,
+			memcg->memory.low, READ_ONCE(parent->memory.elow),
+			atomic_long_read(&parent->memory.children_low_usage));
 
-	if (usage <= emin)
+out:
+	if (usage <= memcg->memory.emin)
 		return MEMCG_PROT_MIN;
-	else if (usage <= elow)
+	else if (usage <= memcg->memory.elow)
 		return MEMCG_PROT_LOW;
 	else
 		return MEMCG_PROT_NONE;
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection
  2019-12-19 20:07 [PATCH v2 0/3] mm: memcontrol: recursive memory protection Johannes Weiner
  2019-12-19 20:07 ` [PATCH v2 1/3] mm: memcontrol: fix memory.low proportional distribution Johannes Weiner
  2019-12-19 20:07 ` [PATCH v2 2/3] mm: memcontrol: clean up and document effective low/min calculations Johannes Weiner
@ 2019-12-19 20:07 ` Johannes Weiner
  2020-01-30 17:00   ` Michal Hocko
  2020-02-21 17:12   ` Michal Koutný
  2019-12-19 20:22 ` [PATCH v2 0/3] mm: memcontrol: recursive memory protection Tejun Heo
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 52+ messages in thread
From: Johannes Weiner @ 2019-12-19 20:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Roman Gushchin, Michal Hocko, Tejun Heo, linux-mm, cgroups,
	linux-kernel, kernel-team

Right now, the effective protection of any given cgroup is capped by
its own explicit memory.low setting, regardless of what the parent
says. The reasons for this are mostly historical and ease of
implementation: to make delegation of memory.low safe, effective
protection is the min() of all memory.low up the tree.

Unfortunately, this limitation makes it impossible to protect an
entire subtree from another without forcing the user to make explicit
protection allocations all the way to the leaf cgroups - something
that is highly undesirable in real life scenarios.

Consider memory in a data center host. At the cgroup top level, we
have a distinction between system management software and the actual
workload the system is executing. Both branches are further subdivided
into individual services, job components etc.

We want to protect the workload as a whole from the system management
software, but that doesn't mean we want to protect and prioritize
individual workload wrt each other. Their memory demand can vary over
time, and we'd want the VM to simply cache the hottest data within the
workload subtree. Yet, the current memory.low limitations force us to
allocate a fixed amount of protection to each workload component in
order to get protection from system management software in
general. This results in very inefficient resource distribution.

To enable such use cases, this patch adds the concept of recursive
protection to the memory.low setting, while retaining the abilty to
assign fixed protection in leaf groups as well.

That means that if protection is explicitly allocated among siblings,
those configured weights are being followed during page reclaim just
like they are now.

However, if the memory.low set at a higher level is not fully claimed
by the children in that subtree, the "floating" remainder is applied
to each cgroup in the tree in proportion to its size. Since reclaim
pressure is applied in proportion to size as well, each child in that
tree gets the same boost, and the effect is neutral among siblings -
with respect to each other, they behave as if no memory control was
enabled at all, and the VM simply balances the memory demands
optimally within the subtree. But collectively those cgroups enjoy a
boost over the cgroups in neighboring trees.

This allows us to recursively protect one subtree (workload) from
another (system management), but let subgroups compete freely among
each other without having to assign fixed shares to each leaf.

The floating protection composes naturally with fixed
protection. Consider the following example tree:

		A            A: low = 2G
               / \          A1: low = 1G
              A1 A2         A2: low = 0G

As outside pressure is applied to this tree, A1 will enjoy a fixed
protection from A2 of 1G, but the remaining, unclaimed 1G from A is
split evenly among A1 and A2. Assuming equal memory demand in both,
memory usage will converge on A1 using 1.5G and A2 using 0.5G.

There is a slight risk of regressing theoretical setups where the
top-level cgroups don't know about the true budgeting and set bogusly
high "bypass" values that are meaningfully allocated down the
tree. Such setups would rely on unclaimed protection to be discarded,
and distributing it would change the intended behavior. Be safe and
hide the new behavior behind a mount option, 'memory_recursiveprot'.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 Documentation/admin-guide/cgroup-v2.rst | 11 +++++
 include/linux/cgroup-defs.h             |  5 ++
 kernel/cgroup/cgroup.c                  | 17 ++++++-
 mm/memcontrol.c                         | 65 +++++++++++++++++++++++--
 4 files changed, 93 insertions(+), 5 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 0636bcb60b5a..e569d83621da 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -186,6 +186,17 @@ cgroup v2 currently supports the following mount options.
         modified through remount from the init namespace. The mount
         option is ignored on non-init namespace mounts.
 
+  memory_recursiveprot
+
+        Recursively apply memory.min and memory.low protection to
+        entire subtrees, without requiring explicit downward
+        propagation into leaf cgroups.  This allows protecting entire
+        subtrees from one another, while retaining free competition
+        within those subtrees.  This should have been the default
+        behavior but is a mount-option to avoid regressing setups
+        relying on the original semantics (e.g. specifying bogusly
+        high 'bypass' protection values at higher tree levels).
+
 
 Organizing Processes and Threads
 --------------------------------
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 63097cb243cb..e1fafed22db1 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -94,6 +94,11 @@ enum {
 	 * Enable legacy local memory.events.
 	 */
 	CGRP_ROOT_MEMORY_LOCAL_EVENTS = (1 << 5),
+
+	/*
+	 * Enable recursive subtree protection
+	 */
+	CGRP_ROOT_MEMORY_RECURSIVE_PROT = (1 << 6),
 };
 
 /* cftype->flags */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 735af8f15f95..a2f8d2ab8dec 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1813,12 +1813,14 @@ int cgroup_show_path(struct seq_file *sf, struct kernfs_node *kf_node,
 enum cgroup2_param {
 	Opt_nsdelegate,
 	Opt_memory_localevents,
+	Opt_memory_recursiveprot,
 	nr__cgroup2_params
 };
 
 static const struct fs_parameter_spec cgroup2_param_specs[] = {
 	fsparam_flag("nsdelegate",		Opt_nsdelegate),
 	fsparam_flag("memory_localevents",	Opt_memory_localevents),
+	fsparam_flag("memory_recursiveprot",	Opt_memory_recursiveprot),
 	{}
 };
 
@@ -1844,6 +1846,9 @@ static int cgroup2_parse_param(struct fs_context *fc, struct fs_parameter *param
 	case Opt_memory_localevents:
 		ctx->flags |= CGRP_ROOT_MEMORY_LOCAL_EVENTS;
 		return 0;
+	case Opt_memory_recursiveprot:
+		ctx->flags |= CGRP_ROOT_MEMORY_RECURSIVE_PROT;
+		return 0;
 	}
 	return -EINVAL;
 }
@@ -1860,6 +1865,11 @@ static void apply_cgroup_root_flags(unsigned int root_flags)
 			cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_LOCAL_EVENTS;
 		else
 			cgrp_dfl_root.flags &= ~CGRP_ROOT_MEMORY_LOCAL_EVENTS;
+
+		if (root_flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT)
+			cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_RECURSIVE_PROT;
+		else
+			cgrp_dfl_root.flags &= ~CGRP_ROOT_MEMORY_RECURSIVE_PROT;
 	}
 }
 
@@ -1869,6 +1879,8 @@ static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root
 		seq_puts(seq, ",nsdelegate");
 	if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_LOCAL_EVENTS)
 		seq_puts(seq, ",memory_localevents");
+	if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT)
+		seq_puts(seq, ",memory_recursiveprot");
 	return 0;
 }
 
@@ -6364,7 +6376,10 @@ static struct kobj_attribute cgroup_delegate_attr = __ATTR_RO(delegate);
 static ssize_t features_show(struct kobject *kobj, struct kobj_attribute *attr,
 			     char *buf)
 {
-	return snprintf(buf, PAGE_SIZE, "nsdelegate\nmemory_localevents\n");
+	return snprintf(buf, PAGE_SIZE,
+			"nsdelegate\n"
+			"memory_localevents\n"
+			"memory_recursiveprot\n");
 }
 static struct kobj_attribute cgroup_features_attr = __ATTR_RO(features);
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9c771c4d6339..cf02e3ef3ed9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6254,6 +6254,32 @@ struct cgroup_subsys memory_cgrp_subsys = {
  *    budget is NOT proportional. A cgroup's protection from a sibling
  *    is capped to its own memory.min/low setting.
  *
+ * 5. However, to allow protecting recursive subtrees from each other
+ *    without having to declare each individual cgroup's fixed share
+ *    of the ancestor's claim to protection, any unutilized -
+ *    "floating" - protection from up the tree is distributed in
+ *    proportion to each cgroup's *usage*. This makes the protection
+ *    neutral wrt sibling cgroups and lets them compete freely over
+ *    the shared parental protection budget, but it protects the
+ *    subtree as a whole from neighboring subtrees.
+ *
+ *    Consider the following example tree:
+ *
+ *        A            A: low = 2G
+ *       / \           B: low = 1G
+ *      B   C          C: low = 0G
+ *
+ *    As memory pressure is applied, the following memory distribution
+ *    is expected (approximately):
+ *
+ *      A/memory.current = 2G
+ *      B/memory.current = 1.5G
+ *      C/memory.current = 0.5G
+ *
+ * Note that 4. and 5. are not in conflict: 4. is about protecting
+ * against immediate siblings whereas 5. is about protecting against
+ * neighboring subtrees.
+ *
  * These calculations require constant tracking of the actual low usages
  * (see propagate_protected_usage()), as well as recursive calculation of
  * effective memory.low values. But as we do call mem_cgroup_protected()
@@ -6263,11 +6289,13 @@ struct cgroup_subsys memory_cgrp_subsys = {
  * as memory.low is a best-effort mechanism.
  */
 static unsigned long effective_protection(unsigned long usage,
+					  unsigned long parent_usage,
 					  unsigned long setting,
 					  unsigned long parent_effective,
 					  unsigned long siblings_protected)
 {
 	unsigned long protected;
+	unsigned long ep;
 
 	protected = min(usage, setting);
 	/*
@@ -6298,7 +6326,34 @@ static unsigned long effective_protection(unsigned long usage,
 	 * protection is always dependent on how memory is actually
 	 * consumed among the siblings anyway.
 	 */
-	return protected;
+	ep = protected;
+
+	/*
+	 * If the children aren't claiming (all of) the protection
+	 * afforded to them by the parent, distribute the remainder in
+	 * proportion to the (unprotected) memory of each cgroup. That
+	 * way, cgroups that aren't explicitly prioritized wrt each
+	 * other compete freely over the allowance, but they are
+	 * collectively protected from neighboring trees.
+	 *
+	 * We're using unprotected memory for the weight so that if
+	 * some cgroups DO claim explicit protection, we don't protect
+	 * the same bytes twice.
+	 */
+	if (!(cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT))
+		return ep;
+
+	if (parent_effective > siblings_protected && usage > protected) {
+		unsigned long unclaimed;
+
+		unclaimed = parent_effective - siblings_protected;
+		unclaimed *= usage - protected;
+		unclaimed /= parent_usage - siblings_protected;
+
+		ep += unclaimed;
+	}
+
+	return ep;
 }
 
 /**
@@ -6318,8 +6373,8 @@ static unsigned long effective_protection(unsigned long usage,
 enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
 						struct mem_cgroup *memcg)
 {
+	unsigned long usage, parent_usage;
 	struct mem_cgroup *parent;
-	unsigned long usage;
 
 	if (mem_cgroup_disabled())
 		return MEMCG_PROT_NONE;
@@ -6344,11 +6399,13 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
 		goto out;
 	}
 
-	memcg->memory.emin = effective_protection(usage,
+	parent_usage = page_counter_read(&parent->memory);
+
+	memcg->memory.emin = effective_protection(usage, parent_usage,
 			memcg->memory.min, READ_ONCE(parent->memory.emin),
 			atomic_long_read(&parent->memory.children_min_usage));
 
-	memcg->memory.elow = effective_protection(usage,
+	memcg->memory.elow = effective_protection(usage, parent_usage,
 			memcg->memory.low, READ_ONCE(parent->memory.elow),
 			atomic_long_read(&parent->memory.children_low_usage));
 
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 0/3] mm: memcontrol: recursive memory protection
  2019-12-19 20:07 [PATCH v2 0/3] mm: memcontrol: recursive memory protection Johannes Weiner
                   ` (2 preceding siblings ...)
  2019-12-19 20:07 ` [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection Johannes Weiner
@ 2019-12-19 20:22 ` Tejun Heo
  2019-12-20  4:06 ` Roman Gushchin
  2019-12-20  4:29 ` Chris Down
  5 siblings, 0 replies; 52+ messages in thread
From: Tejun Heo @ 2019-12-19 20:22 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Roman Gushchin, Michal Hocko, linux-mm, cgroups,
	linux-kernel, kernel-team

On Thu, Dec 19, 2019 at 03:07:15PM -0500, Johannes Weiner wrote:
> Changes since v1:
> - improved Changelogs based on the discussion with Roman. Thanks!
> - fix div0 when recursive & fixed protection is combined
> - fix an unused compiler warning
> 
> The current memory.low (and memory.min) semantics require protection
> to be assigned to a cgroup in an untinterrupted chain from the
> top-level cgroup all the way to the leaf.
> 
> In practice, we want to protect entire cgroup subtrees from each other
> (system management software vs. workload), but we would like the VM to
> balance memory optimally *within* each subtree, without having to make
> explicit weight allocations among individual components. The current
> semantics make that impossible.

Acked-by: Tejun Heo <tj@kernel.org>

The original behavior turned out to be a significant source of
mistakes and use cases which would require older behavior just weren't
there.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 0/3] mm: memcontrol: recursive memory protection
  2019-12-19 20:07 [PATCH v2 0/3] mm: memcontrol: recursive memory protection Johannes Weiner
                   ` (3 preceding siblings ...)
  2019-12-19 20:22 ` [PATCH v2 0/3] mm: memcontrol: recursive memory protection Tejun Heo
@ 2019-12-20  4:06 ` Roman Gushchin
  2019-12-20  4:29 ` Chris Down
  5 siblings, 0 replies; 52+ messages in thread
From: Roman Gushchin @ 2019-12-20  4:06 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, Tejun Heo, linux-mm, cgroups,
	linux-kernel, Kernel Team

On Thu, Dec 19, 2019 at 03:07:15PM -0500, Johannes Weiner wrote:
> Changes since v1:
> - improved Changelogs based on the discussion with Roman. Thanks!
> - fix div0 when recursive & fixed protection is combined
> - fix an unused compiler warning
> 
> The current memory.low (and memory.min) semantics require protection
> to be assigned to a cgroup in an untinterrupted chain from the
> top-level cgroup all the way to the leaf.
> 
> In practice, we want to protect entire cgroup subtrees from each other
> (system management software vs. workload), but we would like the VM to
> balance memory optimally *within* each subtree, without having to make
> explicit weight allocations among individual components. The current
> semantics make that impossible.
> 
> This patch series extends memory.low/min such that the knobs apply
> recursively to the entire subtree. Users can still assign explicit
> protection to subgroups, but if they don't, the protection set by the
> parent cgroup will be distributed dynamically such that children
> compete freely - as if no memory control were enabled inside the
> subtree - but enjoy protection from neighboring trees.
> 
> Patch #1 fixes an existing bug that can give a cgroup tree more
> protection than it should receive as per ancestor configuration.
> 
> Patch #2 simplifies and documents the existing code to make it easier
> to reason about the changes in the next patch.
> 
> Patch #3 finally implements recursive memory protection semantics.
> 
> Because of a risk of regressing legacy setups, the new semantics are
> hidden behind a cgroup2 mount option, 'memory_recursiveprot'.

I really like the new semantics: it looks nice and doesn't require
any new magic values aka "bypass", which have been discussed previously.
The ability to disable the protection for a particular cgroup inside
the protected sub-tree looks overvalued: I don't have any practical
example when it makes any sense. So it's totally worth it to sacrifice
it. Thank you for adding comments to the changelog!

Acked-by: Roman Gushchin <guro@fb.com>
for the series.

Thanks!


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 0/3] mm: memcontrol: recursive memory protection
  2019-12-19 20:07 [PATCH v2 0/3] mm: memcontrol: recursive memory protection Johannes Weiner
                   ` (4 preceding siblings ...)
  2019-12-20  4:06 ` Roman Gushchin
@ 2019-12-20  4:29 ` Chris Down
  5 siblings, 0 replies; 52+ messages in thread
From: Chris Down @ 2019-12-20  4:29 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Roman Gushchin, Michal Hocko, Tejun Heo, linux-mm,
	cgroups, linux-kernel, kernel-team

Johannes Weiner writes:
>Changes since v1:
>- improved Changelogs based on the discussion with Roman. Thanks!
>- fix div0 when recursive & fixed protection is combined
>- fix an unused compiler warning
>
>The current memory.low (and memory.min) semantics require protection
>to be assigned to a cgroup in an untinterrupted chain from the
>top-level cgroup all the way to the leaf.
>
>In practice, we want to protect entire cgroup subtrees from each other
>(system management software vs. workload), but we would like the VM to
>balance memory optimally *within* each subtree, without having to make
>explicit weight allocations among individual components. The current
>semantics make that impossible.
>
>This patch series extends memory.low/min such that the knobs apply
>recursively to the entire subtree. Users can still assign explicit
>protection to subgroups, but if they don't, the protection set by the
>parent cgroup will be distributed dynamically such that children
>compete freely - as if no memory control were enabled inside the
>subtree - but enjoy protection from neighboring trees.

Thanks, from experience working with these semantics in userspace, I agree that 
this design makes it easier to configure the protections in a way that is 
meaningful.

For the series:

Acked-by: Chris Down <chris@chrisdown.name>

>Patch #1 fixes an existing bug that can give a cgroup tree more
>protection than it should receive as per ancestor configuration.
>
>Patch #2 simplifies and documents the existing code to make it easier
>to reason about the changes in the next patch.
>
>Patch #3 finally implements recursive memory protection semantics.

Just as an off-topic aside, although I'm sure you already have it in mind, we 
should definitely make sure to clearly point this out to those in the container 
management tooling space who are in the process of moving to support/default to 
v2. For example, I wonder about CoreOS' systemwide strategy around memory 
management and whether it can benefit from this.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 1/3] mm: memcontrol: fix memory.low proportional distribution
  2019-12-19 20:07 ` [PATCH v2 1/3] mm: memcontrol: fix memory.low proportional distribution Johannes Weiner
@ 2020-01-30 11:49   ` Michal Hocko
  2020-02-03 21:21     ` Johannes Weiner
  0 siblings, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2020-01-30 11:49 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Roman Gushchin, Tejun Heo, linux-mm, cgroups,
	linux-kernel, kernel-team

On Thu 19-12-19 15:07:16, Johannes Weiner wrote:
> When memory.low is overcommitted - i.e. the children claim more
> protection than their shared ancestor grants them - the allowance is
> distributed in proportion to how much each sibling uses their own
> declared protection:

Has there ever been any actual explanation why do we care about
overcommitted protection? I have got back to email threads when
the effective hierarchical protection has been proposed.
http://lkml.kernel.org/r/20180320223353.5673-1-guro@fb.com talks about
some "leaf memory cgroups are more valuable than others" which sounds ok
but it doesn't explain why children have to overcommit parent in the
first place. Do we have any other documentation to explain the usecase?

> 
> 	low_usage = min(memory.low, memory.current)
> 	elow = parent_elow * (low_usage / siblings_low_usage)
> 
> However, siblings_low_usage is not the sum of all low_usages. It sums
> up the usages of *only those cgroups that are within their memory.low*
> That means that low_usage can be *bigger* than siblings_low_usage, and
> consequently the total protection afforded to the children can be
> bigger than what the ancestor grants the subtree.
> 
> Consider three groups where two are in excess of their protection:
> 
>   A/memory.low = 10G
>   A/A1/memory.low = 10G, memory.current = 20G
>   A/A2/memory.low = 10G, memory.current = 20G
>   A/A3/memory.low = 10G, memory.current =  8G
>   siblings_low_usage = 8G (only A3 contributes)
> 
>   A1/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(8G) = 12.5G -> 10G
>   A2/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(8G) = 12.5G -> 10G
>   A3/elow = parent_elow(10G) * low_usage(8G) / siblings_low_usage(8G) = 10.0G
> 
>   (the 12.5G are capped to the explicit memory.low setting of 10G)
> 
> With that, the sum of all awarded protection below A is 30G, when A
> only grants 10G for the entire subtree.
> 
> What does this mean in practice? A1 and A2 would still be in excess of
> their 10G allowance and would be reclaimed, whereas A3 would not. As
> they eventually drop below their protection setting, they would be
> counted in siblings_low_usage again and the error would right itself.
> 
> When reclaim was applied in a binary fashion (cgroup is reclaimed when
> it's above its protection, otherwise it's skipped) this would actually
> work out just fine. However, since 1bc63fb1272b ("mm, memcg: make scan
> aggression always exclude protection"), reclaim pressure is scaled to
> how much a cgroup is above its protection. As a result this
> calculation error unduly skews pressure away from A1 and A2 toward the
> rest of the system.

Just to make sure I fully follow. The overall excess over protection is
38G in your example (for A) while the reclaim would only scan 20G for
this hierarchy until the error starts to fixup because
siblings_low_usage would get back into sync, correct?

> But why did we do it like this in the first place?
> 
> The reasoning behind exempting groups in excess from
> siblings_low_usage was to go after them first during reclaim in an
> overcommitted subtree:
> 
>   A/memory.low = 2G, memory.current = 4G
>   A/A1/memory.low = 3G, memory.current = 2G
>   A/A2/memory.low = 1G, memory.current = 2G
> 
>   siblings_low_usage = 2G (only A1 contributes)
>   A1/elow = parent_elow(2G) * low_usage(2G) / siblings_low_usage(2G) = 2G
>   A2/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(2G) = 1G
> 
> While the children combined are overcomitting A and are technically
> both at fault, A2 is actively declaring unprotected memory and we
> would like to reclaim that first.
> 
> However, while this sounds like a noble goal on the face of it, it
> doesn't make much difference in actual memory distribution: Because A
> is overcommitted, reclaim will not stop once A2 gets pushed back to
> within its allowance; we'll have to reclaim A1 either way. The end
> result is still that protection is distributed proportionally, with A1
> getting 3/4 (1.5G) and A2 getting 1/4 (0.5G) of A's allowance.
> 
> [ If A weren't overcommitted, it wouldn't make a difference since each
>   cgroup would just get the protection it declares:
> 
>   A/memory.low = 2G, memory.current = 3G
>   A/A1/memory.low = 1G, memory.current = 1G
>   A/A2/memory.low = 1G, memory.current = 2G
> 
>   With the current calculation:
> 
>   siblings_low_usage = 1G (only A1 contributes)
>   A1/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(1G) = 2G -> 1G
>   A2/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(1G) = 2G -> 1G
> 
>   Including excess groups in siblings_low_usage:
> 
>   siblings_low_usage = 2G
>   A1/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(2G) = 1G -> 1G
>   A2/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(2G) = 1G -> 1G ]
> 
> Simplify the calculation and fix the proportional reclaim bug by
> including excess cgroups in siblings_low_usage.

I think it would be better to also show the initial example with the
overcommitted protection because this is the primary usecase this patch
is targeting in the first place.
   A/memory.low = 10G
   A/A1/memory.low = 10G, memory.current = 20G
   A/A2/memory.low = 10G, memory.current = 20G
   A/A3/memory.low = 10G, memory.current =  8G
   siblings_low_usage = 28G
 
   A1/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(28G) = 3.5G
   A2/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(28G) = 3.5G
   A3/elow = parent_elow(10G) * low_usage(8G) / siblings_low_usage(28G) = 2.8G

so wrt the global reclaim we have 38G of reclaimable memory with 38G
excess over A's protection. It is true that A3 will get reclaimed way
before A1, A2 reach their protection which might be just good enough to
satisfy the external memory pressure because it is not really likely
that the global pressure would require more than 20G dropped all at
once, right?

That being said I can see the problem with the existing implementation
and how you workaround it. I am still unclear about why should we care
about overcommit on the protection that much and in that light the patch
makes more sense because it doesn't overflow the memory pressure to
outside.
Longer term, though, do we really have to care about this scenario? If
yes, can we have it documented?

Do we want a fixes tag here? There are two changes that need to be
applied together to have a visible effect.

Fixes: 1bc63fb1272b ("mm, memcg: make scan aggression always exclude protection")
Fixes: 230671533d64 ("mm: memory.low hierarchical behavior")

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/memcontrol.c   |  4 +---
>  mm/page_counter.c | 12 ++----------
>  2 files changed, 3 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c5b5f74cfd4d..874a0b00f89b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6236,9 +6236,7 @@ struct cgroup_subsys memory_cgrp_subsys = {
>   * elow = min( memory.low, parent->elow * ------------------ ),
>   *                                        siblings_low_usage
>   *
> - *             | memory.current, if memory.current < memory.low
> - * low_usage = |
> - *	       | 0, otherwise.
> + * low_usage = min(memory.low, memory.current)
>   *
>   *
>   * Such definition of the effective memory.low provides the expected
> diff --git a/mm/page_counter.c b/mm/page_counter.c
> index de31470655f6..75d53f15f040 100644
> --- a/mm/page_counter.c
> +++ b/mm/page_counter.c
> @@ -23,11 +23,7 @@ static void propagate_protected_usage(struct page_counter *c,
>  		return;
>  
>  	if (c->min || atomic_long_read(&c->min_usage)) {
> -		if (usage <= c->min)
> -			protected = usage;
> -		else
> -			protected = 0;
> -
> +		protected = min(usage, c->min);
>  		old_protected = atomic_long_xchg(&c->min_usage, protected);
>  		delta = protected - old_protected;
>  		if (delta)
> @@ -35,11 +31,7 @@ static void propagate_protected_usage(struct page_counter *c,
>  	}
>  
>  	if (c->low || atomic_long_read(&c->low_usage)) {
> -		if (usage <= c->low)
> -			protected = usage;
> -		else
> -			protected = 0;
> -
> +		protected = min(usage, c->low);
>  		old_protected = atomic_long_xchg(&c->low_usage, protected);
>  		delta = protected - old_protected;
>  		if (delta)
> -- 
> 2.24.1

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 2/3] mm: memcontrol: clean up and document effective low/min calculations
  2019-12-19 20:07 ` [PATCH v2 2/3] mm: memcontrol: clean up and document effective low/min calculations Johannes Weiner
@ 2020-01-30 12:54   ` Michal Hocko
  2020-02-21 17:10   ` Michal Koutný
  1 sibling, 0 replies; 52+ messages in thread
From: Michal Hocko @ 2020-01-30 12:54 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Roman Gushchin, Tejun Heo, linux-mm, cgroups,
	linux-kernel, kernel-team

On Thu 19-12-19 15:07:17, Johannes Weiner wrote:
> The effective protection of any given cgroup is a somewhat complicated
> construct that depends on the ancestor's configuration, siblings'
> configurations, as well as current memory utilization in all these
> groups. It's done this way to satisfy hierarchical delegation
> requirements while also making the configuration semantics flexible
> and expressive in complex real life scenarios.
> 
> Unfortunately, all the rules and requirements are sparsely documented,
> and the code is a little too clever in merging different scenarios
> into a single min() expression. This makes it hard to reason about the
> implementation and avoid breaking semantics when making changes to it.
> 
> This patch documents each semantic rule individually and splits out
> the handling of the overcommit case from the regular case.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

It took me a while to double check that the new code is indeed
equivalent but the result is easier to grasp indeed.

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/memcontrol.c | 190 ++++++++++++++++++++++++++----------------------
>  1 file changed, 104 insertions(+), 86 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 874a0b00f89b..9c771c4d6339 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6204,65 +6204,55 @@ struct cgroup_subsys memory_cgrp_subsys = {
>  	.early_init = 0,
>  };
>  
> -/**
> - * mem_cgroup_protected - check if memory consumption is in the normal range
> - * @root: the top ancestor of the sub-tree being checked
> - * @memcg: the memory cgroup to check
> - *
> - * WARNING: This function is not stateless! It can only be used as part
> - *          of a top-down tree iteration, not for isolated queries.
> - *
> - * Returns one of the following:
> - *   MEMCG_PROT_NONE: cgroup memory is not protected
> - *   MEMCG_PROT_LOW: cgroup memory is protected as long there is
> - *     an unprotected supply of reclaimable memory from other cgroups.
> - *   MEMCG_PROT_MIN: cgroup memory is protected
> - *
> - * @root is exclusive; it is never protected when looked at directly
> +/*
> + * This function calculates an individual cgroup's effective
> + * protection which is derived from its own memory.min/low, its
> + * parent's and siblings' settings, as well as the actual memory
> + * distribution in the tree.
>   *
> - * To provide a proper hierarchical behavior, effective memory.min/low values
> - * are used. Below is the description of how effective memory.low is calculated.
> - * Effective memory.min values is calculated in the same way.
> + * The following rules apply to the effective protection values:
>   *
> - * Effective memory.low is always equal or less than the original memory.low.
> - * If there is no memory.low overcommittment (which is always true for
> - * top-level memory cgroups), these two values are equal.
> - * Otherwise, it's a part of parent's effective memory.low,
> - * calculated as a cgroup's memory.low usage divided by sum of sibling's
> - * memory.low usages, where memory.low usage is the size of actually
> - * protected memory.
> + * 1. At the first level of reclaim, effective protection is equal to
> + *    the declared protection in memory.min and memory.low.
>   *
> - *                                             low_usage
> - * elow = min( memory.low, parent->elow * ------------------ ),
> - *                                        siblings_low_usage
> + * 2. To enable safe delegation of the protection configuration, at
> + *    subsequent levels the effective protection is capped to the
> + *    parent's effective protection.
>   *
> - * low_usage = min(memory.low, memory.current)
> + * 3. To make complex and dynamic subtrees easier to configure, the
> + *    user is allowed to overcommit the declared protection at a given
> + *    level. If that is the case, the parent's effective protection is
> + *    distributed to the children in proportion to how much protection
> + *    they have declared and how much of it they are utilizing.
>   *
> + *    This makes distribution proportional, but also work-conserving:
> + *    if one cgroup claims much more protection than it uses memory,
> + *    the unused remainder is available to its siblings.
>   *
> - * Such definition of the effective memory.low provides the expected
> - * hierarchical behavior: parent's memory.low value is limiting
> - * children, unprotected memory is reclaimed first and cgroups,
> - * which are not using their guarantee do not affect actual memory
> - * distribution.
> + *    Consider the following example tree:
>   *
> - * For example, if there are memcgs A, A/B, A/C, A/D and A/E:
> + *        A      A/memory.low = 2G, A/memory.current = 6G
> + *       //\\
> + *      BC  DE   B/memory.low = 3G  B/memory.current = 2G
> + *               C/memory.low = 1G  C/memory.current = 2G
> + *               D/memory.low = 0   D/memory.current = 2G
> + *               E/memory.low = 10G E/memory.current = 0
>   *
> - *     A      A/memory.low = 2G, A/memory.current = 6G
> - *    //\\
> - *   BC  DE   B/memory.low = 3G  B/memory.current = 2G
> - *            C/memory.low = 1G  C/memory.current = 2G
> - *            D/memory.low = 0   D/memory.current = 2G
> - *            E/memory.low = 10G E/memory.current = 0
> + *    and memory pressure is applied, the following memory
> + *    distribution is expected (approximately*):
>   *
> - * and the memory pressure is applied, the following memory distribution
> - * is expected (approximately):
> + *      A/memory.current = 2G
> + *      B/memory.current = 1.3G
> + *      C/memory.current = 0.6G
> + *      D/memory.current = 0
> + *      E/memory.current = 0
>   *
> - *     A/memory.current = 2G
> + *    *assuming equal allocation rate and reclaimability
>   *
> - *     B/memory.current = 1.3G
> - *     C/memory.current = 0.6G
> - *     D/memory.current = 0
> - *     E/memory.current = 0
> + * 4. Conversely, when the declared protection is undercommitted at a
> + *    given level, the distribution of the larger parental protection
> + *    budget is NOT proportional. A cgroup's protection from a sibling
> + *    is capped to its own memory.min/low setting.
>   *
>   * These calculations require constant tracking of the actual low usages
>   * (see propagate_protected_usage()), as well as recursive calculation of
> @@ -6272,12 +6262,63 @@ struct cgroup_subsys memory_cgrp_subsys = {
>   * for next usage. This part is intentionally racy, but it's ok,
>   * as memory.low is a best-effort mechanism.
>   */
> +static unsigned long effective_protection(unsigned long usage,
> +					  unsigned long setting,
> +					  unsigned long parent_effective,
> +					  unsigned long siblings_protected)
> +{
> +	unsigned long protected;
> +
> +	protected = min(usage, setting);
> +	/*
> +	 * If all cgroups at this level combined claim and use more
> +	 * protection then what the parent affords them, distribute
> +	 * shares in proportion to utilization.
> +	 *
> +	 * We are using actual utilization rather than the statically
> +	 * claimed protection in order to be work-conserving: claimed
> +	 * but unused protection is available to siblings that would
> +	 * otherwise get a smaller chunk than what they claimed.
> +	 */
> +	if (siblings_protected > parent_effective)
> +		return protected * parent_effective / siblings_protected;
> +
> +	/*
> +	 * Ok, utilized protection of all children is within what the
> +	 * parent affords them, so we know whatever this child claims
> +	 * and utilizes is effectively protected.
> +	 *
> +	 * If there is unprotected usage beyond this value, reclaim
> +	 * will apply pressure in proportion to that amount.
> +	 *
> +	 * If there is unutilized protection, the cgroup will be fully
> +	 * shielded from reclaim, but we do return a smaller value for
> +	 * protection than what the group could enjoy in theory. This
> +	 * is okay. With the overcommit distribution above, effective
> +	 * protection is always dependent on how memory is actually
> +	 * consumed among the siblings anyway.
> +	 */
> +	return protected;
> +}
> +
> +/**
> + * mem_cgroup_protected - check if memory consumption is in the normal range
> + * @root: the top ancestor of the sub-tree being checked
> + * @memcg: the memory cgroup to check
> + *
> + * WARNING: This function is not stateless! It can only be used as part
> + *          of a top-down tree iteration, not for isolated queries.
> + *
> + * Returns one of the following:
> + *   MEMCG_PROT_NONE: cgroup memory is not protected
> + *   MEMCG_PROT_LOW: cgroup memory is protected as long there is
> + *     an unprotected supply of reclaimable memory from other cgroups.
> + *   MEMCG_PROT_MIN: cgroup memory is protected
> + */
>  enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
>  						struct mem_cgroup *memcg)
>  {
>  	struct mem_cgroup *parent;
> -	unsigned long emin, parent_emin;
> -	unsigned long elow, parent_elow;
>  	unsigned long usage;
>  
>  	if (mem_cgroup_disabled())
> @@ -6292,52 +6333,29 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
>  	if (!usage)
>  		return MEMCG_PROT_NONE;
>  
> -	emin = memcg->memory.min;
> -	elow = memcg->memory.low;
> -
>  	parent = parent_mem_cgroup(memcg);
>  	/* No parent means a non-hierarchical mode on v1 memcg */
>  	if (!parent)
>  		return MEMCG_PROT_NONE;
>  
> -	if (parent == root)
> -		goto exit;
> -
> -	parent_emin = READ_ONCE(parent->memory.emin);
> -	emin = min(emin, parent_emin);
> -	if (emin && parent_emin) {
> -		unsigned long min_usage, siblings_min_usage;
> -
> -		min_usage = min(usage, memcg->memory.min);
> -		siblings_min_usage = atomic_long_read(
> -			&parent->memory.children_min_usage);
> -
> -		if (min_usage && siblings_min_usage)
> -			emin = min(emin, parent_emin * min_usage /
> -				   siblings_min_usage);
> +	if (parent == root) {
> +		memcg->memory.emin = memcg->memory.min;
> +		memcg->memory.elow = memcg->memory.low;
> +		goto out;
>  	}
>  
> -	parent_elow = READ_ONCE(parent->memory.elow);
> -	elow = min(elow, parent_elow);
> -	if (elow && parent_elow) {
> -		unsigned long low_usage, siblings_low_usage;
> -
> -		low_usage = min(usage, memcg->memory.low);
> -		siblings_low_usage = atomic_long_read(
> -			&parent->memory.children_low_usage);
> +	memcg->memory.emin = effective_protection(usage,
> +			memcg->memory.min, READ_ONCE(parent->memory.emin),
> +			atomic_long_read(&parent->memory.children_min_usage));
>  
> -		if (low_usage && siblings_low_usage)
> -			elow = min(elow, parent_elow * low_usage /
> -				   siblings_low_usage);
> -	}
> -
> -exit:
> -	memcg->memory.emin = emin;
> -	memcg->memory.elow = elow;
> +	memcg->memory.elow = effective_protection(usage,
> +			memcg->memory.low, READ_ONCE(parent->memory.elow),
> +			atomic_long_read(&parent->memory.children_low_usage));
>  
> -	if (usage <= emin)
> +out:
> +	if (usage <= memcg->memory.emin)
>  		return MEMCG_PROT_MIN;
> -	else if (usage <= elow)
> +	else if (usage <= memcg->memory.elow)
>  		return MEMCG_PROT_LOW;
>  	else
>  		return MEMCG_PROT_NONE;
> -- 
> 2.24.1
> 

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection
  2019-12-19 20:07 ` [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection Johannes Weiner
@ 2020-01-30 17:00   ` Michal Hocko
  2020-02-03 21:52     ` Johannes Weiner
  2020-02-21 17:12   ` Michal Koutný
  1 sibling, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2020-01-30 17:00 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Roman Gushchin, Tejun Heo, linux-mm, cgroups,
	linux-kernel, kernel-team

On Thu 19-12-19 15:07:18, Johannes Weiner wrote:
> Right now, the effective protection of any given cgroup is capped by
> its own explicit memory.low setting, regardless of what the parent
> says. The reasons for this are mostly historical and ease of
> implementation: to make delegation of memory.low safe, effective
> protection is the min() of all memory.low up the tree.
> 
> Unfortunately, this limitation makes it impossible to protect an
> entire subtree from another without forcing the user to make explicit
> protection allocations all the way to the leaf cgroups - something
> that is highly undesirable in real life scenarios.
> 
> Consider memory in a data center host. At the cgroup top level, we
> have a distinction between system management software and the actual
> workload the system is executing. Both branches are further subdivided
> into individual services, job components etc.
> 
> We want to protect the workload as a whole from the system management
> software, but that doesn't mean we want to protect and prioritize
> individual workload wrt each other. Their memory demand can vary over
> time, and we'd want the VM to simply cache the hottest data within the
> workload subtree. Yet, the current memory.low limitations force us to
> allocate a fixed amount of protection to each workload component in
> order to get protection from system management software in
> general. This results in very inefficient resource distribution.

I do agree that configuring the reclaim protection is not an easy task.
Especially in a deeper reclaim hierarchy. systemd tends to create a deep
and commonly shared subtrees. So having a protected workload really
requires to be put directly into a new first level cgroup in practice
AFAICT. That is a simpler example though. Just imagine you want to
protect a certain user slice.

You seem to be facing a different problem though IIUC. You know how much
memory you want to protect and you do not have to care about the cgroup
hierarchy up but you do not know/care how to distribute that protection
among workloads running under that protection. I agree that this is a
reasonable usecase.

Those both problems however show that we have a more general
configurability problem for both leaf and intermediate nodes. They are
both a result of strong requirements imposed by delegation as you have
noted above. I am thinking didn't we just go too rigid here?

Delegation points are certainly a security boundary and they should
be treated like that but do we really need a strong containment when
the reclaim protection is under admin full control? Does the admin
really have to reconfigure a large part of the hierarchy to protect a
particular subtree?

I do not have a great answer on how to implement this unfortunately. The
best I could come up with was to add a "$inherited_protection" magic
value to distinguish from an explicit >=0 protection. What's the
difference? $inherited_protection would be a default and it would always
refer to the closest explicit protection up the hierarchy (with 0 as a
default if there is none defined).
        A
       / \
      B   C (low=10G)
         / \
        D   E (low = 5G)

A, B don't get any protection (low=0). C gets protection (10G) and
distributes the pressure to D, E when in excess. D inherits (low=10G)
and E overrides the protection to 5G.

That would help both usecases AFAICS while the delegation should be
still possible (configure the delegation point with an explicit
value). I have very likely not thought that through completely.  Does
that sound like a completely insane idea?

Or do you think that the two usecases are simply impossible to handle
at the same time?
[...]
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 1/3] mm: memcontrol: fix memory.low proportional distribution
  2020-01-30 11:49   ` Michal Hocko
@ 2020-02-03 21:21     ` Johannes Weiner
  2020-02-03 21:38       ` Roman Gushchin
  0 siblings, 1 reply; 52+ messages in thread
From: Johannes Weiner @ 2020-02-03 21:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Roman Gushchin, Tejun Heo, linux-mm, cgroups,
	linux-kernel, kernel-team

On Thu, Jan 30, 2020 at 12:49:29PM +0100, Michal Hocko wrote:
> On Thu 19-12-19 15:07:16, Johannes Weiner wrote:
> > When memory.low is overcommitted - i.e. the children claim more
> > protection than their shared ancestor grants them - the allowance is
> > distributed in proportion to how much each sibling uses their own
> > declared protection:
> 
> Has there ever been any actual explanation why do we care about
> overcommitted protection? I have got back to email threads when
> the effective hierarchical protection has been proposed.
> http://lkml.kernel.org/r/20180320223353.5673-1-guro@fb.com talks about
> some "leaf memory cgroups are more valuable than others" which sounds ok
> but it doesn't explain why children have to overcommit parent in the
> first place. Do we have any other documentation to explain the usecase?

I don't think we properly documented it, no. Maybe Roman can elaborate
on that a bit more since he added it.

What I can see is that it makes configuration a bit more forgiving and
easier to set up. At the top-level you tend to configure a share of
available host memory, and at the leaf level you tend to communicate a
requirement of the workload. In practice, these two aren't always
perfectly in sync, with workloads coming and going, the software being
changed and configs tuned by different teams.

Now obviously, they cannot be completely out of wack - you can't set
aside 1G of host memory, and then have 30 workloads that need 1G each
to run comfortably. But as long as the ballpark is correct, it's nice
if things keep working when you're off by a couple MB here and there.

> > 	low_usage = min(memory.low, memory.current)
> > 	elow = parent_elow * (low_usage / siblings_low_usage)
> > 
> > However, siblings_low_usage is not the sum of all low_usages. It sums
> > up the usages of *only those cgroups that are within their memory.low*
> > That means that low_usage can be *bigger* than siblings_low_usage, and
> > consequently the total protection afforded to the children can be
> > bigger than what the ancestor grants the subtree.
> > 
> > Consider three groups where two are in excess of their protection:
> > 
> >   A/memory.low = 10G
> >   A/A1/memory.low = 10G, memory.current = 20G
> >   A/A2/memory.low = 10G, memory.current = 20G
> >   A/A3/memory.low = 10G, memory.current =  8G
> >   siblings_low_usage = 8G (only A3 contributes)
> > 
> >   A1/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(8G) = 12.5G -> 10G
> >   A2/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(8G) = 12.5G -> 10G
> >   A3/elow = parent_elow(10G) * low_usage(8G) / siblings_low_usage(8G) = 10.0G
> > 
> >   (the 12.5G are capped to the explicit memory.low setting of 10G)
> > 
> > With that, the sum of all awarded protection below A is 30G, when A
> > only grants 10G for the entire subtree.
> > 
> > What does this mean in practice? A1 and A2 would still be in excess of
> > their 10G allowance and would be reclaimed, whereas A3 would not. As
> > they eventually drop below their protection setting, they would be
> > counted in siblings_low_usage again and the error would right itself.
> > 
> > When reclaim was applied in a binary fashion (cgroup is reclaimed when
> > it's above its protection, otherwise it's skipped) this would actually
> > work out just fine. However, since 1bc63fb1272b ("mm, memcg: make scan
> > aggression always exclude protection"), reclaim pressure is scaled to
> > how much a cgroup is above its protection. As a result this
> > calculation error unduly skews pressure away from A1 and A2 toward the
> > rest of the system.
> 
> Just to make sure I fully follow. The overall excess over protection is
> 38G in your example (for A) while the reclaim would only scan 20G for
> this hierarchy until the error starts to fixup because
> siblings_low_usage would get back into sync, correct?

Exactly right.

> > But why did we do it like this in the first place?
> > 
> > The reasoning behind exempting groups in excess from
> > siblings_low_usage was to go after them first during reclaim in an
> > overcommitted subtree:
> > 
> >   A/memory.low = 2G, memory.current = 4G
> >   A/A1/memory.low = 3G, memory.current = 2G
> >   A/A2/memory.low = 1G, memory.current = 2G
> > 
> >   siblings_low_usage = 2G (only A1 contributes)
> >   A1/elow = parent_elow(2G) * low_usage(2G) / siblings_low_usage(2G) = 2G
> >   A2/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(2G) = 1G
> > 
> > While the children combined are overcomitting A and are technically
> > both at fault, A2 is actively declaring unprotected memory and we
> > would like to reclaim that first.
> > 
> > However, while this sounds like a noble goal on the face of it, it
> > doesn't make much difference in actual memory distribution: Because A
> > is overcommitted, reclaim will not stop once A2 gets pushed back to
> > within its allowance; we'll have to reclaim A1 either way. The end
> > result is still that protection is distributed proportionally, with A1
> > getting 3/4 (1.5G) and A2 getting 1/4 (0.5G) of A's allowance.
> > 
> > [ If A weren't overcommitted, it wouldn't make a difference since each
> >   cgroup would just get the protection it declares:
> > 
> >   A/memory.low = 2G, memory.current = 3G
> >   A/A1/memory.low = 1G, memory.current = 1G
> >   A/A2/memory.low = 1G, memory.current = 2G
> > 
> >   With the current calculation:
> > 
> >   siblings_low_usage = 1G (only A1 contributes)
> >   A1/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(1G) = 2G -> 1G
> >   A2/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(1G) = 2G -> 1G
> > 
> >   Including excess groups in siblings_low_usage:
> > 
> >   siblings_low_usage = 2G
> >   A1/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(2G) = 1G -> 1G
> >   A2/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(2G) = 1G -> 1G ]
> > 
> > Simplify the calculation and fix the proportional reclaim bug by
> > including excess cgroups in siblings_low_usage.
> 
> I think it would be better to also show the initial example with the
> overcommitted protection because this is the primary usecase this patch
> is targeting in the first place.
>    A/memory.low = 10G
>    A/A1/memory.low = 10G, memory.current = 20G
>    A/A2/memory.low = 10G, memory.current = 20G
>    A/A3/memory.low = 10G, memory.current =  8G
>    siblings_low_usage = 28G
>  
>    A1/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(28G) = 3.5G
>    A2/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(28G) = 3.5G
>    A3/elow = parent_elow(10G) * low_usage(8G) / siblings_low_usage(28G) = 2.8G

Good idea, I added this. Attaching the updated patch below.

> so wrt the global reclaim we have 38G of reclaimable memory with 38G
> excess over A's protection. It is true that A3 will get reclaimed way
> before A1, A2 reach their protection which might be just good enough to
> satisfy the external memory pressure because it is not really likely
> that the global pressure would require more than 20G dropped all at
> once, right?

I think the key realization is that in this configuration, neither A1,
A2 nor A3 *have* 10G of protection. So at this point it's fair to
reclaim them all at once.

> That being said I can see the problem with the existing implementation
> and how you workaround it. I am still unclear about why should we care
> about overcommit on the protection that much and in that light the patch
> makes more sense because it doesn't overflow the memory pressure to
> outside.
> Longer term, though, do we really have to care about this scenario? If
> yes, can we have it documented?

Yes, I think that's a good idea. This patch is not because I have a
strong case for overcommit, but because of the bug of escaping
pressure and to simplify things for patch #3.

> Do we want a fixes tag here? There are two changes that need to be
> applied together to have a visible effect.
> 
> Fixes: 1bc63fb1272b ("mm, memcg: make scan aggression always exclude protection")
> Fixes: 230671533d64 ("mm: memory.low hierarchical behavior")
> 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> Acked-by: Michal Hocko <mhocko@suse.com>

Thanks for your review, Michal!

Updated patch incorporating your feedback:

---

From 46513e8afdc0f325be7007bdbb4e85a009e17dcc Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Mon, 9 Dec 2019 15:18:58 -0500
Subject: [PATCH] mm: memcontrol: fix memory.low proportional distribution

When memory.low is overcommitted - i.e. the children claim more
protection than their shared ancestor grants them - the allowance is
distributed in proportion to how much each sibling uses their own
declared protection:

	low_usage = min(memory.low, memory.current)
	elow = parent_elow * (low_usage / siblings_low_usage)

However, siblings_low_usage is not the sum of all low_usages. It sums
up the usages of *only those cgroups that are within their memory.low*
That means that low_usage can be *bigger* than siblings_low_usage, and
consequently the total protection afforded to the children can be
bigger than what the ancestor grants the subtree.

Consider three groups where two are in excess of their protection:

  A/memory.low = 10G
  A/A1/memory.low = 10G, memory.current = 20G
  A/A2/memory.low = 10G, memory.current = 20G
  A/A3/memory.low = 10G, memory.current =  8G
  siblings_low_usage = 8G (only A3 contributes)

  A1/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(8G) = 12.5G -> 10G
  A2/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(8G) = 12.5G -> 10G
  A3/elow = parent_elow(10G) * low_usage(8G) / siblings_low_usage(8G) = 10.0G

  (the 12.5G are capped to the explicit memory.low setting of 10G)

With that, the sum of all awarded protection below A is 30G, when A
only grants 10G for the entire subtree.

What does this mean in practice? A1 and A2 would still be in excess of
their 10G allowance and would be reclaimed, whereas A3 would not. As
they eventually drop below their protection setting, they would be
counted in siblings_low_usage again and the error would right itself.

When reclaim was applied in a binary fashion (cgroup is reclaimed when
it's above its protection, otherwise it's skipped) this would actually
work out just fine. However, since 1bc63fb1272b ("mm, memcg: make scan
aggression always exclude protection"), reclaim pressure is scaled to
how much a cgroup is above its protection. As a result this
calculation error unduly skews pressure away from A1 and A2 toward the
rest of the system.

But why did we do it like this in the first place?

The reasoning behind exempting groups in excess from
siblings_low_usage was to go after them first during reclaim in an
overcommitted subtree:

  A/memory.low = 2G, memory.current = 4G
  A/A1/memory.low = 3G, memory.current = 2G
  A/A2/memory.low = 1G, memory.current = 2G

  siblings_low_usage = 2G (only A1 contributes)
  A1/elow = parent_elow(2G) * low_usage(2G) / siblings_low_usage(2G) = 2G
  A2/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(2G) = 1G

While the children combined are overcomitting A and are technically
both at fault, A2 is actively declaring unprotected memory and we
would like to reclaim that first.

However, while this sounds like a noble goal on the face of it, it
doesn't make much difference in actual memory distribution: Because A
is overcommitted, reclaim will not stop once A2 gets pushed back to
within its allowance; we'll have to reclaim A1 either way. The end
result is still that protection is distributed proportionally, with A1
getting 3/4 (1.5G) and A2 getting 1/4 (0.5G) of A's allowance.

[ If A weren't overcommitted, it wouldn't make a difference since each
  cgroup would just get the protection it declares:

  A/memory.low = 2G, memory.current = 3G
  A/A1/memory.low = 1G, memory.current = 1G
  A/A2/memory.low = 1G, memory.current = 2G

  With the current calculation:

  siblings_low_usage = 1G (only A1 contributes)
  A1/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(1G) = 2G -> 1G
  A2/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(1G) = 2G -> 1G

  Including excess groups in siblings_low_usage:

  siblings_low_usage = 2G
  A1/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(2G) = 1G -> 1G
  A2/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(2G) = 1G -> 1G ]

Simplify the calculation and fix the proportional reclaim bug by
including excess cgroups in siblings_low_usage.

After this patch, the effective memory.low distribution from the
example above would be as follows:

  A/memory.low = 10G
  A/A1/memory.low = 10G, memory.current = 20G
  A/A2/memory.low = 10G, memory.current = 20G
  A/A3/memory.low = 10G, memory.current =  8G
  siblings_low_usage = 28G

  A1/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(28G) = 3.5G
  A2/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(28G) = 3.5G
  A3/elow = parent_elow(10G) * low_usage(8G) / siblings_low_usage(28G) = 2.8G

Fixes: 1bc63fb1272b ("mm, memcg: make scan aggression always exclude protection")
Fixes: 230671533d64 ("mm: memory.low hierarchical behavior")
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Chris Down <chris@chrisdown.name>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c   |  4 +---
 mm/page_counter.c | 12 ++----------
 2 files changed, 3 insertions(+), 13 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c5b5f74cfd4d..874a0b00f89b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6236,9 +6236,7 @@ struct cgroup_subsys memory_cgrp_subsys = {
  * elow = min( memory.low, parent->elow * ------------------ ),
  *                                        siblings_low_usage
  *
- *             | memory.current, if memory.current < memory.low
- * low_usage = |
- *	       | 0, otherwise.
+ * low_usage = min(memory.low, memory.current)
  *
  *
  * Such definition of the effective memory.low provides the expected
diff --git a/mm/page_counter.c b/mm/page_counter.c
index de31470655f6..75d53f15f040 100644
--- a/mm/page_counter.c
+++ b/mm/page_counter.c
@@ -23,11 +23,7 @@ static void propagate_protected_usage(struct page_counter *c,
 		return;
 
 	if (c->min || atomic_long_read(&c->min_usage)) {
-		if (usage <= c->min)
-			protected = usage;
-		else
-			protected = 0;
-
+		protected = min(usage, c->min);
 		old_protected = atomic_long_xchg(&c->min_usage, protected);
 		delta = protected - old_protected;
 		if (delta)
@@ -35,11 +31,7 @@ static void propagate_protected_usage(struct page_counter *c,
 	}
 
 	if (c->low || atomic_long_read(&c->low_usage)) {
-		if (usage <= c->low)
-			protected = usage;
-		else
-			protected = 0;
-
+		protected = min(usage, c->low);
 		old_protected = atomic_long_xchg(&c->low_usage, protected);
 		delta = protected - old_protected;
 		if (delta)
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 1/3] mm: memcontrol: fix memory.low proportional distribution
  2020-02-03 21:21     ` Johannes Weiner
@ 2020-02-03 21:38       ` Roman Gushchin
  0 siblings, 0 replies; 52+ messages in thread
From: Roman Gushchin @ 2020-02-03 21:38 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Andrew Morton, Tejun Heo, linux-mm, cgroups,
	linux-kernel, kernel-team

On Mon, Feb 03, 2020 at 04:21:36PM -0500, Johannes Weiner wrote:
> On Thu, Jan 30, 2020 at 12:49:29PM +0100, Michal Hocko wrote:
> > On Thu 19-12-19 15:07:16, Johannes Weiner wrote:
> > > When memory.low is overcommitted - i.e. the children claim more
> > > protection than their shared ancestor grants them - the allowance is
> > > distributed in proportion to how much each sibling uses their own
> > > declared protection:
> > 
> > Has there ever been any actual explanation why do we care about
> > overcommitted protection? I have got back to email threads when
> > the effective hierarchical protection has been proposed.
> > http://lkml.kernel.org/r/20180320223353.5673-1-guro@fb.com talks about
> > some "leaf memory cgroups are more valuable than others" which sounds ok
> > but it doesn't explain why children have to overcommit parent in the
> > first place. Do we have any other documentation to explain the usecase?
> 
> I don't think we properly documented it, no. Maybe Roman can elaborate
> on that a bit more since he added it.

There is simple no way to prevent it. Cgroup v2 UI allows to set any values,
and they are never changed automatically on e.g. parent's value change.

So the system has to perform reasonably for all possible configurations.

Before introducing of effective protections it was way too defensive:
it was super easy to disable the protection by exceeding the protection
value on some level. So it wasn't possible to rely on it in production.

So disabling the protection in case of over-commitment isn't a good option
too.

Thanks!


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection
  2020-01-30 17:00   ` Michal Hocko
@ 2020-02-03 21:52     ` Johannes Weiner
  2020-02-10 15:21       ` Johannes Weiner
  2020-02-11 16:47       ` Michal Hocko
  0 siblings, 2 replies; 52+ messages in thread
From: Johannes Weiner @ 2020-02-03 21:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Roman Gushchin, Tejun Heo, linux-mm, cgroups,
	linux-kernel, kernel-team

On Thu, Jan 30, 2020 at 06:00:20PM +0100, Michal Hocko wrote:
> On Thu 19-12-19 15:07:18, Johannes Weiner wrote:
> > Right now, the effective protection of any given cgroup is capped by
> > its own explicit memory.low setting, regardless of what the parent
> > says. The reasons for this are mostly historical and ease of
> > implementation: to make delegation of memory.low safe, effective
> > protection is the min() of all memory.low up the tree.
> > 
> > Unfortunately, this limitation makes it impossible to protect an
> > entire subtree from another without forcing the user to make explicit
> > protection allocations all the way to the leaf cgroups - something
> > that is highly undesirable in real life scenarios.
> > 
> > Consider memory in a data center host. At the cgroup top level, we
> > have a distinction between system management software and the actual
> > workload the system is executing. Both branches are further subdivided
> > into individual services, job components etc.
> > 
> > We want to protect the workload as a whole from the system management
> > software, but that doesn't mean we want to protect and prioritize
> > individual workload wrt each other. Their memory demand can vary over
> > time, and we'd want the VM to simply cache the hottest data within the
> > workload subtree. Yet, the current memory.low limitations force us to
> > allocate a fixed amount of protection to each workload component in
> > order to get protection from system management software in
> > general. This results in very inefficient resource distribution.
> 
> I do agree that configuring the reclaim protection is not an easy task.
> Especially in a deeper reclaim hierarchy. systemd tends to create a deep
> and commonly shared subtrees. So having a protected workload really
> requires to be put directly into a new first level cgroup in practice
> AFAICT. That is a simpler example though. Just imagine you want to
> protect a certain user slice.

Can you elaborate a bit on this? I don't quite understand the two
usecases you are contrasting here.

> You seem to be facing a different problem though IIUC. You know how much
> memory you want to protect and you do not have to care about the cgroup
> hierarchy up but you do not know/care how to distribute that protection
> among workloads running under that protection. I agree that this is a
> reasonable usecase.

I'm not sure I'm parsing this right, but the use case is this:

When I'm running a multi-component workload on a host without any
cgrouping, the individual components compete over the host's memory
based on rate of allocation, how often they reference their memory and
so forth. It's a need-based distribution of pages, and the weight can
change as demand changes throughout the life of the workload.

If I now stick several of such workloads into a containerized
environment, I want to use memory.low to assign each workload as a
whole a chunk of memory it can use - I don't want to assign fixed-size
subchunks to each individual component of each workload! I want the
same free competition *within* the workload while setting clear rules
for competition *between* the different workloads.

[ What I can do today to achieve this is disable the memory controller
  for the subgroups. When I do this, all pages of the workload are on
  one single LRU that is protected by one single memory.low.

  But obviously I lose any detailed accounting as well.

  This patch allows me to have the same recursive protection semantics
  while retaining accounting. ]

> Those both problems however show that we have a more general
> configurability problem for both leaf and intermediate nodes. They are
> both a result of strong requirements imposed by delegation as you have
> noted above. I am thinking didn't we just go too rigid here?

The requirement for delegation is that child groups cannot claim more
than the parent affords. Is that the restriction you are referring to?

> Delegation points are certainly a security boundary and they should
> be treated like that but do we really need a strong containment when
> the reclaim protection is under admin full control? Does the admin
> really have to reconfigure a large part of the hierarchy to protect a
> particular subtree?
> 
> I do not have a great answer on how to implement this unfortunately. The
> best I could come up with was to add a "$inherited_protection" magic
> value to distinguish from an explicit >=0 protection. What's the
> difference? $inherited_protection would be a default and it would always
> refer to the closest explicit protection up the hierarchy (with 0 as a
> default if there is none defined).
>         A
>        / \
>       B   C (low=10G)
>          / \
>         D   E (low = 5G)
> 
> A, B don't get any protection (low=0). C gets protection (10G) and
> distributes the pressure to D, E when in excess. D inherits (low=10G)
> and E overrides the protection to 5G.
> 
> That would help both usecases AFAICS while the delegation should be
> still possible (configure the delegation point with an explicit
> value). I have very likely not thought that through completely.  Does
> that sound like a completely insane idea?
> 
> Or do you think that the two usecases are simply impossible to handle
> at the same time?

Doesn't my patch accomplish this?

Any cgroup or group of cgroups still cannot claim more than the
ancestral protection for the subtree. If a cgroup says 10G, the sum of
all children's protection will never exceed that. This ensures
delegation is safe.

But once an ancestor does declare protection, that value automatically
applies recursively to the entire subtree. The subtree can *chose* to
assign fixed shares of that to its cgroups. But it doesn't have to.

While designing this, I also thought about a new magic token for the
memory.low to distinguish "inherit" from 0 (or other values smaller
than what the parent affords). But the only thing it buys you is that
you allow children to *opt out* of protection the parent prescribes.
And I'm not sure that's a valid usecase: If the parent specifies that
a subtree shall receive a protection of 10G compared to other subtrees
at that level, does it really make sense for children to opt out and
reduce the subtree's protection?

IMO, no. A cgroup should be able to influence competition among its
children, it should not be able to change the rules of competition
among its ancestors.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection
  2020-02-03 21:52     ` Johannes Weiner
@ 2020-02-10 15:21       ` Johannes Weiner
  2020-02-11 16:47       ` Michal Hocko
  1 sibling, 0 replies; 52+ messages in thread
From: Johannes Weiner @ 2020-02-10 15:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Roman Gushchin, Tejun Heo, linux-mm, cgroups,
	linux-kernel, kernel-team

Hi Michal,

did you have any more thoughts on this?

On Mon, Feb 03, 2020 at 04:52:02PM -0500, Johannes Weiner wrote:
> On Thu, Jan 30, 2020 at 06:00:20PM +0100, Michal Hocko wrote:
> > On Thu 19-12-19 15:07:18, Johannes Weiner wrote:
> > > Right now, the effective protection of any given cgroup is capped by
> > > its own explicit memory.low setting, regardless of what the parent
> > > says. The reasons for this are mostly historical and ease of
> > > implementation: to make delegation of memory.low safe, effective
> > > protection is the min() of all memory.low up the tree.
> > > 
> > > Unfortunately, this limitation makes it impossible to protect an
> > > entire subtree from another without forcing the user to make explicit
> > > protection allocations all the way to the leaf cgroups - something
> > > that is highly undesirable in real life scenarios.
> > > 
> > > Consider memory in a data center host. At the cgroup top level, we
> > > have a distinction between system management software and the actual
> > > workload the system is executing. Both branches are further subdivided
> > > into individual services, job components etc.
> > > 
> > > We want to protect the workload as a whole from the system management
> > > software, but that doesn't mean we want to protect and prioritize
> > > individual workload wrt each other. Their memory demand can vary over
> > > time, and we'd want the VM to simply cache the hottest data within the
> > > workload subtree. Yet, the current memory.low limitations force us to
> > > allocate a fixed amount of protection to each workload component in
> > > order to get protection from system management software in
> > > general. This results in very inefficient resource distribution.
> > 
> > I do agree that configuring the reclaim protection is not an easy task.
> > Especially in a deeper reclaim hierarchy. systemd tends to create a deep
> > and commonly shared subtrees. So having a protected workload really
> > requires to be put directly into a new first level cgroup in practice
> > AFAICT. That is a simpler example though. Just imagine you want to
> > protect a certain user slice.
> 
> Can you elaborate a bit on this? I don't quite understand the two
> usecases you are contrasting here.
> 
> > You seem to be facing a different problem though IIUC. You know how much
> > memory you want to protect and you do not have to care about the cgroup
> > hierarchy up but you do not know/care how to distribute that protection
> > among workloads running under that protection. I agree that this is a
> > reasonable usecase.
> 
> I'm not sure I'm parsing this right, but the use case is this:
> 
> When I'm running a multi-component workload on a host without any
> cgrouping, the individual components compete over the host's memory
> based on rate of allocation, how often they reference their memory and
> so forth. It's a need-based distribution of pages, and the weight can
> change as demand changes throughout the life of the workload.
> 
> If I now stick several of such workloads into a containerized
> environment, I want to use memory.low to assign each workload as a
> whole a chunk of memory it can use - I don't want to assign fixed-size
> subchunks to each individual component of each workload! I want the
> same free competition *within* the workload while setting clear rules
> for competition *between* the different workloads.
> 
> [ What I can do today to achieve this is disable the memory controller
>   for the subgroups. When I do this, all pages of the workload are on
>   one single LRU that is protected by one single memory.low.
> 
>   But obviously I lose any detailed accounting as well.
> 
>   This patch allows me to have the same recursive protection semantics
>   while retaining accounting. ]
> 
> > Those both problems however show that we have a more general
> > configurability problem for both leaf and intermediate nodes. They are
> > both a result of strong requirements imposed by delegation as you have
> > noted above. I am thinking didn't we just go too rigid here?
> 
> The requirement for delegation is that child groups cannot claim more
> than the parent affords. Is that the restriction you are referring to?
> 
> > Delegation points are certainly a security boundary and they should
> > be treated like that but do we really need a strong containment when
> > the reclaim protection is under admin full control? Does the admin
> > really have to reconfigure a large part of the hierarchy to protect a
> > particular subtree?
> > 
> > I do not have a great answer on how to implement this unfortunately. The
> > best I could come up with was to add a "$inherited_protection" magic
> > value to distinguish from an explicit >=0 protection. What's the
> > difference? $inherited_protection would be a default and it would always
> > refer to the closest explicit protection up the hierarchy (with 0 as a
> > default if there is none defined).
> >         A
> >        / \
> >       B   C (low=10G)
> >          / \
> >         D   E (low = 5G)
> > 
> > A, B don't get any protection (low=0). C gets protection (10G) and
> > distributes the pressure to D, E when in excess. D inherits (low=10G)
> > and E overrides the protection to 5G.
> > 
> > That would help both usecases AFAICS while the delegation should be
> > still possible (configure the delegation point with an explicit
> > value). I have very likely not thought that through completely.  Does
> > that sound like a completely insane idea?
> > 
> > Or do you think that the two usecases are simply impossible to handle
> > at the same time?
> 
> Doesn't my patch accomplish this?
> 
> Any cgroup or group of cgroups still cannot claim more than the
> ancestral protection for the subtree. If a cgroup says 10G, the sum of
> all children's protection will never exceed that. This ensures
> delegation is safe.
> 
> But once an ancestor does declare protection, that value automatically
> applies recursively to the entire subtree. The subtree can *chose* to
> assign fixed shares of that to its cgroups. But it doesn't have to.
> 
> While designing this, I also thought about a new magic token for the
> memory.low to distinguish "inherit" from 0 (or other values smaller
> than what the parent affords). But the only thing it buys you is that
> you allow children to *opt out* of protection the parent prescribes.
> And I'm not sure that's a valid usecase: If the parent specifies that
> a subtree shall receive a protection of 10G compared to other subtrees
> at that level, does it really make sense for children to opt out and
> reduce the subtree's protection?
> 
> IMO, no. A cgroup should be able to influence competition among its
> children, it should not be able to change the rules of competition
> among its ancestors.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection
  2020-02-03 21:52     ` Johannes Weiner
  2020-02-10 15:21       ` Johannes Weiner
@ 2020-02-11 16:47       ` Michal Hocko
  2020-02-12 17:08         ` Johannes Weiner
  1 sibling, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2020-02-11 16:47 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Roman Gushchin, Tejun Heo, linux-mm, cgroups,
	linux-kernel, kernel-team

On Mon 03-02-20 16:52:01, Johannes Weiner wrote:
> On Thu, Jan 30, 2020 at 06:00:20PM +0100, Michal Hocko wrote:
> > On Thu 19-12-19 15:07:18, Johannes Weiner wrote:
> > > Right now, the effective protection of any given cgroup is capped by
> > > its own explicit memory.low setting, regardless of what the parent
> > > says. The reasons for this are mostly historical and ease of
> > > implementation: to make delegation of memory.low safe, effective
> > > protection is the min() of all memory.low up the tree.
> > > 
> > > Unfortunately, this limitation makes it impossible to protect an
> > > entire subtree from another without forcing the user to make explicit
> > > protection allocations all the way to the leaf cgroups - something
> > > that is highly undesirable in real life scenarios.
> > > 
> > > Consider memory in a data center host. At the cgroup top level, we
> > > have a distinction between system management software and the actual
> > > workload the system is executing. Both branches are further subdivided
> > > into individual services, job components etc.
> > > 
> > > We want to protect the workload as a whole from the system management
> > > software, but that doesn't mean we want to protect and prioritize
> > > individual workload wrt each other. Their memory demand can vary over
> > > time, and we'd want the VM to simply cache the hottest data within the
> > > workload subtree. Yet, the current memory.low limitations force us to
> > > allocate a fixed amount of protection to each workload component in
> > > order to get protection from system management software in
> > > general. This results in very inefficient resource distribution.
> > 
> > I do agree that configuring the reclaim protection is not an easy task.
> > Especially in a deeper reclaim hierarchy. systemd tends to create a deep
> > and commonly shared subtrees. So having a protected workload really
> > requires to be put directly into a new first level cgroup in practice
> > AFAICT. That is a simpler example though. Just imagine you want to
> > protect a certain user slice.
> 
> Can you elaborate a bit on this? I don't quite understand the two
> usecases you are contrasting here.

Essentially this is about two different usecases. The first one is about
protecting a hierarchy and spreading the protection among different
workloads and the second is how to protect an inner memcg without
configuring protection all the way up the hierarchy.
 
> > You seem to be facing a different problem though IIUC. You know how much
> > memory you want to protect and you do not have to care about the cgroup
> > hierarchy up but you do not know/care how to distribute that protection
> > among workloads running under that protection. I agree that this is a
> > reasonable usecase.
> 
> I'm not sure I'm parsing this right, but the use case is this:
> 
> When I'm running a multi-component workload on a host without any
> cgrouping, the individual components compete over the host's memory
> based on rate of allocation, how often they reference their memory and
> so forth. It's a need-based distribution of pages, and the weight can
> change as demand changes throughout the life of the workload.
> 
> If I now stick several of such workloads into a containerized
> environment, I want to use memory.low to assign each workload as a
> whole a chunk of memory it can use - I don't want to assign fixed-size
> subchunks to each individual component of each workload! I want the
> same free competition *within* the workload while setting clear rules
> for competition *between* the different workloads.

Yeah, that matches my understanding of the problem your are trying to
solve here.
> 
> [ What I can do today to achieve this is disable the memory controller
>   for the subgroups. When I do this, all pages of the workload are on
>   one single LRU that is protected by one single memory.low.
> 
>   But obviously I lose any detailed accounting as well.
> 
>   This patch allows me to have the same recursive protection semantics
>   while retaining accounting. ]
> 
> > Those both problems however show that we have a more general
> > configurability problem for both leaf and intermediate nodes. They are
> > both a result of strong requirements imposed by delegation as you have
> > noted above. I am thinking didn't we just go too rigid here?
> 
> The requirement for delegation is that child groups cannot claim more
> than the parent affords. Is that the restriction you are referring to?

yes.

> > Delegation points are certainly a security boundary and they should
> > be treated like that but do we really need a strong containment when
> > the reclaim protection is under admin full control? Does the admin
> > really have to reconfigure a large part of the hierarchy to protect a
> > particular subtree?
> > 
> > I do not have a great answer on how to implement this unfortunately. The
> > best I could come up with was to add a "$inherited_protection" magic
> > value to distinguish from an explicit >=0 protection. What's the
> > difference? $inherited_protection would be a default and it would always
> > refer to the closest explicit protection up the hierarchy (with 0 as a
> > default if there is none defined).
> >         A
> >        / \
> >       B   C (low=10G)
> >          / \
> >         D   E (low = 5G)
> > 
> > A, B don't get any protection (low=0). C gets protection (10G) and
> > distributes the pressure to D, E when in excess. D inherits (low=10G)
> > and E overrides the protection to 5G.
> > 
> > That would help both usecases AFAICS while the delegation should be
> > still possible (configure the delegation point with an explicit
> > value). I have very likely not thought that through completely.  Does
> > that sound like a completely insane idea?
> > 
> > Or do you think that the two usecases are simply impossible to handle
> > at the same time?
> 
> Doesn't my patch accomplish this?

Unless I am missing something then I am afraid it doesn't. Say you have a
default systemd cgroup deployment (aka deeper cgroup hierarchy with
slices and scopes) and now you want to grant a reclaim protection on a
leaf cgroup (or even a whole slice that is not really important). All the
hierarchy up the tree has the protection set to 0 by default, right? You
simply cannot get that protection. You would need to configure the
protection up the hierarchy and that is really cumbersome.

> Any cgroup or group of cgroups still cannot claim more than the
> ancestral protection for the subtree. If a cgroup says 10G, the sum of
> all children's protection will never exceed that. This ensures
> delegation is safe.

Right. And delegation usecase really requres that. No question about
that. I am merely arguing that if you do not delegate then this is way
too strict.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection
  2020-02-11 16:47       ` Michal Hocko
@ 2020-02-12 17:08         ` Johannes Weiner
  2020-02-13  7:40           ` Michal Hocko
  0 siblings, 1 reply; 52+ messages in thread
From: Johannes Weiner @ 2020-02-12 17:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Roman Gushchin, Tejun Heo, linux-mm, cgroups,
	linux-kernel, kernel-team

On Tue, Feb 11, 2020 at 05:47:53PM +0100, Michal Hocko wrote:
> Unless I am missing something then I am afraid it doesn't. Say you have a
> default systemd cgroup deployment (aka deeper cgroup hierarchy with
> slices and scopes) and now you want to grant a reclaim protection on a
> leaf cgroup (or even a whole slice that is not really important). All the
> hierarchy up the tree has the protection set to 0 by default, right? You
> simply cannot get that protection. You would need to configure the
> protection up the hierarchy and that is really cumbersome.

Okay, I think I know what you mean. Let's say you have a tree like
this:

                          A
                         / \
                        B1  B2
                       / \   \
                      C1 C2   C3

and there is no actual delegation point - everything belongs to the
same user / trust domain. C1 sets memory.low to 10G, but its parents
set nothing. You're saying we should honor the 10G protection during
global and limit reclaims anywhere in the tree?

Now let's consider there is a delegation point at B1: we set up and
trust B1, but not its children. What effect would the C1 protection
have then? Would we ignore it during global and A reclaim, but honor
it when there is B1 limit reclaim?

Doing an explicit downward propagation from the root to C1 *could* be
tedious, but I can't think of a scenario where it's completely
impossible. Especially because we allow proportional distribution when
the limit is overcommitted and you don't have to be 100% accurate.

And the clarity that comes with being explicit is an asset too,
IMO. Since it has an effect at the reclaim level, it's not a bad thing
to have that effect *visible* in the settings at that level as well:
the protected memory doesn't come out of thin air, it's delegated down
from the top where memory pressure originates.

My patch is different. It allows a configuration that simply isn't
possible today: protecting C1 and C2 from C3, without having to
protect C1 and C2 from each other.

So I don't think requiring an uninterrupted, authorized chain of
protection from the top is necessarily wrong. In fact, I think it has
benefits. But requiring the protection chain to go all the way to the
leaves for it to have any effect, that is a real problem, and it can't
be worked around.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection
  2020-02-12 17:08         ` Johannes Weiner
@ 2020-02-13  7:40           ` Michal Hocko
  2020-02-13 13:23             ` Johannes Weiner
  2020-02-13 13:53             ` Tejun Heo
  0 siblings, 2 replies; 52+ messages in thread
From: Michal Hocko @ 2020-02-13  7:40 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Roman Gushchin, Tejun Heo, linux-mm, cgroups,
	linux-kernel, kernel-team

On Wed 12-02-20 12:08:26, Johannes Weiner wrote:
> On Tue, Feb 11, 2020 at 05:47:53PM +0100, Michal Hocko wrote:
> > Unless I am missing something then I am afraid it doesn't. Say you have a
> > default systemd cgroup deployment (aka deeper cgroup hierarchy with
> > slices and scopes) and now you want to grant a reclaim protection on a
> > leaf cgroup (or even a whole slice that is not really important). All the
> > hierarchy up the tree has the protection set to 0 by default, right? You
> > simply cannot get that protection. You would need to configure the
> > protection up the hierarchy and that is really cumbersome.
> 
> Okay, I think I know what you mean. Let's say you have a tree like
> this:
> 
>                           A
>                          / \
>                         B1  B2
>                        / \   \
>                       C1 C2   C3
> 
> and there is no actual delegation point - everything belongs to the
> same user / trust domain. C1 sets memory.low to 10G, but its parents
> set nothing. You're saying we should honor the 10G protection during
> global and limit reclaims anywhere in the tree?

No, only in the C1 which sets the limit, because that is the woriking
set we want to protect.

> Now let's consider there is a delegation point at B1: we set up and
> trust B1, but not its children. What effect would the C1 protection
> have then? Would we ignore it during global and A reclaim, but honor
> it when there is B1 limit reclaim?

In the scheme with the inherited protection it would act as the gate
and require an explicit low limit setup defaulting to 0 if none is
specified.

> Doing an explicit downward propagation from the root to C1 *could* be
> tedious, but I can't think of a scenario where it's completely
> impossible. Especially because we allow proportional distribution when
> the limit is overcommitted and you don't have to be 100% accurate.

So let's see how that works in practice, say a multi workload setup
with a complex/deep cgroup hierachies (e.g. your above example). No
delegation point this time.

C1 asks for low=1G while using 500M, C3 low=100M using 80M.  B1 and
B2 are completely independent workloads and the same applies to C2 which
doesn't ask for any protection at all? C2 uses 100M. Now the admin has
to propagate protection upwards so B1 low=1G, B2 low=100M and A low=1G,
right? Let's say we have a global reclaim due to external pressure that
originates from outside of A hierarchy (it is not overcommited on the
protection).

Unless I miss something C2 would get a protection even though nobody
asked for it.

> And the clarity that comes with being explicit is an asset too,
> IMO. Since it has an effect at the reclaim level, it's not a bad thing
> to have that effect *visible* in the settings at that level as well:
> the protected memory doesn't come out of thin air, it's delegated down
> from the top where memory pressure originates.

So how are we going to deal with hierarchies where the actual workload
of interest is a leaf deeper in the hierarchy and the upper levels of
the hierarchy are shared between unrelated workloads?  Look at how
systemd organizes system into cgroups for example (slices vs. scopes)
and say you want to add a protection to a single user or a service.
 
> My patch is different. It allows a configuration that simply isn't
> possible today: protecting C1 and C2 from C3, without having to
> protect C1 and C2 from each other.
> 
> So I don't think requiring an uninterrupted, authorized chain of
> protection from the top is necessarily wrong. In fact, I think it has
> benefits. But requiring the protection chain to go all the way to the
> leaves for it to have any effect, that is a real problem, and it can't
> be worked around.

Yes I do agree that the problem you are dealing with is slightly
different. My main point was that you are already introducing a new
semantic which is not fully backward compatible and I figured we have
more problems in the area and maybe we can introduce a semantic to
handle both above mentioned scenarios while doing that.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection
  2020-02-13  7:40           ` Michal Hocko
@ 2020-02-13 13:23             ` Johannes Weiner
  2020-02-13 15:46               ` Michal Hocko
  2020-02-13 13:53             ` Tejun Heo
  1 sibling, 1 reply; 52+ messages in thread
From: Johannes Weiner @ 2020-02-13 13:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Roman Gushchin, Tejun Heo, linux-mm, cgroups,
	linux-kernel, kernel-team

On Thu, Feb 13, 2020 at 08:40:49AM +0100, Michal Hocko wrote:
> On Wed 12-02-20 12:08:26, Johannes Weiner wrote:
> > On Tue, Feb 11, 2020 at 05:47:53PM +0100, Michal Hocko wrote:
> > > Unless I am missing something then I am afraid it doesn't. Say you have a
> > > default systemd cgroup deployment (aka deeper cgroup hierarchy with
> > > slices and scopes) and now you want to grant a reclaim protection on a
> > > leaf cgroup (or even a whole slice that is not really important). All the
> > > hierarchy up the tree has the protection set to 0 by default, right? You
> > > simply cannot get that protection. You would need to configure the
> > > protection up the hierarchy and that is really cumbersome.
> > 
> > Okay, I think I know what you mean. Let's say you have a tree like
> > this:
> > 
> >                           A
> >                          / \
> >                         B1  B2
> >                        / \   \
> >                       C1 C2   C3
> > 
> > and there is no actual delegation point - everything belongs to the
> > same user / trust domain. C1 sets memory.low to 10G, but its parents
> > set nothing. You're saying we should honor the 10G protection during
> > global and limit reclaims anywhere in the tree?
> 
> No, only in the C1 which sets the limit, because that is the woriking
> set we want to protect.
> 
> > Now let's consider there is a delegation point at B1: we set up and
> > trust B1, but not its children. What effect would the C1 protection
> > have then? Would we ignore it during global and A reclaim, but honor
> > it when there is B1 limit reclaim?
> 
> In the scheme with the inherited protection it would act as the gate
> and require an explicit low limit setup defaulting to 0 if none is
> specified.
> 
> > Doing an explicit downward propagation from the root to C1 *could* be
> > tedious, but I can't think of a scenario where it's completely
> > impossible. Especially because we allow proportional distribution when
> > the limit is overcommitted and you don't have to be 100% accurate.
> 
> So let's see how that works in practice, say a multi workload setup
> with a complex/deep cgroup hierachies (e.g. your above example). No
> delegation point this time.
> 
> C1 asks for low=1G while using 500M, C3 low=100M using 80M.  B1 and
> B2 are completely independent workloads and the same applies to C2 which
> doesn't ask for any protection at all? C2 uses 100M. Now the admin has
> to propagate protection upwards so B1 low=1G, B2 low=100M and A low=1G,
> right? Let's say we have a global reclaim due to external pressure that
> originates from outside of A hierarchy (it is not overcommited on the
> protection).
> 
> Unless I miss something C2 would get a protection even though nobody
> asked for it.

Good observation, but I think you spotted an unintentional side effect
of how I implemented the "floating protection" calculation rather than
a design problem.

My patch still allows explicit downward propagation. So if B1 sets up
1G, and C1 explicitly claims those 1G (low>=1G, usage>=1G), C2 does
NOT get any protection. There is no "floating" protection left in B1
that could get to C2.

However, to calculate the float, I'm using the utilized protection
counters (children_low_usage) to determine what is "claimed". Mostly
for convenience because they were already there. In your example, C1
is only utilizing 500M of its protection, leaving 500M in the float
that will go toward C2. I agree that's undesirable.

But it's fixable by adding a hierarchical children_low counter that
tracks the static configuration, and using that to calculate floating
protection instead of the dynamic children_low_usage.

That way you can propagate protection from A to C1 without it spilling
to anybody else unintentionally, regardless of how much B1 and C1 are
actually *using*.

Does that sound reasonable?


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection
  2020-02-13  7:40           ` Michal Hocko
  2020-02-13 13:23             ` Johannes Weiner
@ 2020-02-13 13:53             ` Tejun Heo
  2020-02-13 15:47               ` Michal Hocko
  1 sibling, 1 reply; 52+ messages in thread
From: Tejun Heo @ 2020-02-13 13:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Andrew Morton, Roman Gushchin, linux-mm,
	cgroups, linux-kernel, kernel-team

Hello, Michal.

On Thu, Feb 13, 2020 at 08:40:49AM +0100, Michal Hocko wrote:
> So how are we going to deal with hierarchies where the actual workload
> of interest is a leaf deeper in the hierarchy and the upper levels of
> the hierarchy are shared between unrelated workloads?  Look at how
> systemd organizes system into cgroups for example (slices vs. scopes)
> and say you want to add a protection to a single user or a service.

The above scenario where descendants of a cgroup behave unrelated to
each other sound plausible in theory but doesn't really work in
practice because memory management is closely tied with IO and all IO
control mechanisms are strictly hierarchical and arbitrate
level-by-level.

A workload's memory footprint can't be protected without protecting
its IOs and a descendant cgroup's IOs can't be protected without
protecting its ancestors, so implementing that in memory controller in
isolation, while doable, won't serve many practical purposes. It just
ends up creating cgroups which are punished on memory while greedly
burning up IOs.

The same pattern for CPU control, so for any practical configuration,
the hierarchy layout has to follow the resource distribution hierarchy
of the system - it's what the whole thing is for after all. The
existing memory.min/low semantics is mostly from failing to recognize
them being closely intertwined with IO and resembling weight based
control mechanisms, and rather blindly copying memory.high/max
behaviors, for which, btw, individual configurations may make sense.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection
  2020-02-13 13:23             ` Johannes Weiner
@ 2020-02-13 15:46               ` Michal Hocko
  2020-02-13 17:41                 ` Johannes Weiner
  0 siblings, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2020-02-13 15:46 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Roman Gushchin, Tejun Heo, linux-mm, cgroups,
	linux-kernel, kernel-team

On Thu 13-02-20 08:23:17, Johannes Weiner wrote:
> On Thu, Feb 13, 2020 at 08:40:49AM +0100, Michal Hocko wrote:
> > On Wed 12-02-20 12:08:26, Johannes Weiner wrote:
> > > On Tue, Feb 11, 2020 at 05:47:53PM +0100, Michal Hocko wrote:
> > > > Unless I am missing something then I am afraid it doesn't. Say you have a
> > > > default systemd cgroup deployment (aka deeper cgroup hierarchy with
> > > > slices and scopes) and now you want to grant a reclaim protection on a
> > > > leaf cgroup (or even a whole slice that is not really important). All the
> > > > hierarchy up the tree has the protection set to 0 by default, right? You
> > > > simply cannot get that protection. You would need to configure the
> > > > protection up the hierarchy and that is really cumbersome.
> > > 
> > > Okay, I think I know what you mean. Let's say you have a tree like
> > > this:
> > > 
> > >                           A
> > >                          / \
> > >                         B1  B2
> > >                        / \   \
> > >                       C1 C2   C3
> > > 
> > > and there is no actual delegation point - everything belongs to the
> > > same user / trust domain. C1 sets memory.low to 10G, but its parents
> > > set nothing. You're saying we should honor the 10G protection during
> > > global and limit reclaims anywhere in the tree?
> > 
> > No, only in the C1 which sets the limit, because that is the woriking
> > set we want to protect.
> > 
> > > Now let's consider there is a delegation point at B1: we set up and
> > > trust B1, but not its children. What effect would the C1 protection
> > > have then? Would we ignore it during global and A reclaim, but honor
> > > it when there is B1 limit reclaim?
> > 
> > In the scheme with the inherited protection it would act as the gate
> > and require an explicit low limit setup defaulting to 0 if none is
> > specified.
> > 
> > > Doing an explicit downward propagation from the root to C1 *could* be
> > > tedious, but I can't think of a scenario where it's completely
> > > impossible. Especially because we allow proportional distribution when
> > > the limit is overcommitted and you don't have to be 100% accurate.
> > 
> > So let's see how that works in practice, say a multi workload setup
> > with a complex/deep cgroup hierachies (e.g. your above example). No
> > delegation point this time.
> > 
> > C1 asks for low=1G while using 500M, C3 low=100M using 80M.  B1 and
> > B2 are completely independent workloads and the same applies to C2 which
> > doesn't ask for any protection at all? C2 uses 100M. Now the admin has
> > to propagate protection upwards so B1 low=1G, B2 low=100M and A low=1G,
> > right? Let's say we have a global reclaim due to external pressure that
> > originates from outside of A hierarchy (it is not overcommited on the
> > protection).
> > 
> > Unless I miss something C2 would get a protection even though nobody
> > asked for it.
> 
> Good observation, but I think you spotted an unintentional side effect
> of how I implemented the "floating protection" calculation rather than
> a design problem.
> 
> My patch still allows explicit downward propagation. So if B1 sets up
> 1G, and C1 explicitly claims those 1G (low>=1G, usage>=1G), C2 does
> NOT get any protection. There is no "floating" protection left in B1
> that could get to C2.

Yeah, the saturated protection works reasonably AFAICS.
 
> However, to calculate the float, I'm using the utilized protection
> counters (children_low_usage) to determine what is "claimed". Mostly
> for convenience because they were already there. In your example, C1
> is only utilizing 500M of its protection, leaving 500M in the float
> that will go toward C2. I agree that's undesirable.
> 
> But it's fixable by adding a hierarchical children_low counter that
> tracks the static configuration, and using that to calculate floating
> protection instead of the dynamic children_low_usage.
> 
> That way you can propagate protection from A to C1 without it spilling
> to anybody else unintentionally, regardless of how much B1 and C1 are
> actually *using*.
> 
> Does that sound reasonable?

Please post a patch and I will think about it more to see whether I can
see more problems. I am worried this is getting more and more complex
and harder to wrap head around.

Thanks!
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection
  2020-02-13 13:53             ` Tejun Heo
@ 2020-02-13 15:47               ` Michal Hocko
  2020-02-13 15:52                 ` Tejun Heo
  0 siblings, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2020-02-13 15:47 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Johannes Weiner, Andrew Morton, Roman Gushchin, linux-mm,
	cgroups, linux-kernel, kernel-team

On Thu 13-02-20 08:53:48, Tejun Heo wrote:
> Hello, Michal.
> 
> On Thu, Feb 13, 2020 at 08:40:49AM +0100, Michal Hocko wrote:
> > So how are we going to deal with hierarchies where the actual workload
> > of interest is a leaf deeper in the hierarchy and the upper levels of
> > the hierarchy are shared between unrelated workloads?  Look at how
> > systemd organizes system into cgroups for example (slices vs. scopes)
> > and say you want to add a protection to a single user or a service.
> 
> The above scenario where descendants of a cgroup behave unrelated to
> each other sound plausible in theory but doesn't really work in
> practice because memory management is closely tied with IO and all IO
> control mechanisms are strictly hierarchical and arbitrate
> level-by-level.
>
> A workload's memory footprint can't be protected without protecting
> its IOs and a descendant cgroup's IOs can't be protected without
> protecting its ancestors, so implementing that in memory controller in
> isolation, while doable, won't serve many practical purposes. It just
> ends up creating cgroups which are punished on memory while greedly
> burning up IOs.
> 
> The same pattern for CPU control, so for any practical configuration,
> the hierarchy layout has to follow the resource distribution hierarchy
> of the system - it's what the whole thing is for after all. The
> existing memory.min/low semantics is mostly from failing to recognize
> them being closely intertwined with IO and resembling weight based
> control mechanisms, and rather blindly copying memory.high/max
> behaviors, for which, btw, individual configurations may make sense.

Well, I would tend to agree but I can see an existing cgroup hierarchy
imposed by systemd and that is more about "logical" organization of
processes based on their purpose than anything resembling resources.
So what can we do about that to make it all work?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection
  2020-02-13 15:47               ` Michal Hocko
@ 2020-02-13 15:52                 ` Tejun Heo
  2020-02-13 16:36                   ` Michal Hocko
  0 siblings, 1 reply; 52+ messages in thread
From: Tejun Heo @ 2020-02-13 15:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Andrew Morton, Roman Gushchin, linux-mm,
	cgroups, linux-kernel, kernel-team

Hello,

On Thu, Feb 13, 2020 at 04:47:31PM +0100, Michal Hocko wrote:
> Well, I would tend to agree but I can see an existing cgroup hierarchy
> imposed by systemd and that is more about "logical" organization of
> processes based on their purpose than anything resembling resources.
> So what can we do about that to make it all work?

systemd right now isn't configuring any resource control by default,
so I'm not sure why it is relevant in this discussion. You gotta
change the layout to configure resource control no matter what and
it's pretty easy to do. systemd folks are planning to integrate higher
level resource control features, so my expectation is that the default
layout is gonna change as it develops.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection
  2020-02-13 15:52                 ` Tejun Heo
@ 2020-02-13 16:36                   ` Michal Hocko
  2020-02-13 16:57                     ` Tejun Heo
  0 siblings, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2020-02-13 16:36 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Johannes Weiner, Andrew Morton, Roman Gushchin, linux-mm,
	cgroups, linux-kernel, kernel-team

On Thu 13-02-20 10:52:49, Tejun Heo wrote:
> Hello,
> 
> On Thu, Feb 13, 2020 at 04:47:31PM +0100, Michal Hocko wrote:
> > Well, I would tend to agree but I can see an existing cgroup hierarchy
> > imposed by systemd and that is more about "logical" organization of
> > processes based on their purpose than anything resembling resources.
> > So what can we do about that to make it all work?
> 
> systemd right now isn't configuring any resource control by default,
> so I'm not sure why it is relevant in this discussion.

AFAIK systemd already offers knobs to configure resource controls [1].
Besides that we are talking about memcg features which are available only
unified hieararchy and that is what systemd is using already.

> You gotta
> change the layout to configure resource control no matter what and
> it's pretty easy to do. systemd folks are planning to integrate higher
> level resource control features, so my expectation is that the default
> layout is gonna change as it develops.

Do you have any pointers to those discussions? I am not really following
systemd development.

Anyway, I am skeptical that systemd can do anything much more clever
than placing cgroups with a resource control under the root cgroup. At
least not without some tagging which workloads are somehow related.

That being said, I do not really blame systemd here. We are not making
their life particularly easy TBH.

[1] https://www.freedesktop.org/software/systemd/man/systemd.resource-control.html
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection
  2020-02-13 16:36                   ` Michal Hocko
@ 2020-02-13 16:57                     ` Tejun Heo
  2020-02-14  7:15                       ` Michal Hocko
  0 siblings, 1 reply; 52+ messages in thread
From: Tejun Heo @ 2020-02-13 16:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Andrew Morton, Roman Gushchin, linux-mm,
	cgroups, linux-kernel, kernel-team

Hello,

On Thu, Feb 13, 2020 at 05:36:36PM +0100, Michal Hocko wrote:
> AFAIK systemd already offers knobs to configure resource controls [1].

Yes, it can set up the control knobs as directed but it doesn't ship
with any material resource configurations or has conventions set up
around it.

> Besides that we are talking about memcg features which are available only
> unified hieararchy and that is what systemd is using already.

I'm not quite sure what the above sentence is trying to say.

> > You gotta
> > change the layout to configure resource control no matter what and
> > it's pretty easy to do. systemd folks are planning to integrate higher
> > level resource control features, so my expectation is that the default
> > layout is gonna change as it develops.
> 
> Do you have any pointers to those discussions? I am not really following
> systemd development.

There's a plan to integrate streamlined implementation of oomd into
systemd. There was a thread somewhere but the only thing I can find
now is a phoronix link.

  https://www.phoronix.com/scan.php?page=news_item&px=Systemd-Facebook-OOMD

systemd recently implemented DisableControllers so that upper level
slices can have authority over what controllers are enabled below it
and in a similar vein there were discussions over making it
auto-propagate some of the configs down the hierarchy but kernel doing
the right thing and maintaining consistent semantics across
controllers seems to be the right approach.

There was also a discussion with a distro. Nothing concrete yet but I
think we're more likely to see more resource control configs being
deployed by default in the future.

> Anyway, I am skeptical that systemd can do anything much more clever
> than placing cgroups with a resource control under the root cgroup. At
> least not without some tagging which workloads are somehow related.

Yeah, exactly, all it needs to do is placing scopes / services
according to resource hierarchy and configure overall policy at higher
level slices, which is exactly what the memory.low semantics change
will allow.

> That being said, I do not really blame systemd here. We are not making
> their life particularly easy TBH.

Do you mind elaborating a bit?

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection
  2020-02-13 15:46               ` Michal Hocko
@ 2020-02-13 17:41                 ` Johannes Weiner
  2020-02-13 17:58                   ` Johannes Weiner
  0 siblings, 1 reply; 52+ messages in thread
From: Johannes Weiner @ 2020-02-13 17:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Roman Gushchin, Tejun Heo, linux-mm, cgroups,
	linux-kernel, kernel-team

On Thu, Feb 13, 2020 at 04:46:27PM +0100, Michal Hocko wrote:
> On Thu 13-02-20 08:23:17, Johannes Weiner wrote:
> > On Thu, Feb 13, 2020 at 08:40:49AM +0100, Michal Hocko wrote:
> > > On Wed 12-02-20 12:08:26, Johannes Weiner wrote:
> > > > On Tue, Feb 11, 2020 at 05:47:53PM +0100, Michal Hocko wrote:
> > > > > Unless I am missing something then I am afraid it doesn't. Say you have a
> > > > > default systemd cgroup deployment (aka deeper cgroup hierarchy with
> > > > > slices and scopes) and now you want to grant a reclaim protection on a
> > > > > leaf cgroup (or even a whole slice that is not really important). All the
> > > > > hierarchy up the tree has the protection set to 0 by default, right? You
> > > > > simply cannot get that protection. You would need to configure the
> > > > > protection up the hierarchy and that is really cumbersome.
> > > > 
> > > > Okay, I think I know what you mean. Let's say you have a tree like
> > > > this:
> > > > 
> > > >                           A
> > > >                          / \
> > > >                         B1  B2
> > > >                        / \   \
> > > >                       C1 C2   C3
> > > > 
> > > > and there is no actual delegation point - everything belongs to the
> > > > same user / trust domain. C1 sets memory.low to 10G, but its parents
> > > > set nothing. You're saying we should honor the 10G protection during
> > > > global and limit reclaims anywhere in the tree?
> > > 
> > > No, only in the C1 which sets the limit, because that is the woriking
> > > set we want to protect.
> > > 
> > > > Now let's consider there is a delegation point at B1: we set up and
> > > > trust B1, but not its children. What effect would the C1 protection
> > > > have then? Would we ignore it during global and A reclaim, but honor
> > > > it when there is B1 limit reclaim?
> > > 
> > > In the scheme with the inherited protection it would act as the gate
> > > and require an explicit low limit setup defaulting to 0 if none is
> > > specified.
> > > 
> > > > Doing an explicit downward propagation from the root to C1 *could* be
> > > > tedious, but I can't think of a scenario where it's completely
> > > > impossible. Especially because we allow proportional distribution when
> > > > the limit is overcommitted and you don't have to be 100% accurate.
> > > 
> > > So let's see how that works in practice, say a multi workload setup
> > > with a complex/deep cgroup hierachies (e.g. your above example). No
> > > delegation point this time.
> > > 
> > > C1 asks for low=1G while using 500M, C3 low=100M using 80M.  B1 and
> > > B2 are completely independent workloads and the same applies to C2 which
> > > doesn't ask for any protection at all? C2 uses 100M. Now the admin has
> > > to propagate protection upwards so B1 low=1G, B2 low=100M and A low=1G,
> > > right? Let's say we have a global reclaim due to external pressure that
> > > originates from outside of A hierarchy (it is not overcommited on the
> > > protection).
> > > 
> > > Unless I miss something C2 would get a protection even though nobody
> > > asked for it.
> > 
> > Good observation, but I think you spotted an unintentional side effect
> > of how I implemented the "floating protection" calculation rather than
> > a design problem.
> > 
> > My patch still allows explicit downward propagation. So if B1 sets up
> > 1G, and C1 explicitly claims those 1G (low>=1G, usage>=1G), C2 does
> > NOT get any protection. There is no "floating" protection left in B1
> > that could get to C2.
> 
> Yeah, the saturated protection works reasonably AFAICS.

Hm, Tejun raises a good point though: even if you could direct memory
protection down to one targeted leaf, you can't do the same with IO or
CPU. Those follow non-conserving weight distribution, and whatever you
allocate to a certain level is available at that level - if one child
doesn't consume it, the other children can.

And we know that controlling memory without controlling IO doesn't
really work in practice. The sibling with less memory allowance will
just page more.

So the question becomes: is this even a legit usecase? If every other
resource is distributed on a level-by-level method already, does it
buy us anything to make memory work differently?


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection
  2020-02-13 17:41                 ` Johannes Weiner
@ 2020-02-13 17:58                   ` Johannes Weiner
  2020-02-14  7:59                     ` Michal Hocko
  0 siblings, 1 reply; 52+ messages in thread
From: Johannes Weiner @ 2020-02-13 17:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Roman Gushchin, Tejun Heo, linux-mm, cgroups,
	linux-kernel, kernel-team

On Thu, Feb 13, 2020 at 12:41:36PM -0500, Johannes Weiner wrote:
> On Thu, Feb 13, 2020 at 04:46:27PM +0100, Michal Hocko wrote:
> > On Thu 13-02-20 08:23:17, Johannes Weiner wrote:
> > > On Thu, Feb 13, 2020 at 08:40:49AM +0100, Michal Hocko wrote:
> > > > On Wed 12-02-20 12:08:26, Johannes Weiner wrote:
> > > > > On Tue, Feb 11, 2020 at 05:47:53PM +0100, Michal Hocko wrote:
> > > > > > Unless I am missing something then I am afraid it doesn't. Say you have a
> > > > > > default systemd cgroup deployment (aka deeper cgroup hierarchy with
> > > > > > slices and scopes) and now you want to grant a reclaim protection on a
> > > > > > leaf cgroup (or even a whole slice that is not really important). All the
> > > > > > hierarchy up the tree has the protection set to 0 by default, right? You
> > > > > > simply cannot get that protection. You would need to configure the
> > > > > > protection up the hierarchy and that is really cumbersome.
> > > > > 
> > > > > Okay, I think I know what you mean. Let's say you have a tree like
> > > > > this:
> > > > > 
> > > > >                           A
> > > > >                          / \
> > > > >                         B1  B2
> > > > >                        / \   \
> > > > >                       C1 C2   C3

> > > > So let's see how that works in practice, say a multi workload setup
> > > > with a complex/deep cgroup hierachies (e.g. your above example). No
> > > > delegation point this time.
> > > > 
> > > > C1 asks for low=1G while using 500M, C3 low=100M using 80M.  B1 and
> > > > B2 are completely independent workloads and the same applies to C2 which
> > > > doesn't ask for any protection at all? C2 uses 100M. Now the admin has
> > > > to propagate protection upwards so B1 low=1G, B2 low=100M and A low=1G,
> > > > right? Let's say we have a global reclaim due to external pressure that
> > > > originates from outside of A hierarchy (it is not overcommited on the
> > > > protection).
> > > > 
> > > > Unless I miss something C2 would get a protection even though nobody
> > > > asked for it.
> > > 
> > > Good observation, but I think you spotted an unintentional side effect
> > > of how I implemented the "floating protection" calculation rather than
> > > a design problem.
> > > 
> > > My patch still allows explicit downward propagation. So if B1 sets up
> > > 1G, and C1 explicitly claims those 1G (low>=1G, usage>=1G), C2 does
> > > NOT get any protection. There is no "floating" protection left in B1
> > > that could get to C2.
> > 
> > Yeah, the saturated protection works reasonably AFAICS.
> 
> Hm, Tejun raises a good point though: even if you could direct memory
> protection down to one targeted leaf, you can't do the same with IO or
> CPU. Those follow non-conserving weight distribution, and whatever you

                    "work-conserving", obviously.

> allocate to a certain level is available at that level - if one child
> doesn't consume it, the other children can.
> 
> And we know that controlling memory without controlling IO doesn't
> really work in practice. The sibling with less memory allowance will
> just page more.
> 
> So the question becomes: is this even a legit usecase? If every other
> resource is distributed on a level-by-level method already, does it
> buy us anything to make memory work differently?


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection
  2020-02-13 16:57                     ` Tejun Heo
@ 2020-02-14  7:15                       ` Michal Hocko
  2020-02-14 13:57                         ` Tejun Heo
  0 siblings, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2020-02-14  7:15 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Johannes Weiner, Andrew Morton, Roman Gushchin, linux-mm,
	cgroups, linux-kernel, kernel-team

On Thu 13-02-20 11:57:11, Tejun Heo wrote:
> Hello,
> 
> On Thu, Feb 13, 2020 at 05:36:36PM +0100, Michal Hocko wrote:
> > AFAIK systemd already offers knobs to configure resource controls [1].
> 
> Yes, it can set up the control knobs as directed but it doesn't ship
> with any material resource configurations or has conventions set up
> around it.

Right. But services might use those knobs, right? And that means that if
somebody wants a memory protection then the service file is going to use 
MemoryLow=$FOO and that is likely not going to work properly without an
an additional hassles, e.g. propagate upwards, which systemd doesn't do
unless I am mistaken.

> > Besides that we are talking about memcg features which are available only
> > unified hieararchy and that is what systemd is using already.
> 
> I'm not quite sure what the above sentence is trying to say.

I meant to say that once the unified hierarchy is used by systemd you
cannot configure it differently to suit your needs without interfering
with systemd.
 
> > > You gotta
> > > change the layout to configure resource control no matter what and
> > > it's pretty easy to do. systemd folks are planning to integrate higher
> > > level resource control features, so my expectation is that the default
> > > layout is gonna change as it develops.
> > 
> > Do you have any pointers to those discussions? I am not really following
> > systemd development.
> 
> There's a plan to integrate streamlined implementation of oomd into
> systemd. There was a thread somewhere but the only thing I can find
> now is a phoronix link.
> 
>   https://www.phoronix.com/scan.php?page=news_item&px=Systemd-Facebook-OOMD

I am not sure I see how that is going to change much wrt. resource
distribution TBH. Is the existing cgroup hierarchy going to change for
the OOMD to be deployed?

[...]

> > Anyway, I am skeptical that systemd can do anything much more clever
> > than placing cgroups with a resource control under the root cgroup. At
> > least not without some tagging which workloads are somehow related.
> 
> Yeah, exactly, all it needs to do is placing scopes / services
> according to resource hierarchy and configure overall policy at higher
> level slices, which is exactly what the memory.low semantics change
> will allow.

Let me ask more specifically. Is there any plan or existing API to allow
to configure which services are related resource wise?

> > That being said, I do not really blame systemd here. We are not making
> > their life particularly easy TBH.
> 
> Do you mind elaborating a bit?

I believe I have already expressed the configurability concern elsewhere
in the email thread. It boils down to necessity to propagate
protection all the way up the hierarchy properly if you really need to
protect leaf cgroups that are organized without a resource control in
mind. Which is what systemd does.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection
  2020-02-13 17:58                   ` Johannes Weiner
@ 2020-02-14  7:59                     ` Michal Hocko
  0 siblings, 0 replies; 52+ messages in thread
From: Michal Hocko @ 2020-02-14  7:59 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Roman Gushchin, Tejun Heo, linux-mm, cgroups,
	linux-kernel, kernel-team

On Thu 13-02-20 12:58:13, Johannes Weiner wrote:
> On Thu, Feb 13, 2020 at 12:41:36PM -0500, Johannes Weiner wrote:
> > On Thu, Feb 13, 2020 at 04:46:27PM +0100, Michal Hocko wrote:
> > > On Thu 13-02-20 08:23:17, Johannes Weiner wrote:
> > > > On Thu, Feb 13, 2020 at 08:40:49AM +0100, Michal Hocko wrote:
> > > > > On Wed 12-02-20 12:08:26, Johannes Weiner wrote:
> > > > > > On Tue, Feb 11, 2020 at 05:47:53PM +0100, Michal Hocko wrote:
> > > > > > > Unless I am missing something then I am afraid it doesn't. Say you have a
> > > > > > > default systemd cgroup deployment (aka deeper cgroup hierarchy with
> > > > > > > slices and scopes) and now you want to grant a reclaim protection on a
> > > > > > > leaf cgroup (or even a whole slice that is not really important). All the
> > > > > > > hierarchy up the tree has the protection set to 0 by default, right? You
> > > > > > > simply cannot get that protection. You would need to configure the
> > > > > > > protection up the hierarchy and that is really cumbersome.
> > > > > > 
> > > > > > Okay, I think I know what you mean. Let's say you have a tree like
> > > > > > this:
> > > > > > 
> > > > > >                           A
> > > > > >                          / \
> > > > > >                         B1  B2
> > > > > >                        / \   \
> > > > > >                       C1 C2   C3
> 
> > > > > So let's see how that works in practice, say a multi workload setup
> > > > > with a complex/deep cgroup hierachies (e.g. your above example). No
> > > > > delegation point this time.
> > > > > 
> > > > > C1 asks for low=1G while using 500M, C3 low=100M using 80M.  B1 and
> > > > > B2 are completely independent workloads and the same applies to C2 which
> > > > > doesn't ask for any protection at all? C2 uses 100M. Now the admin has
> > > > > to propagate protection upwards so B1 low=1G, B2 low=100M and A low=1G,
> > > > > right? Let's say we have a global reclaim due to external pressure that
> > > > > originates from outside of A hierarchy (it is not overcommited on the
> > > > > protection).
> > > > > 
> > > > > Unless I miss something C2 would get a protection even though nobody
> > > > > asked for it.
> > > > 
> > > > Good observation, but I think you spotted an unintentional side effect
> > > > of how I implemented the "floating protection" calculation rather than
> > > > a design problem.
> > > > 
> > > > My patch still allows explicit downward propagation. So if B1 sets up
> > > > 1G, and C1 explicitly claims those 1G (low>=1G, usage>=1G), C2 does
> > > > NOT get any protection. There is no "floating" protection left in B1
> > > > that could get to C2.
> > > 
> > > Yeah, the saturated protection works reasonably AFAICS.
> > 
> > Hm, Tejun raises a good point though: even if you could direct memory
> > protection down to one targeted leaf, you can't do the same with IO or
> > CPU. Those follow non-conserving weight distribution, and whatever you
> 
>                     "work-conserving", obviously.
> 
> > allocate to a certain level is available at that level - if one child
> > doesn't consume it, the other children can.

Right, but isn't this pretty much a definition of weight based resource
distribution? Memory knobs operate in absolute numbers (limits) and that
means that children might both over/under commit what parent assigns to
them. Or have I misunderstood you here?

> > And we know that controlling memory without controlling IO doesn't
> > really work in practice. The sibling with less memory allowance will
> > just page more.
> > 
> > So the question becomes: is this even a legit usecase? If every other
> > resource is distributed on a level-by-level method already, does it
> > buy us anything to make memory work differently?

Aren't you mixing weights and limits here? I am quite confused TBH.

Let me give you an example. Say you have a DB workload which is the
primary thing running on your system and which you want to protect from
an unrelated activity (backups, frontends, etc). Running it inside a
cgroup with memory.low while other components in other cgroups without
any protection achieves that. If those cgroups are top level then this
is simple and straightforward configuration.

Things would get much more tricky if you want run the same workload
deeper down the hierarchy - e.g. run it in a container. Now your
"root" has to use an explicit low protection as well and all other
potential cgroups that are in the same sub-hierarchy (read in the same
container) need to opt-out from the protection because they are not
meant to be protected.

In short we simply have to live with usecases where the cgroup hierarchy
follows the "logical" workload organization at the higher level more
than resource control. This is the case for systemd as well btw.
Workloads are organized into slices and scopes without any direct
relation to resources in mind.

Does this make it more clear what I am thinking about? Does it sound
like a legit usecase?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection
  2020-02-14  7:15                       ` Michal Hocko
@ 2020-02-14 13:57                         ` Tejun Heo
  2020-02-14 15:13                           ` Michal Hocko
  0 siblings, 1 reply; 52+ messages in thread
From: Tejun Heo @ 2020-02-14 13:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Andrew Morton, Roman Gushchin, linux-mm,
	cgroups, linux-kernel, kernel-team

Hello,

On Fri, Feb 14, 2020 at 08:15:37AM +0100, Michal Hocko wrote:
> > Yes, it can set up the control knobs as directed but it doesn't ship
> > with any material resource configurations or has conventions set up
> > around it.
> 
> Right. But services might use those knobs, right? And that means that if
> somebody wants a memory protection then the service file is going to use 
> MemoryLow=$FOO and that is likely not going to work properly without an
> an additional hassles, e.g. propagate upwards, which systemd doesn't do
> unless I am mistaken.

While there are applications where strict protection makes sense, in a
lot of cases, resource decisions have to consider factors global to
the system - how much is there and for what purpose the system is
being set up. Static per-service configuration for sure doesn't work
and neither will dynamic configuration without considering system-wide
factors.

Another aspect is that as configuration gets more granular and
stricter with memory knobs, the configuration becomes less
work-conserving. Kernel's MM keeps track of dynamic behavior and adapt
to the dynamic usage, these configurations can't.

So, while individual applications may indicate what its resource
dispositions are, a working configuration is not gonna come from each
service declaring how many bytes they want.

This doesn't mean configurations are more tedious or difficult. In
fact, in a lot of cases, categorizing applications on the system
broadly and assigning ballpark weights and memory protections from the
higher level is sufficient.

> > > Besides that we are talking about memcg features which are available only
> > > unified hieararchy and that is what systemd is using already.
> > 
> > I'm not quite sure what the above sentence is trying to say.
> 
> I meant to say that once the unified hierarchy is used by systemd you
> cannot configure it differently to suit your needs without interfering
> with systemd.

I haven't experienced systemd getting in the way of structuring cgroup
hierarchy and configuring them. It's pretty flexible and easy to
configure. Do you have any specific constraints on mind?

> > There's a plan to integrate streamlined implementation of oomd into
> > systemd. There was a thread somewhere but the only thing I can find
> > now is a phoronix link.
> > 
> >   https://www.phoronix.com/scan.php?page=news_item&px=Systemd-Facebook-OOMD
> 
> I am not sure I see how that is going to change much wrt. resource
> distribution TBH. Is the existing cgroup hierarchy going to change for
> the OOMD to be deployed?

It's not a hard requirement but it'll be a lot more useful with actual
resource hierarchy. As more resource control features get enabled, I
think it'll converge that way because that's more useful.

> > Yeah, exactly, all it needs to do is placing scopes / services
> > according to resource hierarchy and configure overall policy at higher
> > level slices, which is exactly what the memory.low semantics change
> > will allow.
> 
> Let me ask more specifically. Is there any plan or existing API to allow
> to configure which services are related resource wise?

At kernel level, no. They seem like pretty high level policy decisions
to me.

> > > That being said, I do not really blame systemd here. We are not making
> > > their life particularly easy TBH.
> > 
> > Do you mind elaborating a bit?
> 
> I believe I have already expressed the configurability concern elsewhere
> in the email thread. It boils down to necessity to propagate
> protection all the way up the hierarchy properly if you really need to
> protect leaf cgroups that are organized without a resource control in
> mind. Which is what systemd does.

But that doesn't work for other controllers at all. I'm having a
difficult time imagining how making this one control mechanism work
that way makes sense. Memory protection has to be configured together
with IO protection to be actually effective.

As for cgroup hierarchy being unrelated to how controllers behave, it
frankly reminds me of cgroup1 memcg flat hierarchy thing I'm not sure
how that would actually work in terms of resource isolation. Also, I'm
not sure how systemd forces such configurations and I'd think systemd
folks would be happy to fix them if there are such problems. Is the
point you're trying to make "because of systemd, we have to contort
how memory controller behaves"?

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection
  2020-02-14 13:57                         ` Tejun Heo
@ 2020-02-14 15:13                           ` Michal Hocko
  2020-02-14 15:40                             ` Tejun Heo
  2020-02-14 16:53                             ` Johannes Weiner
  0 siblings, 2 replies; 52+ messages in thread
From: Michal Hocko @ 2020-02-14 15:13 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Johannes Weiner, Andrew Morton, Roman Gushchin, linux-mm,
	cgroups, linux-kernel, kernel-team

On Fri 14-02-20 08:57:28, Tejun Heo wrote:
[...]

Sorry to skip over a large part of your response. The discussion in this
thread got quite fragmented already and I would really like to conclude
to something.

> > I believe I have already expressed the configurability concern elsewhere
> > in the email thread. It boils down to necessity to propagate
> > protection all the way up the hierarchy properly if you really need to
> > protect leaf cgroups that are organized without a resource control in
> > mind. Which is what systemd does.
> 
> But that doesn't work for other controllers at all. I'm having a
> difficult time imagining how making this one control mechanism work
> that way makes sense. Memory protection has to be configured together
> with IO protection to be actually effective.

Please be more specific. If the protected workload is mostly in-memory,
I do not really see how IO controller is relevant. See the example of
the DB setup I've mentioned elsewhere.

> As for cgroup hierarchy being unrelated to how controllers behave, it
> frankly reminds me of cgroup1 memcg flat hierarchy thing I'm not sure
> how that would actually work in terms of resource isolation. Also, I'm
> not sure how systemd forces such configurations and I'd think systemd
> folks would be happy to fix them if there are such problems. Is the
> point you're trying to make "because of systemd, we have to contort
> how memory controller behaves"?

No, I am just saying and as explained in reply to Johannes, there are
practical cases where the cgroup hierarchy reflects organizational
structure as well.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection
  2020-02-14 15:13                           ` Michal Hocko
@ 2020-02-14 15:40                             ` Tejun Heo
  2020-02-14 16:53                             ` Johannes Weiner
  1 sibling, 0 replies; 52+ messages in thread
From: Tejun Heo @ 2020-02-14 15:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Andrew Morton, Roman Gushchin, linux-mm,
	cgroups, linux-kernel, kernel-team

Hello,

On Fri, Feb 14, 2020 at 04:13:18PM +0100, Michal Hocko wrote:
> On Fri 14-02-20 08:57:28, Tejun Heo wrote:
> > But that doesn't work for other controllers at all. I'm having a
> > difficult time imagining how making this one control mechanism work
> > that way makes sense. Memory protection has to be configured together
> > with IO protection to be actually effective.
> 
> Please be more specific. If the protected workload is mostly in-memory,
> I do not really see how IO controller is relevant. See the example of
> the DB setup I've mentioned elsewhere.

Most applications, even the ones which don't use explicit IOs much,
don't have set memory footprint which is uniformly accessed and there
needs to be some level of reclaim activity for the working set to be
established and maintained. Without IO control, memory protection
isn't enough in protecting the workload.

Even if we narrow down the discussion to something like memcache which
has fixed memory footprint with almost uniform access pattern, real
world applications don't exist in vacuum - they compete on CPU, have
to do logging, pulls in metric ton of libraries which implicitly
accesses stuff and so on. If somebody else is pummeling the filesystem
and there's no IO isolation set up, it'll stall noticeably every once
in a while.

> > As for cgroup hierarchy being unrelated to how controllers behave, it
> > frankly reminds me of cgroup1 memcg flat hierarchy thing I'm not sure
> > how that would actually work in terms of resource isolation. Also, I'm
> > not sure how systemd forces such configurations and I'd think systemd
> > folks would be happy to fix them if there are such problems. Is the
> > point you're trying to make "because of systemd, we have to contort
> > how memory controller behaves"?
> 
> No, I am just saying and as explained in reply to Johannes, there are
> practical cases where the cgroup hierarchy reflects organizational
> structure as well.

Oh I see. If cgroup hierarchy isn't set up for resource control,
resource control not working well seems par for the course. I mean, no
other controllers would work anyway, so I'm having a hard time to see
what the point is. What we ultimately want is cgroup actually being
useful for its primary purpose of resource control while supporting
other organizational use cases and while the established usages aren't
there yet I haven't seen anything fundamentally blocking that.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection
  2020-02-14 15:13                           ` Michal Hocko
  2020-02-14 15:40                             ` Tejun Heo
@ 2020-02-14 16:53                             ` Johannes Weiner
  2020-02-14 17:17                               ` Tejun Heo
  2020-02-17  8:41                               ` Michal Hocko
  1 sibling, 2 replies; 52+ messages in thread
From: Johannes Weiner @ 2020-02-14 16:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tejun Heo, Andrew Morton, Roman Gushchin, linux-mm, cgroups,
	linux-kernel, kernel-team

On Fri, Feb 14, 2020 at 04:13:18PM +0100, Michal Hocko wrote:
> On Fri 14-02-20 08:57:28, Tejun Heo wrote:
> [...]
> 
> Sorry to skip over a large part of your response. The discussion in this
> thread got quite fragmented already and I would really like to conclude
> to something.
> 
> > > I believe I have already expressed the configurability concern elsewhere
> > > in the email thread. It boils down to necessity to propagate
> > > protection all the way up the hierarchy properly if you really need to
> > > protect leaf cgroups that are organized without a resource control in
> > > mind. Which is what systemd does.
> > 
> > But that doesn't work for other controllers at all. I'm having a
> > difficult time imagining how making this one control mechanism work
> > that way makes sense. Memory protection has to be configured together
> > with IO protection to be actually effective.
> 
> Please be more specific. If the protected workload is mostly in-memory,
> I do not really see how IO controller is relevant. See the example of
> the DB setup I've mentioned elsewhere.

Say you have the following tree:

                         root
                       /     \
               system.slice  user.slice
               /    |           |
           cron  journal     user-100.slice
                                |
                             session-c2.scope
                                |          \
                             the_workload  misc

You're saying you want to prioritize the workload above everything
else in the system: system tasks, other users' stuff, even other stuff
running as your own user. Therefor you configure memory.low on the
workload and propagate the value up the tree. So far so good, memory
is protected.

However, what you've really done just now is flattened the resource
hierarchy. You configured the_workload not just more important than
its sibling "misc", but you actually pulled it up the resource tree
and declared it more important than what's running in other sessions,
what users are running, and even the system software. Your cgroup tree
still reflects process ownership, but it doesn't actually reflect the
resource hierarchy you just configured.

And that is something that other controllers do not support: you
cannot give IO weight to the_workload without also making the c2
session more important than other sessions, user 100 more important
than other users, and user.slice more important than system.slice.

Memory is really the only controller in which this kind of works.

And yes, maybe you're talking about a highly specific case where
the_workload is mostly in memory and also doesn't need any IO capacity
and you have CPU to spare, so this is good enough for you and you
don't care about the other controllers in that particular usecase.

But the discussion is about the broader approach to resource control:
if other controllers already require the cgroup hierarchy to reflect
resource hierarchy, memory shouldn't be different just because.

And in practice, most workloads *do* in fact need comprehensive
control. Which workload exclusively needs memory, but no IO, and no
CPU? If you only control for memory, lower priority workloads tend to
eat up your CPU doing reclaim and your IO doing paging.

So when talking about the design and semantics of one controller, you
have to think about this comprehensively.

The proper solution to implement the kind of resource hierarchy you
want to express in cgroup2 is to reflect it in the cgroup tree. Yes,
the_workload might have been started by user 100 in session c2, but in
terms of resources, it's prioritized over system.slice and user.slice,
and so that's the level where it needs to sit:

                               root
                       /        |                 \
               system.slice  user.slice       the_workload
               /    |           |
           cron  journal     user-100.slice
                                |
                             session-c2.scope
                                |
                             misc

Then you can configure not just memory.low, but also a proper io
weight and a cpu weight. And the tree correctly reflects where the
workload is in the pecking order of who gets access to resources.

> > As for cgroup hierarchy being unrelated to how controllers behave, it
> > frankly reminds me of cgroup1 memcg flat hierarchy thing I'm not sure
> > how that would actually work in terms of resource isolation. Also, I'm
> > not sure how systemd forces such configurations and I'd think systemd
> > folks would be happy to fix them if there are such problems. Is the
> > point you're trying to make "because of systemd, we have to contort
> > how memory controller behaves"?
> 
> No, I am just saying and as explained in reply to Johannes, there are
> practical cases where the cgroup hierarchy reflects organizational
> structure as well.

There is nothing wrong with that, and systemd does it per default. But
it also doesn't enable resource control domains per default.

Once you do split up resource domains, you cannot express both
hierarchies in a single tree. And cgroup2 controllers are designed
such that resource domain hierarchy takes precedence.

It's perfectly fine to launch the workload in a different/new resource
domain. This doesn't conflict with systemd, and is in fact supported
by it. See the Slice= attribute (systemd.resource_control(5)).

Sure you need to be privileged to do this, but you also need to be
privileged to set memory.low on user.slice...


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection
  2020-02-14 16:53                             ` Johannes Weiner
@ 2020-02-14 17:17                               ` Tejun Heo
  2020-02-17  8:41                               ` Michal Hocko
  1 sibling, 0 replies; 52+ messages in thread
From: Tejun Heo @ 2020-02-14 17:17 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Andrew Morton, Roman Gushchin, linux-mm, cgroups,
	linux-kernel, kernel-team

On Fri, Feb 14, 2020 at 11:53:11AM -0500, Johannes Weiner wrote:
> However, what you've really done just now is flattened the resource
> hierarchy. You configured the_workload not just more important than
> its sibling "misc", but you actually pulled it up the resource tree
> and declared it more important than what's running in other sessions,
> what users are running, and even the system software. Your cgroup tree
> still reflects process ownership, but it doesn't actually reflect the
> resource hierarchy you just configured.

Just to second this point, anything moving in this direction will be a
hard nack from me. We don't want use_hierarchy for cgroup2 and I'm
baffled that this is even being suggested seriously. If we have
learned *anything* from cgroup1's mistakes, this should be the one.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection
  2020-02-14 16:53                             ` Johannes Weiner
  2020-02-14 17:17                               ` Tejun Heo
@ 2020-02-17  8:41                               ` Michal Hocko
  2020-02-18 19:52                                 ` Johannes Weiner
  1 sibling, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2020-02-17  8:41 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tejun Heo, Andrew Morton, Roman Gushchin, linux-mm, cgroups,
	linux-kernel, kernel-team

On Fri 14-02-20 11:53:11, Johannes Weiner wrote:
[...]
> The proper solution to implement the kind of resource hierarchy you
> want to express in cgroup2 is to reflect it in the cgroup tree. Yes,
> the_workload might have been started by user 100 in session c2, but in
> terms of resources, it's prioritized over system.slice and user.slice,
> and so that's the level where it needs to sit:
> 
>                                root
>                        /        |                 \
>                system.slice  user.slice       the_workload
>                /    |           |
>            cron  journal     user-100.slice
>                                 |
>                              session-c2.scope
>                                 |
>                              misc
> 
> Then you can configure not just memory.low, but also a proper io
> weight and a cpu weight. And the tree correctly reflects where the
> workload is in the pecking order of who gets access to resources.

I have already mentioned that this would be the only solution when the
protection would work, right. But I am also saying that this a trivial
example where you simply _can_ move your workload to the 1st level. What
about those that need to reflect organization into the hierarchy. Please
have a look at http://lkml.kernel.org/r/20200214075916.GM31689@dhcp22.suse.cz
Are you saying they are just not supported? Are they supposed to use
cgroup v1 for the organization and v2 for the resource control?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection
  2020-02-17  8:41                               ` Michal Hocko
@ 2020-02-18 19:52                                 ` Johannes Weiner
  2020-02-21 10:11                                   ` Michal Hocko
  0 siblings, 1 reply; 52+ messages in thread
From: Johannes Weiner @ 2020-02-18 19:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tejun Heo, Andrew Morton, Roman Gushchin, linux-mm, cgroups,
	linux-kernel, kernel-team

On Mon, Feb 17, 2020 at 09:41:00AM +0100, Michal Hocko wrote:
> On Fri 14-02-20 11:53:11, Johannes Weiner wrote:
> [...]
> > The proper solution to implement the kind of resource hierarchy you
> > want to express in cgroup2 is to reflect it in the cgroup tree. Yes,
> > the_workload might have been started by user 100 in session c2, but in
> > terms of resources, it's prioritized over system.slice and user.slice,
> > and so that's the level where it needs to sit:
> > 
> >                                root
> >                        /        |                 \
> >                system.slice  user.slice       the_workload
> >                /    |           |
> >            cron  journal     user-100.slice
> >                                 |
> >                              session-c2.scope
> >                                 |
> >                              misc
> > 
> > Then you can configure not just memory.low, but also a proper io
> > weight and a cpu weight. And the tree correctly reflects where the
> > workload is in the pecking order of who gets access to resources.
> 
> I have already mentioned that this would be the only solution when the
> protection would work, right. But I am also saying that this a trivial
> example where you simply _can_ move your workload to the 1st level. What
> about those that need to reflect organization into the hierarchy. Please
> have a look at http://lkml.kernel.org/r/20200214075916.GM31689@dhcp22.suse.cz
> Are you saying they are just not supported? Are they supposed to use
> cgroup v1 for the organization and v2 for the resource control?

From that email:

    > Let me give you an example. Say you have a DB workload which is the
    > primary thing running on your system and which you want to protect from
    > an unrelated activity (backups, frontends, etc). Running it inside a
    > cgroup with memory.low while other components in other cgroups without
    > any protection achieves that. If those cgroups are top level then this
    > is simple and straightforward configuration.
    > 
    > Things would get much more tricky if you want run the same workload
    > deeper down the hierarchy - e.g. run it in a container. Now your
    > "root" has to use an explicit low protection as well and all other
    > potential cgroups that are in the same sub-hierarchy (read in the same
    > container) need to opt-out from the protection because they are not
    > meant to be protected.

You can't prioritize some parts of a cgroup higher than the outside of
the cgroup, and other parts lower than the outside. That's just not
something that can be sanely supported from the controller interface.

However, that doesn't mean this usecase isn't supported. You *can*
always split cgroups for separate resource policies.

And you *can* split cgroups for group labeling purposes too (tracking
stuff that belongs to a certain user).

So in the scenario where you have an important database and a
not-so-important secondary workload, and you want them to run them
containerized, there are two possible scenarios:

- The workloads are co-dependent (e.g. a logging service for the
  db). In that case you actually need to protect them equally,
  otherwise you'll have priority inversions, where the primary gets
  backed up behind the secondary in some form or another.

- The workloads don't interact with each other. In that case, you can
  create two separate containers, one high-pri, one low-pri, and run
  them in parallel. They can share filesystem data, page cache
  etc. where appropriate, so this isn't a problem.

  The fact that they belong to the same team/organization/"user"
  e.g. is an attribute that can be tracked from userspace and isn't
  material from a kernel interface POV.

  You just have two cgroups instead of one to track; but those cgroups
  will still contain stuff like setsid(), setuid() etc. so users
  cannot escape whatever policy/containment you implement for them.

    > In short we simply have to live with usecases where the cgroup hierarchy
    > follows the "logical" workload organization at the higher level more
    > than resource control. This is the case for systemd as well btw.
    > Workloads are organized into slices and scopes without any direct
    > relation to resources in mind.

As I said in the previous email: Yes, per default, because it starts
everything in a single resource domain. But it has all necessary
support for dividing the tree into disjunct resource domains.

    > Does this make it more clear what I am thinking about? Does it sound
    > like a legit usecase?

The desired behavior is legit, but you have to split the cgroups on
conflicting attributes - whether organizational or policy-related -
for properly expressing what you want from the kernel.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection
  2020-02-18 19:52                                 ` Johannes Weiner
@ 2020-02-21 10:11                                   ` Michal Hocko
  2020-02-21 15:43                                     ` Johannes Weiner
  0 siblings, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2020-02-21 10:11 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tejun Heo, Andrew Morton, Roman Gushchin, linux-mm, cgroups,
	linux-kernel, kernel-team

[Sorry I didn't get to this email thread sooner]

On Tue 18-02-20 14:52:53, Johannes Weiner wrote:
> On Mon, Feb 17, 2020 at 09:41:00AM +0100, Michal Hocko wrote:
> > On Fri 14-02-20 11:53:11, Johannes Weiner wrote:
> > [...]
> > > The proper solution to implement the kind of resource hierarchy you
> > > want to express in cgroup2 is to reflect it in the cgroup tree. Yes,
> > > the_workload might have been started by user 100 in session c2, but in
> > > terms of resources, it's prioritized over system.slice and user.slice,
> > > and so that's the level where it needs to sit:
> > > 
> > >                                root
> > >                        /        |                 \
> > >                system.slice  user.slice       the_workload
> > >                /    |           |
> > >            cron  journal     user-100.slice
> > >                                 |
> > >                              session-c2.scope
> > >                                 |
> > >                              misc
> > > 
> > > Then you can configure not just memory.low, but also a proper io
> > > weight and a cpu weight. And the tree correctly reflects where the
> > > workload is in the pecking order of who gets access to resources.
> > 
> > I have already mentioned that this would be the only solution when the
> > protection would work, right. But I am also saying that this a trivial
> > example where you simply _can_ move your workload to the 1st level. What
> > about those that need to reflect organization into the hierarchy. Please
> > have a look at http://lkml.kernel.org/r/20200214075916.GM31689@dhcp22.suse.cz
> > Are you saying they are just not supported? Are they supposed to use
> > cgroup v1 for the organization and v2 for the resource control?
> 
> >From that email:
> 
>     > Let me give you an example. Say you have a DB workload which is the
>     > primary thing running on your system and which you want to protect from
>     > an unrelated activity (backups, frontends, etc). Running it inside a
>     > cgroup with memory.low while other components in other cgroups without
>     > any protection achieves that. If those cgroups are top level then this
>     > is simple and straightforward configuration.
>     > 
>     > Things would get much more tricky if you want run the same workload
>     > deeper down the hierarchy - e.g. run it in a container. Now your
>     > "root" has to use an explicit low protection as well and all other
>     > potential cgroups that are in the same sub-hierarchy (read in the same
>     > container) need to opt-out from the protection because they are not
>     > meant to be protected.
> 
> You can't prioritize some parts of a cgroup higher than the outside of
> the cgroup, and other parts lower than the outside. That's just not
> something that can be sanely supported from the controller interface.

I am sorry but I do not follow. We do allow to opt out from the reclaim
protection with the current semantic and it seems to be reasonably sane.
I also have hard time to grasp what you actually mean by the above.
Let's say you have hiearchy where you split out low limit unevenly
              root (5G of memory)
             /    \
   (low 3G) A      D (low 1,5G)
           / \
 (low 1G) B   C (low 2G)

B gets lower priority than C and D while C gets higher priority than
D? Is there any problem with such a configuration from the semantic
point of view?

> However, that doesn't mean this usecase isn't supported. You *can*
> always split cgroups for separate resource policies.

What if the split up is not possible or impractical. Let's say you want
to control how much CPU share does your container workload get comparing
to other containers running on the system? Or let's say you want to
treat the whole container as a single entity from the OOM perspective
(this would be an example of the logical organization constrain) because
you do not want to leave any part of that workload lingering behind if
the global OOM kicks in. I am pretty sure there are many other reasons
to run related workload that doesn't really share the memory protection
demand under a shared cgroup hierarchy.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection
  2020-02-21 10:11                                   ` Michal Hocko
@ 2020-02-21 15:43                                     ` Johannes Weiner
  2020-02-25 12:20                                       ` Michal Hocko
  0 siblings, 1 reply; 52+ messages in thread
From: Johannes Weiner @ 2020-02-21 15:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tejun Heo, Andrew Morton, Roman Gushchin, linux-mm, cgroups,
	linux-kernel, kernel-team

On Fri, Feb 21, 2020 at 11:11:47AM +0100, Michal Hocko wrote:
> On Tue 18-02-20 14:52:53, Johannes Weiner wrote:
> > On Mon, Feb 17, 2020 at 09:41:00AM +0100, Michal Hocko wrote:
> > > On Fri 14-02-20 11:53:11, Johannes Weiner wrote:
> > > [...]
> > > > The proper solution to implement the kind of resource hierarchy you
> > > > want to express in cgroup2 is to reflect it in the cgroup tree. Yes,
> > > > the_workload might have been started by user 100 in session c2, but in
> > > > terms of resources, it's prioritized over system.slice and user.slice,
> > > > and so that's the level where it needs to sit:
> > > > 
> > > >                                root
> > > >                        /        |                 \
> > > >                system.slice  user.slice       the_workload
> > > >                /    |           |
> > > >            cron  journal     user-100.slice
> > > >                                 |
> > > >                              session-c2.scope
> > > >                                 |
> > > >                              misc
> > > > 
> > > > Then you can configure not just memory.low, but also a proper io
> > > > weight and a cpu weight. And the tree correctly reflects where the
> > > > workload is in the pecking order of who gets access to resources.
> > > 
> > > I have already mentioned that this would be the only solution when the
> > > protection would work, right. But I am also saying that this a trivial
> > > example where you simply _can_ move your workload to the 1st level. What
> > > about those that need to reflect organization into the hierarchy. Please
> > > have a look at http://lkml.kernel.org/r/20200214075916.GM31689@dhcp22.suse.cz
> > > Are you saying they are just not supported? Are they supposed to use
> > > cgroup v1 for the organization and v2 for the resource control?
> > 
> > >From that email:
> > 
> >     > Let me give you an example. Say you have a DB workload which is the
> >     > primary thing running on your system and which you want to protect from
> >     > an unrelated activity (backups, frontends, etc). Running it inside a
> >     > cgroup with memory.low while other components in other cgroups without
> >     > any protection achieves that. If those cgroups are top level then this
> >     > is simple and straightforward configuration.
> >     > 
> >     > Things would get much more tricky if you want run the same workload
> >     > deeper down the hierarchy - e.g. run it in a container. Now your
> >     > "root" has to use an explicit low protection as well and all other
> >     > potential cgroups that are in the same sub-hierarchy (read in the same
> >     > container) need to opt-out from the protection because they are not
> >     > meant to be protected.
> > 
> > You can't prioritize some parts of a cgroup higher than the outside of
> > the cgroup, and other parts lower than the outside. That's just not
> > something that can be sanely supported from the controller interface.
> 
> I am sorry but I do not follow. We do allow to opt out from the reclaim
> protection with the current semantic and it seems to be reasonably sane.

It's not a supported cgroup2 configuration, because the other
controllers do not support such opting out.

Again, you cannot prioritize memory without prioritizing IO, because a
lack of memory will immediately translate to an increase in paging.

The fundamental premise of cgroup2 is that unless you control *all*
contended resources, you are not controlling *any* of them. Years of
running containers in a large scale production environment have
reaffirmed and reinforced this point over and over. If you don't setup
coherent priority domains and control all resources simultaneously,
your workload *will* escape its containment and bring down the host or
other workloads, one way or the other.

The IO controller (as well as the CPU controller) cannot match this
concept of opting out of an assigned priority, so for real life setups
there is no supportable usecase for opting out of memory protection.

A usecase is only supported if all controllers support it.

> I also have hard time to grasp what you actually mean by the above.
> Let's say you have hiearchy where you split out low limit unevenly
>               root (5G of memory)
>              /    \
>    (low 3G) A      D (low 1,5G)
>            / \
>  (low 1G) B   C (low 2G)
>
> B gets lower priority than C and D while C gets higher priority than
> D? Is there any problem with such a configuration from the semantic
> point of view?

No, that's completely fine.

The thing that you cannot do is create the following setup:

                root
               /    \
        container1 container2
         /    \        |
     highpri  lowpri  midpri

and expect midpri to be always higher priority than lowpri. Because
prioritization is level-by-level: container1 has a relative priority
to container2, highpri has a relative priority to lowpri.

If you want highpri to have higher priority than midpri, you need to
give container1 a higher priority than container2. But what that also
means is that if highpri isn't using its share, then it goes to
lowpri. And then lowpri > midpri.

Think about how io and cpu weights are distributed in this tree and it
makes sense.

You can currently opt lowpri out of memory protection, but then it'll
just page and steal midpri's IO bandwidth, and you have the priority
inversion elsewhere.

The only way to actually implement the priorities as the names suggest
is this:

               root
          /     |    \
         c1    c2     c3
         |      |      |
      highpri midpri lowpri

> > However, that doesn't mean this usecase isn't supported. You *can*
> > always split cgroups for separate resource policies.
> 
> What if the split up is not possible or impractical. Let's say you want
> to control how much CPU share does your container workload get comparing
> to other containers running on the system? Or let's say you want to
> treat the whole container as a single entity from the OOM perspective
> (this would be an example of the logical organization constrain) because
> you do not want to leave any part of that workload lingering behind if
> the global OOM kicks in. I am pretty sure there are many other reasons
> to run related workload that doesn't really share the memory protection
> demand under a shared cgroup hierarchy.

The problem is that your "pretty sure" has been proven to be very
wrong in real life. And that's one reason why these arguments are so
frustrating: it's your intuition and gut feeling against the
experience of using this stuff hands-on in large scale production
deployments.

I'm telling you, orthogonal resource control hierarchies result in
priority inversions. Workloads interacting across different resource
domains result in priority inversions. It's the reason why there is a
single tree in cgroup2, and it's the reason why there is the concept
of all resources being combined into single priority buckets.

You cannot actually configure a system the way you claim people would
surely want to and get working resource isolation.

I really don't mind going into specifics and explaining my motivations
here. But you are drawing out the discussion by asserting hypothetical
scenarios that have been so properly refuted in the last decade that
we had to version and redo the entire interface geared toward these
concepts. Meanwhile my patches have been pending for three months with
acks from people who have extensively worked with this interface on
actual container stacks. I would ask you kindly not to do that.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 2/3] mm: memcontrol: clean up and document effective low/min calculations
  2019-12-19 20:07 ` [PATCH v2 2/3] mm: memcontrol: clean up and document effective low/min calculations Johannes Weiner
  2020-01-30 12:54   ` Michal Hocko
@ 2020-02-21 17:10   ` Michal Koutný
  2020-02-25 18:40     ` Johannes Weiner
  1 sibling, 1 reply; 52+ messages in thread
From: Michal Koutný @ 2020-02-21 17:10 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Roman Gushchin, Michal Hocko, Tejun Heo, linux-mm,
	cgroups, linux-kernel, kernel-team

[-- Attachment #1: Type: text/plain, Size: 2489 bytes --]

On Thu, Dec 19, 2019 at 03:07:17PM -0500, Johannes Weiner <hannes@cmpxchg.org> wrote:
> The effective protection of any given cgroup is a somewhat complicated
> construct that depends on the ancestor's configuration, siblings'
> configurations, as well as current memory utilization in all these
> groups.
I agree with that. It makes it a bit hard to determine the equilibrium
in advance.


> + *    Consider the following example tree:
>   *
> + *        A      A/memory.low = 2G, A/memory.current = 6G
> + *       //\\
> + *      BC  DE   B/memory.low = 3G  B/memory.current = 2G
> + *               C/memory.low = 1G  C/memory.current = 2G
> + *               D/memory.low = 0   D/memory.current = 2G
> + *               E/memory.low = 10G E/memory.current = 0
>   *
> + *    and memory pressure is applied, the following memory
> + *    distribution is expected (approximately*):
>   *
> + *      A/memory.current = 2G
> + *      B/memory.current = 1.3G
> + *      C/memory.current = 0.6G
> + *      D/memory.current = 0
> + *      E/memory.current = 0
>   *
> + *    *assuming equal allocation rate and reclaimability
I think the assumptions for this example don't hold (anymore).
Because reclaim rate depends on the usage above protection, the siblings
won't be reclaimed equally and so the low_usage proportionality will
change over time and the equilibrium distribution is IMO different (I'm
attaching an Octave script to calculate it).

As it depends on the initial usage, I don't think there can be given
such a general example (for overcommit).


> @@ -6272,12 +6262,63 @@ struct cgroup_subsys memory_cgrp_subsys = {
>   * for next usage. This part is intentionally racy, but it's ok,
>   * as memory.low is a best-effort mechanism.
Although it's a different issue but since this updates the docs I'm
mentioning it -- we treat memory.min the same, i.e. it's subject to the
same race, however, it's not meant to be best effort. I didn't look into
outcomes of potential misaccounting but the comment seems to miss impact
on memory.min protection.

> @@ -6292,52 +6333,29 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
> [...]
> +	if (parent == root) {
> +		memcg->memory.emin = memcg->memory.min;
> +		memcg->memory.elow = memcg->memory.low;
> +		goto out;
>  	}
Shouldn't this condition be 'if (parent == root_mem_cgroup)'? (I.e. 1st
level takes direct input, but 2nd and further levels redistribute only
what they really got from parent.)


Michal


[-- Attachment #2: script --]
[-- Type: text/plain, Size: 2867 bytes --]

% run as: octave-cli script
%
% Input configurations
% -------------------
% E parent effective protection
% n nominal protection of siblings set at the givel level
% c current consumption -,,-

% example from effective_protection 3.
E = 2;
n = [3 1 0 10];
c = [2 2 2 0];  % this converges to      [1.16 0.84 0 0]
% c = [6 2 2 0];  % keeps ratio          [1.5 0.5 0 0]
% c = [5 2 2 0];  % mixed ratio          [1.45 0.55 0 0]
% c = [8 2 2 0];  % mixed ratio          [1.53 0.47 0 0]

% example from effective_protection 5.
%E = 2;
%n = [1 0];
%c = [2 1];  % coming from close to equilibrium  -> [1.50 0.50]
%c = [100 100];  % coming from "infinity"        -> [1.50 0.50]
%c = [2 2];   % coming from uniformity            -> [1.33 0.67]

% example of recursion by default
%E = 2;
%n = [0 0];
%c = [2 1];  % coming from disbalance            -> [1.33 0.67]
%c = [100 100];  % coming from "infinity"        -> [1.00 1.00]
%c = [2 2];   % coming from uniformity           -> [1.00 1.00]

% example by using infinities (_without_ recursive protection)
%E = 2;
%n = [1e7 1e7];
%c = [2 1];  % coming from disbalance            -> [1.33 0.67]
%c = [100 100];  % coming from "infinity"        -> [1.00 1.00]
%c = [2 2];   % coming from uniformity           -> [1.00 1.00]

% Reclaim parameters
% ------------------

% Minimal reclaim amount (GB)
cluster = 4e-6;

% Reclaim coefficient (think as 0.5^sc->priority)
alpha = .1

% Simulation parameters
% ---------------------
epsilon = 1e-7;
timeout = 1000;

% Simulation loop
% ---------------------
% Simulation assumes siblings consumed the initial amount of memory (w/out
% reclaim) and then the reclaim starts, all memory is reclaimable, i.e. treated
% same. It simulates only non-low reclaim and assumes all memory.min = 0.

ch = [];
eh = [];
rh = [];

for t = 1:timeout
	% low_usage
	u = min(c, n);
	siblings = sum(u);

	% effective_protection()
	protected = min(n, c);                % start with nominal
	e = protected * min(1, E / siblings); % normalize overcommit

	% recursive protection
	unclaimed = max(0, E - siblings);
	parent_overuse = sum(c) - siblings;
	if (unclaimed > 0 && parent_overuse > 0)
		overuse = max(0, c - protected);
		e += unclaimed * (overuse / parent_overuse);
	endif

	% get_scan_count()
	r = alpha * c;             % assume all memory is in a single LRU list

	% 1bc63fb1272b ("mm, memcg: make scan aggression always exclude protection")
	sz = max(e, c);
	r .*= (1 - (e+epsilon) ./ (sz+epsilon));

	% uncomment to debug prints
	e, c, r
	
	% nothing to reclaim, reached equilibrium
	if max(r) < epsilon
		break;
	endif

	% SWAP_CLUSTER_MAX
	r = max(r, (r > epsilon) .* cluster);
	c = max(c - r, 0);
	
	ch = [ch ; c];
	eh = [eh ; e];
	rh = [rh ; r];
endfor

t
c, e
plot([ch, eh])
pause()

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection
  2019-12-19 20:07 ` [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection Johannes Weiner
  2020-01-30 17:00   ` Michal Hocko
@ 2020-02-21 17:12   ` Michal Koutný
  2020-02-21 18:58     ` Johannes Weiner
  1 sibling, 1 reply; 52+ messages in thread
From: Michal Koutný @ 2020-02-21 17:12 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Roman Gushchin, Michal Hocko, Tejun Heo, linux-mm,
	cgroups, linux-kernel, kernel-team

On Thu, Dec 19, 2019 at 03:07:18PM -0500, Johannes Weiner <hannes@cmpxchg.org> wrote:
> Unfortunately, this limitation makes it impossible to protect an
> entire subtree from another without forcing the user to make explicit
> protection allocations all the way to the leaf cgroups - something
> that is highly undesirable in real life scenarios.
I see that the jobs in descedant cgroups don't know (or care) what
protection is above them and hence the implicit distribution is sensible
here.

However, the protection your case requires can already be reached thanks
to the the hierachical capping and overcommit normalization -- you can
set memory.low to "max" at all the non-caring descendants.
IIUC, that is the same as setting zeroes (after your patch) and relying
on the recursive distribution of unused protection -- or is there a
mistake in my reasoning?

So in my view, the recursive distribution doesn't bring anything new,
however, its new semantics of memory.low doesn't allow turning the
protection off in a protected subtree (delegating the decision to
distribute protection within parent bounds is IMO a valid use case).

Regards,
Michal


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection
  2020-02-21 17:12   ` Michal Koutný
@ 2020-02-21 18:58     ` Johannes Weiner
  2020-02-25 13:37       ` Michal Koutný
  0 siblings, 1 reply; 52+ messages in thread
From: Johannes Weiner @ 2020-02-21 18:58 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Andrew Morton, Roman Gushchin, Michal Hocko, Tejun Heo, linux-mm,
	cgroups, linux-kernel, kernel-team

On Fri, Feb 21, 2020 at 06:12:56PM +0100, Michal Koutný wrote:
> On Thu, Dec 19, 2019 at 03:07:18PM -0500, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > Unfortunately, this limitation makes it impossible to protect an
> > entire subtree from another without forcing the user to make explicit
> > protection allocations all the way to the leaf cgroups - something
> > that is highly undesirable in real life scenarios.
> I see that the jobs in descedant cgroups don't know (or care) what
> protection is above them and hence the implicit distribution is sensible
> here.
> 
> However, the protection your case requires can already be reached thanks
> to the the hierachical capping and overcommit normalization -- you can
> set memory.low to "max" at all the non-caring descendants.
> IIUC, that is the same as setting zeroes (after your patch) and relying
> on the recursive distribution of unused protection -- or is there a
> mistake in my reasonineg?

That is correct, but it comes with major problems. We did in fact try
exactly this as a workaround in our fleet, but had to revert and
develop the patch we are discussing now instead.

The reason is this: max isn't a "don't care" value. It's just a high
number with actual meaning in the configuration, and that interferes
when you try to compose it with other settings, such as limits.

Here is a configuration we actually use in practice:

                workload.slice (memory.low=20G)
                /                      \
              job (max=12G, low=10G)    job2 (max=12G, low=10G)
             /   \
           task logger

The idea is that we want to mostly protect the workload from other
stuff running in the system (low=20G), but we also want to catch a job
when it goes wild, to ensure reproducibility in testing regardless of
how loaded the host otherwise is (max=12G).

When you set task's and logger's memory.low to "max" or 10G or any
bogus number like this, a limit reclaim in job treats this as origin
protection and tries hard to avoid reclaiming anything in either of
the two cgroups. memory.events::low skyrockets even though no intended
protection was violated, we'll have reclaim latencies (especially when
there are a few dying cgroups accumluated in subtree).

So we had to undo this setting because of workload performance and
problems with monitoring workload health (the bogus low events).

The secondary problem with requiring explicit downward propagation is
that you may want to protect all jobs on the host from system
management software, as a very high-level host configuration. But a
random job that gets scheduled on a host, that lives in a delegated
cgroup and namespace, and creates its own nested tree of cgroups to
manage stuff - that job can't possibly *know* about the top-level host
protection that lies beyond the delegation point and outside its own
namespace, and that it needs to propagate protection against rpm
upgrades into its own leaf groups for each tasklet and component.

Again, in practice we have found this to be totally unmanageable and
routinely first forgot and then had trouble hacking the propagation
into random jobs that create their own groups.

[ And these job subgroups don't even use their *own* memory.low
  prioritization between siblings yet - god knows how you would
  integrate that with the values that you may inherit from higher
  level ancestors. ]

And when you add new hardware configurations, you cannot just make a
top-level change in the host config, you have to update all the job
specs of workloads running in the fleet.

My patch brings memory configuration in line with other cgroup2
controllers. You can make a high-level decision to prioritize one
subtree over another, just like a top-level weight assignment in CPU
or IO, and then you can delegate the subtree to a different entity
that doesn't need to be aware of and reflect that decision all the way
down the tree in its own settings.

And of course can compose it properly with limits.

> So in my view, the recursive distribution doesn't bring anything new,
> however, its new semantics of memory.low doesn't allow turning the
> protection off in a protected subtree (delegating the decision to
> distribute protection within parent bounds is IMO a valid use case).

I've made the case why it's not a supported usecase, and why it is a
meaningless configuration in practice due to the way other controllers
already behave.

I think at this point in the discussion, the only thing I can do is
remind you that the behavior I'm introducing is gated behind a mount
option that nobody is forced to enable if they insist on disagreeing
against all evidence to the contrary.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection
  2020-02-21 15:43                                     ` Johannes Weiner
@ 2020-02-25 12:20                                       ` Michal Hocko
  2020-02-25 18:17                                         ` Johannes Weiner
  0 siblings, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2020-02-25 12:20 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tejun Heo, Andrew Morton, Roman Gushchin, linux-mm, cgroups,
	linux-kernel, kernel-team


On Fri 21-02-20 10:43:59, Johannes Weiner wrote:
> On Fri, Feb 21, 2020 at 11:11:47AM +0100, Michal Hocko wrote:
[...]
> > I also have hard time to grasp what you actually mean by the above.
> > Let's say you have hiearchy where you split out low limit unevenly
> >               root (5G of memory)
> >              /    \
> >    (low 3G) A      D (low 1,5G)
> >            / \
> >  (low 1G) B   C (low 2G)
> >
> > B gets lower priority than C and D while C gets higher priority than
> > D? Is there any problem with such a configuration from the semantic
> > point of view?
> 
> No, that's completely fine.

How is B (low $EPS) C (low 3-$EPS) where $EPS->0 so much different
from the above. You prioritize C over B, D over B in both cases under
global memory pressure.

[...]

> > > However, that doesn't mean this usecase isn't supported. You *can*
> > > always split cgroups for separate resource policies.
> > 
> > What if the split up is not possible or impractical. Let's say you want
> > to control how much CPU share does your container workload get comparing
> > to other containers running on the system? Or let's say you want to
> > treat the whole container as a single entity from the OOM perspective
> > (this would be an example of the logical organization constrain) because
> > you do not want to leave any part of that workload lingering behind if
> > the global OOM kicks in. I am pretty sure there are many other reasons
> > to run related workload that doesn't really share the memory protection
> > demand under a shared cgroup hierarchy.
> 
> The problem is that your "pretty sure" has been proven to be very
> wrong in real life. And that's one reason why these arguments are so
> frustrating: it's your intuition and gut feeling against the
> experience of using this stuff hands-on in large scale production
> deployments.

I am pretty sure you have a lot of experiences from the FB workloads.
And I haven't ever questioned that. All I am trying to explore here is
what the consequences of the new proposed semantic are. I have provided
few examples of when an opt-out from memory protection might be
practical. You seem to disagree on relevance of those usecases and I can
live with that. Not that I am fully convinced because there is a
different between a very tight resource control which is your primary
usecase and a much simpler deployments focusing on particular resources
which tend to work most of the time and occasional failures are
acceptable.

That being said, the new interface requires an explicit opt-in via mount
option so there is no risk of regressions. So I can live with it. Please
make sure to document explicitly that the effective low limit protection
doesn't allow to opt-out even when the limit is set to 0 and the
propagated protection is fully assigned to a sibling memcg.

It would be also really appreciated if we have some more specific examples
of priority inversion problems you have encountered previously and place
them somewhere into our documentation. There is essentially nothing like
that in the tree.

Thanks!
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection
  2020-02-21 18:58     ` Johannes Weiner
@ 2020-02-25 13:37       ` Michal Koutný
  2020-02-25 15:03         ` Johannes Weiner
  0 siblings, 1 reply; 52+ messages in thread
From: Michal Koutný @ 2020-02-25 13:37 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Roman Gushchin, Michal Hocko, Tejun Heo, linux-mm,
	cgroups, linux-kernel, kernel-team

[-- Attachment #1: Type: text/plain, Size: 3879 bytes --]

Hello.

On Fri, Feb 21, 2020 at 01:58:39PM -0500, Johannes Weiner <hannes@cmpxchg.org> wrote:
> When you set task's and logger's memory.low to "max" or 10G or any
> bogus number like this, a limit reclaim in job treats this as origin
> protection and tries hard to avoid reclaiming anything in either of
> the two cgroups.
What do you mean by origin protection? (I'm starting to see some
misunderstanding here, c.f. my remark regarding the parent==root
condition in the other patch [1]).

> memory.events::low skyrockets even though no intended
> protection was violated, we'll have reclaim latencies (especially when
> there are a few dying cgroups accumluated in subtree).
Hopefully, I see where are you coming from. There would be no (false)
low notifications if the elow was calculated all they way top-down from
the real root. Would such calculation be the way to go?

> that job can't possibly *know* about the top-level host
> protection that lies beyond the delegation point and outside its own
> namespace,
Yes, I agree.

> and that it needs to propagate protection against rpm upgrades into
> its own leaf groups for each tasklet and component.
If a job wants to use concrete protection than it sets it, if it wants
to use protection from above, then it can express it with the infinity
(after changing the effective calculation I described above).

Now, you may argue that the infinity would be nonsensical if it's not a
subordinate job. Simplest approach would be likely to introduce the
special "inherit" value (such a literal name may be misleading as it
would be also "dont-care").

> Again, in practice we have found this to be totally unmanageable and
> routinely first forgot and then had trouble hacking the propagation
> into random jobs that create their own groups.
I've been bitten by this as well. However, the protection defaults to
off and I find it this matches the general rule that kernel provides the
mechanism and user(space) the policy.

> And when you add new hardware configurations, you cannot just make a
> top-level change in the host config, you have to update all the job
> specs of workloads running in the fleet.
(I acknowledge the current mechanism lacks an explicit way to express
the inherit/dont-care value.)


> My patch brings memory configuration in line with other cgroup2
> controllers.
Other controllers mostly provide the limit or weight controls, I'd say
protection semantics is specific only to the memory controller so
far [2]. I don't think (at least by now) it can be aligned as the weight
or limit semantics.

> I've made the case why it's not a supported usecase, and why it is a
> meaningless configuration in practice due to the way other controllers
> already behave.
I see how your reasoning works for limits (you set memory limit and you
need to control io/cpu too to maintain intended isolation). I'm confused
why having a scapegoat (or donor) sibling for protection should not be
supported or how it breaks protection for others if not combined with
io/cpu controllers. Feel free to point me to the message if I overlooked
it.

> I think at this point in the discussion, the only thing I can do is
> remind you 
I think there is different perception on both sides because of unclear
definitions, so I seek to clarify that.

> that the behavior I'm introducing is gated behind a mount
> option that nobody is forced to enable if they insist on disagreeing
> against all evidence to the contrary.
Sure, but I think it's better to (try reaching|reach) a consensus than
to introduce split behavior (maintenance, debugging).

Thanks,
Michal

[1] https://lore.kernel.org/lkml/20200221171024.GA23476@blackbody.suse.cz/
[2] I admit I didn't look into io.latency that much and IIRC
    cpu.uclamp.{min,max} don't have the overcommit issue.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection
  2020-02-25 13:37       ` Michal Koutný
@ 2020-02-25 15:03         ` Johannes Weiner
  2020-02-26 13:22           ` Michal Koutný
  0 siblings, 1 reply; 52+ messages in thread
From: Johannes Weiner @ 2020-02-25 15:03 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Andrew Morton, Roman Gushchin, Michal Hocko, Tejun Heo, linux-mm,
	cgroups, linux-kernel, kernel-team

Hello Michal,

On Tue, Feb 25, 2020 at 02:37:20PM +0100, Michal Koutný wrote:
> On Fri, Feb 21, 2020 at 01:58:39PM -0500, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > When you set task's and logger's memory.low to "max" or 10G or any
> > bogus number like this, a limit reclaim in job treats this as origin
> > protection and tries hard to avoid reclaiming anything in either of
> > the two cgroups.
> What do you mean by origin protection? (I'm starting to see some
> misunderstanding here, c.f. my remark regarding the parent==root
> condition in the other patch [1]).

By origin protection I mean protection values at the first level of
children in a reclaim scope. Those are taken as absolute numbers
during a reclaim cycle and propagated down the tree.

Say you have the following configuration:

                root_mem_cgroup
               /
              A (max=12G, low=10G)
             /
            B (low=max)

If global reclaim occurs, the protection for subtree A is 10G, and B
then gets a proportional share of that.

However, if limit reclaim in A occurs due to the 12G max limit, the
protection for subtree B is max.

> > memory.events::low skyrockets even though no intended
> > protection was violated, we'll have reclaim latencies (especially when
> > there are a few dying cgroups accumluated in subtree).
> Hopefully, I see where are you coming from. There would be no (false)
> low notifications if the elow was calculated all they way top-down from
> the real root. Would such calculation be the way to go?

That hinges on whether an opt-out mechanism makes sense, and we
disagree on that part.

> > that job can't possibly *know* about the top-level host
> > protection that lies beyond the delegation point and outside its own
> > namespace,
> Yes, I agree.
> 
> > and that it needs to propagate protection against rpm upgrades into
> > its own leaf groups for each tasklet and component.
> If a job wants to use concrete protection than it sets it, if it wants
> to use protection from above, then it can express it with the infinity
> (after changing the effective calculation I described above).
> 
> Now, you may argue that the infinity would be nonsensical if it's not a
> subordinate job. Simplest approach would be likely to introduce the
> special "inherit" value (such a literal name may be misleading as it
> would be also "dont-care").

Again, a complication of the interface for *everybody* on the premise
that retaining an opt-out mechanism makes sense. We disagree on that.

> > Again, in practice we have found this to be totally unmanageable and
> > routinely first forgot and then had trouble hacking the propagation
> > into random jobs that create their own groups.
> I've been bitten by this as well. However, the protection defaults to
> off and I find it this matches the general rule that kernel provides the
> mechanism and user(space) the policy.
>
> > And when you add new hardware configurations, you cannot just make a
> > top-level change in the host config, you have to update all the job
> > specs of workloads running in the fleet.
> (I acknowledge the current mechanism lacks an explicit way to express
> the inherit/dont-care value.)
> 
> 
> > My patch brings memory configuration in line with other cgroup2
> > controllers.
> Other controllers mostly provide the limit or weight controls, I'd say
> protection semantics is specific only to the memory controller so
> far [2]. I don't think (at least by now) it can be aligned as the weight
> or limit semantics.

Can you explain why you think protection is different from a weight?

Both specify a minimum amount of a resource that the cgroup can use
under contention, while allowing the cgroup to use more than that
share if there is no contention with siblings.

You configure memory in bytes instead of a relative proportion, but
that's only because bytes are a natural unit of memory whereas a
relative proportion of time is a natural unit of CPU and IO.

I'm having trouble concluding from this that the inheritance rules
must be fundamentally different.

For example, if you assign a share of CPU or IO to a subtree, that
applies to the entire subtree. Nobody has proposed being able to
opt-out of shares in a subtree, let alone forcing individual cgroups
to *opt-in* to receive these shares.

I can't fathom why you think assigning pieces of memory to a subtree
must be fundamentally different.

> > I've made the case why it's not a supported usecase, and why it is a
> > meaningless configuration in practice due to the way other controllers
> > already behave.
> I see how your reasoning works for limits (you set memory limit and you
> need to control io/cpu too to maintain intended isolation). I'm confused
> why having a scapegoat (or donor) sibling for protection should not be
> supported or how it breaks protection for others if not combined with
> io/cpu controllers. Feel free to point me to the message if I overlooked
> it.

Because a lack of memory translates to paging, which consumes IO and
CPU. If you relinquish a cgroup's share of memory (whether with a
limit or with a lack of protection under pressure), you increases its
share of IO. To express a priority order between workloads, you cannot
opt out of memory protection without also opting out of the IO shares.

Say you have the following configuration:

                   A
                  / \
                 B   C
                /\
               D  E

D houses your main workload, C a secondary workload, E is not
important. You give B protection and C less protection. You opt E out
of B's memory share to give it all to D. You established a memory
order of D > C > E.

Now to the IO side. You assign B a higher weight than C, and D a
higher weight then E.

Now you apply memory pressure, what happens?. D isn't reclaimed, C is
somewhat reclaimed, E is reclaimed hard. D will not page, C will page
a little bit, E will page hard *with the higher IO priority of B*.

Now C is stuck behind E. This is a priority inversion.

Yes, from a pure accounting perspective, you've managed to enforce
that E will never have more physical pages allocated than C at any
given time. But what did that accomplish? What was the practical
benefit of having made E a scapegoat?

Since I'm repeating myself on this topic, I would really like to turn
your questions around:

1. Can you please make a practical use case for having scape goats or
   donor groups to justify retaining what I consider to be an
   unimportant artifact in the memory.low semantics?

2. If you think opting out of hierarchically assigned resources is a
   fundamentally important usecase, can you please either make an
   argument why it should also apply to CPU and IO, or alternatively
   explain in detail why they are meaningfully different?


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection
  2020-02-25 12:20                                       ` Michal Hocko
@ 2020-02-25 18:17                                         ` Johannes Weiner
  2020-02-26 17:56                                           ` Michal Hocko
  0 siblings, 1 reply; 52+ messages in thread
From: Johannes Weiner @ 2020-02-25 18:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tejun Heo, Andrew Morton, Roman Gushchin, linux-mm, cgroups,
	linux-kernel, kernel-team

On Tue, Feb 25, 2020 at 01:20:28PM +0100, Michal Hocko wrote:
> 
> On Fri 21-02-20 10:43:59, Johannes Weiner wrote:
> > On Fri, Feb 21, 2020 at 11:11:47AM +0100, Michal Hocko wrote:
> [...]
> > > I also have hard time to grasp what you actually mean by the above.
> > > Let's say you have hiearchy where you split out low limit unevenly
> > >               root (5G of memory)
> > >              /    \
> > >    (low 3G) A      D (low 1,5G)
> > >            / \
> > >  (low 1G) B   C (low 2G)
> > >
> > > B gets lower priority than C and D while C gets higher priority than
> > > D? Is there any problem with such a configuration from the semantic
> > > point of view?
> > 
> > No, that's completely fine.
> 
> How is B (low $EPS) C (low 3-$EPS) where $EPS->0 so much different
> from the above. You prioritize C over B, D over B in both cases under
> global memory pressure.

You snipped the explanation for the caveat / the priority inversion
that followed; it would be good to reply to that instead.

> > > > However, that doesn't mean this usecase isn't supported. You *can*
> > > > always split cgroups for separate resource policies.
> > > 
> > > What if the split up is not possible or impractical. Let's say you want
> > > to control how much CPU share does your container workload get comparing
> > > to other containers running on the system? Or let's say you want to
> > > treat the whole container as a single entity from the OOM perspective
> > > (this would be an example of the logical organization constrain) because
> > > you do not want to leave any part of that workload lingering behind if
> > > the global OOM kicks in. I am pretty sure there are many other reasons
> > > to run related workload that doesn't really share the memory protection
> > > demand under a shared cgroup hierarchy.
> > 
> > The problem is that your "pretty sure" has been proven to be very
> > wrong in real life. And that's one reason why these arguments are so
> > frustrating: it's your intuition and gut feeling against the
> > experience of using this stuff hands-on in large scale production
> > deployments.
> 
> I am pretty sure you have a lot of experiences from the FB workloads.
> And I haven't ever questioned that. All I am trying to explore here is
> what the consequences of the new proposed semantic are. I have provided
> few examples of when an opt-out from memory protection might be
> practical. You seem to disagree on relevance of those usecases and I can
> live with that.

I didn't dismiss them as irrelevant, I repeatedly gave extensive
explanations based on real world examples for why they cannot work.

Look at the example I gave to Michal K. about the low-priority "donor"
cgroup that gives up memory to the rest of the tree. Not only is that
workload not contained, the low-pri memory setting itself makes life
actively worse for higher priority cgroups due to increasing paging.

You have consistently dismissed or not engaged with this argument of
priority inversions through other resources.

> Not that I am fully convinced because there is a
> different between a very tight resource control which is your primary
> usecase and a much simpler deployments focusing on particular resources
> which tend to work most of the time and occasional failures are
> acceptable.

It's been my experience that "works most of the time" combined with
occasional failure doesn't exist. Failure is immediate once resources
become contended (and you don't need cgroups without contention). And
I have explained why that is the case.

You keep claiming that FB somehow has special requirements that other
users don't have. What is this claim based on? All we're trying to do
is isolate general purpose workloads from each other and/or apply
relative priorities between them.

How would simpler deployments look like?

If I run a simple kernel build job on my system right now, setting a
strict memory limit on it will make performance of the rest of the
system worse than if I didn't set one, due to the IO flood from
paging. (There is no difference between setting a strict memory.max on
the compile job or a very high memory.low protection on the rest of
the system, the end result is that the workload will page trying to
fit into the small amount of space left for it.)

> That being said, the new interface requires an explicit opt-in via mount
> option so there is no risk of regressions. So I can live with it. Please
> make sure to document explicitly that the effective low limit protection
> doesn't allow to opt-out even when the limit is set to 0 and the
> propagated protection is fully assigned to a sibling memcg.

I can mention this in the changelog, no problem.

> It would be also really appreciated if we have some more specific examples
> of priority inversion problems you have encountered previously and place
> them somewhere into our documentation. There is essentially nothing like
> that in the tree.

Of course, I wouldn't mind doing that in a separate patch. How about a
section in cgroup-v2.rst, at "Issues with v1 and Rationales for v2"?


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 2/3] mm: memcontrol: clean up and document effective low/min calculations
  2020-02-21 17:10   ` Michal Koutný
@ 2020-02-25 18:40     ` Johannes Weiner
  2020-02-26 16:46       ` Michal Koutný
  0 siblings, 1 reply; 52+ messages in thread
From: Johannes Weiner @ 2020-02-25 18:40 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Andrew Morton, Roman Gushchin, Michal Hocko, Tejun Heo, linux-mm,
	cgroups, linux-kernel, kernel-team

Hello Michal,

On Fri, Feb 21, 2020 at 06:10:24PM +0100, Michal Koutný wrote:
> On Thu, Dec 19, 2019 at 03:07:17PM -0500, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > + *    Consider the following example tree:
> >   *
> > + *        A      A/memory.low = 2G, A/memory.current = 6G
> > + *       //\\
> > + *      BC  DE   B/memory.low = 3G  B/memory.current = 2G
> > + *               C/memory.low = 1G  C/memory.current = 2G
> > + *               D/memory.low = 0   D/memory.current = 2G
> > + *               E/memory.low = 10G E/memory.current = 0
> >   *
> > + *    and memory pressure is applied, the following memory
> > + *    distribution is expected (approximately*):
> >   *
> > + *      A/memory.current = 2G
> > + *      B/memory.current = 1.3G
> > + *      C/memory.current = 0.6G
> > + *      D/memory.current = 0
> > + *      E/memory.current = 0
> >   *
> > + *    *assuming equal allocation rate and reclaimability
> I think the assumptions for this example don't hold (anymore).
> Because reclaim rate depends on the usage above protection, the siblings
> won't be reclaimed equally and so the low_usage proportionality will
> change over time and the equilibrium distribution is IMO different (I'm
> attaching an Octave script to calculate it).

Hm, this example doesn't change with my patch because there is no
"floating" protection that gets distributed among the siblings.

In my testing with the above parameters, the equilibrium still comes
out to roughly this distribution.

> As it depends on the initial usage, I don't think there can be given
> such a general example (for overcommit).

It's just to illustrate the pressure weight, not to reflect each
factor that can influence the equilibrium.

I think it still has value to gain understanding of how it works, no?

> > @@ -6272,12 +6262,63 @@ struct cgroup_subsys memory_cgrp_subsys = {
> >   * for next usage. This part is intentionally racy, but it's ok,
> >   * as memory.low is a best-effort mechanism.
> Although it's a different issue but since this updates the docs I'm
> mentioning it -- we treat memory.min the same, i.e. it's subject to the
> same race, however, it's not meant to be best effort. I didn't look into
> outcomes of potential misaccounting but the comment seems to miss impact
> on memory.min protection.

Yeah I think we can delete that bit.

> > @@ -6292,52 +6333,29 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
> > [...]
> > +	if (parent == root) {
> > +		memcg->memory.emin = memcg->memory.min;
> > +		memcg->memory.elow = memcg->memory.low;
> > +		goto out;
> >  	}
> Shouldn't this condition be 'if (parent == root_mem_cgroup)'? (I.e. 1st
> level takes direct input, but 2nd and further levels redistribute only
> what they really got from parent.)

I believe we cleared this up in the parallel thread, but just in case:
reclaim can happen due to a memory.max set lower in the
tree. memory.low propagation is always relative from the reclaim
scope, not the system-wide root cgroup.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection
  2020-02-25 15:03         ` Johannes Weiner
@ 2020-02-26 13:22           ` Michal Koutný
  2020-02-26 15:05             ` Johannes Weiner
  0 siblings, 1 reply; 52+ messages in thread
From: Michal Koutný @ 2020-02-26 13:22 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Roman Gushchin, Michal Hocko, Tejun Heo, linux-mm,
	cgroups, linux-kernel, kernel-team

Hello
and thanks for continuing the debate.

On Tue, Feb 25, 2020 at 10:03:04AM -0500, Johannes Weiner <hannes@cmpxchg.org> wrote:
> By origin protection I mean protection [...]
> If global reclaim occurs, [...]
> However, if limit reclaim in A occurs due to the 12G max limit [...]
My previous thinking was too bound to the absolute/global POV. Hence,
effectiveness of memory.low is relative to reclaim origin:
- full value -- protection against siblings (i.e. limit in parent).
- reduced value (share) -- protection against (great)-uncles (i.e. limit in
  (great)-grandparent).

(And depending on the absolute depth, it means respective protection
against global reclaim too.)

I see this didn't change since the original implementation w/out
effective protections. So it was just me being confused that protection
can be overcommited locally (but not globally/at higher level, so it's
consistent).

> That hinges on whether an opt-out mechanism makes sense, and we
> disagree on that part.
After my correction above, the calculation I had proposed would reduce
protection unnecessarily for reclaims triggered by nearby limits.

> > Simplest approach would be likely to introduce the special "inherit"
> > value (such a literal name may be misleading as it would be also
> > "dont-care").
> 
> Again, a complication of the interface for *everybody* 
Not if the special value is the new default (alas, that would still need
the mount option).


> Can you explain why you think protection is different from a weight?
- weights are dimension-less, they represent no real resource
- sum of sibling weights is meaningless (and independent from parent
  weight)
- to me this protection is closer to limits (actually I like your simile
  that they're very lazily enforced limits)

> Both specify a minimum amount of a resource that the cgroup can use
> under contention, while allowing the cgroup to use more than that
> share if there is no contention with siblings.
>
> You configure memory in bytes instead of a relative proportion, but
> that's only because bytes are a natural unit of memory whereas a
> relative proportion of time is a natural unit of CPU and IO.
Weights specify ratio (between siblings), not the amount. Single weight
is meaningless, (the meaningful proportion would be the fraction from
cpu.max, i.e. relative to absolute resource).

With weights, non-competing siblings drop out of denominator,
with protection, non-competing siblings (in the sense of not consuming
their allowance) may add resource back to the pool (given by parent).

> For example, if you assign a share of CPU or IO to a subtree, that
> applies to the entire subtree. Nobody has proposed being able to
> opt-out of shares in a subtree, let alone forcing individual cgroups
> to *opt-in* to receive these shares.
The former is because it makes no sense to deny all CPU/IO, the latter
consequence of that too.

> Now you apply memory pressure, what happens?. D isn't reclaimed, C is
> somewhat reclaimed, E is reclaimed hard. D will not page, C will page
> a little bit, E will page hard *with the higher IO priority of B*.
> 
> Now C is stuck behind E. This is a priority inversion.
This is how I understand the weights to work.

    A				
    `- B		io.weight=200
       `- D		io.weight=100 (e.g.)
       `- E		io.weight=100 (e.g.)
    `- C		io.weight=50

Whatever weights I assign to D and E, when only E and C compete, E will
have higher weight (200 to 50, work-conservacy of weights).

I don't think this inversion is wrong because E's work is still on
behalf of B.

Or did you mean that if protections were transformed (via effective
calculation) to have ratios only in the same range as io.weights
(1e-4..1e4 instead of 0..inf), then it'd prevent the inversion? (By
setting D,E weights in same ratios as D,E protections.)

> 1. Can you please make a practical use case for having scape goats or
>    donor groups to justify retaining what I consider to be an
>    unimportant artifact in the memory.low semantics?
    A.low=10G
    `- B.low=X   u=6G
    `- C.low=X   u=4G
    `- D.low=0G  u=5G

B,C   run the workload which should be protected
D     runs job that doesn't need any protection 
u     denotes usage
(I made the example with more than one important sibling to illustrate
usefulness of some implicit distribution X.)

When outer reclaim comes, reclaiming from B,C would be detrimental to
their performance, while impact on D is unimportant. (And induced IO
load on the rest (out of A) too.)

It's not possible to move D to the A's level, since only A is all what a
given user can control.

> 2. If you think opting out of hierarchically assigned resources is a
>    fundamentally important usecase, can you please either make an
>    argument why it should also apply to CPU and IO, or alternatively
>    explain in detail why they are meaningfully different?
I'd say that protected memory is a disposable resource in contrast with
CPU/IO. If you don't have latter, you don't progress; if you lack the
former, you are refaulting but can make progress. Even more, you should
be able to give up memory.min.

Michal


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection
  2020-02-26 13:22           ` Michal Koutný
@ 2020-02-26 15:05             ` Johannes Weiner
  2020-02-27 13:35               ` Michal Koutný
  0 siblings, 1 reply; 52+ messages in thread
From: Johannes Weiner @ 2020-02-26 15:05 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Andrew Morton, Roman Gushchin, Michal Hocko, Tejun Heo, linux-mm,
	cgroups, linux-kernel, kernel-team

Hello,

On Wed, Feb 26, 2020 at 02:22:37PM +0100, Michal Koutný wrote:
> On Tue, Feb 25, 2020 at 10:03:04AM -0500, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > Can you explain why you think protection is different from a weight?
> - weights are dimension-less, they represent no real resource

They still ultimately translate to real resources. The concrete value
depends on what the parent's weight translates to, and it depends on
sibling configurations and their current consumption. (All of this is
already true for memory protection as well, btw). But eventually, a
weight specification translates to actual time on a CPU, bandwidth on
an IO device etc.

> - sum of sibling weights is meaningless (and independent from parent
>   weight)

Technically true for overcommitted memory.low values as well.

> - to me this protection is closer to limits (actually I like your simile
>   that they're very lazily enforced limits)

But weights are also lazily enforced limits. Without competition, you
can get 100% regardless of your weight; under contention, you get
throttled/limited back to an assigned share, however that's specified.

Once you peel away the superficial layer of how resources, or shares
of resources, are being referred to (time, bytes, relative shares)
weights/guarantees/protections are all the same thing: they are lazily
enforced* partitioning rules of a resource under contention.

I don't see a fundamental difference between them. And that in turn
makes it hard for me to accept that hierarchical inheritance rules
should be different.

* We also refer to this as work-conserving

> > Now you apply memory pressure, what happens?. D isn't reclaimed, C is
> > somewhat reclaimed, E is reclaimed hard. D will not page, C will page
> > a little bit, E will page hard *with the higher IO priority of B*.
> > 
> > Now C is stuck behind E. This is a priority inversion.
> This is how I understand the weights to work.
> 
>     A				
>     `- B		io.weight=200
>        `- D		io.weight=100 (e.g.)
>        `- E		io.weight=100 (e.g.)
>     `- C		io.weight=50
> 
> Whatever weights I assign to D and E, when only E and C compete, E will
> have higher weight (200 to 50, work-conservacy of weights).

Yes, exactly. I'm saying the same should be true for memory.

> I don't think this inversion is wrong because E's work is still on
> behalf of B.

"Wrong" isn't the right term. Is it what you wanted to express in your
configuration?

What's the point of designating E a memory donor group that needs to
relinquish memory to C under pressure, but when it actually gives up
that memory it beats C in competition over a different resource?

You are talking about a mathematical truth on a per-controller
basis. What I'm saying is that I don't see how this is useful for real
workloads, their relative priorities, and the performance expectations
users have from these priorities.

With a priority inversion like this, there is no actual performance
isolation or containerization going on here - which is the whole point
of cgroups and resource control.

> Or did you mean that if protections were transformed (via effective
> calculation) to have ratios only in the same range as io.weights
> (1e-4..1e4 instead of 0..inf), then it'd prevent the inversion? (By
> setting D,E weights in same ratios as D,E protections.)

No, the inversion would be prevented if E could consume all resources
assigned to B that aren't consumed by D.

This is true for IO and CPU, but before my patch not for memory.

> > 1. Can you please make a practical use case for having scape goats or
> >    donor groups to justify retaining what I consider to be an
> >    unimportant artifact in the memory.low semantics?
>     A.low=10G
>     `- B.low=X   u=6G
>     `- C.low=X   u=4G
>     `- D.low=0G  u=5G
> 
> B,C   run the workload which should be protected
> D     runs job that doesn't need any protection 
> u     denotes usage
> (I made the example with more than one important sibling to illustrate
> usefulness of some implicit distribution X.)
> 
> When outer reclaim comes, reclaiming from B,C would be detrimental to
> their performance, while impact on D is unimportant. (And induced IO
> load on the rest (out of A) too.)

Okay, but this is a different usecase than we were talking about.

My objection is to opting out of protection against cousins (thus
overriding parental resource assignment), not against siblings.

Expressing priorities between siblings like this is fine. And I
absolutely see practical value in your specific example.

> It's not possible to move D to the A's level, since only A is all what a
> given user can control.

Correct, but you can change the tree to this:

     A.low=10G
     `- A1.low=10G
        `- B.low=0G
        `- C.low=0G
     `- D.low=0G

to express

A1 > D
 B = C

That priority order can be matched by CPU and IO controls as well:

     A.weight=100
     `- A1.weight=100
        `- B.weight=100
        `- C.weight=100
     `- D.weight=1

My objection is purely about opting out of resources relative to (and
assuming a lower memory priority than) an outside cousin that may have
a lower priority on other resources.

That is, I would like to see an argument for this setup:

     A				
     `- B		io.weight=200          memory.low=10G
        `- D		io.weight=100 (e.g.)   memory.low=10G
        `- E		io.weight=100 (e.g.)   memory.low=0
     `- C		io.weight=50           memory.low=5G

Where E has no memory protection against C, but E has IO priority over
C. That's the configuration that cannot be expressed with a recursive
memory.low, but since it involves priority inversions it's not useful
to actually isolate and containerize workloads.

That's why I'm saying it's an artifact, not an actual feature.

> > 2. If you think opting out of hierarchically assigned resources is a
> >    fundamentally important usecase, can you please either make an
> >    argument why it should also apply to CPU and IO, or alternatively
> >    explain in detail why they are meaningfully different?
> I'd say that protected memory is a disposable resource in contrast with
> CPU/IO. If you don't have latter, you don't progress; if you lack the
> former, you are refaulting but can make progress. Even more, you should
> be able to give up memory.min.

Eh, I'm not buying that. You cannot run without memory either. If
somebody reclaims a page between you faulting it in and you resuming
to userspace, there is no forward progress.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 2/3] mm: memcontrol: clean up and document effective low/min calculations
  2020-02-25 18:40     ` Johannes Weiner
@ 2020-02-26 16:46       ` Michal Koutný
  2020-02-26 19:40         ` Johannes Weiner
  0 siblings, 1 reply; 52+ messages in thread
From: Michal Koutný @ 2020-02-26 16:46 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Roman Gushchin, Michal Hocko, Tejun Heo, linux-mm,
	cgroups, linux-kernel, kernel-team


[-- Attachment #1.1: Type: text/plain, Size: 2642 bytes --]

On Tue, Feb 25, 2020 at 01:40:14PM -0500, Johannes Weiner <hannes@cmpxchg.org> wrote:
> Hm, this example doesn't change with my patch because there is no
> "floating" protection that gets distributed among the siblings.
Maybe it had changed even earlier and the example obsoleted.

> In my testing with the above parameters, the equilibrium still comes
> out to roughly this distribution.
I'm attaching my test (10-times smaller) and I'm getting these results:

> /sys/fs/cgroup/test.slice/memory.current:838750208
> /sys/fs/cgroup/test.slice/pressure.service/memory.current:616972288
> /sys/fs/cgroup/test.slice/test-A.slice/memory.current:221782016
> /sys/fs/cgroup/test.slice/test-A.slice/B.service/memory.current:123428864
> /sys/fs/cgroup/test.slice/test-A.slice/C.service/memory.current:93495296
> /sys/fs/cgroup/test.slice/test-A.slice/D.service/memory.current:4702208
> /sys/fs/cgroup/test.slice/test-A.slice/E.service/memory.current:155648

(I'm running that on 5.6.0-rc2 + first two patches of your series.)

That's IMO closer to the my simulation (1.16:0.84)
than the example prediction (1.3:0.6)

> It's just to illustrate the pressure weight, not to reflect each
> factor that can influence the equilibrium.
But it's good to have some idea about the equilibrium when configuring
the values. 

> I think it still has value to gain understanding of how it works, no?
Alas, the example confused me so that I had to write the simulation to
get grasp of it :-)

And even when running actual code now, I'd say the values in the
original example are only one of the equlibriums but definitely not
reachable from the stated initial conditions.

> > > @@ -6272,12 +6262,63 @@ struct cgroup_subsys memory_cgrp_subsys = {
> > >   * for next usage. This part is intentionally racy, but it's ok,
> > >   * as memory.low is a best-effort mechanism.
> > Although it's a different issue but since this updates the docs I'm
> > mentioning it -- we treat memory.min the same, i.e. it's subject to the
> > same race, however, it's not meant to be best effort. I didn't look into
> > outcomes of potential misaccounting but the comment seems to miss impact
> > on memory.min protection.
> 
> Yeah I think we can delete that bit.
Erm, which part?
Make the racy behavior undocumented or that it applies both memory.low
and memory.min?

> I believe we cleared this up in the parallel thread, but just in case:
> reclaim can happen due to a memory.max set lower in the
> tree. memory.low propagation is always relative from the reclaim
> scope, not the system-wide root cgroup.
Clear now.

Michal

[-- Attachment #1.2: run.sh --]
[-- Type: application/x-sh, Size: 1293 bytes --]

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection
  2020-02-25 18:17                                         ` Johannes Weiner
@ 2020-02-26 17:56                                           ` Michal Hocko
  0 siblings, 0 replies; 52+ messages in thread
From: Michal Hocko @ 2020-02-26 17:56 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tejun Heo, Andrew Morton, Roman Gushchin, linux-mm, cgroups,
	linux-kernel, kernel-team

On Tue 25-02-20 13:17:55, Johannes Weiner wrote:
> On Tue, Feb 25, 2020 at 01:20:28PM +0100, Michal Hocko wrote:
> > 
> > On Fri 21-02-20 10:43:59, Johannes Weiner wrote:
> > > On Fri, Feb 21, 2020 at 11:11:47AM +0100, Michal Hocko wrote:
> > [...]
> > > > I also have hard time to grasp what you actually mean by the above.
> > > > Let's say you have hiearchy where you split out low limit unevenly
> > > >               root (5G of memory)
> > > >              /    \
> > > >    (low 3G) A      D (low 1,5G)
> > > >            / \
> > > >  (low 1G) B   C (low 2G)
> > > >
> > > > B gets lower priority than C and D while C gets higher priority than
> > > > D? Is there any problem with such a configuration from the semantic
> > > > point of view?
> > > 
> > > No, that's completely fine.
> > 
> > How is B (low $EPS) C (low 3-$EPS) where $EPS->0 so much different
> > from the above. You prioritize C over B, D over B in both cases under
> > global memory pressure.
> 
> You snipped the explanation for the caveat / the priority inversion
> that followed; it would be good to reply to that instead.

OK, I have misread your response then. Because you were saying that the
above example is perfectly fine while it actually matches your example
of low, mid, high prio example so I was confused.

[...]

I am skipping the large part of your response mostly because I believe
it would make it easier to move this forward.

> > It would be also really appreciated if we have some more specific examples
> > of priority inversion problems you have encountered previously and place
> > them somewhere into our documentation. There is essentially nothing like
> > that in the tree.
> 
> Of course, I wouldn't mind doing that in a separate patch. How about a
> section in cgroup-v2.rst, at "Issues with v1 and Rationales for v2"?

Yes, makes sense. A subsection about examples/best practices/Lessons
learned where we can add more sounds like a very good thing in general.
Because regardless of what tends to work for some deployments it is
always good to give examples of what hasn't worked for others and was
non trivial to find out. This can save countless of hourse for other
people.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 2/3] mm: memcontrol: clean up and document effective low/min calculations
  2020-02-26 16:46       ` Michal Koutný
@ 2020-02-26 19:40         ` Johannes Weiner
  0 siblings, 0 replies; 52+ messages in thread
From: Johannes Weiner @ 2020-02-26 19:40 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Andrew Morton, Roman Gushchin, Michal Hocko, Tejun Heo, linux-mm,
	cgroups, linux-kernel, kernel-team

On Wed, Feb 26, 2020 at 05:46:32PM +0100, Michal Koutný wrote:
> On Tue, Feb 25, 2020 at 01:40:14PM -0500, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > Hm, this example doesn't change with my patch because there is no
> > "floating" protection that gets distributed among the siblings.
> Maybe it had changed even earlier and the example obsoleted.
> 
> > In my testing with the above parameters, the equilibrium still comes
> > out to roughly this distribution.
> I'm attaching my test (10-times smaller) and I'm getting these results:
> 
> > /sys/fs/cgroup/test.slice/memory.current:838750208
> > /sys/fs/cgroup/test.slice/pressure.service/memory.current:616972288
> > /sys/fs/cgroup/test.slice/test-A.slice/memory.current:221782016
> > /sys/fs/cgroup/test.slice/test-A.slice/B.service/memory.current:123428864
> > /sys/fs/cgroup/test.slice/test-A.slice/C.service/memory.current:93495296
> > /sys/fs/cgroup/test.slice/test-A.slice/D.service/memory.current:4702208
> > /sys/fs/cgroup/test.slice/test-A.slice/E.service/memory.current:155648
> 
> (I'm running that on 5.6.0-rc2 + first two patches of your series.)
> 
> That's IMO closer to the my simulation (1.16:0.84)
> than the example prediction (1.3:0.6)

I think you're correct about the moving points of equilibrium. I'm
going to experiment more with your script. I had written it off as
noise from LRU rotations, reclaim concurrency etc. but your script
shows that these points do shift around as the input parameters
change. This is useful, thanks.

But AFAICS, my patches here do not change or introduce this behavior
so it's a longer-standing issue.

> > It's just to illustrate the pressure weight, not to reflect each
> > factor that can influence the equilibrium.
> But it's good to have some idea about the equilibrium when configuring
> the values. 

Agreed.

> > > > @@ -6272,12 +6262,63 @@ struct cgroup_subsys memory_cgrp_subsys = {
> > > >   * for next usage. This part is intentionally racy, but it's ok,
> > > >   * as memory.low is a best-effort mechanism.
> > > Although it's a different issue but since this updates the docs I'm
> > > mentioning it -- we treat memory.min the same, i.e. it's subject to the
> > > same race, however, it's not meant to be best effort. I didn't look into
> > > outcomes of potential misaccounting but the comment seems to miss impact
> > > on memory.min protection.
> > 
> > Yeah I think we can delete that bit.
> Erm, which part?
> Make the racy behavior undocumented or that it applies both memory.low
> and memory.min?

I'm honestly not sure what it's trying to say. Reclaim and charging
can happen concurrently, so the protection enforcement, whether it's
min or low, has always been racy. Even OOM itself is racy.

Caching the parental value while iterating over a group of siblings
shouldn't fundamentally alter that outcome. We do enough priority
iterations and reclaim loops from the allocator that we shouldn't see
premature OOMs or apply totally incorrect balances because of that.

So IMO the statement that "this is racy, but low is best-effort
anyway, so it's okay" is misleading. I think more accurate would be to
say that reclaim is fundamentally racy, so a bit of additional noise
here doesn't matter.

Either way, I don't find this paragraph all that useful. If you think
it's informative, could you please let me know which important aspect
it communicates? Otherwise, I'm inclined to delete it.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection
  2020-02-26 15:05             ` Johannes Weiner
@ 2020-02-27 13:35               ` Michal Koutný
  2020-02-27 15:06                 ` Johannes Weiner
  0 siblings, 1 reply; 52+ messages in thread
From: Michal Koutný @ 2020-02-27 13:35 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Roman Gushchin, Michal Hocko, Tejun Heo, linux-mm,
	cgroups, linux-kernel, kernel-team

[-- Attachment #1: Type: text/plain, Size: 5202 bytes --]

TL;DR I see merit in the recursive propagation if it's requested
explicitly (i.e. retaining meaining of 0). The protection/weight
semantics should be refined.

On Wed, Feb 26, 2020 at 10:05:48AM -0500, Johannes Weiner <hannes@cmpxchg.org> wrote:
> They still ultimately translate to real resources. The concrete value
> depends on what the parent's weight translates to, and it depends on
> sibling configurations and their current consumption. (All of this is
> already true for memory protection as well, btw). But eventually, a
> weight specification translates to actual time on a CPU, bandwidth on
> an IO device etc.
> 
> > - sum of sibling weights is meaningless (and independent from parent
> >   weight)
> 
> Technically true for overcommitted memory.low values as well.
Yes, but for overcommited only. For pure weights it doesn't matter if
you set 1:10, 10:100 or 100:1000, however, for the protection it has
this behavior only when approaching infinity and as the sum compares to
parent's value, the protection behaves differently.

[If there had to be to some pure memory weights, those would for
instance express relative affinity of group's pages to physical memory.]

> I don't see a fundamental difference between them. And that in turn
> makes it hard for me to accept that hierarchical inheritance rules
> should be different.
I'll try coming up with some better examples for the difference that I
perceive.

> "Wrong" isn't the right term. Is it what you wanted to express in your
> configuration?
I want to express absolute amount of memory (ideally representing
workingset size) under protection.

IIUC, you want to express general relative priorities of B vs C when
some outer metric has to be maintained given you reach both limits of
memory and IO.

> You are talking about a mathematical truth on a per-controller
> basis. What I'm saying is that I don't see how this is useful for real
> workloads, their relative priorities, and the performance expectations
> users have from these priorities.
 
> With a priority inversion like this, there is no actual performance
> isolation or containerization going on here - which is the whole point
> of cgroups and resource control.
I acknowledge that by pressing too much along one dimension (memory) you
induce expansion in other dimension (IO) and that may become noticable in
siblings (expansion over saturation [1]). But that's expected when only
weights are in use. If you wanted to hide the effect of workload B to C,
B would need real limit.

[I beg to disagree that containerization is whole point of cgroups, it's
large part of it, hence the isolation needn't be necessarily
bi-directional.]

> My objection is to opting out of protection against cousins (thus
> overriding parental resource assignment), not against siblings.
Just to sync up the terminology - I'm calling this protection against
uncles (the composition/structure under them is irrelevant).
And the limitation comes from grandparent or higher (or global).

...and the overriden parental resource assignment is the expansion on
non-memory dimension (IO/CPU).

> Correct, but you can change the tree to this:
> 
>      A.low=10G
>      `- A1.low=10G
>         `- B.low=0G
>         `- C.low=0G
>      `- D.low=0G
> 
> to express
> 
> A1 > D
>  B = C
That sort of works (if I give up the scapegoat). Although I have trouble
that I have to copy the value from A to A1, I could have done that with
previous hierarchy and simply set B.low=C.low=10G.

> That is, I would like to see an argument for this setup:
> 
>      A				
>      `- B		io.weight=200          memory.low=10G
>         `- D		io.weight=100 (e.g.)   memory.low=10G
>         `- E		io.weight=100 (e.g.)   memory.low=0
>      `- C		io.weight=50           memory.low=5G
> 
> Where E has no memory protection against C, but E has IO priority over
> C. That's the configuration that cannot be expressed with a recursive
> memory.low, but since it involves priority inversions it's not useful
> to actually isolate and containerize workloads.
But there can be no cousin (uncle) or more precisely it's the global
rest that we don't mind to affect.

> > I'd say that protected memory is a disposable resource in contrast with
> > CPU/IO. If you don't have latter, you don't progress; if you lack the
> > former, you are refaulting but can make progress. Even more, you should
> > be able to give up memory.min.
> 
> Eh, I'm not buying that. You cannot run without memory either. If
> somebody reclaims a page between you faulting it in and you resuming
> to userspace, there is no forward progress.
I made a hasty argument (misinterpretting the constant outer reclaim
pressure). So that wasn't the fundamental difference.

The second part -- memory.min is subject to equal calculation as
memory.low. Do you find the scape goat preventing OOM in grand-parent
(or higher) subtree also a misfeature/artifact?

Thanks,
Michal

[1] Here I'm taking your/Tejun's assumption that in the stressful
situations it always boils down to IO, although I don't have any
quantitative arguments for that.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection
  2020-02-27 13:35               ` Michal Koutný
@ 2020-02-27 15:06                 ` Johannes Weiner
  0 siblings, 0 replies; 52+ messages in thread
From: Johannes Weiner @ 2020-02-27 15:06 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Andrew Morton, Roman Gushchin, Michal Hocko, Tejun Heo, linux-mm,
	cgroups, linux-kernel, kernel-team

On Thu, Feb 27, 2020 at 02:35:44PM +0100, Michal Koutný wrote:
> On Wed, Feb 26, 2020 at 10:05:48AM -0500, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > I don't see a fundamental difference between them. And that in turn
> > makes it hard for me to accept that hierarchical inheritance rules
> > should be different.
> I'll try coming up with some better examples for the difference that I
> perceive.
> 
> > "Wrong" isn't the right term. Is it what you wanted to express in your
> > configuration?
> I want to express absolute amount of memory (ideally representing
> workingset size) under protection.
> 
> IIUC, you want to express general relative priorities of B vs C when
> some outer metric has to be maintained given you reach both limits of
> memory and IO.

It's been our experience that it's basically impossible to control for
memory without having it result in IO contention.

You acknowledge below that this effect may be noticable in some
situations. It's been our experience, however, that this effect is so
pronounced over a wide variety of workloads and host configurations
that exclusive memory control is not a practical application for
anything but niche cases - if they exist at all.

> > You are talking about a mathematical truth on a per-controller
> > basis. What I'm saying is that I don't see how this is useful for real
> > workloads, their relative priorities, and the performance expectations
> > users have from these priorities.
>  
> > With a priority inversion like this, there is no actual performance
> > isolation or containerization going on here - which is the whole point
> > of cgroups and resource control.
> I acknowledge that by pressing too much along one dimension (memory) you
> induce expansion in other dimension (IO) and that may become noticable in
> siblings (expansion over saturation [1]). But that's expected when only
> weights are in use. If you wanted to hide the effect of workload B to C,
> B would need real limit.
>
> [I beg to disagree that containerization is whole point of cgroups, it's
> large part of it, hence the isolation needn't be necessarily
> bi-directional.]

I said "isolation or containerization", and it really isn't a stretch
to see how the the intended isolation can break down in this example.

You could set an IO limit on the scape goat to keep it from inheriting
the higher IO priority from its parent.

But you could also just set a memory limit on the scape goat to keep
it from inheriting the higher memory allowance from the parent.

Between all this, I really don't see an argument here to make the
memory hierarchy semantics different from the other controllers.

> > My objection is to opting out of protection against cousins (thus
> > overriding parental resource assignment), not against siblings.
> Just to sync up the terminology - I'm calling this protection against
> uncles (the composition/structure under them is irrelevant).
> And the limitation comes from grandparent or higher (or global).

Yes, either way works.

> ...and the overriden parental resource assignment is the expansion on
> non-memory dimension (IO/CPU).
> 
> > Correct, but you can change the tree to this:
> > 
> >      A.low=10G
> >      `- A1.low=10G
> >         `- B.low=0G
> >         `- C.low=0G
> >      `- D.low=0G
> > 
> > to express
> > 
> > A1 > D
> >  B = C
> That sort of works (if I give up the scapegoat). Although I have trouble
> that I have to copy the value from A to A1, I could have done that with
> previous hierarchy and simply set B.low=C.low=10G.

D is still the scape goat for B and C..?

> > That is, I would like to see an argument for this setup:
> > 
> >      A				
> >      `- B		io.weight=200          memory.low=10G
> >         `- D		io.weight=100 (e.g.)   memory.low=10G
> >         `- E		io.weight=100 (e.g.)   memory.low=0
> >      `- C		io.weight=50           memory.low=5G
> > 
> > Where E has no memory protection against C, but E has IO priority over
> > C. That's the configuration that cannot be expressed with a recursive
> > memory.low, but since it involves priority inversions it's not useful
> > to actually isolate and containerize workloads.
> But there can be no cousin (uncle) or more precisely it's the global
> rest that we don't mind to affect.

Okay, hold on.

You wouldn't care about starving the rest of the system of IO and
CPU. But the objection to my patch is that you want to give memory
back to avoid undue burden on the rest of the system?

Can we please stop talking about such contrived hypotheticals and
discuss real computer systems that real people actually care about?

> > > I'd say that protected memory is a disposable resource in contrast with
> > > CPU/IO. If you don't have latter, you don't progress; if you lack the
> > > former, you are refaulting but can make progress. Even more, you should
> > > be able to give up memory.min.
> > 
> > Eh, I'm not buying that. You cannot run without memory either. If
> > somebody reclaims a page between you faulting it in and you resuming
> > to userspace, there is no forward progress.
> I made a hasty argument (misinterpretting the constant outer reclaim
> pressure). So that wasn't the fundamental difference.
> 
> The second part -- memory.min is subject to equal calculation as
> memory.low. Do you find the scape goat preventing OOM in grand-parent
> (or higher) subtree also a misfeature/artifact?

What about CPU and IO?

If you knew exactly that the scape goat doesn't need the memory, you
could set a memory limit on it - just like you could set a limit on
CPU and IO cycles to "give back" resources from inside a tree.

If you don't know exactly how much of the scape goat's memory is and
isn't needed, the additional paging risk from getting it wrong would
be to the detriment of both your workload and the rest of the system -
your attempt to be good to the rest of the system suddenly turns into
a negative-sum game.

I fundamentally do not understand the practical application of the
configuration you are arguing tooth and nail needs to be supported.

If this is a dealbreaker, surely in a month of discussion and 30+
emails, it should have been possible to come up with *one* example of
a real workload and host configuration for which the ability to
dissent from the hierarchical memory allocation (but oddly, not other
resources) is the *only* way to express working resource isolation.

As it stands, I have provided examples of real workloads and host
configs that can't be expressed with the current semantics. As such, I
would like to move ahead with my changes. They are gated behind a
mount option, so pose no risk to the elusive setups you envision. We
can always implement the inheritance scheme you propose once we have
concrete examples of real life scenarios that aren't otherwise doable,
but there is certainly not enough evidence to make me implement it now
as a condition for merging my patches.


^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2020-02-27 15:19 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-19 20:07 [PATCH v2 0/3] mm: memcontrol: recursive memory protection Johannes Weiner
2019-12-19 20:07 ` [PATCH v2 1/3] mm: memcontrol: fix memory.low proportional distribution Johannes Weiner
2020-01-30 11:49   ` Michal Hocko
2020-02-03 21:21     ` Johannes Weiner
2020-02-03 21:38       ` Roman Gushchin
2019-12-19 20:07 ` [PATCH v2 2/3] mm: memcontrol: clean up and document effective low/min calculations Johannes Weiner
2020-01-30 12:54   ` Michal Hocko
2020-02-21 17:10   ` Michal Koutný
2020-02-25 18:40     ` Johannes Weiner
2020-02-26 16:46       ` Michal Koutný
2020-02-26 19:40         ` Johannes Weiner
2019-12-19 20:07 ` [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection Johannes Weiner
2020-01-30 17:00   ` Michal Hocko
2020-02-03 21:52     ` Johannes Weiner
2020-02-10 15:21       ` Johannes Weiner
2020-02-11 16:47       ` Michal Hocko
2020-02-12 17:08         ` Johannes Weiner
2020-02-13  7:40           ` Michal Hocko
2020-02-13 13:23             ` Johannes Weiner
2020-02-13 15:46               ` Michal Hocko
2020-02-13 17:41                 ` Johannes Weiner
2020-02-13 17:58                   ` Johannes Weiner
2020-02-14  7:59                     ` Michal Hocko
2020-02-13 13:53             ` Tejun Heo
2020-02-13 15:47               ` Michal Hocko
2020-02-13 15:52                 ` Tejun Heo
2020-02-13 16:36                   ` Michal Hocko
2020-02-13 16:57                     ` Tejun Heo
2020-02-14  7:15                       ` Michal Hocko
2020-02-14 13:57                         ` Tejun Heo
2020-02-14 15:13                           ` Michal Hocko
2020-02-14 15:40                             ` Tejun Heo
2020-02-14 16:53                             ` Johannes Weiner
2020-02-14 17:17                               ` Tejun Heo
2020-02-17  8:41                               ` Michal Hocko
2020-02-18 19:52                                 ` Johannes Weiner
2020-02-21 10:11                                   ` Michal Hocko
2020-02-21 15:43                                     ` Johannes Weiner
2020-02-25 12:20                                       ` Michal Hocko
2020-02-25 18:17                                         ` Johannes Weiner
2020-02-26 17:56                                           ` Michal Hocko
2020-02-21 17:12   ` Michal Koutný
2020-02-21 18:58     ` Johannes Weiner
2020-02-25 13:37       ` Michal Koutný
2020-02-25 15:03         ` Johannes Weiner
2020-02-26 13:22           ` Michal Koutný
2020-02-26 15:05             ` Johannes Weiner
2020-02-27 13:35               ` Michal Koutný
2020-02-27 15:06                 ` Johannes Weiner
2019-12-19 20:22 ` [PATCH v2 0/3] mm: memcontrol: recursive memory protection Tejun Heo
2019-12-20  4:06 ` Roman Gushchin
2019-12-20  4:29 ` Chris Down

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).