linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] mm: memcontrol: recursive memory protection
@ 2019-12-13 19:21 Johannes Weiner
  2019-12-13 19:21 ` [PATCH 1/3] mm: memcontrol: fix memory.low proportional distribution Johannes Weiner
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Johannes Weiner @ 2019-12-13 19:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Roman Gushchin, Tejun Heo, linux-mm, cgroups,
	linux-kernel, kernel-team

The current memory.low (and memory.min) semantics require protection
to be assigned to a cgroup in an untinterrupted chain from the
top-level cgroup all the way to the leaf.

In practice, we want to protect entire cgroup subtrees from each other
(system management software vs. workload), but we would like the VM to
balance memory optimally *within* each subtree, without having to make
explicit weight allocations among individual components. The current
semantics make that impossible.

This patch series extends memory.low/min such that the knobs apply
recursively to the entire subtree. Users can still assign explicit
protection to subgroups, but if they don't, the protection set by the
parent cgroup will be distributed dynamically such that children
compete freely - as if no memory control were enabled inside the
subtree - but enjoy protection from neighboring trees.

Patch #1 fixes an existing bug that can give a cgroup tree more
protection than it should receive as per ancestor configuration.

Patch #2 simplifies and documents the existing code to make it easier
to reason about the changes in the next patch.

Patch #3 finally implements recursive memory protection semantics.

Because of a risk of regressing legacy setups, the new semantics are
hidden behind a cgroup2 mount option, 'memory_recursiveprot'.

More details in patch #3.

 Documentation/admin-guide/cgroup-v2.rst |  11 ++
 include/linux/cgroup-defs.h             |   5 +
 kernel/cgroup/cgroup.c                  |  17 ++-
 mm/memcontrol.c                         | 241 +++++++++++++++++++-----------
 mm/page_counter.c                       |  12 +-
 5 files changed, 190 insertions(+), 96 deletions(-)



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/3] mm: memcontrol: fix memory.low proportional distribution
  2019-12-13 19:21 [PATCH 0/3] mm: memcontrol: recursive memory protection Johannes Weiner
@ 2019-12-13 19:21 ` Johannes Weiner
  2019-12-13 20:40   ` Roman Gushchin
  2019-12-13 19:21 ` [PATCH 2/3] mm: memcontrol: clean up and document effective low/min calculations Johannes Weiner
  2019-12-13 19:21 ` [PATCH 3/3] mm: memcontrol: recursive memory.low protection Johannes Weiner
  2 siblings, 1 reply; 9+ messages in thread
From: Johannes Weiner @ 2019-12-13 19:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Roman Gushchin, Tejun Heo, linux-mm, cgroups,
	linux-kernel, kernel-team

When memory.low is overcommitted - i.e. the children claim more
protection than their shared ancestor grants them - the allowance is
distributed in proportion to each siblings's utilized protection:

	low_usage = min(low, usage)
	elow = parent_elow * (low_usage / siblings_low_usage)

However, siblings_low_usage is not the sum of all low_usages. It sums
up the usages of *only those cgroups that are within their memory.low*
That means that low_usage can be *bigger* than siblings_low_usage, and
consequently the total protection afforded to the children can be
bigger than what the ancestor grants the subtree.

Consider three groups where two are in excess of their protection:

  A/memory.low = 10G
  A/A1/memory.low = 10G, A/memory.current = 20G
  A/A2/memory.low = 10G, B/memory.current = 20G
  A/A3/memory.low = 10G, C/memory.current =  8G

  siblings_low_usage = 8G (only A3 contributes)
  A1/elow = parent_elow(10G) * low_usage(20G) / siblings_low_usage(8G) = 25G

The 25G are then capped to A1's own memory.low setting, i.e. 10G. The
same is true for A2. And A3 would also receive 10G. The combined
protection of A1, A2 and A3 is 30G, when A limits the tree to 10G.

What does this mean in practice? A1 and A2 would still be in excess of
their 10G allowance and would be reclaimed, whereas A3 would not. As
they eventually drop below their protection setting, they would be
counted in siblings_low_usage again and the error would right itself.

When reclaim is applied in a binary fashion - cgroup is reclaimed when
it's above its protection, otherwise it's skipped - this could work
actually work out just fine - although it's not quite clear to me why
we'd introduce this error in the first place. However, since
1bc63fb1272b ("mm, memcg: make scan aggression always exclude
protection"), reclaim pressure is scaled to how much a cgroup is above
its protection. As a result this calculation error unduly skews
pressure away from A1 and A2 toward the rest of the system.

Fix this by by making siblings_low_usage the sum of all protected
memory among siblings, including those that are in excess of their
protection.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c   |  4 +---
 mm/page_counter.c | 12 ++----------
 2 files changed, 3 insertions(+), 13 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c5b5f74cfd4d..874a0b00f89b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6236,9 +6236,7 @@ struct cgroup_subsys memory_cgrp_subsys = {
  * elow = min( memory.low, parent->elow * ------------------ ),
  *                                        siblings_low_usage
  *
- *             | memory.current, if memory.current < memory.low
- * low_usage = |
- *	       | 0, otherwise.
+ * low_usage = min(memory.low, memory.current)
  *
  *
  * Such definition of the effective memory.low provides the expected
diff --git a/mm/page_counter.c b/mm/page_counter.c
index de31470655f6..75d53f15f040 100644
--- a/mm/page_counter.c
+++ b/mm/page_counter.c
@@ -23,11 +23,7 @@ static void propagate_protected_usage(struct page_counter *c,
 		return;
 
 	if (c->min || atomic_long_read(&c->min_usage)) {
-		if (usage <= c->min)
-			protected = usage;
-		else
-			protected = 0;
-
+		protected = min(usage, c->min);
 		old_protected = atomic_long_xchg(&c->min_usage, protected);
 		delta = protected - old_protected;
 		if (delta)
@@ -35,11 +31,7 @@ static void propagate_protected_usage(struct page_counter *c,
 	}
 
 	if (c->low || atomic_long_read(&c->low_usage)) {
-		if (usage <= c->low)
-			protected = usage;
-		else
-			protected = 0;
-
+		protected = min(usage, c->low);
 		old_protected = atomic_long_xchg(&c->low_usage, protected);
 		delta = protected - old_protected;
 		if (delta)
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/3] mm: memcontrol: clean up and document effective low/min calculations
  2019-12-13 19:21 [PATCH 0/3] mm: memcontrol: recursive memory protection Johannes Weiner
  2019-12-13 19:21 ` [PATCH 1/3] mm: memcontrol: fix memory.low proportional distribution Johannes Weiner
@ 2019-12-13 19:21 ` Johannes Weiner
  2019-12-13 19:21 ` [PATCH 3/3] mm: memcontrol: recursive memory.low protection Johannes Weiner
  2 siblings, 0 replies; 9+ messages in thread
From: Johannes Weiner @ 2019-12-13 19:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Roman Gushchin, Tejun Heo, linux-mm, cgroups,
	linux-kernel, kernel-team

The effective protection of any given cgroup is a somewhat complicated
construct that depends on the ancestor's configuration, siblings'
configurations, as well as current memory utilization in all these
groups. It's done this way to satisfy hierarchical delegation
requirements while also making the configuration semantics flexible
and expressive in complex real life scenarios.

Unfortunately, all the rules and requirements are sparsely documented,
and the code is a little too clever in merging different scenarios
into a single min() expression. This makes it hard to reason about the
implementation and avoid breaking semantics when making changes to it.

This patch documents each semantic rule individually and splits out
the handling of the overcommit case from the regular case.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c | 191 ++++++++++++++++++++++++++----------------------
 1 file changed, 105 insertions(+), 86 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 874a0b00f89b..ac9a3a170bec 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6204,65 +6204,55 @@ struct cgroup_subsys memory_cgrp_subsys = {
 	.early_init = 0,
 };
 
-/**
- * mem_cgroup_protected - check if memory consumption is in the normal range
- * @root: the top ancestor of the sub-tree being checked
- * @memcg: the memory cgroup to check
- *
- * WARNING: This function is not stateless! It can only be used as part
- *          of a top-down tree iteration, not for isolated queries.
- *
- * Returns one of the following:
- *   MEMCG_PROT_NONE: cgroup memory is not protected
- *   MEMCG_PROT_LOW: cgroup memory is protected as long there is
- *     an unprotected supply of reclaimable memory from other cgroups.
- *   MEMCG_PROT_MIN: cgroup memory is protected
- *
- * @root is exclusive; it is never protected when looked at directly
+/*
+ * This function calculates an individual cgroup's effective
+ * protection which is derived from its own memory.min/low, its
+ * parent's and siblings' settings, as well as the actual memory
+ * distribution in the tree.
  *
- * To provide a proper hierarchical behavior, effective memory.min/low values
- * are used. Below is the description of how effective memory.low is calculated.
- * Effective memory.min values is calculated in the same way.
+ * The following rules apply to the effective protection values:
  *
- * Effective memory.low is always equal or less than the original memory.low.
- * If there is no memory.low overcommittment (which is always true for
- * top-level memory cgroups), these two values are equal.
- * Otherwise, it's a part of parent's effective memory.low,
- * calculated as a cgroup's memory.low usage divided by sum of sibling's
- * memory.low usages, where memory.low usage is the size of actually
- * protected memory.
+ * 1. At the first level of reclaim, effective protection is equal to
+ *    the declared protection in memory.min and memory.low.
  *
- *                                             low_usage
- * elow = min( memory.low, parent->elow * ------------------ ),
- *                                        siblings_low_usage
+ * 2. To enable safe delegation of the protection configuration, at
+ *    subsequent levels the effective protection is capped to the
+ *    parent's effective protection.
  *
- * low_usage = min(memory.low, memory.current)
+ * 3. To make complex and dynamic subtrees easier to configure, the
+ *    user is allowed to overcommit the declared protection at a given
+ *    level. If that is the case, the parent's effective protection is
+ *    distributed to the children in proportion to how much protection
+ *    they have declared and how much of it they are utilizing.
  *
+ *    This makes distribution proportional, but also work-conserving:
+ *    if one cgroup claims much more protection than it uses memory,
+ *    the unused remainder is available to its siblings.
  *
- * Such definition of the effective memory.low provides the expected
- * hierarchical behavior: parent's memory.low value is limiting
- * children, unprotected memory is reclaimed first and cgroups,
- * which are not using their guarantee do not affect actual memory
- * distribution.
+ *    Consider the following example tree:
  *
- * For example, if there are memcgs A, A/B, A/C, A/D and A/E:
+ *        A      A/memory.low = 2G, A/memory.current = 6G
+ *       //\\
+ *      BC  DE   B/memory.low = 3G  B/memory.current = 2G
+ *               C/memory.low = 1G  C/memory.current = 2G
+ *               D/memory.low = 0   D/memory.current = 2G
+ *               E/memory.low = 10G E/memory.current = 0
  *
- *     A      A/memory.low = 2G, A/memory.current = 6G
- *    //\\
- *   BC  DE   B/memory.low = 3G  B/memory.current = 2G
- *            C/memory.low = 1G  C/memory.current = 2G
- *            D/memory.low = 0   D/memory.current = 2G
- *            E/memory.low = 10G E/memory.current = 0
+ *    and memory pressure is applied, the following memory
+ *    distribution is expected (approximately*):
  *
- * and the memory pressure is applied, the following memory distribution
- * is expected (approximately):
+ *      A/memory.current = 2G
+ *      B/memory.current = 1.3G
+ *      C/memory.current = 0.6G
+ *      D/memory.current = 0
+ *      E/memory.current = 0
  *
- *     A/memory.current = 2G
+ *    *assuming equal allocation rate and reclaimability
  *
- *     B/memory.current = 1.3G
- *     C/memory.current = 0.6G
- *     D/memory.current = 0
- *     E/memory.current = 0
+ * 4. Conversely, when the declared protection is undercommitted at a
+ *    given level, the distribution of the larger parental protection
+ *    budget is NOT proportional. A cgroup's protection from a sibling
+ *    is capped to its own memory.min/low setting.
  *
  * These calculations require constant tracking of the actual low usages
  * (see propagate_protected_usage()), as well as recursive calculation of
@@ -6272,12 +6262,64 @@ struct cgroup_subsys memory_cgrp_subsys = {
  * for next usage. This part is intentionally racy, but it's ok,
  * as memory.low is a best-effort mechanism.
  */
+static unsigned long effective_protection(unsigned long usage,
+					  unsigned long setting,
+					  unsigned long parent_effective,
+					  unsigned long siblings_protected)
+{
+	unsigned long protected;
+
+	protected = min(usage, setting);
+	/*
+	 * If all cgroups at this level combined claim and use more
+	 * protection then what the parent affords them, distribute
+	 * shares in proportion to utilization.
+	 *
+	 * We are using actual utilization rather than the statically
+	 * claimed protection in order to be work-conserving: claimed
+	 * but unused protection is available to siblings that would
+	 * otherwise get a smaller chunk than what they claimed.
+	 */
+	if (siblings_protected > parent_effective)
+		return protected * parent_effective / siblings_protected;
+
+	/*
+	 * Ok, utilized protection of all children is within what the
+	 * parent affords them, so we know whatever this child claims
+	 * and utilizes is effectively protected.
+	 *
+	 * If there is unprotected usage beyond this value, reclaim
+	 * will apply pressure in proportion to that amount.
+	 *
+	 * If there is unutilized protection, the cgroup will be fully
+	 * shielded from reclaim, but we do return a smaller value for
+	 * protection than what the group could enjoy in theory. This
+	 * is okay. With the overcommit distribution above, effective
+	 * protection is always dependent on how memory is actually
+	 * consumed among the siblings anyway.
+	 */
+	return protected;
+}
+
+/**
+ * mem_cgroup_protected - check if memory consumption is in the normal range
+ * @root: the top ancestor of the sub-tree being checked
+ * @memcg: the memory cgroup to check
+ *
+ * WARNING: This function is not stateless! It can only be used as part
+ *          of a top-down tree iteration, not for isolated queries.
+ *
+ * Returns one of the following:
+ *   MEMCG_PROT_NONE: cgroup memory is not protected
+ *   MEMCG_PROT_LOW: cgroup memory is protected as long there is
+ *     an unprotected supply of reclaimable memory from other cgroups.
+ *   MEMCG_PROT_MIN: cgroup memory is protected
+ */
 enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
 						struct mem_cgroup *memcg)
 {
 	struct mem_cgroup *parent;
-	unsigned long emin, parent_emin;
-	unsigned long elow, parent_elow;
+	unsigned long emin, elow;
 	unsigned long usage;
 
 	if (mem_cgroup_disabled())
@@ -6292,52 +6334,29 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
 	if (!usage)
 		return MEMCG_PROT_NONE;
 
-	emin = memcg->memory.min;
-	elow = memcg->memory.low;
-
 	parent = parent_mem_cgroup(memcg);
 	/* No parent means a non-hierarchical mode on v1 memcg */
 	if (!parent)
 		return MEMCG_PROT_NONE;
 
-	if (parent == root)
-		goto exit;
-
-	parent_emin = READ_ONCE(parent->memory.emin);
-	emin = min(emin, parent_emin);
-	if (emin && parent_emin) {
-		unsigned long min_usage, siblings_min_usage;
-
-		min_usage = min(usage, memcg->memory.min);
-		siblings_min_usage = atomic_long_read(
-			&parent->memory.children_min_usage);
-
-		if (min_usage && siblings_min_usage)
-			emin = min(emin, parent_emin * min_usage /
-				   siblings_min_usage);
+	if (parent == root) {
+		memcg->memory.emin = memcg->memory.min;
+		memcg->memory.elow = memcg->memory.low;
+		goto out;
 	}
 
-	parent_elow = READ_ONCE(parent->memory.elow);
-	elow = min(elow, parent_elow);
-	if (elow && parent_elow) {
-		unsigned long low_usage, siblings_low_usage;
-
-		low_usage = min(usage, memcg->memory.low);
-		siblings_low_usage = atomic_long_read(
-			&parent->memory.children_low_usage);
+	memcg->memory.emin = effective_protection(usage,
+			memcg->memory.min, READ_ONCE(parent->memory.emin),
+			atomic_long_read(&parent->memory.children_min_usage));
 
-		if (low_usage && siblings_low_usage)
-			elow = min(elow, parent_elow * low_usage /
-				   siblings_low_usage);
-	}
-
-exit:
-	memcg->memory.emin = emin;
-	memcg->memory.elow = elow;
+	memcg->memory.elow = effective_protection(usage,
+			memcg->memory.low, READ_ONCE(parent->memory.elow),
+			atomic_long_read(&parent->memory.children_low_usage));
 
-	if (usage <= emin)
+out:
+	if (usage <= memcg->memory.emin)
 		return MEMCG_PROT_MIN;
-	else if (usage <= elow)
+	else if (usage <= memcg->memory.elow)
 		return MEMCG_PROT_LOW;
 	else
 		return MEMCG_PROT_NONE;
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 3/3] mm: memcontrol: recursive memory.low protection
  2019-12-13 19:21 [PATCH 0/3] mm: memcontrol: recursive memory protection Johannes Weiner
  2019-12-13 19:21 ` [PATCH 1/3] mm: memcontrol: fix memory.low proportional distribution Johannes Weiner
  2019-12-13 19:21 ` [PATCH 2/3] mm: memcontrol: clean up and document effective low/min calculations Johannes Weiner
@ 2019-12-13 19:21 ` Johannes Weiner
  2019-12-13 20:05   ` Johannes Weiner
  2 siblings, 1 reply; 9+ messages in thread
From: Johannes Weiner @ 2019-12-13 19:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Roman Gushchin, Tejun Heo, linux-mm, cgroups,
	linux-kernel, kernel-team

Right now, the effective protection of any given cgroup is capped by
its own explicit memory.low setting, regardless of what the parent
says. The reasons for this are mostly historical and ease of
implementation: to make delegation of memory.low safe, effective
protection is the min() of all memory.low up the tree.

Unfortunately, this limitation makes it impossible to protect an
entire subtree from another without forcing the user to make explicit
protection allocations all the way to the leaf cgroups - something
that is highly undesirable in real life scenarios.

Consider memory in a data center node. At the cgroup top level, we
have a distinction between system management software and the actual
workload the system is executing. Both branches are further subdivided
into individual services, job components etc.

We want to protect the workload as a whole from the system management
software, but we don't want to protect individual workload components
from each other! Their memory demand can vary over time, and we want
the VM to simply cache the hottest data within the workload subtree.
Yet, the current memory.low limitations force us to hard-allocate
protection to each workload cgroup in order to get any protection from
system management software. This is basically useless in practice.

This patch adds the concept of recursive protection to the memory.low
configurable, while retaining the abilty to assign fixed protection in
leaf groups as well.

That means that if protection is explicitly allocated among siblings,
those configured weights are being followed during page reclaim just
like they are now.

However, if the memory.low set at a higher level is not fully claimed
by the children in that subtree, that "floating" protection is applied
to each cgroup in the tree in proportion to its size. Since reclaim
pressure is applied in proportion to size as well, each child in that
tree gets the same boost, and the effect is neutral among siblings -
with respect to each other, they behave as if no memory control was
enabled at all, and the VM simply balances the memory demands
optimally within the subtree. But collectively those cgroups enjoy a
boost over the cgroups in neighboring trees.

This allows us to recursively protect one subtree (workload) from
another (system management), but let subgroups compete freely among
each other without having to assign fixed weights to each leaf.

This floating protection composes with fixed protection. Consider the
following example tree:

		A            A: low = 2G
               / \          A1: low = 1G
              A1 A2         A2: low = 0G

As outside pressure is applied to this tree, A1 will enjoy a fixed
protection from A2 of 1G, but the remaining, unclaimed 1G from A is
split evenly among A1 and A2. Assuming equal memory demand in both,
memory usage will converge on A1 using 1.5G and A2 using 0.5G.

There is a slight risk of regressing theoretical setups where the
top-level cgroups don't know about the true budgeting and set bogusly
high "bypass" values that are meaningfully allocated down the
tree. Such setups would rely on unclaimed protection to be discarded,
and distributing it would change the intended behavior. Be safe and
hide the new behavior behind a mount option, 'memory_recursiveprot'.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 Documentation/admin-guide/cgroup-v2.rst | 11 +++++
 include/linux/cgroup-defs.h             |  5 ++
 kernel/cgroup/cgroup.c                  | 17 ++++++-
 mm/memcontrol.c                         | 62 +++++++++++++++++++++++--
 4 files changed, 90 insertions(+), 5 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 0636bcb60b5a..e569d83621da 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -186,6 +186,17 @@ cgroup v2 currently supports the following mount options.
         modified through remount from the init namespace. The mount
         option is ignored on non-init namespace mounts.
 
+  memory_recursiveprot
+
+        Recursively apply memory.min and memory.low protection to
+        entire subtrees, without requiring explicit downward
+        propagation into leaf cgroups.  This allows protecting entire
+        subtrees from one another, while retaining free competition
+        within those subtrees.  This should have been the default
+        behavior but is a mount-option to avoid regressing setups
+        relying on the original semantics (e.g. specifying bogusly
+        high 'bypass' protection values at higher tree levels).
+
 
 Organizing Processes and Threads
 --------------------------------
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 63097cb243cb..e1fafed22db1 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -94,6 +94,11 @@ enum {
 	 * Enable legacy local memory.events.
 	 */
 	CGRP_ROOT_MEMORY_LOCAL_EVENTS = (1 << 5),
+
+	/*
+	 * Enable recursive subtree protection
+	 */
+	CGRP_ROOT_MEMORY_RECURSIVE_PROT = (1 << 6),
 };
 
 /* cftype->flags */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 735af8f15f95..a2f8d2ab8dec 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1813,12 +1813,14 @@ int cgroup_show_path(struct seq_file *sf, struct kernfs_node *kf_node,
 enum cgroup2_param {
 	Opt_nsdelegate,
 	Opt_memory_localevents,
+	Opt_memory_recursiveprot,
 	nr__cgroup2_params
 };
 
 static const struct fs_parameter_spec cgroup2_param_specs[] = {
 	fsparam_flag("nsdelegate",		Opt_nsdelegate),
 	fsparam_flag("memory_localevents",	Opt_memory_localevents),
+	fsparam_flag("memory_recursiveprot",	Opt_memory_recursiveprot),
 	{}
 };
 
@@ -1844,6 +1846,9 @@ static int cgroup2_parse_param(struct fs_context *fc, struct fs_parameter *param
 	case Opt_memory_localevents:
 		ctx->flags |= CGRP_ROOT_MEMORY_LOCAL_EVENTS;
 		return 0;
+	case Opt_memory_recursiveprot:
+		ctx->flags |= CGRP_ROOT_MEMORY_RECURSIVE_PROT;
+		return 0;
 	}
 	return -EINVAL;
 }
@@ -1860,6 +1865,11 @@ static void apply_cgroup_root_flags(unsigned int root_flags)
 			cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_LOCAL_EVENTS;
 		else
 			cgrp_dfl_root.flags &= ~CGRP_ROOT_MEMORY_LOCAL_EVENTS;
+
+		if (root_flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT)
+			cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_RECURSIVE_PROT;
+		else
+			cgrp_dfl_root.flags &= ~CGRP_ROOT_MEMORY_RECURSIVE_PROT;
 	}
 }
 
@@ -1869,6 +1879,8 @@ static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root
 		seq_puts(seq, ",nsdelegate");
 	if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_LOCAL_EVENTS)
 		seq_puts(seq, ",memory_localevents");
+	if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT)
+		seq_puts(seq, ",memory_recursiveprot");
 	return 0;
 }
 
@@ -6364,7 +6376,10 @@ static struct kobj_attribute cgroup_delegate_attr = __ATTR_RO(delegate);
 static ssize_t features_show(struct kobject *kobj, struct kobj_attribute *attr,
 			     char *buf)
 {
-	return snprintf(buf, PAGE_SIZE, "nsdelegate\nmemory_localevents\n");
+	return snprintf(buf, PAGE_SIZE,
+			"nsdelegate\n"
+			"memory_localevents\n"
+			"memory_recursiveprot\n");
 }
 static struct kobj_attribute cgroup_features_attr = __ATTR_RO(features);
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ac9a3a170bec..2e352cd6c38d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6254,6 +6254,32 @@ struct cgroup_subsys memory_cgrp_subsys = {
  *    budget is NOT proportional. A cgroup's protection from a sibling
  *    is capped to its own memory.min/low setting.
  *
+ * 5. However, to allow protecting recursive subtrees from each other
+ *    without having to declare each individual cgroup's fixed share
+ *    of the ancestor's claim to protection, any unutilized -
+ *    "floating" - protection from up the tree is distributed in
+ *    proportion to each cgroup's *usage*. This makes the protection
+ *    neutral wrt sibling cgroups and lets them compete freely over
+ *    the shared parental protection budget, but it protects the
+ *    subtree as a whole from neighboring subtrees.
+ *
+ *    Consider the following example tree:
+ *
+ *        A            A: low = 2G
+ *       / \           B: low = 1G
+ *      B   C          C: low = 0G
+ *
+ *    As memory pressure is applied, the following memory distribution
+ *    is expected (approximately):
+ *
+ *      A/memory.current = 2G
+ *      B/memory.current = 1.5G
+ *      C/memory.current = 0.5G
+ *
+ * Note that 4. and 5. are not in conflict: 4. is about protecting
+ * against immediate siblings whereas 5. is about protecting against
+ * neighboring subtrees.
+ *
  * These calculations require constant tracking of the actual low usages
  * (see propagate_protected_usage()), as well as recursive calculation of
  * effective memory.low values. But as we do call mem_cgroup_protected()
@@ -6263,11 +6289,13 @@ struct cgroup_subsys memory_cgrp_subsys = {
  * as memory.low is a best-effort mechanism.
  */
 static unsigned long effective_protection(unsigned long usage,
+					  unsigned long parent_usage,
 					  unsigned long setting,
 					  unsigned long parent_effective,
 					  unsigned long siblings_protected)
 {
 	unsigned long protected;
+	unsigned long ep;
 
 	protected = min(usage, setting);
 	/*
@@ -6298,7 +6326,31 @@ static unsigned long effective_protection(unsigned long usage,
 	 * protection is always dependent on how memory is actually
 	 * consumed among the siblings anyway.
 	 */
-	return protected;
+	ep = protected;
+
+	if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT) {
+		unsigned long unclaimed;
+		/*
+		 * If the children aren't claiming (all of) the
+		 * protection afforded to them by the parent,
+		 * distribute the remainder in proportion to the
+		 * (unprotected) size of each cgroup. That way,
+		 * cgroups that aren't explicitly prioritized wrt each
+		 * other compete freely over the allowance, but they
+		 * are collectively protected from neighboring trees.
+		 *
+		 * We're using unprotected size for the weight so that
+		 * if some cgroups DO claim explicit protection, we
+		 * don't protect the same bytes twice.
+		 */
+		unclaimed = parent_effective - siblings_protected;
+		unclaimed *= usage - protected;
+		unclaimed /= parent_usage - siblings_protected;
+
+		ep += unclaimed;
+	}
+
+	return ep;
 }
 
 /**
@@ -6318,9 +6370,9 @@ static unsigned long effective_protection(unsigned long usage,
 enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
 						struct mem_cgroup *memcg)
 {
+	unsigned long usage, parent_usage;
 	struct mem_cgroup *parent;
 	unsigned long emin, elow;
-	unsigned long usage;
 
 	if (mem_cgroup_disabled())
 		return MEMCG_PROT_NONE;
@@ -6345,11 +6397,13 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
 		goto out;
 	}
 
-	memcg->memory.emin = effective_protection(usage,
+	parent_usage = page_counter_read(&parent->memory);
+
+	memcg->memory.emin = effective_protection(usage, parent_usage,
 			memcg->memory.min, READ_ONCE(parent->memory.emin),
 			atomic_long_read(&parent->memory.children_min_usage));
 
-	memcg->memory.elow = effective_protection(usage,
+	memcg->memory.elow = effective_protection(usage, parent_usage,
 			memcg->memory.low, READ_ONCE(parent->memory.elow),
 			atomic_long_read(&parent->memory.children_low_usage));
 
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 3/3] mm: memcontrol: recursive memory.low protection
  2019-12-13 19:21 ` [PATCH 3/3] mm: memcontrol: recursive memory.low protection Johannes Weiner
@ 2019-12-13 20:05   ` Johannes Weiner
  0 siblings, 0 replies; 9+ messages in thread
From: Johannes Weiner @ 2019-12-13 20:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Roman Gushchin, Tejun Heo, linux-mm, cgroups,
	linux-kernel, kernel-team

On Fri, Dec 13, 2019 at 02:21:58PM -0500, Johannes Weiner wrote:
> +	if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT) {
> +		unsigned long unclaimed;
> +		/*
> +		 * If the children aren't claiming (all of) the
> +		 * protection afforded to them by the parent,
> +		 * distribute the remainder in proportion to the
> +		 * (unprotected) size of each cgroup. That way,
> +		 * cgroups that aren't explicitly prioritized wrt each
> +		 * other compete freely over the allowance, but they
> +		 * are collectively protected from neighboring trees.
> +		 *
> +		 * We're using unprotected size for the weight so that
> +		 * if some cgroups DO claim explicit protection, we
> +		 * don't protect the same bytes twice.
> +		 */
> +		unclaimed = parent_effective - siblings_protected;
> +		unclaimed *= usage - protected;
> +		unclaimed /= parent_usage - siblings_protected;

Brainfart I noticed just after sending it out - naturally. If there is
unclaimed protection in the parent, but the children use exactly how
much they claim, this will div0. We have to check for usage that isn't
explicitly protected in the child to which to apply the float. Fixlet
below. Doesn't change the overall logic, though.

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2e352cd6c38d..8d7e9490740b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6328,21 +6328,24 @@ static unsigned long effective_protection(unsigned long usage,
 	 */
 	ep = protected;
 
-	if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT) {
+	/*
+	 * If the children aren't claiming (all of) the protection
+	 * afforded to them by the parent, distribute the remainder in
+	 * proportion to the (unprotected) size of each cgroup. That
+	 * way, cgroups that aren't explicitly prioritized wrt each
+	 * other compete freely over the allowance, but they are
+	 * collectively protected from neighboring trees.
+	 *
+	 * We're using unprotected size for the weight so that if some
+	 * cgroups DO claim explicit protection, we don't protect the
+	 * same bytes twice.
+	 */
+	if (!(cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT))
+		return ep;
+
+	if (usage > protected && parent_effective > siblings_protected) {
 		unsigned long unclaimed;
-		/*
-		 * If the children aren't claiming (all of) the
-		 * protection afforded to them by the parent,
-		 * distribute the remainder in proportion to the
-		 * (unprotected) size of each cgroup. That way,
-		 * cgroups that aren't explicitly prioritized wrt each
-		 * other compete freely over the allowance, but they
-		 * are collectively protected from neighboring trees.
-		 *
-		 * We're using unprotected size for the weight so that
-		 * if some cgroups DO claim explicit protection, we
-		 * don't protect the same bytes twice.
-		 */
+
 		unclaimed = parent_effective - siblings_protected;
 		unclaimed *= usage - protected;
 		unclaimed /= parent_usage - siblings_protected;

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/3] mm: memcontrol: fix memory.low proportional distribution
  2019-12-13 19:21 ` [PATCH 1/3] mm: memcontrol: fix memory.low proportional distribution Johannes Weiner
@ 2019-12-13 20:40   ` Roman Gushchin
  2019-12-16 18:25     ` Johannes Weiner
  0 siblings, 1 reply; 9+ messages in thread
From: Roman Gushchin @ 2019-12-13 20:40 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, Tejun Heo, linux-mm, cgroups,
	linux-kernel, Kernel Team

On Fri, Dec 13, 2019 at 02:21:56PM -0500, Johannes Weiner wrote:
> When memory.low is overcommitted - i.e. the children claim more
> protection than their shared ancestor grants them - the allowance is
> distributed in proportion to each siblings's utilized protection:
> 
> 	low_usage = min(low, usage)
> 	elow = parent_elow * (low_usage / siblings_low_usage)
> 
> However, siblings_low_usage is not the sum of all low_usages. It sums
> up the usages of *only those cgroups that are within their memory.low*
> That means that low_usage can be *bigger* than siblings_low_usage, and
> consequently the total protection afforded to the children can be
> bigger than what the ancestor grants the subtree.
> 
> Consider three groups where two are in excess of their protection:
> 
>   A/memory.low = 10G
>   A/A1/memory.low = 10G, A/memory.current = 20G
>   A/A2/memory.low = 10G, B/memory.current = 20G
>   A/A3/memory.low = 10G, C/memory.current =  8G
> 
>   siblings_low_usage = 8G (only A3 contributes)
>   A1/elow = parent_elow(10G) * low_usage(20G) / siblings_low_usage(8G) = 25G
> 
> The 25G are then capped to A1's own memory.low setting, i.e. 10G. The
> same is true for A2. And A3 would also receive 10G. The combined
> protection of A1, A2 and A3 is 30G, when A limits the tree to 10G.
> 
> What does this mean in practice? A1 and A2 would still be in excess of
> their 10G allowance and would be reclaimed, whereas A3 would not. As
> they eventually drop below their protection setting, they would be
> counted in siblings_low_usage again and the error would right itself.
> 
> When reclaim is applied in a binary fashion - cgroup is reclaimed when
> it's above its protection, otherwise it's skipped - this could work
> actually work out just fine - although it's not quite clear to me why
> we'd introduce this error in the first place.

This complication is not simple an error, it protects cgroups under
their low limits if there is unprotected memory.

So, here is an example:

      A      A/memory.low = 2G, A/memory.current = 4G
     / \
    B   C    B/memory.low = 3G  B/memory.current = 2G
             C/memory.low = 1G  C/memory.current = 2G

as now:

B/elow = 2G * 2G / 2G = 2G == B/memory.current
C/elow = 2G * 1G / 2G = 1G < C/memory.current

with this fix:

B/elow = 2G * 2G / 3G = 4/3 G < B/memory.current
C/elow = 2G * 1G / 3G = 2/3 G < C/memory.current

So in other words, currently B won't be scanned at all, because
there is 1G of unprotected memory in C. With your patch both B and C
will be scanned.

> However, since
> 1bc63fb1272b ("mm, memcg: make scan aggression always exclude
> protection"), reclaim pressure is scaled to how much a cgroup is above
> its protection. As a result this calculation error unduly skews
> pressure away from A1 and A2 toward the rest of the system.

It could be that with 1bc63fb1272b the target memory distribution
will be fine. However the patch will change the memory pressure in B and C
(in the example above). Maybe it's ok, but at least it should be discussed
and documented.

> 
> Fix this by by making siblings_low_usage the sum of all protected
> memory among siblings, including those that are in excess of their
> protection.

So I'm afraid the fix need to be more complex. I need to think a bit more about
it.

Thanks!

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/3] mm: memcontrol: fix memory.low proportional distribution
  2019-12-13 20:40   ` Roman Gushchin
@ 2019-12-16 18:25     ` Johannes Weiner
  2019-12-16 19:11       ` Roman Gushchin
  0 siblings, 1 reply; 9+ messages in thread
From: Johannes Weiner @ 2019-12-16 18:25 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Michal Hocko, Tejun Heo, linux-mm, cgroups,
	linux-kernel, Kernel Team

On Fri, Dec 13, 2019 at 08:40:31PM +0000, Roman Gushchin wrote:
> On Fri, Dec 13, 2019 at 02:21:56PM -0500, Johannes Weiner wrote:
> > When memory.low is overcommitted - i.e. the children claim more
> > protection than their shared ancestor grants them - the allowance is
> > distributed in proportion to each siblings's utilized protection:
> > 
> > 	low_usage = min(low, usage)
> > 	elow = parent_elow * (low_usage / siblings_low_usage)
> > 
> > However, siblings_low_usage is not the sum of all low_usages. It sums
> > up the usages of *only those cgroups that are within their memory.low*
> > That means that low_usage can be *bigger* than siblings_low_usage, and
> > consequently the total protection afforded to the children can be
> > bigger than what the ancestor grants the subtree.
> > 
> > Consider three groups where two are in excess of their protection:
> > 
> >   A/memory.low = 10G
> >   A/A1/memory.low = 10G, A/memory.current = 20G
> >   A/A2/memory.low = 10G, B/memory.current = 20G
> >   A/A3/memory.low = 10G, C/memory.current =  8G
> > 
> >   siblings_low_usage = 8G (only A3 contributes)
> >   A1/elow = parent_elow(10G) * low_usage(20G) / siblings_low_usage(8G) = 25G
> > 
> > The 25G are then capped to A1's own memory.low setting, i.e. 10G. The
> > same is true for A2. And A3 would also receive 10G. The combined
> > protection of A1, A2 and A3 is 30G, when A limits the tree to 10G.
> > 
> > What does this mean in practice? A1 and A2 would still be in excess of
> > their 10G allowance and would be reclaimed, whereas A3 would not. As
> > they eventually drop below their protection setting, they would be
> > counted in siblings_low_usage again and the error would right itself.
> > 
> > When reclaim is applied in a binary fashion - cgroup is reclaimed when
> > it's above its protection, otherwise it's skipped - this could work
> > actually work out just fine - although it's not quite clear to me why
> > we'd introduce this error in the first place.
> 
> This complication is not simple an error, it protects cgroups under
> their low limits if there is unprotected memory.
> 
> So, here is an example:
> 
>       A      A/memory.low = 2G, A/memory.current = 4G
>      / \
>     B   C    B/memory.low = 3G  B/memory.current = 2G
>              C/memory.low = 1G  C/memory.current = 2G
> 
> as now:
> 
> B/elow = 2G * 2G / 2G = 2G == B/memory.current
> C/elow = 2G * 1G / 2G = 1G < C/memory.current
> 
> with this fix:
> 
> B/elow = 2G * 2G / 3G = 4/3 G < B/memory.current
> C/elow = 2G * 1G / 3G = 2/3 G < C/memory.current
> 
> So in other words, currently B won't be scanned at all, because
> there is 1G of unprotected memory in C. With your patch both B and C
> will be scanned.

Looking at the B and C numbers alone: C is bigger than what it claims
for protection and B is smaller than what it claims for protection.

However, A doesn't provide 4G to its children. It provides 2G to be
distributed between the two. So how can B claim 3G and be exempted
from reclaim?

But more importantly, it isn't in either case! The end result is the
same in both implementations. Because as soon as C is reclaimed down
to below 1G, A is still in excess of its memory.low (because it's
overcommitted!), and they both will be reclaimed proportionally.

From the example in the current code:

 * For example, if there are memcgs A, A/B, A/C, A/D and A/E:
 *
 *     A      A/memory.low = 2G, A/memory.current = 6G
 *    //\\
 *   BC  DE   B/memory.low = 3G  B/memory.current = 2G
 *            C/memory.low = 1G  C/memory.current = 2G
 *            D/memory.low = 0   D/memory.current = 2G
 *            E/memory.low = 10G E/memory.current = 0
 *
 * and the memory pressure is applied, the following memory distribution
 * is expected (approximately):
 *
 *     A/memory.current = 2G
 *
 *     B/memory.current = 1.3G
 *     C/memory.current = 0.6G
 *     D/memory.current = 0
 *     E/memory.current = 0

Even though B starts out within whatever it claims to be its
protection, A is overcommitted and so B and C converge on their
proportional share of the parent's allowance.

So to go back to the example chosen above:

>       A      A/memory.low = 2G, A/memory.current = 4G
>      / \
>     B   C    B/memory.low = 3G  B/memory.current = 2G
>              C/memory.low = 1G  C/memory.current = 2G

With either implementation we'd expect the distribution to be about
1.5G and 0.5G for B and C, respectively.

And they'd have to be, too. Otherwise the semantics would be
completely unpredictable to anyone trying to configure this.

So I think mixing proportional distribution with absolute thresholds
like this makes the implementation unnecessarily hard to reason
about. It's also clearly buggy as pointed out in the changelog.

> > However, since
> > 1bc63fb1272b ("mm, memcg: make scan aggression always exclude
> > protection"), reclaim pressure is scaled to how much a cgroup is above
> > its protection. As a result this calculation error unduly skews
> > pressure away from A1 and A2 toward the rest of the system.
> 
> It could be that with 1bc63fb1272b the target memory distribution
> will be fine. However the patch will change the memory pressure in B and C
> (in the example above). Maybe it's ok, but at least it should be discussed
> and documented.

I'll try to improve the changelog based on this, thanks for filling in
the original motivation. But I do think it's a change we want to make.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/3] mm: memcontrol: fix memory.low proportional distribution
  2019-12-16 18:25     ` Johannes Weiner
@ 2019-12-16 19:11       ` Roman Gushchin
  0 siblings, 0 replies; 9+ messages in thread
From: Roman Gushchin @ 2019-12-16 19:11 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, Tejun Heo, linux-mm, cgroups,
	linux-kernel, Kernel Team

On Mon, Dec 16, 2019 at 01:25:18PM -0500, Johannes Weiner wrote:
> On Fri, Dec 13, 2019 at 08:40:31PM +0000, Roman Gushchin wrote:
> > On Fri, Dec 13, 2019 at 02:21:56PM -0500, Johannes Weiner wrote:
> > > When memory.low is overcommitted - i.e. the children claim more
> > > protection than their shared ancestor grants them - the allowance is
> > > distributed in proportion to each siblings's utilized protection:
> > > 
> > > 	low_usage = min(low, usage)
> > > 	elow = parent_elow * (low_usage / siblings_low_usage)
> > > 
> > > However, siblings_low_usage is not the sum of all low_usages. It sums
> > > up the usages of *only those cgroups that are within their memory.low*
> > > That means that low_usage can be *bigger* than siblings_low_usage, and
> > > consequently the total protection afforded to the children can be
> > > bigger than what the ancestor grants the subtree.
> > > 
> > > Consider three groups where two are in excess of their protection:
> > > 
> > >   A/memory.low = 10G
> > >   A/A1/memory.low = 10G, A/memory.current = 20G
> > >   A/A2/memory.low = 10G, B/memory.current = 20G
> > >   A/A3/memory.low = 10G, C/memory.current =  8G
> > > 
> > >   siblings_low_usage = 8G (only A3 contributes)
> > >   A1/elow = parent_elow(10G) * low_usage(20G) / siblings_low_usage(8G) = 25G
> > > 
> > > The 25G are then capped to A1's own memory.low setting, i.e. 10G. The
> > > same is true for A2. And A3 would also receive 10G. The combined
> > > protection of A1, A2 and A3 is 30G, when A limits the tree to 10G.
> > > 
> > > What does this mean in practice? A1 and A2 would still be in excess of
> > > their 10G allowance and would be reclaimed, whereas A3 would not. As
> > > they eventually drop below their protection setting, they would be
> > > counted in siblings_low_usage again and the error would right itself.
> > > 
> > > When reclaim is applied in a binary fashion - cgroup is reclaimed when
> > > it's above its protection, otherwise it's skipped - this could work
> > > actually work out just fine - although it's not quite clear to me why
> > > we'd introduce this error in the first place.
> > 
> > This complication is not simple an error, it protects cgroups under
> > their low limits if there is unprotected memory.
> > 
> > So, here is an example:
> > 
> >       A      A/memory.low = 2G, A/memory.current = 4G
> >      / \
> >     B   C    B/memory.low = 3G  B/memory.current = 2G
> >              C/memory.low = 1G  C/memory.current = 2G
> > 
> > as now:
> > 
> > B/elow = 2G * 2G / 2G = 2G == B/memory.current
> > C/elow = 2G * 1G / 2G = 1G < C/memory.current
> > 
> > with this fix:
> > 
> > B/elow = 2G * 2G / 3G = 4/3 G < B/memory.current
> > C/elow = 2G * 1G / 3G = 2/3 G < C/memory.current
> > 
> > So in other words, currently B won't be scanned at all, because
> > there is 1G of unprotected memory in C. With your patch both B and C
> > will be scanned.
> 
> Looking at the B and C numbers alone: C is bigger than what it claims
> for protection and B is smaller than what it claims for protection.
> 
> However, A doesn't provide 4G to its children. It provides 2G to be
> distributed between the two. So how can B claim 3G and be exempted
> from reclaim?

First, what if the memory pressure comes from memory.high/max set on A?

Second, it's up to semantics we define. Looking at it from the other side:
there is clearly 1G of memory in C which is not protected no matter what.
B wants it's memory to be fully protected, but it's limited by the competition
on the parent level. Now we try to satisfy B's requirements until we can.
Should we treat B and C equally from scratch?

I think both approaches is acceptable, but if we're switching from one option
to another, let's make it clear.

> 
> But more importantly, it isn't in either case! The end result is the
> same in both implementations. Because as soon as C is reclaimed down
> to below 1G, A is still in excess of its memory.low (because it's
> overcommitted!), and they both will be reclaimed proportionally.

I do not disagree: the introduction of the proportional reclaim
made this complication (partially?) obsolete. But originally it was
required to make target distribution correct.

> 
> From the example in the current code:
> 
>  * For example, if there are memcgs A, A/B, A/C, A/D and A/E:
>  *
>  *     A      A/memory.low = 2G, A/memory.current = 6G
>  *    //\\
>  *   BC  DE   B/memory.low = 3G  B/memory.current = 2G
>  *            C/memory.low = 1G  C/memory.current = 2G
>  *            D/memory.low = 0   D/memory.current = 2G
>  *            E/memory.low = 10G E/memory.current = 0
>  *
>  * and the memory pressure is applied, the following memory distribution
>  * is expected (approximately):
>  *
>  *     A/memory.current = 2G
>  *
>  *     B/memory.current = 1.3G
>  *     C/memory.current = 0.6G
>  *     D/memory.current = 0
>  *     E/memory.current = 0
> 
> Even though B starts out within whatever it claims to be its
> protection, A is overcommitted and so B and C converge on their
> proportional share of the parent's allowance.
> 
> So to go back to the example chosen above:
> 
> >       A      A/memory.low = 2G, A/memory.current = 4G
> >      / \
> >     B   C    B/memory.low = 3G  B/memory.current = 2G
> >              C/memory.low = 1G  C/memory.current = 2G
> 
> With either implementation we'd expect the distribution to be about
> 1.5G and 0.5G for B and C, respectively.
> 
> And they'd have to be, too. Otherwise the semantics would be
> completely unpredictable to anyone trying to configure this.
> 
> So I think mixing proportional distribution with absolute thresholds
> like this makes the implementation unnecessarily hard to reason
> about. It's also clearly buggy as pointed out in the changelog.
> 
> > > However, since
> > > 1bc63fb1272b ("mm, memcg: make scan aggression always exclude
> > > protection"), reclaim pressure is scaled to how much a cgroup is above
> > > its protection. As a result this calculation error unduly skews
> > > pressure away from A1 and A2 toward the rest of the system.
> > 
> > It could be that with 1bc63fb1272b the target memory distribution
> > will be fine. However the patch will change the memory pressure in B and C
> > (in the example above). Maybe it's ok, but at least it should be discussed
> > and documented.
> 
> I'll try to improve the changelog based on this, thanks for filling in
> the original motivation. But I do think it's a change we want to make.

Absolutely, I'm not against the change. I just want to make sure we will
put all details into the changelog.

Thanks!

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/3] mm: memcontrol: fix memory.low proportional distribution
  2020-02-27 19:56 [PATCH 0/3] " Johannes Weiner
@ 2020-02-27 19:56 ` Johannes Weiner
  0 siblings, 0 replies; 9+ messages in thread
From: Johannes Weiner @ 2020-02-27 19:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Roman Gushchin, Michal Hocko, Tejun Heo, Chris Down,
	Michal Koutný,
	linux-mm, cgroups, linux-kernel, kernel-team

When memory.low is overcommitted - i.e. the children claim more
protection than their shared ancestor grants them - the allowance is
distributed in proportion to how much each sibling uses their own
declared protection:

	low_usage = min(memory.low, memory.current)
	elow = parent_elow * (low_usage / siblings_low_usage)

However, siblings_low_usage is not the sum of all low_usages. It sums
up the usages of *only those cgroups that are within their memory.low*
That means that low_usage can be *bigger* than siblings_low_usage, and
consequently the total protection afforded to the children can be
bigger than what the ancestor grants the subtree.

Consider three groups where two are in excess of their protection:

  A/memory.low = 10G
  A/A1/memory.low = 10G, memory.current = 20G
  A/A2/memory.low = 10G, memory.current = 20G
  A/A3/memory.low = 10G, memory.current =  8G
  siblings_low_usage = 8G (only A3 contributes)

  A1/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(8G) = 12.5G -> 10G
  A2/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(8G) = 12.5G -> 10G
  A3/elow = parent_elow(10G) * low_usage(8G) / siblings_low_usage(8G) = 10.0G

  (the 12.5G are capped to the explicit memory.low setting of 10G)

With that, the sum of all awarded protection below A is 30G, when A
only grants 10G for the entire subtree.

What does this mean in practice? A1 and A2 would still be in excess of
their 10G allowance and would be reclaimed, whereas A3 would not. As
they eventually drop below their protection setting, they would be
counted in siblings_low_usage again and the error would right itself.

When reclaim was applied in a binary fashion (cgroup is reclaimed when
it's above its protection, otherwise it's skipped) this would actually
work out just fine. However, since 1bc63fb1272b ("mm, memcg: make scan
aggression always exclude protection"), reclaim pressure is scaled to
how much a cgroup is above its protection. As a result this
calculation error unduly skews pressure away from A1 and A2 toward the
rest of the system.

But why did we do it like this in the first place?

The reasoning behind exempting groups in excess from
siblings_low_usage was to go after them first during reclaim in an
overcommitted subtree:

  A/memory.low = 2G, memory.current = 4G
  A/A1/memory.low = 3G, memory.current = 2G
  A/A2/memory.low = 1G, memory.current = 2G

  siblings_low_usage = 2G (only A1 contributes)
  A1/elow = parent_elow(2G) * low_usage(2G) / siblings_low_usage(2G) = 2G
  A2/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(2G) = 1G

While the children combined are overcomitting A and are technically
both at fault, A2 is actively declaring unprotected memory and we
would like to reclaim that first.

However, while this sounds like a noble goal on the face of it, it
doesn't make much difference in actual memory distribution: Because A
is overcommitted, reclaim will not stop once A2 gets pushed back to
within its allowance; we'll have to reclaim A1 either way. The end
result is still that protection is distributed proportionally, with A1
getting 3/4 (1.5G) and A2 getting 1/4 (0.5G) of A's allowance.

[ If A weren't overcommitted, it wouldn't make a difference since each
  cgroup would just get the protection it declares:

  A/memory.low = 2G, memory.current = 3G
  A/A1/memory.low = 1G, memory.current = 1G
  A/A2/memory.low = 1G, memory.current = 2G

  With the current calculation:

  siblings_low_usage = 1G (only A1 contributes)
  A1/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(1G) = 2G -> 1G
  A2/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(1G) = 2G -> 1G

  Including excess groups in siblings_low_usage:

  siblings_low_usage = 2G
  A1/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(2G) = 1G -> 1G
  A2/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(2G) = 1G -> 1G ]

Simplify the calculation and fix the proportional reclaim bug by
including excess cgroups in siblings_low_usage.

After this patch, the effective memory.low distribution from the
example above would be as follows:

  A/memory.low = 10G
  A/A1/memory.low = 10G, memory.current = 20G
  A/A2/memory.low = 10G, memory.current = 20G
  A/A3/memory.low = 10G, memory.current =  8G
  siblings_low_usage = 28G

  A1/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(28G) = 3.5G
  A2/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(28G) = 3.5G
  A3/elow = parent_elow(10G) * low_usage(8G) / siblings_low_usage(28G) = 2.8G

Fixes: 1bc63fb1272b ("mm, memcg: make scan aggression always exclude protection")
Fixes: 230671533d64 ("mm: memory.low hierarchical behavior")
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Chris Down <chris@chrisdown.name>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c   |  4 +---
 mm/page_counter.c | 12 ++----------
 2 files changed, 3 insertions(+), 13 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c5b5f74cfd4d..874a0b00f89b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6236,9 +6236,7 @@ struct cgroup_subsys memory_cgrp_subsys = {
  * elow = min( memory.low, parent->elow * ------------------ ),
  *                                        siblings_low_usage
  *
- *             | memory.current, if memory.current < memory.low
- * low_usage = |
- *	       | 0, otherwise.
+ * low_usage = min(memory.low, memory.current)
  *
  *
  * Such definition of the effective memory.low provides the expected
diff --git a/mm/page_counter.c b/mm/page_counter.c
index de31470655f6..75d53f15f040 100644
--- a/mm/page_counter.c
+++ b/mm/page_counter.c
@@ -23,11 +23,7 @@ static void propagate_protected_usage(struct page_counter *c,
 		return;
 
 	if (c->min || atomic_long_read(&c->min_usage)) {
-		if (usage <= c->min)
-			protected = usage;
-		else
-			protected = 0;
-
+		protected = min(usage, c->min);
 		old_protected = atomic_long_xchg(&c->min_usage, protected);
 		delta = protected - old_protected;
 		if (delta)
@@ -35,11 +31,7 @@ static void propagate_protected_usage(struct page_counter *c,
 	}
 
 	if (c->low || atomic_long_read(&c->low_usage)) {
-		if (usage <= c->low)
-			protected = usage;
-		else
-			protected = 0;
-
+		protected = min(usage, c->low);
 		old_protected = atomic_long_xchg(&c->low_usage, protected);
 		delta = protected - old_protected;
 		if (delta)
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2020-02-27 19:56 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-13 19:21 [PATCH 0/3] mm: memcontrol: recursive memory protection Johannes Weiner
2019-12-13 19:21 ` [PATCH 1/3] mm: memcontrol: fix memory.low proportional distribution Johannes Weiner
2019-12-13 20:40   ` Roman Gushchin
2019-12-16 18:25     ` Johannes Weiner
2019-12-16 19:11       ` Roman Gushchin
2019-12-13 19:21 ` [PATCH 2/3] mm: memcontrol: clean up and document effective low/min calculations Johannes Weiner
2019-12-13 19:21 ` [PATCH 3/3] mm: memcontrol: recursive memory.low protection Johannes Weiner
2019-12-13 20:05   ` Johannes Weiner
2020-02-27 19:56 [PATCH 0/3] " Johannes Weiner
2020-02-27 19:56 ` [PATCH 1/3] mm: memcontrol: fix memory.low proportional distribution Johannes Weiner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).