linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] mm: memory.low heirarchical behavior
@ 2018-03-20 22:33 Roman Gushchin
  2018-03-21 18:23 ` Johannes Weiner
  2018-04-22 20:26 ` [RFC PATCH 0/2] memory.low,min reclaim Greg Thelen
  0 siblings, 2 replies; 13+ messages in thread
From: Roman Gushchin @ 2018-03-20 22:33 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Andrew Morton, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, Tejun Heo, kernel-team, cgroups, linux-kernel

This patch aims to address an issue in current memory.low semantics,
which makes it hard to use it in a hierarchy, where some leaf memory
cgroups are more valuable than others.

For example, there are memcgs A, A/B, A/C, A/D and A/E:

  A      A/memory.low = 2G, A/memory.current = 6G
 //\\
BC  DE   B/memory.low = 3G  B/memory.usage = 2G
         C/memory.low = 1G  C/memory.usage = 2G
         D/memory.low = 0   D/memory.usage = 2G
	 E/memory.low = 10G E/memory.usage = 0

If we apply memory pressure, B, C and D are reclaimed at
the same pace while A's usage exceeds 2G.
This is obviously wrong, as B's usage is fully below B's memory.low,
and C has 1G of protection as well.
Also, A is pushed to the size, which is less than A's 2G memory.low,
which is also wrong.

A simple bash script (provided below) can be used to reproduce
the problem. Current results are:
  A:    1430097920
  A/B:  711929856
  A/C:  717426688
  A/D:  741376
  A/E:  0

To address the issue a concept of effective memory.low is introduced.
Effective memory.low is always equal or less than original memory.low.
In a case, when there is no memory.low overcommittment (and also for
top-level cgroups), these two values are equal.
Otherwise it's a part of parent's effective memory.low, calculated as
a cgroup's memory.low usage divided by sum of sibling's memory.low
usages (under memory.low usage I mean the size of actually protected
memory: memory.current if memory.current < memory.low, 0 otherwise).
It's necessary to track the actual usage, because otherwise an empty
cgroup with memory.low set (A/E in my example) will affect actual
memory distribution, which makes no sense.

Effective memory.low is always capped by memory.low, set by user.
That means it's not possible to become a larger guarantee than
memory.low set by a user, even if corresponding part of parent's
guarantee is larger. This matches existing semantics.

Calculating effective memory.low can be done in the reclaim path,
as we conveniently traversing the cgroup tree from top to bottom and
check memory.low on each level. So, it's a perfect place to calculate
effective memory low and save it to use it for children cgroups.

This also eliminates a need to traverse the cgroup tree from bottom
to top each time to check if parent's guarantee is not exceeded.

Setting/resetting effective memory.low is intentionally racy, but
it's fine and shouldn't lead to any significant differences in
actual memory distribution.

With this patch applied results are matching the expectations:
  A:    2146160640
  A/B:  1427795968
  A/C:  717705216
  A/D:  659456
  A/E:  0

Test script:
  #!/bin/bash

  CGPATH="/sys/fs/cgroup"

  truncate /file1 --size 2G
  truncate /file2 --size 2G
  truncate /file3 --size 2G
  truncate /file4 --size 50G

  mkdir "${CGPATH}/A"
  echo "+memory" > "${CGPATH}/A/cgroup.subtree_control"
  mkdir "${CGPATH}/A/B" "${CGPATH}/A/C" "${CGPATH}/A/D" "${CGPATH}/A/E"

  echo 2G > "${CGPATH}/A/memory.low"
  echo 3G > "${CGPATH}/A/B/memory.low"
  echo 1G > "${CGPATH}/A/C/memory.low"
  echo 0 > "${CGPATH}/A/D/memory.low"
  echo 10G > "${CGPATH}/A/E/memory.low"

  echo $$ > "${CGPATH}/A/B/cgroup.procs" && vmtouch -qt /file1
  echo $$ > "${CGPATH}/A/C/cgroup.procs" && vmtouch -qt /file2
  echo $$ > "${CGPATH}/A/D/cgroup.procs" && vmtouch -qt /file3
  echo $$ > "${CGPATH}/cgroup.procs" && vmtouch -qt /file4

  echo "A:   " `cat "${CGPATH}/A/memory.current"`
  echo "A/B: " `cat "${CGPATH}/A/B/memory.current"`
  echo "A/C: " `cat "${CGPATH}/A/C/memory.current"`
  echo "A/D: " `cat "${CGPATH}/A/D/memory.current"`
  echo "A/E: " `cat "${CGPATH}/A/E/memory.current"`

  rmdir "${CGPATH}/A/B" "${CGPATH}/A/C" "${CGPATH}/A/D" "${CGPATH}/A/E"
  rmdir "${CGPATH}/A"
  rm /file1 /file2 /file3 /file4

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: linux-mm@kvack.org
Cc: cgroups@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 include/linux/memcontrol.h |  4 +++
 mm/memcontrol.c            | 64 +++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 64 insertions(+), 4 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 4525b4404a9e..a95a2e9938b2 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -180,8 +180,12 @@ struct mem_cgroup {
 
 	/* Normal memory consumption range */
 	unsigned long low;
+	unsigned long e_low;
 	unsigned long high;
 
+	atomic_long_t low_usage;
+	atomic_long_t s_low_usage;
+
 	/* Range enforcement for interrupt charges */
 	struct work_struct high_work;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3801ac1fcfbc..5af3199451f0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1672,6 +1672,36 @@ void unlock_page_memcg(struct page *page)
 }
 EXPORT_SYMBOL(unlock_page_memcg);
 
+static void memcg_update_low(struct mem_cgroup *memcg)
+{
+	unsigned long usage, low_usage, prev_low_usage;
+	struct mem_cgroup *parent;
+	long delta;
+
+	do {
+		parent = parent_mem_cgroup(memcg);
+		if (!parent || mem_cgroup_is_root(parent))
+			break;
+
+		if (!memcg->low && !atomic_long_read(&memcg->low_usage))
+			break;
+
+		usage = page_counter_read(&memcg->memory);
+		if (usage < memcg->low)
+			low_usage = usage;
+		else
+			low_usage = 0;
+
+		prev_low_usage = atomic_long_xchg(&memcg->low_usage, low_usage);
+		delta = low_usage - prev_low_usage;
+		if (delta == 0)
+			break;
+
+		atomic_long_add(delta, &parent->s_low_usage);
+
+	} while ((memcg = parent));
+}
+
 struct memcg_stock_pcp {
 	struct mem_cgroup *cached; /* this never be root cgroup */
 	unsigned int nr_pages;
@@ -1726,6 +1756,7 @@ static void drain_stock(struct memcg_stock_pcp *stock)
 		page_counter_uncharge(&old->memory, stock->nr_pages);
 		if (do_memsw_account())
 			page_counter_uncharge(&old->memsw, stock->nr_pages);
+		memcg_update_low(old);
 		css_put_many(&old->css, stock->nr_pages);
 		stock->nr_pages = 0;
 	}
@@ -2017,11 +2048,13 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (do_memsw_account())
 		page_counter_charge(&memcg->memsw, nr_pages);
 	css_get_many(&memcg->css, nr_pages);
+	memcg_update_low(memcg);
 
 	return 0;
 
 done_restock:
 	css_get_many(&memcg->css, batch);
+	memcg_update_low(memcg);
 	if (batch > nr_pages)
 		refill_stock(memcg, batch - nr_pages);
 
@@ -2059,6 +2092,7 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
 	if (do_memsw_account())
 		page_counter_uncharge(&memcg->memsw, nr_pages);
 
+	memcg_update_low(memcg);
 	css_put_many(&memcg->css, nr_pages);
 }
 
@@ -2396,6 +2430,7 @@ void memcg_kmem_uncharge(struct page *page, int order)
 	if (PageKmemcg(page))
 		__ClearPageKmemcg(page);
 
+	memcg_update_low(memcg);
 	css_put_many(&memcg->css, nr_pages);
 }
 #endif /* !CONFIG_SLOB */
@@ -5642,6 +5677,9 @@ struct cgroup_subsys memory_cgrp_subsys = {
  */
 bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg)
 {
+	unsigned long usage, low_usage, s_low_usage, p_e_low, e_low;
+	struct mem_cgroup *parent;
+
 	if (mem_cgroup_disabled())
 		return false;
 
@@ -5650,12 +5688,29 @@ bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg)
 	if (memcg == root)
 		return false;
 
-	for (; memcg != root; memcg = parent_mem_cgroup(memcg)) {
-		if (page_counter_read(&memcg->memory) >= memcg->low)
-			return false;
+	e_low = memcg->low;
+	usage = page_counter_read(&memcg->memory);
+
+	parent = parent_mem_cgroup(memcg);
+	if (mem_cgroup_is_root(parent))
+		goto exit;
+
+	p_e_low = parent->e_low;
+	e_low = min(e_low, p_e_low);
+
+	if (e_low && p_e_low) {
+		low_usage = min(usage, memcg->low);
+		s_low_usage = atomic_long_read(&parent->s_low_usage);
+		if (!low_usage || !s_low_usage)
+			goto exit;
+
+		e_low = min(e_low, p_e_low * low_usage / s_low_usage);
 	}
 
-	return true;
+exit:
+	memcg->e_low = e_low;
+
+	return usage < e_low;
 }
 
 /**
@@ -5829,6 +5884,7 @@ static void uncharge_batch(const struct uncharge_gather *ug)
 		if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && ug->nr_kmem)
 			page_counter_uncharge(&ug->memcg->kmem, ug->nr_kmem);
 		memcg_oom_recover(ug->memcg);
+		memcg_update_low(ug->memcg);
 	}
 
 	local_irq_save(flags);
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC] mm: memory.low heirarchical behavior
  2018-03-20 22:33 [RFC] mm: memory.low heirarchical behavior Roman Gushchin
@ 2018-03-21 18:23 ` Johannes Weiner
  2018-03-21 19:08   ` Roman Gushchin
  2018-03-23 16:37   ` [PATCH v2] mm: memory.low hierarchical behavior Roman Gushchin
  2018-04-22 20:26 ` [RFC PATCH 0/2] memory.low,min reclaim Greg Thelen
  1 sibling, 2 replies; 13+ messages in thread
From: Johannes Weiner @ 2018-03-21 18:23 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Andrew Morton, Michal Hocko, Vladimir Davydov,
	Tejun Heo, kernel-team, cgroups, linux-kernel

Hi Roman,

On Tue, Mar 20, 2018 at 10:33:53PM +0000, Roman Gushchin wrote:
> This patch aims to address an issue in current memory.low semantics,
> which makes it hard to use it in a hierarchy, where some leaf memory
> cgroups are more valuable than others.
> 
> For example, there are memcgs A, A/B, A/C, A/D and A/E:
> 
>   A      A/memory.low = 2G, A/memory.current = 6G
>  //\\
> BC  DE   B/memory.low = 3G  B/memory.usage = 2G
>          C/memory.low = 1G  C/memory.usage = 2G
>          D/memory.low = 0   D/memory.usage = 2G
> 	 E/memory.low = 10G E/memory.usage = 0
> 
> If we apply memory pressure, B, C and D are reclaimed at
> the same pace while A's usage exceeds 2G.
> This is obviously wrong, as B's usage is fully below B's memory.low,
> and C has 1G of protection as well.
> Also, A is pushed to the size, which is less than A's 2G memory.low,
> which is also wrong.
> 
> A simple bash script (provided below) can be used to reproduce
> the problem. Current results are:
>   A:    1430097920
>   A/B:  711929856
>   A/C:  717426688
>   A/D:  741376
>   A/E:  0

Yes, this is a problem. And the behavior with your patch looks much
preferable over the status quo.

> To address the issue a concept of effective memory.low is introduced.
> Effective memory.low is always equal or less than original memory.low.
> In a case, when there is no memory.low overcommittment (and also for
> top-level cgroups), these two values are equal.
> Otherwise it's a part of parent's effective memory.low, calculated as
> a cgroup's memory.low usage divided by sum of sibling's memory.low
> usages (under memory.low usage I mean the size of actually protected
> memory: memory.current if memory.current < memory.low, 0 otherwise).

This hurts my brain.

Why is memory.current == memory.low (which should fully protect
memory.current) a low usage of 0?

Why is memory.current > memory.low not a low usage of memory.low?

I.e. shouldn't this be low_usage = min(memory.current, memory.low)?

> It's necessary to track the actual usage, because otherwise an empty
> cgroup with memory.low set (A/E in my example) will affect actual
> memory distribution, which makes no sense.

Yep, that makes sense.

> Effective memory.low is always capped by memory.low, set by user.
> That means it's not possible to become a larger guarantee than
> memory.low set by a user, even if corresponding part of parent's
> guarantee is larger. This matches existing semantics.

That's a complicated sentence for an intuitive concept: yes, we
wouldn't expect a group's protected usage to exceed its own memory.low
setting just because the parent's is higher. I'd drop this part.

> Calculating effective memory.low can be done in the reclaim path,
> as we conveniently traversing the cgroup tree from top to bottom and
> check memory.low on each level. So, it's a perfect place to calculate
> effective memory low and save it to use it for children cgroups.
> 
> This also eliminates a need to traverse the cgroup tree from bottom
> to top each time to check if parent's guarantee is not exceeded.
> 
> Setting/resetting effective memory.low is intentionally racy, but
> it's fine and shouldn't lead to any significant differences in
> actual memory distribution.
> 
> With this patch applied results are matching the expectations:
>   A:    2146160640
>   A/B:  1427795968
>   A/C:  717705216
>   A/D:  659456
>   A/E:  0

Very cool results.

Below some comments on the implementation.

> @@ -180,8 +180,12 @@ struct mem_cgroup {
>  
>  	/* Normal memory consumption range */
>  	unsigned long low;
> +	unsigned long e_low;
>  	unsigned long high;

I wouldn't mix those. low and high are what the user configures, e_low
is an internal state we calculate and maintain for dynamic enforcing.

Keep e_low with the low_usage variables?

Also please drop the underscore. elow is easier on the eyes.

> +	atomic_long_t low_usage;
> +	atomic_long_t s_low_usage;

children_low_usage would be clearer, I think.

> @@ -1672,6 +1672,36 @@ void unlock_page_memcg(struct page *page)
>  }
>  EXPORT_SYMBOL(unlock_page_memcg);
>  
> +static void memcg_update_low(struct mem_cgroup *memcg)
> +{
> +	unsigned long usage, low_usage, prev_low_usage;
> +	struct mem_cgroup *parent;
> +	long delta;
> +
> +	do {
> +		parent = parent_mem_cgroup(memcg);
> +		if (!parent || mem_cgroup_is_root(parent))
> +			break;
> +
> +		if (!memcg->low && !atomic_long_read(&memcg->low_usage))
> +			break;
> +
> +		usage = page_counter_read(&memcg->memory);
> +		if (usage < memcg->low)
> +			low_usage = usage;
> +		else
> +			low_usage = 0;
> +
> +		prev_low_usage = atomic_long_xchg(&memcg->low_usage, low_usage);
> +		delta = low_usage - prev_low_usage;
> +		if (delta == 0)
> +			break;
> +
> +		atomic_long_add(delta, &parent->s_low_usage);
> +
> +	} while ((memcg = parent));
> +}

This code could use some comments ;)

Something that explains that we're tracking the combined usage of the
children and what we're using that information for.

The conceptual descriptions you have in the changelog should be in the
code somewher, to give a high level overview of how we're enforcing
the low settings hierarchically.

> @@ -1726,6 +1756,7 @@ static void drain_stock(struct memcg_stock_pcp *stock)
>  		page_counter_uncharge(&old->memory, stock->nr_pages);
>  		if (do_memsw_account())
>  			page_counter_uncharge(&old->memsw, stock->nr_pages);
> +		memcg_update_low(old);
>  		css_put_many(&old->css, stock->nr_pages);
>  		stock->nr_pages = 0;

The function is called every time the page counter changes and walks
up the hierarchy exactly the same. That is a good sign that the low
usage tracking should really be part of the page counter code itself.

I think you also have to call it when memory.low changes, as that may
increase or decrease low usage just as much as when usage changes.

> @@ -5650,12 +5688,29 @@ bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg)
>  	if (memcg == root)
>  		return false;
>  
> -	for (; memcg != root; memcg = parent_mem_cgroup(memcg)) {
> -		if (page_counter_read(&memcg->memory) >= memcg->low)
> -			return false;
> +	e_low = memcg->low;
> +	usage = page_counter_read(&memcg->memory);
> +
> +	parent = parent_mem_cgroup(memcg);
> +	if (mem_cgroup_is_root(parent))
> +		goto exit;

Shouldn't this test parent == root?

>
> +	p_e_low = parent->e_low;
> +	e_low = min(e_low, p_e_low);

These names need help! ;) elow and parent_elow?

> +	if (e_low && p_e_low) {
> +		low_usage = min(usage, memcg->low);
> +		s_low_usage = atomic_long_read(&parent->s_low_usage);

How about

		siblings_low_usage = atomic_long_read(&parent->children_low_usage);

It's a bit long, but much clearer. To keep the line short, you can
invert the branch with a goto exit and save one indentation level.

> +		if (!low_usage || !s_low_usage)
> +			goto exit;
> +
> +		e_low = min(e_low, p_e_low * low_usage / s_low_usage);
>  	}
>  
> -	return true;
> +exit:
> +	memcg->e_low = e_low;

The function has the name of a simple predicate tester, but because we
calculate effective low on the go it only works when used as part of a
topdown hierarchy walk. I cannot think of a good way of making this
less surprising, but please at least document this prominently.

Thanks!

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] mm: memory.low heirarchical behavior
  2018-03-21 18:23 ` Johannes Weiner
@ 2018-03-21 19:08   ` Roman Gushchin
  2018-04-04 17:07     ` Johannes Weiner
  2018-03-23 16:37   ` [PATCH v2] mm: memory.low hierarchical behavior Roman Gushchin
  1 sibling, 1 reply; 13+ messages in thread
From: Roman Gushchin @ 2018-03-21 19:08 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Andrew Morton, Michal Hocko, Vladimir Davydov,
	Tejun Heo, kernel-team, cgroups, linux-kernel

Hi Johannes!

Thank you for review!

I've answered most important questions below.
What about all other stylistic/naming/cosmetic issues, I've no
objections, I'll address them in v2.

> Hi Roman,
> 
> On Tue, Mar 20, 2018 at 10:33:53PM +0000, Roman Gushchin wrote:
> > This patch aims to address an issue in current memory.low semantics,
> > which makes it hard to use it in a hierarchy, where some leaf memory
> > cgroups are more valuable than others.
> > 
> > For example, there are memcgs A, A/B, A/C, A/D and A/E:
> > 
> >   A      A/memory.low = 2G, A/memory.current = 6G
> >  //\\
> > BC  DE   B/memory.low = 3G  B/memory.usage = 2G
> >          C/memory.low = 1G  C/memory.usage = 2G
> >          D/memory.low = 0   D/memory.usage = 2G
> > 	 E/memory.low = 10G E/memory.usage = 0
> > 
> > If we apply memory pressure, B, C and D are reclaimed at
> > the same pace while A's usage exceeds 2G.
> > This is obviously wrong, as B's usage is fully below B's memory.low,
> > and C has 1G of protection as well.
> > Also, A is pushed to the size, which is less than A's 2G memory.low,
> > which is also wrong.
> > 
> > A simple bash script (provided below) can be used to reproduce
> > the problem. Current results are:
> >   A:    1430097920
> >   A/B:  711929856
> >   A/C:  717426688
> >   A/D:  741376
> >   A/E:  0
> 
> Yes, this is a problem. And the behavior with your patch looks much
> preferable over the status quo.
> 
> > To address the issue a concept of effective memory.low is introduced.
> > Effective memory.low is always equal or less than original memory.low.
> > In a case, when there is no memory.low overcommittment (and also for
> > top-level cgroups), these two values are equal.
> > Otherwise it's a part of parent's effective memory.low, calculated as
> > a cgroup's memory.low usage divided by sum of sibling's memory.low
> > usages (under memory.low usage I mean the size of actually protected
> > memory: memory.current if memory.current < memory.low, 0 otherwise).
> 
> This hurts my brain.
> 
> Why is memory.current == memory.low (which should fully protect
> memory.current) a low usage of 0?
> 
> Why is memory.current > memory.low not a low usage of memory.low?
> 
> I.e. shouldn't this be low_usage = min(memory.current, memory.low)?

This is really the non-trivial part.

Let's look at an example:
memcg A   (memory.current = 4G, memory.low = 2G)
memcg A/B (memory.current = 2G, memory.low = 2G)
memcg A/C (memory.current = 2G, memory.low = 1G)

If we'll calculate effective memory.low using your definition
before any reclaim, we end up with the following:
A/B  2G * 2G / (2G + 1G) = 4/3G
A/C  2G * 1G / (2G + 1G) = 2/3G

Looks good, but both cgroups are below their effective limits.
When memory pressure is applied, both are reclaimed at the same pace.
While both B and C are getting smaller and smaller, their weights
and effective low limits are getting closer and closer, but
still below their usages. This ends up when both cgroups will
have size of 1G, which is obviously wrong.

Fundamentally the problem is that memory.low doesn't define
the reclaim speed, just yes or no. So, if there are children cgroups,
some of which are below their memory.low, and some above (as in the example),
it's crucially important to reclaim unprotected memory first.

This is exactly what my code does: as soon as memory.current is larger
than memory.low, we don't treat cgroup's memory as protected at all,
so it doesn't affect effective limits of sibling cgroups.

> 
> > It's necessary to track the actual usage, because otherwise an empty
> > cgroup with memory.low set (A/E in my example) will affect actual
> > memory distribution, which makes no sense.
> 
> Yep, that makes sense.
> 
> > Effective memory.low is always capped by memory.low, set by user.
> > That means it's not possible to become a larger guarantee than
> > memory.low set by a user, even if corresponding part of parent's
> > guarantee is larger. This matches existing semantics.
> 
> That's a complicated sentence for an intuitive concept: yes, we
> wouldn't expect a group's protected usage to exceed its own memory.low
> setting just because the parent's is higher. I'd drop this part.
> 
> > Calculating effective memory.low can be done in the reclaim path,
> > as we conveniently traversing the cgroup tree from top to bottom and
> > check memory.low on each level. So, it's a perfect place to calculate
> > effective memory low and save it to use it for children cgroups.
> > 
> > This also eliminates a need to traverse the cgroup tree from bottom
> > to top each time to check if parent's guarantee is not exceeded.
> > 
> > Setting/resetting effective memory.low is intentionally racy, but
> > it's fine and shouldn't lead to any significant differences in
> > actual memory distribution.
> > 
> > With this patch applied results are matching the expectations:
> >   A:    2146160640
> >   A/B:  1427795968
> >   A/C:  717705216
> >   A/D:  659456
> >   A/E:  0
> 
> Very cool results.
> 
> Below some comments on the implementation.
> 
> > +static void memcg_update_low(struct mem_cgroup *memcg)
> > +{
> > +	unsigned long usage, low_usage, prev_low_usage;
> > +	struct mem_cgroup *parent;
> > +	long delta;
> > +
> > +	do {
> > +		parent = parent_mem_cgroup(memcg);
> > +		if (!parent || mem_cgroup_is_root(parent))
> > +			break;
> > +
> > +		if (!memcg->low && !atomic_long_read(&memcg->low_usage))
> > +			break;
> > +
> > +		usage = page_counter_read(&memcg->memory);
> > +		if (usage < memcg->low)
> > +			low_usage = usage;
> > +		else
> > +			low_usage = 0;
> > +
> > +		prev_low_usage = atomic_long_xchg(&memcg->low_usage, low_usage);
> > +		delta = low_usage - prev_low_usage;
> > +		if (delta == 0)
> > +			break;
> > +
> > +		atomic_long_add(delta, &parent->s_low_usage);
> > +
> > +	} while ((memcg = parent));
> > +}
> 
> This code could use some comments ;)
> 
> Something that explains that we're tracking the combined usage of the
> children and what we're using that information for.
> 
> The conceptual descriptions you have in the changelog should be in the
> code somewher, to give a high level overview of how we're enforcing
> the low settings hierarchically.
> 
> > @@ -1726,6 +1756,7 @@ static void drain_stock(struct memcg_stock_pcp *stock)
> >  		page_counter_uncharge(&old->memory, stock->nr_pages);
> >  		if (do_memsw_account())
> >  			page_counter_uncharge(&old->memsw, stock->nr_pages);
> > +		memcg_update_low(old);
> >  		css_put_many(&old->css, stock->nr_pages);
> >  		stock->nr_pages = 0;
> 
> The function is called every time the page counter changes and walks
> up the hierarchy exactly the same. That is a good sign that the low
> usage tracking should really be part of the page counter code itself.

I thought about it, but the problem is that page counters are used for
accounting swap, kmem, tcpmem (for v1), where low limit calculations are
not applicable. I've no idea, how to add them nicely and without excessive
overhead.
Also, good news are that it's possible to avoid any tracking until
a user actually overcommits memory.low guarantees. I plan to implement
this optimization in a separate patch.

> 
> I think you also have to call it when memory.low changes, as that may
> increase or decrease low usage just as much as when usage changes.

Yes, you're right. There will be likely not much difference in practice,
but you're totally correct. Will fix this.

Thank you!

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2] mm: memory.low hierarchical behavior
  2018-03-21 18:23 ` Johannes Weiner
  2018-03-21 19:08   ` Roman Gushchin
@ 2018-03-23 16:37   ` Roman Gushchin
  1 sibling, 0 replies; 13+ messages in thread
From: Roman Gushchin @ 2018-03-23 16:37 UTC (permalink / raw)
  To: linux-mm
  Cc: Roman Gushchin, Andrew Morton, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, Tejun Heo, kernel-team, cgroups, linux-kernel

This patch aims to address an issue in current memory.low semantics,
which makes it hard to use it in a hierarchy, where some leaf memory
cgroups are more valuable than others.

For example, there are memcgs A, A/B, A/C, A/D and A/E:

  A      A/memory.low = 2G, A/memory.current = 6G
 //\\
BC  DE   B/memory.low = 3G  B/memory.current = 2G
         C/memory.low = 1G  C/memory.current = 2G
         D/memory.low = 0   D/memory.current = 2G
	 E/memory.low = 10G E/memory.current = 0

If we apply memory pressure, B, C and D are reclaimed at
the same pace while A's usage exceeds 2G.
This is obviously wrong, as B's usage is fully below B's memory.low,
and C has 1G of protection as well.
Also, A is pushed to the size, which is less than A's 2G memory.low,
which is also wrong.

A simple bash script (provided below) can be used to reproduce
the problem. Current results are:
  A:    1430097920
  A/B:  711929856
  A/C:  717426688
  A/D:  741376
  A/E:  0

To address the issue a concept of effective memory.low is introduced.
Effective memory.low is always equal or less than original memory.low.
In a case, when there is no memory.low overcommittment (and also for
top-level cgroups), these two values are equal.
Otherwise it's a part of parent's effective memory.low, calculated as
a cgroup's memory.low usage divided by sum of sibling's memory.low
usages (under memory.low usage I mean the size of actually protected
memory: memory.current if memory.current < memory.low, 0 otherwise).
It's necessary to track the actual usage, because otherwise an empty
cgroup with memory.low set (A/E in my example) will affect actual
memory distribution, which makes no sense.

Calculating effective memory.low can be done in the reclaim path,
as we conveniently traversing the cgroup tree from top to bottom and
check memory.low on each level. So, it's a perfect place to calculate
effective memory low and save it to use it for children cgroups.

This also eliminates a need to traverse the cgroup tree from bottom
to top each time to check if parent's guarantee is not exceeded.

Setting/resetting effective memory.low is intentionally racy, but
it's fine and shouldn't lead to any significant differences in
actual memory distribution.

With this patch applied results are matching the expectations:
  A:    2143293440
  A/B:  1424736256
  A/C:  717766656
  A/D:  790528
  A/E:  0

Test script:
  #!/bin/bash

  CGPATH="/sys/fs/cgroup"

  truncate /file1 --size 2G
  truncate /file2 --size 2G
  truncate /file3 --size 2G
  truncate /file4 --size 50G

  mkdir "${CGPATH}/A"
  echo "+memory" > "${CGPATH}/A/cgroup.subtree_control"
  mkdir "${CGPATH}/A/B" "${CGPATH}/A/C" "${CGPATH}/A/D" "${CGPATH}/A/E"

  echo 2G > "${CGPATH}/A/memory.low"
  echo 3G > "${CGPATH}/A/B/memory.low"
  echo 1G > "${CGPATH}/A/C/memory.low"
  echo 0 > "${CGPATH}/A/D/memory.low"
  echo 10G > "${CGPATH}/A/E/memory.low"

  echo $$ > "${CGPATH}/A/B/cgroup.procs" && vmtouch -qt /file1
  echo $$ > "${CGPATH}/A/C/cgroup.procs" && vmtouch -qt /file2
  echo $$ > "${CGPATH}/A/D/cgroup.procs" && vmtouch -qt /file3
  echo $$ > "${CGPATH}/cgroup.procs" && vmtouch -qt /file4

  echo "A:   " `cat "${CGPATH}/A/memory.current"`
  echo "A/B: " `cat "${CGPATH}/A/B/memory.current"`
  echo "A/C: " `cat "${CGPATH}/A/C/memory.current"`
  echo "A/D: " `cat "${CGPATH}/A/D/memory.current"`
  echo "A/E: " `cat "${CGPATH}/A/E/memory.current"`

  rmdir "${CGPATH}/A/B" "${CGPATH}/A/C" "${CGPATH}/A/D" "${CGPATH}/A/E"
  rmdir "${CGPATH}/A"
  rm /file1 /file2 /file3 /file4

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: linux-mm@kvack.org
Cc: cgroups@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 include/linux/memcontrol.h |   8 +++
 mm/memcontrol.c            | 140 +++++++++++++++++++++++++++++++++++++--------
 2 files changed, 123 insertions(+), 25 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 44422e1d3def..59873cb99093 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -182,6 +182,14 @@ struct mem_cgroup {
 	unsigned long low;
 	unsigned long high;
 
+	/*
+	 * Effective memory.low and memory.low usage tracking.
+	 * Please, refer to mem_cgroup_low() for more details.
+	 */
+	unsigned long elow;
+	atomic_long_t low_usage;
+	atomic_long_t children_low_usage;
+
 	/* Range enforcement for interrupt charges */
 	struct work_struct high_work;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 636f3dc7b53a..24afbf12b9a6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1663,6 +1663,36 @@ void unlock_page_memcg(struct page *page)
 }
 EXPORT_SYMBOL(unlock_page_memcg);
 
+static void memcg_update_low(struct mem_cgroup *memcg)
+{
+	unsigned long usage, low_usage, prev_low_usage;
+	struct mem_cgroup *parent;
+	long delta;
+
+	do {
+		parent = parent_mem_cgroup(memcg);
+		if (!parent || mem_cgroup_is_root(parent))
+			break;
+
+		if (!memcg->low && !atomic_long_read(&memcg->low_usage))
+			break;
+
+		usage = page_counter_read(&memcg->memory);
+		if (usage < memcg->low)
+			low_usage = usage;
+		else
+			low_usage = 0;
+
+		prev_low_usage = atomic_long_xchg(&memcg->low_usage, low_usage);
+		delta = low_usage - prev_low_usage;
+		if (delta == 0)
+			break;
+
+		atomic_long_add(delta, &parent->children_low_usage);
+
+	} while ((memcg = parent));
+}
+
 struct memcg_stock_pcp {
 	struct mem_cgroup *cached; /* this never be root cgroup */
 	unsigned int nr_pages;
@@ -1717,6 +1747,7 @@ static void drain_stock(struct memcg_stock_pcp *stock)
 		page_counter_uncharge(&old->memory, stock->nr_pages);
 		if (do_memsw_account())
 			page_counter_uncharge(&old->memsw, stock->nr_pages);
+		memcg_update_low(old);
 		css_put_many(&old->css, stock->nr_pages);
 		stock->nr_pages = 0;
 	}
@@ -2008,11 +2039,13 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (do_memsw_account())
 		page_counter_charge(&memcg->memsw, nr_pages);
 	css_get_many(&memcg->css, nr_pages);
+	memcg_update_low(memcg);
 
 	return 0;
 
 done_restock:
 	css_get_many(&memcg->css, batch);
+	memcg_update_low(memcg);
 	if (batch > nr_pages)
 		refill_stock(memcg, batch - nr_pages);
 
@@ -2050,6 +2083,7 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
 	if (do_memsw_account())
 		page_counter_uncharge(&memcg->memsw, nr_pages);
 
+	memcg_update_low(memcg);
 	css_put_many(&memcg->css, nr_pages);
 }
 
@@ -2396,6 +2430,7 @@ void memcg_kmem_uncharge(struct page *page, int order)
 	if (PageKmemcg(page))
 		__ClearPageKmemcg(page);
 
+	memcg_update_low(memcg);
 	css_put_many(&memcg->css, nr_pages);
 }
 #endif /* !CONFIG_SLOB */
@@ -4500,6 +4535,7 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 	spin_unlock(&memcg->event_list_lock);
 
 	memcg->low = 0;
+	memcg_update_low(memcg);
 
 	memcg_offline_kmem(memcg);
 	wb_memcg_offline(memcg);
@@ -4554,6 +4590,7 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css)
 	page_counter_limit(&memcg->kmem, PAGE_COUNTER_MAX);
 	page_counter_limit(&memcg->tcpmem, PAGE_COUNTER_MAX);
 	memcg->low = 0;
+	memcg_update_low(memcg);
 	memcg->high = PAGE_COUNTER_MAX;
 	memcg->soft_limit = PAGE_COUNTER_MAX;
 	memcg_wb_domain_size_changed(memcg);
@@ -5316,6 +5353,7 @@ static ssize_t memory_low_write(struct kernfs_open_file *of,
 		return err;
 
 	memcg->low = low;
+	memcg_update_low(memcg);
 
 	return nbytes;
 }
@@ -5612,36 +5650,69 @@ struct cgroup_subsys memory_cgrp_subsys = {
  * @root: the top ancestor of the sub-tree being checked
  * @memcg: the memory cgroup to check
  *
- * Returns %true if memory consumption of @memcg, and that of all
- * ancestors up to (but not including) @root, is below the normal range.
+ * Returns %true if memory consumption of @memcg is below the normal range.
+ *
+ * @root is exclusive; it is never low when looked at directly
  *
- * @root is exclusive; it is never low when looked at directly and isn't
- * checked when traversing the hierarchy.
+ * To provide a proper hierarchical behavior, effective memory.low value
+ * is used.
  *
- * Excluding @root enables using memory.low to prioritize memory usage
- * between cgroups within a subtree of the hierarchy that is limited by
- * memory.high or memory.max.
+ * Effective memory.low is always equal or less than the original memory.low.
+ * If there is no memory.low overcommittment (which is always true for
+ * top-level memory cgroups), these two values are equal.
+ * Otherwise, it's a part of parent's effective memory.low,
+ * calculated as a cgroup's memory.low usage divided by sum of sibling's
+ * memory.low usages, where memory.low usage is the size of actually
+ * protected memory.
  *
- * For example, given cgroup A with children B and C:
+ *                                             low_usage
+ * elow = min( memory.low, parent->elow * ------------------ ),
+ *                                        siblings_low_usage
  *
- *    A
- *   / \
- *  B   C
+ *             | memory.current, if memory.current < memory.low
+ * low_usage = |
+	       | 0, otherwise.
  *
- * and
  *
- *  1. A/memory.current > A/memory.high
- *  2. A/B/memory.current < A/B/memory.low
- *  3. A/C/memory.current >= A/C/memory.low
+ * Such definition of the effective memory.low provides the expected
+ * hierarchical behavior: parent's memory.low value is limiting
+ * children, unprotected memory is reclaimed first and cgroups,
+ * which are not using their guarantee do not affect actual memory
+ * distribution.
  *
- * As 'A' is high, i.e. triggers reclaim from 'A', and 'B' is low, we
- * should reclaim from 'C' until 'A' is no longer high or until we can
- * no longer reclaim from 'C'.  If 'A', i.e. @root, isn't excluded by
- * mem_cgroup_low when reclaming from 'A', then 'B' won't be considered
- * low and we will reclaim indiscriminately from both 'B' and 'C'.
+ * For example, if there are memcgs A, A/B, A/C, A/D and A/E:
+ *
+ *     A      A/memory.low = 2G, A/memory.current = 6G
+ *    //\\
+ *   BC  DE   B/memory.low = 3G  B/memory.current = 2G
+ *            C/memory.low = 1G  C/memory.current = 2G
+ *            D/memory.low = 0   D/memory.current = 2G
+ *            E/memory.low = 10G E/memory.current = 0
+ *
+ * and the memory pressure is applied, the following memory distribution
+ * is expected (approximately):
+ *
+ *     A/memory.current = 2G
+ *
+ *     B/memory.current = 1.3G
+ *     C/memory.current = 0.6G
+ *     D/memory.current = 0
+ *     E/memory.current = 0
+ *
+ * These calculations require constant tracking of the actual low usages
+ * (see memcg_update_low()), as well as recursive calculation of
+ * effective memory.low values. But as we do call mem_cgroup_low()
+ * path for each memory cgroup top-down from the reclaim,
+ * it's possible to optimize this part, and save calculated elow
+ * for next usage. This part is intentionally racy, but it's ok,
+ * as memory.low is a best-effort mechanism.
  */
 bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg)
 {
+	unsigned long usage, low_usage, siblings_low_usage;
+	unsigned long elow, parent_elow;
+	struct mem_cgroup *parent;
+
 	if (mem_cgroup_disabled())
 		return false;
 
@@ -5650,12 +5721,30 @@ bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg)
 	if (memcg == root)
 		return false;
 
-	for (; memcg != root; memcg = parent_mem_cgroup(memcg)) {
-		if (page_counter_read(&memcg->memory) >= memcg->low)
-			return false;
-	}
+	elow = memcg->low;
+	usage = page_counter_read(&memcg->memory);
 
-	return true;
+	parent = parent_mem_cgroup(memcg);
+	if (parent == root)
+		goto exit;
+
+	parent_elow = parent->elow;
+	elow = min(elow, parent_elow);
+
+	if (!elow || !parent_elow)
+		goto exit;
+
+	low_usage = min(usage, memcg->low);
+	siblings_low_usage = atomic_long_read(&parent->children_low_usage);
+	if (!low_usage || !siblings_low_usage)
+		goto exit;
+
+	elow = min(elow, parent_elow * low_usage / siblings_low_usage);
+
+exit:
+	memcg->elow = elow;
+
+	return usage < elow;
 }
 
 /**
@@ -5829,6 +5918,7 @@ static void uncharge_batch(const struct uncharge_gather *ug)
 		if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && ug->nr_kmem)
 			page_counter_uncharge(&ug->memcg->kmem, ug->nr_kmem);
 		memcg_oom_recover(ug->memcg);
+		memcg_update_low(ug->memcg);
 	}
 
 	local_irq_save(flags);
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC] mm: memory.low heirarchical behavior
  2018-03-21 19:08   ` Roman Gushchin
@ 2018-04-04 17:07     ` Johannes Weiner
  2018-04-05 13:54       ` Roman Gushchin
  0 siblings, 1 reply; 13+ messages in thread
From: Johannes Weiner @ 2018-04-04 17:07 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Andrew Morton, Michal Hocko, Vladimir Davydov,
	Tejun Heo, kernel-team, cgroups, linux-kernel

On Wed, Mar 21, 2018 at 07:08:06PM +0000, Roman Gushchin wrote:
> > On Tue, Mar 20, 2018 at 10:33:53PM +0000, Roman Gushchin wrote:
> > > This patch aims to address an issue in current memory.low semantics,
> > > which makes it hard to use it in a hierarchy, where some leaf memory
> > > cgroups are more valuable than others.
> > > 
> > > For example, there are memcgs A, A/B, A/C, A/D and A/E:
> > > 
> > >   A      A/memory.low = 2G, A/memory.current = 6G
> > >  //\\
> > > BC  DE   B/memory.low = 3G  B/memory.usage = 2G
> > >          C/memory.low = 1G  C/memory.usage = 2G
> > >          D/memory.low = 0   D/memory.usage = 2G
> > > 	 E/memory.low = 10G E/memory.usage = 0
> > > 
> > > If we apply memory pressure, B, C and D are reclaimed at
> > > the same pace while A's usage exceeds 2G.
> > > This is obviously wrong, as B's usage is fully below B's memory.low,
> > > and C has 1G of protection as well.
> > > Also, A is pushed to the size, which is less than A's 2G memory.low,
> > > which is also wrong.
> > > 
> > > A simple bash script (provided below) can be used to reproduce
> > > the problem. Current results are:
> > >   A:    1430097920
> > >   A/B:  711929856
> > >   A/C:  717426688
> > >   A/D:  741376
> > >   A/E:  0
> > 
> > Yes, this is a problem. And the behavior with your patch looks much
> > preferable over the status quo.
> > 
> > > To address the issue a concept of effective memory.low is introduced.
> > > Effective memory.low is always equal or less than original memory.low.
> > > In a case, when there is no memory.low overcommittment (and also for
> > > top-level cgroups), these two values are equal.
> > > Otherwise it's a part of parent's effective memory.low, calculated as
> > > a cgroup's memory.low usage divided by sum of sibling's memory.low
> > > usages (under memory.low usage I mean the size of actually protected
> > > memory: memory.current if memory.current < memory.low, 0 otherwise).
> > 
> > This hurts my brain.
> > 
> > Why is memory.current == memory.low (which should fully protect
> > memory.current) a low usage of 0?
> > 
> > Why is memory.current > memory.low not a low usage of memory.low?
> > 
> > I.e. shouldn't this be low_usage = min(memory.current, memory.low)?
> 
> This is really the non-trivial part.
> 
> Let's look at an example:
> memcg A   (memory.current = 4G, memory.low = 2G)
> memcg A/B (memory.current = 2G, memory.low = 2G)
> memcg A/C (memory.current = 2G, memory.low = 1G)
> 
> If we'll calculate effective memory.low using your definition
> before any reclaim, we end up with the following:
> A/B  2G * 2G / (2G + 1G) = 4/3G
> A/C  2G * 1G / (2G + 1G) = 2/3G
> 
> Looks good, but both cgroups are below their effective limits.
> When memory pressure is applied, both are reclaimed at the same pace.
> While both B and C are getting smaller and smaller, their weights
> and effective low limits are getting closer and closer, but
> still below their usages. This ends up when both cgroups will
> have size of 1G, which is obviously wrong.
> 
> Fundamentally the problem is that memory.low doesn't define
> the reclaim speed, just yes or no. So, if there are children cgroups,
> some of which are below their memory.low, and some above (as in the example),
> it's crucially important to reclaim unprotected memory first.
> 
> This is exactly what my code does: as soon as memory.current is larger
> than memory.low, we don't treat cgroup's memory as protected at all,
> so it doesn't affect effective limits of sibling cgroups.

Okay, that explanation makes sense to me. Once you're in excess, your
memory is generally unprotected wrt your siblings until you're reigned
in again.

It should still be usage <= low rather than usage < low, right? Since
you're protected up to and including what that number says.

> > > @@ -1726,6 +1756,7 @@ static void drain_stock(struct memcg_stock_pcp *stock)
> > >  		page_counter_uncharge(&old->memory, stock->nr_pages);
> > >  		if (do_memsw_account())
> > >  			page_counter_uncharge(&old->memsw, stock->nr_pages);
> > > +		memcg_update_low(old);
> > >  		css_put_many(&old->css, stock->nr_pages);
> > >  		stock->nr_pages = 0;
> > 
> > The function is called every time the page counter changes and walks
> > up the hierarchy exactly the same. That is a good sign that the low
> > usage tracking should really be part of the page counter code itself.
> 
> I thought about it, but the problem is that page counters are used for
> accounting swap, kmem, tcpmem (for v1), where low limit calculations are
> not applicable. I've no idea, how to add them nicely and without excessive
> overhead.
> Also, good news are that it's possible to avoid any tracking until
> a user actually overcommits memory.low guarantees. I plan to implement
> this optimization in a separate patch.

Hm, I'm not too worried about swap (not a sensitive path) or the other
users (per-cpu batched). It just adds a branch. How about the below?

diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
index c15ab80ad32d..95bdbca86751 100644
--- a/include/linux/page_counter.h
+++ b/include/linux/page_counter.h
@@ -9,8 +9,13 @@
 struct page_counter {
 	atomic_long_t count;
 	unsigned long limit;
+	unsigned long protected;
 	struct page_counter *parent;
 
+	/* Hierarchical, proportional protection */
+	atomic_long_t protected_count;
+	atomic_long_t children_protected_count;
+
 	/* legacy */
 	unsigned long watermark;
 	unsigned long failcnt;
@@ -42,6 +47,7 @@ bool page_counter_try_charge(struct page_counter *counter,
 			     struct page_counter **fail);
 void page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
 int page_counter_limit(struct page_counter *counter, unsigned long limit);
+void page_counter_protect(struct page_counter *counter, unsigned long protected);
 int page_counter_memparse(const char *buf, const char *max,
 			  unsigned long *nr_pages);
 
diff --git a/mm/page_counter.c b/mm/page_counter.c
index 2a8df3ad60a4..e6f7665d13e3 100644
--- a/mm/page_counter.c
+++ b/mm/page_counter.c
@@ -13,6 +13,29 @@
 #include <linux/bug.h>
 #include <asm/page.h>
 
+static void propagate_protected(struct page_counter *c, unsigned long count)
+{
+	unsigned long protected_count;
+	unsigned long delta;
+	unsigned long old;
+
+	if (!c->parent)
+		return;
+
+	if (!c->protected && !atomic_long_read(&c->protected_count))
+		return;
+
+	if (count <= c->protected)
+		protected_count = count;
+	else
+		protected_count = 0;
+
+	old = atomic_long_xchg(&c->protected_count, protected_count);
+	delta = protected_count - old;
+	if (delta)
+		atomic_long_add(delta, &c->parent->children_protected_count);
+}
+
 /**
  * page_counter_cancel - take pages out of the local counter
  * @counter: counter
@@ -23,6 +46,7 @@ void page_counter_cancel(struct page_counter *counter, unsigned long nr_pages)
 	long new;
 
 	new = atomic_long_sub_return(nr_pages, &counter->count);
+	propagate_protected(counter, new);
 	/* More uncharges than charges? */
 	WARN_ON_ONCE(new < 0);
 }
@@ -42,6 +66,7 @@ void page_counter_charge(struct page_counter *counter, unsigned long nr_pages)
 		long new;
 
 		new = atomic_long_add_return(nr_pages, &c->count);
+		propagate_protected(c, new);
 		/*
 		 * This is indeed racy, but we can live with some
 		 * inaccuracy in the watermark.
@@ -93,6 +118,7 @@ bool page_counter_try_charge(struct page_counter *counter,
 			*fail = c;
 			goto failed;
 		}
+		propagate_protected(c, new);
 		/*
 		 * Just like with failcnt, we can live with some
 		 * inaccuracy in the watermark.
@@ -164,6 +190,12 @@ int page_counter_limit(struct page_counter *counter, unsigned long limit)
 	}
 }
 
+void page_counter_protect(struct page_counter *counter, unsigned long protected)
+{
+	c->protected = protected;
+	propagate_protected(counter, atomic_long_read(&counter->count));
+}
+
 /**
  * page_counter_memparse - memparse() for page counter limits
  * @buf: string to parse

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC] mm: memory.low heirarchical behavior
  2018-04-04 17:07     ` Johannes Weiner
@ 2018-04-05 13:54       ` Roman Gushchin
  2018-04-05 15:00         ` Johannes Weiner
  0 siblings, 1 reply; 13+ messages in thread
From: Roman Gushchin @ 2018-04-05 13:54 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Andrew Morton, Michal Hocko, Vladimir Davydov,
	Tejun Heo, kernel-team, cgroups, linux-kernel

On Wed, Apr 04, 2018 at 01:07:00PM -0400, Johannes Weiner wrote:
> On Wed, Mar 21, 2018 at 07:08:06PM +0000, Roman Gushchin wrote:
> > > On Tue, Mar 20, 2018 at 10:33:53PM +0000, Roman Gushchin wrote:
> > > > This patch aims to address an issue in current memory.low semantics,
> > > > which makes it hard to use it in a hierarchy, where some leaf memory
> > > > cgroups are more valuable than others.
> > > > 
> > > > For example, there are memcgs A, A/B, A/C, A/D and A/E:
> > > > 
> > > >   A      A/memory.low = 2G, A/memory.current = 6G
> > > >  //\\
> > > > BC  DE   B/memory.low = 3G  B/memory.usage = 2G
> > > >          C/memory.low = 1G  C/memory.usage = 2G
> > > >          D/memory.low = 0   D/memory.usage = 2G
> > > > 	 E/memory.low = 10G E/memory.usage = 0
> > > > 
> > > > If we apply memory pressure, B, C and D are reclaimed at
> > > > the same pace while A's usage exceeds 2G.
> > > > This is obviously wrong, as B's usage is fully below B's memory.low,
> > > > and C has 1G of protection as well.
> > > > Also, A is pushed to the size, which is less than A's 2G memory.low,
> > > > which is also wrong.
> > > > 
> > > > A simple bash script (provided below) can be used to reproduce
> > > > the problem. Current results are:
> > > >   A:    1430097920
> > > >   A/B:  711929856
> > > >   A/C:  717426688
> > > >   A/D:  741376
> > > >   A/E:  0
> > > 
> > > Yes, this is a problem. And the behavior with your patch looks much
> > > preferable over the status quo.
> > > 
> > > > To address the issue a concept of effective memory.low is introduced.
> > > > Effective memory.low is always equal or less than original memory.low.
> > > > In a case, when there is no memory.low overcommittment (and also for
> > > > top-level cgroups), these two values are equal.
> > > > Otherwise it's a part of parent's effective memory.low, calculated as
> > > > a cgroup's memory.low usage divided by sum of sibling's memory.low
> > > > usages (under memory.low usage I mean the size of actually protected
> > > > memory: memory.current if memory.current < memory.low, 0 otherwise).
> > > 
> > > This hurts my brain.
> > > 
> > > Why is memory.current == memory.low (which should fully protect
> > > memory.current) a low usage of 0?
> > > 
> > > Why is memory.current > memory.low not a low usage of memory.low?
> > > 
> > > I.e. shouldn't this be low_usage = min(memory.current, memory.low)?
> > 
> > This is really the non-trivial part.
> > 
> > Let's look at an example:
> > memcg A   (memory.current = 4G, memory.low = 2G)
> > memcg A/B (memory.current = 2G, memory.low = 2G)
> > memcg A/C (memory.current = 2G, memory.low = 1G)
> > 
> > If we'll calculate effective memory.low using your definition
> > before any reclaim, we end up with the following:
> > A/B  2G * 2G / (2G + 1G) = 4/3G
> > A/C  2G * 1G / (2G + 1G) = 2/3G
> > 
> > Looks good, but both cgroups are below their effective limits.
> > When memory pressure is applied, both are reclaimed at the same pace.
> > While both B and C are getting smaller and smaller, their weights
> > and effective low limits are getting closer and closer, but
> > still below their usages. This ends up when both cgroups will
> > have size of 1G, which is obviously wrong.
> > 
> > Fundamentally the problem is that memory.low doesn't define
> > the reclaim speed, just yes or no. So, if there are children cgroups,
> > some of which are below their memory.low, and some above (as in the example),
> > it's crucially important to reclaim unprotected memory first.
> > 
> > This is exactly what my code does: as soon as memory.current is larger
> > than memory.low, we don't treat cgroup's memory as protected at all,
> > so it doesn't affect effective limits of sibling cgroups.
> 
> Okay, that explanation makes sense to me. Once you're in excess, your
> memory is generally unprotected wrt your siblings until you're reigned
> in again.
> 
> It should still be usage <= low rather than usage < low, right? Since
> you're protected up to and including what that number says.
> 
> > > > @@ -1726,6 +1756,7 @@ static void drain_stock(struct memcg_stock_pcp *stock)
> > > >  		page_counter_uncharge(&old->memory, stock->nr_pages);
> > > >  		if (do_memsw_account())
> > > >  			page_counter_uncharge(&old->memsw, stock->nr_pages);
> > > > +		memcg_update_low(old);
> > > >  		css_put_many(&old->css, stock->nr_pages);
> > > >  		stock->nr_pages = 0;
> > > 
> > > The function is called every time the page counter changes and walks
> > > up the hierarchy exactly the same. That is a good sign that the low
> > > usage tracking should really be part of the page counter code itself.
> > 
> > I thought about it, but the problem is that page counters are used for
> > accounting swap, kmem, tcpmem (for v1), where low limit calculations are
> > not applicable. I've no idea, how to add them nicely and without excessive
> > overhead.
> > Also, good news are that it's possible to avoid any tracking until
> > a user actually overcommits memory.low guarantees. I plan to implement
> > this optimization in a separate patch.
> 
> Hm, I'm not too worried about swap (not a sensitive path) or the other
> users (per-cpu batched). It just adds a branch. How about the below?
> 
> diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
> index c15ab80ad32d..95bdbca86751 100644
> --- a/include/linux/page_counter.h
> +++ b/include/linux/page_counter.h
> @@ -9,8 +9,13 @@
>  struct page_counter {
>  	atomic_long_t count;
>  	unsigned long limit;
> +	unsigned long protected;
>  	struct page_counter *parent;
>  
> +	/* Hierarchical, proportional protection */
> +	atomic_long_t protected_count;
> +	atomic_long_t children_protected_count;
> +

I followed your approach, but without introducing the new "protected" term.
It looks cute in the usage tracking part, but a bit weird in mem_cgroup_low(),
and it's not clear how to reuse it for memory.min. I think, we shouldn't
introduce a new term without strict necessity.

Also, I moved the low field from memcg to page_counter, what made
the code simpler and cleaner.

Does it look good to you?

Thanks!

--

>From 70175f8370216ccf63454863977ad16c920a6e6b Mon Sep 17 00:00:00 2001
From: Roman Gushchin <guro@fb.com>
Date: Fri, 16 Mar 2018 14:20:15 +0000
Subject: [PATCH] mm: memory.low hierarchical behavior

This patch aims to address an issue in current memory.low semantics,
which makes it hard to use it in a hierarchy, where some leaf memory
cgroups are more valuable than others.

For example, there are memcgs A, A/B, A/C, A/D and A/E:

  A      A/memory.low = 2G, A/memory.current = 6G
 //\\
BC  DE   B/memory.low = 3G  B/memory.current = 2G
         C/memory.low = 1G  C/memory.current = 2G
         D/memory.low = 0   D/memory.current = 2G
	 E/memory.low = 10G E/memory.current = 0

If we apply memory pressure, B, C and D are reclaimed at
the same pace while A's usage exceeds 2G.
This is obviously wrong, as B's usage is fully below B's memory.low,
and C has 1G of protection as well.
Also, A is pushed to the size, which is less than A's 2G memory.low,
which is also wrong.

A simple bash script (provided below) can be used to reproduce
the problem. Current results are:
  A:    1430097920
  A/B:  711929856
  A/C:  717426688
  A/D:  741376
  A/E:  0

To address the issue a concept of effective memory.low is introduced.
Effective memory.low is always equal or less than original memory.low.
In a case, when there is no memory.low overcommittment (and also for
top-level cgroups), these two values are equal.
Otherwise it's a part of parent's effective memory.low, calculated as
a cgroup's memory.low usage divided by sum of sibling's memory.low
usages (under memory.low usage I mean the size of actually protected
memory: memory.current if memory.current < memory.low, 0 otherwise).
It's necessary to track the actual usage, because otherwise an empty
cgroup with memory.low set (A/E in my example) will affect actual
memory distribution, which makes no sense. To avoid traversing
the cgroup tree twice, page_counters code is reused.

Calculating effective memory.low can be done in the reclaim path,
as we conveniently traversing the cgroup tree from top to bottom and
check memory.low on each level. So, it's a perfect place to calculate
effective memory low and save it to use it for children cgroups.

This also eliminates a need to traverse the cgroup tree from bottom
to top each time to check if parent's guarantee is not exceeded.

Setting/resetting effective memory.low is intentionally racy, but
it's fine and shouldn't lead to any significant differences in
actual memory distribution.

With this patch applied results are matching the expectations:
  A:    2140094464
  A/B:  1424838656
  A/C:  714326016
  A/D:  929792
  A/E:  0

Test script:
  #!/bin/bash

  CGPATH="/sys/fs/cgroup"

  truncate /file1 --size 2G
  truncate /file2 --size 2G
  truncate /file3 --size 2G
  truncate /file4 --size 50G

  mkdir "${CGPATH}/A"
  echo "+memory" > "${CGPATH}/A/cgroup.subtree_control"
  mkdir "${CGPATH}/A/B" "${CGPATH}/A/C" "${CGPATH}/A/D" "${CGPATH}/A/E"

  echo 2G > "${CGPATH}/A/memory.low"
  echo 3G > "${CGPATH}/A/B/memory.low"
  echo 1G > "${CGPATH}/A/C/memory.low"
  echo 0 > "${CGPATH}/A/D/memory.low"
  echo 10G > "${CGPATH}/A/E/memory.low"

  echo $$ > "${CGPATH}/A/B/cgroup.procs" && vmtouch -qt /file1
  echo $$ > "${CGPATH}/A/C/cgroup.procs" && vmtouch -qt /file2
  echo $$ > "${CGPATH}/A/D/cgroup.procs" && vmtouch -qt /file3
  echo $$ > "${CGPATH}/cgroup.procs" && vmtouch -qt /file4

  echo "A:   " `cat "${CGPATH}/A/memory.current"`
  echo "A/B: " `cat "${CGPATH}/A/B/memory.current"`
  echo "A/C: " `cat "${CGPATH}/A/C/memory.current"`
  echo "A/D: " `cat "${CGPATH}/A/D/memory.current"`
  echo "A/E: " `cat "${CGPATH}/A/E/memory.current"`

  rmdir "${CGPATH}/A/B" "${CGPATH}/A/C" "${CGPATH}/A/D" "${CGPATH}/A/E"
  rmdir "${CGPATH}/A"
  rm /file1 /file2 /file3 /file4

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Cc: linux-mm@kvack.org
Cc: cgroups@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 include/linux/memcontrol.h   |   3 +-
 include/linux/page_counter.h |   7 +++
 mm/memcontrol.c              | 110 +++++++++++++++++++++++++++++++------------
 mm/page_counter.c            |  32 +++++++++++++
 4 files changed, 121 insertions(+), 31 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 44422e1d3def..f1e62cf24ebf 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -178,8 +178,7 @@ struct mem_cgroup {
 	struct page_counter kmem;
 	struct page_counter tcpmem;
 
-	/* Normal memory consumption range */
-	unsigned long low;
+	/* Upper bound of normal memory consumption range */
 	unsigned long high;
 
 	/* Range enforcement for interrupt charges */
diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
index c15ab80ad32d..e916595fb700 100644
--- a/include/linux/page_counter.h
+++ b/include/linux/page_counter.h
@@ -9,8 +9,14 @@
 struct page_counter {
 	atomic_long_t count;
 	unsigned long limit;
+	unsigned long low;
 	struct page_counter *parent;
 
+	/* effective memory.low and memory.low usage tracking */
+	unsigned long elow;
+	atomic_long_t low_usage;
+	atomic_long_t children_low_usage;
+
 	/* legacy */
 	unsigned long watermark;
 	unsigned long failcnt;
@@ -44,6 +50,7 @@ void page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages)
 int page_counter_limit(struct page_counter *counter, unsigned long limit);
 int page_counter_memparse(const char *buf, const char *max,
 			  unsigned long *nr_pages);
+void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages);
 
 static inline void page_counter_reset_watermark(struct page_counter *counter)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 636f3dc7b53a..8f6132625974 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4499,7 +4499,7 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 	}
 	spin_unlock(&memcg->event_list_lock);
 
-	memcg->low = 0;
+	page_counter_set_low(&memcg->memory, 0);
 
 	memcg_offline_kmem(memcg);
 	wb_memcg_offline(memcg);
@@ -4553,7 +4553,7 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css)
 	page_counter_limit(&memcg->memsw, PAGE_COUNTER_MAX);
 	page_counter_limit(&memcg->kmem, PAGE_COUNTER_MAX);
 	page_counter_limit(&memcg->tcpmem, PAGE_COUNTER_MAX);
-	memcg->low = 0;
+	page_counter_set_low(&memcg->memory, 0);
 	memcg->high = PAGE_COUNTER_MAX;
 	memcg->soft_limit = PAGE_COUNTER_MAX;
 	memcg_wb_domain_size_changed(memcg);
@@ -5293,7 +5293,7 @@ static u64 memory_current_read(struct cgroup_subsys_state *css,
 static int memory_low_show(struct seq_file *m, void *v)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
-	unsigned long low = READ_ONCE(memcg->low);
+	unsigned long low = READ_ONCE(memcg->memory.low);
 
 	if (low == PAGE_COUNTER_MAX)
 		seq_puts(m, "max\n");
@@ -5315,7 +5315,7 @@ static ssize_t memory_low_write(struct kernfs_open_file *of,
 	if (err)
 		return err;
 
-	memcg->low = low;
+	page_counter_set_low(&memcg->memory, low);
 
 	return nbytes;
 }
@@ -5612,36 +5612,69 @@ struct cgroup_subsys memory_cgrp_subsys = {
  * @root: the top ancestor of the sub-tree being checked
  * @memcg: the memory cgroup to check
  *
- * Returns %true if memory consumption of @memcg, and that of all
- * ancestors up to (but not including) @root, is below the normal range.
+ * Returns %true if memory consumption of @memcg is below the normal range.
  *
- * @root is exclusive; it is never low when looked at directly and isn't
- * checked when traversing the hierarchy.
+ * @root is exclusive; it is never low when looked at directly
  *
- * Excluding @root enables using memory.low to prioritize memory usage
- * between cgroups within a subtree of the hierarchy that is limited by
- * memory.high or memory.max.
+ * To provide a proper hierarchical behavior, effective memory.low value
+ * is used.
  *
- * For example, given cgroup A with children B and C:
+ * Effective memory.low is always equal or less than the original memory.low.
+ * If there is no memory.low overcommittment (which is always true for
+ * top-level memory cgroups), these two values are equal.
+ * Otherwise, it's a part of parent's effective memory.low,
+ * calculated as a cgroup's memory.low usage divided by sum of sibling's
+ * memory.low usages, where memory.low usage is the size of actually
+ * protected memory.
  *
- *    A
- *   / \
- *  B   C
+ *                                             low_usage
+ * elow = min( memory.low, parent->elow * ------------------ ),
+ *                                        siblings_low_usage
  *
- * and
+ *             | memory.current, if memory.current < memory.low
+ * low_usage = |
+	       | 0, otherwise.
  *
- *  1. A/memory.current > A/memory.high
- *  2. A/B/memory.current < A/B/memory.low
- *  3. A/C/memory.current >= A/C/memory.low
  *
- * As 'A' is high, i.e. triggers reclaim from 'A', and 'B' is low, we
- * should reclaim from 'C' until 'A' is no longer high or until we can
- * no longer reclaim from 'C'.  If 'A', i.e. @root, isn't excluded by
- * mem_cgroup_low when reclaming from 'A', then 'B' won't be considered
- * low and we will reclaim indiscriminately from both 'B' and 'C'.
+ * Such definition of the effective memory.low provides the expected
+ * hierarchical behavior: parent's memory.low value is limiting
+ * children, unprotected memory is reclaimed first and cgroups,
+ * which are not using their guarantee do not affect actual memory
+ * distribution.
+ *
+ * For example, if there are memcgs A, A/B, A/C, A/D and A/E:
+ *
+ *     A      A/memory.low = 2G, A/memory.current = 6G
+ *    //\\
+ *   BC  DE   B/memory.low = 3G  B/memory.current = 2G
+ *            C/memory.low = 1G  C/memory.current = 2G
+ *            D/memory.low = 0   D/memory.current = 2G
+ *            E/memory.low = 10G E/memory.current = 0
+ *
+ * and the memory pressure is applied, the following memory distribution
+ * is expected (approximately):
+ *
+ *     A/memory.current = 2G
+ *
+ *     B/memory.current = 1.3G
+ *     C/memory.current = 0.6G
+ *     D/memory.current = 0
+ *     E/memory.current = 0
+ *
+ * These calculations require constant tracking of the actual low usages
+ * (see propagate_protected()), as well as recursive calculation of
+ * effective memory.low values. But as we do call mem_cgroup_low()
+ * path for each memory cgroup top-down from the reclaim,
+ * it's possible to optimize this part, and save calculated elow
+ * for next usage. This part is intentionally racy, but it's ok,
+ * as memory.low is a best-effort mechanism.
  */
 bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg)
 {
+	unsigned long usage, low_usage, siblings_low_usage;
+	unsigned long elow, parent_elow;
+	struct mem_cgroup *parent;
+
 	if (mem_cgroup_disabled())
 		return false;
 
@@ -5650,12 +5683,31 @@ bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg)
 	if (memcg == root)
 		return false;
 
-	for (; memcg != root; memcg = parent_mem_cgroup(memcg)) {
-		if (page_counter_read(&memcg->memory) >= memcg->low)
-			return false;
-	}
+	elow = memcg->memory.low;
+	usage = page_counter_read(&memcg->memory);
 
-	return true;
+	parent = parent_mem_cgroup(memcg);
+	if (parent == root)
+		goto exit;
+
+	parent_elow = READ_ONCE(parent->memory.elow);
+	elow = min(elow, parent_elow);
+
+	if (!elow || !parent_elow)
+		goto exit;
+
+	low_usage = min(usage, memcg->memory.low);
+	siblings_low_usage = atomic_long_read(
+		&parent->memory.children_low_usage);
+	if (!low_usage || !siblings_low_usage)
+		goto exit;
+
+	elow = min(elow, parent_elow * low_usage / siblings_low_usage);
+
+exit:
+	memcg->memory.elow = elow;
+
+	return usage < elow;
 }
 
 /**
diff --git a/mm/page_counter.c b/mm/page_counter.c
index 2a8df3ad60a4..1cba033957d4 100644
--- a/mm/page_counter.c
+++ b/mm/page_counter.c
@@ -13,6 +13,34 @@
 #include <linux/bug.h>
 #include <asm/page.h>
 
+static void propagate_low_usage(struct page_counter *c, unsigned long usage)
+{
+	unsigned long low_usage, old;
+	long delta;
+
+	if (!c->parent)
+		return;
+
+	if (!c->low && !atomic_long_read(&c->low_usage))
+		return;
+
+	if (usage <= c->low)
+		low_usage = usage;
+	else
+		low_usage = 0;
+
+	old = atomic_long_xchg(&c->low_usage, low_usage);
+	delta = low_usage - old;
+	if (delta)
+		atomic_long_add(delta, &c->parent->children_low_usage);
+}
+
+void page_counter_set_low(struct page_counter *c, unsigned long nr_pages)
+{
+	c->low = nr_pages;
+	propagate_low_usage(c, atomic_long_read(&c->count));
+}
+
 /**
  * page_counter_cancel - take pages out of the local counter
  * @counter: counter
@@ -23,6 +51,7 @@ void page_counter_cancel(struct page_counter *counter, unsigned long nr_pages)
 	long new;
 
 	new = atomic_long_sub_return(nr_pages, &counter->count);
+	propagate_low_usage(counter, new);
 	/* More uncharges than charges? */
 	WARN_ON_ONCE(new < 0);
 }
@@ -42,6 +71,7 @@ void page_counter_charge(struct page_counter *counter, unsigned long nr_pages)
 		long new;
 
 		new = atomic_long_add_return(nr_pages, &c->count);
+		propagate_low_usage(counter, new);
 		/*
 		 * This is indeed racy, but we can live with some
 		 * inaccuracy in the watermark.
@@ -85,6 +115,7 @@ bool page_counter_try_charge(struct page_counter *counter,
 		new = atomic_long_add_return(nr_pages, &c->count);
 		if (new > c->limit) {
 			atomic_long_sub(nr_pages, &c->count);
+			propagate_low_usage(counter, new);
 			/*
 			 * This is racy, but we can live with some
 			 * inaccuracy in the failcnt.
@@ -93,6 +124,7 @@ bool page_counter_try_charge(struct page_counter *counter,
 			*fail = c;
 			goto failed;
 		}
+		propagate_low_usage(counter, new);
 		/*
 		 * Just like with failcnt, we can live with some
 		 * inaccuracy in the watermark.
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC] mm: memory.low heirarchical behavior
  2018-04-05 13:54       ` Roman Gushchin
@ 2018-04-05 15:00         ` Johannes Weiner
  0 siblings, 0 replies; 13+ messages in thread
From: Johannes Weiner @ 2018-04-05 15:00 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Andrew Morton, Michal Hocko, Vladimir Davydov,
	Tejun Heo, kernel-team, cgroups, linux-kernel

On Thu, Apr 05, 2018 at 02:54:57PM +0100, Roman Gushchin wrote:
> On Wed, Apr 04, 2018 at 01:07:00PM -0400, Johannes Weiner wrote:
> > @@ -9,8 +9,13 @@
> >  struct page_counter {
> >  	atomic_long_t count;
> >  	unsigned long limit;
> > +	unsigned long protected;
> >  	struct page_counter *parent;
> >  
> > +	/* Hierarchical, proportional protection */
> > +	atomic_long_t protected_count;
> > +	atomic_long_t children_protected_count;
> > +
> 
> I followed your approach, but without introducing the new "protected" term.
> It looks cute in the usage tracking part, but a bit weird in mem_cgroup_low(),
> and it's not clear how to reuse it for memory.min. I think, we shouldn't
> introduce a new term without strict necessity.

I just found the "low_usage" term a bit confusing, but it's true, once
we have both min and low, "protected" is not descriptive enough.

> Also, I moved the low field from memcg to page_counter, what made
> the code simpler and cleaner.

Yep, makes sense.

> @@ -178,8 +178,7 @@ struct mem_cgroup {
>  	struct page_counter kmem;
>  	struct page_counter tcpmem;
>  
> -	/* Normal memory consumption range */
> -	unsigned long low;
> +	/* Upper bound of normal memory consumption range */
>  	unsigned long high;
>  
>  	/* Range enforcement for interrupt charges */
> diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
> index c15ab80ad32d..e916595fb700 100644
> --- a/include/linux/page_counter.h
> +++ b/include/linux/page_counter.h
> @@ -9,8 +9,14 @@
>  struct page_counter {
>  	atomic_long_t count;
>  	unsigned long limit;
> +	unsigned long low;
>  	struct page_counter *parent;
>  
> +	/* effective memory.low and memory.low usage tracking */
> +	unsigned long elow;
> +	atomic_long_t low_usage;
> +	atomic_long_t children_low_usage;

It's still a bit messy and confusing.

1. It uses both "usage" and "count" for the same thing. Can we pick
one? I wouldn't mind changing count -> usage throughout.

2. "Limit" is its own term, but we're starting to add low and min now,
which are more cgroup-specific. The idea with page_counter (previously
res_counter) was to be generic infrastructure, but after a decade no
other user has materialized. We can just follow the cgroup naming IMO.

3. The page counter is always hierarchical and so "children_" is the
implied default. I think we should point out when it's local instead.

usage, max, low, (elow, low_usage, children_low_usage)?

> @@ -5612,36 +5612,69 @@ struct cgroup_subsys memory_cgrp_subsys = {
>   * @root: the top ancestor of the sub-tree being checked
>   * @memcg: the memory cgroup to check
>   *
> - * Returns %true if memory consumption of @memcg, and that of all
> - * ancestors up to (but not including) @root, is below the normal range.
> + * Returns %true if memory consumption of @memcg is below the normal range.

Can you please add something like:

    * WARNING: This function is not stateless! It can only be used as part
    *          of a top-down tree iteration, not for isolated queries.

here?

> - * @root is exclusive; it is never low when looked at directly and isn't
> - * checked when traversing the hierarchy.
> + * @root is exclusive; it is never low when looked at directly
>   *
> - * Excluding @root enables using memory.low to prioritize memory usage
> - * between cgroups within a subtree of the hierarchy that is limited by
> - * memory.high or memory.max.
> + * To provide a proper hierarchical behavior, effective memory.low value
> + * is used.
>   *
> - * For example, given cgroup A with children B and C:
> + * Effective memory.low is always equal or less than the original memory.low.
> + * If there is no memory.low overcommittment (which is always true for
> + * top-level memory cgroups), these two values are equal.
> + * Otherwise, it's a part of parent's effective memory.low,
> + * calculated as a cgroup's memory.low usage divided by sum of sibling's
> + * memory.low usages, where memory.low usage is the size of actually
> + * protected memory.
>   *
> - *    A
> - *   / \
> - *  B   C
> + *                                             low_usage
> + * elow = min( memory.low, parent->elow * ------------------ ),
> + *                                        siblings_low_usage
>   *
> - * and
> + *             | memory.current, if memory.current < memory.low
> + * low_usage = |
> +	       | 0, otherwise.
>   *
> - *  1. A/memory.current > A/memory.high
> - *  2. A/B/memory.current < A/B/memory.low
> - *  3. A/C/memory.current >= A/C/memory.low
>   *
> - * As 'A' is high, i.e. triggers reclaim from 'A', and 'B' is low, we
> - * should reclaim from 'C' until 'A' is no longer high or until we can
> - * no longer reclaim from 'C'.  If 'A', i.e. @root, isn't excluded by
> - * mem_cgroup_low when reclaming from 'A', then 'B' won't be considered
> - * low and we will reclaim indiscriminately from both 'B' and 'C'.
> + * Such definition of the effective memory.low provides the expected
> + * hierarchical behavior: parent's memory.low value is limiting
> + * children, unprotected memory is reclaimed first and cgroups,
> + * which are not using their guarantee do not affect actual memory
> + * distribution.
> + *
> + * For example, if there are memcgs A, A/B, A/C, A/D and A/E:
> + *
> + *     A      A/memory.low = 2G, A/memory.current = 6G
> + *    //\\
> + *   BC  DE   B/memory.low = 3G  B/memory.current = 2G
> + *            C/memory.low = 1G  C/memory.current = 2G
> + *            D/memory.low = 0   D/memory.current = 2G
> + *            E/memory.low = 10G E/memory.current = 0
> + *
> + * and the memory pressure is applied, the following memory distribution
> + * is expected (approximately):
> + *
> + *     A/memory.current = 2G
> + *
> + *     B/memory.current = 1.3G
> + *     C/memory.current = 0.6G
> + *     D/memory.current = 0
> + *     E/memory.current = 0
> + *
> + * These calculations require constant tracking of the actual low usages
> + * (see propagate_protected()), as well as recursive calculation of

           propagate_low_usage()

> + * effective memory.low values. But as we do call mem_cgroup_low()
> + * path for each memory cgroup top-down from the reclaim,
> + * it's possible to optimize this part, and save calculated elow
> + * for next usage. This part is intentionally racy, but it's ok,
> + * as memory.low is a best-effort mechanism.
>   */
>  bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg)
>  {
> +	unsigned long usage, low_usage, siblings_low_usage;
> +	unsigned long elow, parent_elow;
> +	struct mem_cgroup *parent;
> +
>  	if (mem_cgroup_disabled())
>  		return false;
>  
> @@ -5650,12 +5683,31 @@ bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg)
>  	if (memcg == root)
>  		return false;
>  
> -	for (; memcg != root; memcg = parent_mem_cgroup(memcg)) {
> -		if (page_counter_read(&memcg->memory) >= memcg->low)
> -			return false;
> -	}
> +	elow = memcg->memory.low;
> +	usage = page_counter_read(&memcg->memory);
>  
> -	return true;
> +	parent = parent_mem_cgroup(memcg);
> +	if (parent == root)
> +		goto exit;

This is odd newlining. Group the variable lookups instead and separate
them from the parent == root branch?

> +	parent_elow = READ_ONCE(parent->memory.elow);
> +	elow = min(elow, parent_elow);
> +
> +	if (!elow || !parent_elow)
> +		goto exit;

Like here ^

> +	low_usage = min(usage, memcg->memory.low);
> +	siblings_low_usage = atomic_long_read(
> +		&parent->memory.children_low_usage);
> +	if (!low_usage || !siblings_low_usage)
> +		goto exit;

Then this the same way.

> +	elow = min(elow, parent_elow * low_usage / siblings_low_usage);
> +
> +exit:
> +	memcg->memory.elow = elow;
> +
> +	return usage < elow;

These empty lines seem unnecessary, the label line is already a visual
break in the code flow.

> @@ -13,6 +13,34 @@
>  #include <linux/bug.h>
>  #include <asm/page.h>
>  
> +static void propagate_low_usage(struct page_counter *c, unsigned long usage)
> +{
> +	unsigned long low_usage, old;
> +	long delta;
> +
> +	if (!c->parent)
> +		return;
> +
> +	if (!c->low && !atomic_long_read(&c->low_usage))
> +		return;
> +
> +	if (usage <= c->low)
> +		low_usage = usage;
> +	else
> +		low_usage = 0;
> +
> +	old = atomic_long_xchg(&c->low_usage, low_usage);
> +	delta = low_usage - old;
> +	if (delta)
> +		atomic_long_add(delta, &c->parent->children_low_usage);
> +}
> +
> +void page_counter_set_low(struct page_counter *c, unsigned long nr_pages)
> +{
> +	c->low = nr_pages;
> +	propagate_low_usage(c, atomic_long_read(&c->count));

Actually I think I messed this up in my draft. When one level in the
tree changes its usage or low setting, the low usage needs to be
propagated upward the tree. We do this for charge and try_charge, but
not here. page_counter_set_low() should have an ancestor walk.

Other than that, the patch looks great to me.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC PATCH 0/2] memory.low,min reclaim
  2018-03-20 22:33 [RFC] mm: memory.low heirarchical behavior Roman Gushchin
  2018-03-21 18:23 ` Johannes Weiner
@ 2018-04-22 20:26 ` Greg Thelen
  2018-04-22 20:26   ` [RFC PATCH 1/2] memcg: fix memory.low Greg Thelen
                     ` (2 more replies)
  1 sibling, 3 replies; 13+ messages in thread
From: Greg Thelen @ 2018-04-22 20:26 UTC (permalink / raw)
  To: guro, Johannes Weiner, Andrew Morton, Michal Hocko
  Cc: Vladimir Davydov, Tejun Heo, Cgroups, kernel-team, Linux MM,
	LKML, Greg Thelen

Roman's previously posted memory.low,min patches add per memcg effective
low limit to detect overcommitment of parental limits.  But if we flip
low,min reclaim to bail if usage<{low,min} at any level, then we don't need
an effective low limit, which makes the code simpler.  When parent limits
are overcommited memory.min will oom kill, which is more drastic but makes
the memory.low a simpler concept.  If memcg a/b wants oom kill before
reclaim, then give it to them.  It seems a bit strange for a/b/memory.low's
behaviour to depend on a/c/memory.low (i.e. a/b.low is strong unless
a/b.low+a/c.low exceed a.low).

Previous patches:
- mm: rename page_counter's count/limit into usage/max
- mm: memory.low hierarchical behavior
- mm: treat memory.low value inclusive
- mm/docs: describe memory.low refinements
- mm: introduce memory.min
- mm: move the high field from struct mem_cgroup to page_counter
 8 files changed, 405 insertions(+), 141 deletions(-)

I think there might be a simpler way (ableit it doesn't yet include
Documentation):
- memcg: fix memory.low
- memcg: add memory.min
 3 files changed, 75 insertions(+), 6 deletions(-)

The idea of this alternate approach is for memory.low,min to avoid reclaim
if any portion of under-consideration memcg ancestry is under respective
limit.

Also, in either approach, I suspect we should mention the interaction with
numa contrainted allocations (cpuset.mems, mempolicy, mbind).  For example,
if a numa agnostic memcg with large memory.min happens to gobble up all of
node N1 memory, and a future task really wants node N1 memory (via
mempolicy) then we oom kill rather reclaim or migrating memory.
Ideas:
a) oom kill numa constrainted allocator, that's what we've been doing in
   Google.  I can provide patch if helpful.  I admit that it has
   shortcomings.
b) oom kill a memcg with memory.low protection if its TBD priority is lower
   than the allocating task.  Priority is a TBD concept.
c) consider migrating numa agnostic memory.low memory as a lighterweight
   alternative to oom kill.

I extended Roman's nifty reclaim test:
  #!/bin/bash
  #
  # Uppercase cgroups can tolerate some reclaim (current > low).
  # Lowerase cgroups are intolerate to reclaim (current < low).
  #
  #    A     A/memory.low = 2G, A/memory.current = 6G
  #  // \\
  # bC   De  b/memory.low = 3G  b/memory.current = 2G
  #          C/memory.low = 1G  C/memory.current = 2G
  #          D/memory.low = 0   D/memory.current = 2G
  #          e/memory.low = 10G e/memory.current = 0
  #
  #    F     F/memory.low = 2G, F/memory.current = 4G
  #   / \
  #  g   H   g/memory.low = 3G  g/memory.current = 2G
  #          H/memory.low = 1G  H/memory.current = 2G
  #
  #    i     i/memory.low = 5G, i/memory.current = 4G
  #   / \
  #  j   K   j/memory.low = 3G  j/memory.current = 2G
  #          K/memory.low = 1G  K/memory.current = 2G
  #
  #    L     L/memory.low = 2G, L/memory.current = 4G, L/memory.max = 4G
  #   / \
  #  m   N   m/memory.low = 3G  m/memory.current = 2G
  #          N/memory.low = 1G  N/memory.current = 2G
  #
  # setting memory.min: warmup => after global pressure
  # A    : 6306372k => 3154336k
  # A/b  : 2102184k => 2101928k
  # A/C  : 2101936k => 1048352k
  # A/D  : 2102252k => 4056k   
  # A/e  : 0k       => 0k      
  # F    : 4204420k => 3150272k
  # F/g  : 2102188k => 2101912k
  # F/H  : 2102232k => 1048360k
  # i    : 4204652k => 4203884k
  # i/j  : 2102324k => 2101940k
  # i/K  : 2102328k => 2101944k
  # L    : 4189976k => 3147824k
  # L/m  : 2101980k => 2101956k
  # L/N  : 2087996k => 1045868k
  #
  # setting memory.min: warmup => after L/m antagonist
  # A    : 6306076k => 6305988k
  # A/b  : 2102152k => 2102128k
  # A/C  : 2101948k => 2101916k
  # A/D  : 2101976k => 2101944k
  # A/e  : 0k       => 0k      
  # F    : 4204156k => 4203832k
  # F/g  : 2102220k => 2101920k
  # F/H  : 2101936k => 2101912k
  # i    : 4204204k => 4203852k
  # i/j  : 2102256k => 2101936k
  # i/K  : 2101948k => 2101916k
  # L    : 4190012k => 3886352k
  # L/m  : 2101996k => 2837856k
  # L/N  : 2088016k => 1048496k
  #
  # setting memory.low: warmup => after global pressure
  # A    : 6306220k => 3154864k
  # A/b  : 2101964k => 2101940k
  # A/C  : 2102204k => 1047040k
  # A/D  : 2102052k => 5884k	
  # A/e  : 0k       => 0k	
  # F    : 4204192k => 3147888k
  # F/g  : 2101948k => 2101916k
  # F/H  : 2102244k => 1045972k
  # i    : 4204480k => 4204056k
  # i/j  : 2102008k => 2101976k
  # i/K  : 2102464k => 2102080k
  # L    : 4190028k => 3150192k
  # L/m  : 2102004k => 2101980k
  # L/N  : 2088024k => 1048212k
  #
  # setting memory.low: warmup => after L/m antagonist
  # A    : 6306360k => 6305960k
  # A/b  : 2101988k => 2101924k
  # A/C  : 2102192k => 2101916k
  # A/D  : 2102180k => 2102120k
  # A/e  : 0k       => 0k	
  # F    : 4203964k => 4203908k
  # F/g  : 2102016k => 2101992k
  # F/H  : 2101948k => 2101916k
  # i    : 4204408k => 4203988k
  # i/j  : 2101984k => 2101936k
  # i/K  : 2102424k => 2102052k
  # L    : 4189960k => 3886296k
  # L/m  : 2101968k => 2838704k
  # L/N  : 2087992k => 1047592k

  set -o errexit
  set -o nounset
  set -o pipefail
  
  LIM="$1"; shift
  ANTAGONIST="$1"; shift
  CGPATH=/tmp/cgroup
  
  vmtouch2() {
    rm -f "$2"
    (echo $BASHPID > "${CGPATH}/$1/cgroup.procs" && exec /tmp/mmap --loop 1 --file "$2" "$3")
  }
  
  vmtouch() {
    # twice to ensure slab caches are warmed up and all objs are charged to cgroup.
    vmtouch2 "$1" "$2" "$3"
    vmtouch2 "$1" "$2" "$3"
  }
  
  dump() {
    for i in A A/b A/C A/D A/e F F/g F/H i i/j i/K L L/m L/N; do
      printf "%-5s: %sk\n" $i $(($(cat "${CGPATH}/${i}/memory.current")/1024))
    done
  }
  
  rm -f /file_?
  if [[ -e "${CGPATH}/A" ]]; then
    rmdir ${CGPATH}/?/? ${CGPATH}/?
  fi
  echo "+memory" > "${CGPATH}/cgroup.subtree_control"
  mkdir "${CGPATH}/A" "${CGPATH}/F" "${CGPATH}/i" "${CGPATH}/L"
  echo "+memory" > "${CGPATH}/A/cgroup.subtree_control"
  echo "+memory" > "${CGPATH}/F/cgroup.subtree_control"
  echo "+memory" > "${CGPATH}/i/cgroup.subtree_control"
  echo "+memory" > "${CGPATH}/L/cgroup.subtree_control"
  mkdir "${CGPATH}/A/b" "${CGPATH}/A/C" "${CGPATH}/A/D" "${CGPATH}/A/e"
  mkdir "${CGPATH}/F/g" "${CGPATH}/F/H"
  mkdir "${CGPATH}/i/j" "${CGPATH}/i/K"
  mkdir "${CGPATH}/L/m" "${CGPATH}/L/N"
  
  echo 2G > "${CGPATH}/A/memory.${LIM}"
  echo 3G > "${CGPATH}/A/b/memory.${LIM}"
  echo 1G > "${CGPATH}/A/C/memory.${LIM}"
  echo 0 > "${CGPATH}/A/D/memory.${LIM}"
  echo 10G > "${CGPATH}/A/e/memory.${LIM}"
  
  echo 2G > "${CGPATH}/F/memory.${LIM}"
  echo 3G > "${CGPATH}/F/g/memory.${LIM}"
  echo 1G > "${CGPATH}/F/H/memory.${LIM}"
  
  echo 5G > "${CGPATH}/i/memory.${LIM}"
  echo 3G > "${CGPATH}/i/j/memory.${LIM}"
  echo 1G > "${CGPATH}/i/K/memory.${LIM}"
  
  echo 2G > "${CGPATH}/L/memory.${LIM}"
  echo 4G > "${CGPATH}/L/memory.max"
  echo 3G > "${CGPATH}/L/m/memory.${LIM}"
  echo 1G > "${CGPATH}/L/N/memory.${LIM}"
  
  vmtouch A/b /file_b 2G
  vmtouch A/C /file_C 2G
  vmtouch A/D /file_D 2G
  
  vmtouch F/g /file_g 2G
  vmtouch F/H /file_H 2G
  
  vmtouch i/j /file_j 2G
  vmtouch i/K /file_K 2G
  
  vmtouch L/m /file_m 2G
  vmtouch L/N /file_N 2G
  
  vmtouch2 "$ANTAGONIST" /file_ant 150G
  echo
  echo "after $ANTAGONIST antagonist"
  dump
  
  rmdir "${CGPATH}/A/b" "${CGPATH}/A/C" "${CGPATH}/A/D" "${CGPATH}/A/e"
  rmdir "${CGPATH}/F/g" "${CGPATH}/F/H"
  rmdir "${CGPATH}/i/j" "${CGPATH}/i/K"
  rmdir "${CGPATH}/L/m" "${CGPATH}/L/N"
  rmdir "${CGPATH}/A" "${CGPATH}/F" "${CGPATH}/i" "${CGPATH}/L"
  rm /file_ant /file_b /file_C /file_D /file_g /file_H /file_j /file_K

Greg Thelen (2):
  memcg: fix memory.low
  memcg: add memory.min

 include/linux/memcontrol.h |  8 +++++
 mm/memcontrol.c            | 70 ++++++++++++++++++++++++++++++++++----
 mm/vmscan.c                |  3 ++
 3 files changed, 75 insertions(+), 6 deletions(-)

-- 
2.17.0.484.g0c8726318c-goog

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC PATCH 1/2] memcg: fix memory.low
  2018-04-22 20:26 ` [RFC PATCH 0/2] memory.low,min reclaim Greg Thelen
@ 2018-04-22 20:26   ` Greg Thelen
  2018-04-22 20:26   ` [RFC PATCH 2/2] memcg: add memory.min Greg Thelen
  2018-04-23 10:38   ` [RFC PATCH 0/2] memory.low,min reclaim Roman Gushchin
  2 siblings, 0 replies; 13+ messages in thread
From: Greg Thelen @ 2018-04-22 20:26 UTC (permalink / raw)
  To: guro, Johannes Weiner, Andrew Morton, Michal Hocko
  Cc: Vladimir Davydov, Tejun Heo, Cgroups, kernel-team, Linux MM,
	LKML, Greg Thelen

When targeting reclaim to a memcg, protect that memcg from reclaim is
memory consumption of any level is below respective memory.low.

Signed-off-by: Greg Thelen <gthelen@google.com>
---
 mm/memcontrol.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 670e99b68aa6..9668f620203a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5341,8 +5341,8 @@ struct cgroup_subsys memory_cgrp_subsys = {
  * @root: the top ancestor of the sub-tree being checked
  * @memcg: the memory cgroup to check
  *
- * Returns %true if memory consumption of @memcg, and that of all
- * ancestors up to (but not including) @root, is below the normal range.
+ * Returns %true if memory consumption of @memcg, or any of its ancestors
+ * up to (but not including) @root, is below the normal range.
  *
  * @root is exclusive; it is never low when looked at directly and isn't
  * checked when traversing the hierarchy.
@@ -5379,12 +5379,12 @@ bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg)
 	if (memcg == root)
 		return false;
 
+	/* If any level is under, then protect @memcg from reclaim */
 	for (; memcg != root; memcg = parent_mem_cgroup(memcg)) {
-		if (page_counter_read(&memcg->memory) >= memcg->low)
-			return false;
+		if (page_counter_read(&memcg->memory) <= memcg->low)
+			return true; /* protect from reclaim */
 	}
-
-	return true;
+	return false; /* not protected from reclaim */
 }
 
 /**
-- 
2.17.0.484.g0c8726318c-goog

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH 2/2] memcg: add memory.min
  2018-04-22 20:26 ` [RFC PATCH 0/2] memory.low,min reclaim Greg Thelen
  2018-04-22 20:26   ` [RFC PATCH 1/2] memcg: fix memory.low Greg Thelen
@ 2018-04-22 20:26   ` Greg Thelen
  2018-04-23 10:38   ` [RFC PATCH 0/2] memory.low,min reclaim Roman Gushchin
  2 siblings, 0 replies; 13+ messages in thread
From: Greg Thelen @ 2018-04-22 20:26 UTC (permalink / raw)
  To: guro, Johannes Weiner, Andrew Morton, Michal Hocko
  Cc: Vladimir Davydov, Tejun Heo, Cgroups, kernel-team, Linux MM,
	LKML, Greg Thelen

The new memory.min limit is similar to memory.low, just no bypassing it
when reclaim is desparate.  Prefer oom kills before reclaim memory below
memory.min.  Sharing more code with memory_cgroup_low() is possible, but
the idea is posted here for simplicity.

Signed-off-by: Greg Thelen <gthelen@google.com>
---
 include/linux/memcontrol.h |  8 ++++++
 mm/memcontrol.c            | 58 ++++++++++++++++++++++++++++++++++++++
 mm/vmscan.c                |  3 ++
 3 files changed, 69 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c46016bb25eb..22bb4a88653a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -178,6 +178,7 @@ struct mem_cgroup {
 	struct page_counter tcpmem;
 
 	/* Normal memory consumption range */
+	unsigned long min;
 	unsigned long low;
 	unsigned long high;
 
@@ -281,6 +282,7 @@ static inline bool mem_cgroup_disabled(void)
 	return !cgroup_subsys_enabled(memory_cgrp_subsys);
 }
 
+bool mem_cgroup_min(struct mem_cgroup *root, struct mem_cgroup *memcg);
 bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg);
 
 int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
@@ -726,6 +728,12 @@ static inline void mem_cgroup_event(struct mem_cgroup *memcg,
 {
 }
 
+static inline bool mem_cgroup_min(struct mem_cgroup *root,
+				  struct mem_cgroup *memcg)
+{
+	return false;
+}
+
 static inline bool mem_cgroup_low(struct mem_cgroup *root,
 				  struct mem_cgroup *memcg)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9668f620203a..b2aaed4003b4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5058,6 +5058,36 @@ static u64 memory_current_read(struct cgroup_subsys_state *css,
 	return (u64)page_counter_read(&memcg->memory) * PAGE_SIZE;
 }
 
+static int memory_min_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+	unsigned long min = READ_ONCE(memcg->min);
+
+	if (min == PAGE_COUNTER_MAX)
+		seq_puts(m, "max\n");
+	else
+		seq_printf(m, "%llu\n", (u64)min * PAGE_SIZE);
+
+	return 0;
+}
+
+static ssize_t memory_min_write(struct kernfs_open_file *of,
+				char *buf, size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	unsigned long min;
+	int err;
+
+	buf = strstrip(buf);
+	err = page_counter_memparse(buf, "max", &min);
+	if (err)
+		return err;
+
+	memcg->min = min;
+
+	return nbytes;
+}
+
 static int memory_low_show(struct seq_file *m, void *v)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
@@ -5288,6 +5318,12 @@ static struct cftype memory_files[] = {
 		.flags = CFTYPE_NOT_ON_ROOT,
 		.read_u64 = memory_current_read,
 	},
+	{
+		.name = "min",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = memory_min_show,
+		.write = memory_min_write,
+	},
 	{
 		.name = "low",
 		.flags = CFTYPE_NOT_ON_ROOT,
@@ -5336,6 +5372,28 @@ struct cgroup_subsys memory_cgrp_subsys = {
 	.early_init = 0,
 };
 
+/**
+ * mem_cgroup_min returns true for a memcg below its min limit.  Such memcg are
+ * excempt from reclaim.
+ */
+bool mem_cgroup_min(struct mem_cgroup *root, struct mem_cgroup *memcg)
+{
+	if (mem_cgroup_disabled())
+		return false;
+
+	if (!root)
+		root = root_mem_cgroup;
+
+	if (memcg == root)
+		return false;
+
+	for (; memcg != root; memcg = parent_mem_cgroup(memcg)) {
+		if (page_counter_read(&memcg->memory) <= memcg->min)
+			return true; /* protect */
+	}
+	return false; /* !protect */
+}
+
 /**
  * mem_cgroup_low - check if memory consumption is below the normal range
  * @root: the top ancestor of the sub-tree being checked
diff --git a/mm/vmscan.c b/mm/vmscan.c
index cd5dc3faaa57..15ae19a38ad5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2539,6 +2539,9 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 			unsigned long reclaimed;
 			unsigned long scanned;
 
+			if (mem_cgroup_min(root, memcg))
+				continue;
+
 			if (mem_cgroup_low(root, memcg)) {
 				if (!sc->memcg_low_reclaim) {
 					sc->memcg_low_skipped = 1;
-- 
2.17.0.484.g0c8726318c-goog

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 0/2] memory.low,min reclaim
  2018-04-22 20:26 ` [RFC PATCH 0/2] memory.low,min reclaim Greg Thelen
  2018-04-22 20:26   ` [RFC PATCH 1/2] memcg: fix memory.low Greg Thelen
  2018-04-22 20:26   ` [RFC PATCH 2/2] memcg: add memory.min Greg Thelen
@ 2018-04-23 10:38   ` Roman Gushchin
  2018-04-24  0:56     ` Greg Thelen
  2 siblings, 1 reply; 13+ messages in thread
From: Roman Gushchin @ 2018-04-23 10:38 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Johannes Weiner, Andrew Morton, Michal Hocko, Vladimir Davydov,
	Tejun Heo, Cgroups, kernel-team, Linux MM, LKML

Hi, Greg!

On Sun, Apr 22, 2018 at 01:26:10PM -0700, Greg Thelen wrote:
> Roman's previously posted memory.low,min patches add per memcg effective
> low limit to detect overcommitment of parental limits.  But if we flip
> low,min reclaim to bail if usage<{low,min} at any level, then we don't need
> an effective low limit, which makes the code simpler.  When parent limits
> are overcommited memory.min will oom kill, which is more drastic but makes
> the memory.low a simpler concept.  If memcg a/b wants oom kill before
> reclaim, then give it to them.  It seems a bit strange for a/b/memory.low's
> behaviour to depend on a/c/memory.low (i.e. a/b.low is strong unless
> a/b.low+a/c.low exceed a.low).

It's actually not strange: a/b and a/c are sharing a common resource:
a/memory.low.

Exactly as a/b/memory.max and a/c/memory.max are sharing a/memory.max.
If there are sibling cgroups which are consuming memory, a cgroup can't
exceed parent's memory.max, even if its memory.max is grater.

> 
> I think there might be a simpler way (ableit it doesn't yet include
> Documentation):
> - memcg: fix memory.low
> - memcg: add memory.min
>  3 files changed, 75 insertions(+), 6 deletions(-)
> 
> The idea of this alternate approach is for memory.low,min to avoid reclaim
> if any portion of under-consideration memcg ancestry is under respective
> limit.

This approach has a significant downside: it breaks hierarchical constraints
for memory.low/min. There are two important outcomes:

1) Any leaf's memory.low/min value is respected, even if parent's value
   is lower or even 0. It's not possible anymore to limit the amount of
   protected memory for a sub-tree.
   This is especially bad in case of delegation.

2) If a cgroup has an ancestor with the usage under its memory.low/min,
   it becomes protection, even if its memory.low/min is 0. So it becomes
   impossible to have unprotected cgroups in protected sub-tree.

Thanks!

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 0/2] memory.low,min reclaim
  2018-04-23 10:38   ` [RFC PATCH 0/2] memory.low,min reclaim Roman Gushchin
@ 2018-04-24  0:56     ` Greg Thelen
  2018-04-24 10:09       ` Roman Gushchin
  0 siblings, 1 reply; 13+ messages in thread
From: Greg Thelen @ 2018-04-24  0:56 UTC (permalink / raw)
  To: guro
  Cc: Johannes Weiner, Andrew Morton, Michal Hocko, Vladimir Davydov,
	Tejun Heo, Cgroups, kernel-team, Linux MM, LKML

On Mon, Apr 23, 2018 at 3:38 AM Roman Gushchin <guro@fb.com> wrote:

> Hi, Greg!

> On Sun, Apr 22, 2018 at 01:26:10PM -0700, Greg Thelen wrote:
> > Roman's previously posted memory.low,min patches add per memcg effective
> > low limit to detect overcommitment of parental limits.  But if we flip
> > low,min reclaim to bail if usage<{low,min} at any level, then we don't
need
> > an effective low limit, which makes the code simpler.  When parent
limits
> > are overcommited memory.min will oom kill, which is more drastic but
makes
> > the memory.low a simpler concept.  If memcg a/b wants oom kill before
> > reclaim, then give it to them.  It seems a bit strange for
a/b/memory.low's
> > behaviour to depend on a/c/memory.low (i.e. a/b.low is strong unless
> > a/b.low+a/c.low exceed a.low).

> It's actually not strange: a/b and a/c are sharing a common resource:
> a/memory.low.

> Exactly as a/b/memory.max and a/c/memory.max are sharing a/memory.max.
> If there are sibling cgroups which are consuming memory, a cgroup can't
> exceed parent's memory.max, even if its memory.max is grater.

> >
> > I think there might be a simpler way (ableit it doesn't yet include
> > Documentation):
> > - memcg: fix memory.low
> > - memcg: add memory.min
> >  3 files changed, 75 insertions(+), 6 deletions(-)
> >
> > The idea of this alternate approach is for memory.low,min to avoid
reclaim
> > if any portion of under-consideration memcg ancestry is under respective
> > limit.

> This approach has a significant downside: it breaks hierarchical
constraints
> for memory.low/min. There are two important outcomes:

> 1) Any leaf's memory.low/min value is respected, even if parent's value
>           is lower or even 0. It's not possible anymore to limit the amount
of
>           protected memory for a sub-tree.
>           This is especially bad in case of delegation.

As someone who has been using something like memory.min for a while, I have
cases where it needs to be a strong protection.  Such jobs prefer oom kill
to reclaim.  These jobs know they need X MB of memory.  But I guess it's on
me to avoid configuring machines which overcommit memory.min at such cgroup
levels all the way to the root.

> 2) If a cgroup has an ancestor with the usage under its memory.low/min,
>           it becomes protection, even if its memory.low/min is 0. So it
becomes
>           impossible to have unprotected cgroups in protected sub-tree.

Fair point.

One use case is where a non trivial job which has several memory accounting
subcontainers.  Is there a way to only set memory.low at the top and have
the offer protection to the job?
The case I'm thinking of is:
% cd /cgroup
% echo +memory > cgroup.subtree_control
% mkdir top
% echo +memory > top/cgroup.subtree_control
% mkdir top/part1 top/part2
% echo 1GB > top/memory.min
% (echo $BASHPID > top/part1/cgroup.procs && part1)
% (echo $BASHPID > top/part2/cgroup.procs && part2)

Empirically it's been measured that the entire workload (/top) needs 1GB to
perform well.  But we don't care how the memory is distributed between
part1,part2.  Is the strategy for that to set /top, /top/part1.min, and
/top/part2.min to 1GB?

What do you think about exposing emin and elow to user space?  I think that
would reduce admin/user confusion in situations where memory.min is
internally discounted.

(tangent) Delegation in v2 isn't something I've been able to fully
internalize yet.
The "no interior processes" rule challenges my notion of subdelegation.
My current model is where a system controller creates a container C with
C.min and then starts client manager process M in C.  Then M can choose
to further divide C's resources (e.g. C/S).  This doesn't seem possible
because v2 doesn't allow for interior processes.  So the system manager
would need to create C, set C.low, create C/sub_manager, create
C/sub_resources, set C/sub_manager.low, set C/sub_resources.low, then start
M in C/sub_manager.  Then sub_manager can create and manage
C/sub_resources/S.

PS: Thanks for the memory.low and memory.min work.  Regardless of how we
proceed it's better than the upstream memory.soft_limit_in_bytes!

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 0/2] memory.low,min reclaim
  2018-04-24  0:56     ` Greg Thelen
@ 2018-04-24 10:09       ` Roman Gushchin
  0 siblings, 0 replies; 13+ messages in thread
From: Roman Gushchin @ 2018-04-24 10:09 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Johannes Weiner, Andrew Morton, Michal Hocko, Vladimir Davydov,
	Tejun Heo, Cgroups, kernel-team, Linux MM, LKML

On Tue, Apr 24, 2018 at 12:56:09AM +0000, Greg Thelen wrote:
> On Mon, Apr 23, 2018 at 3:38 AM Roman Gushchin <guro@fb.com> wrote:
> 
> > Hi, Greg!
> 
> > On Sun, Apr 22, 2018 at 01:26:10PM -0700, Greg Thelen wrote:
> > > Roman's previously posted memory.low,min patches add per memcg effective
> > > low limit to detect overcommitment of parental limits.  But if we flip
> > > low,min reclaim to bail if usage<{low,min} at any level, then we don't
> > > need an effective low limit, which makes the code simpler.  When parent
> > > limits are overcommited memory.min will oom kill, which is more drastic but
> > > makes the memory.low a simpler concept.  If memcg a/b wants oom kill before
> > > reclaim, then give it to them.  It seems a bit strange for a/b/memory.low's
> > > behaviour to depend on a/c/memory.low (i.e. a/b.low is strong unless
> > > a/b.low+a/c.low exceed a.low).
> 
> > It's actually not strange: a/b and a/c are sharing a common resource:
> > a/memory.low.
> 
> > Exactly as a/b/memory.max and a/c/memory.max are sharing a/memory.max.
> > If there are sibling cgroups which are consuming memory, a cgroup can't
> > exceed parent's memory.max, even if its memory.max is grater.
> 
> > >
> > > I think there might be a simpler way (ableit it doesn't yet include
> > > Documentation):
> > > - memcg: fix memory.low
> > > - memcg: add memory.min
> > >  3 files changed, 75 insertions(+), 6 deletions(-)
> > >
> > > The idea of this alternate approach is for memory.low,min to avoid
> reclaim
> > > if any portion of under-consideration memcg ancestry is under respective
> > > limit.
> 
> > This approach has a significant downside: it breaks hierarchical
> constraints
> > for memory.low/min. There are two important outcomes:
> 
> > 1) Any leaf's memory.low/min value is respected, even if parent's value
> >           is lower or even 0. It's not possible anymore to limit the amount
> of
> >           protected memory for a sub-tree.
> >           This is especially bad in case of delegation.
> 
> As someone who has been using something like memory.min for a while, I have
> cases where it needs to be a strong protection.  Such jobs prefer oom kill
> to reclaim.  These jobs know they need X MB of memory.  But I guess it's on
> me to avoid configuring machines which overcommit memory.min at such cgroup
> levels all the way to the root.

Absolutely.

> 
> > 2) If a cgroup has an ancestor with the usage under its memory.low/min,
> >           it becomes protection, even if its memory.low/min is 0. So it
> becomes
> >           impossible to have unprotected cgroups in protected sub-tree.
> 
> Fair point.
> 
> One use case is where a non trivial job which has several memory accounting
> subcontainers.  Is there a way to only set memory.low at the top and have
> the offer protection to the job?
> The case I'm thinking of is:
> % cd /cgroup
> % echo +memory > cgroup.subtree_control
> % mkdir top
> % echo +memory > top/cgroup.subtree_control
> % mkdir top/part1 top/part2
> % echo 1GB > top/memory.min
> % (echo $BASHPID > top/part1/cgroup.procs && part1)
> % (echo $BASHPID > top/part2/cgroup.procs && part2)
> 
> Empirically it's been measured that the entire workload (/top) needs 1GB to
> perform well.  But we don't care how the memory is distributed between
> part1,part2.  Is the strategy for that to set /top, /top/part1.min, and
> /top/part2.min to 1GB?

The problem is that right now we don't have an "undefined" value for
memory.min/low. The default value is 0, which means "no protection".
So there is no way how a user can express "whatever parent cgroup wants".
It might be useful to introduce such value, as other controllers
may benefit too. But it's a separate theme to discuss.

In your example, it's possible to achieve the requested behavior by setting
top.min into 1G and part1.min and part2.min into "max".

> 
> What do you think about exposing emin and elow to user space?  I think that
> would reduce admin/user confusion in situations where memory.min is
> internally discounted.

They might be useful in some cases (e.g. a cgroup want's to know how much
actual protection it can get), but at the same time these values are
intentionally racy and don't have a clear semantics.
So, maybe we can show them in memory.stat, but I doubt that they deserve
a separate interface file.

> 
> (tangent) Delegation in v2 isn't something I've been able to fully
> internalize yet.
> The "no interior processes" rule challenges my notion of subdelegation.
> My current model is where a system controller creates a container C with
> C.min and then starts client manager process M in C.  Then M can choose
> to further divide C's resources (e.g. C/S).  This doesn't seem possible
> because v2 doesn't allow for interior processes.  So the system manager
> would need to create C, set C.low, create C/sub_manager, create
> C/sub_resources, set C/sub_manager.low, set C/sub_resources.low, then start
> M in C/sub_manager.  Then sub_manager can create and manage
> C/sub_resources/S.

And this is a good example of a case, when some cgroups in the tree
should be protected to work properly (for example, C/sub_manager/memory.low = 128M),
while an actual workload might be not (C/sub_resources/memory.low = 0).

Thanks!

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2018-04-24 10:10 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-20 22:33 [RFC] mm: memory.low heirarchical behavior Roman Gushchin
2018-03-21 18:23 ` Johannes Weiner
2018-03-21 19:08   ` Roman Gushchin
2018-04-04 17:07     ` Johannes Weiner
2018-04-05 13:54       ` Roman Gushchin
2018-04-05 15:00         ` Johannes Weiner
2018-03-23 16:37   ` [PATCH v2] mm: memory.low hierarchical behavior Roman Gushchin
2018-04-22 20:26 ` [RFC PATCH 0/2] memory.low,min reclaim Greg Thelen
2018-04-22 20:26   ` [RFC PATCH 1/2] memcg: fix memory.low Greg Thelen
2018-04-22 20:26   ` [RFC PATCH 2/2] memcg: add memory.min Greg Thelen
2018-04-23 10:38   ` [RFC PATCH 0/2] memory.low,min reclaim Roman Gushchin
2018-04-24  0:56     ` Greg Thelen
2018-04-24 10:09       ` Roman Gushchin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).