mm: Proportional memory.{low,min} reclaim
diff mbox series

Message ID 20190124014455.GA6396@chrisdown.name
State In Next
Headers show
Series
  • mm: Proportional memory.{low,min} reclaim
Related show

Commit Message

Chris Down Jan. 24, 2019, 1:44 a.m. UTC
cgroup v2 introduces two memory protection thresholds: memory.low
(best-effort) and memory.min (hard protection). While they generally do
what they say on the tin, there is a limitation in their implementation
that makes them difficult to use effectively: that cliff behaviour often
manifests when they become eligible for reclaim. This patch implements
more intuitive and usable behaviour, where we gradually mount more
reclaim pressure as cgroups further and further exceed their protection
thresholds.

This cliff edge behaviour happens because we only choose whether or not
to reclaim based on whether the memcg is within its protection limits
(see the use of mem_cgroup_protected in shrink_node), but we don't vary
our reclaim behaviour based on this information. Imagine the following
timeline, with the numbers the lruvec size in this zone:

1. memory.low=1000000, memory.current=999999. 0 pages may be scanned.
2. memory.low=1000000, memory.current=1000000. 0 pages may be scanned.
3. memory.low=1000000, memory.current=1000001. 1000001* pages may be
   scanned. (?!)

* Of course, we won't usually scan all available pages in the zone even
  without this patch because of scan control priority, over-reclaim
  protection, etc. However, as shown by the tests at the end, these
  techniques don't sufficiently throttle such an extreme change in
  input, so cliff-like behaviour isn't really averted by their existence
  alone.

Here's an example of how this plays out in practice. At Facebook, we are
trying to protect various workloads from "system" software, like
configuration management tools, metric collectors, etc (see this[0] case
study). In order to find a suitable memory.low value, we start by
determining the expected memory range within which the workload will be
comfortable operating. This isn't an exact science -- memory usage
deemed "comfortable" will vary over time due to user behaviour,
differences in composition of work, etc, etc. As such we need to
ballpark memory.low, but doing this is currently problematic:

1. If we end up setting it too low for the workload, it won't have *any*
   effect (see discussion above). The group will receive the full weight
   of reclaim and won't have any priority while competing with the less
   important system software, as if we had no memory.low configured at
   all.

2. Because of this behaviour, we end up erring on the side of setting it
   too high, such that the comfort range is reliably covered. However,
   protected memory is completely unavailable to the rest of the system,
   so we might cause undue memory and IO pressure there when we *know*
   we have some elasticity in the workload.

3. Even if we get the value totally right, smack in the middle of the
   comfort zone, we get extreme jumps between no pressure and full
   pressure that cause unpredictable pressure spikes in the workload due
   to the current binary reclaim behaviour.

With this patch, we can set it to our ballpark estimation without too
much worry. Any undesirable behaviour, such as too much or too little
reclaim pressure on the workload or system will be proportional to how
far our estimation is off. This means we can set memory.low much more
conservatively and thus waste less resources *without* the risk of the
workload falling off a cliff if we overshoot.

As a more abstract technical description, this unintuitive behaviour
results in having to give high-priority workloads a large protection
buffer on top of their expected usage to function reliably, as otherwise
we have abrupt periods of dramatically increased memory pressure which
hamper performance.  Having to set these thresholds so high wastes
resources and generally works against the principle of work
conservation. In addition, having proportional memory reclaim behaviour
has other benefits. Most notably, before this patch it's basically
mandatory to set memory.low to a higher than desirable value because
otherwise as soon as you exceed memory.low, all protection is lost, and
all pages are eligible to scan again. By contrast, having a gradual ramp
in reclaim pressure means that you now still get some protection when
thresholds are exceeded, which means that one can now be more
comfortable setting memory.low to lower values without worrying that all
protection will be lost. This is important because workingset size is
really hard to know exactly, especially with variable workloads, so at
least getting *some* protection if your workingset size grows larger
than you expect increases user confidence in setting memory.low without
a huge buffer on top being needed.

Thanks a lot to Johannes Weiner and Tejun Heo for their advice and
assistance in thinking about how to make this work better.

In testing these changes, I intended to verify that:

1. Changes in page scanning become gradual and proportional instead of
   binary.

   To test this, I experimented stepping further and further down
   memory.low protection on a workload that floats around 19G workingset
   when under memory.low protection, watching page scan rates for the
   workload cgroup:

   +------------+-----------------+--------------------+--------------+
   | memory.low | test (pgscan/s) | control (pgscan/s) | % of control |
   +------------+-----------------+--------------------+--------------+
   |        21G |               0 |                  0 | N/A          |
   |        17G |             867 |               3799 | 23%          |
   |        12G |            1203 |               3543 | 34%          |
   |         8G |            2534 |               3979 | 64%          |
   |         4G |            3980 |               4147 | 96%          |
   |          0 |            3799 |               3980 | 95%          |
   +------------+-----------------+--------------------+--------------+

   As you can see, the test kernel (with a kernel containing this patch)
   ramps up page scanning significantly more gradually than the control
   kernel (without this patch).

2. More gradual ramp up in reclaim aggression doesn't result in
   premature OOMs.

   To test this, I wrote a script that slowly increments the number of
   pages held by stress(1)'s --vm-keep mode until a production system
   entered severe overall memory contention. This script runs in a
   highly protected slice taking up the majority of available system
   memory. Watching vmstat revealed that page scanning continued
   essentially nominally between test and control, without causing
   forward reclaim progress to become arrested.

[0]: https://facebookmicrosites.github.io/cgroup2/docs/overview.html#case-study-the-fbtax2-project

Signed-off-by: Chris Down <chris@chrisdown.name>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: linux-kernel@vger.kernel.org
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: kernel-team@fb.com
---
 Documentation/admin-guide/cgroup-v2.rst | 20 +++++--
 include/linux/memcontrol.h              | 17 ++++++
 mm/memcontrol.c                         |  5 ++
 mm/vmscan.c                             | 76 +++++++++++++++++++++++--
 4 files changed, 106 insertions(+), 12 deletions(-)

Comments

Roman Gushchin Jan. 28, 2019, 9 p.m. UTC | #1
On Wed, Jan 23, 2019 at 08:44:55PM -0500, Chris Down wrote:
> cgroup v2 introduces two memory protection thresholds: memory.low
> (best-effort) and memory.min (hard protection). While they generally do
> what they say on the tin, there is a limitation in their implementation
> that makes them difficult to use effectively: that cliff behaviour often
> manifests when they become eligible for reclaim. This patch implements
> more intuitive and usable behaviour, where we gradually mount more
> reclaim pressure as cgroups further and further exceed their protection
> thresholds.
> 
> This cliff edge behaviour happens because we only choose whether or not
> to reclaim based on whether the memcg is within its protection limits
> (see the use of mem_cgroup_protected in shrink_node), but we don't vary
> our reclaim behaviour based on this information. Imagine the following
> timeline, with the numbers the lruvec size in this zone:
> 
> 1. memory.low=1000000, memory.current=999999. 0 pages may be scanned.
> 2. memory.low=1000000, memory.current=1000000. 0 pages may be scanned.
> 3. memory.low=1000000, memory.current=1000001. 1000001* pages may be
>    scanned. (?!)
> 
> * Of course, we won't usually scan all available pages in the zone even
>   without this patch because of scan control priority, over-reclaim
>   protection, etc. However, as shown by the tests at the end, these
>   techniques don't sufficiently throttle such an extreme change in
>   input, so cliff-like behaviour isn't really averted by their existence
>   alone.
> 
> Here's an example of how this plays out in practice. At Facebook, we are
> trying to protect various workloads from "system" software, like
> configuration management tools, metric collectors, etc (see this[0] case
> study). In order to find a suitable memory.low value, we start by
> determining the expected memory range within which the workload will be
> comfortable operating. This isn't an exact science -- memory usage
> deemed "comfortable" will vary over time due to user behaviour,
> differences in composition of work, etc, etc. As such we need to
> ballpark memory.low, but doing this is currently problematic:
> 
> 1. If we end up setting it too low for the workload, it won't have *any*
>    effect (see discussion above). The group will receive the full weight
>    of reclaim and won't have any priority while competing with the less
>    important system software, as if we had no memory.low configured at
>    all.
> 
> 2. Because of this behaviour, we end up erring on the side of setting it
>    too high, such that the comfort range is reliably covered. However,
>    protected memory is completely unavailable to the rest of the system,
>    so we might cause undue memory and IO pressure there when we *know*
>    we have some elasticity in the workload.
> 
> 3. Even if we get the value totally right, smack in the middle of the
>    comfort zone, we get extreme jumps between no pressure and full
>    pressure that cause unpredictable pressure spikes in the workload due
>    to the current binary reclaim behaviour.
> 
> With this patch, we can set it to our ballpark estimation without too
> much worry. Any undesirable behaviour, such as too much or too little
> reclaim pressure on the workload or system will be proportional to how
> far our estimation is off. This means we can set memory.low much more
> conservatively and thus waste less resources *without* the risk of the
> workload falling off a cliff if we overshoot.
> 
> As a more abstract technical description, this unintuitive behaviour
> results in having to give high-priority workloads a large protection
> buffer on top of their expected usage to function reliably, as otherwise
> we have abrupt periods of dramatically increased memory pressure which
> hamper performance.  Having to set these thresholds so high wastes
> resources and generally works against the principle of work
> conservation. In addition, having proportional memory reclaim behaviour
> has other benefits. Most notably, before this patch it's basically
> mandatory to set memory.low to a higher than desirable value because
> otherwise as soon as you exceed memory.low, all protection is lost, and
> all pages are eligible to scan again. By contrast, having a gradual ramp
> in reclaim pressure means that you now still get some protection when
> thresholds are exceeded, which means that one can now be more
> comfortable setting memory.low to lower values without worrying that all
> protection will be lost. This is important because workingset size is
> really hard to know exactly, especially with variable workloads, so at
> least getting *some* protection if your workingset size grows larger
> than you expect increases user confidence in setting memory.low without
> a huge buffer on top being needed.
> 
> Thanks a lot to Johannes Weiner and Tejun Heo for their advice and
> assistance in thinking about how to make this work better.
> 
> In testing these changes, I intended to verify that:
> 
> 1. Changes in page scanning become gradual and proportional instead of
>    binary.
> 
>    To test this, I experimented stepping further and further down
>    memory.low protection on a workload that floats around 19G workingset
>    when under memory.low protection, watching page scan rates for the
>    workload cgroup:
> 
>    +------------+-----------------+--------------------+--------------+
>    | memory.low | test (pgscan/s) | control (pgscan/s) | % of control |
>    +------------+-----------------+--------------------+--------------+
>    |        21G |               0 |                  0 | N/A          |
>    |        17G |             867 |               3799 | 23%          |
>    |        12G |            1203 |               3543 | 34%          |
>    |         8G |            2534 |               3979 | 64%          |
>    |         4G |            3980 |               4147 | 96%          |
>    |          0 |            3799 |               3980 | 95%          |
>    +------------+-----------------+--------------------+--------------+
> 
>    As you can see, the test kernel (with a kernel containing this patch)
>    ramps up page scanning significantly more gradually than the control
>    kernel (without this patch).
> 
> 2. More gradual ramp up in reclaim aggression doesn't result in
>    premature OOMs.
> 
>    To test this, I wrote a script that slowly increments the number of
>    pages held by stress(1)'s --vm-keep mode until a production system
>    entered severe overall memory contention. This script runs in a
>    highly protected slice taking up the majority of available system
>    memory. Watching vmstat revealed that page scanning continued
>    essentially nominally between test and control, without causing
>    forward reclaim progress to become arrested.
> 
> [0]: https://urldefense.proofpoint.com/v2/url?u=https-3A__facebookmicrosites.github.io_cgroup2_docs_overview.html-23case-2Dstudy-2Dthe-2Dfbtax2-2Dproject&d=DwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=i6WobKxbeG3slzHSIOxTVtYIJw7qjCE6S0spDTKL-J4&m=Mo0govWR0-jFjgSx4DTFpIgKfHsLPb-67tLa_ANbtX0&s=6QtuD2I9uTW8eIgzRdVj1uHtwCMj4mYa6wOxkc1bTm0&e=
> 
> Signed-off-by: Chris Down <chris@chrisdown.name>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Roman Gushchin <guro@fb.com>
> Cc: Dennis Zhou <dennis@kernel.org>
> Cc: linux-kernel@vger.kernel.org
> Cc: cgroups@vger.kernel.org
> Cc: linux-mm@kvack.org
> Cc: kernel-team@fb.com
> ---
>  Documentation/admin-guide/cgroup-v2.rst | 20 +++++--
>  include/linux/memcontrol.h              | 17 ++++++
>  mm/memcontrol.c                         |  5 ++
>  mm/vmscan.c                             | 76 +++++++++++++++++++++++--
>  4 files changed, 106 insertions(+), 12 deletions(-)
> 
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 7bf3f129c68b..8ed408166890 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -606,8 +606,8 @@ on an IO device and is an example of this type.
>  Protections
>  -----------
>  
> -A cgroup is protected to be allocated upto the configured amount of
> -the resource if the usages of all its ancestors are under their
> +A cgroup is protected upto the configured amount of the resource
> +as long as the usages of all its ancestors are under their
>  protected levels.  Protections can be hard guarantees or best effort
>  soft boundaries.  Protections can also be over-committed in which case
>  only upto the amount available to the parent is protected among
> @@ -1020,7 +1020,10 @@ PAGE_SIZE multiple when read back.
>  	is within its effective min boundary, the cgroup's memory
>  	won't be reclaimed under any conditions. If there is no
>  	unprotected reclaimable memory available, OOM killer
> -	is invoked.
> +	is invoked. Above the effective min boundary (or
> +	effective low boundary if it is higher), pages are reclaimed
> +	proportionally to the overage, reducing reclaim pressure for
> +	smaller overages.
>  
>         Effective min boundary is limited by memory.min values of
>  	all ancestor cgroups. If there is memory.min overcommitment
> @@ -1042,7 +1045,10 @@ PAGE_SIZE multiple when read back.
>  	Best-effort memory protection.  If the memory usage of a
>  	cgroup is within its effective low boundary, the cgroup's
>  	memory won't be reclaimed unless memory can be reclaimed
> -	from unprotected cgroups.
> +	from unprotected cgroups.  Above the effective low boundary (or
> +	effective min boundary if it is higher), pages are reclaimed
> +	proportionally to the overage, reducing reclaim pressure for
> +	smaller overages.
>  
>  	Effective low boundary is limited by memory.low values of
>  	all ancestor cgroups. If there is memory.low overcommitment
> @@ -2283,8 +2289,10 @@ system performance due to overreclaim, to the point where the feature
>  becomes self-defeating.
>  
>  The memory.low boundary on the other hand is a top-down allocated
> -reserve.  A cgroup enjoys reclaim protection when it's within its low,
> -which makes delegation of subtrees possible.
> +reserve.  A cgroup enjoys reclaim protection when it's within its
> +effective low, which makes delegation of subtrees possible. It also
> +enjoys having reclaim pressure proportional to its overage when
> +above its effective low.
>  
>  The original high boundary, the hard limit, is defined as a strict
>  limit that can not budge, even if the OOM killer has to be called.
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index b0eb29ea0d9c..290cfbfd60cd 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -333,6 +333,11 @@ static inline bool mem_cgroup_disabled(void)
>  	return !cgroup_subsys_enabled(memory_cgrp_subsys);
>  }
>  
> +static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg)
> +{
> +	return max(READ_ONCE(memcg->memory.emin), READ_ONCE(memcg->memory.elow));
> +}
> +
>  enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
>  						struct mem_cgroup *memcg);
>  
> @@ -526,6 +531,8 @@ void mem_cgroup_handle_over_high(void);
>  
>  unsigned long mem_cgroup_get_max(struct mem_cgroup *memcg);
>  
> +unsigned long mem_cgroup_size(struct mem_cgroup *memcg);
> +
>  void mem_cgroup_print_oom_context(struct mem_cgroup *memcg,
>  				struct task_struct *p);
>  
> @@ -819,6 +826,11 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
>  {
>  }
>  
> +static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg)
> +{
> +	return 0;
> +}
> +
>  static inline enum mem_cgroup_protection mem_cgroup_protected(
>  	struct mem_cgroup *root, struct mem_cgroup *memcg)
>  {
> @@ -971,6 +983,11 @@ static inline unsigned long mem_cgroup_get_max(struct mem_cgroup *memcg)
>  	return 0;
>  }
>  
> +static inline unsigned long mem_cgroup_size(struct mem_cgroup *memcg)
> +{
> +	return 0;
> +}
> +
>  static inline void
>  mem_cgroup_print_oom_context(struct mem_cgroup *memcg, struct task_struct *p)
>  {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 18f4aefbe0bf..1d2b2aaf124d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1377,6 +1377,11 @@ unsigned long mem_cgroup_get_max(struct mem_cgroup *memcg)
>  	return max;
>  }
>  
> +unsigned long mem_cgroup_size(struct mem_cgroup *memcg)
> +{
> +	return page_counter_read(&memcg->memory);
> +}
> +
>  static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  				     int order)
>  {
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a714c4f800e9..638c3655dc4b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2445,17 +2445,74 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>  	*lru_pages = 0;
>  	for_each_evictable_lru(lru) {
>  		int file = is_file_lru(lru);
> -		unsigned long size;
> +		unsigned long lruvec_size;
>  		unsigned long scan;
> +		unsigned long protection;
> +
> +		lruvec_size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
> +		protection = mem_cgroup_protection(memcg);
> +
> +		if (protection > 0) {
> +			/*
> +			 * Scale a cgroup's reclaim pressure by proportioning its current
> +			 * usage to its memory.low or memory.min setting.
> +			 *
> +			 * This is important, as otherwise scanning aggression becomes
> +			 * extremely binary -- from nothing as we approach the memory
> +			 * protection threshold, to totally nominal as we exceed it. This
> +			 * results in requiring setting extremely liberal protection
> +			 * thresholds. It also means we simply get no protection at all if
> +			 * we set it too low, which is not ideal.
> +			 */
> +			unsigned long cgroup_size = mem_cgroup_size(memcg);
> +			unsigned long baseline = 0;
> +
> +			/*
> +			 * During the reclaim first pass, we only consider cgroups in
> +			 * excess of their protection setting, but if that doesn't produce
> +			 * free pages, we come back for a second pass where we reclaim from
> +			 * all groups.
> +			 *
> +			 * To maintain fairness in both cases, the first pass targets
> +			 * groups in proportion to their overage, and the second pass
> +			 * targets groups in proportion to their protection utilization.
> +			 *
> +			 * So on the first pass, a group whose size is 130% of its
> +			 * protection will be targeted at 30% of its size. On the second
> +			 * pass, a group whose size is at 40% of its protection will be
> +			 * targeted at 40% of its size.
> +			 */
> +			if (!sc->memcg_low_reclaim)
> +				baseline = lruvec_size;
> +			scan = lruvec_size * cgroup_size / protection - baseline;

Hm, it looks a bit suspicious to me.

Let's say memory.low = 3G, memory.min = 1G and memory.current = 2G.
cgroup_size / protection == 1, so scan doesn't depend on memory.min at all.

So, we need to look directly at memory.emin in memcg_low_reclaim case, and
ignore memory.(e)low.

> +
> +			/*
> +			 * Don't allow the scan target to exceed the lruvec size, which
> +			 * otherwise could happen if we have >200% overage in the normal
> +			 * case, or >100% overage when sc->memcg_low_reclaim is set.
> +			 *
> +			 * This is important because other cgroups without memory.low have
> +			 * their scan target initially set to their lruvec size, so
> +			 * allowing values >100% of the lruvec size here could result in
> +			 * penalising cgroups with memory.low set even *more* than their
> +			 * peers in some cases in the case of large overages.
> +			 *
> +			 * Also, minimally target SWAP_CLUSTER_MAX pages to keep reclaim
> +			 * moving forwards.
> +			 */
> +			scan = clamp(scan, SWAP_CLUSTER_MAX, lruvec_size);

Idk, how much sense does it have to make it larger than SWAP_CLUSTER_MAX,
given that it will become 0 on default (and almost any other) priority.


> +		} else {
> +			scan = lruvec_size;
> +		}
> +
> +		scan >>= sc->priority;
>  
> -		size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
> -		scan = size >> sc->priority;
>  		/*
>  		 * If the cgroup's already been deleted, make sure to
>  		 * scrape out the remaining cache.
>  		 */
>  		if (!scan && !mem_cgroup_online(memcg))
> -			scan = min(size, SWAP_CLUSTER_MAX);
> +			scan = min(lruvec_size, SWAP_CLUSTER_MAX);
>  
>  		switch (scan_balance) {
>  		case SCAN_EQUAL:
> @@ -2475,7 +2532,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>  		case SCAN_ANON:
>  			/* Scan one type exclusively */
>  			if ((scan_balance == SCAN_FILE) != file) {
> -				size = 0;
> +				lruvec_size = 0;
>  				scan = 0;
>  			}
>  			break;
> @@ -2484,7 +2541,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>  			BUG();
>  		}
>  
> -		*lru_pages += size;
> +		*lru_pages += lruvec_size;
>  		nr[lru] = scan;
>  	}
>  }
> @@ -2745,6 +2802,13 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  				memcg_memory_event(memcg, MEMCG_LOW);
>  				break;
>  			case MEMCG_PROT_NONE:
> +				/*
> +				 * All protection thresholds breached.

Or never set.

> We may
> +				 * still choose to vary the scan pressure
> +				 * applied based on by how much the cgroup in
> +				 * question has exceeded its protection
> +				 * thresholds (see get_scan_count).
> +				 */
>  				break;
>  			}
>  
> -- 
> 2.20.1
>
Chris Down Jan. 28, 2019, 9:42 p.m. UTC | #2
Roman Gushchin writes:
>Hm, it looks a bit suspicious to me.
>
>Let's say memory.low = 3G, memory.min = 1G and memory.current = 2G.
>cgroup_size / protection == 1, so scan doesn't depend on memory.min at all.
>
>So, we need to look directly at memory.emin in memcg_low_reclaim case, and
>ignore memory.(e)low.

Hmm, this isn't really a common situation that I'd thought about, but it seems 
reasonable to make the boundaries when in low reclaim to be between min and 
low, rather than 0 and low. I'll add another patch with that. Thanks

>> +			scan = clamp(scan, SWAP_CLUSTER_MAX, lruvec_size);
>
>Idk, how much sense does it have to make it larger than SWAP_CLUSTER_MAX,
>given that it will become 0 on default (and almost any other) priority.

In my testing, setting the scan target to 0 and thus reducing scope for reclaim 
can result in increasing the scan priority more than is desirable, and since we 
base some vm heuristics based on that, that seemed concerning.

I'd rather start being a bit more cautious, erring on the side of scanning at 
least some pages from this memcg when priority gets elevated.

Thanks for the review!
Roman Gushchin Jan. 28, 2019, 9:52 p.m. UTC | #3
On Mon, Jan 28, 2019 at 04:42:13PM -0500, Chris Down wrote:
> Roman Gushchin writes:
> > Hm, it looks a bit suspicious to me.
> > 
> > Let's say memory.low = 3G, memory.min = 1G and memory.current = 2G.
> > cgroup_size / protection == 1, so scan doesn't depend on memory.min at all.
> > 
> > So, we need to look directly at memory.emin in memcg_low_reclaim case, and
> > ignore memory.(e)low.
> 
> Hmm, this isn't really a common situation that I'd thought about, but it
> seems reasonable to make the boundaries when in low reclaim to be between
> min and low, rather than 0 and low. I'll add another patch with that. Thanks

It's not a stopper, so I'm perfectly fine with a follow-up patch.

> 
> > > +			scan = clamp(scan, SWAP_CLUSTER_MAX, lruvec_size);
> > 
> > Idk, how much sense does it have to make it larger than SWAP_CLUSTER_MAX,
> > given that it will become 0 on default (and almost any other) priority.
> 
> In my testing, setting the scan target to 0 and thus reducing scope for
> reclaim can result in increasing the scan priority more than is desirable,
> and since we base some vm heuristics based on that, that seemed concerning.
> 
> I'd rather start being a bit more cautious, erring on the side of scanning
> at least some pages from this memcg when priority gets elevated.
> 
> Thanks for the review!

For the rest of the patch:
Reviewed-by: Roman Gushchin <guro@fb.com>

Thanks!
Andrew Morton July 15, 2019, 10:35 p.m. UTC | #4
On Mon, 28 Jan 2019 21:52:40 +0000 Roman Gushchin <guro@fb.com> wrote:

> > Hmm, this isn't really a common situation that I'd thought about, but it
> > seems reasonable to make the boundaries when in low reclaim to be between
> > min and low, rather than 0 and low. I'll add another patch with that. Thanks
>
> It's not a stopper, so I'm perfectly fine with a follow-up patch.

Did this happen?


I'm still trying to get this five month old patchset unstuck :(.  The
review status is: 

[1/3] mm, memcg: proportional memory.{low,min} reclaim
Acked-by: Johannes
Reviewed-by: Roman

[2/3] mm, memcg: make memory.emin the baseline for utilisation determination
Acked-by: Johannes

[3/3] mm, memcg: make scan aggression always exclude protection
Reviewed-by: Roman


I do have a note here that mhocko intended to take a closer look but I
don't recall whether that happened.

I could

a) say what the hell and merge them or
b) sit on them for another cycle or
c) drop them and ask Chris for a resend so we can start again.
Chris Down July 15, 2019, 10:57 p.m. UTC | #5
Hey Andrew,

Andrew Morton writes:
>On Mon, 28 Jan 2019 21:52:40 +0000 Roman Gushchin <guro@fb.com> wrote:
>
>> > Hmm, this isn't really a common situation that I'd thought about, but it
>> > seems reasonable to make the boundaries when in low reclaim to be between
>> > min and low, rather than 0 and low. I'll add another patch with that. Thanks
>>
>> It's not a stopper, so I'm perfectly fine with a follow-up patch.
>
>Did this happen?

Yes, that's "mm, memcg: make memory.emin the baseline for utilisation 
determination" :-)

>I'm still trying to get this five month old patchset unstuck :(.

Thank you for your help. The patches are stable and proven to do what they're 
intended to do at scale (both shown by the test results, and production use 
inside FB at scale).

>I do have a note here that mhocko intended to take a closer look but I
>don't recall whether that happened.
>
>I could
>
>a) say what the hell and merge them or
>b) sit on them for another cycle or
>c) drop them and ask Chris for a resend so we can start again.

Is there any reason to resend? As far as I know these patches are good to go.  
I'm happy to rebase them, as long as it doesn't extend the time they're being 
sat on. I don't see anything changing before the next release, though, and I 
feel any reviews are clearly not coming at this series with any urgency.

Thanks for the poke on this, I appreciate it.
Johannes Weiner July 16, 2019, 5:24 p.m. UTC | #6
On Mon, Jul 15, 2019 at 03:35:27PM -0700, Andrew Morton wrote:
> On Mon, 28 Jan 2019 21:52:40 +0000 Roman Gushchin <guro@fb.com> wrote:
> 
> > > Hmm, this isn't really a common situation that I'd thought about, but it
> > > seems reasonable to make the boundaries when in low reclaim to be between
> > > min and low, rather than 0 and low. I'll add another patch with that. Thanks
> >
> > It's not a stopper, so I'm perfectly fine with a follow-up patch.
> 
> Did this happen?
> 
> I'm still trying to get this five month old patchset unstuck :(.  The
> review status is: 
> 
> [1/3] mm, memcg: proportional memory.{low,min} reclaim
> Acked-by: Johannes
> Reviewed-by: Roman
> 
> [2/3] mm, memcg: make memory.emin the baseline for utilisation determination
> Acked-by: Johannes
> 
> [3/3] mm, memcg: make scan aggression always exclude protection
> Reviewed-by: Roman

I forgot to send out the actual ack-tag on #, so I just did. I was
involved in the discussions that led to that patch, the code looks
good to me, and it's what we've been using internally for a while
without any hiccups.

> I do have a note here that mhocko intended to take a closer look but I
> don't recall whether that happened.

Michal acked #3 in 20190530065111.GC6703@dhcp22.suse.cz. Afaik not the
others, but #3 also doesn't make a whole lot of sense without #1...

> a) say what the hell and merge them or
> b) sit on them for another cycle or
> c) drop them and ask Chris for a resend so we can start again.

Michal, would you have time to take another look this week? Otherwise,
I think everyone who would review them has done so.

Patch
diff mbox series

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 7bf3f129c68b..8ed408166890 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -606,8 +606,8 @@  on an IO device and is an example of this type.
 Protections
 -----------
 
-A cgroup is protected to be allocated upto the configured amount of
-the resource if the usages of all its ancestors are under their
+A cgroup is protected upto the configured amount of the resource
+as long as the usages of all its ancestors are under their
 protected levels.  Protections can be hard guarantees or best effort
 soft boundaries.  Protections can also be over-committed in which case
 only upto the amount available to the parent is protected among
@@ -1020,7 +1020,10 @@  PAGE_SIZE multiple when read back.
 	is within its effective min boundary, the cgroup's memory
 	won't be reclaimed under any conditions. If there is no
 	unprotected reclaimable memory available, OOM killer
-	is invoked.
+	is invoked. Above the effective min boundary (or
+	effective low boundary if it is higher), pages are reclaimed
+	proportionally to the overage, reducing reclaim pressure for
+	smaller overages.
 
        Effective min boundary is limited by memory.min values of
 	all ancestor cgroups. If there is memory.min overcommitment
@@ -1042,7 +1045,10 @@  PAGE_SIZE multiple when read back.
 	Best-effort memory protection.  If the memory usage of a
 	cgroup is within its effective low boundary, the cgroup's
 	memory won't be reclaimed unless memory can be reclaimed
-	from unprotected cgroups.
+	from unprotected cgroups.  Above the effective low boundary (or
+	effective min boundary if it is higher), pages are reclaimed
+	proportionally to the overage, reducing reclaim pressure for
+	smaller overages.
 
 	Effective low boundary is limited by memory.low values of
 	all ancestor cgroups. If there is memory.low overcommitment
@@ -2283,8 +2289,10 @@  system performance due to overreclaim, to the point where the feature
 becomes self-defeating.
 
 The memory.low boundary on the other hand is a top-down allocated
-reserve.  A cgroup enjoys reclaim protection when it's within its low,
-which makes delegation of subtrees possible.
+reserve.  A cgroup enjoys reclaim protection when it's within its
+effective low, which makes delegation of subtrees possible. It also
+enjoys having reclaim pressure proportional to its overage when
+above its effective low.
 
 The original high boundary, the hard limit, is defined as a strict
 limit that can not budge, even if the OOM killer has to be called.
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b0eb29ea0d9c..290cfbfd60cd 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -333,6 +333,11 @@  static inline bool mem_cgroup_disabled(void)
 	return !cgroup_subsys_enabled(memory_cgrp_subsys);
 }
 
+static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg)
+{
+	return max(READ_ONCE(memcg->memory.emin), READ_ONCE(memcg->memory.elow));
+}
+
 enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
 						struct mem_cgroup *memcg);
 
@@ -526,6 +531,8 @@  void mem_cgroup_handle_over_high(void);
 
 unsigned long mem_cgroup_get_max(struct mem_cgroup *memcg);
 
+unsigned long mem_cgroup_size(struct mem_cgroup *memcg);
+
 void mem_cgroup_print_oom_context(struct mem_cgroup *memcg,
 				struct task_struct *p);
 
@@ -819,6 +826,11 @@  static inline void memcg_memory_event_mm(struct mm_struct *mm,
 {
 }
 
+static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg)
+{
+	return 0;
+}
+
 static inline enum mem_cgroup_protection mem_cgroup_protected(
 	struct mem_cgroup *root, struct mem_cgroup *memcg)
 {
@@ -971,6 +983,11 @@  static inline unsigned long mem_cgroup_get_max(struct mem_cgroup *memcg)
 	return 0;
 }
 
+static inline unsigned long mem_cgroup_size(struct mem_cgroup *memcg)
+{
+	return 0;
+}
+
 static inline void
 mem_cgroup_print_oom_context(struct mem_cgroup *memcg, struct task_struct *p)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 18f4aefbe0bf..1d2b2aaf124d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1377,6 +1377,11 @@  unsigned long mem_cgroup_get_max(struct mem_cgroup *memcg)
 	return max;
 }
 
+unsigned long mem_cgroup_size(struct mem_cgroup *memcg)
+{
+	return page_counter_read(&memcg->memory);
+}
+
 static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 				     int order)
 {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a714c4f800e9..638c3655dc4b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2445,17 +2445,74 @@  static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 	*lru_pages = 0;
 	for_each_evictable_lru(lru) {
 		int file = is_file_lru(lru);
-		unsigned long size;
+		unsigned long lruvec_size;
 		unsigned long scan;
+		unsigned long protection;
+
+		lruvec_size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
+		protection = mem_cgroup_protection(memcg);
+
+		if (protection > 0) {
+			/*
+			 * Scale a cgroup's reclaim pressure by proportioning its current
+			 * usage to its memory.low or memory.min setting.
+			 *
+			 * This is important, as otherwise scanning aggression becomes
+			 * extremely binary -- from nothing as we approach the memory
+			 * protection threshold, to totally nominal as we exceed it. This
+			 * results in requiring setting extremely liberal protection
+			 * thresholds. It also means we simply get no protection at all if
+			 * we set it too low, which is not ideal.
+			 */
+			unsigned long cgroup_size = mem_cgroup_size(memcg);
+			unsigned long baseline = 0;
+
+			/*
+			 * During the reclaim first pass, we only consider cgroups in
+			 * excess of their protection setting, but if that doesn't produce
+			 * free pages, we come back for a second pass where we reclaim from
+			 * all groups.
+			 *
+			 * To maintain fairness in both cases, the first pass targets
+			 * groups in proportion to their overage, and the second pass
+			 * targets groups in proportion to their protection utilization.
+			 *
+			 * So on the first pass, a group whose size is 130% of its
+			 * protection will be targeted at 30% of its size. On the second
+			 * pass, a group whose size is at 40% of its protection will be
+			 * targeted at 40% of its size.
+			 */
+			if (!sc->memcg_low_reclaim)
+				baseline = lruvec_size;
+			scan = lruvec_size * cgroup_size / protection - baseline;
+
+			/*
+			 * Don't allow the scan target to exceed the lruvec size, which
+			 * otherwise could happen if we have >200% overage in the normal
+			 * case, or >100% overage when sc->memcg_low_reclaim is set.
+			 *
+			 * This is important because other cgroups without memory.low have
+			 * their scan target initially set to their lruvec size, so
+			 * allowing values >100% of the lruvec size here could result in
+			 * penalising cgroups with memory.low set even *more* than their
+			 * peers in some cases in the case of large overages.
+			 *
+			 * Also, minimally target SWAP_CLUSTER_MAX pages to keep reclaim
+			 * moving forwards.
+			 */
+			scan = clamp(scan, SWAP_CLUSTER_MAX, lruvec_size);
+		} else {
+			scan = lruvec_size;
+		}
+
+		scan >>= sc->priority;
 
-		size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
-		scan = size >> sc->priority;
 		/*
 		 * If the cgroup's already been deleted, make sure to
 		 * scrape out the remaining cache.
 		 */
 		if (!scan && !mem_cgroup_online(memcg))
-			scan = min(size, SWAP_CLUSTER_MAX);
+			scan = min(lruvec_size, SWAP_CLUSTER_MAX);
 
 		switch (scan_balance) {
 		case SCAN_EQUAL:
@@ -2475,7 +2532,7 @@  static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 		case SCAN_ANON:
 			/* Scan one type exclusively */
 			if ((scan_balance == SCAN_FILE) != file) {
-				size = 0;
+				lruvec_size = 0;
 				scan = 0;
 			}
 			break;
@@ -2484,7 +2541,7 @@  static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 			BUG();
 		}
 
-		*lru_pages += size;
+		*lru_pages += lruvec_size;
 		nr[lru] = scan;
 	}
 }
@@ -2745,6 +2802,13 @@  static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 				memcg_memory_event(memcg, MEMCG_LOW);
 				break;
 			case MEMCG_PROT_NONE:
+				/*
+				 * All protection thresholds breached. We may
+				 * still choose to vary the scan pressure
+				 * applied based on by how much the cgroup in
+				 * question has exceeded its protection
+				 * thresholds (see get_scan_count).
+				 */
 				break;
 			}