linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] Revert "mm/vmscan: never demote for memcg reclaim"
@ 2022-05-18 19:09 Johannes Weiner
  2022-05-18 19:51 ` Dave Hansen
                   ` (6 more replies)
  0 siblings, 7 replies; 9+ messages in thread
From: Johannes Weiner @ 2022-05-18 19:09 UTC (permalink / raw)
  To: Dave Hansen, Huang, Ying, Yang Shi, Andrew Morton
  Cc: linux-mm, cgroups, linux-kernel, kernel-team, Zi Yan,
	Michal Hocko, Shakeel Butt, Roman Gushchin

This reverts commit 3a235693d3930e1276c8d9cc0ca5807ef292cf0a.

Its premise was that cgroup reclaim cares about freeing memory inside
the cgroup, and demotion just moves them around within the cgroup
limit. Hence, pages from toptier nodes should be reclaimed directly.

However, with NUMA balancing now doing tier promotions, demotion is
part of the page aging process. Global reclaim demotes the coldest
toptier pages to secondary memory, where their life continues and from
which they have a chance to get promoted back. Essentially, tiered
memory systems have an LRU order that spans multiple nodes.

When cgroup reclaims pages coming off the toptier directly, there can
be colder pages on lower tier nodes that were demoted by global
reclaim. This is an aging inversion, not unlike if cgroups were to
reclaim directly from the active lists while there are inactive pages.

Proactive reclaim is another factor. The goal of that it is to offload
colder pages from expensive RAM to cheaper storage. When lower tier
memory is available as an intermediate layer, we want offloading to
take advantage of it instead of bypassing to storage.

Revert the patch so that cgroups respect the LRU order spanning the
memory hierarchy.

Of note is a specific undercommit scenario, where all cgroup limits in
the system add up to <= available toptier memory. In that case,
shuffling pages out to lower tiers first to reclaim them from there is
inefficient. This is something could be optimized/short-circuited
later on (although care must be taken not to accidentally recreate the
aging inversion). Let's ensure correctness first.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Roman Gushchin <guro@fb.com>
---
 mm/vmscan.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c6918fff06e1..7a4090712177 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -528,13 +528,8 @@ static bool can_demote(int nid, struct scan_control *sc)
 {
 	if (!numa_demotion_enabled)
 		return false;
-	if (sc) {
-		if (sc->no_demotion)
-			return false;
-		/* It is pointless to do demotion in memcg reclaim */
-		if (cgroup_reclaim(sc))
-			return false;
-	}
+	if (sc && sc->no_demotion)
+		return false;
 	if (next_demotion_node(nid) == NUMA_NO_NODE)
 		return false;
 
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] Revert "mm/vmscan: never demote for memcg reclaim"
  2022-05-18 19:09 [PATCH] Revert "mm/vmscan: never demote for memcg reclaim" Johannes Weiner
@ 2022-05-18 19:51 ` Dave Hansen
  2022-05-18 20:42 ` Yang Shi
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: Dave Hansen @ 2022-05-18 19:51 UTC (permalink / raw)
  To: Johannes Weiner, Dave Hansen, Huang, Ying, Yang Shi, Andrew Morton
  Cc: linux-mm, cgroups, linux-kernel, kernel-team, Zi Yan,
	Michal Hocko, Shakeel Butt, Roman Gushchin

On 5/18/22 12:09, Johannes Weiner wrote:
> However, with NUMA balancing now doing tier promotions, demotion is
> part of the page aging process. Global reclaim demotes the coldest
> toptier pages to secondary memory, where their life continues and from
> which they have a chance to get promoted back. Essentially, tiered
> memory systems have an LRU order that spans multiple nodes.

Thanks for the detailed explanation.  Ultimately, this was just intended
as an optimization to make cgroup reclaim more efficient.  But, I agree
that ordering correctness should trump efficiency.

Acked-by: Dave Hansen <dave.hansen@linux.intel.com>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] Revert "mm/vmscan: never demote for memcg reclaim"
  2022-05-18 19:09 [PATCH] Revert "mm/vmscan: never demote for memcg reclaim" Johannes Weiner
  2022-05-18 19:51 ` Dave Hansen
@ 2022-05-18 20:42 ` Yang Shi
  2022-05-18 21:50 ` Roman Gushchin
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: Yang Shi @ 2022-05-18 20:42 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Dave Hansen, Huang, Ying, Yang Shi, Andrew Morton, Linux MM,
	Cgroups, Linux Kernel Mailing List, Kernel Team, Zi Yan,
	Michal Hocko, Shakeel Butt, Roman Gushchin

On Wed, May 18, 2022 at 12:09 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> This reverts commit 3a235693d3930e1276c8d9cc0ca5807ef292cf0a.
>
> Its premise was that cgroup reclaim cares about freeing memory inside
> the cgroup, and demotion just moves them around within the cgroup
> limit. Hence, pages from toptier nodes should be reclaimed directly.

Yes, exactly.

>
> However, with NUMA balancing now doing tier promotions, demotion is
> part of the page aging process. Global reclaim demotes the coldest
> toptier pages to secondary memory, where their life continues and from
> which they have a chance to get promoted back. Essentially, tiered
> memory systems have an LRU order that spans multiple nodes.
>
> When cgroup reclaims pages coming off the toptier directly, there can
> be colder pages on lower tier nodes that were demoted by global
> reclaim. This is an aging inversion, not unlike if cgroups were to
> reclaim directly from the active lists while there are inactive pages.

Thanks for pointing this out, makes sense to me.

>
> Proactive reclaim is another factor. The goal of that it is to offload
> colder pages from expensive RAM to cheaper storage. When lower tier
> memory is available as an intermediate layer, we want offloading to
> take advantage of it instead of bypassing to storage.
>
> Revert the patch so that cgroups respect the LRU order spanning the
> memory hierarchy.
>
> Of note is a specific undercommit scenario, where all cgroup limits in
> the system add up to <= available toptier memory. In that case,
> shuffling pages out to lower tiers first to reclaim them from there is
> inefficient. This is something could be optimized/short-circuited
> later on (although care must be taken not to accidentally recreate the
> aging inversion). Let's ensure correctness first.

Some side effects we might keep an eye with this revert:
  - Limit reclaim may experience longer latency since it has to do
demotion + reclaim to uncharge enough memory
  - Higher max usage due to the force charge from migration (of course
other migrations, i.e. NUMA fault, could have similar effect, but
anyway one more contributing factor)

They may not be noticeable hopefully, but I tend to agree that keeping
aging correct may be more important.

Reviewed-by: Yang Shi <shy828301@gmail.com>

>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: "Huang, Ying" <ying.huang@intel.com>
> Cc: Yang Shi <yang.shi@linux.alibaba.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Shakeel Butt <shakeelb@google.com>
> Cc: Roman Gushchin <guro@fb.com>
> ---
>  mm/vmscan.c | 9 ++-------
>  1 file changed, 2 insertions(+), 7 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c6918fff06e1..7a4090712177 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -528,13 +528,8 @@ static bool can_demote(int nid, struct scan_control *sc)
>  {
>         if (!numa_demotion_enabled)
>                 return false;
> -       if (sc) {
> -               if (sc->no_demotion)
> -                       return false;
> -               /* It is pointless to do demotion in memcg reclaim */
> -               if (cgroup_reclaim(sc))
> -                       return false;
> -       }
> +       if (sc && sc->no_demotion)
> +               return false;
>         if (next_demotion_node(nid) == NUMA_NO_NODE)
>                 return false;
>
> --
> 2.36.1
>
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] Revert "mm/vmscan: never demote for memcg reclaim"
  2022-05-18 19:09 [PATCH] Revert "mm/vmscan: never demote for memcg reclaim" Johannes Weiner
  2022-05-18 19:51 ` Dave Hansen
  2022-05-18 20:42 ` Yang Shi
@ 2022-05-18 21:50 ` Roman Gushchin
  2022-05-19  5:11 ` Shakeel Butt
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: Roman Gushchin @ 2022-05-18 21:50 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Dave Hansen, Huang, Ying, Yang Shi, Andrew Morton, linux-mm,
	cgroups, linux-kernel, kernel-team, Zi Yan, Michal Hocko,
	Shakeel Butt, Roman Gushchin

On Wed, May 18, 2022 at 03:09:11PM -0400, Johannes Weiner wrote:
> This reverts commit 3a235693d3930e1276c8d9cc0ca5807ef292cf0a.
> 
> Its premise was that cgroup reclaim cares about freeing memory inside
> the cgroup, and demotion just moves them around within the cgroup
> limit. Hence, pages from toptier nodes should be reclaimed directly.
> 
> However, with NUMA balancing now doing tier promotions, demotion is
> part of the page aging process. Global reclaim demotes the coldest
> toptier pages to secondary memory, where their life continues and from
> which they have a chance to get promoted back. Essentially, tiered
> memory systems have an LRU order that spans multiple nodes.
> 
> When cgroup reclaims pages coming off the toptier directly, there can
> be colder pages on lower tier nodes that were demoted by global
> reclaim. This is an aging inversion, not unlike if cgroups were to
> reclaim directly from the active lists while there are inactive pages.
> 
> Proactive reclaim is another factor. The goal of that it is to offload
> colder pages from expensive RAM to cheaper storage. When lower tier
> memory is available as an intermediate layer, we want offloading to
> take advantage of it instead of bypassing to storage.
> 
> Revert the patch so that cgroups respect the LRU order spanning the
> memory hierarchy.
> 
> Of note is a specific undercommit scenario, where all cgroup limits in
> the system add up to <= available toptier memory. In that case,
> shuffling pages out to lower tiers first to reclaim them from there is
> inefficient. This is something could be optimized/short-circuited
> later on (although care must be taken not to accidentally recreate the
> aging inversion). Let's ensure correctness first.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: "Huang, Ying" <ying.huang@intel.com>
> Cc: Yang Shi <yang.shi@linux.alibaba.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Shakeel Butt <shakeelb@google.com>
> Cc: Roman Gushchin <guro@fb.com>

Acked-by: Roman Gushchin <roman.gushchin@linux.dev>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] Revert "mm/vmscan: never demote for memcg reclaim"
  2022-05-18 19:09 [PATCH] Revert "mm/vmscan: never demote for memcg reclaim" Johannes Weiner
                   ` (2 preceding siblings ...)
  2022-05-18 21:50 ` Roman Gushchin
@ 2022-05-19  5:11 ` Shakeel Butt
  2022-05-19  7:42 ` ying.huang
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: Shakeel Butt @ 2022-05-19  5:11 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Dave Hansen, Huang Ying, Yang Shi, Andrew Morton, linux-mm,
	cgroups, linux-kernel, kernel-team, Zi Yan, Michal Hocko,
	Roman Gushchin

On Wed, May 18, 2022 at 03:09:11PM -0400, Johannes Weiner wrote:
> This reverts commit 3a235693d3930e1276c8d9cc0ca5807ef292cf0a.
> 
> Its premise was that cgroup reclaim cares about freeing memory inside
> the cgroup, and demotion just moves them around within the cgroup
> limit. Hence, pages from toptier nodes should be reclaimed directly.
> 
> However, with NUMA balancing now doing tier promotions, demotion is
> part of the page aging process. Global reclaim demotes the coldest
> toptier pages to secondary memory, where their life continues and from
> which they have a chance to get promoted back. Essentially, tiered
> memory systems have an LRU order that spans multiple nodes.
> 
> When cgroup reclaims pages coming off the toptier directly, there can
> be colder pages on lower tier nodes that were demoted by global
> reclaim. This is an aging inversion, not unlike if cgroups were to
> reclaim directly from the active lists while there are inactive pages.
> 
> Proactive reclaim is another factor. The goal of that it is to offload
> colder pages from expensive RAM to cheaper storage. When lower tier
> memory is available as an intermediate layer, we want offloading to
> take advantage of it instead of bypassing to storage.
> 
> Revert the patch so that cgroups respect the LRU order spanning the
> memory hierarchy.
> 
> Of note is a specific undercommit scenario, where all cgroup limits in
> the system add up to <= available toptier memory. In that case,
> shuffling pages out to lower tiers first to reclaim them from there is
> inefficient. This is something could be optimized/short-circuited
> later on (although care must be taken not to accidentally recreate the
> aging inversion). Let's ensure correctness first.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] Revert "mm/vmscan: never demote for memcg reclaim"
  2022-05-18 19:09 [PATCH] Revert "mm/vmscan: never demote for memcg reclaim" Johannes Weiner
                   ` (3 preceding siblings ...)
  2022-05-19  5:11 ` Shakeel Butt
@ 2022-05-19  7:42 ` ying.huang
  2022-05-23 18:59   ` Tim Chen
  2022-05-19  8:53 ` Muchun Song
  2022-05-19  9:51 ` Michal Hocko
  6 siblings, 1 reply; 9+ messages in thread
From: ying.huang @ 2022-05-19  7:42 UTC (permalink / raw)
  To: Johannes Weiner, Dave Hansen, Yang Shi, Andrew Morton
  Cc: linux-mm, cgroups, linux-kernel, kernel-team, Zi Yan,
	Michal Hocko, Shakeel Butt, Roman Gushchin, Tim Chen

On Wed, 2022-05-18 at 15:09 -0400, Johannes Weiner wrote:
> This reverts commit 3a235693d3930e1276c8d9cc0ca5807ef292cf0a.
> 
> Its premise was that cgroup reclaim cares about freeing memory inside
> the cgroup, and demotion just moves them around within the cgroup
> limit. Hence, pages from toptier nodes should be reclaimed directly.
> 
> However, with NUMA balancing now doing tier promotions, demotion is
> part of the page aging process. Global reclaim demotes the coldest
> toptier pages to secondary memory, where their life continues and from
> which they have a chance to get promoted back. Essentially, tiered
> memory systems have an LRU order that spans multiple nodes.
> 
> When cgroup reclaims pages coming off the toptier directly, there can
> be colder pages on lower tier nodes that were demoted by global
> reclaim. This is an aging inversion, not unlike if cgroups were to
> reclaim directly from the active lists while there are inactive pages.
> 
> Proactive reclaim is another factor. The goal of that it is to offload
> colder pages from expensive RAM to cheaper storage. When lower tier
> memory is available as an intermediate layer, we want offloading to
> take advantage of it instead of bypassing to storage.
> 
> Revert the patch so that cgroups respect the LRU order spanning the
> memory hierarchy.
> 
> Of note is a specific undercommit scenario, where all cgroup limits in
> the system add up to <= available toptier memory. In that case,
> shuffling pages out to lower tiers first to reclaim them from there is
> inefficient. This is something could be optimized/short-circuited
> later on (although care must be taken not to accidentally recreate the
> aging inversion). Let's ensure correctness first.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: "Huang, Ying" <ying.huang@intel.com>
> Cc: Yang Shi <yang.shi@linux.alibaba.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Shakeel Butt <shakeelb@google.com>
> Cc: Roman Gushchin <guro@fb.com>

Reviewed-by: "Huang, Ying" <ying.huang@intel.com>

This is also required by Tim's DRAM partition among cgroups in tiered
sytstem.

Best Regards,
Huang, Ying

> ---
>  mm/vmscan.c | 9 ++-------
>  1 file changed, 2 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c6918fff06e1..7a4090712177 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -528,13 +528,8 @@ static bool can_demote(int nid, struct scan_control *sc)
>  {
>  	if (!numa_demotion_enabled)
>  		return false;
> -	if (sc) {
> -		if (sc->no_demotion)
> -			return false;
> -		/* It is pointless to do demotion in memcg reclaim */
> -		if (cgroup_reclaim(sc))
> -			return false;
> -	}
> +	if (sc && sc->no_demotion)
> +		return false;
>  	if (next_demotion_node(nid) == NUMA_NO_NODE)
>  		return false;
>  
> 
> 
> 



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] Revert "mm/vmscan: never demote for memcg reclaim"
  2022-05-18 19:09 [PATCH] Revert "mm/vmscan: never demote for memcg reclaim" Johannes Weiner
                   ` (4 preceding siblings ...)
  2022-05-19  7:42 ` ying.huang
@ 2022-05-19  8:53 ` Muchun Song
  2022-05-19  9:51 ` Michal Hocko
  6 siblings, 0 replies; 9+ messages in thread
From: Muchun Song @ 2022-05-19  8:53 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Dave Hansen, Huang, Ying, Yang Shi, Andrew Morton, linux-mm,
	cgroups, linux-kernel, kernel-team, Zi Yan, Michal Hocko,
	Shakeel Butt, Roman Gushchin

On Wed, May 18, 2022 at 03:09:11PM -0400, Johannes Weiner wrote:
> This reverts commit 3a235693d3930e1276c8d9cc0ca5807ef292cf0a.
> 
> Its premise was that cgroup reclaim cares about freeing memory inside
> the cgroup, and demotion just moves them around within the cgroup
> limit. Hence, pages from toptier nodes should be reclaimed directly.
> 
> However, with NUMA balancing now doing tier promotions, demotion is
> part of the page aging process. Global reclaim demotes the coldest
> toptier pages to secondary memory, where their life continues and from
> which they have a chance to get promoted back. Essentially, tiered
> memory systems have an LRU order that spans multiple nodes.
> 
> When cgroup reclaims pages coming off the toptier directly, there can
> be colder pages on lower tier nodes that were demoted by global
> reclaim. This is an aging inversion, not unlike if cgroups were to
> reclaim directly from the active lists while there are inactive pages.
> 
> Proactive reclaim is another factor. The goal of that it is to offload
> colder pages from expensive RAM to cheaper storage. When lower tier
> memory is available as an intermediate layer, we want offloading to
> take advantage of it instead of bypassing to storage.
> 
> Revert the patch so that cgroups respect the LRU order spanning the
> memory hierarchy.
> 
> Of note is a specific undercommit scenario, where all cgroup limits in
> the system add up to <= available toptier memory. In that case,
> shuffling pages out to lower tiers first to reclaim them from there is
> inefficient. This is something could be optimized/short-circuited
> later on (although care must be taken not to accidentally recreate the
> aging inversion). Let's ensure correctness first.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Muchun Song <songmuchun@bytedance.com>

Thanks.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] Revert "mm/vmscan: never demote for memcg reclaim"
  2022-05-18 19:09 [PATCH] Revert "mm/vmscan: never demote for memcg reclaim" Johannes Weiner
                   ` (5 preceding siblings ...)
  2022-05-19  8:53 ` Muchun Song
@ 2022-05-19  9:51 ` Michal Hocko
  6 siblings, 0 replies; 9+ messages in thread
From: Michal Hocko @ 2022-05-19  9:51 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Dave Hansen, Huang, Ying, Yang Shi, Andrew Morton, linux-mm,
	cgroups, linux-kernel, kernel-team, Zi Yan, Shakeel Butt,
	Roman Gushchin

On Wed 18-05-22 15:09:11, Johannes Weiner wrote:
> This reverts commit 3a235693d3930e1276c8d9cc0ca5807ef292cf0a.
> 
> Its premise was that cgroup reclaim cares about freeing memory inside
> the cgroup, and demotion just moves them around within the cgroup
> limit. Hence, pages from toptier nodes should be reclaimed directly.
> 
> However, with NUMA balancing now doing tier promotions, demotion is
> part of the page aging process. Global reclaim demotes the coldest
> toptier pages to secondary memory, where their life continues and from
> which they have a chance to get promoted back. Essentially, tiered
> memory systems have an LRU order that spans multiple nodes.
> 
> When cgroup reclaims pages coming off the toptier directly, there can
> be colder pages on lower tier nodes that were demoted by global
> reclaim. This is an aging inversion, not unlike if cgroups were to
> reclaim directly from the active lists while there are inactive pages.
> 
> Proactive reclaim is another factor. The goal of that it is to offload
> colder pages from expensive RAM to cheaper storage. When lower tier
> memory is available as an intermediate layer, we want offloading to
> take advantage of it instead of bypassing to storage.
> 
> Revert the patch so that cgroups respect the LRU order spanning the
> memory hierarchy.

I do agree with your reasoning.

> Of note is a specific undercommit scenario, where all cgroup limits in
> the system add up to <= available toptier memory. In that case,
> shuffling pages out to lower tiers first to reclaim them from there is
> inefficient. This is something could be optimized/short-circuited
> later on (although care must be taken not to accidentally recreate the
> aging inversion). Let's ensure correctness first.

My slight concern with demotion is that there is no actual "guarantee"
to reclaim any charges which try_charge depends on to make a forward
progress. I suspect this is rather unlikely situation, though. The last
tear (without any fallback) should have some memory to reclaim most of
the time. Retries should push some pages out but low effort allocation
requests like GFP_NORETRY might fail but callers should be prepared for
that.

All that being said the agin inversion is much more real of a problem
than this.

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: "Huang, Ying" <ying.huang@intel.com>
> Cc: Yang Shi <yang.shi@linux.alibaba.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Shakeel Butt <shakeelb@google.com>
> Cc: Roman Gushchin <guro@fb.com>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/vmscan.c | 9 ++-------
>  1 file changed, 2 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c6918fff06e1..7a4090712177 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -528,13 +528,8 @@ static bool can_demote(int nid, struct scan_control *sc)
>  {
>  	if (!numa_demotion_enabled)
>  		return false;
> -	if (sc) {
> -		if (sc->no_demotion)
> -			return false;
> -		/* It is pointless to do demotion in memcg reclaim */
> -		if (cgroup_reclaim(sc))
> -			return false;
> -	}
> +	if (sc && sc->no_demotion)
> +		return false;
>  	if (next_demotion_node(nid) == NUMA_NO_NODE)
>  		return false;
>  
> -- 
> 2.36.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] Revert "mm/vmscan: never demote for memcg reclaim"
  2022-05-19  7:42 ` ying.huang
@ 2022-05-23 18:59   ` Tim Chen
  0 siblings, 0 replies; 9+ messages in thread
From: Tim Chen @ 2022-05-23 18:59 UTC (permalink / raw)
  To: ying.huang, Johannes Weiner, Dave Hansen, Yang Shi, Andrew Morton
  Cc: linux-mm, cgroups, linux-kernel, kernel-team, Zi Yan,
	Michal Hocko, Shakeel Butt, Roman Gushchin

On Thu, 2022-05-19 at 15:42 +0800, ying.huang@intel.com wrote:
> On Wed, 2022-05-18 at 15:09 -0400, Johannes Weiner wrote:
> > This reverts commit 3a235693d3930e1276c8d9cc0ca5807ef292cf0a.
> > 
> > Its premise was that cgroup reclaim cares about freeing memory inside
> > the cgroup, and demotion just moves them around within the cgroup
> > limit. Hence, pages from toptier nodes should be reclaimed directly.
> > 
> > However, with NUMA balancing now doing tier promotions, demotion is
> > part of the page aging process. Global reclaim demotes the coldest
> > toptier pages to secondary memory, where their life continues and from
> > which they have a chance to get promoted back. Essentially, tiered
> > memory systems have an LRU order that spans multiple nodes.
> > 
> > When cgroup reclaims pages coming off the toptier directly, there can
> > be colder pages on lower tier nodes that were demoted by global
> > reclaim. This is an aging inversion, not unlike if cgroups were to
> > reclaim directly from the active lists while there are inactive pages.
> > 
> > Proactive reclaim is another factor. The goal of that it is to offload
> > colder pages from expensive RAM to cheaper storage. When lower tier
> > memory is available as an intermediate layer, we want offloading to
> > take advantage of it instead of bypassing to storage.
> > 
> > Revert the patch so that cgroups respect the LRU order spanning the
> > memory hierarchy.
> > 
> > Of note is a specific undercommit scenario, where all cgroup limits in
> > the system add up to <= available toptier memory. In that case,
> > shuffling pages out to lower tiers first to reclaim them from there is
> > inefficient. This is something could be optimized/short-circuited
> > later on (although care must be taken not to accidentally recreate the
> > aging inversion). Let's ensure correctness first.
> > 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Dave Hansen <dave.hansen@linux.intel.com>
> > Cc: "Huang, Ying" <ying.huang@intel.com>
> > Cc: Yang Shi <yang.shi@linux.alibaba.com>
> > Cc: Zi Yan <ziy@nvidia.com>
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: Shakeel Butt <shakeelb@google.com>
> > Cc: Roman Gushchin <guro@fb.com>
> 
> Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
> 
> This is also required by Tim's DRAM partition among cgroups in tiered
> sytstem.

Yes, while testing cgroup demotion, I also have to revert
the commit in question.
 
Acked-by: Tim Chen <tim.c.chen@linux.intel.com>
> 
> Best Regards,
> Huang, Ying
> 
> > ---
> >  mm/vmscan.c | 9 ++-------
> >  1 file changed, 2 insertions(+), 7 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index c6918fff06e1..7a4090712177 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -528,13 +528,8 @@ static bool can_demote(int nid, struct scan_control *sc)
> >  {
> >  	if (!numa_demotion_enabled)
> >  		return false;
> > -	if (sc) {
> > -		if (sc->no_demotion)
> > -			return false;
> > -		/* It is pointless to do demotion in memcg reclaim */
> > -		if (cgroup_reclaim(sc))
> > -			return false;
> > -	}
> > +	if (sc && sc->no_demotion)
> > +		return false;
> >  	if (next_demotion_node(nid) == NUMA_NO_NODE)
> >  		return false;
> >  
> > 
> > 
> > 
> 
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-05-23 19:22 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-18 19:09 [PATCH] Revert "mm/vmscan: never demote for memcg reclaim" Johannes Weiner
2022-05-18 19:51 ` Dave Hansen
2022-05-18 20:42 ` Yang Shi
2022-05-18 21:50 ` Roman Gushchin
2022-05-19  5:11 ` Shakeel Butt
2022-05-19  7:42 ` ying.huang
2022-05-23 18:59   ` Tim Chen
2022-05-19  8:53 ` Muchun Song
2022-05-19  9:51 ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).