[PATCH] mm: workingset: ignore slab memory size when calculating shadows pressure

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm: workingset: ignore slab memory size when calculating shadows pressure
@ 2020-09-03 23:00 Roman Gushchin
  2020-09-04  4:10 ` Andrew Morton
  2020-09-09 14:55 ` Johannes Weiner
  0 siblings, 2 replies; 6+ messages in thread
From: Roman Gushchin @ 2020-09-03 23:00 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: =Shakeel Butt, Johannes Weiner, Michal Hocko, kernel-team,
	linux-kernel, Roman Gushchin, Michal Hocko

In the memcg case count_shadow_nodes() sums the number of pages in lru
lists and the amount of slab memory (reclaimable and non-reclaimable)
as a baseline for the allowed number of shadow entries.

It seems to be a good analogy for the !memcg case, where
node_present_pages() is used. However, it's not quite true, as there
two problems:

1) Due to slab reparenting introduced by commit fb2f2b0adb98 ("mm:
memcg/slab: reparent memcg kmem_caches on cgroup removal") local
per-lruvec slab counters might be inaccurate on non-leaf levels.
It's the only place where local slab counters are used.

2) Shadow nodes by themselves are backed by slabs. So there is a loop
dependency: the more shadow entries are there, the less pressure the
kernel applies to reclaim them.

Fortunately, there is a simple way to solve both problems: slab
counters shouldn't be taken into the account by count_shadow_nodes().

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Shakeel Butt <shakeelb@google.com>
---
 mm/workingset.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/mm/workingset.c b/mm/workingset.c
index 92e66113a577..50d53f3699e4 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -495,10 +495,6 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
 		for (pages = 0, i = 0; i < NR_LRU_LISTS; i++)
 			pages += lruvec_page_state_local(lruvec,
 							 NR_LRU_BASE + i);
-		pages += lruvec_page_state_local(
-			lruvec, NR_SLAB_RECLAIMABLE_B) >> PAGE_SHIFT;
-		pages += lruvec_page_state_local(
-			lruvec, NR_SLAB_UNRECLAIMABLE_B) >> PAGE_SHIFT;
 	} else
 #endif
 		pages = node_present_pages(sc->nid);
-- 
2.26.2

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm: workingset: ignore slab memory size when calculating shadows pressure
  2020-09-03 23:00 [PATCH] mm: workingset: ignore slab memory size when calculating shadows pressure Roman Gushchin
@ 2020-09-04  4:10 ` Andrew Morton
  2020-09-04  5:02   ` Roman Gushchin
  2020-09-09 14:55 ` Johannes Weiner
  1 sibling, 1 reply; 6+ messages in thread
From: Andrew Morton @ 2020-09-04  4:10 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, =Shakeel Butt, Johannes Weiner, Michal Hocko,
	kernel-team, linux-kernel, Michal Hocko

On Thu, 3 Sep 2020 16:00:55 -0700 Roman Gushchin <guro@fb.com> wrote:

> In the memcg case count_shadow_nodes() sums the number of pages in lru
> lists and the amount of slab memory (reclaimable and non-reclaimable)
> as a baseline for the allowed number of shadow entries.
> 
> It seems to be a good analogy for the !memcg case, where
> node_present_pages() is used. However, it's not quite true, as there
> two problems:
> 
> 1) Due to slab reparenting introduced by commit fb2f2b0adb98 ("mm:
> memcg/slab: reparent memcg kmem_caches on cgroup removal") local
> per-lruvec slab counters might be inaccurate on non-leaf levels.
> It's the only place where local slab counters are used.
> 
> 2) Shadow nodes by themselves are backed by slabs. So there is a loop
> dependency: the more shadow entries are there, the less pressure the
> kernel applies to reclaim them.
> 
> Fortunately, there is a simple way to solve both problems: slab
> counters shouldn't be taken into the account by count_shadow_nodes().
> 
> ...
>
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -495,10 +495,6 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
>  		for (pages = 0, i = 0; i < NR_LRU_LISTS; i++)
>  			pages += lruvec_page_state_local(lruvec,
>  							 NR_LRU_BASE + i);
> -		pages += lruvec_page_state_local(
> -			lruvec, NR_SLAB_RECLAIMABLE_B) >> PAGE_SHIFT;
> -		pages += lruvec_page_state_local(
> -			lruvec, NR_SLAB_UNRECLAIMABLE_B) >> PAGE_SHIFT;
>  	} else
>  #endif
>  		pages = node_present_pages(sc->nid);

Did this have any observable runtime effects?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm: workingset: ignore slab memory size when calculating shadows pressure
  2020-09-04  4:10 ` Andrew Morton
@ 2020-09-04  5:02   ` Roman Gushchin
  0 siblings, 0 replies; 6+ messages in thread
From: Roman Gushchin @ 2020-09-04  5:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Shakeel Butt, Johannes Weiner, Michal Hocko,
	kernel-team, linux-kernel, Michal Hocko

On Thu, Sep 03, 2020 at 09:10:59PM -0700, Andrew Morton wrote:
> On Thu, 3 Sep 2020 16:00:55 -0700 Roman Gushchin <guro@fb.com> wrote:
> 
> > In the memcg case count_shadow_nodes() sums the number of pages in lru
> > lists and the amount of slab memory (reclaimable and non-reclaimable)
> > as a baseline for the allowed number of shadow entries.
> > 
> > It seems to be a good analogy for the !memcg case, where
> > node_present_pages() is used. However, it's not quite true, as there
> > two problems:
> > 
> > 1) Due to slab reparenting introduced by commit fb2f2b0adb98 ("mm:
> > memcg/slab: reparent memcg kmem_caches on cgroup removal") local
> > per-lruvec slab counters might be inaccurate on non-leaf levels.
> > It's the only place where local slab counters are used.
> > 
> > 2) Shadow nodes by themselves are backed by slabs. So there is a loop
> > dependency: the more shadow entries are there, the less pressure the
> > kernel applies to reclaim them.
> > 
> > Fortunately, there is a simple way to solve both problems: slab
> > counters shouldn't be taken into the account by count_shadow_nodes().
> > 
> > ...
> >
> > --- a/mm/workingset.c
> > +++ b/mm/workingset.c
> > @@ -495,10 +495,6 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
> >  		for (pages = 0, i = 0; i < NR_LRU_LISTS; i++)
> >  			pages += lruvec_page_state_local(lruvec,
> >  							 NR_LRU_BASE + i);
> > -		pages += lruvec_page_state_local(
> > -			lruvec, NR_SLAB_RECLAIMABLE_B) >> PAGE_SHIFT;
> > -		pages += lruvec_page_state_local(
> > -			lruvec, NR_SLAB_UNRECLAIMABLE_B) >> PAGE_SHIFT;
> >  	} else
> >  #endif
> >  		pages = node_present_pages(sc->nid);
> 
> Did this have any observable runtime effects?

Most likely not.

I maybe saw the second effect once, but it was backed up by a bug in the inode
reclaim path in the exact kernel version I used (not an upstream one).

The first problem is pure theoretical, I'm just not comfortable with using these
counters, which are known to be inaccurate after reparenting.

That's why I didn't add stable@.

Thanks!

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm: workingset: ignore slab memory size when calculating shadows pressure
  2020-09-03 23:00 [PATCH] mm: workingset: ignore slab memory size when calculating shadows pressure Roman Gushchin
  2020-09-04  4:10 ` Andrew Morton
@ 2020-09-09 14:55 ` Johannes Weiner
  2020-09-09 16:55   ` Roman Gushchin
  1 sibling, 1 reply; 6+ messages in thread
From: Johannes Weiner @ 2020-09-09 14:55 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, linux-mm, =Shakeel Butt, Michal Hocko,
	kernel-team, linux-kernel, Michal Hocko

On Thu, Sep 03, 2020 at 04:00:55PM -0700, Roman Gushchin wrote:
> In the memcg case count_shadow_nodes() sums the number of pages in lru
> lists and the amount of slab memory (reclaimable and non-reclaimable)
> as a baseline for the allowed number of shadow entries.
> 
> It seems to be a good analogy for the !memcg case, where
> node_present_pages() is used. However, it's not quite true, as there
> two problems:
> 
> 1) Due to slab reparenting introduced by commit fb2f2b0adb98 ("mm:
> memcg/slab: reparent memcg kmem_caches on cgroup removal") local
> per-lruvec slab counters might be inaccurate on non-leaf levels.
> It's the only place where local slab counters are used.

Hm, that sounds like a bug tbh. We're reparenting the kmem caches and
the individual objects on the list_lru when a cgroup is removed -
shouldn't we also reparent the corresponding memory counters?

> 2) Shadow nodes by themselves are backed by slabs. So there is a loop
> dependency: the more shadow entries are there, the less pressure the
> kernel applies to reclaim them.

This effect is negligible in practice.

The permitted shadow nodes are a tiny percentage of memory consumed by
the cgroup. If shadow nodes make up a significant part of the cgroup's
footprint, or are the only thing left, they will be pushed out fast.

The formula is max_nodes = total_pages >> 3, and one page can hold 28
nodes. So if the cgroup holds nothing but 262,144 pages (1G) of shadow
nodes, the shrinker target is 32,768 nodes, which is 32,768 pages
(128M) in the worst packing case and 1,170 pages (4M) at best.

However, if you don't take slab into account here, it can evict shadow
entries with undue aggression when they are needed the most. If, say,
the inode or dentry cache explode temporarily and displace the page
cache, it would be a big problem to drop the cache's non-resident info
at the same time! This is when it's at its most important.

Let's drop this patch, please.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm: workingset: ignore slab memory size when calculating shadows pressure
  2020-09-09 14:55 ` Johannes Weiner
@ 2020-09-09 16:55   ` Roman Gushchin
  2020-09-10 17:50     ` Johannes Weiner
  0 siblings, 1 reply; 6+ messages in thread
From: Roman Gushchin @ 2020-09-09 16:55 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, linux-mm, =Shakeel Butt, Michal Hocko,
	kernel-team, linux-kernel, Michal Hocko

On Wed, Sep 09, 2020 at 10:55:34AM -0400, Johannes Weiner wrote:
> On Thu, Sep 03, 2020 at 04:00:55PM -0700, Roman Gushchin wrote:
> > In the memcg case count_shadow_nodes() sums the number of pages in lru
> > lists and the amount of slab memory (reclaimable and non-reclaimable)
> > as a baseline for the allowed number of shadow entries.
> > 
> > It seems to be a good analogy for the !memcg case, where
> > node_present_pages() is used. However, it's not quite true, as there
> > two problems:
> > 
> > 1) Due to slab reparenting introduced by commit fb2f2b0adb98 ("mm:
> > memcg/slab: reparent memcg kmem_caches on cgroup removal") local
> > per-lruvec slab counters might be inaccurate on non-leaf levels.
> > It's the only place where local slab counters are used.
> 
> Hm, that sounds like a bug tbh. We're reparenting the kmem caches and
> the individual objects on the list_lru when a cgroup is removed -
> shouldn't we also reparent the corresponding memory counters?

It's definitely an option too, the question is if the added code complexity
is really worth it. I'd say no had we talk only about slab counters,
but when we'll eventually start reparenting pagecache, we'll need to reparent
other counters as well, so we'll need it anyway. So, ok, let's drop
this patch for now.

> 
> > 2) Shadow nodes by themselves are backed by slabs. So there is a loop
> > dependency: the more shadow entries are there, the less pressure the
> > kernel applies to reclaim them.
> 
> This effect is negligible in practice.
> 
> The permitted shadow nodes are a tiny percentage of memory consumed by
> the cgroup. If shadow nodes make up a significant part of the cgroup's
> footprint, or are the only thing left, they will be pushed out fast.
> 
> The formula is max_nodes = total_pages >> 3, and one page can hold 28
> nodes. So if the cgroup holds nothing but 262,144 pages (1G) of shadow
> nodes, the shrinker target is 32,768 nodes, which is 32,768 pages
> (128M) in the worst packing case and 1,170 pages (4M) at best.
> 
> However, if you don't take slab into account here, it can evict shadow
> entries with undue aggression when they are needed the most. If, say,
> the inode or dentry cache explode temporarily and displace the page
> cache, it would be a big problem to drop the cache's non-resident info
> at the same time! This is when it's at its most important.

Just curious, have you seen this in the real life?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm: workingset: ignore slab memory size when calculating shadows pressure
  2020-09-09 16:55   ` Roman Gushchin
@ 2020-09-10 17:50     ` Johannes Weiner
  0 siblings, 0 replies; 6+ messages in thread
From: Johannes Weiner @ 2020-09-10 17:50 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, linux-mm, =Shakeel Butt, Michal Hocko,
	kernel-team, linux-kernel, Michal Hocko

On Wed, Sep 09, 2020 at 09:55:20AM -0700, Roman Gushchin wrote:
> On Wed, Sep 09, 2020 at 10:55:34AM -0400, Johannes Weiner wrote:
> > On Thu, Sep 03, 2020 at 04:00:55PM -0700, Roman Gushchin wrote:
> > > In the memcg case count_shadow_nodes() sums the number of pages in lru
> > > lists and the amount of slab memory (reclaimable and non-reclaimable)
> > > as a baseline for the allowed number of shadow entries.
> > > 
> > > It seems to be a good analogy for the !memcg case, where
> > > node_present_pages() is used. However, it's not quite true, as there
> > > two problems:
> > > 
> > > 1) Due to slab reparenting introduced by commit fb2f2b0adb98 ("mm:
> > > memcg/slab: reparent memcg kmem_caches on cgroup removal") local
> > > per-lruvec slab counters might be inaccurate on non-leaf levels.
> > > It's the only place where local slab counters are used.
> > 
> > Hm, that sounds like a bug tbh. We're reparenting the kmem caches and
> > the individual objects on the list_lru when a cgroup is removed -
> > shouldn't we also reparent the corresponding memory counters?
> 
> It's definitely an option too, the question is if the added code complexity
> is really worth it. I'd say no had we talk only about slab counters,
> but when we'll eventually start reparenting pagecache, we'll need to reparent
> other counters as well, so we'll need it anyway. So, ok, let's drop
> this patch for now.
> 
> > 
> > > 2) Shadow nodes by themselves are backed by slabs. So there is a loop
> > > dependency: the more shadow entries are there, the less pressure the
> > > kernel applies to reclaim them.
> > 
> > This effect is negligible in practice.
> > 
> > The permitted shadow nodes are a tiny percentage of memory consumed by
> > the cgroup. If shadow nodes make up a significant part of the cgroup's
> > footprint, or are the only thing left, they will be pushed out fast.
> > 
> > The formula is max_nodes = total_pages >> 3, and one page can hold 28
> > nodes. So if the cgroup holds nothing but 262,144 pages (1G) of shadow
> > nodes, the shrinker target is 32,768 nodes, which is 32,768 pages
> > (128M) in the worst packing case and 1,170 pages (4M) at best.
> > 
> > However, if you don't take slab into account here, it can evict shadow
> > entries with undue aggression when they are needed the most. If, say,
> > the inode or dentry cache explode temporarily and displace the page
> > cache, it would be a big problem to drop the cache's non-resident info
> > at the same time! This is when it's at its most important.
> 
> Just curious, have you seen this in the real life?

I have seen it with anon pages back when we targeted the page cache
instead of the total memory footprint. Especially in the context of
psi it missed thrashing situations, see:

commit 95f9ab2d596e8cbb388315e78c82b9a131bf2928
Author: Johannes Weiner <jweiner@fb.com>
Date:   Fri Oct 26 15:05:59 2018 -0700

    mm: workingset: don't drop refault information prematurely

I can't remember if I saw slabs doing the same, but it's equally
plausible.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-09-10 17:52 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-03 23:00 [PATCH] mm: workingset: ignore slab memory size when calculating shadows pressure Roman Gushchin
2020-09-04  4:10 ` Andrew Morton
2020-09-04  5:02   ` Roman Gushchin
2020-09-09 14:55 ` Johannes Weiner
2020-09-09 16:55   ` Roman Gushchin
2020-09-10 17:50     ` Johannes Weiner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).