linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC v2 PATCH 0/9] Make shrinker's nr_deferred memcg aware
@ 2020-12-14 22:37 Yang Shi
  2020-12-14 22:37 ` [v2 PATCH 1/9] mm: vmscan: use nid from shrink_control for tracepoint Yang Shi
                   ` (8 more replies)
  0 siblings, 9 replies; 37+ messages in thread
From: Yang Shi @ 2020-12-14 22:37 UTC (permalink / raw)
  To: guro, ktkhai, shakeelb, david, hannes, mhocko, akpm
  Cc: shy828301, linux-mm, linux-fsdevel, linux-kernel


Changelog
v1 --> v2:
    * Use shrinker->flags to store the new SHRINKER_REGISTERED flag per Roman.
    * Folded patch #1 into patch #6 per Roman.
    * Added memory barrier to prevent shrink_slab_memcg from seeing NULL shrinker_maps/
      shrinker_deferred per Kirill.
    * Removed memcg_shrinker_map_mutex. Protcted shrinker_map/shrinker_deferred
      allocations from expand with shrinker_rwsem per Johannes.

Recently huge amount one-off slab drop was seen on some vfs metadata heavy workloads,
it turned out there were huge amount accumulated nr_deferred objects seen by the
shrinker.

On our production machine, I saw absurd number of nr_deferred shown as the below
tracing result: 

<...>-48776 [032] .... 27970562.458916: mm_shrink_slab_start:
super_cache_scan+0x0/0x1a0 ffff9a83046f3458: nid: 0 objects to shrink
2531805877005 gfp_flags GFP_HIGHUSER_MOVABLE pgs_scanned 32 lru_pgs
9300 cache items 1667 delta 11 total_scan 833

There are 2.5 trillion deferred objects on one node, assuming all of them
are dentry (192 bytes per object), so the total size of deferred on
one node is ~480TB. It is definitely ridiculous.

I managed to reproduce this problem with kernel build workload plus negative dentry
generator.

First step, run the below kernel build test script:

NR_CPUS=`cat /proc/cpuinfo | grep -e processor | wc -l`

cd /root/Buildarea/linux-stable

for i in `seq 1500`; do
        cgcreate -g memory:kern_build
        echo 4G > /sys/fs/cgroup/memory/kern_build/memory.limit_in_bytes

        echo 3 > /proc/sys/vm/drop_caches
        cgexec -g memory:kern_build make clean > /dev/null 2>&1
        cgexec -g memory:kern_build make -j$NR_CPUS > /dev/null 2>&1

        cgdelete -g memory:kern_build
done

Then run the below negative dentry generator script:

NR_CPUS=`cat /proc/cpuinfo | grep -e processor | wc -l`

mkdir /sys/fs/cgroup/memory/test
echo $$ > /sys/fs/cgroup/memory/test/tasks

for i in `seq $NR_CPUS`; do
        while true; do
                FILE=`head /dev/urandom | tr -dc A-Za-z0-9 | head -c 64`
                cat $FILE 2>/dev/null
        done &
done

Then kswapd will shrink half of dentry cache in just one loop as the below tracing result
showed:

	kswapd0-475   [028] .... 305968.252561: mm_shrink_slab_start: super_cache_scan+0x0/0x190 0000000024acf00c: nid: 0
objects to shrink 4994376020 gfp_flags GFP_KERNEL cache items 93689873 delta 45746 total_scan 46844936 priority 12
	kswapd0-475   [021] .... 306013.099399: mm_shrink_slab_end: super_cache_scan+0x0/0x190 0000000024acf00c: nid: 0 unused
scan count 4994376020 new scan count 4947576838 total_scan 8 last shrinker return val 46844928

There were huge number of deferred objects before the shrinker was called, the behavior
does match the code but it might be not desirable from the user's stand of point.

The excessive amount of nr_deferred might be accumulated due to various reasons, for example:
    * GFP_NOFS allocation
    * Significant times of small amount scan (< scan_batch, 1024 for vfs metadata)

However the LRUs of slabs are per memcg (memcg-aware shrinkers) but the deferred objects
is per shrinker, this may have some bad effects:
    * Poor isolation among memcgs. Some memcgs which happen to have frequent limit
      reclaim may get nr_deferred accumulated to a huge number, then other innocent
      memcgs may take the fall. In our case the main workload was hit.
    * Unbounded deferred objects. There is no cap for deferred objects, it can outgrow
      ridiculously as the tracing result showed.
    * Easy to get out of control. Although shrinkers take into account deferred objects,
      but it can go out of control easily. One misconfigured memcg could incur absurd 
      amount of deferred objects in a period of time.
    * Sort of reclaim problems, i.e. over reclaim, long reclaim latency, etc. There may be
      hundred GB slab caches for vfe metadata heavy workload, shrink half of them may take
      minutes. We observed latency spike due to the prolonged reclaim.

These issues also have been discussed in https://lore.kernel.org/linux-mm/20200916185823.5347-1-shy828301@gmail.com/.
The patchset is the outcome of that discussion.

So this patchset makes nr_deferred per-memcg to tackle the problem. It does:
    * Have memcg_shrinker_deferred per memcg per node, just like what shrinker_map
      does. Instead it is an atomic_long_t array, each element represent one shrinker
      even though the shrinker is not memcg aware, this simplifies the implementation.
      For memcg aware shrinkers, the deferred objects are just accumulated to its own
      memcg. The shrinkers just see nr_deferred from its own memcg. Non memcg aware
      shrinkers still use global nr_deferred from struct shrinker.
    * Once the memcg is offlined, its nr_deferred will be reparented to its parent along
      with LRUs.
    * The root memcg has memcg_shrinker_deferred array too. It simplifies the handling of
      reparenting to root memcg.
    * Cap nr_deferred to 2x of the length of lru. The idea is borrowed from Dave Chinner's
      series (https://lore.kernel.org/linux-xfs/20191031234618.15403-1-david@fromorbit.com/)

The downside is each memcg has to allocate extra memory to store the nr_deferred array.
On our production environment, there are typically around 40 shrinkers, so each memcg
needs ~320 bytes. 10K memcgs would need ~3.2MB memory. It seems fine.

We have been running the patched kernel on some hosts of our fleet (test and production) for
months, it works very well. The monitor data shows the working set is sustained as expected.


Yang Shi (9):
      mm: vmscan: use nid from shrink_control for tracepoint
      mm: memcontrol: use shrinker_rwsem to protect shrinker_maps allocation
      mm: vmscan: guarantee shrinker_slab_memcg() sees valid shrinker_maps for online memcg
      mm: vmscan: use a new flag to indicate shrinker is registered
      mm: memcontrol: add per memcg shrinker nr_deferred
      mm: vmscan: use per memcg nr_deferred of shrinker
      mm: vmscan: don't need allocate shrinker->nr_deferred for memcg aware shrinkers
      mm: memcontrol: reparent nr_deferred when memcg offline
      mm: vmscan: shrink deferred objects proportional to priority

 include/linux/memcontrol.h |   9 +++++
 include/linux/shrinker.h   |  11 ++++--
 mm/internal.h              |   1 +
 mm/memcontrol.c            | 156 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------
 mm/vmscan.c                | 193 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----------------------------------
 5 files changed, 285 insertions(+), 85 deletions(-)



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [v2 PATCH 1/9] mm: vmscan: use nid from shrink_control for tracepoint
  2020-12-14 22:37 [RFC v2 PATCH 0/9] Make shrinker's nr_deferred memcg aware Yang Shi
@ 2020-12-14 22:37 ` Yang Shi
  2020-12-14 22:37 ` [v2 PATCH 2/9] mm: memcontrol: use shrinker_rwsem to protect shrinker_maps allocation Yang Shi
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 37+ messages in thread
From: Yang Shi @ 2020-12-14 22:37 UTC (permalink / raw)
  To: guro, ktkhai, shakeelb, david, hannes, mhocko, akpm
  Cc: shy828301, linux-mm, linux-fsdevel, linux-kernel

The tracepoint's nid should show what node the shrink happens on, the start tracepoint
uses nid from shrinkctl, but the nid might be set to 0 before end tracepoint if the
shrinker is not NUMA aware, so the traceing log may show the shrink happens on one
node but end up on the other node.  It seems confusing.  And the following patch
will remove using nid directly in do_shrink_slab(), this patch also helps cleanup
the code.

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 mm/vmscan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7b4e31eac2cf..48c06c48b97e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -537,7 +537,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 	else
 		new_nr = atomic_long_read(&shrinker->nr_deferred[nid]);
 
-	trace_mm_shrink_slab_end(shrinker, nid, freed, nr, new_nr, total_scan);
+	trace_mm_shrink_slab_end(shrinker, shrinkctl->nid, freed, nr, new_nr, total_scan);
 	return freed;
 }
 
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [v2 PATCH 2/9] mm: memcontrol: use shrinker_rwsem to protect shrinker_maps allocation
  2020-12-14 22:37 [RFC v2 PATCH 0/9] Make shrinker's nr_deferred memcg aware Yang Shi
  2020-12-14 22:37 ` [v2 PATCH 1/9] mm: vmscan: use nid from shrink_control for tracepoint Yang Shi
@ 2020-12-14 22:37 ` Yang Shi
  2020-12-15  2:09   ` Dave Chinner
  2020-12-15 14:07   ` Johannes Weiner
  2020-12-14 22:37 ` [v2 PATCH 3/9] mm: vmscan: guarantee shrinker_slab_memcg() sees valid shrinker_maps for online memcg Yang Shi
                   ` (6 subsequent siblings)
  8 siblings, 2 replies; 37+ messages in thread
From: Yang Shi @ 2020-12-14 22:37 UTC (permalink / raw)
  To: guro, ktkhai, shakeelb, david, hannes, mhocko, akpm
  Cc: shy828301, linux-mm, linux-fsdevel, linux-kernel

Since memcg_shrinker_map_size just can be changd under holding shrinker_rwsem
exclusively, the read side can be protected by holding read lock, so it sounds
superfluous to have a dedicated mutex.  This should not exacerbate the contention
to shrinker_rwsem since just one read side critical section is added.

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 mm/internal.h   |  1 +
 mm/memcontrol.c | 17 +++++++----------
 mm/vmscan.c     |  2 +-
 3 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index c43ccdddb0f6..10c79d199aaa 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -108,6 +108,7 @@ extern unsigned long highest_memmap_pfn;
 /*
  * in mm/vmscan.c:
  */
+extern struct rw_semaphore shrinker_rwsem;
 extern int isolate_lru_page(struct page *page);
 extern void putback_lru_page(struct page *page);
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 29459a6ce1c7..ed942734235f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -394,8 +394,8 @@ DEFINE_STATIC_KEY_FALSE(memcg_kmem_enabled_key);
 EXPORT_SYMBOL(memcg_kmem_enabled_key);
 #endif
 
+/* It is only can be changed with holding shrinker_rwsem exclusively */
 static int memcg_shrinker_map_size;
-static DEFINE_MUTEX(memcg_shrinker_map_mutex);
 
 static void memcg_free_shrinker_map_rcu(struct rcu_head *head)
 {
@@ -408,8 +408,6 @@ static int memcg_expand_one_shrinker_map(struct mem_cgroup *memcg,
 	struct memcg_shrinker_map *new, *old;
 	int nid;
 
-	lockdep_assert_held(&memcg_shrinker_map_mutex);
-
 	for_each_node(nid) {
 		old = rcu_dereference_protected(
 			mem_cgroup_nodeinfo(memcg, nid)->shrinker_map, true);
@@ -458,7 +456,7 @@ static int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
 	if (mem_cgroup_is_root(memcg))
 		return 0;
 
-	mutex_lock(&memcg_shrinker_map_mutex);
+	down_read(&shrinker_rwsem);
 	size = memcg_shrinker_map_size;
 	for_each_node(nid) {
 		map = kvzalloc_node(sizeof(*map) + size, GFP_KERNEL, nid);
@@ -469,7 +467,7 @@ static int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
 		}
 		rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, map);
 	}
-	mutex_unlock(&memcg_shrinker_map_mutex);
+	up_read(&shrinker_rwsem);
 
 	return ret;
 }
@@ -484,9 +482,8 @@ int memcg_expand_shrinker_maps(int new_id)
 	if (size <= old_size)
 		return 0;
 
-	mutex_lock(&memcg_shrinker_map_mutex);
 	if (!root_mem_cgroup)
-		goto unlock;
+		goto out;
 
 	for_each_mem_cgroup(memcg) {
 		if (mem_cgroup_is_root(memcg))
@@ -494,13 +491,13 @@ int memcg_expand_shrinker_maps(int new_id)
 		ret = memcg_expand_one_shrinker_map(memcg, size, old_size);
 		if (ret) {
 			mem_cgroup_iter_break(NULL, memcg);
-			goto unlock;
+			goto out;
 		}
 	}
-unlock:
+out:
 	if (!ret)
 		memcg_shrinker_map_size = size;
-	mutex_unlock(&memcg_shrinker_map_mutex);
+
 	return ret;
 }
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 48c06c48b97e..912c044301dd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -184,7 +184,7 @@ static void set_task_reclaim_state(struct task_struct *task,
 }
 
 static LIST_HEAD(shrinker_list);
-static DECLARE_RWSEM(shrinker_rwsem);
+DECLARE_RWSEM(shrinker_rwsem);
 
 #ifdef CONFIG_MEMCG
 /*
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [v2 PATCH 3/9] mm: vmscan: guarantee shrinker_slab_memcg() sees valid shrinker_maps for online memcg
  2020-12-14 22:37 [RFC v2 PATCH 0/9] Make shrinker's nr_deferred memcg aware Yang Shi
  2020-12-14 22:37 ` [v2 PATCH 1/9] mm: vmscan: use nid from shrink_control for tracepoint Yang Shi
  2020-12-14 22:37 ` [v2 PATCH 2/9] mm: memcontrol: use shrinker_rwsem to protect shrinker_maps allocation Yang Shi
@ 2020-12-14 22:37 ` Yang Shi
  2020-12-15  2:04   ` Dave Chinner
                     ` (2 more replies)
  2020-12-14 22:37 ` [v2 PATCH 4/9] mm: vmscan: use a new flag to indicate shrinker is registered Yang Shi
                   ` (5 subsequent siblings)
  8 siblings, 3 replies; 37+ messages in thread
From: Yang Shi @ 2020-12-14 22:37 UTC (permalink / raw)
  To: guro, ktkhai, shakeelb, david, hannes, mhocko, akpm
  Cc: shy828301, linux-mm, linux-fsdevel, linux-kernel

The shrink_slab_memcg() races with mem_cgroup_css_online(). A visibility of CSS_ONLINE flag
in shrink_slab_memcg()->mem_cgroup_online() does not guarantee that we will see
memcg->nodeinfo[nid]->shrinker_maps != NULL.  This may occur because of processor reordering
on !x86.

This seems like the below case:

           CPU A          CPU B
store shrinker_map      load CSS_ONLINE
store CSS_ONLINE        load shrinker_map

So the memory ordering could be guaranteed by smp_wmb()/smp_rmb() pair.

The memory barriers pair will guarantee the ordering between shrinker_deferred and CSS_ONLINE
for the following patches as well.

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 mm/memcontrol.c | 7 +++++++
 mm/vmscan.c     | 8 +++++---
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ed942734235f..3d4ddbb84a01 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5406,6 +5406,13 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
 		return -ENOMEM;
 	}
 
+	/*
+	 * Barrier for CSS_ONLINE, so that shrink_slab_memcg() sees shirnker_maps
+	 * and shrinker_deferred before CSS_ONLINE. It pairs with the read barrier
+	 * in shrink_slab_memcg().
+	 */
+	smp_wmb();
+
 	/* Online state pins memcg ID, memcg ID pins CSS */
 	refcount_set(&memcg->id.ref, 1);
 	css_get(css);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 912c044301dd..9b31b9c419ec 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -552,13 +552,15 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
 	if (!mem_cgroup_online(memcg))
 		return 0;
 
+	/* Pairs with write barrier in mem_cgroup_css_online() */
+	smp_rmb();
+
 	if (!down_read_trylock(&shrinker_rwsem))
 		return 0;
 
+	/* Once memcg is online it can't be NULL */
 	map = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_map,
 					true);
-	if (unlikely(!map))
-		goto unlock;
 
 	for_each_set_bit(i, map->map, shrinker_nr_max) {
 		struct shrink_control sc = {
@@ -612,7 +614,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
 			break;
 		}
 	}
-unlock:
+
 	up_read(&shrinker_rwsem);
 	return freed;
 }
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [v2 PATCH 4/9] mm: vmscan: use a new flag to indicate shrinker is registered
  2020-12-14 22:37 [RFC v2 PATCH 0/9] Make shrinker's nr_deferred memcg aware Yang Shi
                   ` (2 preceding siblings ...)
  2020-12-14 22:37 ` [v2 PATCH 3/9] mm: vmscan: guarantee shrinker_slab_memcg() sees valid shrinker_maps for online memcg Yang Shi
@ 2020-12-14 22:37 ` Yang Shi
  2020-12-14 22:37 ` [v2 PATCH 5/9] mm: memcontrol: add per memcg shrinker nr_deferred Yang Shi
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 37+ messages in thread
From: Yang Shi @ 2020-12-14 22:37 UTC (permalink / raw)
  To: guro, ktkhai, shakeelb, david, hannes, mhocko, akpm
  Cc: shy828301, linux-mm, linux-fsdevel, linux-kernel

Currently registered shrinker is indicated by non-NULL shrinker->nr_deferred.
This approach is fine with nr_deferred at the shrinker level, but the following
patches will move MEMCG_AWARE shrinkers' nr_deferred to memcg level, so their
shrinker->nr_deferred would always be NULL.  This would prevent the shrinkers
from unregistering correctly.

Introduce a new flag to indicate if shrinker is registered or not.

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 include/linux/shrinker.h |  7 ++++---
 mm/vmscan.c              | 13 +++++++++----
 2 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 0f80123650e2..1eac79ce57d4 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -79,13 +79,14 @@ struct shrinker {
 #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
 
 /* Flags */
-#define SHRINKER_NUMA_AWARE	(1 << 0)
-#define SHRINKER_MEMCG_AWARE	(1 << 1)
+#define SHRINKER_REGISTERED	(1 << 0)
+#define SHRINKER_NUMA_AWARE	(1 << 1)
+#define SHRINKER_MEMCG_AWARE	(1 << 2)
 /*
  * It just makes sense when the shrinker is also MEMCG_AWARE for now,
  * non-MEMCG_AWARE shrinker should not have this flag set.
  */
-#define SHRINKER_NONSLAB	(1 << 2)
+#define SHRINKER_NONSLAB	(1 << 3)
 
 extern int prealloc_shrinker(struct shrinker *shrinker);
 extern void register_shrinker_prepared(struct shrinker *shrinker);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9b31b9c419ec..16c9d2aeeb26 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -378,6 +378,7 @@ void register_shrinker_prepared(struct shrinker *shrinker)
 	if (shrinker->flags & SHRINKER_MEMCG_AWARE)
 		idr_replace(&shrinker_idr, shrinker, shrinker->id);
 #endif
+	shrinker->flags |= SHRINKER_REGISTERED;
 	up_write(&shrinker_rwsem);
 }
 
@@ -397,13 +398,17 @@ EXPORT_SYMBOL(register_shrinker);
  */
 void unregister_shrinker(struct shrinker *shrinker)
 {
-	if (!shrinker->nr_deferred)
-		return;
-	if (shrinker->flags & SHRINKER_MEMCG_AWARE)
-		unregister_memcg_shrinker(shrinker);
 	down_write(&shrinker_rwsem);
+	if (!(shrinker->flags & SHRINKER_REGISTERED)) {
+		up_write(&shrinker_rwsem);
+		return;
+	}
 	list_del(&shrinker->list);
+	shrinker->flags &= ~SHRINKER_REGISTERED;
 	up_write(&shrinker_rwsem);
+
+	if (shrinker->flags & SHRINKER_MEMCG_AWARE)
+		unregister_memcg_shrinker(shrinker);
 	kfree(shrinker->nr_deferred);
 	shrinker->nr_deferred = NULL;
 }
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [v2 PATCH 5/9] mm: memcontrol: add per memcg shrinker nr_deferred
  2020-12-14 22:37 [RFC v2 PATCH 0/9] Make shrinker's nr_deferred memcg aware Yang Shi
                   ` (3 preceding siblings ...)
  2020-12-14 22:37 ` [v2 PATCH 4/9] mm: vmscan: use a new flag to indicate shrinker is registered Yang Shi
@ 2020-12-14 22:37 ` Yang Shi
  2020-12-15  2:22   ` Dave Chinner
  2020-12-14 22:37 ` [v2 PATCH 6/9] mm: vmscan: use per memcg nr_deferred of shrinker Yang Shi
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 37+ messages in thread
From: Yang Shi @ 2020-12-14 22:37 UTC (permalink / raw)
  To: guro, ktkhai, shakeelb, david, hannes, mhocko, akpm
  Cc: shy828301, linux-mm, linux-fsdevel, linux-kernel

Currently the number of deferred objects are per shrinker, but some slabs, for example,
vfs inode/dentry cache are per memcg, this would result in poor isolation among memcgs.

The deferred objects typically are generated by __GFP_NOFS allocations, one memcg with
excessive __GFP_NOFS allocations may blow up deferred objects, then other innocent memcgs
may suffer from over shrink, excessive reclaim latency, etc.

For example, two workloads run in memcgA and memcgB respectively, workload in B is vfs
heavy workload.  Workload in A generates excessive deferred objects, then B's vfs cache
might be hit heavily (drop half of caches) by B's limit reclaim or global reclaim.

We observed this hit in our production environment which was running vfs heavy workload
shown as the below tracing log:

<...>-409454 [016] .... 28286961.747146: mm_shrink_slab_start: super_cache_scan+0x0/0x1a0 ffff9a83046f3458:
nid: 1 objects to shrink 3641681686040 gfp_flags GFP_HIGHUSER_MOVABLE|__GFP_ZERO pgs_scanned 1 lru_pgs 15721
cache items 246404277 delta 31345 total_scan 123202138
<...>-409454 [022] .... 28287105.928018: mm_shrink_slab_end: super_cache_scan+0x0/0x1a0 ffff9a83046f3458:
nid: 1 unused scan count 3641681686040 new scan count 3641798379189 total_scan 602
last shrinker return val 123186855

The vfs cache and page cache ration was 10:1 on this machine, and half of caches were dropped.
This also resulted in significant amount of page caches were dropped due to inodes eviction.

Make nr_deferred per memcg for memcg aware shrinkers would solve the unfairness and bring
better isolation.

When memcg is not enabled (!CONFIG_MEMCG or memcg disabled), the shrinker's nr_deferred
would be used.  And non memcg aware shrinkers use shrinker's nr_deferred all the time.

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 include/linux/memcontrol.h |   9 +++
 mm/memcontrol.c            | 110 ++++++++++++++++++++++++++++++++++++-
 mm/vmscan.c                |   4 ++
 3 files changed, 120 insertions(+), 3 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 922a7f600465..1b343b268359 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -92,6 +92,13 @@ struct lruvec_stat {
 	long count[NR_VM_NODE_STAT_ITEMS];
 };
 
+
+/* Shrinker::id indexed nr_deferred of memcg-aware shrinkers. */
+struct memcg_shrinker_deferred {
+	struct rcu_head rcu;
+	atomic_long_t nr_deferred[];
+};
+
 /*
  * Bitmap of shrinker::id corresponding to memcg-aware shrinkers,
  * which have elements charged to this memcg.
@@ -119,6 +126,7 @@ struct mem_cgroup_per_node {
 	struct mem_cgroup_reclaim_iter	iter;
 
 	struct memcg_shrinker_map __rcu	*shrinker_map;
+	struct memcg_shrinker_deferred __rcu	*shrinker_deferred;
 
 	struct rb_node		tree_node;	/* RB tree node */
 	unsigned long		usage_in_excess;/* Set to the value by which */
@@ -1489,6 +1497,7 @@ static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
 }
 
 extern int memcg_expand_shrinker_maps(int new_id);
+extern int memcg_expand_shrinker_deferred(int new_id);
 
 extern void memcg_set_shrinker_bit(struct mem_cgroup *memcg,
 				   int nid, int shrinker_id);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3d4ddbb84a01..321d1818ce3d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -394,14 +394,20 @@ DEFINE_STATIC_KEY_FALSE(memcg_kmem_enabled_key);
 EXPORT_SYMBOL(memcg_kmem_enabled_key);
 #endif
 
-/* It is only can be changed with holding shrinker_rwsem exclusively */
+/* They are only can be changed with holding shrinker_rwsem exclusively */
 static int memcg_shrinker_map_size;
+static int memcg_shrinker_deferred_size;
 
 static void memcg_free_shrinker_map_rcu(struct rcu_head *head)
 {
 	kvfree(container_of(head, struct memcg_shrinker_map, rcu));
 }
 
+static void memcg_free_shrinker_deferred_rcu(struct rcu_head *head)
+{
+	kvfree(container_of(head, struct memcg_shrinker_deferred, rcu));
+}
+
 static int memcg_expand_one_shrinker_map(struct mem_cgroup *memcg,
 					 int size, int old_size)
 {
@@ -430,6 +436,34 @@ static int memcg_expand_one_shrinker_map(struct mem_cgroup *memcg,
 	return 0;
 }
 
+static int memcg_expand_one_shrinker_deferred(struct mem_cgroup *memcg,
+					      int size, int old_size)
+{
+	struct memcg_shrinker_deferred *new, *old;
+	int nid;
+
+	for_each_node(nid) {
+		old = rcu_dereference_protected(
+			mem_cgroup_nodeinfo(memcg, nid)->shrinker_deferred, true);
+		/* Not yet online memcg */
+		if (!old)
+			return 0;
+
+		new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
+		if (!new)
+			return -ENOMEM;
+
+		/* Copy all old values, and clear all new ones */
+		memcpy((void *)new->nr_deferred, (void *)old->nr_deferred, old_size);
+		memset((void *)new->nr_deferred + old_size, 0, size - old_size);
+
+		rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_deferred, new);
+		call_rcu(&old->rcu, memcg_free_shrinker_deferred_rcu);
+	}
+
+	return 0;
+}
+
 static void memcg_free_shrinker_maps(struct mem_cgroup *memcg)
 {
 	struct mem_cgroup_per_node *pn;
@@ -448,6 +482,21 @@ static void memcg_free_shrinker_maps(struct mem_cgroup *memcg)
 	}
 }
 
+static void memcg_free_shrinker_deferred(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup_per_node *pn;
+	struct memcg_shrinker_deferred *deferred;
+	int nid;
+
+	for_each_node(nid) {
+		pn = mem_cgroup_nodeinfo(memcg, nid);
+		deferred = rcu_dereference_protected(pn->shrinker_deferred, true);
+		if (deferred)
+			kvfree(deferred);
+		rcu_assign_pointer(pn->shrinker_deferred, NULL);
+	}
+}
+
 static int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
 {
 	struct memcg_shrinker_map *map;
@@ -472,6 +521,27 @@ static int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
 	return ret;
 }
 
+static int memcg_alloc_shrinker_deferred(struct mem_cgroup *memcg)
+{
+	struct memcg_shrinker_deferred *deferred;
+	int nid, size, ret = 0;
+
+	down_read(&shrinker_rwsem);
+	size = memcg_shrinker_deferred_size;
+	for_each_node(nid) {
+		deferred = kvzalloc_node(sizeof(*deferred) + size, GFP_KERNEL, nid);
+		if (!deferred) {
+			memcg_free_shrinker_deferred(memcg);
+			ret = -ENOMEM;
+			break;
+		}
+		rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_deferred, deferred);
+	}
+	up_read(&shrinker_rwsem);
+
+	return ret;
+}
+
 int memcg_expand_shrinker_maps(int new_id)
 {
 	int size, old_size, ret = 0;
@@ -501,6 +571,33 @@ int memcg_expand_shrinker_maps(int new_id)
 	return ret;
 }
 
+int memcg_expand_shrinker_deferred(int new_id)
+{
+	int size, old_size, ret = 0;
+	struct mem_cgroup *memcg;
+
+	size = (new_id + 1) * sizeof(atomic_long_t);
+	old_size = memcg_shrinker_deferred_size;
+	if (size <= old_size)
+		return 0;
+
+	if (!root_mem_cgroup)
+		goto out;
+
+	for_each_mem_cgroup(memcg) {
+		ret = memcg_expand_one_shrinker_deferred(memcg, size, old_size);
+		if (ret) {
+			mem_cgroup_iter_break(NULL, memcg);
+			goto out;
+		}
+	}
+out:
+	if (!ret)
+		memcg_shrinker_deferred_size = size;
+
+	return ret;
+}
+
 void memcg_set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
 {
 	if (shrinker_id >= 0 && memcg && !mem_cgroup_is_root(memcg)) {
@@ -5397,8 +5494,8 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
 
 	/*
-	 * A memcg must be visible for memcg_expand_shrinker_maps()
-	 * by the time the maps are allocated. So, we allocate maps
+	 * A memcg must be visible for memcg_expand_shrinker_{maps|deferred}()
+	 * by the time the maps are allocated. So, we allocate maps and deferred
 	 * here, when for_each_mem_cgroup() can't skip it.
 	 */
 	if (memcg_alloc_shrinker_maps(memcg)) {
@@ -5406,6 +5503,12 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
 		return -ENOMEM;
 	}
 
+	if (memcg_alloc_shrinker_deferred(memcg)) {
+		memcg_free_shrinker_maps(memcg);
+		mem_cgroup_id_remove(memcg);
+		return -ENOMEM;
+	}
+
 	/*
 	 * Barrier for CSS_ONLINE, so that shrink_slab_memcg() sees shirnker_maps
 	 * and shrinker_deferred before CSS_ONLINE. It pairs with the read barrier
@@ -5473,6 +5576,7 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
 	cancel_work_sync(&memcg->high_work);
 	mem_cgroup_remove_from_trees(memcg);
 	memcg_free_shrinker_maps(memcg);
+	memcg_free_shrinker_deferred(memcg);
 	memcg_free_kmem(memcg);
 	mem_cgroup_free(memcg);
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 16c9d2aeeb26..bf34167dd67e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -219,6 +219,10 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
 			goto unlock;
 		}
 
+		if (memcg_expand_shrinker_deferred(id)) {
+			idr_remove(&shrinker_idr, id);
+			goto unlock;
+		}
 		shrinker_nr_max = id + 1;
 	}
 	shrinker->id = id;
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [v2 PATCH 6/9] mm: vmscan: use per memcg nr_deferred of shrinker
  2020-12-14 22:37 [RFC v2 PATCH 0/9] Make shrinker's nr_deferred memcg aware Yang Shi
                   ` (4 preceding siblings ...)
  2020-12-14 22:37 ` [v2 PATCH 5/9] mm: memcontrol: add per memcg shrinker nr_deferred Yang Shi
@ 2020-12-14 22:37 ` Yang Shi
  2020-12-15  2:46   ` Dave Chinner
  2020-12-14 22:37 ` [v2 PATCH 7/9] mm: vmscan: don't need allocate shrinker->nr_deferred for memcg aware shrinkers Yang Shi
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 37+ messages in thread
From: Yang Shi @ 2020-12-14 22:37 UTC (permalink / raw)
  To: guro, ktkhai, shakeelb, david, hannes, mhocko, akpm
  Cc: shy828301, linux-mm, linux-fsdevel, linux-kernel

Use per memcg's nr_deferred for memcg aware shrinkers.  The shrinker's nr_deferred
will be used in the following cases:
    1. Non memcg aware shrinkers
    2. !CONFIG_MEMCG
    3. memcg is disabled by boot parameter

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 mm/vmscan.c | 94 ++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 83 insertions(+), 11 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index bf34167dd67e..bce8cf44eca2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -203,6 +203,12 @@ DECLARE_RWSEM(shrinker_rwsem);
 static DEFINE_IDR(shrinker_idr);
 static int shrinker_nr_max;
 
+static inline bool is_deferred_memcg_aware(struct shrinker *shrinker)
+{
+	return (shrinker->flags & SHRINKER_MEMCG_AWARE) &&
+		!mem_cgroup_disabled();
+}
+
 static int prealloc_memcg_shrinker(struct shrinker *shrinker)
 {
 	int id, ret = -ENOMEM;
@@ -271,7 +277,58 @@ static bool writeback_throttling_sane(struct scan_control *sc)
 #endif
 	return false;
 }
+
+static inline long count_nr_deferred(struct shrinker *shrinker,
+				     struct shrink_control *sc)
+{
+	bool per_memcg_deferred = is_deferred_memcg_aware(shrinker) && sc->memcg;
+	struct memcg_shrinker_deferred *deferred;
+	struct mem_cgroup *memcg = sc->memcg;
+	int nid = sc->nid;
+	int id = shrinker->id;
+	long nr;
+
+	if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
+		nid = 0;
+
+	if (per_memcg_deferred) {
+		deferred = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_deferred,
+						     true);
+		nr = atomic_long_xchg(&deferred->nr_deferred[id], 0);
+	} else
+		nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
+
+	return nr;
+}
+
+static inline long set_nr_deferred(long nr, struct shrinker *shrinker,
+				   struct shrink_control *sc)
+{
+	bool per_memcg_deferred = is_deferred_memcg_aware(shrinker) && sc->memcg;
+	struct memcg_shrinker_deferred *deferred;
+	struct mem_cgroup *memcg = sc->memcg;
+	int nid = sc->nid;
+	int id = shrinker->id;
+	long new_nr;
+
+	if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
+		nid = 0;
+
+	if (per_memcg_deferred) {
+		deferred = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_deferred,
+						     true);
+		new_nr = atomic_long_add_return(nr, &deferred->nr_deferred[id]);
+	} else
+		new_nr = atomic_long_add_return(nr, &shrinker->nr_deferred[nid]);
+
+	return new_nr;
+}
 #else
+static inline bool is_deferred_memcg_aware(struct shrinker *shrinker)
+{
+	return false;
+}
+
 static int prealloc_memcg_shrinker(struct shrinker *shrinker)
 {
 	return 0;
@@ -290,6 +347,29 @@ static bool writeback_throttling_sane(struct scan_control *sc)
 {
 	return true;
 }
+
+static inline long count_nr_deferred(struct shrinker *shrinker,
+				     struct shrink_control *sc)
+{
+	int nid = sc->nid;
+
+	if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
+		nid = 0;
+
+	return atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
+}
+
+static inline long set_nr_deferred(long nr, struct shrinker *shrinker,
+				   struct shrink_control *sc)
+{
+	int nid = sc->nid;
+
+	if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
+		nid = 0;
+
+	return atomic_long_add_return(nr,
+				      &shrinker->nr_deferred[nid]);
+}
 #endif
 
 /*
@@ -429,13 +509,10 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 	long freeable;
 	long nr;
 	long new_nr;
-	int nid = shrinkctl->nid;
 	long batch_size = shrinker->batch ? shrinker->batch
 					  : SHRINK_BATCH;
 	long scanned = 0, next_deferred;
 
-	if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
-		nid = 0;
 
 	freeable = shrinker->count_objects(shrinker, shrinkctl);
 	if (freeable == 0 || freeable == SHRINK_EMPTY)
@@ -446,7 +523,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 	 * and zero it so that other concurrent shrinker invocations
 	 * don't also do this scanning work.
 	 */
-	nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
+	nr = count_nr_deferred(shrinker, shrinkctl);
 
 	total_scan = nr;
 	if (shrinker->seeks) {
@@ -537,14 +614,9 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 		next_deferred = 0;
 	/*
 	 * move the unused scan count back into the shrinker in a
-	 * manner that handles concurrent updates. If we exhausted the
-	 * scan, there is no need to do an update.
+	 * manner that handles concurrent updates.
 	 */
-	if (next_deferred > 0)
-		new_nr = atomic_long_add_return(next_deferred,
-						&shrinker->nr_deferred[nid]);
-	else
-		new_nr = atomic_long_read(&shrinker->nr_deferred[nid]);
+	new_nr = set_nr_deferred(next_deferred, shrinker, shrinkctl);
 
 	trace_mm_shrink_slab_end(shrinker, shrinkctl->nid, freed, nr, new_nr, total_scan);
 	return freed;
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [v2 PATCH 7/9] mm: vmscan: don't need allocate shrinker->nr_deferred for memcg aware shrinkers
  2020-12-14 22:37 [RFC v2 PATCH 0/9] Make shrinker's nr_deferred memcg aware Yang Shi
                   ` (5 preceding siblings ...)
  2020-12-14 22:37 ` [v2 PATCH 6/9] mm: vmscan: use per memcg nr_deferred of shrinker Yang Shi
@ 2020-12-14 22:37 ` Yang Shi
  2020-12-15  3:05   ` Dave Chinner
  2020-12-14 22:37 ` [v2 PATCH 8/9] mm: memcontrol: reparent nr_deferred when memcg offline Yang Shi
  2020-12-14 22:37 ` [v2 PATCH 9/9] mm: vmscan: shrink deferred objects proportional to priority Yang Shi
  8 siblings, 1 reply; 37+ messages in thread
From: Yang Shi @ 2020-12-14 22:37 UTC (permalink / raw)
  To: guro, ktkhai, shakeelb, david, hannes, mhocko, akpm
  Cc: shy828301, linux-mm, linux-fsdevel, linux-kernel

Now nr_deferred is available on per memcg level for memcg aware shrinkers, so don't need
allocate shrinker->nr_deferred for such shrinkers anymore.

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 mm/vmscan.c | 28 ++++++++++++++--------------
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index bce8cf44eca2..8d5bfd818acd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -420,7 +420,15 @@ unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone
  */
 int prealloc_shrinker(struct shrinker *shrinker)
 {
-	unsigned int size = sizeof(*shrinker->nr_deferred);
+	unsigned int size;
+
+	if (is_deferred_memcg_aware(shrinker)) {
+		if (prealloc_memcg_shrinker(shrinker))
+			return -ENOMEM;
+		return 0;
+	}
+
+	size = sizeof(*shrinker->nr_deferred);
 
 	if (shrinker->flags & SHRINKER_NUMA_AWARE)
 		size *= nr_node_ids;
@@ -429,26 +437,18 @@ int prealloc_shrinker(struct shrinker *shrinker)
 	if (!shrinker->nr_deferred)
 		return -ENOMEM;
 
-	if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
-		if (prealloc_memcg_shrinker(shrinker))
-			goto free_deferred;
-	}
-
 	return 0;
-
-free_deferred:
-	kfree(shrinker->nr_deferred);
-	shrinker->nr_deferred = NULL;
-	return -ENOMEM;
 }
 
 void free_prealloced_shrinker(struct shrinker *shrinker)
 {
-	if (!shrinker->nr_deferred)
+	if (is_deferred_memcg_aware(shrinker)) {
+		unregister_memcg_shrinker(shrinker);
 		return;
+	}
 
-	if (shrinker->flags & SHRINKER_MEMCG_AWARE)
-		unregister_memcg_shrinker(shrinker);
+	if (!shrinker->nr_deferred)
+		return;
 
 	kfree(shrinker->nr_deferred);
 	shrinker->nr_deferred = NULL;
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [v2 PATCH 8/9] mm: memcontrol: reparent nr_deferred when memcg offline
  2020-12-14 22:37 [RFC v2 PATCH 0/9] Make shrinker's nr_deferred memcg aware Yang Shi
                   ` (6 preceding siblings ...)
  2020-12-14 22:37 ` [v2 PATCH 7/9] mm: vmscan: don't need allocate shrinker->nr_deferred for memcg aware shrinkers Yang Shi
@ 2020-12-14 22:37 ` Yang Shi
  2020-12-15  3:07   ` Dave Chinner
  2020-12-14 22:37 ` [v2 PATCH 9/9] mm: vmscan: shrink deferred objects proportional to priority Yang Shi
  8 siblings, 1 reply; 37+ messages in thread
From: Yang Shi @ 2020-12-14 22:37 UTC (permalink / raw)
  To: guro, ktkhai, shakeelb, david, hannes, mhocko, akpm
  Cc: shy828301, linux-mm, linux-fsdevel, linux-kernel

Now shrinker's nr_deferred is per memcg for memcg aware shrinkers, add to parent's
corresponding nr_deferred when memcg offline.

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 include/linux/shrinker.h |  4 ++++
 mm/memcontrol.c          | 24 ++++++++++++++++++++++++
 mm/vmscan.c              |  2 +-
 3 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 1eac79ce57d4..85cfc910dde4 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -78,6 +78,10 @@ struct shrinker {
 };
 #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
 
+#ifdef CONFIG_MEMCG
+extern int shrinker_nr_max;
+#endif
+
 /* Flags */
 #define SHRINKER_REGISTERED	(1 << 0)
 #define SHRINKER_NUMA_AWARE	(1 << 1)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 321d1818ce3d..1f191a15bee1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -59,6 +59,7 @@
 #include <linux/tracehook.h>
 #include <linux/psi.h>
 #include <linux/seq_buf.h>
+#include <linux/shrinker.h>
 #include "internal.h"
 #include <net/sock.h>
 #include <net/ip.h>
@@ -612,6 +613,28 @@ void memcg_set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
 	}
 }
 
+static void memcg_reparent_shrinker_deferred(struct mem_cgroup *memcg)
+{
+	int i, nid;
+	long nr;
+	struct mem_cgroup *parent;
+	struct memcg_shrinker_deferred *child_deferred, *parent_deferred;
+
+	parent = parent_mem_cgroup(memcg);
+	if (!parent)
+		parent = root_mem_cgroup;
+
+	for_each_node(nid) {
+		child_deferred = memcg->nodeinfo[nid]->shrinker_deferred;
+		parent_deferred = parent->nodeinfo[nid]->shrinker_deferred;
+		for (i = 0; i < shrinker_nr_max; i ++) {
+			nr = atomic_long_read(&child_deferred->nr_deferred[i]);
+			atomic_long_add(nr,
+				&parent_deferred->nr_deferred[i]);
+		}
+	}
+}
+
 /**
  * mem_cgroup_css_from_page - css of the memcg associated with a page
  * @page: page of interest
@@ -5543,6 +5566,7 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 	page_counter_set_low(&memcg->memory, 0);
 
 	memcg_offline_kmem(memcg);
+	memcg_reparent_shrinker_deferred(memcg);
 	wb_memcg_offline(memcg);
 
 	drain_all_stock(memcg);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8d5bfd818acd..693a41e89969 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -201,7 +201,7 @@ DECLARE_RWSEM(shrinker_rwsem);
 #define SHRINKER_REGISTERING ((struct shrinker *)~0UL)
 
 static DEFINE_IDR(shrinker_idr);
-static int shrinker_nr_max;
+int shrinker_nr_max;
 
 static inline bool is_deferred_memcg_aware(struct shrinker *shrinker)
 {
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [v2 PATCH 9/9] mm: vmscan: shrink deferred objects proportional to priority
  2020-12-14 22:37 [RFC v2 PATCH 0/9] Make shrinker's nr_deferred memcg aware Yang Shi
                   ` (7 preceding siblings ...)
  2020-12-14 22:37 ` [v2 PATCH 8/9] mm: memcontrol: reparent nr_deferred when memcg offline Yang Shi
@ 2020-12-14 22:37 ` Yang Shi
  2020-12-15  3:23   ` Dave Chinner
  8 siblings, 1 reply; 37+ messages in thread
From: Yang Shi @ 2020-12-14 22:37 UTC (permalink / raw)
  To: guro, ktkhai, shakeelb, david, hannes, mhocko, akpm
  Cc: shy828301, linux-mm, linux-fsdevel, linux-kernel

The number of deferred objects might get windup to an absurd number, and it results in
clamp of slab objects.  It is undesirable for sustaining workingset.

So shrink deferred objects proportional to priority and cap nr_deferred to twice of
cache items.

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 mm/vmscan.c | 40 +++++-----------------------------------
 1 file changed, 5 insertions(+), 35 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 693a41e89969..58f4a383f0df 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -525,7 +525,6 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 	 */
 	nr = count_nr_deferred(shrinker, shrinkctl);
 
-	total_scan = nr;
 	if (shrinker->seeks) {
 		delta = freeable >> priority;
 		delta *= 4;
@@ -539,37 +538,9 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 		delta = freeable / 2;
 	}
 
+	total_scan = nr >> priority;
 	total_scan += delta;
-	if (total_scan < 0) {
-		pr_err("shrink_slab: %pS negative objects to delete nr=%ld\n",
-		       shrinker->scan_objects, total_scan);
-		total_scan = freeable;
-		next_deferred = nr;
-	} else
-		next_deferred = total_scan;
-
-	/*
-	 * We need to avoid excessive windup on filesystem shrinkers
-	 * due to large numbers of GFP_NOFS allocations causing the
-	 * shrinkers to return -1 all the time. This results in a large
-	 * nr being built up so when a shrink that can do some work
-	 * comes along it empties the entire cache due to nr >>>
-	 * freeable. This is bad for sustaining a working set in
-	 * memory.
-	 *
-	 * Hence only allow the shrinker to scan the entire cache when
-	 * a large delta change is calculated directly.
-	 */
-	if (delta < freeable / 4)
-		total_scan = min(total_scan, freeable / 2);
-
-	/*
-	 * Avoid risking looping forever due to too large nr value:
-	 * never try to free more than twice the estimate number of
-	 * freeable entries.
-	 */
-	if (total_scan > freeable * 2)
-		total_scan = freeable * 2;
+	total_scan = min(total_scan, (2 * freeable));
 
 	trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
 				   freeable, delta, total_scan, priority);
@@ -608,10 +579,9 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 		cond_resched();
 	}
 
-	if (next_deferred >= scanned)
-		next_deferred -= scanned;
-	else
-		next_deferred = 0;
+	next_deferred = max_t(long, (nr - scanned), 0) + total_scan;
+	next_deferred = min(next_deferred, (2 * freeable));
+
 	/*
 	 * move the unused scan count back into the shrinker in a
 	 * manner that handles concurrent updates.
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [v2 PATCH 3/9] mm: vmscan: guarantee shrinker_slab_memcg() sees valid shrinker_maps for online memcg
  2020-12-14 22:37 ` [v2 PATCH 3/9] mm: vmscan: guarantee shrinker_slab_memcg() sees valid shrinker_maps for online memcg Yang Shi
@ 2020-12-15  2:04   ` Dave Chinner
       [not found]   ` <20201215123802.GA379720@cmpxchg.org>
  2020-12-15 17:14   ` Johannes Weiner
  2 siblings, 0 replies; 37+ messages in thread
From: Dave Chinner @ 2020-12-15  2:04 UTC (permalink / raw)
  To: Yang Shi
  Cc: guro, ktkhai, shakeelb, hannes, mhocko, akpm, linux-mm,
	linux-fsdevel, linux-kernel

On Mon, Dec 14, 2020 at 02:37:16PM -0800, Yang Shi wrote:
> The shrink_slab_memcg() races with mem_cgroup_css_online(). A visibility of CSS_ONLINE flag
> in shrink_slab_memcg()->mem_cgroup_online() does not guarantee that we will see
> memcg->nodeinfo[nid]->shrinker_maps != NULL.  This may occur because of processor reordering
> on !x86.
> 
> This seems like the below case:
> 
>            CPU A          CPU B
> store shrinker_map      load CSS_ONLINE
> store CSS_ONLINE        load shrinker_map
> 
> So the memory ordering could be guaranteed by smp_wmb()/smp_rmb() pair.
> 
> The memory barriers pair will guarantee the ordering between shrinker_deferred and CSS_ONLINE
> for the following patches as well.

This should not require memory barriers in the shrinker code.

The code that sets and checks the CSS_ONLINE flag should have the
memory barriers to ensure that anything that sees an online CSS will
see it completely set up.

That is, the functions online_css() that set the CSS_ONLINE needs
a memory barrier to ensure all previous writes are completed before
the CSS_ONLINE flag is set, and the function mem_cgroup_online()
needs a barrier to pair with that.

This is the same existence issue that the superblock shrinkers have
with the shrinkers being registered before the superblock is fully
set up. The SB_BORN flag on the sueprblock indicates the superblock
is now fully set up ("online" in CSS speak) and the registered
shrinker can run. Please see the smp_wmb() before we set SB_BORN in
vfs_get_tree(), and the big comment about the smp_rmb() -after- we
check SB_BORN in super_cache_count() to understand the details of
the data dependency between the flag and the structures being set up
that the barriers enforce.

IOWs, these memory barriers belong inside the cgroup code to
guarantee anything that sees an online cgroup will always see the
fully initialised cgroup structures. They do not belong in the
shrinker infrastructure...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [v2 PATCH 2/9] mm: memcontrol: use shrinker_rwsem to protect shrinker_maps allocation
  2020-12-14 22:37 ` [v2 PATCH 2/9] mm: memcontrol: use shrinker_rwsem to protect shrinker_maps allocation Yang Shi
@ 2020-12-15  2:09   ` Dave Chinner
  2020-12-15 13:53     ` Johannes Weiner
  2020-12-15 14:07   ` Johannes Weiner
  1 sibling, 1 reply; 37+ messages in thread
From: Dave Chinner @ 2020-12-15  2:09 UTC (permalink / raw)
  To: Yang Shi
  Cc: guro, ktkhai, shakeelb, hannes, mhocko, akpm, linux-mm,
	linux-fsdevel, linux-kernel

On Mon, Dec 14, 2020 at 02:37:15PM -0800, Yang Shi wrote:
> Since memcg_shrinker_map_size just can be changd under holding shrinker_rwsem
> exclusively, the read side can be protected by holding read lock, so it sounds
> superfluous to have a dedicated mutex.

I'm not sure this is a good idea. This couples the shrinker
infrastructure to internal details of how cgroups are initialised
and managed. Sure, certain operations might be done in certain
shrinker lock contexts, but that doesn't mean we should share global
locks across otherwise independent subsystems....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [v2 PATCH 5/9] mm: memcontrol: add per memcg shrinker nr_deferred
  2020-12-14 22:37 ` [v2 PATCH 5/9] mm: memcontrol: add per memcg shrinker nr_deferred Yang Shi
@ 2020-12-15  2:22   ` Dave Chinner
  2020-12-15 14:45     ` Johannes Weiner
  0 siblings, 1 reply; 37+ messages in thread
From: Dave Chinner @ 2020-12-15  2:22 UTC (permalink / raw)
  To: Yang Shi
  Cc: guro, ktkhai, shakeelb, hannes, mhocko, akpm, linux-mm,
	linux-fsdevel, linux-kernel

On Mon, Dec 14, 2020 at 02:37:18PM -0800, Yang Shi wrote:
> Currently the number of deferred objects are per shrinker, but some slabs, for example,
> vfs inode/dentry cache are per memcg, this would result in poor isolation among memcgs.
> 
> The deferred objects typically are generated by __GFP_NOFS allocations, one memcg with
> excessive __GFP_NOFS allocations may blow up deferred objects, then other innocent memcgs
> may suffer from over shrink, excessive reclaim latency, etc.
> 
> For example, two workloads run in memcgA and memcgB respectively, workload in B is vfs
> heavy workload.  Workload in A generates excessive deferred objects, then B's vfs cache
> might be hit heavily (drop half of caches) by B's limit reclaim or global reclaim.
> 
> We observed this hit in our production environment which was running vfs heavy workload
> shown as the below tracing log:
> 
> <...>-409454 [016] .... 28286961.747146: mm_shrink_slab_start: super_cache_scan+0x0/0x1a0 ffff9a83046f3458:
> nid: 1 objects to shrink 3641681686040 gfp_flags GFP_HIGHUSER_MOVABLE|__GFP_ZERO pgs_scanned 1 lru_pgs 15721
> cache items 246404277 delta 31345 total_scan 123202138
> <...>-409454 [022] .... 28287105.928018: mm_shrink_slab_end: super_cache_scan+0x0/0x1a0 ffff9a83046f3458:
> nid: 1 unused scan count 3641681686040 new scan count 3641798379189 total_scan 602
> last shrinker return val 123186855
> 
> The vfs cache and page cache ration was 10:1 on this machine, and half of caches were dropped.
> This also resulted in significant amount of page caches were dropped due to inodes eviction.
> 
> Make nr_deferred per memcg for memcg aware shrinkers would solve the unfairness and bring
> better isolation.
> 
> When memcg is not enabled (!CONFIG_MEMCG or memcg disabled), the shrinker's nr_deferred
> would be used.  And non memcg aware shrinkers use shrinker's nr_deferred all the time.
> 
> Signed-off-by: Yang Shi <shy828301@gmail.com>
> ---
>  include/linux/memcontrol.h |   9 +++
>  mm/memcontrol.c            | 110 ++++++++++++++++++++++++++++++++++++-
>  mm/vmscan.c                |   4 ++
>  3 files changed, 120 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 922a7f600465..1b343b268359 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -92,6 +92,13 @@ struct lruvec_stat {
>  	long count[NR_VM_NODE_STAT_ITEMS];
>  };
>  
> +
> +/* Shrinker::id indexed nr_deferred of memcg-aware shrinkers. */
> +struct memcg_shrinker_deferred {
> +	struct rcu_head rcu;
> +	atomic_long_t nr_deferred[];
> +};

So you're effectively copy and pasting the memcg_shrinker_map
infrastructure and doubling the number of allocations/frees required
to set up/tear down a memcg? Why not add it to the struct
memcg_shrinker_map like this:

struct memcg_shrinker_map {
        struct rcu_head	rcu;
	unsigned long	*map;
	atomic_long_t	*nr_deferred;
};

And when you dynamically allocate the structure, set the map and
nr_deferred pointers to the correct offset in the allocated range.

Then this patch is really only changes to the size of the chunk
being allocated, setting up the pointers and copying the relevant
data from the old to new.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [v2 PATCH 6/9] mm: vmscan: use per memcg nr_deferred of shrinker
  2020-12-14 22:37 ` [v2 PATCH 6/9] mm: vmscan: use per memcg nr_deferred of shrinker Yang Shi
@ 2020-12-15  2:46   ` Dave Chinner
  2020-12-15 22:27     ` Yang Shi
  0 siblings, 1 reply; 37+ messages in thread
From: Dave Chinner @ 2020-12-15  2:46 UTC (permalink / raw)
  To: Yang Shi
  Cc: guro, ktkhai, shakeelb, hannes, mhocko, akpm, linux-mm,
	linux-fsdevel, linux-kernel

On Mon, Dec 14, 2020 at 02:37:19PM -0800, Yang Shi wrote:
> Use per memcg's nr_deferred for memcg aware shrinkers.  The shrinker's nr_deferred
> will be used in the following cases:
>     1. Non memcg aware shrinkers
>     2. !CONFIG_MEMCG
>     3. memcg is disabled by boot parameter
> 
> Signed-off-by: Yang Shi <shy828301@gmail.com>

Lots of lines way over 80 columns.

> ---
>  mm/vmscan.c | 94 ++++++++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 83 insertions(+), 11 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index bf34167dd67e..bce8cf44eca2 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -203,6 +203,12 @@ DECLARE_RWSEM(shrinker_rwsem);
>  static DEFINE_IDR(shrinker_idr);
>  static int shrinker_nr_max;
>  
> +static inline bool is_deferred_memcg_aware(struct shrinker *shrinker)
> +{
> +	return (shrinker->flags & SHRINKER_MEMCG_AWARE) &&
> +		!mem_cgroup_disabled();
> +}

Why do we care if mem_cgroup_disabled() is disabled here? The return
of this function is then && sc->memcg, so if memcgs are disabled,
sc->memcg will never be set and this mem_cgroup_disabled() check is
completely redundant, right?

> +
>  static int prealloc_memcg_shrinker(struct shrinker *shrinker)
>  {
>  	int id, ret = -ENOMEM;
> @@ -271,7 +277,58 @@ static bool writeback_throttling_sane(struct scan_control *sc)
>  #endif
>  	return false;
>  }
> +
> +static inline long count_nr_deferred(struct shrinker *shrinker,
> +				     struct shrink_control *sc)
> +{
> +	bool per_memcg_deferred = is_deferred_memcg_aware(shrinker) && sc->memcg;
> +	struct memcg_shrinker_deferred *deferred;
> +	struct mem_cgroup *memcg = sc->memcg;
> +	int nid = sc->nid;
> +	int id = shrinker->id;
> +	long nr;
> +
> +	if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
> +		nid = 0;
> +
> +	if (per_memcg_deferred) {
> +		deferred = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_deferred,
> +						     true);
> +		nr = atomic_long_xchg(&deferred->nr_deferred[id], 0);
> +	} else
> +		nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
> +
> +	return nr;
> +}
> +
> +static inline long set_nr_deferred(long nr, struct shrinker *shrinker,
> +				   struct shrink_control *sc)
> +{
> +	bool per_memcg_deferred = is_deferred_memcg_aware(shrinker) && sc->memcg;
> +	struct memcg_shrinker_deferred *deferred;
> +	struct mem_cgroup *memcg = sc->memcg;
> +	int nid = sc->nid;
> +	int id = shrinker->id;

Oh, that's a nasty trap. Nobody knows if you mean "id" or "nid" in
any of the code and a single letter type results in a bug.

> +	long new_nr;
> +
> +	if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
> +		nid = 0;
> +
> +	if (per_memcg_deferred) {
> +		deferred = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_deferred,
> +						     true);
> +		new_nr = atomic_long_add_return(nr, &deferred->nr_deferred[id]);
> +	} else
> +		new_nr = atomic_long_add_return(nr, &shrinker->nr_deferred[nid]);
> +
> +	return new_nr;
> +}
>  #else
> +static inline bool is_deferred_memcg_aware(struct shrinker *shrinker)
> +{
> +	return false;
> +}
> +
>  static int prealloc_memcg_shrinker(struct shrinker *shrinker)
>  {
>  	return 0;
> @@ -290,6 +347,29 @@ static bool writeback_throttling_sane(struct scan_control *sc)
>  {
>  	return true;
>  }
> +
> +static inline long count_nr_deferred(struct shrinker *shrinker,
> +				     struct shrink_control *sc)
> +{
> +	int nid = sc->nid;
> +
> +	if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
> +		nid = 0;
> +
> +	return atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
> +}
> +
> +static inline long set_nr_deferred(long nr, struct shrinker *shrinker,
> +				   struct shrink_control *sc)
> +{
> +	int nid = sc->nid;
> +
> +	if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
> +		nid = 0;
> +
> +	return atomic_long_add_return(nr,
> +				      &shrinker->nr_deferred[nid]);
> +}
>  #endif

This is pretty ... verbose. It doesn't need to be this complex at
all, and you shouldn't be duplicating code in multiple places. THere
is also no need for any of these to be "inline" functions. The
compiler will do that for static functions automatically if it makes
sense.

Ok, so you only do the memcg nr_deferred thing if NUMA_AWARE &&
sc->memcg is true. so....

static long shrink_slab_set_nr_deferred_memcg(...)
{
	int nid = sc->nid;

	deferred = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_deferred,
					     true);
	return atomic_long_add_return(nr, &deferred->nr_deferred[id]);
}

static long shrink_slab_set_nr_deferred(...)
{
	int nid = sc->nid;

	if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
		nid = 0;
	else if (sc->memcg)
		return shrink_slab_set_nr_deferred_memcg(...., nid);

	return atomic_long_add_return(nr, &shrinker->nr_deferred[nid]);
}

And now there's no duplicated code.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [v2 PATCH 7/9] mm: vmscan: don't need allocate shrinker->nr_deferred for memcg aware shrinkers
  2020-12-14 22:37 ` [v2 PATCH 7/9] mm: vmscan: don't need allocate shrinker->nr_deferred for memcg aware shrinkers Yang Shi
@ 2020-12-15  3:05   ` Dave Chinner
  2020-12-15 23:07     ` Yang Shi
  0 siblings, 1 reply; 37+ messages in thread
From: Dave Chinner @ 2020-12-15  3:05 UTC (permalink / raw)
  To: Yang Shi
  Cc: guro, ktkhai, shakeelb, hannes, mhocko, akpm, linux-mm,
	linux-fsdevel, linux-kernel

On Mon, Dec 14, 2020 at 02:37:20PM -0800, Yang Shi wrote:
> Now nr_deferred is available on per memcg level for memcg aware shrinkers, so don't need
> allocate shrinker->nr_deferred for such shrinkers anymore.
> 
> Signed-off-by: Yang Shi <shy828301@gmail.com>
> ---
>  mm/vmscan.c | 28 ++++++++++++++--------------
>  1 file changed, 14 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index bce8cf44eca2..8d5bfd818acd 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -420,7 +420,15 @@ unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone
>   */
>  int prealloc_shrinker(struct shrinker *shrinker)
>  {
> -	unsigned int size = sizeof(*shrinker->nr_deferred);
> +	unsigned int size;
> +
> +	if (is_deferred_memcg_aware(shrinker)) {
> +		if (prealloc_memcg_shrinker(shrinker))
> +			return -ENOMEM;
> +		return 0;
> +	}
> +
> +	size = sizeof(*shrinker->nr_deferred);
>  
>  	if (shrinker->flags & SHRINKER_NUMA_AWARE)
>  		size *= nr_node_ids;
> @@ -429,26 +437,18 @@ int prealloc_shrinker(struct shrinker *shrinker)
>  	if (!shrinker->nr_deferred)
>  		return -ENOMEM;
>  
> -	if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
> -		if (prealloc_memcg_shrinker(shrinker))
> -			goto free_deferred;
> -	}
> -
>  	return 0;
> -
> -free_deferred:
> -	kfree(shrinker->nr_deferred);
> -	shrinker->nr_deferred = NULL;
> -	return -ENOMEM;
>  }

I'm trying to put my finger on it, but this seems wrong to me. If
memcgs are disabled, then prealloc_memcg_shrinker() needs to fail.
The preallocation code should not care about internal memcg details
like this.

	/*
	 * If the shrinker is memcg aware and memcgs are not
	 * enabled, clear the MEMCG flag and fall back to non-memcg
	 * behaviour for the shrinker.
	 */
	if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
		error = prealloc_memcg_shrinker(shrinker);
		if (!error)
			return 0;
		if (error != -ENOSYS)
			return error;

		/* memcgs not enabled! */
		shrinker->flags &= ~SHRINKER_MEMCG_AWARE;
	}

	size = sizeof(*shrinker->nr_deferred);
	....
	return 0;
}

This guarantees that only the shrinker instances taht have a
correctly set up memcg attached to them will have the
SHRINKER_MEMCG_AWARE flag set. Hence in all the rest of the shrinker
code, we only ever need to check for SHRINKER_MEMCG_AWARE to
determine what we should do....

>  void free_prealloced_shrinker(struct shrinker *shrinker)
>  {
> -	if (!shrinker->nr_deferred)
> +	if (is_deferred_memcg_aware(shrinker)) {
> +		unregister_memcg_shrinker(shrinker);
>  		return;
> +	}
>  
> -	if (shrinker->flags & SHRINKER_MEMCG_AWARE)
> -		unregister_memcg_shrinker(shrinker);
> +	if (!shrinker->nr_deferred)
> +		return;
>  
>  	kfree(shrinker->nr_deferred);
>  	shrinker->nr_deferred = NULL;

e.g. then this function can simply do:

{
	if (shrinker->flags & SHRINKER_MEMCG_AWARE)
		return unregister_memcg_shrinker(shrinker);
	kfree(shrinker->nr_deferred);
	shrinker->nr_deferred = NULL;
}

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [v2 PATCH 8/9] mm: memcontrol: reparent nr_deferred when memcg offline
  2020-12-14 22:37 ` [v2 PATCH 8/9] mm: memcontrol: reparent nr_deferred when memcg offline Yang Shi
@ 2020-12-15  3:07   ` Dave Chinner
  2020-12-15 23:10     ` Yang Shi
  0 siblings, 1 reply; 37+ messages in thread
From: Dave Chinner @ 2020-12-15  3:07 UTC (permalink / raw)
  To: Yang Shi
  Cc: guro, ktkhai, shakeelb, hannes, mhocko, akpm, linux-mm,
	linux-fsdevel, linux-kernel

On Mon, Dec 14, 2020 at 02:37:21PM -0800, Yang Shi wrote:
> Now shrinker's nr_deferred is per memcg for memcg aware shrinkers, add to parent's
> corresponding nr_deferred when memcg offline.
> 
> Signed-off-by: Yang Shi <shy828301@gmail.com>
> ---
>  include/linux/shrinker.h |  4 ++++
>  mm/memcontrol.c          | 24 ++++++++++++++++++++++++
>  mm/vmscan.c              |  2 +-
>  3 files changed, 29 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index 1eac79ce57d4..85cfc910dde4 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -78,6 +78,10 @@ struct shrinker {
>  };
>  #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
>  
> +#ifdef CONFIG_MEMCG
> +extern int shrinker_nr_max;
> +#endif
> +
>  /* Flags */
>  #define SHRINKER_REGISTERED	(1 << 0)
>  #define SHRINKER_NUMA_AWARE	(1 << 1)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 321d1818ce3d..1f191a15bee1 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -59,6 +59,7 @@
>  #include <linux/tracehook.h>
>  #include <linux/psi.h>
>  #include <linux/seq_buf.h>
> +#include <linux/shrinker.h>
>  #include "internal.h"
>  #include <net/sock.h>
>  #include <net/ip.h>
> @@ -612,6 +613,28 @@ void memcg_set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
>  	}
>  }
>  
> +static void memcg_reparent_shrinker_deferred(struct mem_cgroup *memcg)
> +{
> +	int i, nid;
> +	long nr;
> +	struct mem_cgroup *parent;
> +	struct memcg_shrinker_deferred *child_deferred, *parent_deferred;
> +
> +	parent = parent_mem_cgroup(memcg);
> +	if (!parent)
> +		parent = root_mem_cgroup;
> +
> +	for_each_node(nid) {
> +		child_deferred = memcg->nodeinfo[nid]->shrinker_deferred;
> +		parent_deferred = parent->nodeinfo[nid]->shrinker_deferred;
> +		for (i = 0; i < shrinker_nr_max; i ++) {
> +			nr = atomic_long_read(&child_deferred->nr_deferred[i]);
> +			atomic_long_add(nr,
> +				&parent_deferred->nr_deferred[i]);
> +		}
> +	}
> +}

I would place this function in vmscan.c alongside the
shrink_slab_set_nr_deferred_memcg() function so that all the
accounting is in the one place.

> +
>  /**
>   * mem_cgroup_css_from_page - css of the memcg associated with a page
>   * @page: page of interest
> @@ -5543,6 +5566,7 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
>  	page_counter_set_low(&memcg->memory, 0);
>  
>  	memcg_offline_kmem(memcg);
> +	memcg_reparent_shrinker_deferred(memcg);
>  	wb_memcg_offline(memcg);
>  
>  	drain_all_stock(memcg);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 8d5bfd818acd..693a41e89969 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -201,7 +201,7 @@ DECLARE_RWSEM(shrinker_rwsem);
>  #define SHRINKER_REGISTERING ((struct shrinker *)~0UL)
>  
>  static DEFINE_IDR(shrinker_idr);
> -static int shrinker_nr_max;
> +int shrinker_nr_max;

Then we don't need to make yet another variable global...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [v2 PATCH 9/9] mm: vmscan: shrink deferred objects proportional to priority
  2020-12-14 22:37 ` [v2 PATCH 9/9] mm: vmscan: shrink deferred objects proportional to priority Yang Shi
@ 2020-12-15  3:23   ` Dave Chinner
  2020-12-15 23:59     ` Yang Shi
  0 siblings, 1 reply; 37+ messages in thread
From: Dave Chinner @ 2020-12-15  3:23 UTC (permalink / raw)
  To: Yang Shi
  Cc: guro, ktkhai, shakeelb, hannes, mhocko, akpm, linux-mm,
	linux-fsdevel, linux-kernel

On Mon, Dec 14, 2020 at 02:37:22PM -0800, Yang Shi wrote:
> The number of deferred objects might get windup to an absurd number, and it results in
> clamp of slab objects.  It is undesirable for sustaining workingset.
> 
> So shrink deferred objects proportional to priority and cap nr_deferred to twice of
> cache items.

This completely changes the work accrual algorithm without any
explaination of how it works, what the theory behind the algorithm
is, what the work accrual ramp up and damp down curve looks like,
what workloads it is designed to benefit, how it affects page
cache vs slab cache balance and system performance, what OOM stress
testing has been done to ensure pure slab cache pressure workloads
don't easily trigger OOM kills, etc.

You're going to need a lot more supporting evidence that this is a
well thought out algorithm that doesn't obviously introduce
regressions. The current code might fall down in one corner case,
but there are an awful lot of corner cases where it does work.
Please provide some evidence that it not only works in your corner
case, but also doesn't introduce regressions for other slab cache
intensive and mixed cache intensive worklaods...

> 
> Signed-off-by: Yang Shi <shy828301@gmail.com>
> ---
>  mm/vmscan.c | 40 +++++-----------------------------------
>  1 file changed, 5 insertions(+), 35 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 693a41e89969..58f4a383f0df 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -525,7 +525,6 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  	 */
>  	nr = count_nr_deferred(shrinker, shrinkctl);
>  
> -	total_scan = nr;
>  	if (shrinker->seeks) {
>  		delta = freeable >> priority;
>  		delta *= 4;
> @@ -539,37 +538,9 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  		delta = freeable / 2;
>  	}
>  
> +	total_scan = nr >> priority;

When there is low memory pressure, this will throw away a large
amount of the work that is deferred. If we are not defering in
amounts larger than ~4000 items, every pass through this code will
zero the deferred work.

Hence when we do get substantial pressure, that deferred work is no
longer being tracked. While it may help your specific corner case,
it's likely to significantly change the reclaim balance of slab
caches, especially under GFP_NOFS intensive workloads where we can
only defer the work to kswapd.

Hence I think this is still a problematic approach as it doesn't
address the reason why deferred counts are increasing out of
control in the first place....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [v2 PATCH 3/9] mm: vmscan: guarantee shrinker_slab_memcg() sees valid shrinker_maps for online memcg
       [not found]   ` <20201215123802.GA379720@cmpxchg.org>
@ 2020-12-15 12:58     ` Kirill Tkhai
  0 siblings, 0 replies; 37+ messages in thread
From: Kirill Tkhai @ 2020-12-15 12:58 UTC (permalink / raw)
  To: Johannes Weiner, Yang Shi
  Cc: guro, ktkhai, shakeelb, david, mhocko, akpm, linux-mm,
	linux-fsdevel, linux-kernel

15.12.2020, 15:40, "Johannes Weiner" <hannes@cmpxchg.org>:
> On Mon, Dec 14, 2020 at 02:37:16PM -0800, Yang Shi wrote:
>>  The shrink_slab_memcg() races with mem_cgroup_css_online(). A visibility of CSS_ONLINE flag
>>  in shrink_slab_memcg()->mem_cgroup_online() does not guarantee that we will see
>>  memcg->nodeinfo[nid]->shrinker_maps != NULL. This may occur because of processor reordering
>>  on !x86.
>>
>>  This seems like the below case:
>>
>>             CPU A CPU B
>>  store shrinker_map load CSS_ONLINE
>>  store CSS_ONLINE load shrinker_map
>>
>>  So the memory ordering could be guaranteed by smp_wmb()/smp_rmb() pair.
>>
>>  The memory barriers pair will guarantee the ordering between shrinker_deferred and CSS_ONLINE
>>  for the following patches as well.
>>
>>  Signed-off-by: Yang Shi <shy828301@gmail.com>
>
> As per previous feedback, please move the misplaced shrinker
> allocation callback from .css_online to .css_alloc. This will get you
> the necessary ordering guarantees from the cgroup core code.


Can you read my emails from ktkhai@virtuozzo.com? I've already answered
on this question here: https://lkml.org/lkml/2020/12/10/726

Check your spam folder, and add my address to allow-list if so.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [v2 PATCH 2/9] mm: memcontrol: use shrinker_rwsem to protect shrinker_maps allocation
  2020-12-15  2:09   ` Dave Chinner
@ 2020-12-15 13:53     ` Johannes Weiner
  2020-12-15 21:59       ` Dave Chinner
  0 siblings, 1 reply; 37+ messages in thread
From: Johannes Weiner @ 2020-12-15 13:53 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Yang Shi, guro, ktkhai, shakeelb, mhocko, akpm, linux-mm,
	linux-fsdevel, linux-kernel

On Tue, Dec 15, 2020 at 01:09:57PM +1100, Dave Chinner wrote:
> On Mon, Dec 14, 2020 at 02:37:15PM -0800, Yang Shi wrote:
> > Since memcg_shrinker_map_size just can be changd under holding shrinker_rwsem
> > exclusively, the read side can be protected by holding read lock, so it sounds
> > superfluous to have a dedicated mutex.
> 
> I'm not sure this is a good idea. This couples the shrinker
> infrastructure to internal details of how cgroups are initialised
> and managed. Sure, certain operations might be done in certain
> shrinker lock contexts, but that doesn't mean we should share global
> locks across otherwise independent subsystems....

They're not independent subsystems. Most of the memory controller is
an extension of core VM operations that is fairly difficult to
understand outside the context of those operations. Then there are a
limited number of entry points from the cgroup interface. We used to
have our own locks for core VM structures (private page lock e.g.) to
coordinate VM and cgroup, and that was mostly unintelligble.

We have since established that those two components coordinate with
native VM locking and lifetime management. If you need to lock the
page, you lock the page - instead of having all VM paths that already
hold the page lock acquire a nested lock to exclude one cgroup path.

In this case, we have auxiliary shrinker data, subject to shrinker
lifetime and exclusion rules. It's much easier to understand that
cgroup creation needs a stable shrinker list (shrinker_rwsem) to
manage this data, than having an aliased lock that is private to the
memcg callbacks and obscures this real interdependency.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [v2 PATCH 2/9] mm: memcontrol: use shrinker_rwsem to protect shrinker_maps allocation
  2020-12-14 22:37 ` [v2 PATCH 2/9] mm: memcontrol: use shrinker_rwsem to protect shrinker_maps allocation Yang Shi
  2020-12-15  2:09   ` Dave Chinner
@ 2020-12-15 14:07   ` Johannes Weiner
  2020-12-15 20:32     ` Yang Shi
  1 sibling, 1 reply; 37+ messages in thread
From: Johannes Weiner @ 2020-12-15 14:07 UTC (permalink / raw)
  To: Yang Shi
  Cc: guro, ktkhai, shakeelb, david, mhocko, akpm, linux-mm,
	linux-fsdevel, linux-kernel

On Mon, Dec 14, 2020 at 02:37:15PM -0800, Yang Shi wrote:
> Since memcg_shrinker_map_size just can be changd under holding shrinker_rwsem
> exclusively, the read side can be protected by holding read lock, so it sounds
> superfluous to have a dedicated mutex.  This should not exacerbate the contention
> to shrinker_rwsem since just one read side critical section is added.
> 
> Signed-off-by: Yang Shi <shy828301@gmail.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Thanks Yang, this is a step in the right direction. It would still be
nice to also drop memcg_shrinker_map_size and (trivially) derive that
value from shrinker_nr_max where necessary. It is duplicate state.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [v2 PATCH 5/9] mm: memcontrol: add per memcg shrinker nr_deferred
  2020-12-15  2:22   ` Dave Chinner
@ 2020-12-15 14:45     ` Johannes Weiner
  2020-12-15 21:57       ` Yang Shi
  0 siblings, 1 reply; 37+ messages in thread
From: Johannes Weiner @ 2020-12-15 14:45 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Yang Shi, guro, ktkhai, shakeelb, mhocko, akpm, linux-mm,
	linux-fsdevel, linux-kernel

On Tue, Dec 15, 2020 at 01:22:33PM +1100, Dave Chinner wrote:
> On Mon, Dec 14, 2020 at 02:37:18PM -0800, Yang Shi wrote:
> > Currently the number of deferred objects are per shrinker, but some slabs, for example,
> > vfs inode/dentry cache are per memcg, this would result in poor isolation among memcgs.
> > 
> > The deferred objects typically are generated by __GFP_NOFS allocations, one memcg with
> > excessive __GFP_NOFS allocations may blow up deferred objects, then other innocent memcgs
> > may suffer from over shrink, excessive reclaim latency, etc.
> > 
> > For example, two workloads run in memcgA and memcgB respectively, workload in B is vfs
> > heavy workload.  Workload in A generates excessive deferred objects, then B's vfs cache
> > might be hit heavily (drop half of caches) by B's limit reclaim or global reclaim.
> > 
> > We observed this hit in our production environment which was running vfs heavy workload
> > shown as the below tracing log:
> > 
> > <...>-409454 [016] .... 28286961.747146: mm_shrink_slab_start: super_cache_scan+0x0/0x1a0 ffff9a83046f3458:
> > nid: 1 objects to shrink 3641681686040 gfp_flags GFP_HIGHUSER_MOVABLE|__GFP_ZERO pgs_scanned 1 lru_pgs 15721
> > cache items 246404277 delta 31345 total_scan 123202138
> > <...>-409454 [022] .... 28287105.928018: mm_shrink_slab_end: super_cache_scan+0x0/0x1a0 ffff9a83046f3458:
> > nid: 1 unused scan count 3641681686040 new scan count 3641798379189 total_scan 602
> > last shrinker return val 123186855
> > 
> > The vfs cache and page cache ration was 10:1 on this machine, and half of caches were dropped.
> > This also resulted in significant amount of page caches were dropped due to inodes eviction.
> > 
> > Make nr_deferred per memcg for memcg aware shrinkers would solve the unfairness and bring
> > better isolation.
> > 
> > When memcg is not enabled (!CONFIG_MEMCG or memcg disabled), the shrinker's nr_deferred
> > would be used.  And non memcg aware shrinkers use shrinker's nr_deferred all the time.
> > 
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > ---
> >  include/linux/memcontrol.h |   9 +++
> >  mm/memcontrol.c            | 110 ++++++++++++++++++++++++++++++++++++-
> >  mm/vmscan.c                |   4 ++
> >  3 files changed, 120 insertions(+), 3 deletions(-)
> > 
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 922a7f600465..1b343b268359 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -92,6 +92,13 @@ struct lruvec_stat {
> >  	long count[NR_VM_NODE_STAT_ITEMS];
> >  };
> >  
> > +
> > +/* Shrinker::id indexed nr_deferred of memcg-aware shrinkers. */
> > +struct memcg_shrinker_deferred {
> > +	struct rcu_head rcu;
> > +	atomic_long_t nr_deferred[];
> > +};
> 
> So you're effectively copy and pasting the memcg_shrinker_map
> infrastructure and doubling the number of allocations/frees required
> to set up/tear down a memcg? Why not add it to the struct
> memcg_shrinker_map like this:
> 
> struct memcg_shrinker_map {
>         struct rcu_head	rcu;
> 	unsigned long	*map;
> 	atomic_long_t	*nr_deferred;
> };
> 
> And when you dynamically allocate the structure, set the map and
> nr_deferred pointers to the correct offset in the allocated range.
> 
> Then this patch is really only changes to the size of the chunk
> being allocated, setting up the pointers and copying the relevant
> data from the old to new.

Fully agreed.

In the longer-term, it may be nice to further expand this and make
this the generalized intersection between cgroup, node and shrinkers.

There is large overlap with list_lru e.g. - with data of identical
scope and lifetime, but duplicative callbacks and management. If we
folded list_lru_memcg into the above data structure, we could also
generalize and reuse the existing callbacks.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [v2 PATCH 3/9] mm: vmscan: guarantee shrinker_slab_memcg() sees valid shrinker_maps for online memcg
  2020-12-14 22:37 ` [v2 PATCH 3/9] mm: vmscan: guarantee shrinker_slab_memcg() sees valid shrinker_maps for online memcg Yang Shi
  2020-12-15  2:04   ` Dave Chinner
       [not found]   ` <20201215123802.GA379720@cmpxchg.org>
@ 2020-12-15 17:14   ` Johannes Weiner
       [not found]     ` <CAHbLzkrzv48S3ks-x8M=2sHxRS_+c-hLXdt4ScaWD6mC4ZFe8w@mail.gmail.com>
  2 siblings, 1 reply; 37+ messages in thread
From: Johannes Weiner @ 2020-12-15 17:14 UTC (permalink / raw)
  To: Yang Shi
  Cc: guro, ktkhai, shakeelb, david, mhocko, akpm, linux-mm,
	linux-fsdevel, linux-kernel

On Mon, Dec 14, 2020 at 02:37:16PM -0800, Yang Shi wrote:
> The shrink_slab_memcg() races with mem_cgroup_css_online(). A visibility of CSS_ONLINE flag
> in shrink_slab_memcg()->mem_cgroup_online() does not guarantee that we will see
> memcg->nodeinfo[nid]->shrinker_maps != NULL.  This may occur because of processor reordering
> on !x86.
> 
> This seems like the below case:
> 
>            CPU A          CPU B
> store shrinker_map      load CSS_ONLINE
> store CSS_ONLINE        load shrinker_map

But we have a separate check on shrinker_maps, so it doesn't matter
that it isn't guaranteed, no?

The only downside I can see is when CSS_ONLINE isn't visible yet and
we bail even though we'd be ready to shrink. Although it's probably
unlikely that there would be any objects allocated already...

Can somebody remind me why we check mem_cgroup_online() at all?

If shrinker_map is set, we can shrink: .css_alloc is guaranteed to be
complete, and by using RCU for the shrinker_map pointer, the map is
also guaranteed to be initialized. There is nothing else happening
during onlining that you may depend on.

If shrinker_map isn't set, we cannot iterate the bitmap. It does not
really matter whether CSS_ONLINE is reordered and visible already.

Agreed with Dave: if we need that synchronization around onlining, it
needs to happen inside the cgroup core. But I wouldn't add that until
somebody actually required it.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [v2 PATCH 2/9] mm: memcontrol: use shrinker_rwsem to protect shrinker_maps allocation
  2020-12-15 14:07   ` Johannes Weiner
@ 2020-12-15 20:32     ` Yang Shi
  0 siblings, 0 replies; 37+ messages in thread
From: Yang Shi @ 2020-12-15 20:32 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Roman Gushchin, Kirill Tkhai, Shakeel Butt, Dave Chinner,
	Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Tue, Dec 15, 2020 at 6:10 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Mon, Dec 14, 2020 at 02:37:15PM -0800, Yang Shi wrote:
> > Since memcg_shrinker_map_size just can be changd under holding shrinker_rwsem
> > exclusively, the read side can be protected by holding read lock, so it sounds
> > superfluous to have a dedicated mutex.  This should not exacerbate the contention
> > to shrinker_rwsem since just one read side critical section is added.
> >
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
>
> Thanks Yang, this is a step in the right direction. It would still be
> nice to also drop memcg_shrinker_map_size and (trivially) derive that
> value from shrinker_nr_max where necessary. It is duplicate state.

Thanks! I will take a further look at it.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [v2 PATCH 5/9] mm: memcontrol: add per memcg shrinker nr_deferred
  2020-12-15 14:45     ` Johannes Weiner
@ 2020-12-15 21:57       ` Yang Shi
  0 siblings, 0 replies; 37+ messages in thread
From: Yang Shi @ 2020-12-15 21:57 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Dave Chinner, Roman Gushchin, Kirill Tkhai, Shakeel Butt,
	Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Tue, Dec 15, 2020 at 6:47 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Tue, Dec 15, 2020 at 01:22:33PM +1100, Dave Chinner wrote:
> > On Mon, Dec 14, 2020 at 02:37:18PM -0800, Yang Shi wrote:
> > > Currently the number of deferred objects are per shrinker, but some slabs, for example,
> > > vfs inode/dentry cache are per memcg, this would result in poor isolation among memcgs.
> > >
> > > The deferred objects typically are generated by __GFP_NOFS allocations, one memcg with
> > > excessive __GFP_NOFS allocations may blow up deferred objects, then other innocent memcgs
> > > may suffer from over shrink, excessive reclaim latency, etc.
> > >
> > > For example, two workloads run in memcgA and memcgB respectively, workload in B is vfs
> > > heavy workload.  Workload in A generates excessive deferred objects, then B's vfs cache
> > > might be hit heavily (drop half of caches) by B's limit reclaim or global reclaim.
> > >
> > > We observed this hit in our production environment which was running vfs heavy workload
> > > shown as the below tracing log:
> > >
> > > <...>-409454 [016] .... 28286961.747146: mm_shrink_slab_start: super_cache_scan+0x0/0x1a0 ffff9a83046f3458:
> > > nid: 1 objects to shrink 3641681686040 gfp_flags GFP_HIGHUSER_MOVABLE|__GFP_ZERO pgs_scanned 1 lru_pgs 15721
> > > cache items 246404277 delta 31345 total_scan 123202138
> > > <...>-409454 [022] .... 28287105.928018: mm_shrink_slab_end: super_cache_scan+0x0/0x1a0 ffff9a83046f3458:
> > > nid: 1 unused scan count 3641681686040 new scan count 3641798379189 total_scan 602
> > > last shrinker return val 123186855
> > >
> > > The vfs cache and page cache ration was 10:1 on this machine, and half of caches were dropped.
> > > This also resulted in significant amount of page caches were dropped due to inodes eviction.
> > >
> > > Make nr_deferred per memcg for memcg aware shrinkers would solve the unfairness and bring
> > > better isolation.
> > >
> > > When memcg is not enabled (!CONFIG_MEMCG or memcg disabled), the shrinker's nr_deferred
> > > would be used.  And non memcg aware shrinkers use shrinker's nr_deferred all the time.
> > >
> > > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > > ---
> > >  include/linux/memcontrol.h |   9 +++
> > >  mm/memcontrol.c            | 110 ++++++++++++++++++++++++++++++++++++-
> > >  mm/vmscan.c                |   4 ++
> > >  3 files changed, 120 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > index 922a7f600465..1b343b268359 100644
> > > --- a/include/linux/memcontrol.h
> > > +++ b/include/linux/memcontrol.h
> > > @@ -92,6 +92,13 @@ struct lruvec_stat {
> > >     long count[NR_VM_NODE_STAT_ITEMS];
> > >  };
> > >
> > > +
> > > +/* Shrinker::id indexed nr_deferred of memcg-aware shrinkers. */
> > > +struct memcg_shrinker_deferred {
> > > +   struct rcu_head rcu;
> > > +   atomic_long_t nr_deferred[];
> > > +};
> >
> > So you're effectively copy and pasting the memcg_shrinker_map
> > infrastructure and doubling the number of allocations/frees required
> > to set up/tear down a memcg? Why not add it to the struct
> > memcg_shrinker_map like this:
> >
> > struct memcg_shrinker_map {
> >         struct rcu_head       rcu;
> >       unsigned long   *map;
> >       atomic_long_t   *nr_deferred;
> > };
> >
> > And when you dynamically allocate the structure, set the map and
> > nr_deferred pointers to the correct offset in the allocated range.
> >
> > Then this patch is really only changes to the size of the chunk
> > being allocated, setting up the pointers and copying the relevant
> > data from the old to new.
>
> Fully agreed.

Thanks folks. Such idea has been discussed with Roman in the earlier
emails. I agree this would make the code neater. Will do it in v3.

>
> In the longer-term, it may be nice to further expand this and make
> this the generalized intersection between cgroup, node and shrinkers.
>
> There is large overlap with list_lru e.g. - with data of identical
> scope and lifetime, but duplicative callbacks and management. If we
> folded list_lru_memcg into the above data structure, we could also
> generalize and reuse the existing callbacks.

Yes, agree we should look further to combine and deduplicate all the
pieces for the long run.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [v2 PATCH 2/9] mm: memcontrol: use shrinker_rwsem to protect shrinker_maps allocation
  2020-12-15 13:53     ` Johannes Weiner
@ 2020-12-15 21:59       ` Dave Chinner
  2020-12-16 13:17         ` Kirill Tkhai
                           ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: Dave Chinner @ 2020-12-15 21:59 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Yang Shi, guro, ktkhai, shakeelb, mhocko, akpm, linux-mm,
	linux-fsdevel, linux-kernel

On Tue, Dec 15, 2020 at 02:53:48PM +0100, Johannes Weiner wrote:
> On Tue, Dec 15, 2020 at 01:09:57PM +1100, Dave Chinner wrote:
> > On Mon, Dec 14, 2020 at 02:37:15PM -0800, Yang Shi wrote:
> > > Since memcg_shrinker_map_size just can be changd under holding shrinker_rwsem
> > > exclusively, the read side can be protected by holding read lock, so it sounds
> > > superfluous to have a dedicated mutex.
> > 
> > I'm not sure this is a good idea. This couples the shrinker
> > infrastructure to internal details of how cgroups are initialised
> > and managed. Sure, certain operations might be done in certain
> > shrinker lock contexts, but that doesn't mean we should share global
> > locks across otherwise independent subsystems....
> 
> They're not independent subsystems. Most of the memory controller is
> an extension of core VM operations that is fairly difficult to
> understand outside the context of those operations. Then there are a
> limited number of entry points from the cgroup interface. We used to
> have our own locks for core VM structures (private page lock e.g.) to
> coordinate VM and cgroup, and that was mostly unintelligble.

Yes, but OTOH you can CONFIG_MEMCG=n and the shrinker infrastructure
and shrinkers all still functions correctly.  Ergo, the shrinker
infrastructure is independent of memcgs. Yes, it may have functions
to iterate and manipulate memcgs, but it is not dependent on memcgs
existing for correct behaviour and functionality.

Yet.

> We have since established that those two components coordinate with
> native VM locking and lifetime management. If you need to lock the
> page, you lock the page - instead of having all VM paths that already
> hold the page lock acquire a nested lock to exclude one cgroup path.
> 
> In this case, we have auxiliary shrinker data, subject to shrinker
> lifetime and exclusion rules. It's much easier to understand that
> cgroup creation needs a stable shrinker list (shrinker_rwsem) to
> manage this data, than having an aliased lock that is private to the
> memcg callbacks and obscures this real interdependency.

Ok, so the way to do this is to move all the stuff that needs to be
done under a "subsystem global" lock to the one file, not turn a
static lock into a globally visible lock and spray it around random
source files. There's already way too many static globals to manage
separate shrinker and memcg state..

I certainly agree that shrinkers and memcg need to be more closely
integrated.  I've only been saying that for ... well, since memcgs
essentially duplicated the top level shrinker path so the shrinker
map could be introduced to avoid calling shrinkers that have no work
to do for memcgs. The shrinker map should be generic functionality
for all shrinker invocations because even a non-memcg machine can
have thousands of registered shrinkers that are mostly idle all the
time.

IOWs, I think the shrinker map management is not really memcg
specific - it's just allocation and assignment of a structure, and
the only memcg bit is the map is being stored in a memcg structure.
Therefore, if we are looking towards tighter integration then we
should acutally move the map management to the shrinker code, not
split the shrinker infrastructure management across different files.
There's already a heap of code in vmscan.c under #ifdef
CONFIG_MEMCG, like the prealloc_shrinker() code path:

prealloc_shrinker()				vmscan.c
  if (MEMCG_AWARE)				vmscan.c
    prealloc_memcg_shrinker			vmscan.c
#ifdef CONFIG_MEMCG				vmscan.c
      down_write(shrinker_rwsem)		vmscan.c
      if (id > shrinker_id_max)			vmscan.c
	memcg_expand_shrinker_maps		memcontrol.c
	  for_each_memcg			memcontrol.c
	    reallocate shrinker map		memcontrol.c
	    replace shrinker map		memcontrol.c
	shrinker_id_max = id			vmscan.c
      down_write(shrinker_rwsem)		vmscan.c
#endif

And, really, there's very little code in memcg_expand_shrinker_maps()
here - the only memcg part is the memcg iteration loop, and we
already have them in vmscan.c (e.g. shrink_node_memcgs(),
age_active_anon(), drop_slab_node()) so there's precedence for
moving this memcg iteration for shrinker map management all into
vmscan.c.

Doing so would formalise the shrinker maps as first class shrinker
infrastructure rather than being tacked on to the side of the memcg
infrastructure. At this point it makes total sense to serialise map
manipulations under the shrinker_rwsem.

IOWs, I'm not disagreeing with the direction this patch takes us in,
I'm disagreeing with the implementation as published in the patch
because it doesn't move us closer to a clean, concise single
shrinker infrastructure implementation.

That is, for the medium term, I think  we should be getting rid of
the "legacy" non-memcg shrinker path and everything runs under
memcgs.  With this patchset moving all the deferred counts to be
memcg aware, the only reason for keeping the non-memcg path around
goes away.  If sc->memcg is null, then after this patch set we can
simply use the root memcg and just use it's per-node accounting
rather than having a separate construct for non-memcg aware per-node
accounting.

Hence if SHRINKER_MEMCG_AWARE is set, it simply means we should run
the shrinker if sc->memcg is set.  There is no difference in setup
of shrinkers, the duplicate non-memcg/memcg paths go away, and a
heap of code drops out of the shrinker infrastructure. It becomes
much simpler overall.

It also means we have a path for further integrating memcg aware
shrinkers into the shrinker infrastructure because we can always
rely on the shrinker infrastructure being memcg aware. And with that
in mind, I think we should probably also be moving the shrinker code
out of vmscan.c into it's own file as it's really completely
separate infrastructure from the vast majority of page reclaim
infrastructure in vmscan.c...

That's the view I'm looking at this patchset from. Not just as a
standalone bug fix, but also from the perspective of what the
architectural change implies and the directions for tighter
integration it opens up for us.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [v2 PATCH 6/9] mm: vmscan: use per memcg nr_deferred of shrinker
  2020-12-15  2:46   ` Dave Chinner
@ 2020-12-15 22:27     ` Yang Shi
  2020-12-15 23:48       ` Dave Chinner
  0 siblings, 1 reply; 37+ messages in thread
From: Yang Shi @ 2020-12-15 22:27 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Roman Gushchin, Kirill Tkhai, Shakeel Butt, Johannes Weiner,
	Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Mon, Dec 14, 2020 at 6:46 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Mon, Dec 14, 2020 at 02:37:19PM -0800, Yang Shi wrote:
> > Use per memcg's nr_deferred for memcg aware shrinkers.  The shrinker's nr_deferred
> > will be used in the following cases:
> >     1. Non memcg aware shrinkers
> >     2. !CONFIG_MEMCG
> >     3. memcg is disabled by boot parameter
> >
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
>
> Lots of lines way over 80 columns.

I thought that has been lifted to 100 columns.

>
> > ---
> >  mm/vmscan.c | 94 ++++++++++++++++++++++++++++++++++++++++++++++-------
> >  1 file changed, 83 insertions(+), 11 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index bf34167dd67e..bce8cf44eca2 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -203,6 +203,12 @@ DECLARE_RWSEM(shrinker_rwsem);
> >  static DEFINE_IDR(shrinker_idr);
> >  static int shrinker_nr_max;
> >
> > +static inline bool is_deferred_memcg_aware(struct shrinker *shrinker)
> > +{
> > +     return (shrinker->flags & SHRINKER_MEMCG_AWARE) &&
> > +             !mem_cgroup_disabled();
> > +}
>
> Why do we care if mem_cgroup_disabled() is disabled here? The return
> of this function is then && sc->memcg, so if memcgs are disabled,
> sc->memcg will never be set and this mem_cgroup_disabled() check is
> completely redundant, right?

Yes, correct. I missed this point.

>
> > +
> >  static int prealloc_memcg_shrinker(struct shrinker *shrinker)
> >  {
> >       int id, ret = -ENOMEM;
> > @@ -271,7 +277,58 @@ static bool writeback_throttling_sane(struct scan_control *sc)
> >  #endif
> >       return false;
> >  }
> > +
> > +static inline long count_nr_deferred(struct shrinker *shrinker,
> > +                                  struct shrink_control *sc)
> > +{
> > +     bool per_memcg_deferred = is_deferred_memcg_aware(shrinker) && sc->memcg;
> > +     struct memcg_shrinker_deferred *deferred;
> > +     struct mem_cgroup *memcg = sc->memcg;
> > +     int nid = sc->nid;
> > +     int id = shrinker->id;
> > +     long nr;
> > +
> > +     if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
> > +             nid = 0;
> > +
> > +     if (per_memcg_deferred) {
> > +             deferred = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_deferred,
> > +                                                  true);
> > +             nr = atomic_long_xchg(&deferred->nr_deferred[id], 0);
> > +     } else
> > +             nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
> > +
> > +     return nr;
> > +}
> > +
> > +static inline long set_nr_deferred(long nr, struct shrinker *shrinker,
> > +                                struct shrink_control *sc)
> > +{
> > +     bool per_memcg_deferred = is_deferred_memcg_aware(shrinker) && sc->memcg;
> > +     struct memcg_shrinker_deferred *deferred;
> > +     struct mem_cgroup *memcg = sc->memcg;
> > +     int nid = sc->nid;
> > +     int id = shrinker->id;
>
> Oh, that's a nasty trap. Nobody knows if you mean "id" or "nid" in
> any of the code and a single letter type results in a bug.

Sure, will come up with more descriptive names. Maybe "nid" and "shrinker_id"?

>
> > +     long new_nr;
> > +
> > +     if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
> > +             nid = 0;
> > +
> > +     if (per_memcg_deferred) {
> > +             deferred = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_deferred,
> > +                                                  true);
> > +             new_nr = atomic_long_add_return(nr, &deferred->nr_deferred[id]);
> > +     } else
> > +             new_nr = atomic_long_add_return(nr, &shrinker->nr_deferred[nid]);
> > +
> > +     return new_nr;
> > +}
> >  #else
> > +static inline bool is_deferred_memcg_aware(struct shrinker *shrinker)
> > +{
> > +     return false;
> > +}
> > +
> >  static int prealloc_memcg_shrinker(struct shrinker *shrinker)
> >  {
> >       return 0;
> > @@ -290,6 +347,29 @@ static bool writeback_throttling_sane(struct scan_control *sc)
> >  {
> >       return true;
> >  }
> > +
> > +static inline long count_nr_deferred(struct shrinker *shrinker,
> > +                                  struct shrink_control *sc)
> > +{
> > +     int nid = sc->nid;
> > +
> > +     if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
> > +             nid = 0;
> > +
> > +     return atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
> > +}
> > +
> > +static inline long set_nr_deferred(long nr, struct shrinker *shrinker,
> > +                                struct shrink_control *sc)
> > +{
> > +     int nid = sc->nid;
> > +
> > +     if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
> > +             nid = 0;
> > +
> > +     return atomic_long_add_return(nr,
> > +                                   &shrinker->nr_deferred[nid]);
> > +}
> >  #endif
>
> This is pretty ... verbose. It doesn't need to be this complex at
> all, and you shouldn't be duplicating code in multiple places. THere
> is also no need for any of these to be "inline" functions. The
> compiler will do that for static functions automatically if it makes
> sense.
>
> Ok, so you only do the memcg nr_deferred thing if NUMA_AWARE &&
> sc->memcg is true. so....
>
> static long shrink_slab_set_nr_deferred_memcg(...)
> {
>         int nid = sc->nid;
>
>         deferred = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_deferred,
>                                              true);
>         return atomic_long_add_return(nr, &deferred->nr_deferred[id]);
> }
>
> static long shrink_slab_set_nr_deferred(...)
> {
>         int nid = sc->nid;
>
>         if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
>                 nid = 0;
>         else if (sc->memcg)
>                 return shrink_slab_set_nr_deferred_memcg(...., nid);
>
>         return atomic_long_add_return(nr, &shrinker->nr_deferred[nid]);
> }
>
> And now there's no duplicated code.

Thanks for the suggestion. Will incorporate in v3.

>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [v2 PATCH 7/9] mm: vmscan: don't need allocate shrinker->nr_deferred for memcg aware shrinkers
  2020-12-15  3:05   ` Dave Chinner
@ 2020-12-15 23:07     ` Yang Shi
  2020-12-18  0:56       ` Yang Shi
  0 siblings, 1 reply; 37+ messages in thread
From: Yang Shi @ 2020-12-15 23:07 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Roman Gushchin, Kirill Tkhai, Shakeel Butt, Johannes Weiner,
	Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Mon, Dec 14, 2020 at 7:05 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Mon, Dec 14, 2020 at 02:37:20PM -0800, Yang Shi wrote:
> > Now nr_deferred is available on per memcg level for memcg aware shrinkers, so don't need
> > allocate shrinker->nr_deferred for such shrinkers anymore.
> >
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > ---
> >  mm/vmscan.c | 28 ++++++++++++++--------------
> >  1 file changed, 14 insertions(+), 14 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index bce8cf44eca2..8d5bfd818acd 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -420,7 +420,15 @@ unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone
> >   */
> >  int prealloc_shrinker(struct shrinker *shrinker)
> >  {
> > -     unsigned int size = sizeof(*shrinker->nr_deferred);
> > +     unsigned int size;
> > +
> > +     if (is_deferred_memcg_aware(shrinker)) {
> > +             if (prealloc_memcg_shrinker(shrinker))
> > +                     return -ENOMEM;
> > +             return 0;
> > +     }
> > +
> > +     size = sizeof(*shrinker->nr_deferred);
> >
> >       if (shrinker->flags & SHRINKER_NUMA_AWARE)
> >               size *= nr_node_ids;
> > @@ -429,26 +437,18 @@ int prealloc_shrinker(struct shrinker *shrinker)
> >       if (!shrinker->nr_deferred)
> >               return -ENOMEM;
> >
> > -     if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
> > -             if (prealloc_memcg_shrinker(shrinker))
> > -                     goto free_deferred;
> > -     }
> > -
> >       return 0;
> > -
> > -free_deferred:
> > -     kfree(shrinker->nr_deferred);
> > -     shrinker->nr_deferred = NULL;
> > -     return -ENOMEM;
> >  }
>
> I'm trying to put my finger on it, but this seems wrong to me. If
> memcgs are disabled, then prealloc_memcg_shrinker() needs to fail.
> The preallocation code should not care about internal memcg details
> like this.
>
>         /*
>          * If the shrinker is memcg aware and memcgs are not
>          * enabled, clear the MEMCG flag and fall back to non-memcg
>          * behaviour for the shrinker.
>          */
>         if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
>                 error = prealloc_memcg_shrinker(shrinker);
>                 if (!error)
>                         return 0;
>                 if (error != -ENOSYS)
>                         return error;
>
>                 /* memcgs not enabled! */
>                 shrinker->flags &= ~SHRINKER_MEMCG_AWARE;
>         }
>
>         size = sizeof(*shrinker->nr_deferred);
>         ....
>         return 0;
> }
>
> This guarantees that only the shrinker instances taht have a
> correctly set up memcg attached to them will have the
> SHRINKER_MEMCG_AWARE flag set. Hence in all the rest of the shrinker
> code, we only ever need to check for SHRINKER_MEMCG_AWARE to
> determine what we should do....

Thanks. I see your point. We could move the memcg specific details
into prealloc_memcg_shrinker().

It seems we have to acquire shrinker_rwsem before we check and modify
SHIRNKER_MEMCG_AWARE bit if we may clear it.

>
> >  void free_prealloced_shrinker(struct shrinker *shrinker)
> >  {
> > -     if (!shrinker->nr_deferred)
> > +     if (is_deferred_memcg_aware(shrinker)) {
> > +             unregister_memcg_shrinker(shrinker);
> >               return;
> > +     }
> >
> > -     if (shrinker->flags & SHRINKER_MEMCG_AWARE)
> > -             unregister_memcg_shrinker(shrinker);
> > +     if (!shrinker->nr_deferred)
> > +             return;
> >
> >       kfree(shrinker->nr_deferred);
> >       shrinker->nr_deferred = NULL;
>
> e.g. then this function can simply do:
>
> {
>         if (shrinker->flags & SHRINKER_MEMCG_AWARE)
>                 return unregister_memcg_shrinker(shrinker);
>         kfree(shrinker->nr_deferred);
>         shrinker->nr_deferred = NULL;
> }
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [v2 PATCH 8/9] mm: memcontrol: reparent nr_deferred when memcg offline
  2020-12-15  3:07   ` Dave Chinner
@ 2020-12-15 23:10     ` Yang Shi
  0 siblings, 0 replies; 37+ messages in thread
From: Yang Shi @ 2020-12-15 23:10 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Roman Gushchin, Kirill Tkhai, Shakeel Butt, Johannes Weiner,
	Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Mon, Dec 14, 2020 at 7:08 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Mon, Dec 14, 2020 at 02:37:21PM -0800, Yang Shi wrote:
> > Now shrinker's nr_deferred is per memcg for memcg aware shrinkers, add to parent's
> > corresponding nr_deferred when memcg offline.
> >
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > ---
> >  include/linux/shrinker.h |  4 ++++
> >  mm/memcontrol.c          | 24 ++++++++++++++++++++++++
> >  mm/vmscan.c              |  2 +-
> >  3 files changed, 29 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> > index 1eac79ce57d4..85cfc910dde4 100644
> > --- a/include/linux/shrinker.h
> > +++ b/include/linux/shrinker.h
> > @@ -78,6 +78,10 @@ struct shrinker {
> >  };
> >  #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
> >
> > +#ifdef CONFIG_MEMCG
> > +extern int shrinker_nr_max;
> > +#endif
> > +
> >  /* Flags */
> >  #define SHRINKER_REGISTERED  (1 << 0)
> >  #define SHRINKER_NUMA_AWARE  (1 << 1)
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 321d1818ce3d..1f191a15bee1 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -59,6 +59,7 @@
> >  #include <linux/tracehook.h>
> >  #include <linux/psi.h>
> >  #include <linux/seq_buf.h>
> > +#include <linux/shrinker.h>
> >  #include "internal.h"
> >  #include <net/sock.h>
> >  #include <net/ip.h>
> > @@ -612,6 +613,28 @@ void memcg_set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
> >       }
> >  }
> >
> > +static void memcg_reparent_shrinker_deferred(struct mem_cgroup *memcg)
> > +{
> > +     int i, nid;
> > +     long nr;
> > +     struct mem_cgroup *parent;
> > +     struct memcg_shrinker_deferred *child_deferred, *parent_deferred;
> > +
> > +     parent = parent_mem_cgroup(memcg);
> > +     if (!parent)
> > +             parent = root_mem_cgroup;
> > +
> > +     for_each_node(nid) {
> > +             child_deferred = memcg->nodeinfo[nid]->shrinker_deferred;
> > +             parent_deferred = parent->nodeinfo[nid]->shrinker_deferred;
> > +             for (i = 0; i < shrinker_nr_max; i ++) {
> > +                     nr = atomic_long_read(&child_deferred->nr_deferred[i]);
> > +                     atomic_long_add(nr,
> > +                             &parent_deferred->nr_deferred[i]);
> > +             }
> > +     }
> > +}
>
> I would place this function in vmscan.c alongside the
> shrink_slab_set_nr_deferred_memcg() function so that all the
> accounting is in the one place.

Fine to me. Will incorporate in v3.

>
> > +
> >  /**
> >   * mem_cgroup_css_from_page - css of the memcg associated with a page
> >   * @page: page of interest
> > @@ -5543,6 +5566,7 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
> >       page_counter_set_low(&memcg->memory, 0);
> >
> >       memcg_offline_kmem(memcg);
> > +     memcg_reparent_shrinker_deferred(memcg);
> >       wb_memcg_offline(memcg);
> >
> >       drain_all_stock(memcg);
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 8d5bfd818acd..693a41e89969 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -201,7 +201,7 @@ DECLARE_RWSEM(shrinker_rwsem);
> >  #define SHRINKER_REGISTERING ((struct shrinker *)~0UL)
> >
> >  static DEFINE_IDR(shrinker_idr);
> > -static int shrinker_nr_max;
> > +int shrinker_nr_max;
>
> Then we don't need to make yet another variable global...
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [v2 PATCH 6/9] mm: vmscan: use per memcg nr_deferred of shrinker
  2020-12-15 22:27     ` Yang Shi
@ 2020-12-15 23:48       ` Dave Chinner
  0 siblings, 0 replies; 37+ messages in thread
From: Dave Chinner @ 2020-12-15 23:48 UTC (permalink / raw)
  To: Yang Shi
  Cc: Roman Gushchin, Kirill Tkhai, Shakeel Butt, Johannes Weiner,
	Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Tue, Dec 15, 2020 at 02:27:18PM -0800, Yang Shi wrote:
> On Mon, Dec 14, 2020 at 6:46 PM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Mon, Dec 14, 2020 at 02:37:19PM -0800, Yang Shi wrote:
> > > Use per memcg's nr_deferred for memcg aware shrinkers.  The shrinker's nr_deferred
> > > will be used in the following cases:
> > >     1. Non memcg aware shrinkers
> > >     2. !CONFIG_MEMCG
> > >     3. memcg is disabled by boot parameter
> > >
> > > Signed-off-by: Yang Shi <shy828301@gmail.com>
> >
> > Lots of lines way over 80 columns.
> 
> I thought that has been lifted to 100 columns.

Documentation/process/coding-style.rst still says:

"The preferred limit on the length of a single line is 80 columns."

checkpatch might not warn about > 80 columns anymore, but if the
file you are modifying is almost entirely 80 columns in width, then
by default changes to that file should also stay within 80 columns.

I mostly consider using checkpatch to enforce coding styles to be
harmful....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [v2 PATCH 9/9] mm: vmscan: shrink deferred objects proportional to priority
  2020-12-15  3:23   ` Dave Chinner
@ 2020-12-15 23:59     ` Yang Shi
  0 siblings, 0 replies; 37+ messages in thread
From: Yang Shi @ 2020-12-15 23:59 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Roman Gushchin, Kirill Tkhai, Shakeel Butt, Johannes Weiner,
	Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Mon, Dec 14, 2020 at 7:23 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Mon, Dec 14, 2020 at 02:37:22PM -0800, Yang Shi wrote:
> > The number of deferred objects might get windup to an absurd number, and it results in
> > clamp of slab objects.  It is undesirable for sustaining workingset.
> >
> > So shrink deferred objects proportional to priority and cap nr_deferred to twice of
> > cache items.
>
> This completely changes the work accrual algorithm without any
> explaination of how it works, what the theory behind the algorithm
> is, what the work accrual ramp up and damp down curve looks like,
> what workloads it is designed to benefit, how it affects page
> cache vs slab cache balance and system performance, what OOM stress
> testing has been done to ensure pure slab cache pressure workloads
> don't easily trigger OOM kills, etc.

Actually this patch does two things:
1. Take nr_deferred into account priority.
2. Cap nr_deferred to twice of freeable

Actually the idea is borrowed from you patch:
https://lore.kernel.org/linux-xfs/20191031234618.15403-13-david@fromorbit.com/,
the difference is that your patch restrains the change for kswapd
only, but mine is extended to direct reclaim and limit reclaim.

>
> You're going to need a lot more supporting evidence that this is a
> well thought out algorithm that doesn't obviously introduce
> regressions. The current code might fall down in one corner case,
> but there are an awful lot of corner cases where it does work.
> Please provide some evidence that it not only works in your corner
> case, but also doesn't introduce regressions for other slab cache
> intensive and mixed cache intensive worklaods...

I agree the change may cause some workload regressed out of blue. I
tested with kernel build and vfs metadata heavy workloads, I wish I
could cover more. But I'm not filesystem developer, do you have any
typical workloads that I could try to run to see if they have
regression?

>
> >
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > ---
> >  mm/vmscan.c | 40 +++++-----------------------------------
> >  1 file changed, 5 insertions(+), 35 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 693a41e89969..58f4a383f0df 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -525,7 +525,6 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> >        */
> >       nr = count_nr_deferred(shrinker, shrinkctl);
> >
> > -     total_scan = nr;
> >       if (shrinker->seeks) {
> >               delta = freeable >> priority;
> >               delta *= 4;
> > @@ -539,37 +538,9 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> >               delta = freeable / 2;
> >       }
> >
> > +     total_scan = nr >> priority;
>
> When there is low memory pressure, this will throw away a large
> amount of the work that is deferred. If we are not defering in
> amounts larger than ~4000 items, every pass through this code will
> zero the deferred work.
>
> Hence when we do get substantial pressure, that deferred work is no
> longer being tracked. While it may help your specific corner case,
> it's likely to significantly change the reclaim balance of slab
> caches, especially under GFP_NOFS intensive workloads where we can
> only defer the work to kswapd.
>
> Hence I think this is still a problematic approach as it doesn't
> address the reason why deferred counts are increasing out of
> control in the first place....

For our workload the deferred counts are mainly contributed by
multiple memcgs' limit reclaim per my analysis. So, the most crucial
step is to make nr_deferred memcg aware so that the auxiliary memcgs
won't have interference to the main workload.

If the test may take too long I'd prefer drop this patch for now since
it is not that critical to our workload, I really hope have
nr_deferred memcg aware part get into upstream soon.

>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [v2 PATCH 2/9] mm: memcontrol: use shrinker_rwsem to protect shrinker_maps allocation
  2020-12-15 21:59       ` Dave Chinner
@ 2020-12-16 13:17         ` Kirill Tkhai
  2020-12-16 19:12         ` Johannes Weiner
  2020-12-16 19:39         ` Roman Gushchin
  2 siblings, 0 replies; 37+ messages in thread
From: Kirill Tkhai @ 2020-12-16 13:17 UTC (permalink / raw)
  To: Dave Chinner, Johannes Weiner
  Cc: Yang Shi, guro, ktkhai, shakeelb, mhocko, akpm, linux-mm,
	linux-fsdevel, linux-kernel

16.12.2020, 00:59, "Dave Chinner" <david@fromorbit.com>:
> On Tue, Dec 15, 2020 at 02:53:48PM +0100, Johannes Weiner wrote:
>>  On Tue, Dec 15, 2020 at 01:09:57PM +1100, Dave Chinner wrote:
>>  > On Mon, Dec 14, 2020 at 02:37:15PM -0800, Yang Shi wrote:
>>  > > Since memcg_shrinker_map_size just can be changd under holding shrinker_rwsem
>>  > > exclusively, the read side can be protected by holding read lock, so it sounds
>>  > > superfluous to have a dedicated mutex.
>>  >
>>  > I'm not sure this is a good idea. This couples the shrinker
>>  > infrastructure to internal details of how cgroups are initialised
>>  > and managed. Sure, certain operations might be done in certain
>>  > shrinker lock contexts, but that doesn't mean we should share global
>>  > locks across otherwise independent subsystems....
>>
>>  They're not independent subsystems. Most of the memory controller is
>>  an extension of core VM operations that is fairly difficult to
>>  understand outside the context of those operations. Then there are a
>>  limited number of entry points from the cgroup interface. We used to
>>  have our own locks for core VM structures (private page lock e.g.) to
>>  coordinate VM and cgroup, and that was mostly unintelligble.
>
> Yes, but OTOH you can CONFIG_MEMCG=n and the shrinker infrastructure
> and shrinkers all still functions correctly. Ergo, the shrinker
> infrastructure is independent of memcgs. Yes, it may have functions
> to iterate and manipulate memcgs, but it is not dependent on memcgs
> existing for correct behaviour and functionality.
>
> Yet.
>
>>  We have since established that those two components coordinate with
>>  native VM locking and lifetime management. If you need to lock the
>>  page, you lock the page - instead of having all VM paths that already
>>  hold the page lock acquire a nested lock to exclude one cgroup path.
>>
>>  In this case, we have auxiliary shrinker data, subject to shrinker
>>  lifetime and exclusion rules. It's much easier to understand that
>>  cgroup creation needs a stable shrinker list (shrinker_rwsem) to
>>  manage this data, than having an aliased lock that is private to the
>>  memcg callbacks and obscures this real interdependency.
>
> Ok, so the way to do this is to move all the stuff that needs to be
> done under a "subsystem global" lock to the one file, not turn a
> static lock into a globally visible lock and spray it around random
> source files. There's already way too many static globals to manage
> separate shrinker and memcg state..
>
> I certainly agree that shrinkers and memcg need to be more closely
> integrated. I've only been saying that for ... well, since memcgs
> essentially duplicated the top level shrinker path so the shrinker
> map could be introduced to avoid calling shrinkers that have no work
> to do for memcgs. The shrinker map should be generic functionality
> for all shrinker invocations because even a non-memcg machine can
> have thousands of registered shrinkers that are mostly idle all the
> time.
>
> IOWs, I think the shrinker map management is not really memcg
> specific - it's just allocation and assignment of a structure, and
> the only memcg bit is the map is being stored in a memcg structure.
> Therefore, if we are looking towards tighter integration then we
> should acutally move the map management to the shrinker code, not
> split the shrinker infrastructure management across different files.
> There's already a heap of code in vmscan.c under #ifdef
> CONFIG_MEMCG, like the prealloc_shrinker() code path:
>
> prealloc_shrinker() vmscan.c
>   if (MEMCG_AWARE) vmscan.c
>     prealloc_memcg_shrinker vmscan.c
> #ifdef CONFIG_MEMCG vmscan.c
>       down_write(shrinker_rwsem) vmscan.c
>       if (id > shrinker_id_max) vmscan.c
>         memcg_expand_shrinker_maps memcontrol.c
>           for_each_memcg memcontrol.c
>             reallocate shrinker map memcontrol.c
>             replace shrinker map memcontrol.c
>         shrinker_id_max = id vmscan.c
>       down_write(shrinker_rwsem) vmscan.c
> #endif
>
> And, really, there's very little code in memcg_expand_shrinker_maps()
> here - the only memcg part is the memcg iteration loop, and we
> already have them in vmscan.c (e.g. shrink_node_memcgs(),
> age_active_anon(), drop_slab_node()) so there's precedence for
> moving this memcg iteration for shrinker map management all into
> vmscan.c.
>
> Doing so would formalise the shrinker maps as first class shrinker
> infrastructure rather than being tacked on to the side of the memcg
> infrastructure. At this point it makes total sense to serialise map
> manipulations under the shrinker_rwsem.
>
> IOWs, I'm not disagreeing with the direction this patch takes us in,
> I'm disagreeing with the implementation as published in the patch
> because it doesn't move us closer to a clean, concise single
> shrinker infrastructure implementation.
>
> That is, for the medium term, I think we should be getting rid of
> the "legacy" non-memcg shrinker path and everything runs under
> memcgs. With this patchset moving all the deferred counts to be
> memcg aware, the only reason for keeping the non-memcg path around
> goes away. If sc->memcg is null, then after this patch set we can
> simply use the root memcg and just use it's per-node accounting
> rather than having a separate construct for non-memcg aware per-node
> accounting.

Killing "sc->memcg == NULL" cases looks like a great idea. This is equal
to making possible "memory_cgrp_subsys.early_init = 1" with all requirepments
for that, which is a topic for big separate patchset.

> Hence if SHRINKER_MEMCG_AWARE is set, it simply means we should run
> the shrinker if sc->memcg is set. There is no difference in setup
> of shrinkers, the duplicate non-memcg/memcg paths go away, and a
> heap of code drops out of the shrinker infrastructure. It becomes
> much simpler overall.
>
> It also means we have a path for further integrating memcg aware
> shrinkers into the shrinker infrastructure because we can always
> rely on the shrinker infrastructure being memcg aware. And with that
> in mind, I think we should probably also be moving the shrinker code
> out of vmscan.c into it's own file as it's really completely
> separate infrastructure from the vast majority of page reclaim
> infrastructure in vmscan.c...
>
> That's the view I'm looking at this patchset from. Not just as a
> standalone bug fix, but also from the perspective of what the
> architectural change implies and the directions for tighter
> integration it opens up for us.
>
> Cheers,
>
> Dave.
>
> --
> Dave Chinner
> david@fromorbit.com
>
> .


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [v2 PATCH 2/9] mm: memcontrol: use shrinker_rwsem to protect shrinker_maps allocation
  2020-12-15 21:59       ` Dave Chinner
  2020-12-16 13:17         ` Kirill Tkhai
@ 2020-12-16 19:12         ` Johannes Weiner
  2020-12-16 21:56           ` Yang Shi
  2020-12-16 19:39         ` Roman Gushchin
  2 siblings, 1 reply; 37+ messages in thread
From: Johannes Weiner @ 2020-12-16 19:12 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Yang Shi, guro, ktkhai, shakeelb, mhocko, akpm, linux-mm,
	linux-fsdevel, linux-kernel

On Wed, Dec 16, 2020 at 08:59:38AM +1100, Dave Chinner wrote:
> On Tue, Dec 15, 2020 at 02:53:48PM +0100, Johannes Weiner wrote:
> > On Tue, Dec 15, 2020 at 01:09:57PM +1100, Dave Chinner wrote:
> > > On Mon, Dec 14, 2020 at 02:37:15PM -0800, Yang Shi wrote:
> > > > Since memcg_shrinker_map_size just can be changd under holding shrinker_rwsem
> > > > exclusively, the read side can be protected by holding read lock, so it sounds
> > > > superfluous to have a dedicated mutex.
> > > 
> > > I'm not sure this is a good idea. This couples the shrinker
> > > infrastructure to internal details of how cgroups are initialised
> > > and managed. Sure, certain operations might be done in certain
> > > shrinker lock contexts, but that doesn't mean we should share global
> > > locks across otherwise independent subsystems....
> > 
> > They're not independent subsystems. Most of the memory controller is
> > an extension of core VM operations that is fairly difficult to
> > understand outside the context of those operations. Then there are a
> > limited number of entry points from the cgroup interface. We used to
> > have our own locks for core VM structures (private page lock e.g.) to
> > coordinate VM and cgroup, and that was mostly unintelligble.
> 
> Yes, but OTOH you can CONFIG_MEMCG=n and the shrinker infrastructure
> and shrinkers all still functions correctly.  Ergo, the shrinker
> infrastructure is independent of memcgs. Yes, it may have functions
> to iterate and manipulate memcgs, but it is not dependent on memcgs
> existing for correct behaviour and functionality.

Okay, but now do it the other way round and explain the memcg bits in
a world where shrinkers don't exist ;-)

Anyway, we seem to be mostly in agreement below.

> > We have since established that those two components coordinate with
> > native VM locking and lifetime management. If you need to lock the
> > page, you lock the page - instead of having all VM paths that already
> > hold the page lock acquire a nested lock to exclude one cgroup path.
> > 
> > In this case, we have auxiliary shrinker data, subject to shrinker
> > lifetime and exclusion rules. It's much easier to understand that
> > cgroup creation needs a stable shrinker list (shrinker_rwsem) to
> > manage this data, than having an aliased lock that is private to the
> > memcg callbacks and obscures this real interdependency.
> 
> Ok, so the way to do this is to move all the stuff that needs to be
> done under a "subsystem global" lock to the one file, not turn a
> static lock into a globally visible lock and spray it around random
> source files.

Sure, that works as well.

> The shrinker map should be generic functionality for all shrinker
> invocations because even a non-memcg machine can have thousands of
> registered shrinkers that are mostly idle all the time.

Agreed.

> IOWs, I think the shrinker map management is not really memcg
> specific - it's just allocation and assignment of a structure, and
> the only memcg bit is the map is being stored in a memcg structure.
> Therefore, if we are looking towards tighter integration then we
> should acutally move the map management to the shrinker code, not
> split the shrinker infrastructure management across different files.
> There's already a heap of code in vmscan.c under #ifdef
> CONFIG_MEMCG, like the prealloc_shrinker() code path:
> 
> prealloc_shrinker()				vmscan.c
>   if (MEMCG_AWARE)				vmscan.c
>     prealloc_memcg_shrinker			vmscan.c
> #ifdef CONFIG_MEMCG				vmscan.c
>       down_write(shrinker_rwsem)		vmscan.c
>       if (id > shrinker_id_max)			vmscan.c
> 	memcg_expand_shrinker_maps		memcontrol.c
> 	  for_each_memcg			memcontrol.c
> 	    reallocate shrinker map		memcontrol.c
> 	    replace shrinker map		memcontrol.c
> 	shrinker_id_max = id			vmscan.c
>       down_write(shrinker_rwsem)		vmscan.c
> #endif
> 
> And, really, there's very little code in memcg_expand_shrinker_maps()
> here - the only memcg part is the memcg iteration loop, and we
> already have them in vmscan.c (e.g. shrink_node_memcgs(),
> age_active_anon(), drop_slab_node()) so there's precedence for
> moving this memcg iteration for shrinker map management all into
> vmscan.c.
>
> Doing so would formalise the shrinker maps as first class shrinker
> infrastructure rather than being tacked on to the side of the memcg
> infrastructure. At this point it makes total sense to serialise map
> manipulations under the shrinker_rwsem.

Yes, that's a great idea.

> That is, for the medium term, I think  we should be getting rid of
> the "legacy" non-memcg shrinker path and everything runs under
> memcgs.  With this patchset moving all the deferred counts to be
> memcg aware, the only reason for keeping the non-memcg path around
> goes away.  If sc->memcg is null, then after this patch set we can
> simply use the root memcg and just use it's per-node accounting
> rather than having a separate construct for non-memcg aware per-node
> accounting.
> 
> Hence if SHRINKER_MEMCG_AWARE is set, it simply means we should run
> the shrinker if sc->memcg is set.  There is no difference in setup
> of shrinkers, the duplicate non-memcg/memcg paths go away, and a
> heap of code drops out of the shrinker infrastructure. It becomes
> much simpler overall.

Agreed as well.

> It also means we have a path for further integrating memcg aware
> shrinkers into the shrinker infrastructure because we can always
> rely on the shrinker infrastructure being memcg aware. And with that
> in mind, I think we should probably also be moving the shrinker code
> out of vmscan.c into it's own file as it's really completely
> separate infrastructure from the vast majority of page reclaim
> infrastructure in vmscan.c...

Right again.

> That's the view I'm looking at this patchset from. Not just as a
> standalone bug fix, but also from the perspective of what the
> architectural change implies and the directions for tighter
> integration it opens up for us.

Makes sense, but I'm not sure it's getting in the way of that: a
generalized first-class map would be managed under the shrinker_rwsem,
so ditching the private lock is good progress. The widened lock scope
(temporarily, and still mm/) is easy to reverse later on.

That said, moving the map handling code from memcontrol.c to vmscan.c
in preparation, and/or even reworking the shrinker around the concept
of a memcg, indeed are great ideas.

I'd support patches doing that.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [v2 PATCH 2/9] mm: memcontrol: use shrinker_rwsem to protect shrinker_maps allocation
  2020-12-15 21:59       ` Dave Chinner
  2020-12-16 13:17         ` Kirill Tkhai
  2020-12-16 19:12         ` Johannes Weiner
@ 2020-12-16 19:39         ` Roman Gushchin
  2 siblings, 0 replies; 37+ messages in thread
From: Roman Gushchin @ 2020-12-16 19:39 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Johannes Weiner, Yang Shi, ktkhai, shakeelb, mhocko, akpm,
	linux-mm, linux-fsdevel, linux-kernel

On Wed, Dec 16, 2020 at 08:59:38AM +1100, Dave Chinner wrote:
> On Tue, Dec 15, 2020 at 02:53:48PM +0100, Johannes Weiner wrote:
> > On Tue, Dec 15, 2020 at 01:09:57PM +1100, Dave Chinner wrote:
> > > On Mon, Dec 14, 2020 at 02:37:15PM -0800, Yang Shi wrote:
> > > > Since memcg_shrinker_map_size just can be changd under holding shrinker_rwsem
> > > > exclusively, the read side can be protected by holding read lock, so it sounds
> > > > superfluous to have a dedicated mutex.
> > > 
> > > I'm not sure this is a good idea. This couples the shrinker
> > > infrastructure to internal details of how cgroups are initialised
> > > and managed. Sure, certain operations might be done in certain
> > > shrinker lock contexts, but that doesn't mean we should share global
> > > locks across otherwise independent subsystems....
> > 
> > They're not independent subsystems. Most of the memory controller is
> > an extension of core VM operations that is fairly difficult to
> > understand outside the context of those operations. Then there are a
> > limited number of entry points from the cgroup interface. We used to
> > have our own locks for core VM structures (private page lock e.g.) to
> > coordinate VM and cgroup, and that was mostly unintelligble.
> 
> Yes, but OTOH you can CONFIG_MEMCG=n and the shrinker infrastructure
> and shrinkers all still functions correctly.  Ergo, the shrinker
> infrastructure is independent of memcgs. Yes, it may have functions
> to iterate and manipulate memcgs, but it is not dependent on memcgs
> existing for correct behaviour and functionality.
> 
> Yet.
> 
> > We have since established that those two components coordinate with
> > native VM locking and lifetime management. If you need to lock the
> > page, you lock the page - instead of having all VM paths that already
> > hold the page lock acquire a nested lock to exclude one cgroup path.
> > 
> > In this case, we have auxiliary shrinker data, subject to shrinker
> > lifetime and exclusion rules. It's much easier to understand that
> > cgroup creation needs a stable shrinker list (shrinker_rwsem) to
> > manage this data, than having an aliased lock that is private to the
> > memcg callbacks and obscures this real interdependency.
> 
> Ok, so the way to do this is to move all the stuff that needs to be
> done under a "subsystem global" lock to the one file, not turn a
> static lock into a globally visible lock and spray it around random
> source files. There's already way too many static globals to manage
> separate shrinker and memcg state..
> 
> I certainly agree that shrinkers and memcg need to be more closely
> integrated.  I've only been saying that for ... well, since memcgs
> essentially duplicated the top level shrinker path so the shrinker
> map could be introduced to avoid calling shrinkers that have no work
> to do for memcgs. The shrinker map should be generic functionality
> for all shrinker invocations because even a non-memcg machine can
> have thousands of registered shrinkers that are mostly idle all the
> time.
> 
> IOWs, I think the shrinker map management is not really memcg
> specific - it's just allocation and assignment of a structure, and
> the only memcg bit is the map is being stored in a memcg structure.
> Therefore, if we are looking towards tighter integration then we
> should acutally move the map management to the shrinker code, not
> split the shrinker infrastructure management across different files.
> There's already a heap of code in vmscan.c under #ifdef
> CONFIG_MEMCG, like the prealloc_shrinker() code path:
> 
> prealloc_shrinker()				vmscan.c
>   if (MEMCG_AWARE)				vmscan.c
>     prealloc_memcg_shrinker			vmscan.c
> #ifdef CONFIG_MEMCG				vmscan.c
>       down_write(shrinker_rwsem)		vmscan.c
>       if (id > shrinker_id_max)			vmscan.c
> 	memcg_expand_shrinker_maps		memcontrol.c
> 	  for_each_memcg			memcontrol.c
> 	    reallocate shrinker map		memcontrol.c
> 	    replace shrinker map		memcontrol.c
> 	shrinker_id_max = id			vmscan.c
>       down_write(shrinker_rwsem)		vmscan.c
> #endif
> 
> And, really, there's very little code in memcg_expand_shrinker_maps()
> here - the only memcg part is the memcg iteration loop, and we
> already have them in vmscan.c (e.g. shrink_node_memcgs(),
> age_active_anon(), drop_slab_node()) so there's precedence for
> moving this memcg iteration for shrinker map management all into
> vmscan.c.
> 
> Doing so would formalise the shrinker maps as first class shrinker
> infrastructure rather than being tacked on to the side of the memcg
> infrastructure. At this point it makes total sense to serialise map
> manipulations under the shrinker_rwsem.
> 
> IOWs, I'm not disagreeing with the direction this patch takes us in,
> I'm disagreeing with the implementation as published in the patch
> because it doesn't move us closer to a clean, concise single
> shrinker infrastructure implementation.
> 
> That is, for the medium term, I think  we should be getting rid of
> the "legacy" non-memcg shrinker path and everything runs under
> memcgs.  With this patchset moving all the deferred counts to be
> memcg aware, the only reason for keeping the non-memcg path around
> goes away.  If sc->memcg is null, then after this patch set we can
> simply use the root memcg and just use it's per-node accounting
> rather than having a separate construct for non-memcg aware per-node
> accounting.
> 
> Hence if SHRINKER_MEMCG_AWARE is set, it simply means we should run
> the shrinker if sc->memcg is set.  There is no difference in setup
> of shrinkers, the duplicate non-memcg/memcg paths go away, and a
> heap of code drops out of the shrinker infrastructure. It becomes
> much simpler overall.
> 
> It also means we have a path for further integrating memcg aware
> shrinkers into the shrinker infrastructure because we can always
> rely on the shrinker infrastructure being memcg aware. And with that
> in mind, I think we should probably also be moving the shrinker code
> out of vmscan.c into it's own file as it's really completely
> separate infrastructure from the vast majority of page reclaim
> infrastructure in vmscan.c...
> 
> That's the view I'm looking at this patchset from. Not just as a
> standalone bug fix, but also from the perspective of what the
> architectural change implies and the directions for tighter
> integration it opens up for us.

I like the plan too.

Thanks!


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [v2 PATCH 2/9] mm: memcontrol: use shrinker_rwsem to protect shrinker_maps allocation
  2020-12-16 19:12         ` Johannes Weiner
@ 2020-12-16 21:56           ` Yang Shi
  0 siblings, 0 replies; 37+ messages in thread
From: Yang Shi @ 2020-12-16 21:56 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Dave Chinner, Roman Gushchin, Kirill Tkhai, Shakeel Butt,
	Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Wed, Dec 16, 2020 at 11:14 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Wed, Dec 16, 2020 at 08:59:38AM +1100, Dave Chinner wrote:
> > On Tue, Dec 15, 2020 at 02:53:48PM +0100, Johannes Weiner wrote:
> > > On Tue, Dec 15, 2020 at 01:09:57PM +1100, Dave Chinner wrote:
> > > > On Mon, Dec 14, 2020 at 02:37:15PM -0800, Yang Shi wrote:
> > > > > Since memcg_shrinker_map_size just can be changd under holding shrinker_rwsem
> > > > > exclusively, the read side can be protected by holding read lock, so it sounds
> > > > > superfluous to have a dedicated mutex.
> > > >
> > > > I'm not sure this is a good idea. This couples the shrinker
> > > > infrastructure to internal details of how cgroups are initialised
> > > > and managed. Sure, certain operations might be done in certain
> > > > shrinker lock contexts, but that doesn't mean we should share global
> > > > locks across otherwise independent subsystems....
> > >
> > > They're not independent subsystems. Most of the memory controller is
> > > an extension of core VM operations that is fairly difficult to
> > > understand outside the context of those operations. Then there are a
> > > limited number of entry points from the cgroup interface. We used to
> > > have our own locks for core VM structures (private page lock e.g.) to
> > > coordinate VM and cgroup, and that was mostly unintelligble.
> >
> > Yes, but OTOH you can CONFIG_MEMCG=n and the shrinker infrastructure
> > and shrinkers all still functions correctly.  Ergo, the shrinker
> > infrastructure is independent of memcgs. Yes, it may have functions
> > to iterate and manipulate memcgs, but it is not dependent on memcgs
> > existing for correct behaviour and functionality.
>
> Okay, but now do it the other way round and explain the memcg bits in
> a world where shrinkers don't exist ;-)
>
> Anyway, we seem to be mostly in agreement below.
>
> > > We have since established that those two components coordinate with
> > > native VM locking and lifetime management. If you need to lock the
> > > page, you lock the page - instead of having all VM paths that already
> > > hold the page lock acquire a nested lock to exclude one cgroup path.
> > >
> > > In this case, we have auxiliary shrinker data, subject to shrinker
> > > lifetime and exclusion rules. It's much easier to understand that
> > > cgroup creation needs a stable shrinker list (shrinker_rwsem) to
> > > manage this data, than having an aliased lock that is private to the
> > > memcg callbacks and obscures this real interdependency.
> >
> > Ok, so the way to do this is to move all the stuff that needs to be
> > done under a "subsystem global" lock to the one file, not turn a
> > static lock into a globally visible lock and spray it around random
> > source files.
>
> Sure, that works as well.
>
> > The shrinker map should be generic functionality for all shrinker
> > invocations because even a non-memcg machine can have thousands of
> > registered shrinkers that are mostly idle all the time.
>
> Agreed.
>
> > IOWs, I think the shrinker map management is not really memcg
> > specific - it's just allocation and assignment of a structure, and
> > the only memcg bit is the map is being stored in a memcg structure.
> > Therefore, if we are looking towards tighter integration then we
> > should acutally move the map management to the shrinker code, not
> > split the shrinker infrastructure management across different files.
> > There's already a heap of code in vmscan.c under #ifdef
> > CONFIG_MEMCG, like the prealloc_shrinker() code path:
> >
> > prealloc_shrinker()                           vmscan.c
> >   if (MEMCG_AWARE)                            vmscan.c
> >     prealloc_memcg_shrinker                   vmscan.c
> > #ifdef CONFIG_MEMCG                           vmscan.c
> >       down_write(shrinker_rwsem)              vmscan.c
> >       if (id > shrinker_id_max)                       vmscan.c
> >       memcg_expand_shrinker_maps              memcontrol.c
> >         for_each_memcg                        memcontrol.c
> >           reallocate shrinker map             memcontrol.c
> >           replace shrinker map                memcontrol.c
> >       shrinker_id_max = id                    vmscan.c
> >       down_write(shrinker_rwsem)              vmscan.c
> > #endif
> >
> > And, really, there's very little code in memcg_expand_shrinker_maps()
> > here - the only memcg part is the memcg iteration loop, and we
> > already have them in vmscan.c (e.g. shrink_node_memcgs(),
> > age_active_anon(), drop_slab_node()) so there's precedence for
> > moving this memcg iteration for shrinker map management all into
> > vmscan.c.
> >
> > Doing so would formalise the shrinker maps as first class shrinker
> > infrastructure rather than being tacked on to the side of the memcg
> > infrastructure. At this point it makes total sense to serialise map
> > manipulations under the shrinker_rwsem.
>
> Yes, that's a great idea.
>
> > That is, for the medium term, I think  we should be getting rid of
> > the "legacy" non-memcg shrinker path and everything runs under
> > memcgs.  With this patchset moving all the deferred counts to be
> > memcg aware, the only reason for keeping the non-memcg path around
> > goes away.  If sc->memcg is null, then after this patch set we can
> > simply use the root memcg and just use it's per-node accounting
> > rather than having a separate construct for non-memcg aware per-node
> > accounting.
> >
> > Hence if SHRINKER_MEMCG_AWARE is set, it simply means we should run
> > the shrinker if sc->memcg is set.  There is no difference in setup
> > of shrinkers, the duplicate non-memcg/memcg paths go away, and a
> > heap of code drops out of the shrinker infrastructure. It becomes
> > much simpler overall.
>
> Agreed as well.
>
> > It also means we have a path for further integrating memcg aware
> > shrinkers into the shrinker infrastructure because we can always
> > rely on the shrinker infrastructure being memcg aware. And with that
> > in mind, I think we should probably also be moving the shrinker code
> > out of vmscan.c into it's own file as it's really completely
> > separate infrastructure from the vast majority of page reclaim
> > infrastructure in vmscan.c...
>
> Right again.
>
> > That's the view I'm looking at this patchset from. Not just as a
> > standalone bug fix, but also from the perspective of what the
> > architectural change implies and the directions for tighter
> > integration it opens up for us.
>
> Makes sense, but I'm not sure it's getting in the way of that: a
> generalized first-class map would be managed under the shrinker_rwsem,
> so ditching the private lock is good progress. The widened lock scope
> (temporarily, and still mm/) is easy to reverse later on.
>
> That said, moving the map handling code from memcontrol.c to vmscan.c
> in preparation, and/or even reworking the shrinker around the concept
> of a memcg, indeed are great ideas.
>
> I'd support patches doing that.

Thanks a lot for all the great ideas and suggestions! Per the
discussion I will consolidate all shrinker map related code into
vmscan.c in v3 since the changeset seems manageable and won't get the
patch set bloat.

I will look into further cleanup/refactor for mid/long term once this
patch set is done.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [v2 PATCH 7/9] mm: vmscan: don't need allocate shrinker->nr_deferred for memcg aware shrinkers
  2020-12-15 23:07     ` Yang Shi
@ 2020-12-18  0:56       ` Yang Shi
  2020-12-18  1:09         ` Dave Chinner
  0 siblings, 1 reply; 37+ messages in thread
From: Yang Shi @ 2020-12-18  0:56 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Roman Gushchin, Kirill Tkhai, Shakeel Butt, Johannes Weiner,
	Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Tue, Dec 15, 2020 at 3:07 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Mon, Dec 14, 2020 at 7:05 PM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Mon, Dec 14, 2020 at 02:37:20PM -0800, Yang Shi wrote:
> > > Now nr_deferred is available on per memcg level for memcg aware shrinkers, so don't need
> > > allocate shrinker->nr_deferred for such shrinkers anymore.
> > >
> > > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > > ---
> > >  mm/vmscan.c | 28 ++++++++++++++--------------
> > >  1 file changed, 14 insertions(+), 14 deletions(-)
> > >
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index bce8cf44eca2..8d5bfd818acd 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -420,7 +420,15 @@ unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone
> > >   */
> > >  int prealloc_shrinker(struct shrinker *shrinker)
> > >  {
> > > -     unsigned int size = sizeof(*shrinker->nr_deferred);
> > > +     unsigned int size;
> > > +
> > > +     if (is_deferred_memcg_aware(shrinker)) {
> > > +             if (prealloc_memcg_shrinker(shrinker))
> > > +                     return -ENOMEM;
> > > +             return 0;
> > > +     }
> > > +
> > > +     size = sizeof(*shrinker->nr_deferred);
> > >
> > >       if (shrinker->flags & SHRINKER_NUMA_AWARE)
> > >               size *= nr_node_ids;
> > > @@ -429,26 +437,18 @@ int prealloc_shrinker(struct shrinker *shrinker)
> > >       if (!shrinker->nr_deferred)
> > >               return -ENOMEM;
> > >
> > > -     if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
> > > -             if (prealloc_memcg_shrinker(shrinker))
> > > -                     goto free_deferred;
> > > -     }
> > > -
> > >       return 0;
> > > -
> > > -free_deferred:
> > > -     kfree(shrinker->nr_deferred);
> > > -     shrinker->nr_deferred = NULL;
> > > -     return -ENOMEM;
> > >  }
> >
> > I'm trying to put my finger on it, but this seems wrong to me. If
> > memcgs are disabled, then prealloc_memcg_shrinker() needs to fail.
> > The preallocation code should not care about internal memcg details
> > like this.
> >
> >         /*
> >          * If the shrinker is memcg aware and memcgs are not
> >          * enabled, clear the MEMCG flag and fall back to non-memcg
> >          * behaviour for the shrinker.
> >          */
> >         if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
> >                 error = prealloc_memcg_shrinker(shrinker);
> >                 if (!error)
> >                         return 0;
> >                 if (error != -ENOSYS)
> >                         return error;
> >
> >                 /* memcgs not enabled! */
> >                 shrinker->flags &= ~SHRINKER_MEMCG_AWARE;
> >         }
> >
> >         size = sizeof(*shrinker->nr_deferred);
> >         ....
> >         return 0;
> > }
> >
> > This guarantees that only the shrinker instances taht have a
> > correctly set up memcg attached to them will have the
> > SHRINKER_MEMCG_AWARE flag set. Hence in all the rest of the shrinker
> > code, we only ever need to check for SHRINKER_MEMCG_AWARE to
> > determine what we should do....
>
> Thanks. I see your point. We could move the memcg specific details
> into prealloc_memcg_shrinker().
>
> It seems we have to acquire shrinker_rwsem before we check and modify
> SHIRNKER_MEMCG_AWARE bit if we may clear it.

Hi Dave,

Is it possible that shrinker register races with shrinker unregister?
It seems impossible to me by a quick visual code inspection. But I'm
not a VFS expert so I'm not quite sure.

If it is impossible the implementation would be quite simple otherwise
we need move shrinker_rwsem acquire/release to
prealloc_shrinker/free_prealloced_shrinker/unregister_shrinker to
protect SHRINKER_MEMCG_AWARE update.

>
> >
> > >  void free_prealloced_shrinker(struct shrinker *shrinker)
> > >  {
> > > -     if (!shrinker->nr_deferred)
> > > +     if (is_deferred_memcg_aware(shrinker)) {
> > > +             unregister_memcg_shrinker(shrinker);
> > >               return;
> > > +     }
> > >
> > > -     if (shrinker->flags & SHRINKER_MEMCG_AWARE)
> > > -             unregister_memcg_shrinker(shrinker);
> > > +     if (!shrinker->nr_deferred)
> > > +             return;
> > >
> > >       kfree(shrinker->nr_deferred);
> > >       shrinker->nr_deferred = NULL;
> >
> > e.g. then this function can simply do:
> >
> > {
> >         if (shrinker->flags & SHRINKER_MEMCG_AWARE)
> >                 return unregister_memcg_shrinker(shrinker);
> >         kfree(shrinker->nr_deferred);
> >         shrinker->nr_deferred = NULL;
> > }
> >
> > Cheers,
> >
> > Dave.
> > --
> > Dave Chinner
> > david@fromorbit.com


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [v2 PATCH 7/9] mm: vmscan: don't need allocate shrinker->nr_deferred for memcg aware shrinkers
  2020-12-18  0:56       ` Yang Shi
@ 2020-12-18  1:09         ` Dave Chinner
  0 siblings, 0 replies; 37+ messages in thread
From: Dave Chinner @ 2020-12-18  1:09 UTC (permalink / raw)
  To: Yang Shi
  Cc: Roman Gushchin, Kirill Tkhai, Shakeel Butt, Johannes Weiner,
	Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Thu, Dec 17, 2020 at 04:56:48PM -0800, Yang Shi wrote:
> On Tue, Dec 15, 2020 at 3:07 PM Yang Shi <shy828301@gmail.com> wrote:
> > > This guarantees that only the shrinker instances taht have a
> > > correctly set up memcg attached to them will have the
> > > SHRINKER_MEMCG_AWARE flag set. Hence in all the rest of the shrinker
> > > code, we only ever need to check for SHRINKER_MEMCG_AWARE to
> > > determine what we should do....
> >
> > Thanks. I see your point. We could move the memcg specific details
> > into prealloc_memcg_shrinker().
> >
> > It seems we have to acquire shrinker_rwsem before we check and modify
> > SHIRNKER_MEMCG_AWARE bit if we may clear it.
> 
> Hi Dave,
> 
> Is it possible that shrinker register races with shrinker unregister?
> It seems impossible to me by a quick visual code inspection. But I'm
> not a VFS expert so I'm not quite sure.

Uh, if you have a shrinker racing to register and unregister, you've
got a major bug in your object initialisation/teardown code. i.e.
calling reagister/unregister at the same time for the same shrinker
is a bug, pure and simple.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [v2 PATCH 3/9] mm: vmscan: guarantee shrinker_slab_memcg() sees valid shrinker_maps for online memcg
       [not found]     ` <CAHbLzkrzv48S3ks-x8M=2sHxRS_+c-hLXdt4ScaWD6mC4ZFe8w@mail.gmail.com>
@ 2020-12-28 20:03       ` Yang Shi
  0 siblings, 0 replies; 37+ messages in thread
From: Yang Shi @ 2020-12-28 20:03 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Roman Gushchin, Kirill Tkhai, Shakeel Butt, Dave Chinner,
	Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

I think Johannes's point makes sense to me. If the shrinker_maps is
not initialized yet it means the memcg is too young to have a number
of reclaimable slab caches. It sounds fine to just skip it.

And, with consolidating shrinker_maps and shrinker_deferred into one
struct, we could just check the pointer of the struct. So, it seems
this patch is not necessary anymore. This patch will be dropped in v3.

On Tue, Dec 15, 2020 at 12:31 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Tue, Dec 15, 2020 at 9:16 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Mon, Dec 14, 2020 at 02:37:16PM -0800, Yang Shi wrote:
> > > The shrink_slab_memcg() races with mem_cgroup_css_online(). A visibility of CSS_ONLINE flag
> > > in shrink_slab_memcg()->mem_cgroup_online() does not guarantee that we will see
> > > memcg->nodeinfo[nid]->shrinker_maps != NULL.  This may occur because of processor reordering
> > > on !x86.
> > >
> > > This seems like the below case:
> > >
> > >            CPU A          CPU B
> > > store shrinker_map      load CSS_ONLINE
> > > store CSS_ONLINE        load shrinker_map
> >
> > But we have a separate check on shrinker_maps, so it doesn't matter
> > that it isn't guaranteed, no?
>
> IIUC, yes. Checking shrinker_maps is the alternative way to detect the
> reordering to prevent from seeing NULL shrinker_maps per Kirill.
>
> We could check shrinker_deferred too, then just walk away if it is NULL.
>
> >
> > The only downside I can see is when CSS_ONLINE isn't visible yet and
> > we bail even though we'd be ready to shrink. Although it's probably
> > unlikely that there would be any objects allocated already...
>
> Yes, it seems so.
>
> >
> > Can somebody remind me why we check mem_cgroup_online() at all?
>
> IIUC it should be mainly used to skip offlined memcgs since there is
> nothing on offlined memcgs' LRU because all objects have been
> reparented. But shrinker_map won't be freed until .css_free is called.
> So the shrinkers might be called in vain.
>
> >
> > If shrinker_map is set, we can shrink: .css_alloc is guaranteed to be
> > complete, and by using RCU for the shrinker_map pointer, the map is
> > also guaranteed to be initialized. There is nothing else happening
> > during onlining that you may depend on.
> >
> > If shrinker_map isn't set, we cannot iterate the bitmap. It does not
> > really matter whether CSS_ONLINE is reordered and visible already.
>
> As I mentioned above it should be used to skip offlined memcgs, but it
> also opens the race condition due to memory reordering. As Kirill
> explained in the earlier email, we could either check the pointer or
> use memory barriers.
>
> If the memory barriers seems overkilling, I could definitely switch
> back to NULL pointer check approach.
>
> >
> > Agreed with Dave: if we need that synchronization around onlining, it
> > needs to happen inside the cgroup core. But I wouldn't add that until
> > somebody actually required it.


^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2020-12-28 20:03 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-14 22:37 [RFC v2 PATCH 0/9] Make shrinker's nr_deferred memcg aware Yang Shi
2020-12-14 22:37 ` [v2 PATCH 1/9] mm: vmscan: use nid from shrink_control for tracepoint Yang Shi
2020-12-14 22:37 ` [v2 PATCH 2/9] mm: memcontrol: use shrinker_rwsem to protect shrinker_maps allocation Yang Shi
2020-12-15  2:09   ` Dave Chinner
2020-12-15 13:53     ` Johannes Weiner
2020-12-15 21:59       ` Dave Chinner
2020-12-16 13:17         ` Kirill Tkhai
2020-12-16 19:12         ` Johannes Weiner
2020-12-16 21:56           ` Yang Shi
2020-12-16 19:39         ` Roman Gushchin
2020-12-15 14:07   ` Johannes Weiner
2020-12-15 20:32     ` Yang Shi
2020-12-14 22:37 ` [v2 PATCH 3/9] mm: vmscan: guarantee shrinker_slab_memcg() sees valid shrinker_maps for online memcg Yang Shi
2020-12-15  2:04   ` Dave Chinner
     [not found]   ` <20201215123802.GA379720@cmpxchg.org>
2020-12-15 12:58     ` Kirill Tkhai
2020-12-15 17:14   ` Johannes Weiner
     [not found]     ` <CAHbLzkrzv48S3ks-x8M=2sHxRS_+c-hLXdt4ScaWD6mC4ZFe8w@mail.gmail.com>
2020-12-28 20:03       ` Yang Shi
2020-12-14 22:37 ` [v2 PATCH 4/9] mm: vmscan: use a new flag to indicate shrinker is registered Yang Shi
2020-12-14 22:37 ` [v2 PATCH 5/9] mm: memcontrol: add per memcg shrinker nr_deferred Yang Shi
2020-12-15  2:22   ` Dave Chinner
2020-12-15 14:45     ` Johannes Weiner
2020-12-15 21:57       ` Yang Shi
2020-12-14 22:37 ` [v2 PATCH 6/9] mm: vmscan: use per memcg nr_deferred of shrinker Yang Shi
2020-12-15  2:46   ` Dave Chinner
2020-12-15 22:27     ` Yang Shi
2020-12-15 23:48       ` Dave Chinner
2020-12-14 22:37 ` [v2 PATCH 7/9] mm: vmscan: don't need allocate shrinker->nr_deferred for memcg aware shrinkers Yang Shi
2020-12-15  3:05   ` Dave Chinner
2020-12-15 23:07     ` Yang Shi
2020-12-18  0:56       ` Yang Shi
2020-12-18  1:09         ` Dave Chinner
2020-12-14 22:37 ` [v2 PATCH 8/9] mm: memcontrol: reparent nr_deferred when memcg offline Yang Shi
2020-12-15  3:07   ` Dave Chinner
2020-12-15 23:10     ` Yang Shi
2020-12-14 22:37 ` [v2 PATCH 9/9] mm: vmscan: shrink deferred objects proportional to priority Yang Shi
2020-12-15  3:23   ` Dave Chinner
2020-12-15 23:59     ` Yang Shi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).