linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC v3 PATCH 0/11] Make shrinker's nr_deferred memcg aware
@ 2021-01-05 22:58 Yang Shi
  2021-01-05 22:58 ` [v3 PATCH 01/11] mm: vmscan: use nid from shrink_control for tracepoint Yang Shi
                   ` (10 more replies)
  0 siblings, 11 replies; 43+ messages in thread
From: Yang Shi @ 2021-01-05 22:58 UTC (permalink / raw)
  To: guro, ktkhai, shakeelb, david, hannes, mhocko, akpm
  Cc: shy828301, linux-mm, linux-fsdevel, linux-kernel


Changelog
v2 --> v3:
    * Moved shrinker_maps related code to vmscan.c per Dave.
    * Removed memcg_shrinker_map_size. Calcuated the size of map via shrinker_nr_max
      per Johannes.
    * Consolidated shrinker_deferred with shrinker_maps into one struct per Dave.
    * Simplified the nr_deferred related code.
    * Dropped the memory barrier from v2.
    * Moved nr_deferred reparent code to vmscan.c per Dave.
    * Added test coverage information in patch #11. Dave is concerned about the
      potential regression. I didn't notice regression with my tests, but suggestions
      about more test coverage is definitely welcome. And it may help spot regression
      with this patch in -mm tree then linux-next tree so I keep it in this version.
    * The code cleanup and consolidation resulted in the series grow to 11 patches.
    * Rebased onto 5.11-rc2. 
v1 --> v2:
    * Use shrinker->flags to store the new SHRINKER_REGISTERED flag per Roman.
    * Folded patch #1 into patch #6 per Roman.
    * Added memory barrier to prevent shrink_slab_memcg from seeing NULL shrinker_maps/
      shrinker_deferred per Kirill.
    * Removed memcg_shrinker_map_mutex. Protcted shrinker_map/shrinker_deferred
      allocations from expand with shrinker_rwsem per Johannes.

Recently huge amount one-off slab drop was seen on some vfs metadata heavy workloads,
it turned out there were huge amount accumulated nr_deferred objects seen by the
shrinker.

On our production machine, I saw absurd number of nr_deferred shown as the below
tracing result: 

<...>-48776 [032] .... 27970562.458916: mm_shrink_slab_start:
super_cache_scan+0x0/0x1a0 ffff9a83046f3458: nid: 0 objects to shrink
2531805877005 gfp_flags GFP_HIGHUSER_MOVABLE pgs_scanned 32 lru_pgs
9300 cache items 1667 delta 11 total_scan 833

There are 2.5 trillion deferred objects on one node, assuming all of them
are dentry (192 bytes per object), so the total size of deferred on
one node is ~480TB. It is definitely ridiculous.

I managed to reproduce this problem with kernel build workload plus negative dentry
generator.

First step, run the below kernel build test script:

NR_CPUS=`cat /proc/cpuinfo | grep -e processor | wc -l`

cd /root/Buildarea/linux-stable

for i in `seq 1500`; do
        cgcreate -g memory:kern_build
        echo 4G > /sys/fs/cgroup/memory/kern_build/memory.limit_in_bytes

        echo 3 > /proc/sys/vm/drop_caches
        cgexec -g memory:kern_build make clean > /dev/null 2>&1
        cgexec -g memory:kern_build make -j$NR_CPUS > /dev/null 2>&1

        cgdelete -g memory:kern_build
done

Then run the below negative dentry generator script:

NR_CPUS=`cat /proc/cpuinfo | grep -e processor | wc -l`

mkdir /sys/fs/cgroup/memory/test
echo $$ > /sys/fs/cgroup/memory/test/tasks

for i in `seq $NR_CPUS`; do
        while true; do
                FILE=`head /dev/urandom | tr -dc A-Za-z0-9 | head -c 64`
                cat $FILE 2>/dev/null
        done &
done

Then kswapd will shrink half of dentry cache in just one loop as the below tracing result
showed:

	kswapd0-475   [028] .... 305968.252561: mm_shrink_slab_start: super_cache_scan+0x0/0x190 0000000024acf00c: nid: 0
objects to shrink 4994376020 gfp_flags GFP_KERNEL cache items 93689873 delta 45746 total_scan 46844936 priority 12
	kswapd0-475   [021] .... 306013.099399: mm_shrink_slab_end: super_cache_scan+0x0/0x190 0000000024acf00c: nid: 0 unused
scan count 4994376020 new scan count 4947576838 total_scan 8 last shrinker return val 46844928

There were huge number of deferred objects before the shrinker was called, the behavior
does match the code but it might be not desirable from the user's stand of point.

The excessive amount of nr_deferred might be accumulated due to various reasons, for example:
    * GFP_NOFS allocation
    * Significant times of small amount scan (< scan_batch, 1024 for vfs metadata)

However the LRUs of slabs are per memcg (memcg-aware shrinkers) but the deferred objects
is per shrinker, this may have some bad effects:
    * Poor isolation among memcgs. Some memcgs which happen to have frequent limit
      reclaim may get nr_deferred accumulated to a huge number, then other innocent
      memcgs may take the fall. In our case the main workload was hit.
    * Unbounded deferred objects. There is no cap for deferred objects, it can outgrow
      ridiculously as the tracing result showed.
    * Easy to get out of control. Although shrinkers take into account deferred objects,
      but it can go out of control easily. One misconfigured memcg could incur absurd 
      amount of deferred objects in a period of time.
    * Sort of reclaim problems, i.e. over reclaim, long reclaim latency, etc. There may be
      hundred GB slab caches for vfe metadata heavy workload, shrink half of them may take
      minutes. We observed latency spike due to the prolonged reclaim.

These issues also have been discussed in https://lore.kernel.org/linux-mm/20200916185823.5347-1-shy828301@gmail.com/.
The patchset is the outcome of that discussion.

So this patchset makes nr_deferred per-memcg to tackle the problem. It does:
    * Have memcg_shrinker_deferred per memcg per node, just like what shrinker_map
      does. Instead it is an atomic_long_t array, each element represent one shrinker
      even though the shrinker is not memcg aware, this simplifies the implementation.
      For memcg aware shrinkers, the deferred objects are just accumulated to its own
      memcg. The shrinkers just see nr_deferred from its own memcg. Non memcg aware
      shrinkers still use global nr_deferred from struct shrinker.
    * Once the memcg is offlined, its nr_deferred will be reparented to its parent along
      with LRUs.
    * The root memcg has memcg_shrinker_deferred array too. It simplifies the handling of
      reparenting to root memcg.
    * Cap nr_deferred to 2x of the length of lru. The idea is borrowed from Dave Chinner's
      series (https://lore.kernel.org/linux-xfs/20191031234618.15403-1-david@fromorbit.com/)

The downside is each memcg has to allocate extra memory to store the nr_deferred array.
On our production environment, there are typically around 40 shrinkers, so each memcg
needs ~320 bytes. 10K memcgs would need ~3.2MB memory. It seems fine.

We have been running the patched kernel on some hosts of our fleet (test and production) for
months, it works very well. The monitor data shows the working set is sustained as expected.

Yang Shi (11):
      mm: vmscan: use nid from shrink_control for tracepoint
      mm: vmscan: consolidate shrinker_maps handling code
      mm: vmscan: use shrinker_rwsem to protect shrinker_maps allocation
      mm: vmscan: remove memcg_shrinker_map_size
      mm: vmscan: use a new flag to indicate shrinker is registered
      mm: memcontrol: rename shrinker_map to shrinker_info
      mm: vmscan: add per memcg shrinker nr_deferred
      mm: vmscan: use per memcg nr_deferred of shrinker
      mm: vmscan: don't need allocate shrinker->nr_deferred for memcg aware shrinkers
      mm: memcontrol: reparent nr_deferred when memcg offline
      mm: vmscan: shrink deferred objects proportional to priority

 include/linux/memcontrol.h |  16 +++--
 include/linux/shrinker.h   |   7 +-
 mm/memcontrol.c            | 131 ++--------------------------------
 mm/vmscan.c                | 351 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------------------
 4 files changed, 292 insertions(+), 213 deletions(-)


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [v3 PATCH 01/11] mm: vmscan: use nid from shrink_control for tracepoint
  2021-01-05 22:58 [RFC v3 PATCH 0/11] Make shrinker's nr_deferred memcg aware Yang Shi
@ 2021-01-05 22:58 ` Yang Shi
  2021-01-05 22:58 ` [v3 PATCH 02/11] mm: vmscan: consolidate shrinker_maps handling code Yang Shi
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 43+ messages in thread
From: Yang Shi @ 2021-01-05 22:58 UTC (permalink / raw)
  To: guro, ktkhai, shakeelb, david, hannes, mhocko, akpm
  Cc: shy828301, linux-mm, linux-fsdevel, linux-kernel

The tracepoint's nid should show what node the shrink happens on, the start tracepoint
uses nid from shrinkctl, but the nid might be set to 0 before end tracepoint if the
shrinker is not NUMA aware, so the traceing log may show the shrink happens on one
node but end up on the other node.  It seems confusing.  And the following patch
will remove using nid directly in do_shrink_slab(), this patch also helps cleanup
the code.

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 mm/vmscan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 257cba79a96d..cb24ef952efc 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -535,7 +535,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 	else
 		new_nr = atomic_long_read(&shrinker->nr_deferred[nid]);
 
-	trace_mm_shrink_slab_end(shrinker, nid, freed, nr, new_nr, total_scan);
+	trace_mm_shrink_slab_end(shrinker, shrinkctl->nid, freed, nr, new_nr, total_scan);
 	return freed;
 }
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [v3 PATCH 02/11] mm: vmscan: consolidate shrinker_maps handling code
  2021-01-05 22:58 [RFC v3 PATCH 0/11] Make shrinker's nr_deferred memcg aware Yang Shi
  2021-01-05 22:58 ` [v3 PATCH 01/11] mm: vmscan: use nid from shrink_control for tracepoint Yang Shi
@ 2021-01-05 22:58 ` Yang Shi
  2021-01-07  0:13   ` Roman Gushchin
  2021-01-05 22:58 ` [v3 PATCH 03/11] mm: vmscan: use shrinker_rwsem to protect shrinker_maps allocation Yang Shi
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 43+ messages in thread
From: Yang Shi @ 2021-01-05 22:58 UTC (permalink / raw)
  To: guro, ktkhai, shakeelb, david, hannes, mhocko, akpm
  Cc: shy828301, linux-mm, linux-fsdevel, linux-kernel

The shrinker map management is not really memcg specific, it's just allocation
and assignment of a structure, and the only memcg bit is the map is being stored
in a memcg structure.  So move the shrinker_maps handling code into vmscan.c for
tighter integration with shrinker code.  There is no functional change.

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 include/linux/memcontrol.h |   4 +-
 mm/memcontrol.c            | 124 ------------------------------------
 mm/vmscan.c                | 126 +++++++++++++++++++++++++++++++++++++
 3 files changed, 128 insertions(+), 126 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d827bd7f3bfe..d128d2842f22 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1581,8 +1581,8 @@ static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
 	return false;
 }
 
-extern int memcg_expand_shrinker_maps(int new_id);
-
+extern int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg);
+extern void memcg_free_shrinker_maps(struct mem_cgroup *memcg);
 extern void memcg_set_shrinker_bit(struct mem_cgroup *memcg,
 				   int nid, int shrinker_id);
 #else
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 605f671203ef..817dde366258 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -397,130 +397,6 @@ DEFINE_STATIC_KEY_FALSE(memcg_kmem_enabled_key);
 EXPORT_SYMBOL(memcg_kmem_enabled_key);
 #endif
 
-static int memcg_shrinker_map_size;
-static DEFINE_MUTEX(memcg_shrinker_map_mutex);
-
-static void memcg_free_shrinker_map_rcu(struct rcu_head *head)
-{
-	kvfree(container_of(head, struct memcg_shrinker_map, rcu));
-}
-
-static int memcg_expand_one_shrinker_map(struct mem_cgroup *memcg,
-					 int size, int old_size)
-{
-	struct memcg_shrinker_map *new, *old;
-	int nid;
-
-	lockdep_assert_held(&memcg_shrinker_map_mutex);
-
-	for_each_node(nid) {
-		old = rcu_dereference_protected(
-			mem_cgroup_nodeinfo(memcg, nid)->shrinker_map, true);
-		/* Not yet online memcg */
-		if (!old)
-			return 0;
-
-		new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
-		if (!new)
-			return -ENOMEM;
-
-		/* Set all old bits, clear all new bits */
-		memset(new->map, (int)0xff, old_size);
-		memset((void *)new->map + old_size, 0, size - old_size);
-
-		rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, new);
-		call_rcu(&old->rcu, memcg_free_shrinker_map_rcu);
-	}
-
-	return 0;
-}
-
-static void memcg_free_shrinker_maps(struct mem_cgroup *memcg)
-{
-	struct mem_cgroup_per_node *pn;
-	struct memcg_shrinker_map *map;
-	int nid;
-
-	if (mem_cgroup_is_root(memcg))
-		return;
-
-	for_each_node(nid) {
-		pn = mem_cgroup_nodeinfo(memcg, nid);
-		map = rcu_dereference_protected(pn->shrinker_map, true);
-		if (map)
-			kvfree(map);
-		rcu_assign_pointer(pn->shrinker_map, NULL);
-	}
-}
-
-static int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
-{
-	struct memcg_shrinker_map *map;
-	int nid, size, ret = 0;
-
-	if (mem_cgroup_is_root(memcg))
-		return 0;
-
-	mutex_lock(&memcg_shrinker_map_mutex);
-	size = memcg_shrinker_map_size;
-	for_each_node(nid) {
-		map = kvzalloc_node(sizeof(*map) + size, GFP_KERNEL, nid);
-		if (!map) {
-			memcg_free_shrinker_maps(memcg);
-			ret = -ENOMEM;
-			break;
-		}
-		rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, map);
-	}
-	mutex_unlock(&memcg_shrinker_map_mutex);
-
-	return ret;
-}
-
-int memcg_expand_shrinker_maps(int new_id)
-{
-	int size, old_size, ret = 0;
-	struct mem_cgroup *memcg;
-
-	size = DIV_ROUND_UP(new_id + 1, BITS_PER_LONG) * sizeof(unsigned long);
-	old_size = memcg_shrinker_map_size;
-	if (size <= old_size)
-		return 0;
-
-	mutex_lock(&memcg_shrinker_map_mutex);
-	if (!root_mem_cgroup)
-		goto unlock;
-
-	for_each_mem_cgroup(memcg) {
-		if (mem_cgroup_is_root(memcg))
-			continue;
-		ret = memcg_expand_one_shrinker_map(memcg, size, old_size);
-		if (ret) {
-			mem_cgroup_iter_break(NULL, memcg);
-			goto unlock;
-		}
-	}
-unlock:
-	if (!ret)
-		memcg_shrinker_map_size = size;
-	mutex_unlock(&memcg_shrinker_map_mutex);
-	return ret;
-}
-
-void memcg_set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
-{
-	if (shrinker_id >= 0 && memcg && !mem_cgroup_is_root(memcg)) {
-		struct memcg_shrinker_map *map;
-
-		rcu_read_lock();
-		map = rcu_dereference(memcg->nodeinfo[nid]->shrinker_map);
-		/* Pairs with smp mb in shrink_slab() */
-		smp_mb__before_atomic();
-		set_bit(shrinker_id, map->map);
-		rcu_read_unlock();
-	}
-}
-
 /**
  * mem_cgroup_css_from_page - css of the memcg associated with a page
  * @page: page of interest
diff --git a/mm/vmscan.c b/mm/vmscan.c
index cb24ef952efc..9db7b4d6d0ae 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -185,6 +185,132 @@ static LIST_HEAD(shrinker_list);
 static DECLARE_RWSEM(shrinker_rwsem);
 
 #ifdef CONFIG_MEMCG
+
+static int memcg_shrinker_map_size;
+static DEFINE_MUTEX(memcg_shrinker_map_mutex);
+
+static void memcg_free_shrinker_map_rcu(struct rcu_head *head)
+{
+	kvfree(container_of(head, struct memcg_shrinker_map, rcu));
+}
+
+static int memcg_expand_one_shrinker_map(struct mem_cgroup *memcg,
+					 int size, int old_size)
+{
+	struct memcg_shrinker_map *new, *old;
+	int nid;
+
+	lockdep_assert_held(&memcg_shrinker_map_mutex);
+
+	for_each_node(nid) {
+		old = rcu_dereference_protected(
+			mem_cgroup_nodeinfo(memcg, nid)->shrinker_map, true);
+		/* Not yet online memcg */
+		if (!old)
+			return 0;
+
+		new = kvmalloc(sizeof(*new) + size, GFP_KERNEL);
+		if (!new)
+			return -ENOMEM;
+
+		/* Set all old bits, clear all new bits */
+		memset(new->map, (int)0xff, old_size);
+		memset((void *)new->map + old_size, 0, size - old_size);
+
+		rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, new);
+		call_rcu(&old->rcu, memcg_free_shrinker_map_rcu);
+	}
+
+	return 0;
+}
+
+void memcg_free_shrinker_maps(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup_per_node *pn;
+	struct memcg_shrinker_map *map;
+	int nid;
+
+	if (mem_cgroup_is_root(memcg))
+		return;
+
+	for_each_node(nid) {
+		pn = mem_cgroup_nodeinfo(memcg, nid);
+		map = rcu_dereference_protected(pn->shrinker_map, true);
+		if (map)
+			kvfree(map);
+		rcu_assign_pointer(pn->shrinker_map, NULL);
+	}
+}
+
+int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
+{
+	struct memcg_shrinker_map *map;
+	int nid, size, ret = 0;
+
+	if (mem_cgroup_is_root(memcg))
+		return 0;
+
+	mutex_lock(&memcg_shrinker_map_mutex);
+	size = memcg_shrinker_map_size;
+	for_each_node(nid) {
+		map = kvzalloc(sizeof(*map) + size, GFP_KERNEL);
+		if (!map) {
+			memcg_free_shrinker_maps(memcg);
+			ret = -ENOMEM;
+			break;
+		}
+		rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, map);
+	}
+	mutex_unlock(&memcg_shrinker_map_mutex);
+
+	return ret;
+}
+
+static int memcg_expand_shrinker_maps(int new_id)
+{
+	int size, old_size, ret = 0;
+	struct mem_cgroup *memcg;
+
+	size = DIV_ROUND_UP(new_id + 1, BITS_PER_LONG) * sizeof(unsigned long);
+	old_size = memcg_shrinker_map_size;
+	if (size <= old_size)
+		return 0;
+
+	mutex_lock(&memcg_shrinker_map_mutex);
+	if (!root_mem_cgroup)
+		goto unlock;
+
+	memcg = mem_cgroup_iter(NULL, NULL, NULL);
+	do {
+		if (mem_cgroup_is_root(memcg))
+			continue;
+		ret = memcg_expand_one_shrinker_map(memcg, size, old_size);
+		if (ret) {
+			mem_cgroup_iter_break(NULL, memcg);
+			goto unlock;
+		}
+	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
+unlock:
+	if (!ret)
+		memcg_shrinker_map_size = size;
+	mutex_unlock(&memcg_shrinker_map_mutex);
+	return ret;
+}
+
+void memcg_set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
+{
+	if (shrinker_id >= 0 && memcg && !mem_cgroup_is_root(memcg)) {
+		struct memcg_shrinker_map *map;
+
+		rcu_read_lock();
+		map = rcu_dereference(memcg->nodeinfo[nid]->shrinker_map);
+		/* Pairs with smp mb in shrink_slab() */
+		smp_mb__before_atomic();
+		set_bit(shrinker_id, map->map);
+		rcu_read_unlock();
+	}
+}
+
 /*
  * We allow subsystems to populate their shrinker-related
  * LRU lists before register_shrinker_prepared() is called
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [v3 PATCH 03/11] mm: vmscan: use shrinker_rwsem to protect shrinker_maps allocation
  2021-01-05 22:58 [RFC v3 PATCH 0/11] Make shrinker's nr_deferred memcg aware Yang Shi
  2021-01-05 22:58 ` [v3 PATCH 01/11] mm: vmscan: use nid from shrink_control for tracepoint Yang Shi
  2021-01-05 22:58 ` [v3 PATCH 02/11] mm: vmscan: consolidate shrinker_maps handling code Yang Shi
@ 2021-01-05 22:58 ` Yang Shi
  2021-01-06  9:54   ` Kirill Tkhai
  2021-01-05 22:58 ` [v3 PATCH 04/11] mm: vmscan: remove memcg_shrinker_map_size Yang Shi
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 43+ messages in thread
From: Yang Shi @ 2021-01-05 22:58 UTC (permalink / raw)
  To: guro, ktkhai, shakeelb, david, hannes, mhocko, akpm
  Cc: shy828301, linux-mm, linux-fsdevel, linux-kernel

Since memcg_shrinker_map_size just can be changd under holding shrinker_rwsem
exclusively, the read side can be protected by holding read lock, so it sounds
superfluous to have a dedicated mutex.  This should not exacerbate the contention
to shrinker_rwsem since just one read side critical section is added.

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 mm/vmscan.c | 16 ++++++----------
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9db7b4d6d0ae..ddb9f972f856 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -187,7 +187,6 @@ static DECLARE_RWSEM(shrinker_rwsem);
 #ifdef CONFIG_MEMCG
 
 static int memcg_shrinker_map_size;
-static DEFINE_MUTEX(memcg_shrinker_map_mutex);
 
 static void memcg_free_shrinker_map_rcu(struct rcu_head *head)
 {
@@ -200,8 +199,6 @@ static int memcg_expand_one_shrinker_map(struct mem_cgroup *memcg,
 	struct memcg_shrinker_map *new, *old;
 	int nid;
 
-	lockdep_assert_held(&memcg_shrinker_map_mutex);
-
 	for_each_node(nid) {
 		old = rcu_dereference_protected(
 			mem_cgroup_nodeinfo(memcg, nid)->shrinker_map, true);
@@ -250,7 +247,7 @@ int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
 	if (mem_cgroup_is_root(memcg))
 		return 0;
 
-	mutex_lock(&memcg_shrinker_map_mutex);
+	down_read(&shrinker_rwsem);
 	size = memcg_shrinker_map_size;
 	for_each_node(nid) {
 		map = kvzalloc(sizeof(*map) + size, GFP_KERNEL);
@@ -261,7 +258,7 @@ int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
 		}
 		rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, map);
 	}
-	mutex_unlock(&memcg_shrinker_map_mutex);
+	up_read(&shrinker_rwsem);
 
 	return ret;
 }
@@ -276,9 +273,8 @@ static int memcg_expand_shrinker_maps(int new_id)
 	if (size <= old_size)
 		return 0;
 
-	mutex_lock(&memcg_shrinker_map_mutex);
 	if (!root_mem_cgroup)
-		goto unlock;
+		goto out;
 
 	memcg = mem_cgroup_iter(NULL, NULL, NULL);
 	do {
@@ -287,13 +283,13 @@ static int memcg_expand_shrinker_maps(int new_id)
 		ret = memcg_expand_one_shrinker_map(memcg, size, old_size);
 		if (ret) {
 			mem_cgroup_iter_break(NULL, memcg);
-			goto unlock;
+			goto out;
 		}
 	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
-unlock:
+out:
 	if (!ret)
 		memcg_shrinker_map_size = size;
-	mutex_unlock(&memcg_shrinker_map_mutex);
+
 	return ret;
 }
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [v3 PATCH 04/11] mm: vmscan: remove memcg_shrinker_map_size
  2021-01-05 22:58 [RFC v3 PATCH 0/11] Make shrinker's nr_deferred memcg aware Yang Shi
                   ` (2 preceding siblings ...)
  2021-01-05 22:58 ` [v3 PATCH 03/11] mm: vmscan: use shrinker_rwsem to protect shrinker_maps allocation Yang Shi
@ 2021-01-05 22:58 ` Yang Shi
  2021-01-06 10:15   ` Kirill Tkhai
  2021-01-05 22:58 ` [v3 PATCH 05/11] mm: vmscan: use a new flag to indicate shrinker is registered Yang Shi
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 43+ messages in thread
From: Yang Shi @ 2021-01-05 22:58 UTC (permalink / raw)
  To: guro, ktkhai, shakeelb, david, hannes, mhocko, akpm
  Cc: shy828301, linux-mm, linux-fsdevel, linux-kernel

Both memcg_shrinker_map_size and shrinker_nr_max is maintained, but actually the
map size can be calculated via shrinker_nr_max, so it seems unnecessary to keep both.
Remove memcg_shrinker_map_size since shrinker_nr_max is also used by iterating the
bit map.

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 mm/vmscan.c | 12 ++++--------
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index ddb9f972f856..8da765a85569 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -185,8 +185,7 @@ static LIST_HEAD(shrinker_list);
 static DECLARE_RWSEM(shrinker_rwsem);
 
 #ifdef CONFIG_MEMCG
-
-static int memcg_shrinker_map_size;
+static int shrinker_nr_max;
 
 static void memcg_free_shrinker_map_rcu(struct rcu_head *head)
 {
@@ -248,7 +247,7 @@ int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
 		return 0;
 
 	down_read(&shrinker_rwsem);
-	size = memcg_shrinker_map_size;
+	size = DIV_ROUND_UP(shrinker_nr_max, BITS_PER_LONG) * sizeof(unsigned long);
 	for_each_node(nid) {
 		map = kvzalloc(sizeof(*map) + size, GFP_KERNEL);
 		if (!map) {
@@ -269,7 +268,7 @@ static int memcg_expand_shrinker_maps(int new_id)
 	struct mem_cgroup *memcg;
 
 	size = DIV_ROUND_UP(new_id + 1, BITS_PER_LONG) * sizeof(unsigned long);
-	old_size = memcg_shrinker_map_size;
+	old_size = DIV_ROUND_UP(shrinker_nr_max, BITS_PER_LONG) * sizeof(unsigned long);
 	if (size <= old_size)
 		return 0;
 
@@ -286,10 +285,8 @@ static int memcg_expand_shrinker_maps(int new_id)
 			goto out;
 		}
 	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
-out:
-	if (!ret)
-		memcg_shrinker_map_size = size;
 
+out:
 	return ret;
 }
 
@@ -321,7 +318,6 @@ void memcg_set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
 #define SHRINKER_REGISTERING ((struct shrinker *)~0UL)
 
 static DEFINE_IDR(shrinker_idr);
-static int shrinker_nr_max;
 
 static int prealloc_memcg_shrinker(struct shrinker *shrinker)
 {
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [v3 PATCH 05/11] mm: vmscan: use a new flag to indicate shrinker is registered
  2021-01-05 22:58 [RFC v3 PATCH 0/11] Make shrinker's nr_deferred memcg aware Yang Shi
                   ` (3 preceding siblings ...)
  2021-01-05 22:58 ` [v3 PATCH 04/11] mm: vmscan: remove memcg_shrinker_map_size Yang Shi
@ 2021-01-05 22:58 ` Yang Shi
  2021-01-06 10:21   ` Kirill Tkhai
  2021-01-05 22:58 ` [v3 PATCH 06/11] mm: memcontrol: rename shrinker_map to shrinker_info Yang Shi
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 43+ messages in thread
From: Yang Shi @ 2021-01-05 22:58 UTC (permalink / raw)
  To: guro, ktkhai, shakeelb, david, hannes, mhocko, akpm
  Cc: shy828301, linux-mm, linux-fsdevel, linux-kernel

Currently registered shrinker is indicated by non-NULL shrinker->nr_deferred.
This approach is fine with nr_deferred at the shrinker level, but the following
patches will move MEMCG_AWARE shrinkers' nr_deferred to memcg level, so their
shrinker->nr_deferred would always be NULL.  This would prevent the shrinkers
from unregistering correctly.

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 include/linux/shrinker.h |  7 ++++---
 mm/vmscan.c              | 13 +++++++++----
 2 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 0f80123650e2..1eac79ce57d4 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -79,13 +79,14 @@ struct shrinker {
 #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
 
 /* Flags */
-#define SHRINKER_NUMA_AWARE	(1 << 0)
-#define SHRINKER_MEMCG_AWARE	(1 << 1)
+#define SHRINKER_REGISTERED	(1 << 0)
+#define SHRINKER_NUMA_AWARE	(1 << 1)
+#define SHRINKER_MEMCG_AWARE	(1 << 2)
 /*
  * It just makes sense when the shrinker is also MEMCG_AWARE for now,
  * non-MEMCG_AWARE shrinker should not have this flag set.
  */
-#define SHRINKER_NONSLAB	(1 << 2)
+#define SHRINKER_NONSLAB	(1 << 3)
 
 extern int prealloc_shrinker(struct shrinker *shrinker);
 extern void register_shrinker_prepared(struct shrinker *shrinker);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8da765a85569..9761c7c27412 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -494,6 +494,7 @@ void register_shrinker_prepared(struct shrinker *shrinker)
 	if (shrinker->flags & SHRINKER_MEMCG_AWARE)
 		idr_replace(&shrinker_idr, shrinker, shrinker->id);
 #endif
+	shrinker->flags |= SHRINKER_REGISTERED;
 	up_write(&shrinker_rwsem);
 }
 
@@ -513,13 +514,17 @@ EXPORT_SYMBOL(register_shrinker);
  */
 void unregister_shrinker(struct shrinker *shrinker)
 {
-	if (!shrinker->nr_deferred)
-		return;
-	if (shrinker->flags & SHRINKER_MEMCG_AWARE)
-		unregister_memcg_shrinker(shrinker);
 	down_write(&shrinker_rwsem);
+	if (!(shrinker->flags & SHRINKER_REGISTERED)) {
+		up_write(&shrinker_rwsem);
+		return;
+	}
 	list_del(&shrinker->list);
+	shrinker->flags &= ~SHRINKER_REGISTERED;
 	up_write(&shrinker_rwsem);
+
+	if (shrinker->flags & SHRINKER_MEMCG_AWARE)
+		unregister_memcg_shrinker(shrinker);
 	kfree(shrinker->nr_deferred);
 	shrinker->nr_deferred = NULL;
 }
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [v3 PATCH 06/11] mm: memcontrol: rename shrinker_map to shrinker_info
  2021-01-05 22:58 [RFC v3 PATCH 0/11] Make shrinker's nr_deferred memcg aware Yang Shi
                   ` (4 preceding siblings ...)
  2021-01-05 22:58 ` [v3 PATCH 05/11] mm: vmscan: use a new flag to indicate shrinker is registered Yang Shi
@ 2021-01-05 22:58 ` Yang Shi
  2021-01-06 11:38   ` Kirill Tkhai
  2021-01-05 22:58 ` [v3 PATCH 07/11] mm: vmscan: add per memcg shrinker nr_deferred Yang Shi
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 43+ messages in thread
From: Yang Shi @ 2021-01-05 22:58 UTC (permalink / raw)
  To: guro, ktkhai, shakeelb, david, hannes, mhocko, akpm
  Cc: shy828301, linux-mm, linux-fsdevel, linux-kernel

The following patch is going to add nr_deferred into shrinker_map, the change will
make shrinker_map not only include map anymore, so rename it to a more general
name.  And this should make the patch adding nr_deferred cleaner and readable and make
review easier.

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 include/linux/memcontrol.h |  8 ++---
 mm/memcontrol.c            |  6 ++--
 mm/vmscan.c                | 66 +++++++++++++++++++-------------------
 3 files changed, 40 insertions(+), 40 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d128d2842f22..e05bbe8277cc 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -96,7 +96,7 @@ struct lruvec_stat {
  * Bitmap of shrinker::id corresponding to memcg-aware shrinkers,
  * which have elements charged to this memcg.
  */
-struct memcg_shrinker_map {
+struct memcg_shrinker_info {
 	struct rcu_head rcu;
 	unsigned long map[];
 };
@@ -118,7 +118,7 @@ struct mem_cgroup_per_node {
 
 	struct mem_cgroup_reclaim_iter	iter;
 
-	struct memcg_shrinker_map __rcu	*shrinker_map;
+	struct memcg_shrinker_info __rcu	*shrinker_info;
 
 	struct rb_node		tree_node;	/* RB tree node */
 	unsigned long		usage_in_excess;/* Set to the value by which */
@@ -1581,8 +1581,8 @@ static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
 	return false;
 }
 
-extern int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg);
-extern void memcg_free_shrinker_maps(struct mem_cgroup *memcg);
+extern int memcg_alloc_shrinker_info(struct mem_cgroup *memcg);
+extern void memcg_free_shrinker_info(struct mem_cgroup *memcg);
 extern void memcg_set_shrinker_bit(struct mem_cgroup *memcg,
 				   int nid, int shrinker_id);
 #else
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 817dde366258..126f1fd550c8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5248,11 +5248,11 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
 
 	/*
-	 * A memcg must be visible for memcg_expand_shrinker_maps()
+	 * A memcg must be visible for memcg_expand_shrinker_info()
 	 * by the time the maps are allocated. So, we allocate maps
 	 * here, when for_each_mem_cgroup() can't skip it.
 	 */
-	if (memcg_alloc_shrinker_maps(memcg)) {
+	if (memcg_alloc_shrinker_info(memcg)) {
 		mem_cgroup_id_remove(memcg);
 		return -ENOMEM;
 	}
@@ -5316,7 +5316,7 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
 	vmpressure_cleanup(&memcg->vmpressure);
 	cancel_work_sync(&memcg->high_work);
 	mem_cgroup_remove_from_trees(memcg);
-	memcg_free_shrinker_maps(memcg);
+	memcg_free_shrinker_info(memcg);
 	memcg_free_kmem(memcg);
 	mem_cgroup_free(memcg);
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9761c7c27412..0033659abf9e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -187,20 +187,20 @@ static DECLARE_RWSEM(shrinker_rwsem);
 #ifdef CONFIG_MEMCG
 static int shrinker_nr_max;
 
-static void memcg_free_shrinker_map_rcu(struct rcu_head *head)
+static void memcg_free_shrinker_info_rcu(struct rcu_head *head)
 {
-	kvfree(container_of(head, struct memcg_shrinker_map, rcu));
+	kvfree(container_of(head, struct memcg_shrinker_info, rcu));
 }
 
-static int memcg_expand_one_shrinker_map(struct mem_cgroup *memcg,
-					 int size, int old_size)
+static int memcg_expand_one_shrinker_info(struct mem_cgroup *memcg,
+					  int size, int old_size)
 {
-	struct memcg_shrinker_map *new, *old;
+	struct memcg_shrinker_info *new, *old;
 	int nid;
 
 	for_each_node(nid) {
 		old = rcu_dereference_protected(
-			mem_cgroup_nodeinfo(memcg, nid)->shrinker_map, true);
+			mem_cgroup_nodeinfo(memcg, nid)->shrinker_info, true);
 		/* Not yet online memcg */
 		if (!old)
 			return 0;
@@ -213,17 +213,17 @@ static int memcg_expand_one_shrinker_map(struct mem_cgroup *memcg,
 		memset(new->map, (int)0xff, old_size);
 		memset((void *)new->map + old_size, 0, size - old_size);
 
-		rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, new);
-		call_rcu(&old->rcu, memcg_free_shrinker_map_rcu);
+		rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, new);
+		call_rcu(&old->rcu, memcg_free_shrinker_info_rcu);
 	}
 
 	return 0;
 }
 
-void memcg_free_shrinker_maps(struct mem_cgroup *memcg)
+void memcg_free_shrinker_info(struct mem_cgroup *memcg)
 {
 	struct mem_cgroup_per_node *pn;
-	struct memcg_shrinker_map *map;
+	struct memcg_shrinker_info *info;
 	int nid;
 
 	if (mem_cgroup_is_root(memcg))
@@ -231,16 +231,16 @@ void memcg_free_shrinker_maps(struct mem_cgroup *memcg)
 
 	for_each_node(nid) {
 		pn = mem_cgroup_nodeinfo(memcg, nid);
-		map = rcu_dereference_protected(pn->shrinker_map, true);
-		if (map)
-			kvfree(map);
-		rcu_assign_pointer(pn->shrinker_map, NULL);
+		info = rcu_dereference_protected(pn->shrinker_info, true);
+		if (info)
+			kvfree(info);
+		rcu_assign_pointer(pn->shrinker_info, NULL);
 	}
 }
 
-int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
+int memcg_alloc_shrinker_info(struct mem_cgroup *memcg)
 {
-	struct memcg_shrinker_map *map;
+	struct memcg_shrinker_info *info;
 	int nid, size, ret = 0;
 
 	if (mem_cgroup_is_root(memcg))
@@ -249,20 +249,20 @@ int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
 	down_read(&shrinker_rwsem);
 	size = DIV_ROUND_UP(shrinker_nr_max, BITS_PER_LONG) * sizeof(unsigned long);
 	for_each_node(nid) {
-		map = kvzalloc(sizeof(*map) + size, GFP_KERNEL);
-		if (!map) {
-			memcg_free_shrinker_maps(memcg);
+		info = kvzalloc(sizeof(*info) + size, GFP_KERNEL);
+		if (!info) {
+			memcg_free_shrinker_info(memcg);
 			ret = -ENOMEM;
 			break;
 		}
-		rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, map);
+		rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, info);
 	}
 	up_read(&shrinker_rwsem);
 
 	return ret;
 }
 
-static int memcg_expand_shrinker_maps(int new_id)
+static int memcg_expand_shrinker_info(int new_id)
 {
 	int size, old_size, ret = 0;
 	struct mem_cgroup *memcg;
@@ -279,7 +279,7 @@ static int memcg_expand_shrinker_maps(int new_id)
 	do {
 		if (mem_cgroup_is_root(memcg))
 			continue;
-		ret = memcg_expand_one_shrinker_map(memcg, size, old_size);
+		ret = memcg_expand_one_shrinker_info(memcg, size, old_size);
 		if (ret) {
 			mem_cgroup_iter_break(NULL, memcg);
 			goto out;
@@ -293,13 +293,13 @@ static int memcg_expand_shrinker_maps(int new_id)
 void memcg_set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
 {
 	if (shrinker_id >= 0 && memcg && !mem_cgroup_is_root(memcg)) {
-		struct memcg_shrinker_map *map;
+		struct memcg_shrinker_info *info;
 
 		rcu_read_lock();
-		map = rcu_dereference(memcg->nodeinfo[nid]->shrinker_map);
+		info = rcu_dereference(memcg->nodeinfo[nid]->shrinker_info);
 		/* Pairs with smp mb in shrink_slab() */
 		smp_mb__before_atomic();
-		set_bit(shrinker_id, map->map);
+		set_bit(shrinker_id, info->map);
 		rcu_read_unlock();
 	}
 }
@@ -330,7 +330,7 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
 		goto unlock;
 
 	if (id >= shrinker_nr_max) {
-		if (memcg_expand_shrinker_maps(id)) {
+		if (memcg_expand_shrinker_info(id)) {
 			idr_remove(&shrinker_idr, id);
 			goto unlock;
 		}
@@ -666,7 +666,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
 			struct mem_cgroup *memcg, int priority)
 {
-	struct memcg_shrinker_map *map;
+	struct memcg_shrinker_info *info;
 	unsigned long ret, freed = 0;
 	int i;
 
@@ -676,12 +676,12 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
 	if (!down_read_trylock(&shrinker_rwsem))
 		return 0;
 
-	map = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_map,
-					true);
-	if (unlikely(!map))
+	info = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_info,
+					 true);
+	if (unlikely(!info))
 		goto unlock;
 
-	for_each_set_bit(i, map->map, shrinker_nr_max) {
+	for_each_set_bit(i, info->map, shrinker_nr_max) {
 		struct shrink_control sc = {
 			.gfp_mask = gfp_mask,
 			.nid = nid,
@@ -692,7 +692,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
 		shrinker = idr_find(&shrinker_idr, i);
 		if (unlikely(!shrinker || shrinker == SHRINKER_REGISTERING)) {
 			if (!shrinker)
-				clear_bit(i, map->map);
+				clear_bit(i, info->map);
 			continue;
 		}
 
@@ -703,7 +703,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
 
 		ret = do_shrink_slab(&sc, shrinker, priority);
 		if (ret == SHRINK_EMPTY) {
-			clear_bit(i, map->map);
+			clear_bit(i, info->map);
 			/*
 			 * After the shrinker reported that it had no objects to
 			 * free, but before we cleared the corresponding bit in
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [v3 PATCH 07/11] mm: vmscan: add per memcg shrinker nr_deferred
  2021-01-05 22:58 [RFC v3 PATCH 0/11] Make shrinker's nr_deferred memcg aware Yang Shi
                   ` (5 preceding siblings ...)
  2021-01-05 22:58 ` [v3 PATCH 06/11] mm: memcontrol: rename shrinker_map to shrinker_info Yang Shi
@ 2021-01-05 22:58 ` Yang Shi
  2021-01-06 11:06   ` Kirill Tkhai
  2021-01-05 22:58 ` [v3 PATCH 08/11] mm: vmscan: use per memcg nr_deferred of shrinker Yang Shi
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 43+ messages in thread
From: Yang Shi @ 2021-01-05 22:58 UTC (permalink / raw)
  To: guro, ktkhai, shakeelb, david, hannes, mhocko, akpm
  Cc: shy828301, linux-mm, linux-fsdevel, linux-kernel

Currently the number of deferred objects are per shrinker, but some slabs, for example,
vfs inode/dentry cache are per memcg, this would result in poor isolation among memcgs.

The deferred objects typically are generated by __GFP_NOFS allocations, one memcg with
excessive __GFP_NOFS allocations may blow up deferred objects, then other innocent memcgs
may suffer from over shrink, excessive reclaim latency, etc.

For example, two workloads run in memcgA and memcgB respectively, workload in B is vfs
heavy workload.  Workload in A generates excessive deferred objects, then B's vfs cache
might be hit heavily (drop half of caches) by B's limit reclaim or global reclaim.

We observed this hit in our production environment which was running vfs heavy workload
shown as the below tracing log:

<...>-409454 [016] .... 28286961.747146: mm_shrink_slab_start: super_cache_scan+0x0/0x1a0 ffff9a83046f3458:
nid: 1 objects to shrink 3641681686040 gfp_flags GFP_HIGHUSER_MOVABLE|__GFP_ZERO pgs_scanned 1 lru_pgs 15721
cache items 246404277 delta 31345 total_scan 123202138
<...>-409454 [022] .... 28287105.928018: mm_shrink_slab_end: super_cache_scan+0x0/0x1a0 ffff9a83046f3458:
nid: 1 unused scan count 3641681686040 new scan count 3641798379189 total_scan 602
last shrinker return val 123186855

The vfs cache and page cache ration was 10:1 on this machine, and half of caches were dropped.
This also resulted in significant amount of page caches were dropped due to inodes eviction.

Make nr_deferred per memcg for memcg aware shrinkers would solve the unfairness and bring
better isolation.

When memcg is not enabled (!CONFIG_MEMCG or memcg disabled), the shrinker's nr_deferred
would be used.  And non memcg aware shrinkers use shrinker's nr_deferred all the time.

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 include/linux/memcontrol.h |  7 +++---
 mm/vmscan.c                | 49 +++++++++++++++++++++++++-------------
 2 files changed, 37 insertions(+), 19 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e05bbe8277cc..5599082df623 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -93,12 +93,13 @@ struct lruvec_stat {
 };
 
 /*
- * Bitmap of shrinker::id corresponding to memcg-aware shrinkers,
- * which have elements charged to this memcg.
+ * Bitmap and deferred work of shrinker::id corresponding to memcg-aware
+ * shrinkers, which have elements charged to this memcg.
  */
 struct memcg_shrinker_info {
 	struct rcu_head rcu;
-	unsigned long map[];
+	unsigned long *map;
+	atomic_long_t *nr_deferred;
 };
 
 /*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0033659abf9e..72259253e414 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -193,10 +193,12 @@ static void memcg_free_shrinker_info_rcu(struct rcu_head *head)
 }
 
 static int memcg_expand_one_shrinker_info(struct mem_cgroup *memcg,
-					  int size, int old_size)
+					  int m_size, int d_size,
+					  int old_m_size, int old_d_size)
 {
 	struct memcg_shrinker_info *new, *old;
 	int nid;
+	int size = m_size + d_size;
 
 	for_each_node(nid) {
 		old = rcu_dereference_protected(
@@ -209,9 +211,18 @@ static int memcg_expand_one_shrinker_info(struct mem_cgroup *memcg,
 		if (!new)
 			return -ENOMEM;
 
-		/* Set all old bits, clear all new bits */
-		memset(new->map, (int)0xff, old_size);
-		memset((void *)new->map + old_size, 0, size - old_size);
+		new->map = (unsigned long *)((unsigned long)new + sizeof(*new));
+		new->nr_deferred = (atomic_long_t *)((unsigned long)new +
+					sizeof(*new) + m_size);
+
+		/* map: set all old bits, clear all new bits */
+		memset(new->map, (int)0xff, old_m_size);
+		memset((void *)new->map + old_m_size, 0, m_size - old_m_size);
+		/* nr_deferred: copy old values, clear all new values */
+		memcpy((void *)new->nr_deferred, (void *)old->nr_deferred,
+		       old_d_size);
+		memset((void *)new->nr_deferred + old_d_size, 0,
+		       d_size - old_d_size);
 
 		rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, new);
 		call_rcu(&old->rcu, memcg_free_shrinker_info_rcu);
@@ -226,9 +237,6 @@ void memcg_free_shrinker_info(struct mem_cgroup *memcg)
 	struct memcg_shrinker_info *info;
 	int nid;
 
-	if (mem_cgroup_is_root(memcg))
-		return;
-
 	for_each_node(nid) {
 		pn = mem_cgroup_nodeinfo(memcg, nid);
 		info = rcu_dereference_protected(pn->shrinker_info, true);
@@ -242,12 +250,13 @@ int memcg_alloc_shrinker_info(struct mem_cgroup *memcg)
 {
 	struct memcg_shrinker_info *info;
 	int nid, size, ret = 0;
-
-	if (mem_cgroup_is_root(memcg))
-		return 0;
+	int m_size, d_size = 0;
 
 	down_read(&shrinker_rwsem);
-	size = DIV_ROUND_UP(shrinker_nr_max, BITS_PER_LONG) * sizeof(unsigned long);
+	m_size = DIV_ROUND_UP(shrinker_nr_max, BITS_PER_LONG) * sizeof(unsigned long);
+	d_size = shrinker_nr_max * sizeof(atomic_long_t);
+	size = m_size + d_size;
+
 	for_each_node(nid) {
 		info = kvzalloc(sizeof(*info) + size, GFP_KERNEL);
 		if (!info) {
@@ -255,6 +264,9 @@ int memcg_alloc_shrinker_info(struct mem_cgroup *memcg)
 			ret = -ENOMEM;
 			break;
 		}
+		info->map = (unsigned long *)((unsigned long)info + sizeof(*info));
+		info->nr_deferred = (atomic_long_t *)((unsigned long)info +
+					sizeof(*info) + m_size);
 		rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, info);
 	}
 	up_read(&shrinker_rwsem);
@@ -265,10 +277,16 @@ int memcg_alloc_shrinker_info(struct mem_cgroup *memcg)
 static int memcg_expand_shrinker_info(int new_id)
 {
 	int size, old_size, ret = 0;
+	int m_size, d_size = 0;
+	int old_m_size, old_d_size = 0;
 	struct mem_cgroup *memcg;
 
-	size = DIV_ROUND_UP(new_id + 1, BITS_PER_LONG) * sizeof(unsigned long);
-	old_size = DIV_ROUND_UP(shrinker_nr_max, BITS_PER_LONG) * sizeof(unsigned long);
+	m_size = DIV_ROUND_UP(new_id + 1, BITS_PER_LONG) * sizeof(unsigned long);
+	d_size = (new_id + 1) * sizeof(atomic_long_t);
+	size = m_size + d_size;
+	old_m_size = DIV_ROUND_UP(shrinker_nr_max, BITS_PER_LONG) * sizeof(unsigned long);
+	old_d_size = shrinker_nr_max * sizeof(atomic_long_t);
+	old_size = old_m_size + old_d_size;
 	if (size <= old_size)
 		return 0;
 
@@ -277,9 +295,8 @@ static int memcg_expand_shrinker_info(int new_id)
 
 	memcg = mem_cgroup_iter(NULL, NULL, NULL);
 	do {
-		if (mem_cgroup_is_root(memcg))
-			continue;
-		ret = memcg_expand_one_shrinker_info(memcg, size, old_size);
+		ret = memcg_expand_one_shrinker_info(memcg, m_size, d_size,
+						     old_m_size, old_d_size);
 		if (ret) {
 			mem_cgroup_iter_break(NULL, memcg);
 			goto out;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [v3 PATCH 08/11] mm: vmscan: use per memcg nr_deferred of shrinker
  2021-01-05 22:58 [RFC v3 PATCH 0/11] Make shrinker's nr_deferred memcg aware Yang Shi
                   ` (6 preceding siblings ...)
  2021-01-05 22:58 ` [v3 PATCH 07/11] mm: vmscan: add per memcg shrinker nr_deferred Yang Shi
@ 2021-01-05 22:58 ` Yang Shi
  2021-01-07  0:17   ` Roman Gushchin
  2021-01-05 22:58 ` [v3 PATCH 09/11] mm: vmscan: don't need allocate shrinker->nr_deferred for memcg aware shrinkers Yang Shi
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 43+ messages in thread
From: Yang Shi @ 2021-01-05 22:58 UTC (permalink / raw)
  To: guro, ktkhai, shakeelb, david, hannes, mhocko, akpm
  Cc: shy828301, linux-mm, linux-fsdevel, linux-kernel

Use per memcg's nr_deferred for memcg aware shrinkers.  The shrinker's nr_deferred
will be used in the following cases:
    1. Non memcg aware shrinkers
    2. !CONFIG_MEMCG
    3. memcg is disabled by boot parameter

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 mm/vmscan.c | 81 +++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 69 insertions(+), 12 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 72259253e414..f20ed8e928c2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -372,6 +372,27 @@ static void unregister_memcg_shrinker(struct shrinker *shrinker)
 	up_write(&shrinker_rwsem);
 }
 
+static long count_nr_deferred_memcg(int nid, struct shrinker *shrinker,
+				    struct mem_cgroup *memcg)
+{
+	struct memcg_shrinker_info *info;
+
+	info = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_info,
+					 true);
+	return atomic_long_xchg(&info->nr_deferred[shrinker->id], 0);
+}
+
+static long set_nr_deferred_memcg(long nr, int nid, struct shrinker *shrinker,
+				  struct mem_cgroup *memcg)
+{
+	struct memcg_shrinker_info *info;
+
+	info = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_info,
+					 true);
+
+	return atomic_long_add_return(nr, &info->nr_deferred[shrinker->id]);
+}
+
 static bool cgroup_reclaim(struct scan_control *sc)
 {
 	return sc->target_mem_cgroup;
@@ -410,6 +431,18 @@ static void unregister_memcg_shrinker(struct shrinker *shrinker)
 {
 }
 
+static long count_nr_deferred_memcg(int nid, struct shrinker *shrinker,
+				    struct mem_cgroup *memcg)
+{
+	return 0;
+}
+
+static long set_nr_deferred_memcg(long nr, int nid, struct shrinker *shrinker,
+				  struct mem_cgroup *memcg)
+{
+	return 0;
+}
+
 static bool cgroup_reclaim(struct scan_control *sc)
 {
 	return false;
@@ -421,6 +454,39 @@ static bool writeback_throttling_sane(struct scan_control *sc)
 }
 #endif
 
+static long count_nr_deferred(struct shrinker *shrinker,
+			      struct shrink_control *sc)
+{
+	int nid = sc->nid;
+
+	if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
+		nid = 0;
+
+	if (sc->memcg &&
+	    (shrinker->flags & SHRINKER_MEMCG_AWARE))
+		return count_nr_deferred_memcg(nid, shrinker,
+					       sc->memcg);
+
+	return atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
+}
+
+
+static long set_nr_deferred(long nr, struct shrinker *shrinker,
+			    struct shrink_control *sc)
+{
+	int nid = sc->nid;
+
+	if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
+		nid = 0;
+
+	if (sc->memcg &&
+	    (shrinker->flags & SHRINKER_MEMCG_AWARE))
+		return set_nr_deferred_memcg(nr, nid, shrinker,
+					     sc->memcg);
+
+	return atomic_long_add_return(nr, &shrinker->nr_deferred[nid]);
+}
+
 /*
  * This misses isolated pages which are not accounted for to save counters.
  * As the data only determines if reclaim or compaction continues, it is
@@ -558,14 +624,10 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 	long freeable;
 	long nr;
 	long new_nr;
-	int nid = shrinkctl->nid;
 	long batch_size = shrinker->batch ? shrinker->batch
 					  : SHRINK_BATCH;
 	long scanned = 0, next_deferred;
 
-	if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
-		nid = 0;
-
 	freeable = shrinker->count_objects(shrinker, shrinkctl);
 	if (freeable == 0 || freeable == SHRINK_EMPTY)
 		return freeable;
@@ -575,7 +637,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 	 * and zero it so that other concurrent shrinker invocations
 	 * don't also do this scanning work.
 	 */
-	nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
+	nr = count_nr_deferred(shrinker, shrinkctl);
 
 	total_scan = nr;
 	if (shrinker->seeks) {
@@ -666,14 +728,9 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 		next_deferred = 0;
 	/*
 	 * move the unused scan count back into the shrinker in a
-	 * manner that handles concurrent updates. If we exhausted the
-	 * scan, there is no need to do an update.
+	 * manner that handles concurrent updates.
 	 */
-	if (next_deferred > 0)
-		new_nr = atomic_long_add_return(next_deferred,
-						&shrinker->nr_deferred[nid]);
-	else
-		new_nr = atomic_long_read(&shrinker->nr_deferred[nid]);
+	new_nr = set_nr_deferred(next_deferred, shrinker, shrinkctl);
 
 	trace_mm_shrink_slab_end(shrinker, shrinkctl->nid, freed, nr, new_nr, total_scan);
 	return freed;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [v3 PATCH 09/11] mm: vmscan: don't need allocate shrinker->nr_deferred for memcg aware shrinkers
  2021-01-05 22:58 [RFC v3 PATCH 0/11] Make shrinker's nr_deferred memcg aware Yang Shi
                   ` (7 preceding siblings ...)
  2021-01-05 22:58 ` [v3 PATCH 08/11] mm: vmscan: use per memcg nr_deferred of shrinker Yang Shi
@ 2021-01-05 22:58 ` Yang Shi
  2021-01-06 11:15   ` Kirill Tkhai
  2021-01-05 22:58 ` [v3 PATCH 10/11] mm: memcontrol: reparent nr_deferred when memcg offline Yang Shi
  2021-01-05 22:58 ` [v3 PATCH 11/11] mm: vmscan: shrink deferred objects proportional to priority Yang Shi
  10 siblings, 1 reply; 43+ messages in thread
From: Yang Shi @ 2021-01-05 22:58 UTC (permalink / raw)
  To: guro, ktkhai, shakeelb, david, hannes, mhocko, akpm
  Cc: shy828301, linux-mm, linux-fsdevel, linux-kernel

Now nr_deferred is available on per memcg level for memcg aware shrinkers, so don't need
allocate shrinker->nr_deferred for such shrinkers anymore.

The prealloc_memcg_shrinker() would return -ENOSYS if !CONFIG_MEMCG or memcg is disabled
by kernel command line, then shrinker's SHRINKER_MEMCG_AWARE flag would be cleared.
This makes the implementation of this patch simpler.

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 mm/vmscan.c | 33 ++++++++++++++++++---------------
 1 file changed, 18 insertions(+), 15 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index f20ed8e928c2..d9795fb0f1c5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -340,6 +340,9 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
 {
 	int id, ret = -ENOMEM;
 
+	if (mem_cgroup_disabled())
+		return -ENOSYS;
+
 	down_write(&shrinker_rwsem);
 	/* This may call shrinker, so it must use down_read_trylock() */
 	id = idr_alloc(&shrinker_idr, SHRINKER_REGISTERING, 0, 0, GFP_KERNEL);
@@ -424,7 +427,7 @@ static bool writeback_throttling_sane(struct scan_control *sc)
 #else
 static int prealloc_memcg_shrinker(struct shrinker *shrinker)
 {
-	return 0;
+	return -ENOSYS;
 }
 
 static void unregister_memcg_shrinker(struct shrinker *shrinker)
@@ -535,8 +538,20 @@ unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone
  */
 int prealloc_shrinker(struct shrinker *shrinker)
 {
-	unsigned int size = sizeof(*shrinker->nr_deferred);
+	unsigned int size;
+	int err;
+
+	if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
+		err = prealloc_memcg_shrinker(shrinker);
+		if (!err)
+			return 0;
+		if (err != -ENOSYS)
+			return err;
+
+		shrinker->flags &= ~SHRINKER_MEMCG_AWARE;
+	}
 
+	size = sizeof(*shrinker->nr_deferred);
 	if (shrinker->flags & SHRINKER_NUMA_AWARE)
 		size *= nr_node_ids;
 
@@ -544,26 +559,14 @@ int prealloc_shrinker(struct shrinker *shrinker)
 	if (!shrinker->nr_deferred)
 		return -ENOMEM;
 
-	if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
-		if (prealloc_memcg_shrinker(shrinker))
-			goto free_deferred;
-	}
 
 	return 0;
-
-free_deferred:
-	kfree(shrinker->nr_deferred);
-	shrinker->nr_deferred = NULL;
-	return -ENOMEM;
 }
 
 void free_prealloced_shrinker(struct shrinker *shrinker)
 {
-	if (!shrinker->nr_deferred)
-		return;
-
 	if (shrinker->flags & SHRINKER_MEMCG_AWARE)
-		unregister_memcg_shrinker(shrinker);
+		return unregister_memcg_shrinker(shrinker);
 
 	kfree(shrinker->nr_deferred);
 	shrinker->nr_deferred = NULL;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [v3 PATCH 10/11] mm: memcontrol: reparent nr_deferred when memcg offline
  2021-01-05 22:58 [RFC v3 PATCH 0/11] Make shrinker's nr_deferred memcg aware Yang Shi
                   ` (8 preceding siblings ...)
  2021-01-05 22:58 ` [v3 PATCH 09/11] mm: vmscan: don't need allocate shrinker->nr_deferred for memcg aware shrinkers Yang Shi
@ 2021-01-05 22:58 ` Yang Shi
  2021-01-06 11:34   ` Kirill Tkhai
  2021-01-05 22:58 ` [v3 PATCH 11/11] mm: vmscan: shrink deferred objects proportional to priority Yang Shi
  10 siblings, 1 reply; 43+ messages in thread
From: Yang Shi @ 2021-01-05 22:58 UTC (permalink / raw)
  To: guro, ktkhai, shakeelb, david, hannes, mhocko, akpm
  Cc: shy828301, linux-mm, linux-fsdevel, linux-kernel

Now shrinker's nr_deferred is per memcg for memcg aware shrinkers, add to parent's
corresponding nr_deferred when memcg offline.

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 include/linux/memcontrol.h |  1 +
 mm/memcontrol.c            |  1 +
 mm/vmscan.c                | 29 +++++++++++++++++++++++++++++
 3 files changed, 31 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5599082df623..d1e52e916cc2 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1586,6 +1586,7 @@ extern int memcg_alloc_shrinker_info(struct mem_cgroup *memcg);
 extern void memcg_free_shrinker_info(struct mem_cgroup *memcg);
 extern void memcg_set_shrinker_bit(struct mem_cgroup *memcg,
 				   int nid, int shrinker_id);
+extern void memcg_reparent_shrinker_deferred(struct mem_cgroup *memcg);
 #else
 #define mem_cgroup_sockets_enabled 0
 static inline void mem_cgroup_sk_alloc(struct sock *sk) { };
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 126f1fd550c8..19e555675582 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5284,6 +5284,7 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 	page_counter_set_low(&memcg->memory, 0);
 
 	memcg_offline_kmem(memcg);
+	memcg_reparent_shrinker_deferred(memcg);
 	wb_memcg_offline(memcg);
 
 	drain_all_stock(memcg);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d9795fb0f1c5..71056057d26d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -396,6 +396,35 @@ static long set_nr_deferred_memcg(long nr, int nid, struct shrinker *shrinker,
 	return atomic_long_add_return(nr, &info->nr_deferred[shrinker->id]);
 }
 
+void memcg_reparent_shrinker_deferred(struct mem_cgroup *memcg)
+{
+	int i, nid;
+	long nr;
+	struct mem_cgroup *parent;
+	struct memcg_shrinker_info *child_info, *parent_info;
+
+	parent = parent_mem_cgroup(memcg);
+	if (!parent)
+		parent = root_mem_cgroup;
+
+	/* Prevent from concurrent shrinker_info expand */
+	down_read(&shrinker_rwsem);
+	for_each_node(nid) {
+		child_info = rcu_dereference_protected(
+					memcg->nodeinfo[nid]->shrinker_info,
+					true);
+		parent_info = rcu_dereference_protected(
+					parent->nodeinfo[nid]->shrinker_info,
+					true);
+		for (i = 0; i < shrinker_nr_max; i++) {
+			nr = atomic_long_read(&child_info->nr_deferred[i]);
+			atomic_long_add(nr,
+					&parent_info->nr_deferred[i]);
+		}
+	}
+	up_read(&shrinker_rwsem);
+}
+
 static bool cgroup_reclaim(struct scan_control *sc)
 {
 	return sc->target_mem_cgroup;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [v3 PATCH 11/11] mm: vmscan: shrink deferred objects proportional to priority
  2021-01-05 22:58 [RFC v3 PATCH 0/11] Make shrinker's nr_deferred memcg aware Yang Shi
                   ` (9 preceding siblings ...)
  2021-01-05 22:58 ` [v3 PATCH 10/11] mm: memcontrol: reparent nr_deferred when memcg offline Yang Shi
@ 2021-01-05 22:58 ` Yang Shi
  10 siblings, 0 replies; 43+ messages in thread
From: Yang Shi @ 2021-01-05 22:58 UTC (permalink / raw)
  To: guro, ktkhai, shakeelb, david, hannes, mhocko, akpm
  Cc: shy828301, linux-mm, linux-fsdevel, linux-kernel

The number of deferred objects might get windup to an absurd number, and it results in
clamp of slab objects.  It is undesirable for sustaining workingset.

So shrink deferred objects proportional to priority and cap nr_deferred to twice of
cache items.

The idea is borrowed fron Dave Chinner's patch:
https://lore.kernel.org/linux-xfs/20191031234618.15403-13-david@fromorbit.com/

Tested with kernel build and vfs metadata heavy workload, no regression is spotted
so far.

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 mm/vmscan.c | 40 +++++-----------------------------------
 1 file changed, 5 insertions(+), 35 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 71056057d26d..6832f1d24d2b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -671,7 +671,6 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 	 */
 	nr = count_nr_deferred(shrinker, shrinkctl);
 
-	total_scan = nr;
 	if (shrinker->seeks) {
 		delta = freeable >> priority;
 		delta *= 4;
@@ -685,37 +684,9 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 		delta = freeable / 2;
 	}
 
+	total_scan = nr >> priority;
 	total_scan += delta;
-	if (total_scan < 0) {
-		pr_err("shrink_slab: %pS negative objects to delete nr=%ld\n",
-		       shrinker->scan_objects, total_scan);
-		total_scan = freeable;
-		next_deferred = nr;
-	} else
-		next_deferred = total_scan;
-
-	/*
-	 * We need to avoid excessive windup on filesystem shrinkers
-	 * due to large numbers of GFP_NOFS allocations causing the
-	 * shrinkers to return -1 all the time. This results in a large
-	 * nr being built up so when a shrink that can do some work
-	 * comes along it empties the entire cache due to nr >>>
-	 * freeable. This is bad for sustaining a working set in
-	 * memory.
-	 *
-	 * Hence only allow the shrinker to scan the entire cache when
-	 * a large delta change is calculated directly.
-	 */
-	if (delta < freeable / 4)
-		total_scan = min(total_scan, freeable / 2);
-
-	/*
-	 * Avoid risking looping forever due to too large nr value:
-	 * never try to free more than twice the estimate number of
-	 * freeable entries.
-	 */
-	if (total_scan > freeable * 2)
-		total_scan = freeable * 2;
+	total_scan = min(total_scan, (2 * freeable));
 
 	trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
 				   freeable, delta, total_scan, priority);
@@ -754,10 +725,9 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 		cond_resched();
 	}
 
-	if (next_deferred >= scanned)
-		next_deferred -= scanned;
-	else
-		next_deferred = 0;
+	next_deferred = max_t(long, (nr - scanned), 0) + total_scan;
+	next_deferred = min(next_deferred, (2 * freeable));
+
 	/*
 	 * move the unused scan count back into the shrinker in a
 	 * manner that handles concurrent updates.
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [v3 PATCH 03/11] mm: vmscan: use shrinker_rwsem to protect shrinker_maps allocation
  2021-01-05 22:58 ` [v3 PATCH 03/11] mm: vmscan: use shrinker_rwsem to protect shrinker_maps allocation Yang Shi
@ 2021-01-06  9:54   ` Kirill Tkhai
  2021-01-11 17:08     ` Yang Shi
  0 siblings, 1 reply; 43+ messages in thread
From: Kirill Tkhai @ 2021-01-06  9:54 UTC (permalink / raw)
  To: Yang Shi, guro, shakeelb, david, hannes, mhocko, akpm
  Cc: linux-mm, linux-fsdevel, linux-kernel

On 06.01.2021 01:58, Yang Shi wrote:
> Since memcg_shrinker_map_size just can be changd under holding shrinker_rwsem
> exclusively, the read side can be protected by holding read lock, so it sounds
> superfluous to have a dedicated mutex.  This should not exacerbate the contention
> to shrinker_rwsem since just one read side critical section is added.
> 
> Signed-off-by: Yang Shi <shy828301@gmail.com>
> ---
>  mm/vmscan.c | 16 ++++++----------
>  1 file changed, 6 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 9db7b4d6d0ae..ddb9f972f856 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -187,7 +187,6 @@ static DECLARE_RWSEM(shrinker_rwsem);
>  #ifdef CONFIG_MEMCG
>  
>  static int memcg_shrinker_map_size;
> -static DEFINE_MUTEX(memcg_shrinker_map_mutex);
>  
>  static void memcg_free_shrinker_map_rcu(struct rcu_head *head)
>  {
> @@ -200,8 +199,6 @@ static int memcg_expand_one_shrinker_map(struct mem_cgroup *memcg,
>  	struct memcg_shrinker_map *new, *old;
>  	int nid;
>  
> -	lockdep_assert_held(&memcg_shrinker_map_mutex);
> -
>  	for_each_node(nid) {
>  		old = rcu_dereference_protected(
>  			mem_cgroup_nodeinfo(memcg, nid)->shrinker_map, true);
> @@ -250,7 +247,7 @@ int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
>  	if (mem_cgroup_is_root(memcg))
>  		return 0;
>  
> -	mutex_lock(&memcg_shrinker_map_mutex);
> +	down_read(&shrinker_rwsem);
>  	size = memcg_shrinker_map_size;
>  	for_each_node(nid) {
>  		map = kvzalloc(sizeof(*map) + size, GFP_KERNEL);
> @@ -261,7 +258,7 @@ int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
>  		}
>  		rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, map);

Here we do STORE operation, and since we want the assignment is visible
for shrink_slab_memcg() under down_read(), we have to use down_write()
in memcg_alloc_shrinker_maps().

>  	}
> -	mutex_unlock(&memcg_shrinker_map_mutex);
> +	up_read(&shrinker_rwsem);
>  
>  	return ret;
>  }
> @@ -276,9 +273,8 @@ static int memcg_expand_shrinker_maps(int new_id)
>  	if (size <= old_size)
>  		return 0;
>  
> -	mutex_lock(&memcg_shrinker_map_mutex);
>  	if (!root_mem_cgroup)
> -		goto unlock;
> +		goto out;
>  
>  	memcg = mem_cgroup_iter(NULL, NULL, NULL);
>  	do {
> @@ -287,13 +283,13 @@ static int memcg_expand_shrinker_maps(int new_id)
>  		ret = memcg_expand_one_shrinker_map(memcg, size, old_size);
>  		if (ret) {
>  			mem_cgroup_iter_break(NULL, memcg);
> -			goto unlock;
> +			goto out;
>  		}
>  	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
> -unlock:
> +out:
>  	if (!ret)
>  		memcg_shrinker_map_size = size;
> -	mutex_unlock(&memcg_shrinker_map_mutex);
> +
>  	return ret;
>  }
>  
> 


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v3 PATCH 04/11] mm: vmscan: remove memcg_shrinker_map_size
  2021-01-05 22:58 ` [v3 PATCH 04/11] mm: vmscan: remove memcg_shrinker_map_size Yang Shi
@ 2021-01-06 10:15   ` Kirill Tkhai
  2021-01-11 17:44     ` Yang Shi
  2021-01-13 23:48     ` Yang Shi
  0 siblings, 2 replies; 43+ messages in thread
From: Kirill Tkhai @ 2021-01-06 10:15 UTC (permalink / raw)
  To: Yang Shi, guro, shakeelb, david, hannes, mhocko, akpm
  Cc: linux-mm, linux-fsdevel, linux-kernel

On 06.01.2021 01:58, Yang Shi wrote:
> Both memcg_shrinker_map_size and shrinker_nr_max is maintained, but actually the
> map size can be calculated via shrinker_nr_max, so it seems unnecessary to keep both.
> Remove memcg_shrinker_map_size since shrinker_nr_max is also used by iterating the
> bit map.
> 
> Signed-off-by: Yang Shi <shy828301@gmail.com>
> ---
>  mm/vmscan.c | 12 ++++--------
>  1 file changed, 4 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index ddb9f972f856..8da765a85569 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -185,8 +185,7 @@ static LIST_HEAD(shrinker_list);
>  static DECLARE_RWSEM(shrinker_rwsem);
>  
>  #ifdef CONFIG_MEMCG
> -
> -static int memcg_shrinker_map_size;
> +static int shrinker_nr_max;
>  
>  static void memcg_free_shrinker_map_rcu(struct rcu_head *head)
>  {
> @@ -248,7 +247,7 @@ int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
>  		return 0;
>  
>  	down_read(&shrinker_rwsem);
> -	size = memcg_shrinker_map_size;
> +	size = DIV_ROUND_UP(shrinker_nr_max, BITS_PER_LONG) * sizeof(unsigned long);
>  	for_each_node(nid) {
>  		map = kvzalloc(sizeof(*map) + size, GFP_KERNEL);
>  		if (!map) {
> @@ -269,7 +268,7 @@ static int memcg_expand_shrinker_maps(int new_id)
>  	struct mem_cgroup *memcg;
>  
>  	size = DIV_ROUND_UP(new_id + 1, BITS_PER_LONG) * sizeof(unsigned long);
> -	old_size = memcg_shrinker_map_size;
> +	old_size = DIV_ROUND_UP(shrinker_nr_max, BITS_PER_LONG) * sizeof(unsigned long);
>  	if (size <= old_size)
>  		return 0;

These bunch of DIV_ROUND_UP() looks too complex. Since now all the shrinker maps allocation
logic in the only file, can't we simplify this to look better? I mean something like below
to merge in your patch:

diff --git a/mm/vmscan.c b/mm/vmscan.c
index b951c289ef3a..27b6371a1656 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -247,7 +247,7 @@ int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
 		return 0;
 
 	down_read(&shrinker_rwsem);
-	size = DIV_ROUND_UP(shrinker_nr_max, BITS_PER_LONG) * sizeof(unsigned long);
+	size = shrinker_nr_max / BITS_PER_BYTE;
 	for_each_node(nid) {
 		map = kvzalloc(sizeof(*map) + size, GFP_KERNEL);
 		if (!map) {
@@ -264,13 +264,11 @@ int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
 
 static int memcg_expand_shrinker_maps(int new_id)
 {
-	int size, old_size, ret = 0;
+	int size, old_size, new_nr_max, ret = 0;
 	struct mem_cgroup *memcg;
 
 	size = DIV_ROUND_UP(new_id + 1, BITS_PER_LONG) * sizeof(unsigned long);
-	old_size = DIV_ROUND_UP(shrinker_nr_max, BITS_PER_LONG) * sizeof(unsigned long);
-	if (size <= old_size)
-		return 0;
+	new_nr_max = size * BITS_PER_BYTE;
 
 	if (!root_mem_cgroup)
 		goto out;
@@ -287,6 +285,9 @@ static int memcg_expand_shrinker_maps(int new_id)
 	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
 
 out:
+	if (ret == 0)
+		shrinker_nr_max = new_nr_max;
+
 	return ret;
 }
 
@@ -334,8 +335,6 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
 			idr_remove(&shrinker_idr, id);
 			goto unlock;
 		}
-
-		shrinker_nr_max = id + 1;
 	}
 	shrinker->id = id;
 	ret = 0;

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [v3 PATCH 05/11] mm: vmscan: use a new flag to indicate shrinker is registered
  2021-01-05 22:58 ` [v3 PATCH 05/11] mm: vmscan: use a new flag to indicate shrinker is registered Yang Shi
@ 2021-01-06 10:21   ` Kirill Tkhai
  2021-01-11 18:17     ` Yang Shi
  0 siblings, 1 reply; 43+ messages in thread
From: Kirill Tkhai @ 2021-01-06 10:21 UTC (permalink / raw)
  To: Yang Shi, guro, shakeelb, david, hannes, mhocko, akpm
  Cc: linux-mm, linux-fsdevel, linux-kernel

On 06.01.2021 01:58, Yang Shi wrote:
> Currently registered shrinker is indicated by non-NULL shrinker->nr_deferred.
> This approach is fine with nr_deferred at the shrinker level, but the following
> patches will move MEMCG_AWARE shrinkers' nr_deferred to memcg level, so their
> shrinker->nr_deferred would always be NULL.  This would prevent the shrinkers
> from unregistering correctly.
> 
> Signed-off-by: Yang Shi <shy828301@gmail.com>
> ---
>  include/linux/shrinker.h |  7 ++++---
>  mm/vmscan.c              | 13 +++++++++----
>  2 files changed, 13 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index 0f80123650e2..1eac79ce57d4 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -79,13 +79,14 @@ struct shrinker {
>  #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
>  
>  /* Flags */
> -#define SHRINKER_NUMA_AWARE	(1 << 0)
> -#define SHRINKER_MEMCG_AWARE	(1 << 1)
> +#define SHRINKER_REGISTERED	(1 << 0)
> +#define SHRINKER_NUMA_AWARE	(1 << 1)
> +#define SHRINKER_MEMCG_AWARE	(1 << 2)
>  /*
>   * It just makes sense when the shrinker is also MEMCG_AWARE for now,
>   * non-MEMCG_AWARE shrinker should not have this flag set.
>   */
> -#define SHRINKER_NONSLAB	(1 << 2)
> +#define SHRINKER_NONSLAB	(1 << 3)
>  
>  extern int prealloc_shrinker(struct shrinker *shrinker);
>  extern void register_shrinker_prepared(struct shrinker *shrinker);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 8da765a85569..9761c7c27412 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -494,6 +494,7 @@ void register_shrinker_prepared(struct shrinker *shrinker)
>  	if (shrinker->flags & SHRINKER_MEMCG_AWARE)
>  		idr_replace(&shrinker_idr, shrinker, shrinker->id);
>  #endif
> +	shrinker->flags |= SHRINKER_REGISTERED;

In case of we introduce this new flag, we should kill old flag SHRINKER_REGISTERING,
which are not needed anymore (we should you the new flag instead of that).

>  	up_write(&shrinker_rwsem);
>  }
>  
> @@ -513,13 +514,17 @@ EXPORT_SYMBOL(register_shrinker);
>   */
>  void unregister_shrinker(struct shrinker *shrinker)
>  {
> -	if (!shrinker->nr_deferred)
> -		return;
> -	if (shrinker->flags & SHRINKER_MEMCG_AWARE)
> -		unregister_memcg_shrinker(shrinker);
>  	down_write(&shrinker_rwsem);

I do not think there are some users which registration may race with unregistration.
So, I think we should check SHRINKER_REGISTERED unlocked similar to we used to check
shrinker->nr_deferred unlocked.

> +	if (!(shrinker->flags & SHRINKER_REGISTERED)) {
> +		up_write(&shrinker_rwsem);
> +		return;
> +	}
>  	list_del(&shrinker->list);
> +	shrinker->flags &= ~SHRINKER_REGISTERED;
>  	up_write(&shrinker_rwsem);
> +
> +	if (shrinker->flags & SHRINKER_MEMCG_AWARE)
> +		unregister_memcg_shrinker(shrinker);
>  	kfree(shrinker->nr_deferred);
>  	shrinker->nr_deferred = NULL;
>  }
> 


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v3 PATCH 07/11] mm: vmscan: add per memcg shrinker nr_deferred
  2021-01-05 22:58 ` [v3 PATCH 07/11] mm: vmscan: add per memcg shrinker nr_deferred Yang Shi
@ 2021-01-06 11:06   ` Kirill Tkhai
  2021-01-11 18:24     ` Yang Shi
  2021-01-13 23:30     ` Yang Shi
  0 siblings, 2 replies; 43+ messages in thread
From: Kirill Tkhai @ 2021-01-06 11:06 UTC (permalink / raw)
  To: Yang Shi, guro, shakeelb, david, hannes, mhocko, akpm
  Cc: linux-mm, linux-fsdevel, linux-kernel

On 06.01.2021 01:58, Yang Shi wrote:
> Currently the number of deferred objects are per shrinker, but some slabs, for example,
> vfs inode/dentry cache are per memcg, this would result in poor isolation among memcgs.
> 
> The deferred objects typically are generated by __GFP_NOFS allocations, one memcg with
> excessive __GFP_NOFS allocations may blow up deferred objects, then other innocent memcgs
> may suffer from over shrink, excessive reclaim latency, etc.
> 
> For example, two workloads run in memcgA and memcgB respectively, workload in B is vfs
> heavy workload.  Workload in A generates excessive deferred objects, then B's vfs cache
> might be hit heavily (drop half of caches) by B's limit reclaim or global reclaim.
> 
> We observed this hit in our production environment which was running vfs heavy workload
> shown as the below tracing log:
> 
> <...>-409454 [016] .... 28286961.747146: mm_shrink_slab_start: super_cache_scan+0x0/0x1a0 ffff9a83046f3458:
> nid: 1 objects to shrink 3641681686040 gfp_flags GFP_HIGHUSER_MOVABLE|__GFP_ZERO pgs_scanned 1 lru_pgs 15721
> cache items 246404277 delta 31345 total_scan 123202138
> <...>-409454 [022] .... 28287105.928018: mm_shrink_slab_end: super_cache_scan+0x0/0x1a0 ffff9a83046f3458:
> nid: 1 unused scan count 3641681686040 new scan count 3641798379189 total_scan 602
> last shrinker return val 123186855
> 
> The vfs cache and page cache ration was 10:1 on this machine, and half of caches were dropped.
> This also resulted in significant amount of page caches were dropped due to inodes eviction.
> 
> Make nr_deferred per memcg for memcg aware shrinkers would solve the unfairness and bring
> better isolation.
> 
> When memcg is not enabled (!CONFIG_MEMCG or memcg disabled), the shrinker's nr_deferred
> would be used.  And non memcg aware shrinkers use shrinker's nr_deferred all the time.
> 
> Signed-off-by: Yang Shi <shy828301@gmail.com>
> ---
>  include/linux/memcontrol.h |  7 +++---
>  mm/vmscan.c                | 49 +++++++++++++++++++++++++-------------
>  2 files changed, 37 insertions(+), 19 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index e05bbe8277cc..5599082df623 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -93,12 +93,13 @@ struct lruvec_stat {
>  };
>  
>  /*
> - * Bitmap of shrinker::id corresponding to memcg-aware shrinkers,
> - * which have elements charged to this memcg.
> + * Bitmap and deferred work of shrinker::id corresponding to memcg-aware
> + * shrinkers, which have elements charged to this memcg.
>   */
>  struct memcg_shrinker_info {
>  	struct rcu_head rcu;
> -	unsigned long map[];
> +	unsigned long *map;
> +	atomic_long_t *nr_deferred;
>  };
>  
>  /*
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 0033659abf9e..72259253e414 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -193,10 +193,12 @@ static void memcg_free_shrinker_info_rcu(struct rcu_head *head)
>  }
>  
>  static int memcg_expand_one_shrinker_info(struct mem_cgroup *memcg,
> -					  int size, int old_size)
> +					  int m_size, int d_size,
> +					  int old_m_size, int old_d_size)
>  {
>  	struct memcg_shrinker_info *new, *old;
>  	int nid;
> +	int size = m_size + d_size;
>  
>  	for_each_node(nid) {
>  		old = rcu_dereference_protected(
> @@ -209,9 +211,18 @@ static int memcg_expand_one_shrinker_info(struct mem_cgroup *memcg,
>  		if (!new)
>  			return -ENOMEM;
>  
> -		/* Set all old bits, clear all new bits */
> -		memset(new->map, (int)0xff, old_size);
> -		memset((void *)new->map + old_size, 0, size - old_size);
> +		new->map = (unsigned long *)((unsigned long)new + sizeof(*new));
> +		new->nr_deferred = (atomic_long_t *)((unsigned long)new +
> +					sizeof(*new) + m_size);

Can't we write this more compact?

		new->map = (unsigned long *)(new + 1);
		new->nr_deferred = (atomic_long_t)(new->map + 1);

> +
> +		/* map: set all old bits, clear all new bits */
> +		memset(new->map, (int)0xff, old_m_size);
> +		memset((void *)new->map + old_m_size, 0, m_size - old_m_size);
> +		/* nr_deferred: copy old values, clear all new values */
> +		memcpy((void *)new->nr_deferred, (void *)old->nr_deferred,
> +		       old_d_size);

Why not
	 	memcpy(new->nr_deferred, old->nr_deferred, old_d_size);
?

> +		memset((void *)new->nr_deferred + old_d_size, 0,
> +		       d_size - old_d_size);
>  
>  		rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, new);
>  		call_rcu(&old->rcu, memcg_free_shrinker_info_rcu);
> @@ -226,9 +237,6 @@ void memcg_free_shrinker_info(struct mem_cgroup *memcg)
>  	struct memcg_shrinker_info *info;
>  	int nid;
>  
> -	if (mem_cgroup_is_root(memcg))
> -		return;
> -
>  	for_each_node(nid) {
>  		pn = mem_cgroup_nodeinfo(memcg, nid);
>  		info = rcu_dereference_protected(pn->shrinker_info, true);
> @@ -242,12 +250,13 @@ int memcg_alloc_shrinker_info(struct mem_cgroup *memcg)
>  {
>  	struct memcg_shrinker_info *info;
>  	int nid, size, ret = 0;
> -
> -	if (mem_cgroup_is_root(memcg))
> -		return 0;
> +	int m_size, d_size = 0;
>  
>  	down_read(&shrinker_rwsem);
> -	size = DIV_ROUND_UP(shrinker_nr_max, BITS_PER_LONG) * sizeof(unsigned long);
> +	m_size = DIV_ROUND_UP(shrinker_nr_max, BITS_PER_LONG) * sizeof(unsigned long);
> +	d_size = shrinker_nr_max * sizeof(atomic_long_t);
> +	size = m_size + d_size;
> +
>  	for_each_node(nid) {
>  		info = kvzalloc(sizeof(*info) + size, GFP_KERNEL);
>  		if (!info) {
> @@ -255,6 +264,9 @@ int memcg_alloc_shrinker_info(struct mem_cgroup *memcg)
>  			ret = -ENOMEM;
>  			break;
>  		}
> +		info->map = (unsigned long *)((unsigned long)info + sizeof(*info));
> +		info->nr_deferred = (atomic_long_t *)((unsigned long)info +
> +					sizeof(*info) + m_size);
>  		rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, info);
>  	}
>  	up_read(&shrinker_rwsem);
> @@ -265,10 +277,16 @@ int memcg_alloc_shrinker_info(struct mem_cgroup *memcg)
>  static int memcg_expand_shrinker_info(int new_id)
>  {
>  	int size, old_size, ret = 0;
> +	int m_size, d_size = 0;
> +	int old_m_size, old_d_size = 0;
>  	struct mem_cgroup *memcg;
>  
> -	size = DIV_ROUND_UP(new_id + 1, BITS_PER_LONG) * sizeof(unsigned long);
> -	old_size = DIV_ROUND_UP(shrinker_nr_max, BITS_PER_LONG) * sizeof(unsigned long);
> +	m_size = DIV_ROUND_UP(new_id + 1, BITS_PER_LONG) * sizeof(unsigned long);
> +	d_size = (new_id + 1) * sizeof(atomic_long_t);
> +	size = m_size + d_size;
> +	old_m_size = DIV_ROUND_UP(shrinker_nr_max, BITS_PER_LONG) * sizeof(unsigned long);
> +	old_d_size = shrinker_nr_max * sizeof(atomic_long_t);
> +	old_size = old_m_size + old_d_size;
>  	if (size <= old_size)
>  		return 0;

This replication of patch [4/11] looks awkwardly. Please, try to incorporate
the same changes to nr_deferred as I requested for shrinker_map in [4/11].

>  
> @@ -277,9 +295,8 @@ static int memcg_expand_shrinker_info(int new_id)
>  
>  	memcg = mem_cgroup_iter(NULL, NULL, NULL);
>  	do {
> -		if (mem_cgroup_is_root(memcg))
> -			continue;
> -		ret = memcg_expand_one_shrinker_info(memcg, size, old_size);
> +		ret = memcg_expand_one_shrinker_info(memcg, m_size, d_size,
> +						     old_m_size, old_d_size);
>  		if (ret) {
>  			mem_cgroup_iter_break(NULL, memcg);
>  			goto out;
> 


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v3 PATCH 09/11] mm: vmscan: don't need allocate shrinker->nr_deferred for memcg aware shrinkers
  2021-01-05 22:58 ` [v3 PATCH 09/11] mm: vmscan: don't need allocate shrinker->nr_deferred for memcg aware shrinkers Yang Shi
@ 2021-01-06 11:15   ` Kirill Tkhai
  2021-01-11 18:40     ` Yang Shi
  0 siblings, 1 reply; 43+ messages in thread
From: Kirill Tkhai @ 2021-01-06 11:15 UTC (permalink / raw)
  To: Yang Shi, guro, shakeelb, david, hannes, mhocko, akpm
  Cc: linux-mm, linux-fsdevel, linux-kernel

On 06.01.2021 01:58, Yang Shi wrote:
> Now nr_deferred is available on per memcg level for memcg aware shrinkers, so don't need
> allocate shrinker->nr_deferred for such shrinkers anymore.
> 
> The prealloc_memcg_shrinker() would return -ENOSYS if !CONFIG_MEMCG or memcg is disabled
> by kernel command line, then shrinker's SHRINKER_MEMCG_AWARE flag would be cleared.
> This makes the implementation of this patch simpler.
> 
> Signed-off-by: Yang Shi <shy828301@gmail.com>
> ---
>  mm/vmscan.c | 33 ++++++++++++++++++---------------
>  1 file changed, 18 insertions(+), 15 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f20ed8e928c2..d9795fb0f1c5 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -340,6 +340,9 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
>  {
>  	int id, ret = -ENOMEM;
>  
> +	if (mem_cgroup_disabled())
> +		return -ENOSYS;
> +
>  	down_write(&shrinker_rwsem);
>  	/* This may call shrinker, so it must use down_read_trylock() */
>  	id = idr_alloc(&shrinker_idr, SHRINKER_REGISTERING, 0, 0, GFP_KERNEL);
> @@ -424,7 +427,7 @@ static bool writeback_throttling_sane(struct scan_control *sc)
>  #else
>  static int prealloc_memcg_shrinker(struct shrinker *shrinker)
>  {
> -	return 0;
> +	return -ENOSYS;
>  }
>  
>  static void unregister_memcg_shrinker(struct shrinker *shrinker)
> @@ -535,8 +538,20 @@ unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone
>   */
>  int prealloc_shrinker(struct shrinker *shrinker)
>  {
> -	unsigned int size = sizeof(*shrinker->nr_deferred);
> +	unsigned int size;
> +	int err;
> +
> +	if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
> +		err = prealloc_memcg_shrinker(shrinker);
> +		if (!err)
> +			return 0;
> +		if (err != -ENOSYS)
> +			return err;
> +
> +		shrinker->flags &= ~SHRINKER_MEMCG_AWARE;

This looks very confusing.

In case of you want to disable preallocation branch for !MEMCG case,
you should firstly consider something like the below:

#ifdef CONFIG_MEMCG
#define SHRINKER_MEMCG_AWARE    (1 << 2)
#else
#define SHRINKER_MEMCG_AWARE    0
#endif

> +	}
>  
> +	size = sizeof(*shrinker->nr_deferred);
>  	if (shrinker->flags & SHRINKER_NUMA_AWARE)
>  		size *= nr_node_ids;
>  
> @@ -544,26 +559,14 @@ int prealloc_shrinker(struct shrinker *shrinker)
>  	if (!shrinker->nr_deferred)
>  		return -ENOMEM;
>  
> -	if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
> -		if (prealloc_memcg_shrinker(shrinker))
> -			goto free_deferred;
> -	}
>  
>  	return 0;
> -
> -free_deferred:
> -	kfree(shrinker->nr_deferred);
> -	shrinker->nr_deferred = NULL;
> -	return -ENOMEM;
>  }
>  
>  void free_prealloced_shrinker(struct shrinker *shrinker)
>  {
> -	if (!shrinker->nr_deferred)
> -		return;
> -
>  	if (shrinker->flags & SHRINKER_MEMCG_AWARE)
> -		unregister_memcg_shrinker(shrinker);
> +		return unregister_memcg_shrinker(shrinker);
>  
>  	kfree(shrinker->nr_deferred);
>  	shrinker->nr_deferred = NULL;
> 


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v3 PATCH 10/11] mm: memcontrol: reparent nr_deferred when memcg offline
  2021-01-05 22:58 ` [v3 PATCH 10/11] mm: memcontrol: reparent nr_deferred when memcg offline Yang Shi
@ 2021-01-06 11:34   ` Kirill Tkhai
  2021-01-11 18:43     ` Yang Shi
  0 siblings, 1 reply; 43+ messages in thread
From: Kirill Tkhai @ 2021-01-06 11:34 UTC (permalink / raw)
  To: Yang Shi, guro, shakeelb, david, hannes, mhocko, akpm
  Cc: linux-mm, linux-fsdevel, linux-kernel

On 06.01.2021 01:58, Yang Shi wrote:
> Now shrinker's nr_deferred is per memcg for memcg aware shrinkers, add to parent's
> corresponding nr_deferred when memcg offline.
> 
> Signed-off-by: Yang Shi <shy828301@gmail.com>
> ---
>  include/linux/memcontrol.h |  1 +
>  mm/memcontrol.c            |  1 +
>  mm/vmscan.c                | 29 +++++++++++++++++++++++++++++
>  3 files changed, 31 insertions(+)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 5599082df623..d1e52e916cc2 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1586,6 +1586,7 @@ extern int memcg_alloc_shrinker_info(struct mem_cgroup *memcg);
>  extern void memcg_free_shrinker_info(struct mem_cgroup *memcg);
>  extern void memcg_set_shrinker_bit(struct mem_cgroup *memcg,
>  				   int nid, int shrinker_id);
> +extern void memcg_reparent_shrinker_deferred(struct mem_cgroup *memcg);
>  #else
>  #define mem_cgroup_sockets_enabled 0
>  static inline void mem_cgroup_sk_alloc(struct sock *sk) { };
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 126f1fd550c8..19e555675582 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5284,6 +5284,7 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
>  	page_counter_set_low(&memcg->memory, 0);
>  
>  	memcg_offline_kmem(memcg);
> +	memcg_reparent_shrinker_deferred(memcg);
>  	wb_memcg_offline(memcg);
>  
>  	drain_all_stock(memcg);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index d9795fb0f1c5..71056057d26d 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -396,6 +396,35 @@ static long set_nr_deferred_memcg(long nr, int nid, struct shrinker *shrinker,
>  	return atomic_long_add_return(nr, &info->nr_deferred[shrinker->id]);
>  }
>  
> +void memcg_reparent_shrinker_deferred(struct mem_cgroup *memcg)
> +{
> +	int i, nid;
> +	long nr;
> +	struct mem_cgroup *parent;
> +	struct memcg_shrinker_info *child_info, *parent_info;
> +
> +	parent = parent_mem_cgroup(memcg);
> +	if (!parent)
> +		parent = root_mem_cgroup;
> +
> +	/* Prevent from concurrent shrinker_info expand */
> +	down_read(&shrinker_rwsem);
> +	for_each_node(nid) {
> +		child_info = rcu_dereference_protected(
> +					memcg->nodeinfo[nid]->shrinker_info,
> +					true);
> +		parent_info = rcu_dereference_protected(
> +					parent->nodeinfo[nid]->shrinker_info,
> +					true);

Simple assignment can't take such lots of space, we have to do something with that.

Number of these

	rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_info, true)

became too big, and we can't allow every of them takes 3 lines.

We should introduce a short helper to dereferrence this, so we will be able to give
out attention to really difficult logic instead of wasting it on parsing this.

		child_info = memcg_shrinker_info(memcg, nid);
or
		child_info = memcg_shrinker_info_protected(memcg, nid);

Both of them fit in single line.

struct memcg_shrinker_info *memcg_shrinker_info_protected(
					struct mem_cgroup *memcg, int nid)
{
	return rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_info,
					 lockdep_assert_held(&shrinker_rwsem));
}


> +		for (i = 0; i < shrinker_nr_max; i++) {
> +			nr = atomic_long_read(&child_info->nr_deferred[i]);
> +			atomic_long_add(nr,
> +					&parent_info->nr_deferred[i]);

Why new line is here? In case of you merge it up, it will be even shorter then previous line.

> +		}
> +	}
> +	up_read(&shrinker_rwsem);
> +}
> +
>  static bool cgroup_reclaim(struct scan_control *sc)
>  {
>  	return sc->target_mem_cgroup;
> 


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v3 PATCH 06/11] mm: memcontrol: rename shrinker_map to shrinker_info
  2021-01-05 22:58 ` [v3 PATCH 06/11] mm: memcontrol: rename shrinker_map to shrinker_info Yang Shi
@ 2021-01-06 11:38   ` Kirill Tkhai
  2021-01-11 18:19     ` Yang Shi
  0 siblings, 1 reply; 43+ messages in thread
From: Kirill Tkhai @ 2021-01-06 11:38 UTC (permalink / raw)
  To: Yang Shi, guro, shakeelb, david, hannes, mhocko, akpm
  Cc: linux-mm, linux-fsdevel, linux-kernel

On 06.01.2021 01:58, Yang Shi wrote:
> The following patch is going to add nr_deferred into shrinker_map, the change will
> make shrinker_map not only include map anymore, so rename it to a more general
> name.  And this should make the patch adding nr_deferred cleaner and readable and make
> review easier.
> 
> Signed-off-by: Yang Shi <shy828301@gmail.com>
> ---
>  include/linux/memcontrol.h |  8 ++---
>  mm/memcontrol.c            |  6 ++--
>  mm/vmscan.c                | 66 +++++++++++++++++++-------------------
>  3 files changed, 40 insertions(+), 40 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index d128d2842f22..e05bbe8277cc 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -96,7 +96,7 @@ struct lruvec_stat {
>   * Bitmap of shrinker::id corresponding to memcg-aware shrinkers,
>   * which have elements charged to this memcg.
>   */
> -struct memcg_shrinker_map {
> +struct memcg_shrinker_info {

Reviewing your next patch actively using new fields in this structure,
I strongly insist on renaming it in "struct shrinker_info" instead of that.

Otherwise, lines of function declarations become too long.

>  	struct rcu_head rcu;
>  	unsigned long map[];
>  };
> @@ -118,7 +118,7 @@ struct mem_cgroup_per_node {
>  
>  	struct mem_cgroup_reclaim_iter	iter;
>  
> -	struct memcg_shrinker_map __rcu	*shrinker_map;
> +	struct memcg_shrinker_info __rcu	*shrinker_info;
>  
>  	struct rb_node		tree_node;	/* RB tree node */
>  	unsigned long		usage_in_excess;/* Set to the value by which */
> @@ -1581,8 +1581,8 @@ static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
>  	return false;
>  }
>  
> -extern int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg);
> -extern void memcg_free_shrinker_maps(struct mem_cgroup *memcg);
> +extern int memcg_alloc_shrinker_info(struct mem_cgroup *memcg);
> +extern void memcg_free_shrinker_info(struct mem_cgroup *memcg);
>  extern void memcg_set_shrinker_bit(struct mem_cgroup *memcg,
>  				   int nid, int shrinker_id);
>  #else
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 817dde366258..126f1fd550c8 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5248,11 +5248,11 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
>  	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
>  
>  	/*
> -	 * A memcg must be visible for memcg_expand_shrinker_maps()
> +	 * A memcg must be visible for memcg_expand_shrinker_info()
>  	 * by the time the maps are allocated. So, we allocate maps
>  	 * here, when for_each_mem_cgroup() can't skip it.
>  	 */
> -	if (memcg_alloc_shrinker_maps(memcg)) {
> +	if (memcg_alloc_shrinker_info(memcg)) {
>  		mem_cgroup_id_remove(memcg);
>  		return -ENOMEM;
>  	}
> @@ -5316,7 +5316,7 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
>  	vmpressure_cleanup(&memcg->vmpressure);
>  	cancel_work_sync(&memcg->high_work);
>  	mem_cgroup_remove_from_trees(memcg);
> -	memcg_free_shrinker_maps(memcg);
> +	memcg_free_shrinker_info(memcg);
>  	memcg_free_kmem(memcg);
>  	mem_cgroup_free(memcg);
>  }
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 9761c7c27412..0033659abf9e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -187,20 +187,20 @@ static DECLARE_RWSEM(shrinker_rwsem);
>  #ifdef CONFIG_MEMCG
>  static int shrinker_nr_max;
>  
> -static void memcg_free_shrinker_map_rcu(struct rcu_head *head)
> +static void memcg_free_shrinker_info_rcu(struct rcu_head *head)
>  {
> -	kvfree(container_of(head, struct memcg_shrinker_map, rcu));
> +	kvfree(container_of(head, struct memcg_shrinker_info, rcu));
>  }
>  
> -static int memcg_expand_one_shrinker_map(struct mem_cgroup *memcg,
> -					 int size, int old_size)
> +static int memcg_expand_one_shrinker_info(struct mem_cgroup *memcg,
> +					  int size, int old_size)
>  {
> -	struct memcg_shrinker_map *new, *old;
> +	struct memcg_shrinker_info *new, *old;
>  	int nid;
>  
>  	for_each_node(nid) {
>  		old = rcu_dereference_protected(
> -			mem_cgroup_nodeinfo(memcg, nid)->shrinker_map, true);
> +			mem_cgroup_nodeinfo(memcg, nid)->shrinker_info, true);
>  		/* Not yet online memcg */
>  		if (!old)
>  			return 0;
> @@ -213,17 +213,17 @@ static int memcg_expand_one_shrinker_map(struct mem_cgroup *memcg,
>  		memset(new->map, (int)0xff, old_size);
>  		memset((void *)new->map + old_size, 0, size - old_size);
>  
> -		rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, new);
> -		call_rcu(&old->rcu, memcg_free_shrinker_map_rcu);
> +		rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, new);
> +		call_rcu(&old->rcu, memcg_free_shrinker_info_rcu);
>  	}
>  
>  	return 0;
>  }
>  
> -void memcg_free_shrinker_maps(struct mem_cgroup *memcg)
> +void memcg_free_shrinker_info(struct mem_cgroup *memcg)
>  {
>  	struct mem_cgroup_per_node *pn;
> -	struct memcg_shrinker_map *map;
> +	struct memcg_shrinker_info *info;
>  	int nid;
>  
>  	if (mem_cgroup_is_root(memcg))
> @@ -231,16 +231,16 @@ void memcg_free_shrinker_maps(struct mem_cgroup *memcg)
>  
>  	for_each_node(nid) {
>  		pn = mem_cgroup_nodeinfo(memcg, nid);
> -		map = rcu_dereference_protected(pn->shrinker_map, true);
> -		if (map)
> -			kvfree(map);
> -		rcu_assign_pointer(pn->shrinker_map, NULL);
> +		info = rcu_dereference_protected(pn->shrinker_info, true);
> +		if (info)
> +			kvfree(info);
> +		rcu_assign_pointer(pn->shrinker_info, NULL);
>  	}
>  }
>  
> -int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
> +int memcg_alloc_shrinker_info(struct mem_cgroup *memcg)
>  {
> -	struct memcg_shrinker_map *map;
> +	struct memcg_shrinker_info *info;
>  	int nid, size, ret = 0;
>  
>  	if (mem_cgroup_is_root(memcg))
> @@ -249,20 +249,20 @@ int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
>  	down_read(&shrinker_rwsem);
>  	size = DIV_ROUND_UP(shrinker_nr_max, BITS_PER_LONG) * sizeof(unsigned long);
>  	for_each_node(nid) {
> -		map = kvzalloc(sizeof(*map) + size, GFP_KERNEL);
> -		if (!map) {
> -			memcg_free_shrinker_maps(memcg);
> +		info = kvzalloc(sizeof(*info) + size, GFP_KERNEL);
> +		if (!info) {
> +			memcg_free_shrinker_info(memcg);
>  			ret = -ENOMEM;
>  			break;
>  		}
> -		rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, map);
> +		rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, info);
>  	}
>  	up_read(&shrinker_rwsem);
>  
>  	return ret;
>  }
>  
> -static int memcg_expand_shrinker_maps(int new_id)
> +static int memcg_expand_shrinker_info(int new_id)
>  {
>  	int size, old_size, ret = 0;
>  	struct mem_cgroup *memcg;
> @@ -279,7 +279,7 @@ static int memcg_expand_shrinker_maps(int new_id)
>  	do {
>  		if (mem_cgroup_is_root(memcg))
>  			continue;
> -		ret = memcg_expand_one_shrinker_map(memcg, size, old_size);
> +		ret = memcg_expand_one_shrinker_info(memcg, size, old_size);
>  		if (ret) {
>  			mem_cgroup_iter_break(NULL, memcg);
>  			goto out;
> @@ -293,13 +293,13 @@ static int memcg_expand_shrinker_maps(int new_id)
>  void memcg_set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
>  {
>  	if (shrinker_id >= 0 && memcg && !mem_cgroup_is_root(memcg)) {
> -		struct memcg_shrinker_map *map;
> +		struct memcg_shrinker_info *info;
>  
>  		rcu_read_lock();
> -		map = rcu_dereference(memcg->nodeinfo[nid]->shrinker_map);
> +		info = rcu_dereference(memcg->nodeinfo[nid]->shrinker_info);
>  		/* Pairs with smp mb in shrink_slab() */
>  		smp_mb__before_atomic();
> -		set_bit(shrinker_id, map->map);
> +		set_bit(shrinker_id, info->map);
>  		rcu_read_unlock();
>  	}
>  }
> @@ -330,7 +330,7 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
>  		goto unlock;
>  
>  	if (id >= shrinker_nr_max) {
> -		if (memcg_expand_shrinker_maps(id)) {
> +		if (memcg_expand_shrinker_info(id)) {
>  			idr_remove(&shrinker_idr, id);
>  			goto unlock;
>  		}
> @@ -666,7 +666,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>  			struct mem_cgroup *memcg, int priority)
>  {
> -	struct memcg_shrinker_map *map;
> +	struct memcg_shrinker_info *info;
>  	unsigned long ret, freed = 0;
>  	int i;
>  
> @@ -676,12 +676,12 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>  	if (!down_read_trylock(&shrinker_rwsem))
>  		return 0;
>  
> -	map = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_map,
> -					true);
> -	if (unlikely(!map))
> +	info = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_info,
> +					 true);
> +	if (unlikely(!info))
>  		goto unlock;
>  
> -	for_each_set_bit(i, map->map, shrinker_nr_max) {
> +	for_each_set_bit(i, info->map, shrinker_nr_max) {
>  		struct shrink_control sc = {
>  			.gfp_mask = gfp_mask,
>  			.nid = nid,
> @@ -692,7 +692,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>  		shrinker = idr_find(&shrinker_idr, i);
>  		if (unlikely(!shrinker || shrinker == SHRINKER_REGISTERING)) {
>  			if (!shrinker)
> -				clear_bit(i, map->map);
> +				clear_bit(i, info->map);
>  			continue;
>  		}
>  
> @@ -703,7 +703,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>  
>  		ret = do_shrink_slab(&sc, shrinker, priority);
>  		if (ret == SHRINK_EMPTY) {
> -			clear_bit(i, map->map);
> +			clear_bit(i, info->map);
>  			/*
>  			 * After the shrinker reported that it had no objects to
>  			 * free, but before we cleared the corresponding bit in
> 


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v3 PATCH 02/11] mm: vmscan: consolidate shrinker_maps handling code
  2021-01-05 22:58 ` [v3 PATCH 02/11] mm: vmscan: consolidate shrinker_maps handling code Yang Shi
@ 2021-01-07  0:13   ` Roman Gushchin
  2021-01-07 17:29     ` Yang Shi
  2021-01-11 19:00     ` Yang Shi
  0 siblings, 2 replies; 43+ messages in thread
From: Roman Gushchin @ 2021-01-07  0:13 UTC (permalink / raw)
  To: Yang Shi
  Cc: ktkhai, shakeelb, david, hannes, mhocko, akpm, linux-mm,
	linux-fsdevel, linux-kernel

On Tue, Jan 05, 2021 at 02:58:08PM -0800, Yang Shi wrote:
> The shrinker map management is not really memcg specific, it's just allocation

In the current form it doesn't look so, especially because each name
has a memcg_ prefix and each function takes a memcg argument.

It begs for some refactorings (Kirill suggested some) and renamings.

Thanks!

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v3 PATCH 08/11] mm: vmscan: use per memcg nr_deferred of shrinker
  2021-01-05 22:58 ` [v3 PATCH 08/11] mm: vmscan: use per memcg nr_deferred of shrinker Yang Shi
@ 2021-01-07  0:17   ` Roman Gushchin
  2021-01-07 17:34     ` Yang Shi
  0 siblings, 1 reply; 43+ messages in thread
From: Roman Gushchin @ 2021-01-07  0:17 UTC (permalink / raw)
  To: Yang Shi
  Cc: ktkhai, shakeelb, david, hannes, mhocko, akpm, linux-mm,
	linux-fsdevel, linux-kernel

On Tue, Jan 05, 2021 at 02:58:14PM -0800, Yang Shi wrote:
> Use per memcg's nr_deferred for memcg aware shrinkers.  The shrinker's nr_deferred
> will be used in the following cases:
>     1. Non memcg aware shrinkers
>     2. !CONFIG_MEMCG

It's better to depend on CONFIG_MEMCG_KMEM rather than CONFIG_MEMCG.
Without CONFIG_MEMCG_KMEM the kernel memory accounting is off, so
per-memcg shrinkers do not make any sense. The same applies for many
places in the patchset.

PS I like this version of the patchset much more than the previous one,
so it looks like it's going in the right direction.

Thanks!


>     3. memcg is disabled by boot parameter
> 
> Signed-off-by: Yang Shi <shy828301@gmail.com>
> ---
>  mm/vmscan.c | 81 +++++++++++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 69 insertions(+), 12 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 72259253e414..f20ed8e928c2 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -372,6 +372,27 @@ static void unregister_memcg_shrinker(struct shrinker *shrinker)
>  	up_write(&shrinker_rwsem);
>  }
>  
> +static long count_nr_deferred_memcg(int nid, struct shrinker *shrinker,
> +				    struct mem_cgroup *memcg)
> +{
> +	struct memcg_shrinker_info *info;
> +
> +	info = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_info,
> +					 true);
> +	return atomic_long_xchg(&info->nr_deferred[shrinker->id], 0);
> +}
> +
> +static long set_nr_deferred_memcg(long nr, int nid, struct shrinker *shrinker,
> +				  struct mem_cgroup *memcg)
> +{
> +	struct memcg_shrinker_info *info;
> +
> +	info = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_info,
> +					 true);
> +
> +	return atomic_long_add_return(nr, &info->nr_deferred[shrinker->id]);
> +}
> +
>  static bool cgroup_reclaim(struct scan_control *sc)
>  {
>  	return sc->target_mem_cgroup;
> @@ -410,6 +431,18 @@ static void unregister_memcg_shrinker(struct shrinker *shrinker)
>  {
>  }
>  
> +static long count_nr_deferred_memcg(int nid, struct shrinker *shrinker,
> +				    struct mem_cgroup *memcg)
> +{
> +	return 0;
> +}
> +
> +static long set_nr_deferred_memcg(long nr, int nid, struct shrinker *shrinker,
> +				  struct mem_cgroup *memcg)
> +{
> +	return 0;
> +}
> +
>  static bool cgroup_reclaim(struct scan_control *sc)
>  {
>  	return false;
> @@ -421,6 +454,39 @@ static bool writeback_throttling_sane(struct scan_control *sc)
>  }
>  #endif
>  
> +static long count_nr_deferred(struct shrinker *shrinker,
> +			      struct shrink_control *sc)
> +{
> +	int nid = sc->nid;
> +
> +	if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
> +		nid = 0;
> +
> +	if (sc->memcg &&
> +	    (shrinker->flags & SHRINKER_MEMCG_AWARE))
> +		return count_nr_deferred_memcg(nid, shrinker,
> +					       sc->memcg);
> +
> +	return atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
> +}
> +
> +
> +static long set_nr_deferred(long nr, struct shrinker *shrinker,
> +			    struct shrink_control *sc)
> +{
> +	int nid = sc->nid;
> +
> +	if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
> +		nid = 0;
> +
> +	if (sc->memcg &&
> +	    (shrinker->flags & SHRINKER_MEMCG_AWARE))
> +		return set_nr_deferred_memcg(nr, nid, shrinker,
> +					     sc->memcg);
> +
> +	return atomic_long_add_return(nr, &shrinker->nr_deferred[nid]);
> +}
> +
>  /*
>   * This misses isolated pages which are not accounted for to save counters.
>   * As the data only determines if reclaim or compaction continues, it is
> @@ -558,14 +624,10 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  	long freeable;
>  	long nr;
>  	long new_nr;
> -	int nid = shrinkctl->nid;
>  	long batch_size = shrinker->batch ? shrinker->batch
>  					  : SHRINK_BATCH;
>  	long scanned = 0, next_deferred;
>  
> -	if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
> -		nid = 0;
> -
>  	freeable = shrinker->count_objects(shrinker, shrinkctl);
>  	if (freeable == 0 || freeable == SHRINK_EMPTY)
>  		return freeable;
> @@ -575,7 +637,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  	 * and zero it so that other concurrent shrinker invocations
>  	 * don't also do this scanning work.
>  	 */
> -	nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
> +	nr = count_nr_deferred(shrinker, shrinkctl);
>  
>  	total_scan = nr;
>  	if (shrinker->seeks) {
> @@ -666,14 +728,9 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  		next_deferred = 0;
>  	/*
>  	 * move the unused scan count back into the shrinker in a
> -	 * manner that handles concurrent updates. If we exhausted the
> -	 * scan, there is no need to do an update.
> +	 * manner that handles concurrent updates.
>  	 */
> -	if (next_deferred > 0)
> -		new_nr = atomic_long_add_return(next_deferred,
> -						&shrinker->nr_deferred[nid]);
> -	else
> -		new_nr = atomic_long_read(&shrinker->nr_deferred[nid]);
> +	new_nr = set_nr_deferred(next_deferred, shrinker, shrinkctl);
>  
>  	trace_mm_shrink_slab_end(shrinker, shrinkctl->nid, freed, nr, new_nr, total_scan);
>  	return freed;
> -- 
> 2.26.2
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v3 PATCH 02/11] mm: vmscan: consolidate shrinker_maps handling code
  2021-01-07  0:13   ` Roman Gushchin
@ 2021-01-07 17:29     ` Yang Shi
  2021-01-11 19:00     ` Yang Shi
  1 sibling, 0 replies; 43+ messages in thread
From: Yang Shi @ 2021-01-07 17:29 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Kirill Tkhai, Shakeel Butt, Dave Chinner, Johannes Weiner,
	Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Wed, Jan 6, 2021 at 4:14 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Tue, Jan 05, 2021 at 02:58:08PM -0800, Yang Shi wrote:
> > The shrinker map management is not really memcg specific, it's just allocation
>
> In the current form it doesn't look so, especially because each name
> has a memcg_ prefix and each function takes a memcg argument.

That statement from commit log might be ambiguous and confusing. "Not
really memcg specific" doesn't mean it has nothing to do with memcg.
It is the intersection between memcg and shrinker. So, I don't think
of why it can't take a memcg argument. There are plenty of functions
from vmscan.c that take memcg as argument.

The direct reason for this consolidation is actually the following
patch which uses shrinker_rwsem to protect shrinker_maps allocation.
With this code consolidation we could keep the use of shrinker_rwsem
in one single file. And it also makes some sense to have shrinker
related code in vmscan.c, just like lruvec.

>
> It begs for some refactorings (Kirill suggested some) and renamings.

I apologize that I can't remember what specific suggestions from
Kirill you mean. Removing the "memcg_" prefix makes some sense to me,
we don't have "memcg_" prefix for lruvec either.

>
> Thanks!

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v3 PATCH 08/11] mm: vmscan: use per memcg nr_deferred of shrinker
  2021-01-07  0:17   ` Roman Gushchin
@ 2021-01-07 17:34     ` Yang Shi
  0 siblings, 0 replies; 43+ messages in thread
From: Yang Shi @ 2021-01-07 17:34 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Kirill Tkhai, Shakeel Butt, Dave Chinner, Johannes Weiner,
	Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Wed, Jan 6, 2021 at 4:17 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Tue, Jan 05, 2021 at 02:58:14PM -0800, Yang Shi wrote:
> > Use per memcg's nr_deferred for memcg aware shrinkers.  The shrinker's nr_deferred
> > will be used in the following cases:
> >     1. Non memcg aware shrinkers
> >     2. !CONFIG_MEMCG
>
> It's better to depend on CONFIG_MEMCG_KMEM rather than CONFIG_MEMCG.
> Without CONFIG_MEMCG_KMEM the kernel memory accounting is off, so
> per-memcg shrinkers do not make any sense. The same applies for many
> places in the patchset.

That is because not only does kmem use shrinker. The deferred split
THP does get split by shrinker and it is memcg aware as well. And it
is not the conventional "kmem".

Actually it was CONFIG_MEMCG_KMEM before, it was changed to
CONFIG_MEMCG by memcg-aware deferred split THP patches.

>
> PS I like this version of the patchset much more than the previous one,
> so it looks like it's going in the right direction.

Thanks a lot for the help from you folks.

>
> Thanks!
>
>
> >     3. memcg is disabled by boot parameter
> >
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > ---
> >  mm/vmscan.c | 81 +++++++++++++++++++++++++++++++++++++++++++++--------
> >  1 file changed, 69 insertions(+), 12 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 72259253e414..f20ed8e928c2 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -372,6 +372,27 @@ static void unregister_memcg_shrinker(struct shrinker *shrinker)
> >       up_write(&shrinker_rwsem);
> >  }
> >
> > +static long count_nr_deferred_memcg(int nid, struct shrinker *shrinker,
> > +                                 struct mem_cgroup *memcg)
> > +{
> > +     struct memcg_shrinker_info *info;
> > +
> > +     info = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_info,
> > +                                      true);
> > +     return atomic_long_xchg(&info->nr_deferred[shrinker->id], 0);
> > +}
> > +
> > +static long set_nr_deferred_memcg(long nr, int nid, struct shrinker *shrinker,
> > +                               struct mem_cgroup *memcg)
> > +{
> > +     struct memcg_shrinker_info *info;
> > +
> > +     info = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_info,
> > +                                      true);
> > +
> > +     return atomic_long_add_return(nr, &info->nr_deferred[shrinker->id]);
> > +}
> > +
> >  static bool cgroup_reclaim(struct scan_control *sc)
> >  {
> >       return sc->target_mem_cgroup;
> > @@ -410,6 +431,18 @@ static void unregister_memcg_shrinker(struct shrinker *shrinker)
> >  {
> >  }
> >
> > +static long count_nr_deferred_memcg(int nid, struct shrinker *shrinker,
> > +                                 struct mem_cgroup *memcg)
> > +{
> > +     return 0;
> > +}
> > +
> > +static long set_nr_deferred_memcg(long nr, int nid, struct shrinker *shrinker,
> > +                               struct mem_cgroup *memcg)
> > +{
> > +     return 0;
> > +}
> > +
> >  static bool cgroup_reclaim(struct scan_control *sc)
> >  {
> >       return false;
> > @@ -421,6 +454,39 @@ static bool writeback_throttling_sane(struct scan_control *sc)
> >  }
> >  #endif
> >
> > +static long count_nr_deferred(struct shrinker *shrinker,
> > +                           struct shrink_control *sc)
> > +{
> > +     int nid = sc->nid;
> > +
> > +     if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
> > +             nid = 0;
> > +
> > +     if (sc->memcg &&
> > +         (shrinker->flags & SHRINKER_MEMCG_AWARE))
> > +             return count_nr_deferred_memcg(nid, shrinker,
> > +                                            sc->memcg);
> > +
> > +     return atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
> > +}
> > +
> > +
> > +static long set_nr_deferred(long nr, struct shrinker *shrinker,
> > +                         struct shrink_control *sc)
> > +{
> > +     int nid = sc->nid;
> > +
> > +     if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
> > +             nid = 0;
> > +
> > +     if (sc->memcg &&
> > +         (shrinker->flags & SHRINKER_MEMCG_AWARE))
> > +             return set_nr_deferred_memcg(nr, nid, shrinker,
> > +                                          sc->memcg);
> > +
> > +     return atomic_long_add_return(nr, &shrinker->nr_deferred[nid]);
> > +}
> > +
> >  /*
> >   * This misses isolated pages which are not accounted for to save counters.
> >   * As the data only determines if reclaim or compaction continues, it is
> > @@ -558,14 +624,10 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> >       long freeable;
> >       long nr;
> >       long new_nr;
> > -     int nid = shrinkctl->nid;
> >       long batch_size = shrinker->batch ? shrinker->batch
> >                                         : SHRINK_BATCH;
> >       long scanned = 0, next_deferred;
> >
> > -     if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
> > -             nid = 0;
> > -
> >       freeable = shrinker->count_objects(shrinker, shrinkctl);
> >       if (freeable == 0 || freeable == SHRINK_EMPTY)
> >               return freeable;
> > @@ -575,7 +637,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> >        * and zero it so that other concurrent shrinker invocations
> >        * don't also do this scanning work.
> >        */
> > -     nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
> > +     nr = count_nr_deferred(shrinker, shrinkctl);
> >
> >       total_scan = nr;
> >       if (shrinker->seeks) {
> > @@ -666,14 +728,9 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> >               next_deferred = 0;
> >       /*
> >        * move the unused scan count back into the shrinker in a
> > -      * manner that handles concurrent updates. If we exhausted the
> > -      * scan, there is no need to do an update.
> > +      * manner that handles concurrent updates.
> >        */
> > -     if (next_deferred > 0)
> > -             new_nr = atomic_long_add_return(next_deferred,
> > -                                             &shrinker->nr_deferred[nid]);
> > -     else
> > -             new_nr = atomic_long_read(&shrinker->nr_deferred[nid]);
> > +     new_nr = set_nr_deferred(next_deferred, shrinker, shrinkctl);
> >
> >       trace_mm_shrink_slab_end(shrinker, shrinkctl->nid, freed, nr, new_nr, total_scan);
> >       return freed;
> > --
> > 2.26.2
> >

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v3 PATCH 03/11] mm: vmscan: use shrinker_rwsem to protect shrinker_maps allocation
  2021-01-06  9:54   ` Kirill Tkhai
@ 2021-01-11 17:08     ` Yang Shi
  2021-01-11 17:33       ` Kirill Tkhai
  0 siblings, 1 reply; 43+ messages in thread
From: Yang Shi @ 2021-01-11 17:08 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Roman Gushchin, Shakeel Butt, Dave Chinner, Johannes Weiner,
	Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Wed, Jan 6, 2021 at 1:55 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>
> On 06.01.2021 01:58, Yang Shi wrote:
> > Since memcg_shrinker_map_size just can be changd under holding shrinker_rwsem
> > exclusively, the read side can be protected by holding read lock, so it sounds
> > superfluous to have a dedicated mutex.  This should not exacerbate the contention
> > to shrinker_rwsem since just one read side critical section is added.
> >
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > ---
> >  mm/vmscan.c | 16 ++++++----------
> >  1 file changed, 6 insertions(+), 10 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 9db7b4d6d0ae..ddb9f972f856 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -187,7 +187,6 @@ static DECLARE_RWSEM(shrinker_rwsem);
> >  #ifdef CONFIG_MEMCG
> >
> >  static int memcg_shrinker_map_size;
> > -static DEFINE_MUTEX(memcg_shrinker_map_mutex);
> >
> >  static void memcg_free_shrinker_map_rcu(struct rcu_head *head)
> >  {
> > @@ -200,8 +199,6 @@ static int memcg_expand_one_shrinker_map(struct mem_cgroup *memcg,
> >       struct memcg_shrinker_map *new, *old;
> >       int nid;
> >
> > -     lockdep_assert_held(&memcg_shrinker_map_mutex);
> > -
> >       for_each_node(nid) {
> >               old = rcu_dereference_protected(
> >                       mem_cgroup_nodeinfo(memcg, nid)->shrinker_map, true);
> > @@ -250,7 +247,7 @@ int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
> >       if (mem_cgroup_is_root(memcg))
> >               return 0;
> >
> > -     mutex_lock(&memcg_shrinker_map_mutex);
> > +     down_read(&shrinker_rwsem);
> >       size = memcg_shrinker_map_size;
> >       for_each_node(nid) {
> >               map = kvzalloc(sizeof(*map) + size, GFP_KERNEL);
> > @@ -261,7 +258,7 @@ int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
> >               }
> >               rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, map);
>
> Here we do STORE operation, and since we want the assignment is visible
> for shrink_slab_memcg() under down_read(), we have to use down_write()
> in memcg_alloc_shrinker_maps().

I apologize for the late reply, these emails went to my SPAM again.

Before this patch it was not serialized by any lock either, right? Do
we have to serialize it? As Johannes mentioned if shrinker_maps has
not been initialized yet, it means the memcg is a newborn, there
should not be significant amount of reclaimable slab caches, so it is
fine to skip it. The point makes some sense to me.

So, the read lock seems good enough.

>
> >       }
> > -     mutex_unlock(&memcg_shrinker_map_mutex);
> > +     up_read(&shrinker_rwsem);
> >
> >       return ret;
> >  }
> > @@ -276,9 +273,8 @@ static int memcg_expand_shrinker_maps(int new_id)
> >       if (size <= old_size)
> >               return 0;
> >
> > -     mutex_lock(&memcg_shrinker_map_mutex);
> >       if (!root_mem_cgroup)
> > -             goto unlock;
> > +             goto out;
> >
> >       memcg = mem_cgroup_iter(NULL, NULL, NULL);
> >       do {
> > @@ -287,13 +283,13 @@ static int memcg_expand_shrinker_maps(int new_id)
> >               ret = memcg_expand_one_shrinker_map(memcg, size, old_size);
> >               if (ret) {
> >                       mem_cgroup_iter_break(NULL, memcg);
> > -                     goto unlock;
> > +                     goto out;
> >               }
> >       } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
> > -unlock:
> > +out:
> >       if (!ret)
> >               memcg_shrinker_map_size = size;
> > -     mutex_unlock(&memcg_shrinker_map_mutex);
> > +
> >       return ret;
> >  }
> >
> >
>
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v3 PATCH 03/11] mm: vmscan: use shrinker_rwsem to protect shrinker_maps allocation
  2021-01-11 17:08     ` Yang Shi
@ 2021-01-11 17:33       ` Kirill Tkhai
  2021-01-11 18:57         ` Yang Shi
  0 siblings, 1 reply; 43+ messages in thread
From: Kirill Tkhai @ 2021-01-11 17:33 UTC (permalink / raw)
  To: Yang Shi
  Cc: Roman Gushchin, Shakeel Butt, Dave Chinner, Johannes Weiner,
	Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On 11.01.2021 20:08, Yang Shi wrote:
> On Wed, Jan 6, 2021 at 1:55 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>
>> On 06.01.2021 01:58, Yang Shi wrote:
>>> Since memcg_shrinker_map_size just can be changd under holding shrinker_rwsem
>>> exclusively, the read side can be protected by holding read lock, so it sounds
>>> superfluous to have a dedicated mutex.  This should not exacerbate the contention
>>> to shrinker_rwsem since just one read side critical section is added.
>>>
>>> Signed-off-by: Yang Shi <shy828301@gmail.com>
>>> ---
>>>  mm/vmscan.c | 16 ++++++----------
>>>  1 file changed, 6 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index 9db7b4d6d0ae..ddb9f972f856 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -187,7 +187,6 @@ static DECLARE_RWSEM(shrinker_rwsem);
>>>  #ifdef CONFIG_MEMCG
>>>
>>>  static int memcg_shrinker_map_size;
>>> -static DEFINE_MUTEX(memcg_shrinker_map_mutex);
>>>
>>>  static void memcg_free_shrinker_map_rcu(struct rcu_head *head)
>>>  {
>>> @@ -200,8 +199,6 @@ static int memcg_expand_one_shrinker_map(struct mem_cgroup *memcg,
>>>       struct memcg_shrinker_map *new, *old;
>>>       int nid;
>>>
>>> -     lockdep_assert_held(&memcg_shrinker_map_mutex);
>>> -
>>>       for_each_node(nid) {
>>>               old = rcu_dereference_protected(
>>>                       mem_cgroup_nodeinfo(memcg, nid)->shrinker_map, true);
>>> @@ -250,7 +247,7 @@ int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
>>>       if (mem_cgroup_is_root(memcg))
>>>               return 0;
>>>
>>> -     mutex_lock(&memcg_shrinker_map_mutex);
>>> +     down_read(&shrinker_rwsem);
>>>       size = memcg_shrinker_map_size;
>>>       for_each_node(nid) {
>>>               map = kvzalloc(sizeof(*map) + size, GFP_KERNEL);
>>> @@ -261,7 +258,7 @@ int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
>>>               }
>>>               rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, map);
>>
>> Here we do STORE operation, and since we want the assignment is visible
>> for shrink_slab_memcg() under down_read(), we have to use down_write()
>> in memcg_alloc_shrinker_maps().
> 
> I apologize for the late reply, these emails went to my SPAM again.

This is the second time the problem appeared. Just add my email address to allow list,
and there won't be this problem again.
 
> Before this patch it was not serialized by any lock either, right? Do
> we have to serialize it? As Johannes mentioned if shrinker_maps has
> not been initialized yet, it means the memcg is a newborn, there
> should not be significant amount of reclaimable slab caches, so it is
> fine to skip it. The point makes some sense to me.
> 
> So, the read lock seems good enough.

No, this is not so.

Patch "[v3 PATCH 07/11] mm: vmscan: add per memcg shrinker nr_deferred" adds
new assignments:

+               info->map = (unsigned long *)((unsigned long)info + sizeof(*info));
+               info->nr_deferred = (atomic_long_t *)((unsigned long)info +
+                                       sizeof(*info) + m_size);

info->map and info->nr_deferred are not visible under READ lock in shrink_slab_memcg(),
unless you use WRITE lock in memcg_alloc_shrinker_maps().

Nowhere in your patchset you convert READ lock to WRITE lock in memcg_alloc_shrinker_maps().

So, just use the true lock in this patch from the first time.
 
>>
>>>       }
>>> -     mutex_unlock(&memcg_shrinker_map_mutex);
>>> +     up_read(&shrinker_rwsem);
>>>
>>>       return ret;
>>>  }
>>> @@ -276,9 +273,8 @@ static int memcg_expand_shrinker_maps(int new_id)
>>>       if (size <= old_size)
>>>               return 0;
>>>
>>> -     mutex_lock(&memcg_shrinker_map_mutex);
>>>       if (!root_mem_cgroup)
>>> -             goto unlock;
>>> +             goto out;
>>>
>>>       memcg = mem_cgroup_iter(NULL, NULL, NULL);
>>>       do {
>>> @@ -287,13 +283,13 @@ static int memcg_expand_shrinker_maps(int new_id)
>>>               ret = memcg_expand_one_shrinker_map(memcg, size, old_size);
>>>               if (ret) {
>>>                       mem_cgroup_iter_break(NULL, memcg);
>>> -                     goto unlock;
>>> +                     goto out;
>>>               }
>>>       } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
>>> -unlock:
>>> +out:
>>>       if (!ret)
>>>               memcg_shrinker_map_size = size;
>>> -     mutex_unlock(&memcg_shrinker_map_mutex);
>>> +
>>>       return ret;
>>>  }
>>>
>>>
>>
>>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v3 PATCH 04/11] mm: vmscan: remove memcg_shrinker_map_size
  2021-01-06 10:15   ` Kirill Tkhai
@ 2021-01-11 17:44     ` Yang Shi
  2021-01-13 23:48     ` Yang Shi
  1 sibling, 0 replies; 43+ messages in thread
From: Yang Shi @ 2021-01-11 17:44 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Roman Gushchin, Shakeel Butt, Dave Chinner, Johannes Weiner,
	Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Wed, Jan 6, 2021 at 2:16 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>
> On 06.01.2021 01:58, Yang Shi wrote:
> > Both memcg_shrinker_map_size and shrinker_nr_max is maintained, but actually the
> > map size can be calculated via shrinker_nr_max, so it seems unnecessary to keep both.
> > Remove memcg_shrinker_map_size since shrinker_nr_max is also used by iterating the
> > bit map.
> >
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > ---
> >  mm/vmscan.c | 12 ++++--------
> >  1 file changed, 4 insertions(+), 8 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index ddb9f972f856..8da765a85569 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -185,8 +185,7 @@ static LIST_HEAD(shrinker_list);
> >  static DECLARE_RWSEM(shrinker_rwsem);
> >
> >  #ifdef CONFIG_MEMCG
> > -
> > -static int memcg_shrinker_map_size;
> > +static int shrinker_nr_max;
> >
> >  static void memcg_free_shrinker_map_rcu(struct rcu_head *head)
> >  {
> > @@ -248,7 +247,7 @@ int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
> >               return 0;
> >
> >       down_read(&shrinker_rwsem);
> > -     size = memcg_shrinker_map_size;
> > +     size = DIV_ROUND_UP(shrinker_nr_max, BITS_PER_LONG) * sizeof(unsigned long);
> >       for_each_node(nid) {
> >               map = kvzalloc(sizeof(*map) + size, GFP_KERNEL);
> >               if (!map) {
> > @@ -269,7 +268,7 @@ static int memcg_expand_shrinker_maps(int new_id)
> >       struct mem_cgroup *memcg;
> >
> >       size = DIV_ROUND_UP(new_id + 1, BITS_PER_LONG) * sizeof(unsigned long);
> > -     old_size = memcg_shrinker_map_size;
> > +     old_size = DIV_ROUND_UP(shrinker_nr_max, BITS_PER_LONG) * sizeof(unsigned long);
> >       if (size <= old_size)
> >               return 0;
>
> These bunch of DIV_ROUND_UP() looks too complex. Since now all the shrinker maps allocation
> logic in the only file, can't we simplify this to look better? I mean something like below
> to merge in your patch:

Thanks for the suggestion. Will incorporate in v4.

>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b951c289ef3a..27b6371a1656 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -247,7 +247,7 @@ int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
>                 return 0;
>
>         down_read(&shrinker_rwsem);
> -       size = DIV_ROUND_UP(shrinker_nr_max, BITS_PER_LONG) * sizeof(unsigned long);
> +       size = shrinker_nr_max / BITS_PER_BYTE;
>         for_each_node(nid) {
>                 map = kvzalloc(sizeof(*map) + size, GFP_KERNEL);
>                 if (!map) {
> @@ -264,13 +264,11 @@ int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
>
>  static int memcg_expand_shrinker_maps(int new_id)
>  {
> -       int size, old_size, ret = 0;
> +       int size, old_size, new_nr_max, ret = 0;
>         struct mem_cgroup *memcg;
>
>         size = DIV_ROUND_UP(new_id + 1, BITS_PER_LONG) * sizeof(unsigned long);
> -       old_size = DIV_ROUND_UP(shrinker_nr_max, BITS_PER_LONG) * sizeof(unsigned long);
> -       if (size <= old_size)
> -               return 0;

BTW, it seems the above chunk needs to be kept.

> +       new_nr_max = size * BITS_PER_BYTE;
>
>         if (!root_mem_cgroup)
>                 goto out;
> @@ -287,6 +285,9 @@ static int memcg_expand_shrinker_maps(int new_id)
>         } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
>
>  out:
> +       if (ret == 0)
> +               shrinker_nr_max = new_nr_max;
> +
>         return ret;
>  }
>
> @@ -334,8 +335,6 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
>                         idr_remove(&shrinker_idr, id);
>                         goto unlock;
>                 }
> -
> -               shrinker_nr_max = id + 1;
>         }
>         shrinker->id = id;
>         ret = 0;
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v3 PATCH 05/11] mm: vmscan: use a new flag to indicate shrinker is registered
  2021-01-06 10:21   ` Kirill Tkhai
@ 2021-01-11 18:17     ` Yang Shi
  2021-01-11 21:37       ` Kirill Tkhai
  0 siblings, 1 reply; 43+ messages in thread
From: Yang Shi @ 2021-01-11 18:17 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Roman Gushchin, Shakeel Butt, Dave Chinner, Johannes Weiner,
	Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Wed, Jan 6, 2021 at 2:22 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>
> On 06.01.2021 01:58, Yang Shi wrote:
> > Currently registered shrinker is indicated by non-NULL shrinker->nr_deferred.
> > This approach is fine with nr_deferred at the shrinker level, but the following
> > patches will move MEMCG_AWARE shrinkers' nr_deferred to memcg level, so their
> > shrinker->nr_deferred would always be NULL.  This would prevent the shrinkers
> > from unregistering correctly.
> >
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > ---
> >  include/linux/shrinker.h |  7 ++++---
> >  mm/vmscan.c              | 13 +++++++++----
> >  2 files changed, 13 insertions(+), 7 deletions(-)
> >
> > diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> > index 0f80123650e2..1eac79ce57d4 100644
> > --- a/include/linux/shrinker.h
> > +++ b/include/linux/shrinker.h
> > @@ -79,13 +79,14 @@ struct shrinker {
> >  #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
> >
> >  /* Flags */
> > -#define SHRINKER_NUMA_AWARE  (1 << 0)
> > -#define SHRINKER_MEMCG_AWARE (1 << 1)
> > +#define SHRINKER_REGISTERED  (1 << 0)
> > +#define SHRINKER_NUMA_AWARE  (1 << 1)
> > +#define SHRINKER_MEMCG_AWARE (1 << 2)
> >  /*
> >   * It just makes sense when the shrinker is also MEMCG_AWARE for now,
> >   * non-MEMCG_AWARE shrinker should not have this flag set.
> >   */
> > -#define SHRINKER_NONSLAB     (1 << 2)
> > +#define SHRINKER_NONSLAB     (1 << 3)
> >
> >  extern int prealloc_shrinker(struct shrinker *shrinker);
> >  extern void register_shrinker_prepared(struct shrinker *shrinker);
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 8da765a85569..9761c7c27412 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -494,6 +494,7 @@ void register_shrinker_prepared(struct shrinker *shrinker)
> >       if (shrinker->flags & SHRINKER_MEMCG_AWARE)
> >               idr_replace(&shrinker_idr, shrinker, shrinker->id);
> >  #endif
> > +     shrinker->flags |= SHRINKER_REGISTERED;
>
> In case of we introduce this new flag, we should kill old flag SHRINKER_REGISTERING,
> which are not needed anymore (we should you the new flag instead of that).

The only think that I'm confused with is the check in
shrink_slab_memcg, it does:

shrinker = idr_find(&shrinker_idr, i);
if (unlikely(!shrinker || shrinker == SHRINKER_REGISTERING)) {

When allocating idr, the shrinker is associated with
SHRINKER_REGISTERING. But, shrink_slab_memcg does acquire read
shrinker_rwsem, and idr_alloc is called with holding write
shrinker_rwsem, so I'm supposed shrink_slab_memcg should never see
shrinker is registering. If so it seems easy to remove
SHRINKER_REGISTERING.

We just need change that check to:
!shrinker || !(shrinker->flags & SHRINKER_REGISTERED)

> >       up_write(&shrinker_rwsem);
> >  }
> >
> > @@ -513,13 +514,17 @@ EXPORT_SYMBOL(register_shrinker);
> >   */
> >  void unregister_shrinker(struct shrinker *shrinker)
> >  {
> > -     if (!shrinker->nr_deferred)
> > -             return;
> > -     if (shrinker->flags & SHRINKER_MEMCG_AWARE)
> > -             unregister_memcg_shrinker(shrinker);
> >       down_write(&shrinker_rwsem);
>
> I do not think there are some users which registration may race with unregistration.
> So, I think we should check SHRINKER_REGISTERED unlocked similar to we used to check
> shrinker->nr_deferred unlocked.

Yes, I agree.

>
> > +     if (!(shrinker->flags & SHRINKER_REGISTERED)) {
> > +             up_write(&shrinker_rwsem);
> > +             return;
> > +     }
> >       list_del(&shrinker->list);
> > +     shrinker->flags &= ~SHRINKER_REGISTERED;
> >       up_write(&shrinker_rwsem);
> > +
> > +     if (shrinker->flags & SHRINKER_MEMCG_AWARE)
> > +             unregister_memcg_shrinker(shrinker);
> >       kfree(shrinker->nr_deferred);
> >       shrinker->nr_deferred = NULL;
> >  }
> >
>
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v3 PATCH 06/11] mm: memcontrol: rename shrinker_map to shrinker_info
  2021-01-06 11:38   ` Kirill Tkhai
@ 2021-01-11 18:19     ` Yang Shi
  0 siblings, 0 replies; 43+ messages in thread
From: Yang Shi @ 2021-01-11 18:19 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Roman Gushchin, Shakeel Butt, Dave Chinner, Johannes Weiner,
	Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Wed, Jan 6, 2021 at 3:39 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>
> On 06.01.2021 01:58, Yang Shi wrote:
> > The following patch is going to add nr_deferred into shrinker_map, the change will
> > make shrinker_map not only include map anymore, so rename it to a more general
> > name.  And this should make the patch adding nr_deferred cleaner and readable and make
> > review easier.
> >
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > ---
> >  include/linux/memcontrol.h |  8 ++---
> >  mm/memcontrol.c            |  6 ++--
> >  mm/vmscan.c                | 66 +++++++++++++++++++-------------------
> >  3 files changed, 40 insertions(+), 40 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index d128d2842f22..e05bbe8277cc 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -96,7 +96,7 @@ struct lruvec_stat {
> >   * Bitmap of shrinker::id corresponding to memcg-aware shrinkers,
> >   * which have elements charged to this memcg.
> >   */
> > -struct memcg_shrinker_map {
> > +struct memcg_shrinker_info {
>
> Reviewing your next patch actively using new fields in this structure,
> I strongly insist on renaming it in "struct shrinker_info" instead of that.
>
> Otherwise, lines of function declarations become too long.

Yes, agreed. Will incorporate in v4.

>
> >       struct rcu_head rcu;
> >       unsigned long map[];
> >  };
> > @@ -118,7 +118,7 @@ struct mem_cgroup_per_node {
> >
> >       struct mem_cgroup_reclaim_iter  iter;
> >
> > -     struct memcg_shrinker_map __rcu *shrinker_map;
> > +     struct memcg_shrinker_info __rcu        *shrinker_info;
> >
> >       struct rb_node          tree_node;      /* RB tree node */
> >       unsigned long           usage_in_excess;/* Set to the value by which */
> > @@ -1581,8 +1581,8 @@ static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
> >       return false;
> >  }
> >
> > -extern int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg);
> > -extern void memcg_free_shrinker_maps(struct mem_cgroup *memcg);
> > +extern int memcg_alloc_shrinker_info(struct mem_cgroup *memcg);
> > +extern void memcg_free_shrinker_info(struct mem_cgroup *memcg);
> >  extern void memcg_set_shrinker_bit(struct mem_cgroup *memcg,
> >                                  int nid, int shrinker_id);
> >  #else
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 817dde366258..126f1fd550c8 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -5248,11 +5248,11 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
> >       struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> >
> >       /*
> > -      * A memcg must be visible for memcg_expand_shrinker_maps()
> > +      * A memcg must be visible for memcg_expand_shrinker_info()
> >        * by the time the maps are allocated. So, we allocate maps
> >        * here, when for_each_mem_cgroup() can't skip it.
> >        */
> > -     if (memcg_alloc_shrinker_maps(memcg)) {
> > +     if (memcg_alloc_shrinker_info(memcg)) {
> >               mem_cgroup_id_remove(memcg);
> >               return -ENOMEM;
> >       }
> > @@ -5316,7 +5316,7 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
> >       vmpressure_cleanup(&memcg->vmpressure);
> >       cancel_work_sync(&memcg->high_work);
> >       mem_cgroup_remove_from_trees(memcg);
> > -     memcg_free_shrinker_maps(memcg);
> > +     memcg_free_shrinker_info(memcg);
> >       memcg_free_kmem(memcg);
> >       mem_cgroup_free(memcg);
> >  }
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 9761c7c27412..0033659abf9e 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -187,20 +187,20 @@ static DECLARE_RWSEM(shrinker_rwsem);
> >  #ifdef CONFIG_MEMCG
> >  static int shrinker_nr_max;
> >
> > -static void memcg_free_shrinker_map_rcu(struct rcu_head *head)
> > +static void memcg_free_shrinker_info_rcu(struct rcu_head *head)
> >  {
> > -     kvfree(container_of(head, struct memcg_shrinker_map, rcu));
> > +     kvfree(container_of(head, struct memcg_shrinker_info, rcu));
> >  }
> >
> > -static int memcg_expand_one_shrinker_map(struct mem_cgroup *memcg,
> > -                                      int size, int old_size)
> > +static int memcg_expand_one_shrinker_info(struct mem_cgroup *memcg,
> > +                                       int size, int old_size)
> >  {
> > -     struct memcg_shrinker_map *new, *old;
> > +     struct memcg_shrinker_info *new, *old;
> >       int nid;
> >
> >       for_each_node(nid) {
> >               old = rcu_dereference_protected(
> > -                     mem_cgroup_nodeinfo(memcg, nid)->shrinker_map, true);
> > +                     mem_cgroup_nodeinfo(memcg, nid)->shrinker_info, true);
> >               /* Not yet online memcg */
> >               if (!old)
> >                       return 0;
> > @@ -213,17 +213,17 @@ static int memcg_expand_one_shrinker_map(struct mem_cgroup *memcg,
> >               memset(new->map, (int)0xff, old_size);
> >               memset((void *)new->map + old_size, 0, size - old_size);
> >
> > -             rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, new);
> > -             call_rcu(&old->rcu, memcg_free_shrinker_map_rcu);
> > +             rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, new);
> > +             call_rcu(&old->rcu, memcg_free_shrinker_info_rcu);
> >       }
> >
> >       return 0;
> >  }
> >
> > -void memcg_free_shrinker_maps(struct mem_cgroup *memcg)
> > +void memcg_free_shrinker_info(struct mem_cgroup *memcg)
> >  {
> >       struct mem_cgroup_per_node *pn;
> > -     struct memcg_shrinker_map *map;
> > +     struct memcg_shrinker_info *info;
> >       int nid;
> >
> >       if (mem_cgroup_is_root(memcg))
> > @@ -231,16 +231,16 @@ void memcg_free_shrinker_maps(struct mem_cgroup *memcg)
> >
> >       for_each_node(nid) {
> >               pn = mem_cgroup_nodeinfo(memcg, nid);
> > -             map = rcu_dereference_protected(pn->shrinker_map, true);
> > -             if (map)
> > -                     kvfree(map);
> > -             rcu_assign_pointer(pn->shrinker_map, NULL);
> > +             info = rcu_dereference_protected(pn->shrinker_info, true);
> > +             if (info)
> > +                     kvfree(info);
> > +             rcu_assign_pointer(pn->shrinker_info, NULL);
> >       }
> >  }
> >
> > -int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
> > +int memcg_alloc_shrinker_info(struct mem_cgroup *memcg)
> >  {
> > -     struct memcg_shrinker_map *map;
> > +     struct memcg_shrinker_info *info;
> >       int nid, size, ret = 0;
> >
> >       if (mem_cgroup_is_root(memcg))
> > @@ -249,20 +249,20 @@ int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
> >       down_read(&shrinker_rwsem);
> >       size = DIV_ROUND_UP(shrinker_nr_max, BITS_PER_LONG) * sizeof(unsigned long);
> >       for_each_node(nid) {
> > -             map = kvzalloc(sizeof(*map) + size, GFP_KERNEL);
> > -             if (!map) {
> > -                     memcg_free_shrinker_maps(memcg);
> > +             info = kvzalloc(sizeof(*info) + size, GFP_KERNEL);
> > +             if (!info) {
> > +                     memcg_free_shrinker_info(memcg);
> >                       ret = -ENOMEM;
> >                       break;
> >               }
> > -             rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, map);
> > +             rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, info);
> >       }
> >       up_read(&shrinker_rwsem);
> >
> >       return ret;
> >  }
> >
> > -static int memcg_expand_shrinker_maps(int new_id)
> > +static int memcg_expand_shrinker_info(int new_id)
> >  {
> >       int size, old_size, ret = 0;
> >       struct mem_cgroup *memcg;
> > @@ -279,7 +279,7 @@ static int memcg_expand_shrinker_maps(int new_id)
> >       do {
> >               if (mem_cgroup_is_root(memcg))
> >                       continue;
> > -             ret = memcg_expand_one_shrinker_map(memcg, size, old_size);
> > +             ret = memcg_expand_one_shrinker_info(memcg, size, old_size);
> >               if (ret) {
> >                       mem_cgroup_iter_break(NULL, memcg);
> >                       goto out;
> > @@ -293,13 +293,13 @@ static int memcg_expand_shrinker_maps(int new_id)
> >  void memcg_set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
> >  {
> >       if (shrinker_id >= 0 && memcg && !mem_cgroup_is_root(memcg)) {
> > -             struct memcg_shrinker_map *map;
> > +             struct memcg_shrinker_info *info;
> >
> >               rcu_read_lock();
> > -             map = rcu_dereference(memcg->nodeinfo[nid]->shrinker_map);
> > +             info = rcu_dereference(memcg->nodeinfo[nid]->shrinker_info);
> >               /* Pairs with smp mb in shrink_slab() */
> >               smp_mb__before_atomic();
> > -             set_bit(shrinker_id, map->map);
> > +             set_bit(shrinker_id, info->map);
> >               rcu_read_unlock();
> >       }
> >  }
> > @@ -330,7 +330,7 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
> >               goto unlock;
> >
> >       if (id >= shrinker_nr_max) {
> > -             if (memcg_expand_shrinker_maps(id)) {
> > +             if (memcg_expand_shrinker_info(id)) {
> >                       idr_remove(&shrinker_idr, id);
> >                       goto unlock;
> >               }
> > @@ -666,7 +666,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> >  static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
> >                       struct mem_cgroup *memcg, int priority)
> >  {
> > -     struct memcg_shrinker_map *map;
> > +     struct memcg_shrinker_info *info;
> >       unsigned long ret, freed = 0;
> >       int i;
> >
> > @@ -676,12 +676,12 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
> >       if (!down_read_trylock(&shrinker_rwsem))
> >               return 0;
> >
> > -     map = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_map,
> > -                                     true);
> > -     if (unlikely(!map))
> > +     info = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_info,
> > +                                      true);
> > +     if (unlikely(!info))
> >               goto unlock;
> >
> > -     for_each_set_bit(i, map->map, shrinker_nr_max) {
> > +     for_each_set_bit(i, info->map, shrinker_nr_max) {
> >               struct shrink_control sc = {
> >                       .gfp_mask = gfp_mask,
> >                       .nid = nid,
> > @@ -692,7 +692,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
> >               shrinker = idr_find(&shrinker_idr, i);
> >               if (unlikely(!shrinker || shrinker == SHRINKER_REGISTERING)) {
> >                       if (!shrinker)
> > -                             clear_bit(i, map->map);
> > +                             clear_bit(i, info->map);
> >                       continue;
> >               }
> >
> > @@ -703,7 +703,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
> >
> >               ret = do_shrink_slab(&sc, shrinker, priority);
> >               if (ret == SHRINK_EMPTY) {
> > -                     clear_bit(i, map->map);
> > +                     clear_bit(i, info->map);
> >                       /*
> >                        * After the shrinker reported that it had no objects to
> >                        * free, but before we cleared the corresponding bit in
> >
>
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v3 PATCH 07/11] mm: vmscan: add per memcg shrinker nr_deferred
  2021-01-06 11:06   ` Kirill Tkhai
@ 2021-01-11 18:24     ` Yang Shi
  2021-01-13 23:30     ` Yang Shi
  1 sibling, 0 replies; 43+ messages in thread
From: Yang Shi @ 2021-01-11 18:24 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Roman Gushchin, Shakeel Butt, Dave Chinner, Johannes Weiner,
	Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Wed, Jan 6, 2021 at 3:07 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>
> On 06.01.2021 01:58, Yang Shi wrote:
> > Currently the number of deferred objects are per shrinker, but some slabs, for example,
> > vfs inode/dentry cache are per memcg, this would result in poor isolation among memcgs.
> >
> > The deferred objects typically are generated by __GFP_NOFS allocations, one memcg with
> > excessive __GFP_NOFS allocations may blow up deferred objects, then other innocent memcgs
> > may suffer from over shrink, excessive reclaim latency, etc.
> >
> > For example, two workloads run in memcgA and memcgB respectively, workload in B is vfs
> > heavy workload.  Workload in A generates excessive deferred objects, then B's vfs cache
> > might be hit heavily (drop half of caches) by B's limit reclaim or global reclaim.
> >
> > We observed this hit in our production environment which was running vfs heavy workload
> > shown as the below tracing log:
> >
> > <...>-409454 [016] .... 28286961.747146: mm_shrink_slab_start: super_cache_scan+0x0/0x1a0 ffff9a83046f3458:
> > nid: 1 objects to shrink 3641681686040 gfp_flags GFP_HIGHUSER_MOVABLE|__GFP_ZERO pgs_scanned 1 lru_pgs 15721
> > cache items 246404277 delta 31345 total_scan 123202138
> > <...>-409454 [022] .... 28287105.928018: mm_shrink_slab_end: super_cache_scan+0x0/0x1a0 ffff9a83046f3458:
> > nid: 1 unused scan count 3641681686040 new scan count 3641798379189 total_scan 602
> > last shrinker return val 123186855
> >
> > The vfs cache and page cache ration was 10:1 on this machine, and half of caches were dropped.
> > This also resulted in significant amount of page caches were dropped due to inodes eviction.
> >
> > Make nr_deferred per memcg for memcg aware shrinkers would solve the unfairness and bring
> > better isolation.
> >
> > When memcg is not enabled (!CONFIG_MEMCG or memcg disabled), the shrinker's nr_deferred
> > would be used.  And non memcg aware shrinkers use shrinker's nr_deferred all the time.
> >
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > ---
> >  include/linux/memcontrol.h |  7 +++---
> >  mm/vmscan.c                | 49 +++++++++++++++++++++++++-------------
> >  2 files changed, 37 insertions(+), 19 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index e05bbe8277cc..5599082df623 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -93,12 +93,13 @@ struct lruvec_stat {
> >  };
> >
> >  /*
> > - * Bitmap of shrinker::id corresponding to memcg-aware shrinkers,
> > - * which have elements charged to this memcg.
> > + * Bitmap and deferred work of shrinker::id corresponding to memcg-aware
> > + * shrinkers, which have elements charged to this memcg.
> >   */
> >  struct memcg_shrinker_info {
> >       struct rcu_head rcu;
> > -     unsigned long map[];
> > +     unsigned long *map;
> > +     atomic_long_t *nr_deferred;
> >  };
> >
> >  /*
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 0033659abf9e..72259253e414 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -193,10 +193,12 @@ static void memcg_free_shrinker_info_rcu(struct rcu_head *head)
> >  }
> >
> >  static int memcg_expand_one_shrinker_info(struct mem_cgroup *memcg,
> > -                                       int size, int old_size)
> > +                                       int m_size, int d_size,
> > +                                       int old_m_size, int old_d_size)
> >  {
> >       struct memcg_shrinker_info *new, *old;
> >       int nid;
> > +     int size = m_size + d_size;
> >
> >       for_each_node(nid) {
> >               old = rcu_dereference_protected(
> > @@ -209,9 +211,18 @@ static int memcg_expand_one_shrinker_info(struct mem_cgroup *memcg,
> >               if (!new)
> >                       return -ENOMEM;
> >
> > -             /* Set all old bits, clear all new bits */
> > -             memset(new->map, (int)0xff, old_size);
> > -             memset((void *)new->map + old_size, 0, size - old_size);
> > +             new->map = (unsigned long *)((unsigned long)new + sizeof(*new));
> > +             new->nr_deferred = (atomic_long_t *)((unsigned long)new +
> > +                                     sizeof(*new) + m_size);
>
> Can't we write this more compact?
>
>                 new->map = (unsigned long *)(new + 1);
>                 new->nr_deferred = (atomic_long_t)(new->map + 1);

Thanks for the suggestion, will incorporate it in v4.

>
> > +
> > +             /* map: set all old bits, clear all new bits */
> > +             memset(new->map, (int)0xff, old_m_size);
> > +             memset((void *)new->map + old_m_size, 0, m_size - old_m_size);
> > +             /* nr_deferred: copy old values, clear all new values */
> > +             memcpy((void *)new->nr_deferred, (void *)old->nr_deferred,
> > +                    old_d_size);
>
> Why not
>                 memcpy(new->nr_deferred, old->nr_deferred, old_d_size);
> ?

Will fix in v4.

>
> > +             memset((void *)new->nr_deferred + old_d_size, 0,
> > +                    d_size - old_d_size);
> >
> >               rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, new);
> >               call_rcu(&old->rcu, memcg_free_shrinker_info_rcu);
> > @@ -226,9 +237,6 @@ void memcg_free_shrinker_info(struct mem_cgroup *memcg)
> >       struct memcg_shrinker_info *info;
> >       int nid;
> >
> > -     if (mem_cgroup_is_root(memcg))
> > -             return;
> > -
> >       for_each_node(nid) {
> >               pn = mem_cgroup_nodeinfo(memcg, nid);
> >               info = rcu_dereference_protected(pn->shrinker_info, true);
> > @@ -242,12 +250,13 @@ int memcg_alloc_shrinker_info(struct mem_cgroup *memcg)
> >  {
> >       struct memcg_shrinker_info *info;
> >       int nid, size, ret = 0;
> > -
> > -     if (mem_cgroup_is_root(memcg))
> > -             return 0;
> > +     int m_size, d_size = 0;
> >
> >       down_read(&shrinker_rwsem);
> > -     size = DIV_ROUND_UP(shrinker_nr_max, BITS_PER_LONG) * sizeof(unsigned long);
> > +     m_size = DIV_ROUND_UP(shrinker_nr_max, BITS_PER_LONG) * sizeof(unsigned long);
> > +     d_size = shrinker_nr_max * sizeof(atomic_long_t);
> > +     size = m_size + d_size;
> > +
> >       for_each_node(nid) {
> >               info = kvzalloc(sizeof(*info) + size, GFP_KERNEL);
> >               if (!info) {
> > @@ -255,6 +264,9 @@ int memcg_alloc_shrinker_info(struct mem_cgroup *memcg)
> >                       ret = -ENOMEM;
> >                       break;
> >               }
> > +             info->map = (unsigned long *)((unsigned long)info + sizeof(*info));
> > +             info->nr_deferred = (atomic_long_t *)((unsigned long)info +
> > +                                     sizeof(*info) + m_size);
> >               rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, info);
> >       }
> >       up_read(&shrinker_rwsem);
> > @@ -265,10 +277,16 @@ int memcg_alloc_shrinker_info(struct mem_cgroup *memcg)
> >  static int memcg_expand_shrinker_info(int new_id)
> >  {
> >       int size, old_size, ret = 0;
> > +     int m_size, d_size = 0;
> > +     int old_m_size, old_d_size = 0;
> >       struct mem_cgroup *memcg;
> >
> > -     size = DIV_ROUND_UP(new_id + 1, BITS_PER_LONG) * sizeof(unsigned long);
> > -     old_size = DIV_ROUND_UP(shrinker_nr_max, BITS_PER_LONG) * sizeof(unsigned long);
> > +     m_size = DIV_ROUND_UP(new_id + 1, BITS_PER_LONG) * sizeof(unsigned long);
> > +     d_size = (new_id + 1) * sizeof(atomic_long_t);
> > +     size = m_size + d_size;
> > +     old_m_size = DIV_ROUND_UP(shrinker_nr_max, BITS_PER_LONG) * sizeof(unsigned long);
> > +     old_d_size = shrinker_nr_max * sizeof(atomic_long_t);
> > +     old_size = old_m_size + old_d_size;
> >       if (size <= old_size)
> >               return 0;
>
> This replication of patch [4/11] looks awkwardly. Please, try to incorporate
> the same changes to nr_deferred as I requested for shrinker_map in [4/11].

Sure. Thanks.

>
> >
> > @@ -277,9 +295,8 @@ static int memcg_expand_shrinker_info(int new_id)
> >
> >       memcg = mem_cgroup_iter(NULL, NULL, NULL);
> >       do {
> > -             if (mem_cgroup_is_root(memcg))
> > -                     continue;
> > -             ret = memcg_expand_one_shrinker_info(memcg, size, old_size);
> > +             ret = memcg_expand_one_shrinker_info(memcg, m_size, d_size,
> > +                                                  old_m_size, old_d_size);
> >               if (ret) {
> >                       mem_cgroup_iter_break(NULL, memcg);
> >                       goto out;
> >
>
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v3 PATCH 09/11] mm: vmscan: don't need allocate shrinker->nr_deferred for memcg aware shrinkers
  2021-01-06 11:15   ` Kirill Tkhai
@ 2021-01-11 18:40     ` Yang Shi
  2021-01-11 21:57       ` Kirill Tkhai
  0 siblings, 1 reply; 43+ messages in thread
From: Yang Shi @ 2021-01-11 18:40 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Roman Gushchin, Shakeel Butt, Dave Chinner, Johannes Weiner,
	Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Wed, Jan 6, 2021 at 3:16 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>
> On 06.01.2021 01:58, Yang Shi wrote:
> > Now nr_deferred is available on per memcg level for memcg aware shrinkers, so don't need
> > allocate shrinker->nr_deferred for such shrinkers anymore.
> >
> > The prealloc_memcg_shrinker() would return -ENOSYS if !CONFIG_MEMCG or memcg is disabled
> > by kernel command line, then shrinker's SHRINKER_MEMCG_AWARE flag would be cleared.
> > This makes the implementation of this patch simpler.
> >
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > ---
> >  mm/vmscan.c | 33 ++++++++++++++++++---------------
> >  1 file changed, 18 insertions(+), 15 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index f20ed8e928c2..d9795fb0f1c5 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -340,6 +340,9 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
> >  {
> >       int id, ret = -ENOMEM;
> >
> > +     if (mem_cgroup_disabled())
> > +             return -ENOSYS;
> > +
> >       down_write(&shrinker_rwsem);
> >       /* This may call shrinker, so it must use down_read_trylock() */
> >       id = idr_alloc(&shrinker_idr, SHRINKER_REGISTERING, 0, 0, GFP_KERNEL);
> > @@ -424,7 +427,7 @@ static bool writeback_throttling_sane(struct scan_control *sc)
> >  #else
> >  static int prealloc_memcg_shrinker(struct shrinker *shrinker)
> >  {
> > -     return 0;
> > +     return -ENOSYS;
> >  }
> >
> >  static void unregister_memcg_shrinker(struct shrinker *shrinker)
> > @@ -535,8 +538,20 @@ unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone
> >   */
> >  int prealloc_shrinker(struct shrinker *shrinker)
> >  {
> > -     unsigned int size = sizeof(*shrinker->nr_deferred);
> > +     unsigned int size;
> > +     int err;
> > +
> > +     if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
> > +             err = prealloc_memcg_shrinker(shrinker);
> > +             if (!err)
> > +                     return 0;
> > +             if (err != -ENOSYS)
> > +                     return err;
> > +
> > +             shrinker->flags &= ~SHRINKER_MEMCG_AWARE;
>
> This looks very confusing.
>
> In case of you want to disable preallocation branch for !MEMCG case,
> you should firstly consider something like the below:

Not only !CONFIG_MEMCG, but also "cgroup_disable=memory" case.

>
> #ifdef CONFIG_MEMCG
> #define SHRINKER_MEMCG_AWARE    (1 << 2)
> #else
> #define SHRINKER_MEMCG_AWARE    0
> #endif

This could handle !CONFIG_MEMCG case, but can't deal with
"cgroup_disable=memory" case. We could consider check
mem_cgroup_disabled() when initializing shrinker, but this may result
in touching fs codes like below:

--- a/fs/super.c
+++ b/fs/super.c
@@ -266,7 +266,9 @@ static struct super_block *alloc_super(struct
file_system_type *type, int flags,
        s->s_shrink.scan_objects = super_cache_scan;
        s->s_shrink.count_objects = super_cache_count;
        s->s_shrink.batch = 1024;
-       s->s_shrink.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE;
+       s->s_shrink.flags = SHRINKER_NUMA_AWARE;
+       if (!mem_cgroup_disabled())
+               s->s_shrink.flags |= SHRINKER_MEMCG_AWARE;
        if (prealloc_shrinker(&s->s_shrink))
                goto fail;
        if (list_lru_init_memcg(&s->s_dentry_lru, &s->s_shrink))


>
> > +     }
> >
> > +     size = sizeof(*shrinker->nr_deferred);
> >       if (shrinker->flags & SHRINKER_NUMA_AWARE)
> >               size *= nr_node_ids;
> >
> > @@ -544,26 +559,14 @@ int prealloc_shrinker(struct shrinker *shrinker)
> >       if (!shrinker->nr_deferred)
> >               return -ENOMEM;
> >
> > -     if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
> > -             if (prealloc_memcg_shrinker(shrinker))
> > -                     goto free_deferred;
> > -     }
> >
> >       return 0;
> > -
> > -free_deferred:
> > -     kfree(shrinker->nr_deferred);
> > -     shrinker->nr_deferred = NULL;
> > -     return -ENOMEM;
> >  }
> >
> >  void free_prealloced_shrinker(struct shrinker *shrinker)
> >  {
> > -     if (!shrinker->nr_deferred)
> > -             return;
> > -
> >       if (shrinker->flags & SHRINKER_MEMCG_AWARE)
> > -             unregister_memcg_shrinker(shrinker);
> > +             return unregister_memcg_shrinker(shrinker);
> >
> >       kfree(shrinker->nr_deferred);
> >       shrinker->nr_deferred = NULL;
> >
>
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v3 PATCH 10/11] mm: memcontrol: reparent nr_deferred when memcg offline
  2021-01-06 11:34   ` Kirill Tkhai
@ 2021-01-11 18:43     ` Yang Shi
  0 siblings, 0 replies; 43+ messages in thread
From: Yang Shi @ 2021-01-11 18:43 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Roman Gushchin, Shakeel Butt, Dave Chinner, Johannes Weiner,
	Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Wed, Jan 6, 2021 at 3:35 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>
> On 06.01.2021 01:58, Yang Shi wrote:
> > Now shrinker's nr_deferred is per memcg for memcg aware shrinkers, add to parent's
> > corresponding nr_deferred when memcg offline.
> >
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > ---
> >  include/linux/memcontrol.h |  1 +
> >  mm/memcontrol.c            |  1 +
> >  mm/vmscan.c                | 29 +++++++++++++++++++++++++++++
> >  3 files changed, 31 insertions(+)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 5599082df623..d1e52e916cc2 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -1586,6 +1586,7 @@ extern int memcg_alloc_shrinker_info(struct mem_cgroup *memcg);
> >  extern void memcg_free_shrinker_info(struct mem_cgroup *memcg);
> >  extern void memcg_set_shrinker_bit(struct mem_cgroup *memcg,
> >                                  int nid, int shrinker_id);
> > +extern void memcg_reparent_shrinker_deferred(struct mem_cgroup *memcg);
> >  #else
> >  #define mem_cgroup_sockets_enabled 0
> >  static inline void mem_cgroup_sk_alloc(struct sock *sk) { };
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 126f1fd550c8..19e555675582 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -5284,6 +5284,7 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
> >       page_counter_set_low(&memcg->memory, 0);
> >
> >       memcg_offline_kmem(memcg);
> > +     memcg_reparent_shrinker_deferred(memcg);
> >       wb_memcg_offline(memcg);
> >
> >       drain_all_stock(memcg);
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index d9795fb0f1c5..71056057d26d 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -396,6 +396,35 @@ static long set_nr_deferred_memcg(long nr, int nid, struct shrinker *shrinker,
> >       return atomic_long_add_return(nr, &info->nr_deferred[shrinker->id]);
> >  }
> >
> > +void memcg_reparent_shrinker_deferred(struct mem_cgroup *memcg)
> > +{
> > +     int i, nid;
> > +     long nr;
> > +     struct mem_cgroup *parent;
> > +     struct memcg_shrinker_info *child_info, *parent_info;
> > +
> > +     parent = parent_mem_cgroup(memcg);
> > +     if (!parent)
> > +             parent = root_mem_cgroup;
> > +
> > +     /* Prevent from concurrent shrinker_info expand */
> > +     down_read(&shrinker_rwsem);
> > +     for_each_node(nid) {
> > +             child_info = rcu_dereference_protected(
> > +                                     memcg->nodeinfo[nid]->shrinker_info,
> > +                                     true);
> > +             parent_info = rcu_dereference_protected(
> > +                                     parent->nodeinfo[nid]->shrinker_info,
> > +                                     true);
>
> Simple assignment can't take such lots of space, we have to do something with that.
>
> Number of these
>
>         rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_info, true)
>
> became too big, and we can't allow every of them takes 3 lines.
>
> We should introduce a short helper to dereferrence this, so we will be able to give
> out attention to really difficult logic instead of wasting it on parsing this.
>
>                 child_info = memcg_shrinker_info(memcg, nid);
> or
>                 child_info = memcg_shrinker_info_protected(memcg, nid);
>
> Both of them fit in single line.
>
> struct memcg_shrinker_info *memcg_shrinker_info_protected(
>                                         struct mem_cgroup *memcg, int nid)
> {
>         return rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_info,
>                                          lockdep_assert_held(&shrinker_rwsem));
> }

Thanks for the suggestion, it makes sense to me. Will incorporate it in v4.

>
>
> > +             for (i = 0; i < shrinker_nr_max; i++) {
> > +                     nr = atomic_long_read(&child_info->nr_deferred[i]);
> > +                     atomic_long_add(nr,
> > +                                     &parent_info->nr_deferred[i]);
>
> Why new line is here? In case of you merge it up, it will be even shorter then previous line.

Just keep in 80 lines. We could relax it.

>
> > +             }
> > +     }
> > +     up_read(&shrinker_rwsem);
> > +}
> > +
> >  static bool cgroup_reclaim(struct scan_control *sc)
> >  {
> >       return sc->target_mem_cgroup;
> >
>
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v3 PATCH 03/11] mm: vmscan: use shrinker_rwsem to protect shrinker_maps allocation
  2021-01-11 17:33       ` Kirill Tkhai
@ 2021-01-11 18:57         ` Yang Shi
  2021-01-11 21:33           ` Kirill Tkhai
  0 siblings, 1 reply; 43+ messages in thread
From: Yang Shi @ 2021-01-11 18:57 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Roman Gushchin, Shakeel Butt, Dave Chinner, Johannes Weiner,
	Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Mon, Jan 11, 2021 at 9:34 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>
> On 11.01.2021 20:08, Yang Shi wrote:
> > On Wed, Jan 6, 2021 at 1:55 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> >>
> >> On 06.01.2021 01:58, Yang Shi wrote:
> >>> Since memcg_shrinker_map_size just can be changd under holding shrinker_rwsem
> >>> exclusively, the read side can be protected by holding read lock, so it sounds
> >>> superfluous to have a dedicated mutex.  This should not exacerbate the contention
> >>> to shrinker_rwsem since just one read side critical section is added.
> >>>
> >>> Signed-off-by: Yang Shi <shy828301@gmail.com>
> >>> ---
> >>>  mm/vmscan.c | 16 ++++++----------
> >>>  1 file changed, 6 insertions(+), 10 deletions(-)
> >>>
> >>> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >>> index 9db7b4d6d0ae..ddb9f972f856 100644
> >>> --- a/mm/vmscan.c
> >>> +++ b/mm/vmscan.c
> >>> @@ -187,7 +187,6 @@ static DECLARE_RWSEM(shrinker_rwsem);
> >>>  #ifdef CONFIG_MEMCG
> >>>
> >>>  static int memcg_shrinker_map_size;
> >>> -static DEFINE_MUTEX(memcg_shrinker_map_mutex);
> >>>
> >>>  static void memcg_free_shrinker_map_rcu(struct rcu_head *head)
> >>>  {
> >>> @@ -200,8 +199,6 @@ static int memcg_expand_one_shrinker_map(struct mem_cgroup *memcg,
> >>>       struct memcg_shrinker_map *new, *old;
> >>>       int nid;
> >>>
> >>> -     lockdep_assert_held(&memcg_shrinker_map_mutex);
> >>> -
> >>>       for_each_node(nid) {
> >>>               old = rcu_dereference_protected(
> >>>                       mem_cgroup_nodeinfo(memcg, nid)->shrinker_map, true);
> >>> @@ -250,7 +247,7 @@ int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
> >>>       if (mem_cgroup_is_root(memcg))
> >>>               return 0;
> >>>
> >>> -     mutex_lock(&memcg_shrinker_map_mutex);
> >>> +     down_read(&shrinker_rwsem);
> >>>       size = memcg_shrinker_map_size;
> >>>       for_each_node(nid) {
> >>>               map = kvzalloc(sizeof(*map) + size, GFP_KERNEL);
> >>> @@ -261,7 +258,7 @@ int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
> >>>               }
> >>>               rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, map);
> >>
> >> Here we do STORE operation, and since we want the assignment is visible
> >> for shrink_slab_memcg() under down_read(), we have to use down_write()
> >> in memcg_alloc_shrinker_maps().
> >
> > I apologize for the late reply, these emails went to my SPAM again.
>
> This is the second time the problem appeared. Just add my email address to allow list,
> and there won't be this problem again.

Yes, I thought clicking "not spam" would add your email address to the
allow list automatically. But it turns out not true.

>
> > Before this patch it was not serialized by any lock either, right? Do
> > we have to serialize it? As Johannes mentioned if shrinker_maps has
> > not been initialized yet, it means the memcg is a newborn, there
> > should not be significant amount of reclaimable slab caches, so it is
> > fine to skip it. The point makes some sense to me.
> >
> > So, the read lock seems good enough.
>
> No, this is not so.
>
> Patch "[v3 PATCH 07/11] mm: vmscan: add per memcg shrinker nr_deferred" adds
> new assignments:
>
> +               info->map = (unsigned long *)((unsigned long)info + sizeof(*info));
> +               info->nr_deferred = (atomic_long_t *)((unsigned long)info +
> +                                       sizeof(*info) + m_size);
>
> info->map and info->nr_deferred are not visible under READ lock in shrink_slab_memcg(),
> unless you use WRITE lock in memcg_alloc_shrinker_maps().

However map and nr_deferred are assigned before
rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, new). The
shrink_slab_memcg() checks shrinker_info pointer.
But that order might be not guaranteed, so it seems a memory barrier
before rcu_assign_pointer should be good enough, right?

>
> Nowhere in your patchset you convert READ lock to WRITE lock in memcg_alloc_shrinker_maps().
>
> So, just use the true lock in this patch from the first time.
>
> >>
> >>>       }
> >>> -     mutex_unlock(&memcg_shrinker_map_mutex);
> >>> +     up_read(&shrinker_rwsem);
> >>>
> >>>       return ret;
> >>>  }
> >>> @@ -276,9 +273,8 @@ static int memcg_expand_shrinker_maps(int new_id)
> >>>       if (size <= old_size)
> >>>               return 0;
> >>>
> >>> -     mutex_lock(&memcg_shrinker_map_mutex);
> >>>       if (!root_mem_cgroup)
> >>> -             goto unlock;
> >>> +             goto out;
> >>>
> >>>       memcg = mem_cgroup_iter(NULL, NULL, NULL);
> >>>       do {
> >>> @@ -287,13 +283,13 @@ static int memcg_expand_shrinker_maps(int new_id)
> >>>               ret = memcg_expand_one_shrinker_map(memcg, size, old_size);
> >>>               if (ret) {
> >>>                       mem_cgroup_iter_break(NULL, memcg);
> >>> -                     goto unlock;
> >>> +                     goto out;
> >>>               }
> >>>       } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
> >>> -unlock:
> >>> +out:
> >>>       if (!ret)
> >>>               memcg_shrinker_map_size = size;
> >>> -     mutex_unlock(&memcg_shrinker_map_mutex);
> >>> +
> >>>       return ret;
> >>>  }
> >>>
> >>>
> >>
> >>
>
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v3 PATCH 02/11] mm: vmscan: consolidate shrinker_maps handling code
  2021-01-07  0:13   ` Roman Gushchin
  2021-01-07 17:29     ` Yang Shi
@ 2021-01-11 19:00     ` Yang Shi
  2021-01-11 19:37       ` Roman Gushchin
  1 sibling, 1 reply; 43+ messages in thread
From: Yang Shi @ 2021-01-11 19:00 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Kirill Tkhai, Shakeel Butt, Dave Chinner, Johannes Weiner,
	Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Wed, Jan 6, 2021 at 4:14 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Tue, Jan 05, 2021 at 02:58:08PM -0800, Yang Shi wrote:
> > The shrinker map management is not really memcg specific, it's just allocation
>
> In the current form it doesn't look so, especially because each name
> has a memcg_ prefix and each function takes a memcg argument.
>
> It begs for some refactorings (Kirill suggested some) and renamings.

BTW, do you mean the suggestion about renaming memcg_shrinker_maps to
shrinker_maps? I just saw his email today since gmail filtered his
emails to SPAM :-(

>
> Thanks!

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v3 PATCH 02/11] mm: vmscan: consolidate shrinker_maps handling code
  2021-01-11 19:00     ` Yang Shi
@ 2021-01-11 19:37       ` Roman Gushchin
  2021-01-11 19:43         ` Yang Shi
  0 siblings, 1 reply; 43+ messages in thread
From: Roman Gushchin @ 2021-01-11 19:37 UTC (permalink / raw)
  To: Yang Shi
  Cc: Kirill Tkhai, Shakeel Butt, Dave Chinner, Johannes Weiner,
	Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Mon, Jan 11, 2021 at 11:00:17AM -0800, Yang Shi wrote:
> On Wed, Jan 6, 2021 at 4:14 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > On Tue, Jan 05, 2021 at 02:58:08PM -0800, Yang Shi wrote:
> > > The shrinker map management is not really memcg specific, it's just allocation
> >
> > In the current form it doesn't look so, especially because each name
> > has a memcg_ prefix and each function takes a memcg argument.
> >
> > It begs for some refactorings (Kirill suggested some) and renamings.
> 
> BTW, do you mean the suggestion about renaming memcg_shrinker_maps to
> shrinker_maps? I just saw his email today since gmail filtered his
> emails to SPAM :-(

Yes.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v3 PATCH 02/11] mm: vmscan: consolidate shrinker_maps handling code
  2021-01-11 19:37       ` Roman Gushchin
@ 2021-01-11 19:43         ` Yang Shi
  0 siblings, 0 replies; 43+ messages in thread
From: Yang Shi @ 2021-01-11 19:43 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Kirill Tkhai, Shakeel Butt, Dave Chinner, Johannes Weiner,
	Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Mon, Jan 11, 2021 at 11:37 AM Roman Gushchin <guro@fb.com> wrote:
>
> On Mon, Jan 11, 2021 at 11:00:17AM -0800, Yang Shi wrote:
> > On Wed, Jan 6, 2021 at 4:14 PM Roman Gushchin <guro@fb.com> wrote:
> > >
> > > On Tue, Jan 05, 2021 at 02:58:08PM -0800, Yang Shi wrote:
> > > > The shrinker map management is not really memcg specific, it's just allocation
> > >
> > > In the current form it doesn't look so, especially because each name
> > > has a memcg_ prefix and each function takes a memcg argument.
> > >
> > > It begs for some refactorings (Kirill suggested some) and renamings.
> >
> > BTW, do you mean the suggestion about renaming memcg_shrinker_maps to
> > shrinker_maps? I just saw his email today since gmail filtered his
> > emails to SPAM :-(
>
> Yes.

Thanks for confirming, will do it in v4.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v3 PATCH 03/11] mm: vmscan: use shrinker_rwsem to protect shrinker_maps allocation
  2021-01-11 18:57         ` Yang Shi
@ 2021-01-11 21:33           ` Kirill Tkhai
  2021-01-12 21:23             ` Yang Shi
  0 siblings, 1 reply; 43+ messages in thread
From: Kirill Tkhai @ 2021-01-11 21:33 UTC (permalink / raw)
  To: Yang Shi
  Cc: Roman Gushchin, Shakeel Butt, Dave Chinner, Johannes Weiner,
	Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On 11.01.2021 21:57, Yang Shi wrote:
> On Mon, Jan 11, 2021 at 9:34 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>
>> On 11.01.2021 20:08, Yang Shi wrote:
>>> On Wed, Jan 6, 2021 at 1:55 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>>>
>>>> On 06.01.2021 01:58, Yang Shi wrote:
>>>>> Since memcg_shrinker_map_size just can be changd under holding shrinker_rwsem
>>>>> exclusively, the read side can be protected by holding read lock, so it sounds
>>>>> superfluous to have a dedicated mutex.  This should not exacerbate the contention
>>>>> to shrinker_rwsem since just one read side critical section is added.
>>>>>
>>>>> Signed-off-by: Yang Shi <shy828301@gmail.com>
>>>>> ---
>>>>>  mm/vmscan.c | 16 ++++++----------
>>>>>  1 file changed, 6 insertions(+), 10 deletions(-)
>>>>>
>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>> index 9db7b4d6d0ae..ddb9f972f856 100644
>>>>> --- a/mm/vmscan.c
>>>>> +++ b/mm/vmscan.c
>>>>> @@ -187,7 +187,6 @@ static DECLARE_RWSEM(shrinker_rwsem);
>>>>>  #ifdef CONFIG_MEMCG
>>>>>
>>>>>  static int memcg_shrinker_map_size;
>>>>> -static DEFINE_MUTEX(memcg_shrinker_map_mutex);
>>>>>
>>>>>  static void memcg_free_shrinker_map_rcu(struct rcu_head *head)
>>>>>  {
>>>>> @@ -200,8 +199,6 @@ static int memcg_expand_one_shrinker_map(struct mem_cgroup *memcg,
>>>>>       struct memcg_shrinker_map *new, *old;
>>>>>       int nid;
>>>>>
>>>>> -     lockdep_assert_held(&memcg_shrinker_map_mutex);
>>>>> -
>>>>>       for_each_node(nid) {
>>>>>               old = rcu_dereference_protected(
>>>>>                       mem_cgroup_nodeinfo(memcg, nid)->shrinker_map, true);
>>>>> @@ -250,7 +247,7 @@ int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
>>>>>       if (mem_cgroup_is_root(memcg))
>>>>>               return 0;
>>>>>
>>>>> -     mutex_lock(&memcg_shrinker_map_mutex);
>>>>> +     down_read(&shrinker_rwsem);
>>>>>       size = memcg_shrinker_map_size;
>>>>>       for_each_node(nid) {
>>>>>               map = kvzalloc(sizeof(*map) + size, GFP_KERNEL);
>>>>> @@ -261,7 +258,7 @@ int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
>>>>>               }
>>>>>               rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, map);
>>>>
>>>> Here we do STORE operation, and since we want the assignment is visible
>>>> for shrink_slab_memcg() under down_read(), we have to use down_write()
>>>> in memcg_alloc_shrinker_maps().
>>>
>>> I apologize for the late reply, these emails went to my SPAM again.
>>
>> This is the second time the problem appeared. Just add my email address to allow list,
>> and there won't be this problem again.
> 
> Yes, I thought clicking "not spam" would add your email address to the
> allow list automatically. But it turns out not true.
> 
>>
>>> Before this patch it was not serialized by any lock either, right? Do
>>> we have to serialize it? As Johannes mentioned if shrinker_maps has
>>> not been initialized yet, it means the memcg is a newborn, there
>>> should not be significant amount of reclaimable slab caches, so it is
>>> fine to skip it. The point makes some sense to me.
>>>
>>> So, the read lock seems good enough.
>>
>> No, this is not so.
>>
>> Patch "[v3 PATCH 07/11] mm: vmscan: add per memcg shrinker nr_deferred" adds
>> new assignments:
>>
>> +               info->map = (unsigned long *)((unsigned long)info + sizeof(*info));
>> +               info->nr_deferred = (atomic_long_t *)((unsigned long)info +
>> +                                       sizeof(*info) + m_size);
>>
>> info->map and info->nr_deferred are not visible under READ lock in shrink_slab_memcg(),
>> unless you use WRITE lock in memcg_alloc_shrinker_maps().
> 
> However map and nr_deferred are assigned before
> rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, new). The
> shrink_slab_memcg() checks shrinker_info pointer.
> But that order might be not guaranteed, so it seems a memory barrier
> before rcu_assign_pointer should be good enough, right?

Yes, and here are some more:

1)There is rcu_dereference_protected() dereferrencing in rcu_dereference_protected(),
  but in case of we use READ lock in memcg_alloc_shrinker_maps(), the dereferrencing
  is not actually protected.

2)READ lock makes memcg_alloc_shrinker_info() racy against memory allocation fail.
  memcg_alloc_shrinker_info()->memcg_free_shrinker_info() may free memory right
  after shrink_slab_memcg() dereferenced it. You may say shrink_slab_memcg()->mem_cgroup_online()
  protects us from it?! Yes, sure, but this is not the thing we want to remember
  in the future, since this spreads modularity.

Why don't we use WRITE lock? It prohibits shrinking of SLAB during memcg_alloc_shrinker_info()->kvzalloc()?
Yes, but it is not a problem, since page cache is still shrinkable, and we are able to
allocate memory. WRITE lock means better modularity, and it gives us a possibility
not to think about corner cases.
 
>>
>> Nowhere in your patchset you convert READ lock to WRITE lock in memcg_alloc_shrinker_maps().
>>
>> So, just use the true lock in this patch from the first time.
>>
>>>>
>>>>>       }
>>>>> -     mutex_unlock(&memcg_shrinker_map_mutex);
>>>>> +     up_read(&shrinker_rwsem);
>>>>>
>>>>>       return ret;
>>>>>  }
>>>>> @@ -276,9 +273,8 @@ static int memcg_expand_shrinker_maps(int new_id)
>>>>>       if (size <= old_size)
>>>>>               return 0;
>>>>>
>>>>> -     mutex_lock(&memcg_shrinker_map_mutex);
>>>>>       if (!root_mem_cgroup)
>>>>> -             goto unlock;
>>>>> +             goto out;
>>>>>
>>>>>       memcg = mem_cgroup_iter(NULL, NULL, NULL);
>>>>>       do {
>>>>> @@ -287,13 +283,13 @@ static int memcg_expand_shrinker_maps(int new_id)
>>>>>               ret = memcg_expand_one_shrinker_map(memcg, size, old_size);
>>>>>               if (ret) {
>>>>>                       mem_cgroup_iter_break(NULL, memcg);
>>>>> -                     goto unlock;
>>>>> +                     goto out;
>>>>>               }
>>>>>       } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
>>>>> -unlock:
>>>>> +out:
>>>>>       if (!ret)
>>>>>               memcg_shrinker_map_size = size;
>>>>> -     mutex_unlock(&memcg_shrinker_map_mutex);
>>>>> +
>>>>>       return ret;
>>>>>  }
>>>>>
>>>>>
>>>>
>>>>
>>
>>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v3 PATCH 05/11] mm: vmscan: use a new flag to indicate shrinker is registered
  2021-01-11 18:17     ` Yang Shi
@ 2021-01-11 21:37       ` Kirill Tkhai
  2021-01-12 20:58         ` Yang Shi
  0 siblings, 1 reply; 43+ messages in thread
From: Kirill Tkhai @ 2021-01-11 21:37 UTC (permalink / raw)
  To: Yang Shi
  Cc: Roman Gushchin, Shakeel Butt, Dave Chinner, Johannes Weiner,
	Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On 11.01.2021 21:17, Yang Shi wrote:
> On Wed, Jan 6, 2021 at 2:22 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>
>> On 06.01.2021 01:58, Yang Shi wrote:
>>> Currently registered shrinker is indicated by non-NULL shrinker->nr_deferred.
>>> This approach is fine with nr_deferred at the shrinker level, but the following
>>> patches will move MEMCG_AWARE shrinkers' nr_deferred to memcg level, so their
>>> shrinker->nr_deferred would always be NULL.  This would prevent the shrinkers
>>> from unregistering correctly.
>>>
>>> Signed-off-by: Yang Shi <shy828301@gmail.com>
>>> ---
>>>  include/linux/shrinker.h |  7 ++++---
>>>  mm/vmscan.c              | 13 +++++++++----
>>>  2 files changed, 13 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
>>> index 0f80123650e2..1eac79ce57d4 100644
>>> --- a/include/linux/shrinker.h
>>> +++ b/include/linux/shrinker.h
>>> @@ -79,13 +79,14 @@ struct shrinker {
>>>  #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
>>>
>>>  /* Flags */
>>> -#define SHRINKER_NUMA_AWARE  (1 << 0)
>>> -#define SHRINKER_MEMCG_AWARE (1 << 1)
>>> +#define SHRINKER_REGISTERED  (1 << 0)
>>> +#define SHRINKER_NUMA_AWARE  (1 << 1)
>>> +#define SHRINKER_MEMCG_AWARE (1 << 2)
>>>  /*
>>>   * It just makes sense when the shrinker is also MEMCG_AWARE for now,
>>>   * non-MEMCG_AWARE shrinker should not have this flag set.
>>>   */
>>> -#define SHRINKER_NONSLAB     (1 << 2)
>>> +#define SHRINKER_NONSLAB     (1 << 3)
>>>
>>>  extern int prealloc_shrinker(struct shrinker *shrinker);
>>>  extern void register_shrinker_prepared(struct shrinker *shrinker);
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index 8da765a85569..9761c7c27412 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -494,6 +494,7 @@ void register_shrinker_prepared(struct shrinker *shrinker)
>>>       if (shrinker->flags & SHRINKER_MEMCG_AWARE)
>>>               idr_replace(&shrinker_idr, shrinker, shrinker->id);
>>>  #endif
>>> +     shrinker->flags |= SHRINKER_REGISTERED;
>>
>> In case of we introduce this new flag, we should kill old flag SHRINKER_REGISTERING,
>> which are not needed anymore (we should you the new flag instead of that).
> 
> The only think that I'm confused with is the check in
> shrink_slab_memcg, it does:
> 
> shrinker = idr_find(&shrinker_idr, i);
> if (unlikely(!shrinker || shrinker == SHRINKER_REGISTERING)) {
> 
> When allocating idr, the shrinker is associated with
> SHRINKER_REGISTERING. But, shrink_slab_memcg does acquire read
> shrinker_rwsem, and idr_alloc is called with holding write
> shrinker_rwsem, so I'm supposed shrink_slab_memcg should never see
> shrinker is registering.

After prealloc_shrinker() shrinker is visible for shrink_slab_memcg().
This is the moment shrink_slab_memcg() sees SHRINKER_REGISTERED.

> If so it seems easy to remove
> SHRINKER_REGISTERING.
> 
> We just need change that check to:
> !shrinker || !(shrinker->flags & SHRINKER_REGISTERED)
> 
>>>       up_write(&shrinker_rwsem);
>>>  }
>>>
>>> @@ -513,13 +514,17 @@ EXPORT_SYMBOL(register_shrinker);
>>>   */
>>>  void unregister_shrinker(struct shrinker *shrinker)
>>>  {
>>> -     if (!shrinker->nr_deferred)
>>> -             return;
>>> -     if (shrinker->flags & SHRINKER_MEMCG_AWARE)
>>> -             unregister_memcg_shrinker(shrinker);
>>>       down_write(&shrinker_rwsem);
>>
>> I do not think there are some users which registration may race with unregistration.
>> So, I think we should check SHRINKER_REGISTERED unlocked similar to we used to check
>> shrinker->nr_deferred unlocked.
> 
> Yes, I agree.
> 
>>
>>> +     if (!(shrinker->flags & SHRINKER_REGISTERED)) {
>>> +             up_write(&shrinker_rwsem);
>>> +             return;
>>> +     }
>>>       list_del(&shrinker->list);
>>> +     shrinker->flags &= ~SHRINKER_REGISTERED;
>>>       up_write(&shrinker_rwsem);
>>> +
>>> +     if (shrinker->flags & SHRINKER_MEMCG_AWARE)
>>> +             unregister_memcg_shrinker(shrinker);
>>>       kfree(shrinker->nr_deferred);
>>>       shrinker->nr_deferred = NULL;
>>>  }
>>>
>>
>>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v3 PATCH 09/11] mm: vmscan: don't need allocate shrinker->nr_deferred for memcg aware shrinkers
  2021-01-11 18:40     ` Yang Shi
@ 2021-01-11 21:57       ` Kirill Tkhai
  0 siblings, 0 replies; 43+ messages in thread
From: Kirill Tkhai @ 2021-01-11 21:57 UTC (permalink / raw)
  To: Yang Shi
  Cc: Roman Gushchin, Shakeel Butt, Dave Chinner, Johannes Weiner,
	Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On 11.01.2021 21:40, Yang Shi wrote:
> On Wed, Jan 6, 2021 at 3:16 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>
>> On 06.01.2021 01:58, Yang Shi wrote:
>>> Now nr_deferred is available on per memcg level for memcg aware shrinkers, so don't need
>>> allocate shrinker->nr_deferred for such shrinkers anymore.
>>>
>>> The prealloc_memcg_shrinker() would return -ENOSYS if !CONFIG_MEMCG or memcg is disabled
>>> by kernel command line, then shrinker's SHRINKER_MEMCG_AWARE flag would be cleared.
>>> This makes the implementation of this patch simpler.
>>>
>>> Signed-off-by: Yang Shi <shy828301@gmail.com>
>>> ---
>>>  mm/vmscan.c | 33 ++++++++++++++++++---------------
>>>  1 file changed, 18 insertions(+), 15 deletions(-)
>>>
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index f20ed8e928c2..d9795fb0f1c5 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -340,6 +340,9 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
>>>  {
>>>       int id, ret = -ENOMEM;
>>>
>>> +     if (mem_cgroup_disabled())
>>> +             return -ENOSYS;
>>> +
>>>       down_write(&shrinker_rwsem);
>>>       /* This may call shrinker, so it must use down_read_trylock() */
>>>       id = idr_alloc(&shrinker_idr, SHRINKER_REGISTERING, 0, 0, GFP_KERNEL);
>>> @@ -424,7 +427,7 @@ static bool writeback_throttling_sane(struct scan_control *sc)
>>>  #else
>>>  static int prealloc_memcg_shrinker(struct shrinker *shrinker)
>>>  {
>>> -     return 0;
>>> +     return -ENOSYS;
>>>  }
>>>
>>>  static void unregister_memcg_shrinker(struct shrinker *shrinker)
>>> @@ -535,8 +538,20 @@ unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone
>>>   */
>>>  int prealloc_shrinker(struct shrinker *shrinker)
>>>  {
>>> -     unsigned int size = sizeof(*shrinker->nr_deferred);
>>> +     unsigned int size;
>>> +     int err;
>>> +
>>> +     if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
>>> +             err = prealloc_memcg_shrinker(shrinker);
>>> +             if (!err)
>>> +                     return 0;
>>> +             if (err != -ENOSYS)
>>> +                     return err;
>>> +
>>> +             shrinker->flags &= ~SHRINKER_MEMCG_AWARE;
>>
>> This looks very confusing.
>>
>> In case of you want to disable preallocation branch for !MEMCG case,
>> you should firstly consider something like the below:
> 
> Not only !CONFIG_MEMCG, but also "cgroup_disable=memory" case.
> 
>>
>> #ifdef CONFIG_MEMCG
>> #define SHRINKER_MEMCG_AWARE    (1 << 2)
>> #else
>> #define SHRINKER_MEMCG_AWARE    0
>> #endif
> 
> This could handle !CONFIG_MEMCG case, but can't deal with
> "cgroup_disable=memory" case. We could consider check
> mem_cgroup_disabled() when initializing shrinker, but this may result
> in touching fs codes like below:
> 
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -266,7 +266,9 @@ static struct super_block *alloc_super(struct
> file_system_type *type, int flags,
>         s->s_shrink.scan_objects = super_cache_scan;
>         s->s_shrink.count_objects = super_cache_count;
>         s->s_shrink.batch = 1024;
> -       s->s_shrink.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE;
> +       s->s_shrink.flags = SHRINKER_NUMA_AWARE;
> +       if (!mem_cgroup_disabled())
> +               s->s_shrink.flags |= SHRINKER_MEMCG_AWARE;
>         if (prealloc_shrinker(&s->s_shrink))
>                 goto fail;
>         if (list_lru_init_memcg(&s->s_dentry_lru, &s->s_shrink))

Oh. If so, then initial variant was better.

>>
>>> +     }
>>>
>>> +     size = sizeof(*shrinker->nr_deferred);
>>>       if (shrinker->flags & SHRINKER_NUMA_AWARE)
>>>               size *= nr_node_ids;
>>>
>>> @@ -544,26 +559,14 @@ int prealloc_shrinker(struct shrinker *shrinker)
>>>       if (!shrinker->nr_deferred)
>>>               return -ENOMEM;
>>>
>>> -     if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
>>> -             if (prealloc_memcg_shrinker(shrinker))
>>> -                     goto free_deferred;
>>> -     }
>>>
>>>       return 0;
>>> -
>>> -free_deferred:
>>> -     kfree(shrinker->nr_deferred);
>>> -     shrinker->nr_deferred = NULL;
>>> -     return -ENOMEM;
>>>  }
>>>
>>>  void free_prealloced_shrinker(struct shrinker *shrinker)
>>>  {
>>> -     if (!shrinker->nr_deferred)
>>> -             return;
>>> -
>>>       if (shrinker->flags & SHRINKER_MEMCG_AWARE)
>>> -             unregister_memcg_shrinker(shrinker);
>>> +             return unregister_memcg_shrinker(shrinker);
>>>
>>>       kfree(shrinker->nr_deferred);
>>>       shrinker->nr_deferred = NULL;
>>>
>>
>>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v3 PATCH 05/11] mm: vmscan: use a new flag to indicate shrinker is registered
  2021-01-11 21:37       ` Kirill Tkhai
@ 2021-01-12 20:58         ` Yang Shi
  0 siblings, 0 replies; 43+ messages in thread
From: Yang Shi @ 2021-01-12 20:58 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Roman Gushchin, Shakeel Butt, Dave Chinner, Johannes Weiner,
	Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Mon, Jan 11, 2021 at 1:38 PM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>
> On 11.01.2021 21:17, Yang Shi wrote:
> > On Wed, Jan 6, 2021 at 2:22 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> >>
> >> On 06.01.2021 01:58, Yang Shi wrote:
> >>> Currently registered shrinker is indicated by non-NULL shrinker->nr_deferred.
> >>> This approach is fine with nr_deferred at the shrinker level, but the following
> >>> patches will move MEMCG_AWARE shrinkers' nr_deferred to memcg level, so their
> >>> shrinker->nr_deferred would always be NULL.  This would prevent the shrinkers
> >>> from unregistering correctly.
> >>>
> >>> Signed-off-by: Yang Shi <shy828301@gmail.com>
> >>> ---
> >>>  include/linux/shrinker.h |  7 ++++---
> >>>  mm/vmscan.c              | 13 +++++++++----
> >>>  2 files changed, 13 insertions(+), 7 deletions(-)
> >>>
> >>> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> >>> index 0f80123650e2..1eac79ce57d4 100644
> >>> --- a/include/linux/shrinker.h
> >>> +++ b/include/linux/shrinker.h
> >>> @@ -79,13 +79,14 @@ struct shrinker {
> >>>  #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
> >>>
> >>>  /* Flags */
> >>> -#define SHRINKER_NUMA_AWARE  (1 << 0)
> >>> -#define SHRINKER_MEMCG_AWARE (1 << 1)
> >>> +#define SHRINKER_REGISTERED  (1 << 0)
> >>> +#define SHRINKER_NUMA_AWARE  (1 << 1)
> >>> +#define SHRINKER_MEMCG_AWARE (1 << 2)
> >>>  /*
> >>>   * It just makes sense when the shrinker is also MEMCG_AWARE for now,
> >>>   * non-MEMCG_AWARE shrinker should not have this flag set.
> >>>   */
> >>> -#define SHRINKER_NONSLAB     (1 << 2)
> >>> +#define SHRINKER_NONSLAB     (1 << 3)
> >>>
> >>>  extern int prealloc_shrinker(struct shrinker *shrinker);
> >>>  extern void register_shrinker_prepared(struct shrinker *shrinker);
> >>> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >>> index 8da765a85569..9761c7c27412 100644
> >>> --- a/mm/vmscan.c
> >>> +++ b/mm/vmscan.c
> >>> @@ -494,6 +494,7 @@ void register_shrinker_prepared(struct shrinker *shrinker)
> >>>       if (shrinker->flags & SHRINKER_MEMCG_AWARE)
> >>>               idr_replace(&shrinker_idr, shrinker, shrinker->id);
> >>>  #endif
> >>> +     shrinker->flags |= SHRINKER_REGISTERED;
> >>
> >> In case of we introduce this new flag, we should kill old flag SHRINKER_REGISTERING,
> >> which are not needed anymore (we should you the new flag instead of that).
> >
> > The only think that I'm confused with is the check in
> > shrink_slab_memcg, it does:
> >
> > shrinker = idr_find(&shrinker_idr, i);
> > if (unlikely(!shrinker || shrinker == SHRINKER_REGISTERING)) {
> >
> > When allocating idr, the shrinker is associated with
> > SHRINKER_REGISTERING. But, shrink_slab_memcg does acquire read
> > shrinker_rwsem, and idr_alloc is called with holding write
> > shrinker_rwsem, so I'm supposed shrink_slab_memcg should never see
> > shrinker is registering.
>
> After prealloc_shrinker() shrinker is visible for shrink_slab_memcg().
> This is the moment shrink_slab_memcg() sees SHRINKER_REGISTERED.

Yes, this exactly is what I'm supposed.

>
> > If so it seems easy to remove
> > SHRINKER_REGISTERING.
> >
> > We just need change that check to:
> > !shrinker || !(shrinker->flags & SHRINKER_REGISTERED)
> >
> >>>       up_write(&shrinker_rwsem);
> >>>  }
> >>>
> >>> @@ -513,13 +514,17 @@ EXPORT_SYMBOL(register_shrinker);
> >>>   */
> >>>  void unregister_shrinker(struct shrinker *shrinker)
> >>>  {
> >>> -     if (!shrinker->nr_deferred)
> >>> -             return;
> >>> -     if (shrinker->flags & SHRINKER_MEMCG_AWARE)
> >>> -             unregister_memcg_shrinker(shrinker);
> >>>       down_write(&shrinker_rwsem);
> >>
> >> I do not think there are some users which registration may race with unregistration.
> >> So, I think we should check SHRINKER_REGISTERED unlocked similar to we used to check
> >> shrinker->nr_deferred unlocked.
> >
> > Yes, I agree.
> >
> >>
> >>> +     if (!(shrinker->flags & SHRINKER_REGISTERED)) {
> >>> +             up_write(&shrinker_rwsem);
> >>> +             return;
> >>> +     }
> >>>       list_del(&shrinker->list);
> >>> +     shrinker->flags &= ~SHRINKER_REGISTERED;
> >>>       up_write(&shrinker_rwsem);
> >>> +
> >>> +     if (shrinker->flags & SHRINKER_MEMCG_AWARE)
> >>> +             unregister_memcg_shrinker(shrinker);
> >>>       kfree(shrinker->nr_deferred);
> >>>       shrinker->nr_deferred = NULL;
> >>>  }
> >>>
> >>
> >>
>
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v3 PATCH 03/11] mm: vmscan: use shrinker_rwsem to protect shrinker_maps allocation
  2021-01-11 21:33           ` Kirill Tkhai
@ 2021-01-12 21:23             ` Yang Shi
  2021-01-13 18:16               ` Yang Shi
  0 siblings, 1 reply; 43+ messages in thread
From: Yang Shi @ 2021-01-12 21:23 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Roman Gushchin, Shakeel Butt, Dave Chinner, Johannes Weiner,
	Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Mon, Jan 11, 2021 at 1:34 PM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>
> On 11.01.2021 21:57, Yang Shi wrote:
> > On Mon, Jan 11, 2021 at 9:34 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> >>
> >> On 11.01.2021 20:08, Yang Shi wrote:
> >>> On Wed, Jan 6, 2021 at 1:55 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> >>>>
> >>>> On 06.01.2021 01:58, Yang Shi wrote:
> >>>>> Since memcg_shrinker_map_size just can be changd under holding shrinker_rwsem
> >>>>> exclusively, the read side can be protected by holding read lock, so it sounds
> >>>>> superfluous to have a dedicated mutex.  This should not exacerbate the contention
> >>>>> to shrinker_rwsem since just one read side critical section is added.
> >>>>>
> >>>>> Signed-off-by: Yang Shi <shy828301@gmail.com>
> >>>>> ---
> >>>>>  mm/vmscan.c | 16 ++++++----------
> >>>>>  1 file changed, 6 insertions(+), 10 deletions(-)
> >>>>>
> >>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >>>>> index 9db7b4d6d0ae..ddb9f972f856 100644
> >>>>> --- a/mm/vmscan.c
> >>>>> +++ b/mm/vmscan.c
> >>>>> @@ -187,7 +187,6 @@ static DECLARE_RWSEM(shrinker_rwsem);
> >>>>>  #ifdef CONFIG_MEMCG
> >>>>>
> >>>>>  static int memcg_shrinker_map_size;
> >>>>> -static DEFINE_MUTEX(memcg_shrinker_map_mutex);
> >>>>>
> >>>>>  static void memcg_free_shrinker_map_rcu(struct rcu_head *head)
> >>>>>  {
> >>>>> @@ -200,8 +199,6 @@ static int memcg_expand_one_shrinker_map(struct mem_cgroup *memcg,
> >>>>>       struct memcg_shrinker_map *new, *old;
> >>>>>       int nid;
> >>>>>
> >>>>> -     lockdep_assert_held(&memcg_shrinker_map_mutex);
> >>>>> -
> >>>>>       for_each_node(nid) {
> >>>>>               old = rcu_dereference_protected(
> >>>>>                       mem_cgroup_nodeinfo(memcg, nid)->shrinker_map, true);
> >>>>> @@ -250,7 +247,7 @@ int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
> >>>>>       if (mem_cgroup_is_root(memcg))
> >>>>>               return 0;
> >>>>>
> >>>>> -     mutex_lock(&memcg_shrinker_map_mutex);
> >>>>> +     down_read(&shrinker_rwsem);
> >>>>>       size = memcg_shrinker_map_size;
> >>>>>       for_each_node(nid) {
> >>>>>               map = kvzalloc(sizeof(*map) + size, GFP_KERNEL);
> >>>>> @@ -261,7 +258,7 @@ int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
> >>>>>               }
> >>>>>               rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, map);
> >>>>
> >>>> Here we do STORE operation, and since we want the assignment is visible
> >>>> for shrink_slab_memcg() under down_read(), we have to use down_write()
> >>>> in memcg_alloc_shrinker_maps().
> >>>
> >>> I apologize for the late reply, these emails went to my SPAM again.
> >>
> >> This is the second time the problem appeared. Just add my email address to allow list,
> >> and there won't be this problem again.
> >
> > Yes, I thought clicking "not spam" would add your email address to the
> > allow list automatically. But it turns out not true.
> >
> >>
> >>> Before this patch it was not serialized by any lock either, right? Do
> >>> we have to serialize it? As Johannes mentioned if shrinker_maps has
> >>> not been initialized yet, it means the memcg is a newborn, there
> >>> should not be significant amount of reclaimable slab caches, so it is
> >>> fine to skip it. The point makes some sense to me.
> >>>
> >>> So, the read lock seems good enough.
> >>
> >> No, this is not so.
> >>
> >> Patch "[v3 PATCH 07/11] mm: vmscan: add per memcg shrinker nr_deferred" adds
> >> new assignments:
> >>
> >> +               info->map = (unsigned long *)((unsigned long)info + sizeof(*info));
> >> +               info->nr_deferred = (atomic_long_t *)((unsigned long)info +
> >> +                                       sizeof(*info) + m_size);
> >>
> >> info->map and info->nr_deferred are not visible under READ lock in shrink_slab_memcg(),
> >> unless you use WRITE lock in memcg_alloc_shrinker_maps().
> >
> > However map and nr_deferred are assigned before
> > rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, new). The
> > shrink_slab_memcg() checks shrinker_info pointer.
> > But that order might be not guaranteed, so it seems a memory barrier
> > before rcu_assign_pointer should be good enough, right?
>
> Yes, and here are some more:
>
> 1)There is rcu_dereference_protected() dereferrencing in rcu_dereference_protected(),
>   but in case of we use READ lock in memcg_alloc_shrinker_maps(), the dereferrencing
>   is not actually protected.
>
> 2)READ lock makes memcg_alloc_shrinker_info() racy against memory allocation fail.
>   memcg_alloc_shrinker_info()->memcg_free_shrinker_info() may free memory right
>   after shrink_slab_memcg() dereferenced it. You may say shrink_slab_memcg()->mem_cgroup_online()
>   protects us from it?! Yes, sure, but this is not the thing we want to remember
>   in the future, since this spreads modularity.
>
> Why don't we use WRITE lock? It prohibits shrinking of SLAB during memcg_alloc_shrinker_info()->kvzalloc()?

Yes, it is the main concern.

> Yes, but it is not a problem, since page cache is still shrinkable, and we are able to
> allocate memory. WRITE lock means better modularity, and it gives us a possibility
> not to think about corner cases.

I do agree using write lock makes life easier. I'm just not sure how
bad the impact would be, particularly with vfs metadata heavy workload
(the most memory is consumed by slab cache rather than page cache).
But I think I can design a simple test case, which generates global
memory pressure with slab cache (i.e. negative dentry cache), then
create significant amount of memcgs (i.e. 10k), then check if the
memcgs creation time is lengthened or not.

>
> >>
> >> Nowhere in your patchset you convert READ lock to WRITE lock in memcg_alloc_shrinker_maps().
> >>
> >> So, just use the true lock in this patch from the first time.
> >>
> >>>>
> >>>>>       }
> >>>>> -     mutex_unlock(&memcg_shrinker_map_mutex);
> >>>>> +     up_read(&shrinker_rwsem);
> >>>>>
> >>>>>       return ret;
> >>>>>  }
> >>>>> @@ -276,9 +273,8 @@ static int memcg_expand_shrinker_maps(int new_id)
> >>>>>       if (size <= old_size)
> >>>>>               return 0;
> >>>>>
> >>>>> -     mutex_lock(&memcg_shrinker_map_mutex);
> >>>>>       if (!root_mem_cgroup)
> >>>>> -             goto unlock;
> >>>>> +             goto out;
> >>>>>
> >>>>>       memcg = mem_cgroup_iter(NULL, NULL, NULL);
> >>>>>       do {
> >>>>> @@ -287,13 +283,13 @@ static int memcg_expand_shrinker_maps(int new_id)
> >>>>>               ret = memcg_expand_one_shrinker_map(memcg, size, old_size);
> >>>>>               if (ret) {
> >>>>>                       mem_cgroup_iter_break(NULL, memcg);
> >>>>> -                     goto unlock;
> >>>>> +                     goto out;
> >>>>>               }
> >>>>>       } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
> >>>>> -unlock:
> >>>>> +out:
> >>>>>       if (!ret)
> >>>>>               memcg_shrinker_map_size = size;
> >>>>> -     mutex_unlock(&memcg_shrinker_map_mutex);
> >>>>> +
> >>>>>       return ret;
> >>>>>  }
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>
> >>
>
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v3 PATCH 03/11] mm: vmscan: use shrinker_rwsem to protect shrinker_maps allocation
  2021-01-12 21:23             ` Yang Shi
@ 2021-01-13 18:16               ` Yang Shi
  0 siblings, 0 replies; 43+ messages in thread
From: Yang Shi @ 2021-01-13 18:16 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Roman Gushchin, Shakeel Butt, Dave Chinner, Johannes Weiner,
	Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Tue, Jan 12, 2021 at 1:23 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Mon, Jan 11, 2021 at 1:34 PM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> >
> > On 11.01.2021 21:57, Yang Shi wrote:
> > > On Mon, Jan 11, 2021 at 9:34 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> > >>
> > >> On 11.01.2021 20:08, Yang Shi wrote:
> > >>> On Wed, Jan 6, 2021 at 1:55 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> > >>>>
> > >>>> On 06.01.2021 01:58, Yang Shi wrote:
> > >>>>> Since memcg_shrinker_map_size just can be changd under holding shrinker_rwsem
> > >>>>> exclusively, the read side can be protected by holding read lock, so it sounds
> > >>>>> superfluous to have a dedicated mutex.  This should not exacerbate the contention
> > >>>>> to shrinker_rwsem since just one read side critical section is added.
> > >>>>>
> > >>>>> Signed-off-by: Yang Shi <shy828301@gmail.com>
> > >>>>> ---
> > >>>>>  mm/vmscan.c | 16 ++++++----------
> > >>>>>  1 file changed, 6 insertions(+), 10 deletions(-)
> > >>>>>
> > >>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
> > >>>>> index 9db7b4d6d0ae..ddb9f972f856 100644
> > >>>>> --- a/mm/vmscan.c
> > >>>>> +++ b/mm/vmscan.c
> > >>>>> @@ -187,7 +187,6 @@ static DECLARE_RWSEM(shrinker_rwsem);
> > >>>>>  #ifdef CONFIG_MEMCG
> > >>>>>
> > >>>>>  static int memcg_shrinker_map_size;
> > >>>>> -static DEFINE_MUTEX(memcg_shrinker_map_mutex);
> > >>>>>
> > >>>>>  static void memcg_free_shrinker_map_rcu(struct rcu_head *head)
> > >>>>>  {
> > >>>>> @@ -200,8 +199,6 @@ static int memcg_expand_one_shrinker_map(struct mem_cgroup *memcg,
> > >>>>>       struct memcg_shrinker_map *new, *old;
> > >>>>>       int nid;
> > >>>>>
> > >>>>> -     lockdep_assert_held(&memcg_shrinker_map_mutex);
> > >>>>> -
> > >>>>>       for_each_node(nid) {
> > >>>>>               old = rcu_dereference_protected(
> > >>>>>                       mem_cgroup_nodeinfo(memcg, nid)->shrinker_map, true);
> > >>>>> @@ -250,7 +247,7 @@ int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
> > >>>>>       if (mem_cgroup_is_root(memcg))
> > >>>>>               return 0;
> > >>>>>
> > >>>>> -     mutex_lock(&memcg_shrinker_map_mutex);
> > >>>>> +     down_read(&shrinker_rwsem);
> > >>>>>       size = memcg_shrinker_map_size;
> > >>>>>       for_each_node(nid) {
> > >>>>>               map = kvzalloc(sizeof(*map) + size, GFP_KERNEL);
> > >>>>> @@ -261,7 +258,7 @@ int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
> > >>>>>               }
> > >>>>>               rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, map);
> > >>>>
> > >>>> Here we do STORE operation, and since we want the assignment is visible
> > >>>> for shrink_slab_memcg() under down_read(), we have to use down_write()
> > >>>> in memcg_alloc_shrinker_maps().
> > >>>
> > >>> I apologize for the late reply, these emails went to my SPAM again.
> > >>
> > >> This is the second time the problem appeared. Just add my email address to allow list,
> > >> and there won't be this problem again.
> > >
> > > Yes, I thought clicking "not spam" would add your email address to the
> > > allow list automatically. But it turns out not true.
> > >
> > >>
> > >>> Before this patch it was not serialized by any lock either, right? Do
> > >>> we have to serialize it? As Johannes mentioned if shrinker_maps has
> > >>> not been initialized yet, it means the memcg is a newborn, there
> > >>> should not be significant amount of reclaimable slab caches, so it is
> > >>> fine to skip it. The point makes some sense to me.
> > >>>
> > >>> So, the read lock seems good enough.
> > >>
> > >> No, this is not so.
> > >>
> > >> Patch "[v3 PATCH 07/11] mm: vmscan: add per memcg shrinker nr_deferred" adds
> > >> new assignments:
> > >>
> > >> +               info->map = (unsigned long *)((unsigned long)info + sizeof(*info));
> > >> +               info->nr_deferred = (atomic_long_t *)((unsigned long)info +
> > >> +                                       sizeof(*info) + m_size);
> > >>
> > >> info->map and info->nr_deferred are not visible under READ lock in shrink_slab_memcg(),
> > >> unless you use WRITE lock in memcg_alloc_shrinker_maps().
> > >
> > > However map and nr_deferred are assigned before
> > > rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, new). The
> > > shrink_slab_memcg() checks shrinker_info pointer.
> > > But that order might be not guaranteed, so it seems a memory barrier
> > > before rcu_assign_pointer should be good enough, right?
> >
> > Yes, and here are some more:
> >
> > 1)There is rcu_dereference_protected() dereferrencing in rcu_dereference_protected(),
> >   but in case of we use READ lock in memcg_alloc_shrinker_maps(), the dereferrencing
> >   is not actually protected.
> >
> > 2)READ lock makes memcg_alloc_shrinker_info() racy against memory allocation fail.
> >   memcg_alloc_shrinker_info()->memcg_free_shrinker_info() may free memory right
> >   after shrink_slab_memcg() dereferenced it. You may say shrink_slab_memcg()->mem_cgroup_online()
> >   protects us from it?! Yes, sure, but this is not the thing we want to remember
> >   in the future, since this spreads modularity.
> >
> > Why don't we use WRITE lock? It prohibits shrinking of SLAB during memcg_alloc_shrinker_info()->kvzalloc()?
>
> Yes, it is the main concern.
>
> > Yes, but it is not a problem, since page cache is still shrinkable, and we are able to
> > allocate memory. WRITE lock means better modularity, and it gives us a possibility
> > not to think about corner cases.
>
> I do agree using write lock makes life easier. I'm just not sure how
> bad the impact would be, particularly with vfs metadata heavy workload
> (the most memory is consumed by slab cache rather than page cache).
> But I think I can design a simple test case, which generates global
> memory pressure with slab cache (i.e. negative dentry cache), then
> create significant amount of memcgs (i.e. 10k), then check if the
> memcgs creation time is lengthened or not.

Did a test on a VM with two nodes (80 cpus) + 16GB memory. The test
does the below firstly:
* Generate negative dentry cache from all cpus to fill up the memory
* Run kernel build with 80 processes

The memory would be filled up and there should be multiple parallel
reclaimers running simultaneously (at least 2 kswapd processes, at
most 80 reclaimers), then create 10K memcgs (memcgs creation need
allocate shrinker_info with acquiring shrinker_rwsem).

The result is:

Read lock
real    7m17.891s
user    0m28.061s
sys     2m33.170s

Write lock
real    7m5.431s
user    0m20.400s
sys     2m53.162s

The one with write lock has longer sys time, it should not be caused
by the lock contention since the lock is rwsem, it might spend more
time in reclaiming pages. But it had a little bit shorter wall time
spent. And OOMs didn't happen either.

So, it seems using write lock didn't have a noticeable impact.

>
> >
> > >>
> > >> Nowhere in your patchset you convert READ lock to WRITE lock in memcg_alloc_shrinker_maps().
> > >>
> > >> So, just use the true lock in this patch from the first time.
> > >>
> > >>>>
> > >>>>>       }
> > >>>>> -     mutex_unlock(&memcg_shrinker_map_mutex);
> > >>>>> +     up_read(&shrinker_rwsem);
> > >>>>>
> > >>>>>       return ret;
> > >>>>>  }
> > >>>>> @@ -276,9 +273,8 @@ static int memcg_expand_shrinker_maps(int new_id)
> > >>>>>       if (size <= old_size)
> > >>>>>               return 0;
> > >>>>>
> > >>>>> -     mutex_lock(&memcg_shrinker_map_mutex);
> > >>>>>       if (!root_mem_cgroup)
> > >>>>> -             goto unlock;
> > >>>>> +             goto out;
> > >>>>>
> > >>>>>       memcg = mem_cgroup_iter(NULL, NULL, NULL);
> > >>>>>       do {
> > >>>>> @@ -287,13 +283,13 @@ static int memcg_expand_shrinker_maps(int new_id)
> > >>>>>               ret = memcg_expand_one_shrinker_map(memcg, size, old_size);
> > >>>>>               if (ret) {
> > >>>>>                       mem_cgroup_iter_break(NULL, memcg);
> > >>>>> -                     goto unlock;
> > >>>>> +                     goto out;
> > >>>>>               }
> > >>>>>       } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
> > >>>>> -unlock:
> > >>>>> +out:
> > >>>>>       if (!ret)
> > >>>>>               memcg_shrinker_map_size = size;
> > >>>>> -     mutex_unlock(&memcg_shrinker_map_mutex);
> > >>>>> +
> > >>>>>       return ret;
> > >>>>>  }
> > >>>>>
> > >>>>>
> > >>>>
> > >>>>
> > >>
> > >>
> >
> >

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v3 PATCH 07/11] mm: vmscan: add per memcg shrinker nr_deferred
  2021-01-06 11:06   ` Kirill Tkhai
  2021-01-11 18:24     ` Yang Shi
@ 2021-01-13 23:30     ` Yang Shi
  1 sibling, 0 replies; 43+ messages in thread
From: Yang Shi @ 2021-01-13 23:30 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Roman Gushchin, Shakeel Butt, Dave Chinner, Johannes Weiner,
	Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Wed, Jan 6, 2021 at 3:07 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>
> On 06.01.2021 01:58, Yang Shi wrote:
> > Currently the number of deferred objects are per shrinker, but some slabs, for example,
> > vfs inode/dentry cache are per memcg, this would result in poor isolation among memcgs.
> >
> > The deferred objects typically are generated by __GFP_NOFS allocations, one memcg with
> > excessive __GFP_NOFS allocations may blow up deferred objects, then other innocent memcgs
> > may suffer from over shrink, excessive reclaim latency, etc.
> >
> > For example, two workloads run in memcgA and memcgB respectively, workload in B is vfs
> > heavy workload.  Workload in A generates excessive deferred objects, then B's vfs cache
> > might be hit heavily (drop half of caches) by B's limit reclaim or global reclaim.
> >
> > We observed this hit in our production environment which was running vfs heavy workload
> > shown as the below tracing log:
> >
> > <...>-409454 [016] .... 28286961.747146: mm_shrink_slab_start: super_cache_scan+0x0/0x1a0 ffff9a83046f3458:
> > nid: 1 objects to shrink 3641681686040 gfp_flags GFP_HIGHUSER_MOVABLE|__GFP_ZERO pgs_scanned 1 lru_pgs 15721
> > cache items 246404277 delta 31345 total_scan 123202138
> > <...>-409454 [022] .... 28287105.928018: mm_shrink_slab_end: super_cache_scan+0x0/0x1a0 ffff9a83046f3458:
> > nid: 1 unused scan count 3641681686040 new scan count 3641798379189 total_scan 602
> > last shrinker return val 123186855
> >
> > The vfs cache and page cache ration was 10:1 on this machine, and half of caches were dropped.
> > This also resulted in significant amount of page caches were dropped due to inodes eviction.
> >
> > Make nr_deferred per memcg for memcg aware shrinkers would solve the unfairness and bring
> > better isolation.
> >
> > When memcg is not enabled (!CONFIG_MEMCG or memcg disabled), the shrinker's nr_deferred
> > would be used.  And non memcg aware shrinkers use shrinker's nr_deferred all the time.
> >
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > ---
> >  include/linux/memcontrol.h |  7 +++---
> >  mm/vmscan.c                | 49 +++++++++++++++++++++++++-------------
> >  2 files changed, 37 insertions(+), 19 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index e05bbe8277cc..5599082df623 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -93,12 +93,13 @@ struct lruvec_stat {
> >  };
> >
> >  /*
> > - * Bitmap of shrinker::id corresponding to memcg-aware shrinkers,
> > - * which have elements charged to this memcg.
> > + * Bitmap and deferred work of shrinker::id corresponding to memcg-aware
> > + * shrinkers, which have elements charged to this memcg.
> >   */
> >  struct memcg_shrinker_info {
> >       struct rcu_head rcu;
> > -     unsigned long map[];
> > +     unsigned long *map;
> > +     atomic_long_t *nr_deferred;
> >  };
> >
> >  /*
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 0033659abf9e..72259253e414 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -193,10 +193,12 @@ static void memcg_free_shrinker_info_rcu(struct rcu_head *head)
> >  }
> >
> >  static int memcg_expand_one_shrinker_info(struct mem_cgroup *memcg,
> > -                                       int size, int old_size)
> > +                                       int m_size, int d_size,
> > +                                       int old_m_size, int old_d_size)
> >  {
> >       struct memcg_shrinker_info *new, *old;
> >       int nid;
> > +     int size = m_size + d_size;
> >
> >       for_each_node(nid) {
> >               old = rcu_dereference_protected(
> > @@ -209,9 +211,18 @@ static int memcg_expand_one_shrinker_info(struct mem_cgroup *memcg,
> >               if (!new)
> >                       return -ENOMEM;
> >
> > -             /* Set all old bits, clear all new bits */
> > -             memset(new->map, (int)0xff, old_size);
> > -             memset((void *)new->map + old_size, 0, size - old_size);
> > +             new->map = (unsigned long *)((unsigned long)new + sizeof(*new));
> > +             new->nr_deferred = (atomic_long_t *)((unsigned long)new +
> > +                                     sizeof(*new) + m_size);
>
> Can't we write this more compact?
>
>                 new->map = (unsigned long *)(new + 1);
>                 new->nr_deferred = (atomic_long_t)(new->map + 1);

By relooking this, the second line looks wrong. The layout should be:

        ----------------------------
       | struct shrinker_info |
       -----------------------------
       |    map array             |
       -----------------------------
       |   nr_deferred array   |
       ------------------------------

new->map is the pointer to map array, its type is "unsigned long *",
so "new->map + 1" should point to the next 32 bytes, but the map array
may occupy more than one "unsigned long", this would corrupt the
arrays.

I think we could use "new->map + (shrinker_nr_max / BITS_PER_LONG) + 1"

>
> > +
> > +             /* map: set all old bits, clear all new bits */
> > +             memset(new->map, (int)0xff, old_m_size);
> > +             memset((void *)new->map + old_m_size, 0, m_size - old_m_size);
> > +             /* nr_deferred: copy old values, clear all new values */
> > +             memcpy((void *)new->nr_deferred, (void *)old->nr_deferred,
> > +                    old_d_size);
>
> Why not
>                 memcpy(new->nr_deferred, old->nr_deferred, old_d_size);
> ?
>
> > +             memset((void *)new->nr_deferred + old_d_size, 0,
> > +                    d_size - old_d_size);
> >
> >               rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, new);
> >               call_rcu(&old->rcu, memcg_free_shrinker_info_rcu);
> > @@ -226,9 +237,6 @@ void memcg_free_shrinker_info(struct mem_cgroup *memcg)
> >       struct memcg_shrinker_info *info;
> >       int nid;
> >
> > -     if (mem_cgroup_is_root(memcg))
> > -             return;
> > -
> >       for_each_node(nid) {
> >               pn = mem_cgroup_nodeinfo(memcg, nid);
> >               info = rcu_dereference_protected(pn->shrinker_info, true);
> > @@ -242,12 +250,13 @@ int memcg_alloc_shrinker_info(struct mem_cgroup *memcg)
> >  {
> >       struct memcg_shrinker_info *info;
> >       int nid, size, ret = 0;
> > -
> > -     if (mem_cgroup_is_root(memcg))
> > -             return 0;
> > +     int m_size, d_size = 0;
> >
> >       down_read(&shrinker_rwsem);
> > -     size = DIV_ROUND_UP(shrinker_nr_max, BITS_PER_LONG) * sizeof(unsigned long);
> > +     m_size = DIV_ROUND_UP(shrinker_nr_max, BITS_PER_LONG) * sizeof(unsigned long);
> > +     d_size = shrinker_nr_max * sizeof(atomic_long_t);
> > +     size = m_size + d_size;
> > +
> >       for_each_node(nid) {
> >               info = kvzalloc(sizeof(*info) + size, GFP_KERNEL);
> >               if (!info) {
> > @@ -255,6 +264,9 @@ int memcg_alloc_shrinker_info(struct mem_cgroup *memcg)
> >                       ret = -ENOMEM;
> >                       break;
> >               }
> > +             info->map = (unsigned long *)((unsigned long)info + sizeof(*info));
> > +             info->nr_deferred = (atomic_long_t *)((unsigned long)info +
> > +                                     sizeof(*info) + m_size);
> >               rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, info);
> >       }
> >       up_read(&shrinker_rwsem);
> > @@ -265,10 +277,16 @@ int memcg_alloc_shrinker_info(struct mem_cgroup *memcg)
> >  static int memcg_expand_shrinker_info(int new_id)
> >  {
> >       int size, old_size, ret = 0;
> > +     int m_size, d_size = 0;
> > +     int old_m_size, old_d_size = 0;
> >       struct mem_cgroup *memcg;
> >
> > -     size = DIV_ROUND_UP(new_id + 1, BITS_PER_LONG) * sizeof(unsigned long);
> > -     old_size = DIV_ROUND_UP(shrinker_nr_max, BITS_PER_LONG) * sizeof(unsigned long);
> > +     m_size = DIV_ROUND_UP(new_id + 1, BITS_PER_LONG) * sizeof(unsigned long);
> > +     d_size = (new_id + 1) * sizeof(atomic_long_t);
> > +     size = m_size + d_size;
> > +     old_m_size = DIV_ROUND_UP(shrinker_nr_max, BITS_PER_LONG) * sizeof(unsigned long);
> > +     old_d_size = shrinker_nr_max * sizeof(atomic_long_t);
> > +     old_size = old_m_size + old_d_size;
> >       if (size <= old_size)
> >               return 0;
>
> This replication of patch [4/11] looks awkwardly. Please, try to incorporate
> the same changes to nr_deferred as I requested for shrinker_map in [4/11].
>
> >
> > @@ -277,9 +295,8 @@ static int memcg_expand_shrinker_info(int new_id)
> >
> >       memcg = mem_cgroup_iter(NULL, NULL, NULL);
> >       do {
> > -             if (mem_cgroup_is_root(memcg))
> > -                     continue;
> > -             ret = memcg_expand_one_shrinker_info(memcg, size, old_size);
> > +             ret = memcg_expand_one_shrinker_info(memcg, m_size, d_size,
> > +                                                  old_m_size, old_d_size);
> >               if (ret) {
> >                       mem_cgroup_iter_break(NULL, memcg);
> >                       goto out;
> >
>
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [v3 PATCH 04/11] mm: vmscan: remove memcg_shrinker_map_size
  2021-01-06 10:15   ` Kirill Tkhai
  2021-01-11 17:44     ` Yang Shi
@ 2021-01-13 23:48     ` Yang Shi
  1 sibling, 0 replies; 43+ messages in thread
From: Yang Shi @ 2021-01-13 23:48 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Roman Gushchin, Shakeel Butt, Dave Chinner, Johannes Weiner,
	Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Wed, Jan 6, 2021 at 2:16 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>
> On 06.01.2021 01:58, Yang Shi wrote:
> > Both memcg_shrinker_map_size and shrinker_nr_max is maintained, but actually the
> > map size can be calculated via shrinker_nr_max, so it seems unnecessary to keep both.
> > Remove memcg_shrinker_map_size since shrinker_nr_max is also used by iterating the
> > bit map.
> >
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > ---
> >  mm/vmscan.c | 12 ++++--------
> >  1 file changed, 4 insertions(+), 8 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index ddb9f972f856..8da765a85569 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -185,8 +185,7 @@ static LIST_HEAD(shrinker_list);
> >  static DECLARE_RWSEM(shrinker_rwsem);
> >
> >  #ifdef CONFIG_MEMCG
> > -
> > -static int memcg_shrinker_map_size;
> > +static int shrinker_nr_max;
> >
> >  static void memcg_free_shrinker_map_rcu(struct rcu_head *head)
> >  {
> > @@ -248,7 +247,7 @@ int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
> >               return 0;
> >
> >       down_read(&shrinker_rwsem);
> > -     size = memcg_shrinker_map_size;
> > +     size = DIV_ROUND_UP(shrinker_nr_max, BITS_PER_LONG) * sizeof(unsigned long);
> >       for_each_node(nid) {
> >               map = kvzalloc(sizeof(*map) + size, GFP_KERNEL);
> >               if (!map) {
> > @@ -269,7 +268,7 @@ static int memcg_expand_shrinker_maps(int new_id)
> >       struct mem_cgroup *memcg;
> >
> >       size = DIV_ROUND_UP(new_id + 1, BITS_PER_LONG) * sizeof(unsigned long);
> > -     old_size = memcg_shrinker_map_size;
> > +     old_size = DIV_ROUND_UP(shrinker_nr_max, BITS_PER_LONG) * sizeof(unsigned long);
> >       if (size <= old_size)
> >               return 0;
>
> These bunch of DIV_ROUND_UP() looks too complex. Since now all the shrinker maps allocation
> logic in the only file, can't we simplify this to look better? I mean something like below
> to merge in your patch:
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b951c289ef3a..27b6371a1656 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -247,7 +247,7 @@ int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
>                 return 0;
>
>         down_read(&shrinker_rwsem);
> -       size = DIV_ROUND_UP(shrinker_nr_max, BITS_PER_LONG) * sizeof(unsigned long);
> +       size = shrinker_nr_max / BITS_PER_BYTE;

The type of shrinker_maps->map is "unsigned long *", I think we should
do "(shrinker_nr_max / BITS_PER_LONG + 1) * sizeof(unsigned long)".

And the "/ BITS_PER_BYTE" makes calculating the pointer of nr_deferred
array harder in the following patch since the length of the map array
may be not multiple of "unsigned long". Without the nr_deferred array,
this change seems fine.

>         for_each_node(nid) {
>                 map = kvzalloc(sizeof(*map) + size, GFP_KERNEL);
>                 if (!map) {
> @@ -264,13 +264,11 @@ int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
>
>  static int memcg_expand_shrinker_maps(int new_id)
>  {
> -       int size, old_size, ret = 0;
> +       int size, old_size, new_nr_max, ret = 0;
>         struct mem_cgroup *memcg;
>
>         size = DIV_ROUND_UP(new_id + 1, BITS_PER_LONG) * sizeof(unsigned long);
> -       old_size = DIV_ROUND_UP(shrinker_nr_max, BITS_PER_LONG) * sizeof(unsigned long);
> -       if (size <= old_size)
> -               return 0;
> +       new_nr_max = size * BITS_PER_BYTE;
>
>         if (!root_mem_cgroup)
>                 goto out;
> @@ -287,6 +285,9 @@ static int memcg_expand_shrinker_maps(int new_id)
>         } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
>
>  out:
> +       if (ret == 0)
> +               shrinker_nr_max = new_nr_max;
> +
>         return ret;
>  }
>
> @@ -334,8 +335,6 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
>                         idr_remove(&shrinker_idr, id);
>                         goto unlock;
>                 }
> -
> -               shrinker_nr_max = id + 1;
>         }
>         shrinker->id = id;
>         ret = 0;
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2021-01-14  1:59 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-05 22:58 [RFC v3 PATCH 0/11] Make shrinker's nr_deferred memcg aware Yang Shi
2021-01-05 22:58 ` [v3 PATCH 01/11] mm: vmscan: use nid from shrink_control for tracepoint Yang Shi
2021-01-05 22:58 ` [v3 PATCH 02/11] mm: vmscan: consolidate shrinker_maps handling code Yang Shi
2021-01-07  0:13   ` Roman Gushchin
2021-01-07 17:29     ` Yang Shi
2021-01-11 19:00     ` Yang Shi
2021-01-11 19:37       ` Roman Gushchin
2021-01-11 19:43         ` Yang Shi
2021-01-05 22:58 ` [v3 PATCH 03/11] mm: vmscan: use shrinker_rwsem to protect shrinker_maps allocation Yang Shi
2021-01-06  9:54   ` Kirill Tkhai
2021-01-11 17:08     ` Yang Shi
2021-01-11 17:33       ` Kirill Tkhai
2021-01-11 18:57         ` Yang Shi
2021-01-11 21:33           ` Kirill Tkhai
2021-01-12 21:23             ` Yang Shi
2021-01-13 18:16               ` Yang Shi
2021-01-05 22:58 ` [v3 PATCH 04/11] mm: vmscan: remove memcg_shrinker_map_size Yang Shi
2021-01-06 10:15   ` Kirill Tkhai
2021-01-11 17:44     ` Yang Shi
2021-01-13 23:48     ` Yang Shi
2021-01-05 22:58 ` [v3 PATCH 05/11] mm: vmscan: use a new flag to indicate shrinker is registered Yang Shi
2021-01-06 10:21   ` Kirill Tkhai
2021-01-11 18:17     ` Yang Shi
2021-01-11 21:37       ` Kirill Tkhai
2021-01-12 20:58         ` Yang Shi
2021-01-05 22:58 ` [v3 PATCH 06/11] mm: memcontrol: rename shrinker_map to shrinker_info Yang Shi
2021-01-06 11:38   ` Kirill Tkhai
2021-01-11 18:19     ` Yang Shi
2021-01-05 22:58 ` [v3 PATCH 07/11] mm: vmscan: add per memcg shrinker nr_deferred Yang Shi
2021-01-06 11:06   ` Kirill Tkhai
2021-01-11 18:24     ` Yang Shi
2021-01-13 23:30     ` Yang Shi
2021-01-05 22:58 ` [v3 PATCH 08/11] mm: vmscan: use per memcg nr_deferred of shrinker Yang Shi
2021-01-07  0:17   ` Roman Gushchin
2021-01-07 17:34     ` Yang Shi
2021-01-05 22:58 ` [v3 PATCH 09/11] mm: vmscan: don't need allocate shrinker->nr_deferred for memcg aware shrinkers Yang Shi
2021-01-06 11:15   ` Kirill Tkhai
2021-01-11 18:40     ` Yang Shi
2021-01-11 21:57       ` Kirill Tkhai
2021-01-05 22:58 ` [v3 PATCH 10/11] mm: memcontrol: reparent nr_deferred when memcg offline Yang Shi
2021-01-06 11:34   ` Kirill Tkhai
2021-01-11 18:43     ` Yang Shi
2021-01-05 22:58 ` [v3 PATCH 11/11] mm: vmscan: shrink deferred objects proportional to priority Yang Shi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).