linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/9] Make shrinker's nr_deferred memcg aware
@ 2020-12-02 18:27 Yang Shi
  2020-12-02 18:27 ` [PATCH 1/9] mm: vmscan: simplify nr_deferred update code Yang Shi
                   ` (9 more replies)
  0 siblings, 10 replies; 41+ messages in thread
From: Yang Shi @ 2020-12-02 18:27 UTC (permalink / raw)
  To: guro, ktkhai, shakeelb, david, hannes, mhocko, akpm
  Cc: shy828301, linux-mm, linux-fsdevel, linux-kernel


Recently huge amount one-off slab drop was seen on some vfs metadata heavy workloads,
it turned out there were huge amount accumulated nr_deferred objects seen by the
shrinker.

On our production machine, I saw absurd number of nr_deferred shown as the below
tracing result: 

<...>-48776 [032] .... 27970562.458916: mm_shrink_slab_start:
super_cache_scan+0x0/0x1a0 ffff9a83046f3458: nid: 0 objects to shrink
2531805877005 gfp_flags GFP_HIGHUSER_MOVABLE pgs_scanned 32 lru_pgs
9300 cache items 1667 delta 11 total_scan 833

There are 2.5 trillion deferred objects on one node, assuming all of them
are dentry (192 bytes per object), so the total size of deferred on
one node is ~480TB. It is definitely ridiculous.

I managed to reproduce this problem with kernel build workload plus negative dentry
generator.

First step, run the below kernel build test script:

NR_CPUS=`cat /proc/cpuinfo | grep -e processor | wc -l`

cd /root/Buildarea/linux-stable

for i in `seq 1500`; do
        cgcreate -g memory:kern_build
        echo 4G > /sys/fs/cgroup/memory/kern_build/memory.limit_in_bytes

        echo 3 > /proc/sys/vm/drop_caches
        cgexec -g memory:kern_build make clean > /dev/null 2>&1
        cgexec -g memory:kern_build make -j$NR_CPUS > /dev/null 2>&1

        cgdelete -g memory:kern_build
done

Then run the below negative dentry generator script:

NR_CPUS=`cat /proc/cpuinfo | grep -e processor | wc -l`

mkdir /sys/fs/cgroup/memory/test
echo $$ > /sys/fs/cgroup/memory/test/tasks

for i in `seq $NR_CPUS`; do
        while true; do
                FILE=`head /dev/urandom | tr -dc A-Za-z0-9 | head -c 64`
                cat $FILE 2>/dev/null
        done &
done

Then kswapd will shrink half of dentry cache in just one loop as the below tracing result
showed:

	kswapd0-475   [028] .... 305968.252561: mm_shrink_slab_start: super_cache_scan+0x0/0x190 0000000024acf00c: nid: 0
objects to shrink 4994376020 gfp_flags GFP_KERNEL cache items 93689873 delta 45746 total_scan 46844936 priority 12
	kswapd0-475   [021] .... 306013.099399: mm_shrink_slab_end: super_cache_scan+0x0/0x190 0000000024acf00c: nid: 0 unused
scan count 4994376020 new scan count 4947576838 total_scan 8 last shrinker return val 46844928

There were huge number of deferred objects before the shrinker was called, the behavior
does match the code but it might be not desirable from the user's stand of point.

The excessive amount of nr_deferred might be accumulated due to various reasons, for example:
    * GFP_NOFS allocation
    * Significant times of small amount scan (< scan_batch, 1024 for vfs metadata)

However the LRUs of slabs are per memcg (memcg-aware shrinkers) but the deferred objects
is per shrinker, this may have some bad effects:
    * Poor isolation among memcgs. Some memcgs which happen to have frequent limit
      reclaim may get nr_deferred accumulated to a huge number, then other innocent
      memcgs may take the fall. In our case the main workload was hit.
    * Unbounded deferred objects. There is no cap for deferred objects, it can outgrow
      ridiculously as the tracing result showed.
    * Easy to get out of control. Although shrinkers take into account deferred objects,
      but it can go out of control easily. One misconfigured memcg could incur absurd 
      amount of deferred objects in a period of time.
    * Sort of reclaim problems, i.e. over reclaim, long reclaim latency, etc. There may be
      hundred GB slab caches for vfe metadata heavy workload, shrink half of them may take
      minutes. We observed latency spike due to the prolonged reclaim.

These issues also have been discussed in https://lore.kernel.org/linux-mm/20200916185823.5347-1-shy828301@gmail.com/.
The patchset is the outcome of that discussion.

So this patchset makes nr_deferred per-memcg to tackle the problem. It does:
    * Have memcg_shrinker_deferred per memcg per node, just like what shrinker_map
      does. Instead it is an atomic_long_t array, each element represent one shrinker
      even though the shrinker is not memcg aware, this simplifies the implementation.
      For memcg aware shrinkers, the deferred objects are just accumulated to its own
      memcg. The shrinkers just see nr_deferred from its own memcg. Non memcg aware
      shrinkers still use global nr_deferred from struct shrinker.
    * Once the memcg is offlined, its nr_deferred will be reparented to its parent along
      with LRUs.
    * The root memcg has memcg_shrinker_deferred array too. It simplifies the handling of
      reparenting to root memcg.
    * Cap nr_deferred to 2x of the length of lru. The idea is borrowed from Dave Chinner's
      series (https://lore.kernel.org/linux-xfs/20191031234618.15403-1-david@fromorbit.com/)

The downside is each memcg has to allocate extra memory to store the nr_deferred array.
On our production environment, there are typically around 40 shrinkers, so each memcg
needs ~320 bytes. 10K memcgs would need ~3.2MB memory. It seems fine.

We have been running the patched kernel on some hosts of our fleet (test and production) for
months, it works very well. The monitor data shows the working set is sustained as expected.

Yang Shi (9):
      mm: vmscan: simplify nr_deferred update code
      mm: vmscan: use nid from shrink_control for tracepoint
      mm: memcontrol: rename memcg_shrinker_map_mutex to memcg_shrinker_mutex
      mm: vmscan: use a new flag to indicate shrinker is registered
      mm: memcontrol: add per memcg shrinker nr_deferred
      mm: vmscan: use per memcg nr_deferred of shrinker
      mm: vmscan: don't need allocate shrinker->nr_deferred for memcg aware shrinkers
      mm: memcontrol: reparent nr_deferred when memcg offline
      mm: vmscan: shrink deferred objects proportional to priority

 include/linux/memcontrol.h |   9 +++++
 include/linux/shrinker.h   |   8 ++++
 mm/memcontrol.c            | 148 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----
 mm/vmscan.c                | 183 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------
 4 files changed, 274 insertions(+), 74 deletions(-)


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 1/9] mm: vmscan: simplify nr_deferred update code
  2020-12-02 18:27 [RFC PATCH 0/9] Make shrinker's nr_deferred memcg aware Yang Shi
@ 2020-12-02 18:27 ` Yang Shi
  2020-12-03  2:56   ` Roman Gushchin
  2020-12-02 18:27 ` [PATCH 2/9] mm: vmscan: use nid from shrink_control for tracepoint Yang Shi
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 41+ messages in thread
From: Yang Shi @ 2020-12-02 18:27 UTC (permalink / raw)
  To: guro, ktkhai, shakeelb, david, hannes, mhocko, akpm
  Cc: shy828301, linux-mm, linux-fsdevel, linux-kernel

Currently if (next_deferred - scanned) = 0, the code would just read the current
nr_deferred otherwise add the delta back.  Both needs atomic operation anyway, it
seems there is not too much gain by distinguishing the two cases, so just add the
delta back even though the delta is 0.  This would simply the code for the following
patches too.

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 mm/vmscan.c | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7b4e31eac2cf..7d6186a07daf 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -528,14 +528,10 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 		next_deferred = 0;
 	/*
 	 * move the unused scan count back into the shrinker in a
-	 * manner that handles concurrent updates. If we exhausted the
-	 * scan, there is no need to do an update.
+	 * manner that handles concurrent updates.
 	 */
-	if (next_deferred > 0)
-		new_nr = atomic_long_add_return(next_deferred,
-						&shrinker->nr_deferred[nid]);
-	else
-		new_nr = atomic_long_read(&shrinker->nr_deferred[nid]);
+	new_nr = atomic_long_add_return(next_deferred,
+					&shrinker->nr_deferred[nid]);
 
 	trace_mm_shrink_slab_end(shrinker, nid, freed, nr, new_nr, total_scan);
 	return freed;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 2/9] mm: vmscan: use nid from shrink_control for tracepoint
  2020-12-02 18:27 [RFC PATCH 0/9] Make shrinker's nr_deferred memcg aware Yang Shi
  2020-12-02 18:27 ` [PATCH 1/9] mm: vmscan: simplify nr_deferred update code Yang Shi
@ 2020-12-02 18:27 ` Yang Shi
  2020-12-03  3:13   ` Xiaqing (A)
  2020-12-02 18:27 ` [PATCH 3/9] mm: memcontrol: rename memcg_shrinker_map_mutex to memcg_shrinker_mutex Yang Shi
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 41+ messages in thread
From: Yang Shi @ 2020-12-02 18:27 UTC (permalink / raw)
  To: guro, ktkhai, shakeelb, david, hannes, mhocko, akpm
  Cc: shy828301, linux-mm, linux-fsdevel, linux-kernel

The tracepoint's nid should show what node the shrink happens on, the start tracepoint
uses nid from shrinkctl, but the nid might be set to 0 before end tracepoint if the
shrinker is not NUMA aware, so the traceing log may show the shrink happens on one
node but end up on the other node.  It seems confusing.  And the following patch
will remove using nid directly in do_shrink_slab(), this patch also helps cleanup
the code.

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 mm/vmscan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7d6186a07daf..457ce04eebf2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -533,7 +533,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 	new_nr = atomic_long_add_return(next_deferred,
 					&shrinker->nr_deferred[nid]);
 
-	trace_mm_shrink_slab_end(shrinker, nid, freed, nr, new_nr, total_scan);
+	trace_mm_shrink_slab_end(shrinker, shrinkctl->nid, freed, nr, new_nr, total_scan);
 	return freed;
 }
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 3/9] mm: memcontrol: rename memcg_shrinker_map_mutex to memcg_shrinker_mutex
  2020-12-02 18:27 [RFC PATCH 0/9] Make shrinker's nr_deferred memcg aware Yang Shi
  2020-12-02 18:27 ` [PATCH 1/9] mm: vmscan: simplify nr_deferred update code Yang Shi
  2020-12-02 18:27 ` [PATCH 2/9] mm: vmscan: use nid from shrink_control for tracepoint Yang Shi
@ 2020-12-02 18:27 ` Yang Shi
  2020-12-02 18:27 ` [PATCH 4/9] mm: vmscan: use a new flag to indicate shrinker is registered Yang Shi
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 41+ messages in thread
From: Yang Shi @ 2020-12-02 18:27 UTC (permalink / raw)
  To: guro, ktkhai, shakeelb, david, hannes, mhocko, akpm
  Cc: shy828301, linux-mm, linux-fsdevel, linux-kernel

The following patch will add memcg_shrinker_deferred which could be protected by
the same mutex, rename it to a more common name.

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 mm/memcontrol.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 29459a6ce1c7..19e41684c96b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -395,7 +395,7 @@ EXPORT_SYMBOL(memcg_kmem_enabled_key);
 #endif
 
 static int memcg_shrinker_map_size;
-static DEFINE_MUTEX(memcg_shrinker_map_mutex);
+static DEFINE_MUTEX(memcg_shrinker_mutex);
 
 static void memcg_free_shrinker_map_rcu(struct rcu_head *head)
 {
@@ -408,7 +408,7 @@ static int memcg_expand_one_shrinker_map(struct mem_cgroup *memcg,
 	struct memcg_shrinker_map *new, *old;
 	int nid;
 
-	lockdep_assert_held(&memcg_shrinker_map_mutex);
+	lockdep_assert_held(&memcg_shrinker_mutex);
 
 	for_each_node(nid) {
 		old = rcu_dereference_protected(
@@ -458,7 +458,7 @@ static int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
 	if (mem_cgroup_is_root(memcg))
 		return 0;
 
-	mutex_lock(&memcg_shrinker_map_mutex);
+	mutex_lock(&memcg_shrinker_mutex);
 	size = memcg_shrinker_map_size;
 	for_each_node(nid) {
 		map = kvzalloc_node(sizeof(*map) + size, GFP_KERNEL, nid);
@@ -469,7 +469,7 @@ static int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
 		}
 		rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, map);
 	}
-	mutex_unlock(&memcg_shrinker_map_mutex);
+	mutex_unlock(&memcg_shrinker_mutex);
 
 	return ret;
 }
@@ -484,7 +484,7 @@ int memcg_expand_shrinker_maps(int new_id)
 	if (size <= old_size)
 		return 0;
 
-	mutex_lock(&memcg_shrinker_map_mutex);
+	mutex_lock(&memcg_shrinker_mutex);
 	if (!root_mem_cgroup)
 		goto unlock;
 
@@ -500,7 +500,7 @@ int memcg_expand_shrinker_maps(int new_id)
 unlock:
 	if (!ret)
 		memcg_shrinker_map_size = size;
-	mutex_unlock(&memcg_shrinker_map_mutex);
+	mutex_unlock(&memcg_shrinker_mutex);
 	return ret;
 }
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 4/9] mm: vmscan: use a new flag to indicate shrinker is registered
  2020-12-02 18:27 [RFC PATCH 0/9] Make shrinker's nr_deferred memcg aware Yang Shi
                   ` (2 preceding siblings ...)
  2020-12-02 18:27 ` [PATCH 3/9] mm: memcontrol: rename memcg_shrinker_map_mutex to memcg_shrinker_mutex Yang Shi
@ 2020-12-02 18:27 ` Yang Shi
  2020-12-03  3:01   ` Roman Gushchin
  2020-12-02 18:27 ` [PATCH 5/9] mm: memcontrol: add per memcg shrinker nr_deferred Yang Shi
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 41+ messages in thread
From: Yang Shi @ 2020-12-02 18:27 UTC (permalink / raw)
  To: guro, ktkhai, shakeelb, david, hannes, mhocko, akpm
  Cc: shy828301, linux-mm, linux-fsdevel, linux-kernel

Currently registered shrinker is indicated by non-NULL shrinker->nr_deferred.
This approach is fine with nr_deferred atthe shrinker level, but the following
patches will move MEMCG_AWARE shrinkers' nr_deferred to memcg level, so their
shrinker->nr_deferred would always be NULL.  This would prevent the shrinkers
from unregistering correctly.

Introduce a new "state" field to indicate if shrinker is registered or not.
We could use the highest bit of flags, but it may be a little bit complicated to
extract that bit and the flags is accessed frequently by vmscan (every time shrinker
is called).  So add a new field in "struct shrinker", we may waster a little bit
memory, but it should be very few since there should be not too many registered
shrinkers on a normal system.

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 include/linux/shrinker.h |  4 ++++
 mm/vmscan.c              | 13 +++++++++----
 2 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 0f80123650e2..0bb5be88e41d 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -35,6 +35,9 @@ struct shrink_control {
 
 #define SHRINK_STOP (~0UL)
 #define SHRINK_EMPTY (~0UL - 1)
+
+#define SHRINKER_REGISTERED	0x1
+
 /*
  * A callback you can register to apply pressure to ageable caches.
  *
@@ -66,6 +69,7 @@ struct shrinker {
 	long batch;	/* reclaim batch size, 0 = default */
 	int seeks;	/* seeks to recreate an obj */
 	unsigned flags;
+	unsigned state;
 
 	/* These are for internal use */
 	struct list_head list;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 457ce04eebf2..0d628299e55c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -378,6 +378,7 @@ void register_shrinker_prepared(struct shrinker *shrinker)
 	if (shrinker->flags & SHRINKER_MEMCG_AWARE)
 		idr_replace(&shrinker_idr, shrinker, shrinker->id);
 #endif
+	shrinker->state |= SHRINKER_REGISTERED;
 	up_write(&shrinker_rwsem);
 }
 
@@ -397,13 +398,17 @@ EXPORT_SYMBOL(register_shrinker);
  */
 void unregister_shrinker(struct shrinker *shrinker)
 {
-	if (!shrinker->nr_deferred)
-		return;
-	if (shrinker->flags & SHRINKER_MEMCG_AWARE)
-		unregister_memcg_shrinker(shrinker);
 	down_write(&shrinker_rwsem);
+	if (!shrinker->state) {
+		up_write(&shrinker_rwsem);
+		return;
+	}
 	list_del(&shrinker->list);
+	shrinker->state &= ~SHRINKER_REGISTERED;
 	up_write(&shrinker_rwsem);
+
+	if (shrinker->flags & SHRINKER_MEMCG_AWARE)
+		unregister_memcg_shrinker(shrinker);
 	kfree(shrinker->nr_deferred);
 	shrinker->nr_deferred = NULL;
 }
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 5/9] mm: memcontrol: add per memcg shrinker nr_deferred
  2020-12-02 18:27 [RFC PATCH 0/9] Make shrinker's nr_deferred memcg aware Yang Shi
                   ` (3 preceding siblings ...)
  2020-12-02 18:27 ` [PATCH 4/9] mm: vmscan: use a new flag to indicate shrinker is registered Yang Shi
@ 2020-12-02 18:27 ` Yang Shi
  2020-12-03  3:06   ` Roman Gushchin
  2020-12-10 15:33   ` Johannes Weiner
  2020-12-02 18:27 ` [PATCH 6/9] mm: vmscan: use per memcg nr_deferred of shrinker Yang Shi
                   ` (4 subsequent siblings)
  9 siblings, 2 replies; 41+ messages in thread
From: Yang Shi @ 2020-12-02 18:27 UTC (permalink / raw)
  To: guro, ktkhai, shakeelb, david, hannes, mhocko, akpm
  Cc: shy828301, linux-mm, linux-fsdevel, linux-kernel

Currently the number of deferred objects are per shrinker, but some slabs, for example,
vfs inode/dentry cache are per memcg, this would result in poor isolation among memcgs.

The deferred objects typically are generated by __GFP_NOFS allocations, one memcg with
excessive __GFP_NOFS allocations may blow up deferred objects, then other innocent memcgs
may suffer from over shrink, excessive reclaim latency, etc.

For example, two workloads run in memcgA and memcgB respectively, workload in B is vfs
heavy workload.  Workload in A generates excessive deferred objects, then B's vfs cache
might be hit heavily (drop half of caches) by B's limit reclaim or global reclaim.

We observed this hit in our production environment which was running vfs heavy workload
shown as the below tracing log:

<...>-409454 [016] .... 28286961.747146: mm_shrink_slab_start: super_cache_scan+0x0/0x1a0 ffff9a83046f3458:
nid: 1 objects to shrink 3641681686040 gfp_flags GFP_HIGHUSER_MOVABLE|__GFP_ZERO pgs_scanned 1 lru_pgs 15721
cache items 246404277 delta 31345 total_scan 123202138
<...>-409454 [022] .... 28287105.928018: mm_shrink_slab_end: super_cache_scan+0x0/0x1a0 ffff9a83046f3458:
nid: 1 unused scan count 3641681686040 new scan count 3641798379189 total_scan 602
last shrinker return val 123186855

The vfs cache and page cache ration was 10:1 on this machine, and half of caches were dropped.
This also resulted in significant amount of page caches were dropped due to inodes eviction.

Make nr_deferred per memcg for memcg aware shrinkers would solve the unfairness and bring
better isolation.

When memcg is not enabled (!CONFIG_MEMCG or memcg disabled), the shrinker's nr_deferred
would be used.  And non memcg aware shrinkers use shrinker's nr_deferred all the time.

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 include/linux/memcontrol.h |   9 +++
 mm/memcontrol.c            | 112 ++++++++++++++++++++++++++++++++++++-
 mm/vmscan.c                |   4 ++
 3 files changed, 123 insertions(+), 2 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 922a7f600465..1b343b268359 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -92,6 +92,13 @@ struct lruvec_stat {
 	long count[NR_VM_NODE_STAT_ITEMS];
 };
 
+
+/* Shrinker::id indexed nr_deferred of memcg-aware shrinkers. */
+struct memcg_shrinker_deferred {
+	struct rcu_head rcu;
+	atomic_long_t nr_deferred[];
+};
+
 /*
  * Bitmap of shrinker::id corresponding to memcg-aware shrinkers,
  * which have elements charged to this memcg.
@@ -119,6 +126,7 @@ struct mem_cgroup_per_node {
 	struct mem_cgroup_reclaim_iter	iter;
 
 	struct memcg_shrinker_map __rcu	*shrinker_map;
+	struct memcg_shrinker_deferred __rcu	*shrinker_deferred;
 
 	struct rb_node		tree_node;	/* RB tree node */
 	unsigned long		usage_in_excess;/* Set to the value by which */
@@ -1489,6 +1497,7 @@ static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
 }
 
 extern int memcg_expand_shrinker_maps(int new_id);
+extern int memcg_expand_shrinker_deferred(int new_id);
 
 extern void memcg_set_shrinker_bit(struct mem_cgroup *memcg,
 				   int nid, int shrinker_id);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 19e41684c96b..d3d5c88db179 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -395,6 +395,8 @@ EXPORT_SYMBOL(memcg_kmem_enabled_key);
 #endif
 
 static int memcg_shrinker_map_size;
+static int memcg_shrinker_deferred_size;
+
 static DEFINE_MUTEX(memcg_shrinker_mutex);
 
 static void memcg_free_shrinker_map_rcu(struct rcu_head *head)
@@ -402,6 +404,11 @@ static void memcg_free_shrinker_map_rcu(struct rcu_head *head)
 	kvfree(container_of(head, struct memcg_shrinker_map, rcu));
 }
 
+static void memcg_free_shrinker_deferred_rcu(struct rcu_head *head)
+{
+	kvfree(container_of(head, struct memcg_shrinker_deferred, rcu));
+}
+
 static int memcg_expand_one_shrinker_map(struct mem_cgroup *memcg,
 					 int size, int old_size)
 {
@@ -432,6 +439,36 @@ static int memcg_expand_one_shrinker_map(struct mem_cgroup *memcg,
 	return 0;
 }
 
+static int memcg_expand_one_shrinker_deferred(struct mem_cgroup *memcg,
+					      int size, int old_size)
+{
+	struct memcg_shrinker_deferred *new, *old;
+	int nid;
+
+	lockdep_assert_held(&memcg_shrinker_mutex);
+
+	for_each_node(nid) {
+		old = rcu_dereference_protected(
+			mem_cgroup_nodeinfo(memcg, nid)->shrinker_deferred, true);
+		/* Not yet online memcg */
+		if (!old)
+			return 0;
+
+		new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
+		if (!new)
+			return -ENOMEM;
+
+		/* Copy all old values, and clear all new ones */
+		memcpy((void *)new->nr_deferred, (void *)old->nr_deferred, old_size);
+		memset((void *)new->nr_deferred + old_size, 0, size - old_size);
+
+		rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_deferred, new);
+		call_rcu(&old->rcu, memcg_free_shrinker_deferred_rcu);
+	}
+
+	return 0;
+}
+
 static void memcg_free_shrinker_maps(struct mem_cgroup *memcg)
 {
 	struct mem_cgroup_per_node *pn;
@@ -450,6 +487,21 @@ static void memcg_free_shrinker_maps(struct mem_cgroup *memcg)
 	}
 }
 
+static void memcg_free_shrinker_deferred(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup_per_node *pn;
+	struct memcg_shrinker_deferred *deferred;
+	int nid;
+
+	for_each_node(nid) {
+		pn = mem_cgroup_nodeinfo(memcg, nid);
+		deferred = rcu_dereference_protected(pn->shrinker_deferred, true);
+		if (deferred)
+			kvfree(deferred);
+		rcu_assign_pointer(pn->shrinker_deferred, NULL);
+	}
+}
+
 static int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
 {
 	struct memcg_shrinker_map *map;
@@ -474,6 +526,27 @@ static int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
 	return ret;
 }
 
+static int memcg_alloc_shrinker_deferred(struct mem_cgroup *memcg)
+{
+	struct memcg_shrinker_deferred *deferred;
+	int nid, size, ret = 0;
+
+	mutex_lock(&memcg_shrinker_mutex);
+	size = memcg_shrinker_deferred_size;
+	for_each_node(nid) {
+		deferred = kvzalloc_node(sizeof(*deferred) + size, GFP_KERNEL, nid);
+		if (!deferred) {
+			memcg_free_shrinker_deferred(memcg);
+			ret = -ENOMEM;
+			break;
+		}
+		rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_deferred, deferred);
+	}
+	mutex_unlock(&memcg_shrinker_mutex);
+
+	return ret;
+}
+
 int memcg_expand_shrinker_maps(int new_id)
 {
 	int size, old_size, ret = 0;
@@ -504,6 +577,34 @@ int memcg_expand_shrinker_maps(int new_id)
 	return ret;
 }
 
+int memcg_expand_shrinker_deferred(int new_id)
+{
+	int size, old_size, ret = 0;
+	struct mem_cgroup *memcg;
+
+	size = (new_id + 1) * sizeof(atomic_long_t);
+	old_size = memcg_shrinker_deferred_size;
+	if (size <= old_size)
+		return 0;
+
+	mutex_lock(&memcg_shrinker_mutex);
+	if (!root_mem_cgroup)
+		goto unlock;
+
+	for_each_mem_cgroup(memcg) {
+		ret = memcg_expand_one_shrinker_deferred(memcg, size, old_size);
+		if (ret) {
+			mem_cgroup_iter_break(NULL, memcg);
+			goto unlock;
+		}
+	}
+unlock:
+	if (!ret)
+		memcg_shrinker_deferred_size = size;
+	mutex_unlock(&memcg_shrinker_mutex);
+	return ret;
+}
+
 void memcg_set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
 {
 	if (shrinker_id >= 0 && memcg && !mem_cgroup_is_root(memcg)) {
@@ -5400,8 +5501,8 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
 
 	/*
-	 * A memcg must be visible for memcg_expand_shrinker_maps()
-	 * by the time the maps are allocated. So, we allocate maps
+	 * A memcg must be visible for memcg_expand_shrinker_{maps|deferred}()
+	 * by the time the maps are allocated. So, we allocate maps and deferred
 	 * here, when for_each_mem_cgroup() can't skip it.
 	 */
 	if (memcg_alloc_shrinker_maps(memcg)) {
@@ -5409,6 +5510,12 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
 		return -ENOMEM;
 	}
 
+	if (memcg_alloc_shrinker_deferred(memcg)) {
+		memcg_free_shrinker_maps(memcg);
+		mem_cgroup_id_remove(memcg);
+		return -ENOMEM;
+	}
+
 	/* Online state pins memcg ID, memcg ID pins CSS */
 	refcount_set(&memcg->id.ref, 1);
 	css_get(css);
@@ -5469,6 +5576,7 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
 	cancel_work_sync(&memcg->high_work);
 	mem_cgroup_remove_from_trees(memcg);
 	memcg_free_shrinker_maps(memcg);
+	memcg_free_shrinker_deferred(memcg);
 	memcg_free_kmem(memcg);
 	mem_cgroup_free(memcg);
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0d628299e55c..cba0bc8d4661 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -219,6 +219,10 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
 			goto unlock;
 		}
 
+		if (memcg_expand_shrinker_deferred(id)) {
+			idr_remove(&shrinker_idr, id);
+			goto unlock;
+		}
 		shrinker_nr_max = id + 1;
 	}
 	shrinker->id = id;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 6/9] mm: vmscan: use per memcg nr_deferred of shrinker
  2020-12-02 18:27 [RFC PATCH 0/9] Make shrinker's nr_deferred memcg aware Yang Shi
                   ` (4 preceding siblings ...)
  2020-12-02 18:27 ` [PATCH 5/9] mm: memcontrol: add per memcg shrinker nr_deferred Yang Shi
@ 2020-12-02 18:27 ` Yang Shi
  2020-12-03  3:08   ` Roman Gushchin
  2020-12-03 11:40   ` Kirill Tkhai
  2020-12-02 18:27 ` [PATCH 7/9] mm: vmscan: don't need allocate shrinker->nr_deferred for memcg aware shrinkers Yang Shi
                   ` (3 subsequent siblings)
  9 siblings, 2 replies; 41+ messages in thread
From: Yang Shi @ 2020-12-02 18:27 UTC (permalink / raw)
  To: guro, ktkhai, shakeelb, david, hannes, mhocko, akpm
  Cc: shy828301, linux-mm, linux-fsdevel, linux-kernel

Use per memcg's nr_deferred for memcg aware shrinkers.  The shrinker's nr_deferred
will be used in the following cases:
    1. Non memcg aware shrinkers
    2. !CONFIG_MEMCG
    3. memcg is disabled by boot parameter

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 mm/vmscan.c | 88 +++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 82 insertions(+), 6 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index cba0bc8d4661..d569fdcaba79 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -203,6 +203,12 @@ static DECLARE_RWSEM(shrinker_rwsem);
 static DEFINE_IDR(shrinker_idr);
 static int shrinker_nr_max;
 
+static inline bool is_deferred_memcg_aware(struct shrinker *shrinker)
+{
+	return (shrinker->flags & SHRINKER_MEMCG_AWARE) &&
+		!mem_cgroup_disabled();
+}
+
 static int prealloc_memcg_shrinker(struct shrinker *shrinker)
 {
 	int id, ret = -ENOMEM;
@@ -271,7 +277,58 @@ static bool writeback_throttling_sane(struct scan_control *sc)
 #endif
 	return false;
 }
+
+static inline long count_nr_deferred(struct shrinker *shrinker,
+				     struct shrink_control *sc)
+{
+	bool per_memcg_deferred = is_deferred_memcg_aware(shrinker) && sc->memcg;
+	struct memcg_shrinker_deferred *deferred;
+	struct mem_cgroup *memcg = sc->memcg;
+	int nid = sc->nid;
+	int id = shrinker->id;
+	long nr;
+
+	if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
+		nid = 0;
+
+	if (per_memcg_deferred) {
+		deferred = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_deferred,
+						     true);
+		nr = atomic_long_xchg(&deferred->nr_deferred[id], 0);
+	} else
+		nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
+
+	return nr;
+}
+
+static inline long set_nr_deferred(long nr, struct shrinker *shrinker,
+				   struct shrink_control *sc)
+{
+	bool per_memcg_deferred = is_deferred_memcg_aware(shrinker) && sc->memcg;
+	struct memcg_shrinker_deferred *deferred;
+	struct mem_cgroup *memcg = sc->memcg;
+	int nid = sc->nid;
+	int id = shrinker->id;
+	long new_nr;
+
+	if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
+		nid = 0;
+
+	if (per_memcg_deferred) {
+		deferred = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_deferred,
+						     true);
+		new_nr = atomic_long_add_return(nr, &deferred->nr_deferred[id]);
+	} else
+		new_nr = atomic_long_add_return(nr, &shrinker->nr_deferred[nid]);
+
+	return new_nr;
+}
 #else
+static inline bool is_deferred_memcg_aware(struct shrinker *shrinker)
+{
+	return false;
+}
+
 static int prealloc_memcg_shrinker(struct shrinker *shrinker)
 {
 	return 0;
@@ -290,6 +347,29 @@ static bool writeback_throttling_sane(struct scan_control *sc)
 {
 	return true;
 }
+
+static inline long count_nr_deferred(struct shrinker *shrinker,
+				     struct shrink_control *sc)
+{
+	int nid = sc->nid;
+
+	if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
+		nid = 0;
+
+	return atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
+}
+
+static inline long set_nr_deferred(long nr, struct shrinker *shrinker,
+				   struct shrink_control *sc)
+{
+	int nid = sc->nid;
+
+	if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
+		nid = 0;
+
+	return atomic_long_add_return(nr,
+				      &shrinker->nr_deferred[nid]);
+}
 #endif
 
 /*
@@ -429,13 +509,10 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 	long freeable;
 	long nr;
 	long new_nr;
-	int nid = shrinkctl->nid;
 	long batch_size = shrinker->batch ? shrinker->batch
 					  : SHRINK_BATCH;
 	long scanned = 0, next_deferred;
 
-	if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
-		nid = 0;
 
 	freeable = shrinker->count_objects(shrinker, shrinkctl);
 	if (freeable == 0 || freeable == SHRINK_EMPTY)
@@ -446,7 +523,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 	 * and zero it so that other concurrent shrinker invocations
 	 * don't also do this scanning work.
 	 */
-	nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
+	nr = count_nr_deferred(shrinker, shrinkctl);
 
 	total_scan = nr;
 	if (shrinker->seeks) {
@@ -539,8 +616,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 	 * move the unused scan count back into the shrinker in a
 	 * manner that handles concurrent updates.
 	 */
-	new_nr = atomic_long_add_return(next_deferred,
-					&shrinker->nr_deferred[nid]);
+	new_nr = set_nr_deferred(next_deferred, shrinker, shrinkctl);
 
 	trace_mm_shrink_slab_end(shrinker, shrinkctl->nid, freed, nr, new_nr, total_scan);
 	return freed;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 7/9] mm: vmscan: don't need allocate shrinker->nr_deferred for memcg aware shrinkers
  2020-12-02 18:27 [RFC PATCH 0/9] Make shrinker's nr_deferred memcg aware Yang Shi
                   ` (5 preceding siblings ...)
  2020-12-02 18:27 ` [PATCH 6/9] mm: vmscan: use per memcg nr_deferred of shrinker Yang Shi
@ 2020-12-02 18:27 ` Yang Shi
  2020-12-02 18:27 ` [PATCH 8/9] mm: memcontrol: reparent nr_deferred when memcg offline Yang Shi
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 41+ messages in thread
From: Yang Shi @ 2020-12-02 18:27 UTC (permalink / raw)
  To: guro, ktkhai, shakeelb, david, hannes, mhocko, akpm
  Cc: shy828301, linux-mm, linux-fsdevel, linux-kernel

Now nr_deferred is available on per memcg level for memcg aware shrinkers, so don't need
allocate shrinker->nr_deferred for such shrinkers anymore.

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 mm/vmscan.c | 28 ++++++++++++++--------------
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index d569fdcaba79..5fd57060fafd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -420,7 +420,15 @@ unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone
  */
 int prealloc_shrinker(struct shrinker *shrinker)
 {
-	unsigned int size = sizeof(*shrinker->nr_deferred);
+	unsigned int size;
+
+	if (is_deferred_memcg_aware(shrinker)) {
+		if (prealloc_memcg_shrinker(shrinker))
+			return -ENOMEM;
+		return 0;
+	}
+
+	size = sizeof(*shrinker->nr_deferred);
 
 	if (shrinker->flags & SHRINKER_NUMA_AWARE)
 		size *= nr_node_ids;
@@ -429,26 +437,18 @@ int prealloc_shrinker(struct shrinker *shrinker)
 	if (!shrinker->nr_deferred)
 		return -ENOMEM;
 
-	if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
-		if (prealloc_memcg_shrinker(shrinker))
-			goto free_deferred;
-	}
-
 	return 0;
-
-free_deferred:
-	kfree(shrinker->nr_deferred);
-	shrinker->nr_deferred = NULL;
-	return -ENOMEM;
 }
 
 void free_prealloced_shrinker(struct shrinker *shrinker)
 {
-	if (!shrinker->nr_deferred)
+	if (is_deferred_memcg_aware(shrinker)) {
+		unregister_memcg_shrinker(shrinker);
 		return;
+	}
 
-	if (shrinker->flags & SHRINKER_MEMCG_AWARE)
-		unregister_memcg_shrinker(shrinker);
+	if (!shrinker->nr_deferred)
+		return;
 
 	kfree(shrinker->nr_deferred);
 	shrinker->nr_deferred = NULL;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 8/9] mm: memcontrol: reparent nr_deferred when memcg offline
  2020-12-02 18:27 [RFC PATCH 0/9] Make shrinker's nr_deferred memcg aware Yang Shi
                   ` (6 preceding siblings ...)
  2020-12-02 18:27 ` [PATCH 7/9] mm: vmscan: don't need allocate shrinker->nr_deferred for memcg aware shrinkers Yang Shi
@ 2020-12-02 18:27 ` Yang Shi
  2020-12-02 18:27 ` [PATCH 9/9] mm: vmscan: shrink deferred objects proportional to priority Yang Shi
  2020-12-03  2:52 ` [RFC PATCH 0/9] Make shrinker's nr_deferred memcg aware Roman Gushchin
  9 siblings, 0 replies; 41+ messages in thread
From: Yang Shi @ 2020-12-02 18:27 UTC (permalink / raw)
  To: guro, ktkhai, shakeelb, david, hannes, mhocko, akpm
  Cc: shy828301, linux-mm, linux-fsdevel, linux-kernel

Now shrinker's nr_deferred is per memcg for memcg aware shrinkers, add to parent's
corresponding nr_deferred when memcg offline.

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 include/linux/shrinker.h |  4 ++++
 mm/memcontrol.c          | 24 ++++++++++++++++++++++++
 mm/vmscan.c              |  2 +-
 3 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 0bb5be88e41d..8f095e424799 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -82,6 +82,10 @@ struct shrinker {
 };
 #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
 
+#ifdef CONFIG_MEMCG
+extern int shrinker_nr_max;
+#endif
+
 /* Flags */
 #define SHRINKER_NUMA_AWARE	(1 << 0)
 #define SHRINKER_MEMCG_AWARE	(1 << 1)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d3d5c88db179..df128cab900f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -59,6 +59,7 @@
 #include <linux/tracehook.h>
 #include <linux/psi.h>
 #include <linux/seq_buf.h>
+#include <linux/shrinker.h>
 #include "internal.h"
 #include <net/sock.h>
 #include <net/ip.h>
@@ -619,6 +620,28 @@ void memcg_set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
 	}
 }
 
+static void memcg_reparent_shrinker_deferred(struct mem_cgroup *memcg)
+{
+	int i, nid;
+	long nr;
+	struct mem_cgroup *parent;
+	struct memcg_shrinker_deferred *child_deferred, *parent_deferred;
+
+	parent = parent_mem_cgroup(memcg);
+	if (!parent)
+		parent = root_mem_cgroup;
+
+	for_each_node(nid) {
+		child_deferred = memcg->nodeinfo[nid]->shrinker_deferred;
+		parent_deferred = parent->nodeinfo[nid]->shrinker_deferred;
+		for (i = 0; i < shrinker_nr_max; i ++) {
+			nr = atomic_long_read(&child_deferred->nr_deferred[i]);
+			atomic_long_add(nr,
+				&parent_deferred->nr_deferred[i]);
+		}
+	}
+}
+
 /**
  * mem_cgroup_css_from_page - css of the memcg associated with a page
  * @page: page of interest
@@ -5543,6 +5566,7 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 	page_counter_set_low(&memcg->memory, 0);
 
 	memcg_offline_kmem(memcg);
+	memcg_reparent_shrinker_deferred(memcg);
 	wb_memcg_offline(memcg);
 
 	drain_all_stock(memcg);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5fd57060fafd..7dc8075c371b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -201,7 +201,7 @@ static DECLARE_RWSEM(shrinker_rwsem);
 #define SHRINKER_REGISTERING ((struct shrinker *)~0UL)
 
 static DEFINE_IDR(shrinker_idr);
-static int shrinker_nr_max;
+int shrinker_nr_max;
 
 static inline bool is_deferred_memcg_aware(struct shrinker *shrinker)
 {
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 9/9] mm: vmscan: shrink deferred objects proportional to priority
  2020-12-02 18:27 [RFC PATCH 0/9] Make shrinker's nr_deferred memcg aware Yang Shi
                   ` (7 preceding siblings ...)
  2020-12-02 18:27 ` [PATCH 8/9] mm: memcontrol: reparent nr_deferred when memcg offline Yang Shi
@ 2020-12-02 18:27 ` Yang Shi
  2020-12-03  2:52 ` [RFC PATCH 0/9] Make shrinker's nr_deferred memcg aware Roman Gushchin
  9 siblings, 0 replies; 41+ messages in thread
From: Yang Shi @ 2020-12-02 18:27 UTC (permalink / raw)
  To: guro, ktkhai, shakeelb, david, hannes, mhocko, akpm
  Cc: shy828301, linux-mm, linux-fsdevel, linux-kernel

The number of deferred objects might get windup to an absurd number, and it results in
clamp of slab objects.  It is undesirable for sustaining workingset.

So shrink deferred objects proportional to priority and cap nr_deferred to twice of
cache items.

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 mm/vmscan.c | 40 +++++-----------------------------------
 1 file changed, 5 insertions(+), 35 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7dc8075c371b..9d2a6485e982 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -525,7 +525,6 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 	 */
 	nr = count_nr_deferred(shrinker, shrinkctl);
 
-	total_scan = nr;
 	if (shrinker->seeks) {
 		delta = freeable >> priority;
 		delta *= 4;
@@ -539,37 +538,9 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 		delta = freeable / 2;
 	}
 
+	total_scan = nr >> priority;
 	total_scan += delta;
-	if (total_scan < 0) {
-		pr_err("shrink_slab: %pS negative objects to delete nr=%ld\n",
-		       shrinker->scan_objects, total_scan);
-		total_scan = freeable;
-		next_deferred = nr;
-	} else
-		next_deferred = total_scan;
-
-	/*
-	 * We need to avoid excessive windup on filesystem shrinkers
-	 * due to large numbers of GFP_NOFS allocations causing the
-	 * shrinkers to return -1 all the time. This results in a large
-	 * nr being built up so when a shrink that can do some work
-	 * comes along it empties the entire cache due to nr >>>
-	 * freeable. This is bad for sustaining a working set in
-	 * memory.
-	 *
-	 * Hence only allow the shrinker to scan the entire cache when
-	 * a large delta change is calculated directly.
-	 */
-	if (delta < freeable / 4)
-		total_scan = min(total_scan, freeable / 2);
-
-	/*
-	 * Avoid risking looping forever due to too large nr value:
-	 * never try to free more than twice the estimate number of
-	 * freeable entries.
-	 */
-	if (total_scan > freeable * 2)
-		total_scan = freeable * 2;
+	total_scan = min(total_scan, (2 * freeable));
 
 	trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
 				   freeable, delta, total_scan, priority);
@@ -608,10 +579,9 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 		cond_resched();
 	}
 
-	if (next_deferred >= scanned)
-		next_deferred -= scanned;
-	else
-		next_deferred = 0;
+	next_deferred = max_t(long, (nr - scanned), 0) + total_scan;
+	next_deferred = min(next_deferred, (2 * freeable));
+
 	/*
 	 * move the unused scan count back into the shrinker in a
 	 * manner that handles concurrent updates.
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 0/9] Make shrinker's nr_deferred memcg aware
  2020-12-02 18:27 [RFC PATCH 0/9] Make shrinker's nr_deferred memcg aware Yang Shi
                   ` (8 preceding siblings ...)
  2020-12-02 18:27 ` [PATCH 9/9] mm: vmscan: shrink deferred objects proportional to priority Yang Shi
@ 2020-12-03  2:52 ` Roman Gushchin
  2020-12-03 17:52   ` Yang Shi
  9 siblings, 1 reply; 41+ messages in thread
From: Roman Gushchin @ 2020-12-03  2:52 UTC (permalink / raw)
  To: Yang Shi
  Cc: ktkhai, shakeelb, david, hannes, mhocko, akpm, linux-mm,
	linux-fsdevel, linux-kernel

On Wed, Dec 02, 2020 at 10:27:16AM -0800, Yang Shi wrote:
> 
> Recently huge amount one-off slab drop was seen on some vfs metadata heavy workloads,
> it turned out there were huge amount accumulated nr_deferred objects seen by the
> shrinker.
> 
> On our production machine, I saw absurd number of nr_deferred shown as the below
> tracing result: 
> 
> <...>-48776 [032] .... 27970562.458916: mm_shrink_slab_start:
> super_cache_scan+0x0/0x1a0 ffff9a83046f3458: nid: 0 objects to shrink
> 2531805877005 gfp_flags GFP_HIGHUSER_MOVABLE pgs_scanned 32 lru_pgs
> 9300 cache items 1667 delta 11 total_scan 833
> 
> There are 2.5 trillion deferred objects on one node, assuming all of them
> are dentry (192 bytes per object), so the total size of deferred on
> one node is ~480TB. It is definitely ridiculous.
> 
> I managed to reproduce this problem with kernel build workload plus negative dentry
> generator.
> 
> First step, run the below kernel build test script:
> 
> NR_CPUS=`cat /proc/cpuinfo | grep -e processor | wc -l`
> 
> cd /root/Buildarea/linux-stable
> 
> for i in `seq 1500`; do
>         cgcreate -g memory:kern_build
>         echo 4G > /sys/fs/cgroup/memory/kern_build/memory.limit_in_bytes
> 
>         echo 3 > /proc/sys/vm/drop_caches
>         cgexec -g memory:kern_build make clean > /dev/null 2>&1
>         cgexec -g memory:kern_build make -j$NR_CPUS > /dev/null 2>&1
> 
>         cgdelete -g memory:kern_build
> done
> 
> Then run the below negative dentry generator script:
> 
> NR_CPUS=`cat /proc/cpuinfo | grep -e processor | wc -l`
> 
> mkdir /sys/fs/cgroup/memory/test
> echo $$ > /sys/fs/cgroup/memory/test/tasks
> 
> for i in `seq $NR_CPUS`; do
>         while true; do
>                 FILE=`head /dev/urandom | tr -dc A-Za-z0-9 | head -c 64`
>                 cat $FILE 2>/dev/null
>         done &
> done
> 
> Then kswapd will shrink half of dentry cache in just one loop as the below tracing result
> showed:
> 
> 	kswapd0-475   [028] .... 305968.252561: mm_shrink_slab_start: super_cache_scan+0x0/0x190 0000000024acf00c: nid: 0
> objects to shrink 4994376020 gfp_flags GFP_KERNEL cache items 93689873 delta 45746 total_scan 46844936 priority 12
> 	kswapd0-475   [021] .... 306013.099399: mm_shrink_slab_end: super_cache_scan+0x0/0x190 0000000024acf00c: nid: 0 unused
> scan count 4994376020 new scan count 4947576838 total_scan 8 last shrinker return val 46844928
> 
> There were huge number of deferred objects before the shrinker was called, the behavior
> does match the code but it might be not desirable from the user's stand of point.
> 
> The excessive amount of nr_deferred might be accumulated due to various reasons, for example:
>     * GFP_NOFS allocation
>     * Significant times of small amount scan (< scan_batch, 1024 for vfs metadata)
> 
> However the LRUs of slabs are per memcg (memcg-aware shrinkers) but the deferred objects
> is per shrinker, this may have some bad effects:
>     * Poor isolation among memcgs. Some memcgs which happen to have frequent limit
>       reclaim may get nr_deferred accumulated to a huge number, then other innocent
>       memcgs may take the fall. In our case the main workload was hit.
>     * Unbounded deferred objects. There is no cap for deferred objects, it can outgrow
>       ridiculously as the tracing result showed.
>     * Easy to get out of control. Although shrinkers take into account deferred objects,
>       but it can go out of control easily. One misconfigured memcg could incur absurd 
>       amount of deferred objects in a period of time.
>     * Sort of reclaim problems, i.e. over reclaim, long reclaim latency, etc. There may be
>       hundred GB slab caches for vfe metadata heavy workload, shrink half of them may take
>       minutes. We observed latency spike due to the prolonged reclaim.
> 
> These issues also have been discussed in https://lore.kernel.org/linux-mm/20200916185823.5347-1-shy828301@gmail.com/.
> The patchset is the outcome of that discussion.
> 
> So this patchset makes nr_deferred per-memcg to tackle the problem. It does:
>     * Have memcg_shrinker_deferred per memcg per node, just like what shrinker_map
>       does. Instead it is an atomic_long_t array, each element represent one shrinker
>       even though the shrinker is not memcg aware, this simplifies the implementation.
>       For memcg aware shrinkers, the deferred objects are just accumulated to its own
>       memcg. The shrinkers just see nr_deferred from its own memcg. Non memcg aware
>       shrinkers still use global nr_deferred from struct shrinker.
>     * Once the memcg is offlined, its nr_deferred will be reparented to its parent along
>       with LRUs.
>     * The root memcg has memcg_shrinker_deferred array too. It simplifies the handling of
>       reparenting to root memcg.
>     * Cap nr_deferred to 2x of the length of lru. The idea is borrowed from Dave Chinner's
>       series (https://lore.kernel.org/linux-xfs/20191031234618.15403-1-david@fromorbit.com/)
> 
> The downside is each memcg has to allocate extra memory to store the nr_deferred array.
> On our production environment, there are typically around 40 shrinkers, so each memcg
> needs ~320 bytes. 10K memcgs would need ~3.2MB memory. It seems fine.
> 
> We have been running the patched kernel on some hosts of our fleet (test and production) for
> months, it works very well. The monitor data shows the working set is sustained as expected.

Hello Yang!

The rationale is very well described and makes perfect sense to me.
I fully support the idea to make nr_deferred per-memcg.
Thank you for such a detailed description!

More comments in individual patches.

Thanks!

> 
> Yang Shi (9):
>       mm: vmscan: simplify nr_deferred update code
>       mm: vmscan: use nid from shrink_control for tracepoint
>       mm: memcontrol: rename memcg_shrinker_map_mutex to memcg_shrinker_mutex
>       mm: vmscan: use a new flag to indicate shrinker is registered
>       mm: memcontrol: add per memcg shrinker nr_deferred
>       mm: vmscan: use per memcg nr_deferred of shrinker
>       mm: vmscan: don't need allocate shrinker->nr_deferred for memcg aware shrinkers
>       mm: memcontrol: reparent nr_deferred when memcg offline
>       mm: vmscan: shrink deferred objects proportional to priority
> 
>  include/linux/memcontrol.h |   9 +++++
>  include/linux/shrinker.h   |   8 ++++
>  mm/memcontrol.c            | 148 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----
>  mm/vmscan.c                | 183 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------
>  4 files changed, 274 insertions(+), 74 deletions(-)
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/9] mm: vmscan: simplify nr_deferred update code
  2020-12-02 18:27 ` [PATCH 1/9] mm: vmscan: simplify nr_deferred update code Yang Shi
@ 2020-12-03  2:56   ` Roman Gushchin
  0 siblings, 0 replies; 41+ messages in thread
From: Roman Gushchin @ 2020-12-03  2:56 UTC (permalink / raw)
  To: Yang Shi
  Cc: ktkhai, shakeelb, david, hannes, mhocko, akpm, linux-mm,
	linux-fsdevel, linux-kernel

On Wed, Dec 02, 2020 at 10:27:17AM -0800, Yang Shi wrote:
> Currently if (next_deferred - scanned) = 0, the code would just read the current
> nr_deferred otherwise add the delta back.  Both needs atomic operation anyway,

But atomic_read() is usually way cheaper than any atomic write.

> it
> seems there is not too much gain by distinguishing the two cases, so just add the
> delta back even though the delta is 0.  This would simply the code for the following
> patches too.
> 
> Signed-off-by: Yang Shi <shy828301@gmail.com>
> ---
>  mm/vmscan.c | 10 +++-------
>  1 file changed, 3 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 7b4e31eac2cf..7d6186a07daf 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -528,14 +528,10 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  		next_deferred = 0;
>  	/*
>  	 * move the unused scan count back into the shrinker in a
> -	 * manner that handles concurrent updates. If we exhausted the
> -	 * scan, there is no need to do an update.
> +	 * manner that handles concurrent updates.
>  	 */
> -	if (next_deferred > 0)
> -		new_nr = atomic_long_add_return(next_deferred,
> -						&shrinker->nr_deferred[nid]);
> -	else
> -		new_nr = atomic_long_read(&shrinker->nr_deferred[nid]);
> +	new_nr = atomic_long_add_return(next_deferred,
> +					&shrinker->nr_deferred[nid]);

So looking at this patch standalone, it's a bit hard to buy in. Maybe it's better to
merge the change into other patch, if it will make more obvious why this change is
required. Or just leave things as they are.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/9] mm: vmscan: use a new flag to indicate shrinker is registered
  2020-12-02 18:27 ` [PATCH 4/9] mm: vmscan: use a new flag to indicate shrinker is registered Yang Shi
@ 2020-12-03  3:01   ` Roman Gushchin
  2020-12-03  4:59     ` Yang Shi
  0 siblings, 1 reply; 41+ messages in thread
From: Roman Gushchin @ 2020-12-03  3:01 UTC (permalink / raw)
  To: Yang Shi
  Cc: ktkhai, shakeelb, david, hannes, mhocko, akpm, linux-mm,
	linux-fsdevel, linux-kernel

On Wed, Dec 02, 2020 at 10:27:20AM -0800, Yang Shi wrote:
> Currently registered shrinker is indicated by non-NULL shrinker->nr_deferred.
> This approach is fine with nr_deferred atthe shrinker level, but the following
> patches will move MEMCG_AWARE shrinkers' nr_deferred to memcg level, so their
> shrinker->nr_deferred would always be NULL.  This would prevent the shrinkers
> from unregistering correctly.
> 
> Introduce a new "state" field to indicate if shrinker is registered or not.
> We could use the highest bit of flags, but it may be a little bit complicated to
> extract that bit and the flags is accessed frequently by vmscan (every time shrinker
> is called).  So add a new field in "struct shrinker", we may waster a little bit
> memory, but it should be very few since there should be not too many registered
> shrinkers on a normal system.
> 
> Signed-off-by: Yang Shi <shy828301@gmail.com>
> ---
>  include/linux/shrinker.h |  4 ++++
>  mm/vmscan.c              | 13 +++++++++----
>  2 files changed, 13 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index 0f80123650e2..0bb5be88e41d 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -35,6 +35,9 @@ struct shrink_control {
>  
>  #define SHRINK_STOP (~0UL)
>  #define SHRINK_EMPTY (~0UL - 1)
> +
> +#define SHRINKER_REGISTERED	0x1
> +
>  /*
>   * A callback you can register to apply pressure to ageable caches.
>   *
> @@ -66,6 +69,7 @@ struct shrinker {
>  	long batch;	/* reclaim batch size, 0 = default */
>  	int seeks;	/* seeks to recreate an obj */
>  	unsigned flags;
> +	unsigned state;

Hm, can't it be another flag? It seems like we have a plenty of free bits.

>  
>  	/* These are for internal use */
>  	struct list_head list;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 457ce04eebf2..0d628299e55c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -378,6 +378,7 @@ void register_shrinker_prepared(struct shrinker *shrinker)
>  	if (shrinker->flags & SHRINKER_MEMCG_AWARE)
>  		idr_replace(&shrinker_idr, shrinker, shrinker->id);
>  #endif
> +	shrinker->state |= SHRINKER_REGISTERED;
>  	up_write(&shrinker_rwsem);
>  }
>  
> @@ -397,13 +398,17 @@ EXPORT_SYMBOL(register_shrinker);
>   */
>  void unregister_shrinker(struct shrinker *shrinker)
>  {
> -	if (!shrinker->nr_deferred)
> -		return;
> -	if (shrinker->flags & SHRINKER_MEMCG_AWARE)
> -		unregister_memcg_shrinker(shrinker);
>  	down_write(&shrinker_rwsem);
> +	if (!shrinker->state) {
> +		up_write(&shrinker_rwsem);
> +		return;
> +	}
>  	list_del(&shrinker->list);
> +	shrinker->state &= ~SHRINKER_REGISTERED;
>  	up_write(&shrinker_rwsem);
> +
> +	if (shrinker->flags & SHRINKER_MEMCG_AWARE)
> +		unregister_memcg_shrinker(shrinker);
>  	kfree(shrinker->nr_deferred);
>  	shrinker->nr_deferred = NULL;
>  }
> -- 
> 2.26.2
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/9] mm: memcontrol: add per memcg shrinker nr_deferred
  2020-12-02 18:27 ` [PATCH 5/9] mm: memcontrol: add per memcg shrinker nr_deferred Yang Shi
@ 2020-12-03  3:06   ` Roman Gushchin
  2020-12-03  4:54     ` Yang Shi
  2020-12-10 15:33   ` Johannes Weiner
  1 sibling, 1 reply; 41+ messages in thread
From: Roman Gushchin @ 2020-12-03  3:06 UTC (permalink / raw)
  To: Yang Shi
  Cc: ktkhai, shakeelb, david, hannes, mhocko, akpm, linux-mm,
	linux-fsdevel, linux-kernel

On Wed, Dec 02, 2020 at 10:27:21AM -0800, Yang Shi wrote:
> Currently the number of deferred objects are per shrinker, but some slabs, for example,
> vfs inode/dentry cache are per memcg, this would result in poor isolation among memcgs.
> 
> The deferred objects typically are generated by __GFP_NOFS allocations, one memcg with
> excessive __GFP_NOFS allocations may blow up deferred objects, then other innocent memcgs
> may suffer from over shrink, excessive reclaim latency, etc.
> 
> For example, two workloads run in memcgA and memcgB respectively, workload in B is vfs
> heavy workload.  Workload in A generates excessive deferred objects, then B's vfs cache
> might be hit heavily (drop half of caches) by B's limit reclaim or global reclaim.
> 
> We observed this hit in our production environment which was running vfs heavy workload
> shown as the below tracing log:
> 
> <...>-409454 [016] .... 28286961.747146: mm_shrink_slab_start: super_cache_scan+0x0/0x1a0 ffff9a83046f3458:
> nid: 1 objects to shrink 3641681686040 gfp_flags GFP_HIGHUSER_MOVABLE|__GFP_ZERO pgs_scanned 1 lru_pgs 15721
> cache items 246404277 delta 31345 total_scan 123202138
> <...>-409454 [022] .... 28287105.928018: mm_shrink_slab_end: super_cache_scan+0x0/0x1a0 ffff9a83046f3458:
> nid: 1 unused scan count 3641681686040 new scan count 3641798379189 total_scan 602
> last shrinker return val 123186855
> 
> The vfs cache and page cache ration was 10:1 on this machine, and half of caches were dropped.
> This also resulted in significant amount of page caches were dropped due to inodes eviction.
> 
> Make nr_deferred per memcg for memcg aware shrinkers would solve the unfairness and bring
> better isolation.
> 
> When memcg is not enabled (!CONFIG_MEMCG or memcg disabled), the shrinker's nr_deferred
> would be used.  And non memcg aware shrinkers use shrinker's nr_deferred all the time.
> 
> Signed-off-by: Yang Shi <shy828301@gmail.com>
> ---
>  include/linux/memcontrol.h |   9 +++
>  mm/memcontrol.c            | 112 ++++++++++++++++++++++++++++++++++++-
>  mm/vmscan.c                |   4 ++
>  3 files changed, 123 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 922a7f600465..1b343b268359 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -92,6 +92,13 @@ struct lruvec_stat {
>  	long count[NR_VM_NODE_STAT_ITEMS];
>  };
>  
> +
> +/* Shrinker::id indexed nr_deferred of memcg-aware shrinkers. */
> +struct memcg_shrinker_deferred {
> +	struct rcu_head rcu;
> +	atomic_long_t nr_deferred[];
> +};

The idea makes total sense to me. But I wonder if we can add nr_deferred to
struct list_lru_one, instead of adding another per-memcg per-shrinker entity?
I guess it can simplify the code quite a lot. What do you think?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/9] mm: vmscan: use per memcg nr_deferred of shrinker
  2020-12-02 18:27 ` [PATCH 6/9] mm: vmscan: use per memcg nr_deferred of shrinker Yang Shi
@ 2020-12-03  3:08   ` Roman Gushchin
  2020-12-03  5:01     ` Yang Shi
  2020-12-03 11:40   ` Kirill Tkhai
  1 sibling, 1 reply; 41+ messages in thread
From: Roman Gushchin @ 2020-12-03  3:08 UTC (permalink / raw)
  To: Yang Shi
  Cc: ktkhai, shakeelb, david, hannes, mhocko, akpm, linux-mm,
	linux-fsdevel, linux-kernel

On Wed, Dec 02, 2020 at 10:27:22AM -0800, Yang Shi wrote:
> Use per memcg's nr_deferred for memcg aware shrinkers.  The shrinker's nr_deferred
> will be used in the following cases:
>     1. Non memcg aware shrinkers
>     2. !CONFIG_MEMCG
>     3. memcg is disabled by boot parameter
> 
> Signed-off-by: Yang Shi <shy828301@gmail.com>
> ---
>  mm/vmscan.c | 88 +++++++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 82 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index cba0bc8d4661..d569fdcaba79 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -203,6 +203,12 @@ static DECLARE_RWSEM(shrinker_rwsem);
>  static DEFINE_IDR(shrinker_idr);
>  static int shrinker_nr_max;
>  
> +static inline bool is_deferred_memcg_aware(struct shrinker *shrinker)
> +{
> +	return (shrinker->flags & SHRINKER_MEMCG_AWARE) &&
> +		!mem_cgroup_disabled();
> +}
> +
>  static int prealloc_memcg_shrinker(struct shrinker *shrinker)
>  {
>  	int id, ret = -ENOMEM;
> @@ -271,7 +277,58 @@ static bool writeback_throttling_sane(struct scan_control *sc)
>  #endif
>  	return false;
>  }
> +
> +static inline long count_nr_deferred(struct shrinker *shrinker,
> +				     struct shrink_control *sc)
> +{
> +	bool per_memcg_deferred = is_deferred_memcg_aware(shrinker) && sc->memcg;
> +	struct memcg_shrinker_deferred *deferred;
> +	struct mem_cgroup *memcg = sc->memcg;
> +	int nid = sc->nid;
> +	int id = shrinker->id;
> +	long nr;
> +
> +	if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
> +		nid = 0;
> +
> +	if (per_memcg_deferred) {
> +		deferred = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_deferred,
> +						     true);
> +		nr = atomic_long_xchg(&deferred->nr_deferred[id], 0);
> +	} else
> +		nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
> +
> +	return nr;
> +}
> +
> +static inline long set_nr_deferred(long nr, struct shrinker *shrinker,
> +				   struct shrink_control *sc)
> +{
> +	bool per_memcg_deferred = is_deferred_memcg_aware(shrinker) && sc->memcg;
> +	struct memcg_shrinker_deferred *deferred;
> +	struct mem_cgroup *memcg = sc->memcg;
> +	int nid = sc->nid;
> +	int id = shrinker->id;
> +	long new_nr;
> +
> +	if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
> +		nid = 0;
> +
> +	if (per_memcg_deferred) {
> +		deferred = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_deferred,
> +						     true);
> +		new_nr = atomic_long_add_return(nr, &deferred->nr_deferred[id]);
> +	} else
> +		new_nr = atomic_long_add_return(nr, &shrinker->nr_deferred[nid]);
> +
> +	return new_nr;
> +}
>  #else
> +static inline bool is_deferred_memcg_aware(struct shrinker *shrinker)
> +{
> +	return false;
> +}
> +
>  static int prealloc_memcg_shrinker(struct shrinker *shrinker)
>  {
>  	return 0;
> @@ -290,6 +347,29 @@ static bool writeback_throttling_sane(struct scan_control *sc)
>  {
>  	return true;
>  }
> +
> +static inline long count_nr_deferred(struct shrinker *shrinker,
> +				     struct shrink_control *sc)
> +{
> +	int nid = sc->nid;
> +
> +	if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
> +		nid = 0;
> +
> +	return atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
> +}
> +
> +static inline long set_nr_deferred(long nr, struct shrinker *shrinker,
> +				   struct shrink_control *sc)
> +{
> +	int nid = sc->nid;
> +
> +	if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
> +		nid = 0;
> +
> +	return atomic_long_add_return(nr,
> +				      &shrinker->nr_deferred[nid]);
> +}
>  #endif
>  
>  /*
> @@ -429,13 +509,10 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  	long freeable;
>  	long nr;
>  	long new_nr;
> -	int nid = shrinkctl->nid;
>  	long batch_size = shrinker->batch ? shrinker->batch
>  					  : SHRINK_BATCH;
>  	long scanned = 0, next_deferred;
>  
> -	if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
> -		nid = 0;
>  
>  	freeable = shrinker->count_objects(shrinker, shrinkctl);
>  	if (freeable == 0 || freeable == SHRINK_EMPTY)
> @@ -446,7 +523,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  	 * and zero it so that other concurrent shrinker invocations
>  	 * don't also do this scanning work.
>  	 */
> -	nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
> +	nr = count_nr_deferred(shrinker, shrinkctl);
>  
>  	total_scan = nr;
>  	if (shrinker->seeks) {
> @@ -539,8 +616,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  	 * move the unused scan count back into the shrinker in a
>  	 * manner that handles concurrent updates.
>  	 */
> -	new_nr = atomic_long_add_return(next_deferred,
> -					&shrinker->nr_deferred[nid]);
> +	new_nr = set_nr_deferred(next_deferred, shrinker, shrinkctl);

Ok, I think patch (1) can be just merged into this and then it would make total sense.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/9] mm: vmscan: use nid from shrink_control for tracepoint
  2020-12-02 18:27 ` [PATCH 2/9] mm: vmscan: use nid from shrink_control for tracepoint Yang Shi
@ 2020-12-03  3:13   ` Xiaqing (A)
  2020-12-11 19:20     ` Yang Shi
  0 siblings, 1 reply; 41+ messages in thread
From: Xiaqing (A) @ 2020-12-03  3:13 UTC (permalink / raw)
  To: Yang Shi, guro, ktkhai, shakeelb, david, hannes, mhocko, akpm
  Cc: linux-mm, linux-fsdevel, linux-kernel, Liu Yi



On 2020/12/3 2:27, Yang Shi wrote:
> The tracepoint's nid should show what node the shrink happens on, the start tracepoint
> uses nid from shrinkctl, but the nid might be set to 0 before end tracepoint if the
> shrinker is not NUMA aware, so the traceing log may show the shrink happens on one
> node but end up on the other node.  It seems confusing.  And the following patch
> will remove using nid directly in do_shrink_slab(), this patch also helps cleanup
> the code.
>
> Signed-off-by: Yang Shi <shy828301@gmail.com>
> ---
>   mm/vmscan.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 7d6186a07daf..457ce04eebf2 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -533,7 +533,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>   	new_nr = atomic_long_add_return(next_deferred,
>   					&shrinker->nr_deferred[nid]);
>   
> -	trace_mm_shrink_slab_end(shrinker, nid, freed, nr, new_nr, total_scan);
> +	trace_mm_shrink_slab_end(shrinker, shrinkctl->nid, freed, nr, new_nr, total_scan);

Hi, Yang

When I read this patch, I wondered why you modified it so much until I read patch6. Could you merge
this patch into patch6?

Thanks!

>   	return freed;
>   }
>   


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/9] mm: memcontrol: add per memcg shrinker nr_deferred
  2020-12-03  3:06   ` Roman Gushchin
@ 2020-12-03  4:54     ` Yang Shi
  2020-12-03 18:03       ` Yang Shi
  0 siblings, 1 reply; 41+ messages in thread
From: Yang Shi @ 2020-12-03  4:54 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Kirill Tkhai, Shakeel Butt, Dave Chinner, Johannes Weiner,
	Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Wed, Dec 2, 2020 at 7:06 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Wed, Dec 02, 2020 at 10:27:21AM -0800, Yang Shi wrote:
> > Currently the number of deferred objects are per shrinker, but some slabs, for example,
> > vfs inode/dentry cache are per memcg, this would result in poor isolation among memcgs.
> >
> > The deferred objects typically are generated by __GFP_NOFS allocations, one memcg with
> > excessive __GFP_NOFS allocations may blow up deferred objects, then other innocent memcgs
> > may suffer from over shrink, excessive reclaim latency, etc.
> >
> > For example, two workloads run in memcgA and memcgB respectively, workload in B is vfs
> > heavy workload.  Workload in A generates excessive deferred objects, then B's vfs cache
> > might be hit heavily (drop half of caches) by B's limit reclaim or global reclaim.
> >
> > We observed this hit in our production environment which was running vfs heavy workload
> > shown as the below tracing log:
> >
> > <...>-409454 [016] .... 28286961.747146: mm_shrink_slab_start: super_cache_scan+0x0/0x1a0 ffff9a83046f3458:
> > nid: 1 objects to shrink 3641681686040 gfp_flags GFP_HIGHUSER_MOVABLE|__GFP_ZERO pgs_scanned 1 lru_pgs 15721
> > cache items 246404277 delta 31345 total_scan 123202138
> > <...>-409454 [022] .... 28287105.928018: mm_shrink_slab_end: super_cache_scan+0x0/0x1a0 ffff9a83046f3458:
> > nid: 1 unused scan count 3641681686040 new scan count 3641798379189 total_scan 602
> > last shrinker return val 123186855
> >
> > The vfs cache and page cache ration was 10:1 on this machine, and half of caches were dropped.
> > This also resulted in significant amount of page caches were dropped due to inodes eviction.
> >
> > Make nr_deferred per memcg for memcg aware shrinkers would solve the unfairness and bring
> > better isolation.
> >
> > When memcg is not enabled (!CONFIG_MEMCG or memcg disabled), the shrinker's nr_deferred
> > would be used.  And non memcg aware shrinkers use shrinker's nr_deferred all the time.
> >
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > ---
> >  include/linux/memcontrol.h |   9 +++
> >  mm/memcontrol.c            | 112 ++++++++++++++++++++++++++++++++++++-
> >  mm/vmscan.c                |   4 ++
> >  3 files changed, 123 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 922a7f600465..1b343b268359 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -92,6 +92,13 @@ struct lruvec_stat {
> >       long count[NR_VM_NODE_STAT_ITEMS];
> >  };
> >
> > +
> > +/* Shrinker::id indexed nr_deferred of memcg-aware shrinkers. */
> > +struct memcg_shrinker_deferred {
> > +     struct rcu_head rcu;
> > +     atomic_long_t nr_deferred[];
> > +};
>
> The idea makes total sense to me. But I wonder if we can add nr_deferred to
> struct list_lru_one, instead of adding another per-memcg per-shrinker entity?
> I guess it can simplify the code quite a lot. What do you think?

Aha, actually this exactly was what I did at the first place. But Dave
NAK'ed this approach. You can find the discussion at:
https://lore.kernel.org/linux-mm/20200930073152.GH12096@dread.disaster.area/.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/9] mm: vmscan: use a new flag to indicate shrinker is registered
  2020-12-03  3:01   ` Roman Gushchin
@ 2020-12-03  4:59     ` Yang Shi
  2020-12-03 20:08       ` Roman Gushchin
  0 siblings, 1 reply; 41+ messages in thread
From: Yang Shi @ 2020-12-03  4:59 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Kirill Tkhai, Shakeel Butt, Dave Chinner, Johannes Weiner,
	Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Wed, Dec 2, 2020 at 7:01 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Wed, Dec 02, 2020 at 10:27:20AM -0800, Yang Shi wrote:
> > Currently registered shrinker is indicated by non-NULL shrinker->nr_deferred.
> > This approach is fine with nr_deferred atthe shrinker level, but the following
> > patches will move MEMCG_AWARE shrinkers' nr_deferred to memcg level, so their
> > shrinker->nr_deferred would always be NULL.  This would prevent the shrinkers
> > from unregistering correctly.
> >
> > Introduce a new "state" field to indicate if shrinker is registered or not.
> > We could use the highest bit of flags, but it may be a little bit complicated to
> > extract that bit and the flags is accessed frequently by vmscan (every time shrinker
> > is called).  So add a new field in "struct shrinker", we may waster a little bit
> > memory, but it should be very few since there should be not too many registered
> > shrinkers on a normal system.
> >
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > ---
> >  include/linux/shrinker.h |  4 ++++
> >  mm/vmscan.c              | 13 +++++++++----
> >  2 files changed, 13 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> > index 0f80123650e2..0bb5be88e41d 100644
> > --- a/include/linux/shrinker.h
> > +++ b/include/linux/shrinker.h
> > @@ -35,6 +35,9 @@ struct shrink_control {
> >
> >  #define SHRINK_STOP (~0UL)
> >  #define SHRINK_EMPTY (~0UL - 1)
> > +
> > +#define SHRINKER_REGISTERED  0x1
> > +
> >  /*
> >   * A callback you can register to apply pressure to ageable caches.
> >   *
> > @@ -66,6 +69,7 @@ struct shrinker {
> >       long batch;     /* reclaim batch size, 0 = default */
> >       int seeks;      /* seeks to recreate an obj */
> >       unsigned flags;
> > +     unsigned state;
>
> Hm, can't it be another flag? It seems like we have a plenty of free bits.

I thought about this too. But I was not convinced by myself that
messing flags with state is a good practice. We may add more flags in
the future, so we may end up having something like:

flag
flag
flag
state
flag
flag
...

Maybe we could use the highest bit for state?

>
> >
> >       /* These are for internal use */
> >       struct list_head list;
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 457ce04eebf2..0d628299e55c 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -378,6 +378,7 @@ void register_shrinker_prepared(struct shrinker *shrinker)
> >       if (shrinker->flags & SHRINKER_MEMCG_AWARE)
> >               idr_replace(&shrinker_idr, shrinker, shrinker->id);
> >  #endif
> > +     shrinker->state |= SHRINKER_REGISTERED;
> >       up_write(&shrinker_rwsem);
> >  }
> >
> > @@ -397,13 +398,17 @@ EXPORT_SYMBOL(register_shrinker);
> >   */
> >  void unregister_shrinker(struct shrinker *shrinker)
> >  {
> > -     if (!shrinker->nr_deferred)
> > -             return;
> > -     if (shrinker->flags & SHRINKER_MEMCG_AWARE)
> > -             unregister_memcg_shrinker(shrinker);
> >       down_write(&shrinker_rwsem);
> > +     if (!shrinker->state) {
> > +             up_write(&shrinker_rwsem);
> > +             return;
> > +     }
> >       list_del(&shrinker->list);
> > +     shrinker->state &= ~SHRINKER_REGISTERED;
> >       up_write(&shrinker_rwsem);
> > +
> > +     if (shrinker->flags & SHRINKER_MEMCG_AWARE)
> > +             unregister_memcg_shrinker(shrinker);
> >       kfree(shrinker->nr_deferred);
> >       shrinker->nr_deferred = NULL;
> >  }
> > --
> > 2.26.2
> >

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/9] mm: vmscan: use per memcg nr_deferred of shrinker
  2020-12-03  3:08   ` Roman Gushchin
@ 2020-12-03  5:01     ` Yang Shi
  0 siblings, 0 replies; 41+ messages in thread
From: Yang Shi @ 2020-12-03  5:01 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Kirill Tkhai, Shakeel Butt, Dave Chinner, Johannes Weiner,
	Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Wed, Dec 2, 2020 at 7:08 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Wed, Dec 02, 2020 at 10:27:22AM -0800, Yang Shi wrote:
> > Use per memcg's nr_deferred for memcg aware shrinkers.  The shrinker's nr_deferred
> > will be used in the following cases:
> >     1. Non memcg aware shrinkers
> >     2. !CONFIG_MEMCG
> >     3. memcg is disabled by boot parameter
> >
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > ---
> >  mm/vmscan.c | 88 +++++++++++++++++++++++++++++++++++++++++++++++++----
> >  1 file changed, 82 insertions(+), 6 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index cba0bc8d4661..d569fdcaba79 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -203,6 +203,12 @@ static DECLARE_RWSEM(shrinker_rwsem);
> >  static DEFINE_IDR(shrinker_idr);
> >  static int shrinker_nr_max;
> >
> > +static inline bool is_deferred_memcg_aware(struct shrinker *shrinker)
> > +{
> > +     return (shrinker->flags & SHRINKER_MEMCG_AWARE) &&
> > +             !mem_cgroup_disabled();
> > +}
> > +
> >  static int prealloc_memcg_shrinker(struct shrinker *shrinker)
> >  {
> >       int id, ret = -ENOMEM;
> > @@ -271,7 +277,58 @@ static bool writeback_throttling_sane(struct scan_control *sc)
> >  #endif
> >       return false;
> >  }
> > +
> > +static inline long count_nr_deferred(struct shrinker *shrinker,
> > +                                  struct shrink_control *sc)
> > +{
> > +     bool per_memcg_deferred = is_deferred_memcg_aware(shrinker) && sc->memcg;
> > +     struct memcg_shrinker_deferred *deferred;
> > +     struct mem_cgroup *memcg = sc->memcg;
> > +     int nid = sc->nid;
> > +     int id = shrinker->id;
> > +     long nr;
> > +
> > +     if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
> > +             nid = 0;
> > +
> > +     if (per_memcg_deferred) {
> > +             deferred = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_deferred,
> > +                                                  true);
> > +             nr = atomic_long_xchg(&deferred->nr_deferred[id], 0);
> > +     } else
> > +             nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
> > +
> > +     return nr;
> > +}
> > +
> > +static inline long set_nr_deferred(long nr, struct shrinker *shrinker,
> > +                                struct shrink_control *sc)
> > +{
> > +     bool per_memcg_deferred = is_deferred_memcg_aware(shrinker) && sc->memcg;
> > +     struct memcg_shrinker_deferred *deferred;
> > +     struct mem_cgroup *memcg = sc->memcg;
> > +     int nid = sc->nid;
> > +     int id = shrinker->id;
> > +     long new_nr;
> > +
> > +     if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
> > +             nid = 0;
> > +
> > +     if (per_memcg_deferred) {
> > +             deferred = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_deferred,
> > +                                                  true);
> > +             new_nr = atomic_long_add_return(nr, &deferred->nr_deferred[id]);
> > +     } else
> > +             new_nr = atomic_long_add_return(nr, &shrinker->nr_deferred[nid]);
> > +
> > +     return new_nr;
> > +}
> >  #else
> > +static inline bool is_deferred_memcg_aware(struct shrinker *shrinker)
> > +{
> > +     return false;
> > +}
> > +
> >  static int prealloc_memcg_shrinker(struct shrinker *shrinker)
> >  {
> >       return 0;
> > @@ -290,6 +347,29 @@ static bool writeback_throttling_sane(struct scan_control *sc)
> >  {
> >       return true;
> >  }
> > +
> > +static inline long count_nr_deferred(struct shrinker *shrinker,
> > +                                  struct shrink_control *sc)
> > +{
> > +     int nid = sc->nid;
> > +
> > +     if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
> > +             nid = 0;
> > +
> > +     return atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
> > +}
> > +
> > +static inline long set_nr_deferred(long nr, struct shrinker *shrinker,
> > +                                struct shrink_control *sc)
> > +{
> > +     int nid = sc->nid;
> > +
> > +     if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
> > +             nid = 0;
> > +
> > +     return atomic_long_add_return(nr,
> > +                                   &shrinker->nr_deferred[nid]);
> > +}
> >  #endif
> >
> >  /*
> > @@ -429,13 +509,10 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> >       long freeable;
> >       long nr;
> >       long new_nr;
> > -     int nid = shrinkctl->nid;
> >       long batch_size = shrinker->batch ? shrinker->batch
> >                                         : SHRINK_BATCH;
> >       long scanned = 0, next_deferred;
> >
> > -     if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
> > -             nid = 0;
> >
> >       freeable = shrinker->count_objects(shrinker, shrinkctl);
> >       if (freeable == 0 || freeable == SHRINK_EMPTY)
> > @@ -446,7 +523,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> >        * and zero it so that other concurrent shrinker invocations
> >        * don't also do this scanning work.
> >        */
> > -     nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
> > +     nr = count_nr_deferred(shrinker, shrinkctl);
> >
> >       total_scan = nr;
> >       if (shrinker->seeks) {
> > @@ -539,8 +616,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> >        * move the unused scan count back into the shrinker in a
> >        * manner that handles concurrent updates.
> >        */
> > -     new_nr = atomic_long_add_return(next_deferred,
> > -                                     &shrinker->nr_deferred[nid]);
> > +     new_nr = set_nr_deferred(next_deferred, shrinker, shrinkctl);
>
> Ok, I think patch (1) can be just merged into this and then it would make total sense.

Sure. Makes sense to me.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/9] mm: vmscan: use per memcg nr_deferred of shrinker
  2020-12-02 18:27 ` [PATCH 6/9] mm: vmscan: use per memcg nr_deferred of shrinker Yang Shi
  2020-12-03  3:08   ` Roman Gushchin
@ 2020-12-03 11:40   ` Kirill Tkhai
  2020-12-08 17:13     ` Yang Shi
  1 sibling, 1 reply; 41+ messages in thread
From: Kirill Tkhai @ 2020-12-03 11:40 UTC (permalink / raw)
  To: Yang Shi, guro, shakeelb, david, hannes, mhocko, akpm
  Cc: linux-mm, linux-fsdevel, linux-kernel

On 02.12.2020 21:27, Yang Shi wrote:
> Use per memcg's nr_deferred for memcg aware shrinkers.  The shrinker's nr_deferred
> will be used in the following cases:
>     1. Non memcg aware shrinkers
>     2. !CONFIG_MEMCG
>     3. memcg is disabled by boot parameter
> 
> Signed-off-by: Yang Shi <shy828301@gmail.com>
> ---
>  mm/vmscan.c | 88 +++++++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 82 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index cba0bc8d4661..d569fdcaba79 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -203,6 +203,12 @@ static DECLARE_RWSEM(shrinker_rwsem);
>  static DEFINE_IDR(shrinker_idr);
>  static int shrinker_nr_max;
>  
> +static inline bool is_deferred_memcg_aware(struct shrinker *shrinker)
> +{
> +	return (shrinker->flags & SHRINKER_MEMCG_AWARE) &&
> +		!mem_cgroup_disabled();
> +}
> +
>  static int prealloc_memcg_shrinker(struct shrinker *shrinker)
>  {
>  	int id, ret = -ENOMEM;
> @@ -271,7 +277,58 @@ static bool writeback_throttling_sane(struct scan_control *sc)
>  #endif
>  	return false;
>  }
> +
> +static inline long count_nr_deferred(struct shrinker *shrinker,
> +				     struct shrink_control *sc)
> +{
> +	bool per_memcg_deferred = is_deferred_memcg_aware(shrinker) && sc->memcg;
> +	struct memcg_shrinker_deferred *deferred;
> +	struct mem_cgroup *memcg = sc->memcg;
> +	int nid = sc->nid;
> +	int id = shrinker->id;
> +	long nr;
> +
> +	if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
> +		nid = 0;
> +
> +	if (per_memcg_deferred) {
> +		deferred = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_deferred,
> +						     true);

My comment is about both 5/9 and 6/9 patches.

shrink_slab_memcg() races with mem_cgroup_css_online(). A visibility of CSS_ONLINE flag
in shrink_slab_memcg()->mem_cgroup_online() does not guarantee that you will see
memcg->nodeinfo[nid]->shrinker_deferred != NULL in count_nr_deferred(). This may occur
because of processor reordering on !x86 (there is no a common lock or memory barriers).

Regarding to shrinker_map this is not a problem due to map check in shrink_slab_memcg().
The map can't be NULL there.

Regarding to shrinker_deferred you should prove either this is not a problem too,
or to add proper synchronization (maybe, based on barriers) or to add some similar check
(maybe, in shrink_slab_memcg() too).

Kirill

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 0/9] Make shrinker's nr_deferred memcg aware
  2020-12-03  2:52 ` [RFC PATCH 0/9] Make shrinker's nr_deferred memcg aware Roman Gushchin
@ 2020-12-03 17:52   ` Yang Shi
  0 siblings, 0 replies; 41+ messages in thread
From: Yang Shi @ 2020-12-03 17:52 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Kirill Tkhai, Shakeel Butt, Dave Chinner, Johannes Weiner,
	Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Wed, Dec 2, 2020 at 6:52 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Wed, Dec 02, 2020 at 10:27:16AM -0800, Yang Shi wrote:
> >
> > Recently huge amount one-off slab drop was seen on some vfs metadata heavy workloads,
> > it turned out there were huge amount accumulated nr_deferred objects seen by the
> > shrinker.
> >
> > On our production machine, I saw absurd number of nr_deferred shown as the below
> > tracing result:
> >
> > <...>-48776 [032] .... 27970562.458916: mm_shrink_slab_start:
> > super_cache_scan+0x0/0x1a0 ffff9a83046f3458: nid: 0 objects to shrink
> > 2531805877005 gfp_flags GFP_HIGHUSER_MOVABLE pgs_scanned 32 lru_pgs
> > 9300 cache items 1667 delta 11 total_scan 833
> >
> > There are 2.5 trillion deferred objects on one node, assuming all of them
> > are dentry (192 bytes per object), so the total size of deferred on
> > one node is ~480TB. It is definitely ridiculous.
> >
> > I managed to reproduce this problem with kernel build workload plus negative dentry
> > generator.
> >
> > First step, run the below kernel build test script:
> >
> > NR_CPUS=`cat /proc/cpuinfo | grep -e processor | wc -l`
> >
> > cd /root/Buildarea/linux-stable
> >
> > for i in `seq 1500`; do
> >         cgcreate -g memory:kern_build
> >         echo 4G > /sys/fs/cgroup/memory/kern_build/memory.limit_in_bytes
> >
> >         echo 3 > /proc/sys/vm/drop_caches
> >         cgexec -g memory:kern_build make clean > /dev/null 2>&1
> >         cgexec -g memory:kern_build make -j$NR_CPUS > /dev/null 2>&1
> >
> >         cgdelete -g memory:kern_build
> > done
> >
> > Then run the below negative dentry generator script:
> >
> > NR_CPUS=`cat /proc/cpuinfo | grep -e processor | wc -l`
> >
> > mkdir /sys/fs/cgroup/memory/test
> > echo $$ > /sys/fs/cgroup/memory/test/tasks
> >
> > for i in `seq $NR_CPUS`; do
> >         while true; do
> >                 FILE=`head /dev/urandom | tr -dc A-Za-z0-9 | head -c 64`
> >                 cat $FILE 2>/dev/null
> >         done &
> > done
> >
> > Then kswapd will shrink half of dentry cache in just one loop as the below tracing result
> > showed:
> >
> >       kswapd0-475   [028] .... 305968.252561: mm_shrink_slab_start: super_cache_scan+0x0/0x190 0000000024acf00c: nid: 0
> > objects to shrink 4994376020 gfp_flags GFP_KERNEL cache items 93689873 delta 45746 total_scan 46844936 priority 12
> >       kswapd0-475   [021] .... 306013.099399: mm_shrink_slab_end: super_cache_scan+0x0/0x190 0000000024acf00c: nid: 0 unused
> > scan count 4994376020 new scan count 4947576838 total_scan 8 last shrinker return val 46844928
> >
> > There were huge number of deferred objects before the shrinker was called, the behavior
> > does match the code but it might be not desirable from the user's stand of point.
> >
> > The excessive amount of nr_deferred might be accumulated due to various reasons, for example:
> >     * GFP_NOFS allocation
> >     * Significant times of small amount scan (< scan_batch, 1024 for vfs metadata)
> >
> > However the LRUs of slabs are per memcg (memcg-aware shrinkers) but the deferred objects
> > is per shrinker, this may have some bad effects:
> >     * Poor isolation among memcgs. Some memcgs which happen to have frequent limit
> >       reclaim may get nr_deferred accumulated to a huge number, then other innocent
> >       memcgs may take the fall. In our case the main workload was hit.
> >     * Unbounded deferred objects. There is no cap for deferred objects, it can outgrow
> >       ridiculously as the tracing result showed.
> >     * Easy to get out of control. Although shrinkers take into account deferred objects,
> >       but it can go out of control easily. One misconfigured memcg could incur absurd
> >       amount of deferred objects in a period of time.
> >     * Sort of reclaim problems, i.e. over reclaim, long reclaim latency, etc. There may be
> >       hundred GB slab caches for vfe metadata heavy workload, shrink half of them may take
> >       minutes. We observed latency spike due to the prolonged reclaim.
> >
> > These issues also have been discussed in https://lore.kernel.org/linux-mm/20200916185823.5347-1-shy828301@gmail.com/.
> > The patchset is the outcome of that discussion.
> >
> > So this patchset makes nr_deferred per-memcg to tackle the problem. It does:
> >     * Have memcg_shrinker_deferred per memcg per node, just like what shrinker_map
> >       does. Instead it is an atomic_long_t array, each element represent one shrinker
> >       even though the shrinker is not memcg aware, this simplifies the implementation.
> >       For memcg aware shrinkers, the deferred objects are just accumulated to its own
> >       memcg. The shrinkers just see nr_deferred from its own memcg. Non memcg aware
> >       shrinkers still use global nr_deferred from struct shrinker.
> >     * Once the memcg is offlined, its nr_deferred will be reparented to its parent along
> >       with LRUs.
> >     * The root memcg has memcg_shrinker_deferred array too. It simplifies the handling of
> >       reparenting to root memcg.
> >     * Cap nr_deferred to 2x of the length of lru. The idea is borrowed from Dave Chinner's
> >       series (https://lore.kernel.org/linux-xfs/20191031234618.15403-1-david@fromorbit.com/)
> >
> > The downside is each memcg has to allocate extra memory to store the nr_deferred array.
> > On our production environment, there are typically around 40 shrinkers, so each memcg
> > needs ~320 bytes. 10K memcgs would need ~3.2MB memory. It seems fine.
> >
> > We have been running the patched kernel on some hosts of our fleet (test and production) for
> > months, it works very well. The monitor data shows the working set is sustained as expected.
>
> Hello Yang!
>
> The rationale is very well described and makes perfect sense to me.
> I fully support the idea to make nr_deferred per-memcg.
> Thank you for such a detailed description!
>
> More comments in individual patches.

Thank you very much.

>
> Thanks!
>
> >
> > Yang Shi (9):
> >       mm: vmscan: simplify nr_deferred update code
> >       mm: vmscan: use nid from shrink_control for tracepoint
> >       mm: memcontrol: rename memcg_shrinker_map_mutex to memcg_shrinker_mutex
> >       mm: vmscan: use a new flag to indicate shrinker is registered
> >       mm: memcontrol: add per memcg shrinker nr_deferred
> >       mm: vmscan: use per memcg nr_deferred of shrinker
> >       mm: vmscan: don't need allocate shrinker->nr_deferred for memcg aware shrinkers
> >       mm: memcontrol: reparent nr_deferred when memcg offline
> >       mm: vmscan: shrink deferred objects proportional to priority
> >
> >  include/linux/memcontrol.h |   9 +++++
> >  include/linux/shrinker.h   |   8 ++++
> >  mm/memcontrol.c            | 148 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----
> >  mm/vmscan.c                | 183 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------
> >  4 files changed, 274 insertions(+), 74 deletions(-)
> >

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/9] mm: memcontrol: add per memcg shrinker nr_deferred
  2020-12-03  4:54     ` Yang Shi
@ 2020-12-03 18:03       ` Yang Shi
  2020-12-03 20:07         ` Roman Gushchin
  0 siblings, 1 reply; 41+ messages in thread
From: Yang Shi @ 2020-12-03 18:03 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Kirill Tkhai, Shakeel Butt, Dave Chinner, Johannes Weiner,
	Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Wed, Dec 2, 2020 at 8:54 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Wed, Dec 2, 2020 at 7:06 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > On Wed, Dec 02, 2020 at 10:27:21AM -0800, Yang Shi wrote:
> > > Currently the number of deferred objects are per shrinker, but some slabs, for example,
> > > vfs inode/dentry cache are per memcg, this would result in poor isolation among memcgs.
> > >
> > > The deferred objects typically are generated by __GFP_NOFS allocations, one memcg with
> > > excessive __GFP_NOFS allocations may blow up deferred objects, then other innocent memcgs
> > > may suffer from over shrink, excessive reclaim latency, etc.
> > >
> > > For example, two workloads run in memcgA and memcgB respectively, workload in B is vfs
> > > heavy workload.  Workload in A generates excessive deferred objects, then B's vfs cache
> > > might be hit heavily (drop half of caches) by B's limit reclaim or global reclaim.
> > >
> > > We observed this hit in our production environment which was running vfs heavy workload
> > > shown as the below tracing log:
> > >
> > > <...>-409454 [016] .... 28286961.747146: mm_shrink_slab_start: super_cache_scan+0x0/0x1a0 ffff9a83046f3458:
> > > nid: 1 objects to shrink 3641681686040 gfp_flags GFP_HIGHUSER_MOVABLE|__GFP_ZERO pgs_scanned 1 lru_pgs 15721
> > > cache items 246404277 delta 31345 total_scan 123202138
> > > <...>-409454 [022] .... 28287105.928018: mm_shrink_slab_end: super_cache_scan+0x0/0x1a0 ffff9a83046f3458:
> > > nid: 1 unused scan count 3641681686040 new scan count 3641798379189 total_scan 602
> > > last shrinker return val 123186855
> > >
> > > The vfs cache and page cache ration was 10:1 on this machine, and half of caches were dropped.
> > > This also resulted in significant amount of page caches were dropped due to inodes eviction.
> > >
> > > Make nr_deferred per memcg for memcg aware shrinkers would solve the unfairness and bring
> > > better isolation.
> > >
> > > When memcg is not enabled (!CONFIG_MEMCG or memcg disabled), the shrinker's nr_deferred
> > > would be used.  And non memcg aware shrinkers use shrinker's nr_deferred all the time.
> > >
> > > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > > ---
> > >  include/linux/memcontrol.h |   9 +++
> > >  mm/memcontrol.c            | 112 ++++++++++++++++++++++++++++++++++++-
> > >  mm/vmscan.c                |   4 ++
> > >  3 files changed, 123 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > index 922a7f600465..1b343b268359 100644
> > > --- a/include/linux/memcontrol.h
> > > +++ b/include/linux/memcontrol.h
> > > @@ -92,6 +92,13 @@ struct lruvec_stat {
> > >       long count[NR_VM_NODE_STAT_ITEMS];
> > >  };
> > >
> > > +
> > > +/* Shrinker::id indexed nr_deferred of memcg-aware shrinkers. */
> > > +struct memcg_shrinker_deferred {
> > > +     struct rcu_head rcu;
> > > +     atomic_long_t nr_deferred[];
> > > +};
> >
> > The idea makes total sense to me. But I wonder if we can add nr_deferred to
> > struct list_lru_one, instead of adding another per-memcg per-shrinker entity?
> > I guess it can simplify the code quite a lot. What do you think?
>
> Aha, actually this exactly was what I did at the first place. But Dave
> NAK'ed this approach. You can find the discussion at:
> https://lore.kernel.org/linux-mm/20200930073152.GH12096@dread.disaster.area/.

I did prototypes for both approaches (move nr_deferred to list_lru or
to memcg). I preferred the list_lru approach at the first place. But
Dave's opinion does make perfect sense to me. So I dropped that
list_lru one. That email elaborated why moving nr_deferred to list_lru
is not appropriate.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/9] mm: memcontrol: add per memcg shrinker nr_deferred
  2020-12-03 18:03       ` Yang Shi
@ 2020-12-03 20:07         ` Roman Gushchin
  2020-12-03 22:49           ` Yang Shi
  0 siblings, 1 reply; 41+ messages in thread
From: Roman Gushchin @ 2020-12-03 20:07 UTC (permalink / raw)
  To: Yang Shi
  Cc: Kirill Tkhai, Shakeel Butt, Dave Chinner, Johannes Weiner,
	Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Thu, Dec 03, 2020 at 10:03:44AM -0800, Yang Shi wrote:
> On Wed, Dec 2, 2020 at 8:54 PM Yang Shi <shy828301@gmail.com> wrote:
> >
> > On Wed, Dec 2, 2020 at 7:06 PM Roman Gushchin <guro@fb.com> wrote:
> > >
> > > On Wed, Dec 02, 2020 at 10:27:21AM -0800, Yang Shi wrote:
> > > > Currently the number of deferred objects are per shrinker, but some slabs, for example,
> > > > vfs inode/dentry cache are per memcg, this would result in poor isolation among memcgs.
> > > >
> > > > The deferred objects typically are generated by __GFP_NOFS allocations, one memcg with
> > > > excessive __GFP_NOFS allocations may blow up deferred objects, then other innocent memcgs
> > > > may suffer from over shrink, excessive reclaim latency, etc.
> > > >
> > > > For example, two workloads run in memcgA and memcgB respectively, workload in B is vfs
> > > > heavy workload.  Workload in A generates excessive deferred objects, then B's vfs cache
> > > > might be hit heavily (drop half of caches) by B's limit reclaim or global reclaim.
> > > >
> > > > We observed this hit in our production environment which was running vfs heavy workload
> > > > shown as the below tracing log:
> > > >
> > > > <...>-409454 [016] .... 28286961.747146: mm_shrink_slab_start: super_cache_scan+0x0/0x1a0 ffff9a83046f3458:
> > > > nid: 1 objects to shrink 3641681686040 gfp_flags GFP_HIGHUSER_MOVABLE|__GFP_ZERO pgs_scanned 1 lru_pgs 15721
> > > > cache items 246404277 delta 31345 total_scan 123202138
> > > > <...>-409454 [022] .... 28287105.928018: mm_shrink_slab_end: super_cache_scan+0x0/0x1a0 ffff9a83046f3458:
> > > > nid: 1 unused scan count 3641681686040 new scan count 3641798379189 total_scan 602
> > > > last shrinker return val 123186855
> > > >
> > > > The vfs cache and page cache ration was 10:1 on this machine, and half of caches were dropped.
> > > > This also resulted in significant amount of page caches were dropped due to inodes eviction.
> > > >
> > > > Make nr_deferred per memcg for memcg aware shrinkers would solve the unfairness and bring
> > > > better isolation.
> > > >
> > > > When memcg is not enabled (!CONFIG_MEMCG or memcg disabled), the shrinker's nr_deferred
> > > > would be used.  And non memcg aware shrinkers use shrinker's nr_deferred all the time.
> > > >
> > > > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > > > ---
> > > >  include/linux/memcontrol.h |   9 +++
> > > >  mm/memcontrol.c            | 112 ++++++++++++++++++++++++++++++++++++-
> > > >  mm/vmscan.c                |   4 ++
> > > >  3 files changed, 123 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > > index 922a7f600465..1b343b268359 100644
> > > > --- a/include/linux/memcontrol.h
> > > > +++ b/include/linux/memcontrol.h
> > > > @@ -92,6 +92,13 @@ struct lruvec_stat {
> > > >       long count[NR_VM_NODE_STAT_ITEMS];
> > > >  };
> > > >
> > > > +
> > > > +/* Shrinker::id indexed nr_deferred of memcg-aware shrinkers. */
> > > > +struct memcg_shrinker_deferred {
> > > > +     struct rcu_head rcu;
> > > > +     atomic_long_t nr_deferred[];
> > > > +};
> > >
> > > The idea makes total sense to me. But I wonder if we can add nr_deferred to
> > > struct list_lru_one, instead of adding another per-memcg per-shrinker entity?
> > > I guess it can simplify the code quite a lot. What do you think?
> >
> > Aha, actually this exactly was what I did at the first place. But Dave
> > NAK'ed this approach. You can find the discussion at:
> > https://lore.kernel.org/linux-mm/20200930073152.GH12096@dread.disaster.area/.

Yes, this makes sense for me. Thank you for the link!

> 
> I did prototypes for both approaches (move nr_deferred to list_lru or
> to memcg). I preferred the list_lru approach at the first place. But
> Dave's opinion does make perfect sense to me. So I dropped that
> list_lru one. That email elaborated why moving nr_deferred to list_lru
> is not appropriate.

Hm, shouldn't we move list_lru to memcg then? It's not directly related
to your patchset, but maybe it's something we should consider in the future.

What worries me is that with your patchset we'll have 3 separate
per-memcg (per-node) per-shrinker entity, each with slightly different
approach to allocate/extend/reparent/release. So it begs for some
unification. I don't think it's a showstopper for your work though, it
can be done later.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/9] mm: vmscan: use a new flag to indicate shrinker is registered
  2020-12-03  4:59     ` Yang Shi
@ 2020-12-03 20:08       ` Roman Gushchin
  2020-12-03 22:25         ` Yang Shi
  0 siblings, 1 reply; 41+ messages in thread
From: Roman Gushchin @ 2020-12-03 20:08 UTC (permalink / raw)
  To: Yang Shi
  Cc: Kirill Tkhai, Shakeel Butt, Dave Chinner, Johannes Weiner,
	Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Wed, Dec 02, 2020 at 08:59:40PM -0800, Yang Shi wrote:
> On Wed, Dec 2, 2020 at 7:01 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > On Wed, Dec 02, 2020 at 10:27:20AM -0800, Yang Shi wrote:
> > > Currently registered shrinker is indicated by non-NULL shrinker->nr_deferred.
> > > This approach is fine with nr_deferred atthe shrinker level, but the following
> > > patches will move MEMCG_AWARE shrinkers' nr_deferred to memcg level, so their
> > > shrinker->nr_deferred would always be NULL.  This would prevent the shrinkers
> > > from unregistering correctly.
> > >
> > > Introduce a new "state" field to indicate if shrinker is registered or not.
> > > We could use the highest bit of flags, but it may be a little bit complicated to
> > > extract that bit and the flags is accessed frequently by vmscan (every time shrinker
> > > is called).  So add a new field in "struct shrinker", we may waster a little bit
> > > memory, but it should be very few since there should be not too many registered
> > > shrinkers on a normal system.
> > >
> > > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > > ---
> > >  include/linux/shrinker.h |  4 ++++
> > >  mm/vmscan.c              | 13 +++++++++----
> > >  2 files changed, 13 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> > > index 0f80123650e2..0bb5be88e41d 100644
> > > --- a/include/linux/shrinker.h
> > > +++ b/include/linux/shrinker.h
> > > @@ -35,6 +35,9 @@ struct shrink_control {
> > >
> > >  #define SHRINK_STOP (~0UL)
> > >  #define SHRINK_EMPTY (~0UL - 1)
> > > +
> > > +#define SHRINKER_REGISTERED  0x1
> > > +
> > >  /*
> > >   * A callback you can register to apply pressure to ageable caches.
> > >   *
> > > @@ -66,6 +69,7 @@ struct shrinker {
> > >       long batch;     /* reclaim batch size, 0 = default */
> > >       int seeks;      /* seeks to recreate an obj */
> > >       unsigned flags;
> > > +     unsigned state;
> >
> > Hm, can't it be another flag? It seems like we have a plenty of free bits.
> 
> I thought about this too. But I was not convinced by myself that
> messing flags with state is a good practice. We may add more flags in
> the future, so we may end up having something like:
> 
> flag
> flag
> flag
> state
> flag
> flag
> ...
> 
> Maybe we could use the highest bit for state?

Or just
state
flag
flag
flag
flag
flag
...

?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/9] mm: vmscan: use a new flag to indicate shrinker is registered
  2020-12-03 20:08       ` Roman Gushchin
@ 2020-12-03 22:25         ` Yang Shi
  2020-12-04 18:52           ` Johannes Weiner
  0 siblings, 1 reply; 41+ messages in thread
From: Yang Shi @ 2020-12-03 22:25 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Kirill Tkhai, Shakeel Butt, Dave Chinner, Johannes Weiner,
	Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Thu, Dec 3, 2020 at 12:09 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Wed, Dec 02, 2020 at 08:59:40PM -0800, Yang Shi wrote:
> > On Wed, Dec 2, 2020 at 7:01 PM Roman Gushchin <guro@fb.com> wrote:
> > >
> > > On Wed, Dec 02, 2020 at 10:27:20AM -0800, Yang Shi wrote:
> > > > Currently registered shrinker is indicated by non-NULL shrinker->nr_deferred.
> > > > This approach is fine with nr_deferred atthe shrinker level, but the following
> > > > patches will move MEMCG_AWARE shrinkers' nr_deferred to memcg level, so their
> > > > shrinker->nr_deferred would always be NULL.  This would prevent the shrinkers
> > > > from unregistering correctly.
> > > >
> > > > Introduce a new "state" field to indicate if shrinker is registered or not.
> > > > We could use the highest bit of flags, but it may be a little bit complicated to
> > > > extract that bit and the flags is accessed frequently by vmscan (every time shrinker
> > > > is called).  So add a new field in "struct shrinker", we may waster a little bit
> > > > memory, but it should be very few since there should be not too many registered
> > > > shrinkers on a normal system.
> > > >
> > > > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > > > ---
> > > >  include/linux/shrinker.h |  4 ++++
> > > >  mm/vmscan.c              | 13 +++++++++----
> > > >  2 files changed, 13 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> > > > index 0f80123650e2..0bb5be88e41d 100644
> > > > --- a/include/linux/shrinker.h
> > > > +++ b/include/linux/shrinker.h
> > > > @@ -35,6 +35,9 @@ struct shrink_control {
> > > >
> > > >  #define SHRINK_STOP (~0UL)
> > > >  #define SHRINK_EMPTY (~0UL - 1)
> > > > +
> > > > +#define SHRINKER_REGISTERED  0x1
> > > > +
> > > >  /*
> > > >   * A callback you can register to apply pressure to ageable caches.
> > > >   *
> > > > @@ -66,6 +69,7 @@ struct shrinker {
> > > >       long batch;     /* reclaim batch size, 0 = default */
> > > >       int seeks;      /* seeks to recreate an obj */
> > > >       unsigned flags;
> > > > +     unsigned state;
> > >
> > > Hm, can't it be another flag? It seems like we have a plenty of free bits.
> >
> > I thought about this too. But I was not convinced by myself that
> > messing flags with state is a good practice. We may add more flags in
> > the future, so we may end up having something like:
> >
> > flag
> > flag
> > flag
> > state
> > flag
> > flag
> > ...
> >
> > Maybe we could use the highest bit for state?
>
> Or just
> state
> flag
> flag
> flag
> flag
> flag
> ...
>
> ?

It is fine too. We should not add more states in foreseeable future.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/9] mm: memcontrol: add per memcg shrinker nr_deferred
  2020-12-03 20:07         ` Roman Gushchin
@ 2020-12-03 22:49           ` Yang Shi
  2020-12-03 23:30             ` Roman Gushchin
  0 siblings, 1 reply; 41+ messages in thread
From: Yang Shi @ 2020-12-03 22:49 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Kirill Tkhai, Shakeel Butt, Dave Chinner, Johannes Weiner,
	Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Thu, Dec 3, 2020 at 12:07 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Thu, Dec 03, 2020 at 10:03:44AM -0800, Yang Shi wrote:
> > On Wed, Dec 2, 2020 at 8:54 PM Yang Shi <shy828301@gmail.com> wrote:
> > >
> > > On Wed, Dec 2, 2020 at 7:06 PM Roman Gushchin <guro@fb.com> wrote:
> > > >
> > > > On Wed, Dec 02, 2020 at 10:27:21AM -0800, Yang Shi wrote:
> > > > > Currently the number of deferred objects are per shrinker, but some slabs, for example,
> > > > > vfs inode/dentry cache are per memcg, this would result in poor isolation among memcgs.
> > > > >
> > > > > The deferred objects typically are generated by __GFP_NOFS allocations, one memcg with
> > > > > excessive __GFP_NOFS allocations may blow up deferred objects, then other innocent memcgs
> > > > > may suffer from over shrink, excessive reclaim latency, etc.
> > > > >
> > > > > For example, two workloads run in memcgA and memcgB respectively, workload in B is vfs
> > > > > heavy workload.  Workload in A generates excessive deferred objects, then B's vfs cache
> > > > > might be hit heavily (drop half of caches) by B's limit reclaim or global reclaim.
> > > > >
> > > > > We observed this hit in our production environment which was running vfs heavy workload
> > > > > shown as the below tracing log:
> > > > >
> > > > > <...>-409454 [016] .... 28286961.747146: mm_shrink_slab_start: super_cache_scan+0x0/0x1a0 ffff9a83046f3458:
> > > > > nid: 1 objects to shrink 3641681686040 gfp_flags GFP_HIGHUSER_MOVABLE|__GFP_ZERO pgs_scanned 1 lru_pgs 15721
> > > > > cache items 246404277 delta 31345 total_scan 123202138
> > > > > <...>-409454 [022] .... 28287105.928018: mm_shrink_slab_end: super_cache_scan+0x0/0x1a0 ffff9a83046f3458:
> > > > > nid: 1 unused scan count 3641681686040 new scan count 3641798379189 total_scan 602
> > > > > last shrinker return val 123186855
> > > > >
> > > > > The vfs cache and page cache ration was 10:1 on this machine, and half of caches were dropped.
> > > > > This also resulted in significant amount of page caches were dropped due to inodes eviction.
> > > > >
> > > > > Make nr_deferred per memcg for memcg aware shrinkers would solve the unfairness and bring
> > > > > better isolation.
> > > > >
> > > > > When memcg is not enabled (!CONFIG_MEMCG or memcg disabled), the shrinker's nr_deferred
> > > > > would be used.  And non memcg aware shrinkers use shrinker's nr_deferred all the time.
> > > > >
> > > > > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > > > > ---
> > > > >  include/linux/memcontrol.h |   9 +++
> > > > >  mm/memcontrol.c            | 112 ++++++++++++++++++++++++++++++++++++-
> > > > >  mm/vmscan.c                |   4 ++
> > > > >  3 files changed, 123 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > > > index 922a7f600465..1b343b268359 100644
> > > > > --- a/include/linux/memcontrol.h
> > > > > +++ b/include/linux/memcontrol.h
> > > > > @@ -92,6 +92,13 @@ struct lruvec_stat {
> > > > >       long count[NR_VM_NODE_STAT_ITEMS];
> > > > >  };
> > > > >
> > > > > +
> > > > > +/* Shrinker::id indexed nr_deferred of memcg-aware shrinkers. */
> > > > > +struct memcg_shrinker_deferred {
> > > > > +     struct rcu_head rcu;
> > > > > +     atomic_long_t nr_deferred[];
> > > > > +};
> > > >
> > > > The idea makes total sense to me. But I wonder if we can add nr_deferred to
> > > > struct list_lru_one, instead of adding another per-memcg per-shrinker entity?
> > > > I guess it can simplify the code quite a lot. What do you think?
> > >
> > > Aha, actually this exactly was what I did at the first place. But Dave
> > > NAK'ed this approach. You can find the discussion at:
> > > https://lore.kernel.org/linux-mm/20200930073152.GH12096@dread.disaster.area/.
>
> Yes, this makes sense for me. Thank you for the link!
>
> >
> > I did prototypes for both approaches (move nr_deferred to list_lru or
> > to memcg). I preferred the list_lru approach at the first place. But
> > Dave's opinion does make perfect sense to me. So I dropped that
> > list_lru one. That email elaborated why moving nr_deferred to list_lru
> > is not appropriate.
>
> Hm, shouldn't we move list_lru to memcg then? It's not directly related
> to your patchset, but maybe it's something we should consider in the future.

I haven't thought about this yet. I agree we could look into it
further later on.

>
> What worries me is that with your patchset we'll have 3 separate
> per-memcg (per-node) per-shrinker entity, each with slightly different
> approach to allocate/extend/reparent/release. So it begs for some
> unification. I don't think it's a showstopper for your work though, it
> can be done later.

Off the top of my head, we may be able to have shrinker_info struct,
it should look like:

struct shrinker_info {
    atomic_long_t nr_deferred;
    /* Just one bit is used now */
    u8 map:1;
}

struct memcg_shrinker_info {
    struct rcu_head rcu;
    /* Indexed by shrinker ID */
    struct shrinker_info info[];
}

Then in struct mem_cgroup_per_node, we could have:

struct mem_cgroup_per_node {
    ....
    struct memcg_shrinker_info __rcu *shrinker_info;
    ....
}

In this way shrinker_info should be allocated to all memcgs, including
root. But shrinker could ignore root's map bit. We may waste a little
bit memory, but we get unification.

Would that work?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/9] mm: memcontrol: add per memcg shrinker nr_deferred
  2020-12-03 22:49           ` Yang Shi
@ 2020-12-03 23:30             ` Roman Gushchin
  2020-12-04  0:22               ` Yang Shi
  0 siblings, 1 reply; 41+ messages in thread
From: Roman Gushchin @ 2020-12-03 23:30 UTC (permalink / raw)
  To: Yang Shi
  Cc: Kirill Tkhai, Shakeel Butt, Dave Chinner, Johannes Weiner,
	Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Thu, Dec 03, 2020 at 02:49:00PM -0800, Yang Shi wrote:
> On Thu, Dec 3, 2020 at 12:07 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > On Thu, Dec 03, 2020 at 10:03:44AM -0800, Yang Shi wrote:
> > > On Wed, Dec 2, 2020 at 8:54 PM Yang Shi <shy828301@gmail.com> wrote:
> > > >
> > > > On Wed, Dec 2, 2020 at 7:06 PM Roman Gushchin <guro@fb.com> wrote:
> > > > >
> > > > > On Wed, Dec 02, 2020 at 10:27:21AM -0800, Yang Shi wrote:
> > > > > > Currently the number of deferred objects are per shrinker, but some slabs, for example,
> > > > > > vfs inode/dentry cache are per memcg, this would result in poor isolation among memcgs.
> > > > > >
> > > > > > The deferred objects typically are generated by __GFP_NOFS allocations, one memcg with
> > > > > > excessive __GFP_NOFS allocations may blow up deferred objects, then other innocent memcgs
> > > > > > may suffer from over shrink, excessive reclaim latency, etc.
> > > > > >
> > > > > > For example, two workloads run in memcgA and memcgB respectively, workload in B is vfs
> > > > > > heavy workload.  Workload in A generates excessive deferred objects, then B's vfs cache
> > > > > > might be hit heavily (drop half of caches) by B's limit reclaim or global reclaim.
> > > > > >
> > > > > > We observed this hit in our production environment which was running vfs heavy workload
> > > > > > shown as the below tracing log:
> > > > > >
> > > > > > <...>-409454 [016] .... 28286961.747146: mm_shrink_slab_start: super_cache_scan+0x0/0x1a0 ffff9a83046f3458:
> > > > > > nid: 1 objects to shrink 3641681686040 gfp_flags GFP_HIGHUSER_MOVABLE|__GFP_ZERO pgs_scanned 1 lru_pgs 15721
> > > > > > cache items 246404277 delta 31345 total_scan 123202138
> > > > > > <...>-409454 [022] .... 28287105.928018: mm_shrink_slab_end: super_cache_scan+0x0/0x1a0 ffff9a83046f3458:
> > > > > > nid: 1 unused scan count 3641681686040 new scan count 3641798379189 total_scan 602
> > > > > > last shrinker return val 123186855
> > > > > >
> > > > > > The vfs cache and page cache ration was 10:1 on this machine, and half of caches were dropped.
> > > > > > This also resulted in significant amount of page caches were dropped due to inodes eviction.
> > > > > >
> > > > > > Make nr_deferred per memcg for memcg aware shrinkers would solve the unfairness and bring
> > > > > > better isolation.
> > > > > >
> > > > > > When memcg is not enabled (!CONFIG_MEMCG or memcg disabled), the shrinker's nr_deferred
> > > > > > would be used.  And non memcg aware shrinkers use shrinker's nr_deferred all the time.
> > > > > >
> > > > > > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > > > > > ---
> > > > > >  include/linux/memcontrol.h |   9 +++
> > > > > >  mm/memcontrol.c            | 112 ++++++++++++++++++++++++++++++++++++-
> > > > > >  mm/vmscan.c                |   4 ++
> > > > > >  3 files changed, 123 insertions(+), 2 deletions(-)
> > > > > >
> > > > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > > > > index 922a7f600465..1b343b268359 100644
> > > > > > --- a/include/linux/memcontrol.h
> > > > > > +++ b/include/linux/memcontrol.h
> > > > > > @@ -92,6 +92,13 @@ struct lruvec_stat {
> > > > > >       long count[NR_VM_NODE_STAT_ITEMS];
> > > > > >  };
> > > > > >
> > > > > > +
> > > > > > +/* Shrinker::id indexed nr_deferred of memcg-aware shrinkers. */
> > > > > > +struct memcg_shrinker_deferred {
> > > > > > +     struct rcu_head rcu;
> > > > > > +     atomic_long_t nr_deferred[];
> > > > > > +};
> > > > >
> > > > > The idea makes total sense to me. But I wonder if we can add nr_deferred to
> > > > > struct list_lru_one, instead of adding another per-memcg per-shrinker entity?
> > > > > I guess it can simplify the code quite a lot. What do you think?
> > > >
> > > > Aha, actually this exactly was what I did at the first place. But Dave
> > > > NAK'ed this approach. You can find the discussion at:
> > > > https://lore.kernel.org/linux-mm/20200930073152.GH12096@dread.disaster.area/.
> >
> > Yes, this makes sense for me. Thank you for the link!
> >
> > >
> > > I did prototypes for both approaches (move nr_deferred to list_lru or
> > > to memcg). I preferred the list_lru approach at the first place. But
> > > Dave's opinion does make perfect sense to me. So I dropped that
> > > list_lru one. That email elaborated why moving nr_deferred to list_lru
> > > is not appropriate.
> >
> > Hm, shouldn't we move list_lru to memcg then? It's not directly related
> > to your patchset, but maybe it's something we should consider in the future.
> 
> I haven't thought about this yet. I agree we could look into it
> further later on.
> 
> >
> > What worries me is that with your patchset we'll have 3 separate
> > per-memcg (per-node) per-shrinker entity, each with slightly different
> > approach to allocate/extend/reparent/release. So it begs for some
> > unification. I don't think it's a showstopper for your work though, it
> > can be done later.
> 
> Off the top of my head, we may be able to have shrinker_info struct,
> it should look like:
> 
> struct shrinker_info {
>     atomic_long_t nr_deferred;
>     /* Just one bit is used now */
>     u8 map:1;
> }
> 
> struct memcg_shrinker_info {
>     struct rcu_head rcu;
>     /* Indexed by shrinker ID */
>     struct shrinker_info info[];
> }
> 
> Then in struct mem_cgroup_per_node, we could have:
> 
> struct mem_cgroup_per_node {
>     ....
>     struct memcg_shrinker_info __rcu *shrinker_info;
>     ....
> }
> 
> In this way shrinker_info should be allocated to all memcgs, including
> root. But shrinker could ignore root's map bit. We may waste a little
> bit memory, but we get unification.
> 
> Would that work?

Hm, not exactly, then you'll an ability to iterate with
	for_each_set_bit(i, map->map, shrinker_nr_max)...
But you can probably do something like:

struct shrinker_info {
   atomic_long_t nr_deferred;

   struct list_lru_one[]; /* optional, depends on the shrinker implementation */
};

struct memcg_shrinker_info {
    /* Indexed by shrinker ID */
    unsigned long *map[];
    struct shrinker_info *shrinker_info[];
}

Then you'll be able to allocate individual shrinker_info structures on-demand.

But, please, take this all with a grain of salt, I didn't check if it's all really
possible or there are some obstacles.

Thanks!

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/9] mm: memcontrol: add per memcg shrinker nr_deferred
  2020-12-03 23:30             ` Roman Gushchin
@ 2020-12-04  0:22               ` Yang Shi
  0 siblings, 0 replies; 41+ messages in thread
From: Yang Shi @ 2020-12-04  0:22 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Kirill Tkhai, Shakeel Butt, Dave Chinner, Johannes Weiner,
	Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Thu, Dec 3, 2020 at 3:31 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Thu, Dec 03, 2020 at 02:49:00PM -0800, Yang Shi wrote:
> > On Thu, Dec 3, 2020 at 12:07 PM Roman Gushchin <guro@fb.com> wrote:
> > >
> > > On Thu, Dec 03, 2020 at 10:03:44AM -0800, Yang Shi wrote:
> > > > On Wed, Dec 2, 2020 at 8:54 PM Yang Shi <shy828301@gmail.com> wrote:
> > > > >
> > > > > On Wed, Dec 2, 2020 at 7:06 PM Roman Gushchin <guro@fb.com> wrote:
> > > > > >
> > > > > > On Wed, Dec 02, 2020 at 10:27:21AM -0800, Yang Shi wrote:
> > > > > > > Currently the number of deferred objects are per shrinker, but some slabs, for example,
> > > > > > > vfs inode/dentry cache are per memcg, this would result in poor isolation among memcgs.
> > > > > > >
> > > > > > > The deferred objects typically are generated by __GFP_NOFS allocations, one memcg with
> > > > > > > excessive __GFP_NOFS allocations may blow up deferred objects, then other innocent memcgs
> > > > > > > may suffer from over shrink, excessive reclaim latency, etc.
> > > > > > >
> > > > > > > For example, two workloads run in memcgA and memcgB respectively, workload in B is vfs
> > > > > > > heavy workload.  Workload in A generates excessive deferred objects, then B's vfs cache
> > > > > > > might be hit heavily (drop half of caches) by B's limit reclaim or global reclaim.
> > > > > > >
> > > > > > > We observed this hit in our production environment which was running vfs heavy workload
> > > > > > > shown as the below tracing log:
> > > > > > >
> > > > > > > <...>-409454 [016] .... 28286961.747146: mm_shrink_slab_start: super_cache_scan+0x0/0x1a0 ffff9a83046f3458:
> > > > > > > nid: 1 objects to shrink 3641681686040 gfp_flags GFP_HIGHUSER_MOVABLE|__GFP_ZERO pgs_scanned 1 lru_pgs 15721
> > > > > > > cache items 246404277 delta 31345 total_scan 123202138
> > > > > > > <...>-409454 [022] .... 28287105.928018: mm_shrink_slab_end: super_cache_scan+0x0/0x1a0 ffff9a83046f3458:
> > > > > > > nid: 1 unused scan count 3641681686040 new scan count 3641798379189 total_scan 602
> > > > > > > last shrinker return val 123186855
> > > > > > >
> > > > > > > The vfs cache and page cache ration was 10:1 on this machine, and half of caches were dropped.
> > > > > > > This also resulted in significant amount of page caches were dropped due to inodes eviction.
> > > > > > >
> > > > > > > Make nr_deferred per memcg for memcg aware shrinkers would solve the unfairness and bring
> > > > > > > better isolation.
> > > > > > >
> > > > > > > When memcg is not enabled (!CONFIG_MEMCG or memcg disabled), the shrinker's nr_deferred
> > > > > > > would be used.  And non memcg aware shrinkers use shrinker's nr_deferred all the time.
> > > > > > >
> > > > > > > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > > > > > > ---
> > > > > > >  include/linux/memcontrol.h |   9 +++
> > > > > > >  mm/memcontrol.c            | 112 ++++++++++++++++++++++++++++++++++++-
> > > > > > >  mm/vmscan.c                |   4 ++
> > > > > > >  3 files changed, 123 insertions(+), 2 deletions(-)
> > > > > > >
> > > > > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > > > > > index 922a7f600465..1b343b268359 100644
> > > > > > > --- a/include/linux/memcontrol.h
> > > > > > > +++ b/include/linux/memcontrol.h
> > > > > > > @@ -92,6 +92,13 @@ struct lruvec_stat {
> > > > > > >       long count[NR_VM_NODE_STAT_ITEMS];
> > > > > > >  };
> > > > > > >
> > > > > > > +
> > > > > > > +/* Shrinker::id indexed nr_deferred of memcg-aware shrinkers. */
> > > > > > > +struct memcg_shrinker_deferred {
> > > > > > > +     struct rcu_head rcu;
> > > > > > > +     atomic_long_t nr_deferred[];
> > > > > > > +};
> > > > > >
> > > > > > The idea makes total sense to me. But I wonder if we can add nr_deferred to
> > > > > > struct list_lru_one, instead of adding another per-memcg per-shrinker entity?
> > > > > > I guess it can simplify the code quite a lot. What do you think?
> > > > >
> > > > > Aha, actually this exactly was what I did at the first place. But Dave
> > > > > NAK'ed this approach. You can find the discussion at:
> > > > > https://lore.kernel.org/linux-mm/20200930073152.GH12096@dread.disaster.area/.
> > >
> > > Yes, this makes sense for me. Thank you for the link!
> > >
> > > >
> > > > I did prototypes for both approaches (move nr_deferred to list_lru or
> > > > to memcg). I preferred the list_lru approach at the first place. But
> > > > Dave's opinion does make perfect sense to me. So I dropped that
> > > > list_lru one. That email elaborated why moving nr_deferred to list_lru
> > > > is not appropriate.
> > >
> > > Hm, shouldn't we move list_lru to memcg then? It's not directly related
> > > to your patchset, but maybe it's something we should consider in the future.
> >
> > I haven't thought about this yet. I agree we could look into it
> > further later on.
> >
> > >
> > > What worries me is that with your patchset we'll have 3 separate
> > > per-memcg (per-node) per-shrinker entity, each with slightly different
> > > approach to allocate/extend/reparent/release. So it begs for some
> > > unification. I don't think it's a showstopper for your work though, it
> > > can be done later.
> >
> > Off the top of my head, we may be able to have shrinker_info struct,
> > it should look like:
> >
> > struct shrinker_info {
> >     atomic_long_t nr_deferred;
> >     /* Just one bit is used now */
> >     u8 map:1;
> > }
> >
> > struct memcg_shrinker_info {
> >     struct rcu_head rcu;
> >     /* Indexed by shrinker ID */
> >     struct shrinker_info info[];
> > }
> >
> > Then in struct mem_cgroup_per_node, we could have:
> >
> > struct mem_cgroup_per_node {
> >     ....
> >     struct memcg_shrinker_info __rcu *shrinker_info;
> >     ....
> > }
> >
> > In this way shrinker_info should be allocated to all memcgs, including
> > root. But shrinker could ignore root's map bit. We may waste a little
> > bit memory, but we get unification.
> >
> > Would that work?
>
> Hm, not exactly, then you'll an ability to iterate with
>         for_each_set_bit(i, map->map, shrinker_nr_max)...

Instead we could just iterate each shrinker_info struct to check if
its map is set.

> But you can probably do something like:
>
> struct shrinker_info {
>    atomic_long_t nr_deferred;
>
>    struct list_lru_one[]; /* optional, depends on the shrinker implementation */
> };
>
> struct memcg_shrinker_info {
>     /* Indexed by shrinker ID */
>     unsigned long *map[];
>     struct shrinker_info *shrinker_info[];

Both map and shrinker_info has to be extendable, so they have to be
struct with rcu_head. So actually it is the same with separate
shrinker_map and shrinker_deferred, but under one struct. Actually I
tried this in my prototype, but I gave up it since it didn't simplify
the code IMHO.

> }
>
> Then you'll be able to allocate individual shrinker_info structures on-demand.
>
> But, please, take this all with a grain of salt, I didn't check if it's all really
> possible or there are some obstacles.

Thanks a lot for all the kind suggestions. I'd agree with you we could
revisit the unification later on. It seems there is not a simple and
straightforward way to unify them at the first glance. There might be
more evils in the detail.

>
> Thanks!

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/9] mm: vmscan: use a new flag to indicate shrinker is registered
  2020-12-03 22:25         ` Yang Shi
@ 2020-12-04 18:52           ` Johannes Weiner
  2020-12-04 21:24             ` Yang Shi
  0 siblings, 1 reply; 41+ messages in thread
From: Johannes Weiner @ 2020-12-04 18:52 UTC (permalink / raw)
  To: Yang Shi
  Cc: Roman Gushchin, Kirill Tkhai, Shakeel Butt, Dave Chinner,
	Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Thu, Dec 03, 2020 at 02:25:20PM -0800, Yang Shi wrote:
> On Thu, Dec 3, 2020 at 12:09 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > On Wed, Dec 02, 2020 at 08:59:40PM -0800, Yang Shi wrote:
> > > On Wed, Dec 2, 2020 at 7:01 PM Roman Gushchin <guro@fb.com> wrote:
> > > >
> > > > On Wed, Dec 02, 2020 at 10:27:20AM -0800, Yang Shi wrote:
> > > > > Currently registered shrinker is indicated by non-NULL shrinker->nr_deferred.
> > > > > This approach is fine with nr_deferred atthe shrinker level, but the following
> > > > > patches will move MEMCG_AWARE shrinkers' nr_deferred to memcg level, so their
> > > > > shrinker->nr_deferred would always be NULL.  This would prevent the shrinkers
> > > > > from unregistering correctly.
> > > > >
> > > > > Introduce a new "state" field to indicate if shrinker is registered or not.
> > > > > We could use the highest bit of flags, but it may be a little bit complicated to
> > > > > extract that bit and the flags is accessed frequently by vmscan (every time shrinker
> > > > > is called).  So add a new field in "struct shrinker", we may waster a little bit
> > > > > memory, but it should be very few since there should be not too many registered
> > > > > shrinkers on a normal system.
> > > > >
> > > > > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > > > > ---
> > > > >  include/linux/shrinker.h |  4 ++++
> > > > >  mm/vmscan.c              | 13 +++++++++----
> > > > >  2 files changed, 13 insertions(+), 4 deletions(-)
> > > > >
> > > > > diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> > > > > index 0f80123650e2..0bb5be88e41d 100644
> > > > > --- a/include/linux/shrinker.h
> > > > > +++ b/include/linux/shrinker.h
> > > > > @@ -35,6 +35,9 @@ struct shrink_control {
> > > > >
> > > > >  #define SHRINK_STOP (~0UL)
> > > > >  #define SHRINK_EMPTY (~0UL - 1)
> > > > > +
> > > > > +#define SHRINKER_REGISTERED  0x1
> > > > > +
> > > > >  /*
> > > > >   * A callback you can register to apply pressure to ageable caches.
> > > > >   *
> > > > > @@ -66,6 +69,7 @@ struct shrinker {
> > > > >       long batch;     /* reclaim batch size, 0 = default */
> > > > >       int seeks;      /* seeks to recreate an obj */
> > > > >       unsigned flags;
> > > > > +     unsigned state;
> > > >
> > > > Hm, can't it be another flag? It seems like we have a plenty of free bits.
> > >
> > > I thought about this too. But I was not convinced by myself that
> > > messing flags with state is a good practice. We may add more flags in
> > > the future, so we may end up having something like:
> > >
> > > flag
> > > flag
> > > flag
> > > state
> > > flag
> > > flag
> > > ...
> > >
> > > Maybe we could use the highest bit for state?
> >
> > Or just
> > state
> > flag
> > flag
> > flag
> > flag
> > flag
> > ...
> >
> > ?
> 
> It is fine too. We should not add more states in foreseeable future.

It's always possible to shuffle things around for cleanup later on,
too. We don't have to provide binary compatibility for existing flags,
and changing a couple of adjacent bits isn't a big deal to keep things
neat. Or am I missing something?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/9] mm: vmscan: use a new flag to indicate shrinker is registered
  2020-12-04 18:52           ` Johannes Weiner
@ 2020-12-04 21:24             ` Yang Shi
  0 siblings, 0 replies; 41+ messages in thread
From: Yang Shi @ 2020-12-04 21:24 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Roman Gushchin, Kirill Tkhai, Shakeel Butt, Dave Chinner,
	Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Fri, Dec 4, 2020 at 10:54 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Thu, Dec 03, 2020 at 02:25:20PM -0800, Yang Shi wrote:
> > On Thu, Dec 3, 2020 at 12:09 PM Roman Gushchin <guro@fb.com> wrote:
> > >
> > > On Wed, Dec 02, 2020 at 08:59:40PM -0800, Yang Shi wrote:
> > > > On Wed, Dec 2, 2020 at 7:01 PM Roman Gushchin <guro@fb.com> wrote:
> > > > >
> > > > > On Wed, Dec 02, 2020 at 10:27:20AM -0800, Yang Shi wrote:
> > > > > > Currently registered shrinker is indicated by non-NULL shrinker->nr_deferred.
> > > > > > This approach is fine with nr_deferred atthe shrinker level, but the following
> > > > > > patches will move MEMCG_AWARE shrinkers' nr_deferred to memcg level, so their
> > > > > > shrinker->nr_deferred would always be NULL.  This would prevent the shrinkers
> > > > > > from unregistering correctly.
> > > > > >
> > > > > > Introduce a new "state" field to indicate if shrinker is registered or not.
> > > > > > We could use the highest bit of flags, but it may be a little bit complicated to
> > > > > > extract that bit and the flags is accessed frequently by vmscan (every time shrinker
> > > > > > is called).  So add a new field in "struct shrinker", we may waster a little bit
> > > > > > memory, but it should be very few since there should be not too many registered
> > > > > > shrinkers on a normal system.
> > > > > >
> > > > > > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > > > > > ---
> > > > > >  include/linux/shrinker.h |  4 ++++
> > > > > >  mm/vmscan.c              | 13 +++++++++----
> > > > > >  2 files changed, 13 insertions(+), 4 deletions(-)
> > > > > >
> > > > > > diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> > > > > > index 0f80123650e2..0bb5be88e41d 100644
> > > > > > --- a/include/linux/shrinker.h
> > > > > > +++ b/include/linux/shrinker.h
> > > > > > @@ -35,6 +35,9 @@ struct shrink_control {
> > > > > >
> > > > > >  #define SHRINK_STOP (~0UL)
> > > > > >  #define SHRINK_EMPTY (~0UL - 1)
> > > > > > +
> > > > > > +#define SHRINKER_REGISTERED  0x1
> > > > > > +
> > > > > >  /*
> > > > > >   * A callback you can register to apply pressure to ageable caches.
> > > > > >   *
> > > > > > @@ -66,6 +69,7 @@ struct shrinker {
> > > > > >       long batch;     /* reclaim batch size, 0 = default */
> > > > > >       int seeks;      /* seeks to recreate an obj */
> > > > > >       unsigned flags;
> > > > > > +     unsigned state;
> > > > >
> > > > > Hm, can't it be another flag? It seems like we have a plenty of free bits.
> > > >
> > > > I thought about this too. But I was not convinced by myself that
> > > > messing flags with state is a good practice. We may add more flags in
> > > > the future, so we may end up having something like:
> > > >
> > > > flag
> > > > flag
> > > > flag
> > > > state
> > > > flag
> > > > flag
> > > > ...
> > > >
> > > > Maybe we could use the highest bit for state?
> > >
> > > Or just
> > > state
> > > flag
> > > flag
> > > flag
> > > flag
> > > flag
> > > ...
> > >
> > > ?
> >
> > It is fine too. We should not add more states in foreseeable future.
>
> It's always possible to shuffle things around for cleanup later on,
> too. We don't have to provide binary compatibility for existing flags,
> and changing a couple of adjacent bits isn't a big deal to keep things
> neat. Or am I missing something?

No. It is definitely not a big deal.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/9] mm: vmscan: use per memcg nr_deferred of shrinker
  2020-12-03 11:40   ` Kirill Tkhai
@ 2020-12-08 17:13     ` Yang Shi
  2020-12-09 15:41       ` Kirill Tkhai
  0 siblings, 1 reply; 41+ messages in thread
From: Yang Shi @ 2020-12-08 17:13 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Roman Gushchin, Shakeel Butt, Dave Chinner, Johannes Weiner,
	Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Thu, Dec 3, 2020 at 3:40 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>
> On 02.12.2020 21:27, Yang Shi wrote:
> > Use per memcg's nr_deferred for memcg aware shrinkers.  The shrinker's nr_deferred
> > will be used in the following cases:
> >     1. Non memcg aware shrinkers
> >     2. !CONFIG_MEMCG
> >     3. memcg is disabled by boot parameter
> >
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > ---
> >  mm/vmscan.c | 88 +++++++++++++++++++++++++++++++++++++++++++++++++----
> >  1 file changed, 82 insertions(+), 6 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index cba0bc8d4661..d569fdcaba79 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -203,6 +203,12 @@ static DECLARE_RWSEM(shrinker_rwsem);
> >  static DEFINE_IDR(shrinker_idr);
> >  static int shrinker_nr_max;
> >
> > +static inline bool is_deferred_memcg_aware(struct shrinker *shrinker)
> > +{
> > +     return (shrinker->flags & SHRINKER_MEMCG_AWARE) &&
> > +             !mem_cgroup_disabled();
> > +}
> > +
> >  static int prealloc_memcg_shrinker(struct shrinker *shrinker)
> >  {
> >       int id, ret = -ENOMEM;
> > @@ -271,7 +277,58 @@ static bool writeback_throttling_sane(struct scan_control *sc)
> >  #endif
> >       return false;
> >  }
> > +
> > +static inline long count_nr_deferred(struct shrinker *shrinker,
> > +                                  struct shrink_control *sc)
> > +{
> > +     bool per_memcg_deferred = is_deferred_memcg_aware(shrinker) && sc->memcg;
> > +     struct memcg_shrinker_deferred *deferred;
> > +     struct mem_cgroup *memcg = sc->memcg;
> > +     int nid = sc->nid;
> > +     int id = shrinker->id;
> > +     long nr;
> > +
> > +     if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
> > +             nid = 0;
> > +
> > +     if (per_memcg_deferred) {
> > +             deferred = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_deferred,
> > +                                                  true);
>
> My comment is about both 5/9 and 6/9 patches.

Sorry for the late reply, I don't know why Gmail filtered this out to spam.

>
> shrink_slab_memcg() races with mem_cgroup_css_online(). A visibility of CSS_ONLINE flag
> in shrink_slab_memcg()->mem_cgroup_online() does not guarantee that you will see
> memcg->nodeinfo[nid]->shrinker_deferred != NULL in count_nr_deferred(). This may occur
> because of processor reordering on !x86 (there is no a common lock or memory barriers).
>
> Regarding to shrinker_map this is not a problem due to map check in shrink_slab_memcg().
> The map can't be NULL there.
>
> Regarding to shrinker_deferred you should prove either this is not a problem too,
> or to add proper synchronization (maybe, based on barriers) or to add some similar check
> (maybe, in shrink_slab_memcg() too).

It seems shrink_slab_memcg() might see shrinker_deferred as NULL
either due to the same reason. I don't think there is a guarantee it
won't happen.

We just need guarantee CSS_ONLINE is seen after shrinker_maps and
shrinker_deferred are allocated, so I'm supposed barriers before
"css->flags |= CSS_ONLINE" should work.

So the below patch may be ok:

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index df128cab900f..9f7fb0450d69 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5539,6 +5539,12 @@ static int mem_cgroup_css_online(struct
cgroup_subsys_state *css)
                return -ENOMEM;
        }


+       /*
+        * Barrier for CSS_ONLINE, so that shrink_slab_memcg() sees
shirnker_maps
+        * and shrinker_deferred before CSS_ONLINE.
+        */
+       smp_mb();
+
        /* Online state pins memcg ID, memcg ID pins CSS */
        refcount_set(&memcg->id.ref, 1);
        css_get(css);


Or add one more check for shrinker_deferred sounds acceptable as well.

> Kirill
>

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/9] mm: vmscan: use per memcg nr_deferred of shrinker
  2020-12-08 17:13     ` Yang Shi
@ 2020-12-09 15:41       ` Kirill Tkhai
  2020-12-09 17:32         ` Yang Shi
  0 siblings, 1 reply; 41+ messages in thread
From: Kirill Tkhai @ 2020-12-09 15:41 UTC (permalink / raw)
  To: Yang Shi
  Cc: Roman Gushchin, Shakeel Butt, Dave Chinner, Johannes Weiner,
	Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On 08.12.2020 20:13, Yang Shi wrote:
> On Thu, Dec 3, 2020 at 3:40 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>
>> On 02.12.2020 21:27, Yang Shi wrote:
>>> Use per memcg's nr_deferred for memcg aware shrinkers.  The shrinker's nr_deferred
>>> will be used in the following cases:
>>>     1. Non memcg aware shrinkers
>>>     2. !CONFIG_MEMCG
>>>     3. memcg is disabled by boot parameter
>>>
>>> Signed-off-by: Yang Shi <shy828301@gmail.com>
>>> ---
>>>  mm/vmscan.c | 88 +++++++++++++++++++++++++++++++++++++++++++++++++----
>>>  1 file changed, 82 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index cba0bc8d4661..d569fdcaba79 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -203,6 +203,12 @@ static DECLARE_RWSEM(shrinker_rwsem);
>>>  static DEFINE_IDR(shrinker_idr);
>>>  static int shrinker_nr_max;
>>>
>>> +static inline bool is_deferred_memcg_aware(struct shrinker *shrinker)
>>> +{
>>> +     return (shrinker->flags & SHRINKER_MEMCG_AWARE) &&
>>> +             !mem_cgroup_disabled();
>>> +}
>>> +
>>>  static int prealloc_memcg_shrinker(struct shrinker *shrinker)
>>>  {
>>>       int id, ret = -ENOMEM;
>>> @@ -271,7 +277,58 @@ static bool writeback_throttling_sane(struct scan_control *sc)
>>>  #endif
>>>       return false;
>>>  }
>>> +
>>> +static inline long count_nr_deferred(struct shrinker *shrinker,
>>> +                                  struct shrink_control *sc)
>>> +{
>>> +     bool per_memcg_deferred = is_deferred_memcg_aware(shrinker) && sc->memcg;
>>> +     struct memcg_shrinker_deferred *deferred;
>>> +     struct mem_cgroup *memcg = sc->memcg;
>>> +     int nid = sc->nid;
>>> +     int id = shrinker->id;
>>> +     long nr;
>>> +
>>> +     if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
>>> +             nid = 0;
>>> +
>>> +     if (per_memcg_deferred) {
>>> +             deferred = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_deferred,
>>> +                                                  true);
>>
>> My comment is about both 5/9 and 6/9 patches.
> 
> Sorry for the late reply, I don't know why Gmail filtered this out to spam.
> 
>>
>> shrink_slab_memcg() races with mem_cgroup_css_online(). A visibility of CSS_ONLINE flag
>> in shrink_slab_memcg()->mem_cgroup_online() does not guarantee that you will see
>> memcg->nodeinfo[nid]->shrinker_deferred != NULL in count_nr_deferred(). This may occur
>> because of processor reordering on !x86 (there is no a common lock or memory barriers).
>>
>> Regarding to shrinker_map this is not a problem due to map check in shrink_slab_memcg().
>> The map can't be NULL there.
>>
>> Regarding to shrinker_deferred you should prove either this is not a problem too,
>> or to add proper synchronization (maybe, based on barriers) or to add some similar check
>> (maybe, in shrink_slab_memcg() too).
> 
> It seems shrink_slab_memcg() might see shrinker_deferred as NULL
> either due to the same reason. I don't think there is a guarantee it
> won't happen.
> 
> We just need guarantee CSS_ONLINE is seen after shrinker_maps and
> shrinker_deferred are allocated, so I'm supposed barriers before
> "css->flags |= CSS_ONLINE" should work.
> 
> So the below patch may be ok:
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index df128cab900f..9f7fb0450d69 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5539,6 +5539,12 @@ static int mem_cgroup_css_online(struct
> cgroup_subsys_state *css)
>                 return -ENOMEM;
>         }
> 
> 
> +       /*
> +        * Barrier for CSS_ONLINE, so that shrink_slab_memcg() sees
> shirnker_maps
> +        * and shrinker_deferred before CSS_ONLINE.
> +        */
> +       smp_mb();
> +
>         /* Online state pins memcg ID, memcg ID pins CSS */
>         refcount_set(&memcg->id.ref, 1);
>         css_get(css);

smp barriers synchronize data access from different cpus. They should go in a pair.
In case of you add the smp barrier into mem_cgroup_css_online(), we should also
add one more smp barrier in another place, which we want to synchonize with this.
Also, every place should contain a comment referring to its pair: "Pairs with...".

Kirill

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/9] mm: vmscan: use per memcg nr_deferred of shrinker
  2020-12-09 15:41       ` Kirill Tkhai
@ 2020-12-09 17:32         ` Yang Shi
  2020-12-10 15:13           ` Johannes Weiner
  0 siblings, 1 reply; 41+ messages in thread
From: Yang Shi @ 2020-12-09 17:32 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Roman Gushchin, Shakeel Butt, Dave Chinner, Johannes Weiner,
	Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Wed, Dec 9, 2020 at 7:42 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>
> On 08.12.2020 20:13, Yang Shi wrote:
> > On Thu, Dec 3, 2020 at 3:40 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> >>
> >> On 02.12.2020 21:27, Yang Shi wrote:
> >>> Use per memcg's nr_deferred for memcg aware shrinkers.  The shrinker's nr_deferred
> >>> will be used in the following cases:
> >>>     1. Non memcg aware shrinkers
> >>>     2. !CONFIG_MEMCG
> >>>     3. memcg is disabled by boot parameter
> >>>
> >>> Signed-off-by: Yang Shi <shy828301@gmail.com>
> >>> ---
> >>>  mm/vmscan.c | 88 +++++++++++++++++++++++++++++++++++++++++++++++++----
> >>>  1 file changed, 82 insertions(+), 6 deletions(-)
> >>>
> >>> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >>> index cba0bc8d4661..d569fdcaba79 100644
> >>> --- a/mm/vmscan.c
> >>> +++ b/mm/vmscan.c
> >>> @@ -203,6 +203,12 @@ static DECLARE_RWSEM(shrinker_rwsem);
> >>>  static DEFINE_IDR(shrinker_idr);
> >>>  static int shrinker_nr_max;
> >>>
> >>> +static inline bool is_deferred_memcg_aware(struct shrinker *shrinker)
> >>> +{
> >>> +     return (shrinker->flags & SHRINKER_MEMCG_AWARE) &&
> >>> +             !mem_cgroup_disabled();
> >>> +}
> >>> +
> >>>  static int prealloc_memcg_shrinker(struct shrinker *shrinker)
> >>>  {
> >>>       int id, ret = -ENOMEM;
> >>> @@ -271,7 +277,58 @@ static bool writeback_throttling_sane(struct scan_control *sc)
> >>>  #endif
> >>>       return false;
> >>>  }
> >>> +
> >>> +static inline long count_nr_deferred(struct shrinker *shrinker,
> >>> +                                  struct shrink_control *sc)
> >>> +{
> >>> +     bool per_memcg_deferred = is_deferred_memcg_aware(shrinker) && sc->memcg;
> >>> +     struct memcg_shrinker_deferred *deferred;
> >>> +     struct mem_cgroup *memcg = sc->memcg;
> >>> +     int nid = sc->nid;
> >>> +     int id = shrinker->id;
> >>> +     long nr;
> >>> +
> >>> +     if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
> >>> +             nid = 0;
> >>> +
> >>> +     if (per_memcg_deferred) {
> >>> +             deferred = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_deferred,
> >>> +                                                  true);
> >>
> >> My comment is about both 5/9 and 6/9 patches.
> >
> > Sorry for the late reply, I don't know why Gmail filtered this out to spam.
> >
> >>
> >> shrink_slab_memcg() races with mem_cgroup_css_online(). A visibility of CSS_ONLINE flag
> >> in shrink_slab_memcg()->mem_cgroup_online() does not guarantee that you will see
> >> memcg->nodeinfo[nid]->shrinker_deferred != NULL in count_nr_deferred(). This may occur
> >> because of processor reordering on !x86 (there is no a common lock or memory barriers).
> >>
> >> Regarding to shrinker_map this is not a problem due to map check in shrink_slab_memcg().
> >> The map can't be NULL there.
> >>
> >> Regarding to shrinker_deferred you should prove either this is not a problem too,
> >> or to add proper synchronization (maybe, based on barriers) or to add some similar check
> >> (maybe, in shrink_slab_memcg() too).
> >
> > It seems shrink_slab_memcg() might see shrinker_deferred as NULL
> > either due to the same reason. I don't think there is a guarantee it
> > won't happen.
> >
> > We just need guarantee CSS_ONLINE is seen after shrinker_maps and
> > shrinker_deferred are allocated, so I'm supposed barriers before
> > "css->flags |= CSS_ONLINE" should work.
> >
> > So the below patch may be ok:
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index df128cab900f..9f7fb0450d69 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -5539,6 +5539,12 @@ static int mem_cgroup_css_online(struct
> > cgroup_subsys_state *css)
> >                 return -ENOMEM;
> >         }
> >
> >
> > +       /*
> > +        * Barrier for CSS_ONLINE, so that shrink_slab_memcg() sees
> > shirnker_maps
> > +        * and shrinker_deferred before CSS_ONLINE.
> > +        */
> > +       smp_mb();
> > +
> >         /* Online state pins memcg ID, memcg ID pins CSS */
> >         refcount_set(&memcg->id.ref, 1);
> >         css_get(css);
>
> smp barriers synchronize data access from different cpus. They should go in a pair.
> In case of you add the smp barrier into mem_cgroup_css_online(), we should also
> add one more smp barrier in another place, which we want to synchonize with this.
> Also, every place should contain a comment referring to its pair: "Pairs with...".

Thanks, I think you are correct. Looked into it further, it seems the
race pattern looks like:

CPU A                                                                  CPU B
store shrinker_maps pointer                      load CSS_ONLINE
store CSS_ONLINE                                   load shrinker_maps pointer

By checking the memory-barriers document, it seems we need write
barrier/read barrier pair as below:

CPU A                                                                  CPU B
store shrinker_maps pointer                       load CSS_ONLINE
<write barrier>                                             <read barrier>
store CSS_ONLINE                                    load shrinker_maps pointer


So, the patch should look like:

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index df128cab900f..489c0a84f82b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5539,6 +5539,13 @@ static int mem_cgroup_css_online(struct
cgroup_subsys_state *css)
                return -ENOMEM;
        }

+       /*
+        * Barrier for CSS_ONLINE, so that shrink_slab_memcg() sees
shirnker_maps
+        * and shrinker_deferred before CSS_ONLINE. It pairs with the
read barrier
+        * in shrink_slab_memcg().
+        */
+       smp_wmb();
+
        /* Online state pins memcg ID, memcg ID pins CSS */
        refcount_set(&memcg->id.ref, 1);
        css_get(css);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9d2a6485e982..fc9bda576d98 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -603,13 +603,15 @@ static unsigned long shrink_slab_memcg(gfp_t
gfp_mask, int nid,
        if (!mem_cgroup_online(memcg))
                return 0;

+       /* Pairs with write barrier in mem_cgroup_css_online */
+       smp_rmb();
+
        if (!down_read_trylock(&shrinker_rwsem))
                return 0;

+       /* Once memcg is online it can't be NULL */
        map = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_map,
                                        true);
-       if (unlikely(!map))
-               goto unlock;

        for_each_set_bit(i, map->map, shrinker_nr_max) {
                struct shrink_control sc = {


Does this seem correct?

>
> Kirill
>

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/9] mm: vmscan: use per memcg nr_deferred of shrinker
  2020-12-09 17:32         ` Yang Shi
@ 2020-12-10 15:13           ` Johannes Weiner
  2020-12-10 15:17             ` Kirill Tkhai
  0 siblings, 1 reply; 41+ messages in thread
From: Johannes Weiner @ 2020-12-10 15:13 UTC (permalink / raw)
  To: Yang Shi
  Cc: Kirill Tkhai, Roman Gushchin, Shakeel Butt, Dave Chinner,
	Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Wed, Dec 09, 2020 at 09:32:37AM -0800, Yang Shi wrote:
> On Wed, Dec 9, 2020 at 7:42 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> >
> > On 08.12.2020 20:13, Yang Shi wrote:
> > > On Thu, Dec 3, 2020 at 3:40 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> > >>
> > >> On 02.12.2020 21:27, Yang Shi wrote:
> > >>> Use per memcg's nr_deferred for memcg aware shrinkers.  The shrinker's nr_deferred
> > >>> will be used in the following cases:
> > >>>     1. Non memcg aware shrinkers
> > >>>     2. !CONFIG_MEMCG
> > >>>     3. memcg is disabled by boot parameter
> > >>>
> > >>> Signed-off-by: Yang Shi <shy828301@gmail.com>
> > >>> ---
> > >>>  mm/vmscan.c | 88 +++++++++++++++++++++++++++++++++++++++++++++++++----
> > >>>  1 file changed, 82 insertions(+), 6 deletions(-)
> > >>>
> > >>> diff --git a/mm/vmscan.c b/mm/vmscan.c
> > >>> index cba0bc8d4661..d569fdcaba79 100644
> > >>> --- a/mm/vmscan.c
> > >>> +++ b/mm/vmscan.c
> > >>> @@ -203,6 +203,12 @@ static DECLARE_RWSEM(shrinker_rwsem);
> > >>>  static DEFINE_IDR(shrinker_idr);
> > >>>  static int shrinker_nr_max;
> > >>>
> > >>> +static inline bool is_deferred_memcg_aware(struct shrinker *shrinker)
> > >>> +{
> > >>> +     return (shrinker->flags & SHRINKER_MEMCG_AWARE) &&
> > >>> +             !mem_cgroup_disabled();
> > >>> +}
> > >>> +
> > >>>  static int prealloc_memcg_shrinker(struct shrinker *shrinker)
> > >>>  {
> > >>>       int id, ret = -ENOMEM;
> > >>> @@ -271,7 +277,58 @@ static bool writeback_throttling_sane(struct scan_control *sc)
> > >>>  #endif
> > >>>       return false;
> > >>>  }
> > >>> +
> > >>> +static inline long count_nr_deferred(struct shrinker *shrinker,
> > >>> +                                  struct shrink_control *sc)
> > >>> +{
> > >>> +     bool per_memcg_deferred = is_deferred_memcg_aware(shrinker) && sc->memcg;
> > >>> +     struct memcg_shrinker_deferred *deferred;
> > >>> +     struct mem_cgroup *memcg = sc->memcg;
> > >>> +     int nid = sc->nid;
> > >>> +     int id = shrinker->id;
> > >>> +     long nr;
> > >>> +
> > >>> +     if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
> > >>> +             nid = 0;
> > >>> +
> > >>> +     if (per_memcg_deferred) {
> > >>> +             deferred = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_deferred,
> > >>> +                                                  true);
> > >>
> > >> My comment is about both 5/9 and 6/9 patches.
> > >
> > > Sorry for the late reply, I don't know why Gmail filtered this out to spam.
> > >
> > >>
> > >> shrink_slab_memcg() races with mem_cgroup_css_online(). A visibility of CSS_ONLINE flag
> > >> in shrink_slab_memcg()->mem_cgroup_online() does not guarantee that you will see
> > >> memcg->nodeinfo[nid]->shrinker_deferred != NULL in count_nr_deferred(). This may occur
> > >> because of processor reordering on !x86 (there is no a common lock or memory barriers).
> > >>
> > >> Regarding to shrinker_map this is not a problem due to map check in shrink_slab_memcg().
> > >> The map can't be NULL there.
> > >>
> > >> Regarding to shrinker_deferred you should prove either this is not a problem too,
> > >> or to add proper synchronization (maybe, based on barriers) or to add some similar check
> > >> (maybe, in shrink_slab_memcg() too).
> > >
> > > It seems shrink_slab_memcg() might see shrinker_deferred as NULL
> > > either due to the same reason. I don't think there is a guarantee it
> > > won't happen.
> > >
> > > We just need guarantee CSS_ONLINE is seen after shrinker_maps and
> > > shrinker_deferred are allocated, so I'm supposed barriers before
> > > "css->flags |= CSS_ONLINE" should work.
> > >
> > > So the below patch may be ok:
> > >
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index df128cab900f..9f7fb0450d69 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -5539,6 +5539,12 @@ static int mem_cgroup_css_online(struct
> > > cgroup_subsys_state *css)
> > >                 return -ENOMEM;
> > >         }
> > >
> > >
> > > +       /*
> > > +        * Barrier for CSS_ONLINE, so that shrink_slab_memcg() sees
> > > shirnker_maps
> > > +        * and shrinker_deferred before CSS_ONLINE.
> > > +        */
> > > +       smp_mb();
> > > +
> > >         /* Online state pins memcg ID, memcg ID pins CSS */
> > >         refcount_set(&memcg->id.ref, 1);
> > >         css_get(css);
> >
> > smp barriers synchronize data access from different cpus. They should go in a pair.
> > In case of you add the smp barrier into mem_cgroup_css_online(), we should also
> > add one more smp barrier in another place, which we want to synchonize with this.
> > Also, every place should contain a comment referring to its pair: "Pairs with...".
> 
> Thanks, I think you are correct. Looked into it further, it seems the
> race pattern looks like:
> 
> CPU A                                                                  CPU B
> store shrinker_maps pointer                      load CSS_ONLINE
> store CSS_ONLINE                                   load shrinker_maps pointer
> 
> By checking the memory-barriers document, it seems we need write
> barrier/read barrier pair as below:
> 
> CPU A                                                                  CPU B
> store shrinker_maps pointer                       load CSS_ONLINE
> <write barrier>                                             <read barrier>
> store CSS_ONLINE                                    load shrinker_maps pointer
> 
> 
> So, the patch should look like:
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index df128cab900f..489c0a84f82b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5539,6 +5539,13 @@ static int mem_cgroup_css_online(struct
> cgroup_subsys_state *css)
>                 return -ENOMEM;
>         }
> 
> +       /*
> +        * Barrier for CSS_ONLINE, so that shrink_slab_memcg() sees
> shirnker_maps
> +        * and shrinker_deferred before CSS_ONLINE. It pairs with the
> read barrier
> +        * in shrink_slab_memcg().
> +        */
> +       smp_wmb();

Is there a reason why the shrinker allocations aren't done in
.css_alloc()? That would take care of all necessary ordering:

      #0
      css = ->css_alloc()
      list_add_tail_rcu(css, parent->children)
        rcu_assign_pointer()
      ->css_online(css)
      css->flags |= CSS_ONLINE

      #1
      memcg = mem_cgroup_iter()
        list_entry_rcu()
	  rcu_dereference()
      shrink_slab(.., memcg)

RCU ensures that once the cgroup shows up in the reclaim cgroup it's
also fully allocated.

>         /* Online state pins memcg ID, memcg ID pins CSS */
>         refcount_set(&memcg->id.ref, 1);
>         css_get(css);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 9d2a6485e982..fc9bda576d98 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -603,13 +603,15 @@ static unsigned long shrink_slab_memcg(gfp_t
> gfp_mask, int nid,
>         if (!mem_cgroup_online(memcg))
>                 return 0;

...then we should be able to delete this online check here entirely:

A not-yet online cgroup is guaranteed to have a shrinker map, just
with no bits set. The shrinker code handles that just fine.

An offlined cgroup will eventually have an empty bitmap as the called
shrinkers return SHRINK_EMPTY. This could also be shortcut by clearing
the bit in memcg_drain_list_lru_node() the same way we set it in the
parent when we move all objects upward, but seems correct as-is.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/9] mm: vmscan: use per memcg nr_deferred of shrinker
  2020-12-10 15:13           ` Johannes Weiner
@ 2020-12-10 15:17             ` Kirill Tkhai
  2020-12-15 16:44               ` Johannes Weiner
  0 siblings, 1 reply; 41+ messages in thread
From: Kirill Tkhai @ 2020-12-10 15:17 UTC (permalink / raw)
  To: Johannes Weiner, Yang Shi
  Cc: Roman Gushchin, Shakeel Butt, Dave Chinner, Michal Hocko,
	Andrew Morton, Linux MM, Linux FS-devel Mailing List,
	Linux Kernel Mailing List

On 10.12.2020 18:13, Johannes Weiner wrote:
> On Wed, Dec 09, 2020 at 09:32:37AM -0800, Yang Shi wrote:
>> On Wed, Dec 9, 2020 at 7:42 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>>
>>> On 08.12.2020 20:13, Yang Shi wrote:
>>>> On Thu, Dec 3, 2020 at 3:40 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>>>>
>>>>> On 02.12.2020 21:27, Yang Shi wrote:
>>>>>> Use per memcg's nr_deferred for memcg aware shrinkers.  The shrinker's nr_deferred
>>>>>> will be used in the following cases:
>>>>>>     1. Non memcg aware shrinkers
>>>>>>     2. !CONFIG_MEMCG
>>>>>>     3. memcg is disabled by boot parameter
>>>>>>
>>>>>> Signed-off-by: Yang Shi <shy828301@gmail.com>
>>>>>> ---
>>>>>>  mm/vmscan.c | 88 +++++++++++++++++++++++++++++++++++++++++++++++++----
>>>>>>  1 file changed, 82 insertions(+), 6 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>>> index cba0bc8d4661..d569fdcaba79 100644
>>>>>> --- a/mm/vmscan.c
>>>>>> +++ b/mm/vmscan.c
>>>>>> @@ -203,6 +203,12 @@ static DECLARE_RWSEM(shrinker_rwsem);
>>>>>>  static DEFINE_IDR(shrinker_idr);
>>>>>>  static int shrinker_nr_max;
>>>>>>
>>>>>> +static inline bool is_deferred_memcg_aware(struct shrinker *shrinker)
>>>>>> +{
>>>>>> +     return (shrinker->flags & SHRINKER_MEMCG_AWARE) &&
>>>>>> +             !mem_cgroup_disabled();
>>>>>> +}
>>>>>> +
>>>>>>  static int prealloc_memcg_shrinker(struct shrinker *shrinker)
>>>>>>  {
>>>>>>       int id, ret = -ENOMEM;
>>>>>> @@ -271,7 +277,58 @@ static bool writeback_throttling_sane(struct scan_control *sc)
>>>>>>  #endif
>>>>>>       return false;
>>>>>>  }
>>>>>> +
>>>>>> +static inline long count_nr_deferred(struct shrinker *shrinker,
>>>>>> +                                  struct shrink_control *sc)
>>>>>> +{
>>>>>> +     bool per_memcg_deferred = is_deferred_memcg_aware(shrinker) && sc->memcg;
>>>>>> +     struct memcg_shrinker_deferred *deferred;
>>>>>> +     struct mem_cgroup *memcg = sc->memcg;
>>>>>> +     int nid = sc->nid;
>>>>>> +     int id = shrinker->id;
>>>>>> +     long nr;
>>>>>> +
>>>>>> +     if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
>>>>>> +             nid = 0;
>>>>>> +
>>>>>> +     if (per_memcg_deferred) {
>>>>>> +             deferred = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_deferred,
>>>>>> +                                                  true);
>>>>>
>>>>> My comment is about both 5/9 and 6/9 patches.
>>>>
>>>> Sorry for the late reply, I don't know why Gmail filtered this out to spam.
>>>>
>>>>>
>>>>> shrink_slab_memcg() races with mem_cgroup_css_online(). A visibility of CSS_ONLINE flag
>>>>> in shrink_slab_memcg()->mem_cgroup_online() does not guarantee that you will see
>>>>> memcg->nodeinfo[nid]->shrinker_deferred != NULL in count_nr_deferred(). This may occur
>>>>> because of processor reordering on !x86 (there is no a common lock or memory barriers).
>>>>>
>>>>> Regarding to shrinker_map this is not a problem due to map check in shrink_slab_memcg().
>>>>> The map can't be NULL there.
>>>>>
>>>>> Regarding to shrinker_deferred you should prove either this is not a problem too,
>>>>> or to add proper synchronization (maybe, based on barriers) or to add some similar check
>>>>> (maybe, in shrink_slab_memcg() too).
>>>>
>>>> It seems shrink_slab_memcg() might see shrinker_deferred as NULL
>>>> either due to the same reason. I don't think there is a guarantee it
>>>> won't happen.
>>>>
>>>> We just need guarantee CSS_ONLINE is seen after shrinker_maps and
>>>> shrinker_deferred are allocated, so I'm supposed barriers before
>>>> "css->flags |= CSS_ONLINE" should work.
>>>>
>>>> So the below patch may be ok:
>>>>
>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>>> index df128cab900f..9f7fb0450d69 100644
>>>> --- a/mm/memcontrol.c
>>>> +++ b/mm/memcontrol.c
>>>> @@ -5539,6 +5539,12 @@ static int mem_cgroup_css_online(struct
>>>> cgroup_subsys_state *css)
>>>>                 return -ENOMEM;
>>>>         }
>>>>
>>>>
>>>> +       /*
>>>> +        * Barrier for CSS_ONLINE, so that shrink_slab_memcg() sees
>>>> shirnker_maps
>>>> +        * and shrinker_deferred before CSS_ONLINE.
>>>> +        */
>>>> +       smp_mb();
>>>> +
>>>>         /* Online state pins memcg ID, memcg ID pins CSS */
>>>>         refcount_set(&memcg->id.ref, 1);
>>>>         css_get(css);
>>>
>>> smp barriers synchronize data access from different cpus. They should go in a pair.
>>> In case of you add the smp barrier into mem_cgroup_css_online(), we should also
>>> add one more smp barrier in another place, which we want to synchonize with this.
>>> Also, every place should contain a comment referring to its pair: "Pairs with...".
>>
>> Thanks, I think you are correct. Looked into it further, it seems the
>> race pattern looks like:
>>
>> CPU A                                                                  CPU B
>> store shrinker_maps pointer                      load CSS_ONLINE
>> store CSS_ONLINE                                   load shrinker_maps pointer
>>
>> By checking the memory-barriers document, it seems we need write
>> barrier/read barrier pair as below:
>>
>> CPU A                                                                  CPU B
>> store shrinker_maps pointer                       load CSS_ONLINE
>> <write barrier>                                             <read barrier>
>> store CSS_ONLINE                                    load shrinker_maps pointer
>>
>>
>> So, the patch should look like:
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index df128cab900f..489c0a84f82b 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -5539,6 +5539,13 @@ static int mem_cgroup_css_online(struct
>> cgroup_subsys_state *css)
>>                 return -ENOMEM;
>>         }
>>
>> +       /*
>> +        * Barrier for CSS_ONLINE, so that shrink_slab_memcg() sees
>> shirnker_maps
>> +        * and shrinker_deferred before CSS_ONLINE. It pairs with the
>> read barrier
>> +        * in shrink_slab_memcg().
>> +        */
>> +       smp_wmb();
> 
> Is there a reason why the shrinker allocations aren't done in
> .css_alloc()? That would take care of all necessary ordering:

The reason is that allocations have to be made in a place, where
mem-cgroup_iter() can't miss it, since memcg_expand_shrinker_maps()
shouldn't miss allocated shrinker maps.

> 
>       #0
>       css = ->css_alloc()
>       list_add_tail_rcu(css, parent->children)
>         rcu_assign_pointer()
>       ->css_online(css)
>       css->flags |= CSS_ONLINE
> 
>       #1
>       memcg = mem_cgroup_iter()
>         list_entry_rcu()
> 	  rcu_dereference()
>       shrink_slab(.., memcg)
> 
> RCU ensures that once the cgroup shows up in the reclaim cgroup it's
> also fully allocated.
> 
>>         /* Online state pins memcg ID, memcg ID pins CSS */
>>         refcount_set(&memcg->id.ref, 1);
>>         css_get(css);
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 9d2a6485e982..fc9bda576d98 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -603,13 +603,15 @@ static unsigned long shrink_slab_memcg(gfp_t
>> gfp_mask, int nid,
>>         if (!mem_cgroup_online(memcg))
>>                 return 0;
> 
> ...then we should be able to delete this online check here entirely:
> 
> A not-yet online cgroup is guaranteed to have a shrinker map, just
> with no bits set. The shrinker code handles that just fine.
> 
> An offlined cgroup will eventually have an empty bitmap as the called
> shrinkers return SHRINK_EMPTY. This could also be shortcut by clearing
> the bit in memcg_drain_list_lru_node() the same way we set it in the
> parent when we move all objects upward, but seems correct as-is.
> 


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/9] mm: memcontrol: add per memcg shrinker nr_deferred
  2020-12-02 18:27 ` [PATCH 5/9] mm: memcontrol: add per memcg shrinker nr_deferred Yang Shi
  2020-12-03  3:06   ` Roman Gushchin
@ 2020-12-10 15:33   ` Johannes Weiner
  2020-12-10 19:12     ` Yang Shi
  2020-12-10 21:59     ` Yang Shi
  1 sibling, 2 replies; 41+ messages in thread
From: Johannes Weiner @ 2020-12-10 15:33 UTC (permalink / raw)
  To: Yang Shi
  Cc: guro, ktkhai, shakeelb, david, mhocko, akpm, linux-mm,
	linux-fsdevel, linux-kernel

On Wed, Dec 02, 2020 at 10:27:21AM -0800, Yang Shi wrote:
> @@ -504,6 +577,34 @@ int memcg_expand_shrinker_maps(int new_id)
>  	return ret;
>  }
>  
> +int memcg_expand_shrinker_deferred(int new_id)
> +{
> +	int size, old_size, ret = 0;
> +	struct mem_cgroup *memcg;
> +
> +	size = (new_id + 1) * sizeof(atomic_long_t);
> +	old_size = memcg_shrinker_deferred_size;
> +	if (size <= old_size)
> +		return 0;
> +
> +	mutex_lock(&memcg_shrinker_mutex);

The locking is somewhat confusing. I was wondering why we first read
memcg_shrinker_deferred_size "locklessly", then change it while
holding the &memcg_shrinker_mutex.

memcg_shrinker_deferred_size only changes under shrinker_rwsem(write),
correct? This should be documented in a comment, IMO.

memcg_shrinker_mutex looks superfluous then. The memcg allocation path
is the read-side of memcg_shrinker_deferred_size, and so simply needs
to take shrinker_rwsem(read) to lock out shrinker (de)registration.

Also, isn't memcg_shrinker_deferred_size just shrinker_nr_max? And
memcg_expand_shrinker_deferred() is only called when size >= old_size
in the first place (because id >= shrinker_nr_max)?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/9] mm: memcontrol: add per memcg shrinker nr_deferred
  2020-12-10 15:33   ` Johannes Weiner
@ 2020-12-10 19:12     ` Yang Shi
  2020-12-11 17:52       ` Yang Shi
  2020-12-10 21:59     ` Yang Shi
  1 sibling, 1 reply; 41+ messages in thread
From: Yang Shi @ 2020-12-10 19:12 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Roman Gushchin, Kirill Tkhai, Shakeel Butt, Dave Chinner,
	Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Thu, Dec 10, 2020 at 7:36 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Wed, Dec 02, 2020 at 10:27:21AM -0800, Yang Shi wrote:
> > @@ -504,6 +577,34 @@ int memcg_expand_shrinker_maps(int new_id)
> >       return ret;
> >  }
> >
> > +int memcg_expand_shrinker_deferred(int new_id)
> > +{
> > +     int size, old_size, ret = 0;
> > +     struct mem_cgroup *memcg;
> > +
> > +     size = (new_id + 1) * sizeof(atomic_long_t);
> > +     old_size = memcg_shrinker_deferred_size;
> > +     if (size <= old_size)
> > +             return 0;
> > +
> > +     mutex_lock(&memcg_shrinker_mutex);
>
> The locking is somewhat confusing. I was wondering why we first read
> memcg_shrinker_deferred_size "locklessly", then change it while
> holding the &memcg_shrinker_mutex.
>
> memcg_shrinker_deferred_size only changes under shrinker_rwsem(write),
> correct? This should be documented in a comment, IMO.

Yes, it is correct.

>
> memcg_shrinker_mutex looks superfluous then. The memcg allocation path
> is the read-side of memcg_shrinker_deferred_size, and so simply needs
> to take shrinker_rwsem(read) to lock out shrinker (de)registration.

I see you point. Yes, it seems shrinker_{maps|deferred} allocation
could be synchronized with shrinker registration by shrinker_rwsem.

memcg_shrinker_mutex is just renamed from memcg_shrinker_map_mutex
which was introduced by shrinker_maps patchset. I'm not quite sure why
this mutex was introduced at the first place, I guess the main purpose
is to *not* exacerbate the contention of shrinker_rwsem?

If that contention is not a concern, we could remove that dedicated mutex.

>
> Also, isn't memcg_shrinker_deferred_size just shrinker_nr_max? And

No, it is variable. It is nr * sizeof(atomit_long_t). The nr is the
current last shrinker ID. If a new shrinker is registered, the nr may
grow.

> memcg_expand_shrinker_deferred() is only called when size >= old_size
> in the first place (because id >= shrinker_nr_max)?

Yes.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/9] mm: memcontrol: add per memcg shrinker nr_deferred
  2020-12-10 15:33   ` Johannes Weiner
  2020-12-10 19:12     ` Yang Shi
@ 2020-12-10 21:59     ` Yang Shi
  1 sibling, 0 replies; 41+ messages in thread
From: Yang Shi @ 2020-12-10 21:59 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Roman Gushchin, Kirill Tkhai, Shakeel Butt, Dave Chinner,
	Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Thu, Dec 10, 2020 at 7:36 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Wed, Dec 02, 2020 at 10:27:21AM -0800, Yang Shi wrote:
> > @@ -504,6 +577,34 @@ int memcg_expand_shrinker_maps(int new_id)
> >       return ret;
> >  }
> >
> > +int memcg_expand_shrinker_deferred(int new_id)
> > +{
> > +     int size, old_size, ret = 0;
> > +     struct mem_cgroup *memcg;
> > +
> > +     size = (new_id + 1) * sizeof(atomic_long_t);
> > +     old_size = memcg_shrinker_deferred_size;
> > +     if (size <= old_size)
> > +             return 0;
> > +
> > +     mutex_lock(&memcg_shrinker_mutex);
>
> The locking is somewhat confusing. I was wondering why we first read
> memcg_shrinker_deferred_size "locklessly", then change it while
> holding the &memcg_shrinker_mutex.

The concurrent shrinkers registration may have race. But, they should
get different IDs, so it seems not matter.

I agree it is a little bit confusing and not that straightforward, it
does owe some comments in the code.

>
> memcg_shrinker_deferred_size only changes under shrinker_rwsem(write),
> correct? This should be documented in a comment, IMO.
>
> memcg_shrinker_mutex looks superfluous then. The memcg allocation path
> is the read-side of memcg_shrinker_deferred_size, and so simply needs
> to take shrinker_rwsem(read) to lock out shrinker (de)registration.
>
> Also, isn't memcg_shrinker_deferred_size just shrinker_nr_max? And
> memcg_expand_shrinker_deferred() is only called when size >= old_size
> in the first place (because id >= shrinker_nr_max)?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/9] mm: memcontrol: add per memcg shrinker nr_deferred
  2020-12-10 19:12     ` Yang Shi
@ 2020-12-11 17:52       ` Yang Shi
  0 siblings, 0 replies; 41+ messages in thread
From: Yang Shi @ 2020-12-11 17:52 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Roman Gushchin, Kirill Tkhai, Shakeel Butt, Dave Chinner,
	Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Thu, Dec 10, 2020 at 11:12 AM Yang Shi <shy828301@gmail.com> wrote:
>
> On Thu, Dec 10, 2020 at 7:36 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Wed, Dec 02, 2020 at 10:27:21AM -0800, Yang Shi wrote:
> > > @@ -504,6 +577,34 @@ int memcg_expand_shrinker_maps(int new_id)
> > >       return ret;
> > >  }
> > >
> > > +int memcg_expand_shrinker_deferred(int new_id)
> > > +{
> > > +     int size, old_size, ret = 0;
> > > +     struct mem_cgroup *memcg;
> > > +
> > > +     size = (new_id + 1) * sizeof(atomic_long_t);
> > > +     old_size = memcg_shrinker_deferred_size;
> > > +     if (size <= old_size)
> > > +             return 0;
> > > +
> > > +     mutex_lock(&memcg_shrinker_mutex);
> >
> > The locking is somewhat confusing. I was wondering why we first read
> > memcg_shrinker_deferred_size "locklessly", then change it while
> > holding the &memcg_shrinker_mutex.
> >
> > memcg_shrinker_deferred_size only changes under shrinker_rwsem(write),
> > correct? This should be documented in a comment, IMO.
>
> Yes, it is correct.
>
> >
> > memcg_shrinker_mutex looks superfluous then. The memcg allocation path
> > is the read-side of memcg_shrinker_deferred_size, and so simply needs
> > to take shrinker_rwsem(read) to lock out shrinker (de)registration.
>
> I see you point. Yes, it seems shrinker_{maps|deferred} allocation
> could be synchronized with shrinker registration by shrinker_rwsem.
>
> memcg_shrinker_mutex is just renamed from memcg_shrinker_map_mutex
> which was introduced by shrinker_maps patchset. I'm not quite sure why
> this mutex was introduced at the first place, I guess the main purpose
> is to *not* exacerbate the contention of shrinker_rwsem?
>
> If that contention is not a concern, we could remove that dedicated mutex.

It seems using shrinker_rwsem instead of dedicated mutex should not
exacerbate the contention since we just add one read critical section.
Will do it in v2.


>
> >
> > Also, isn't memcg_shrinker_deferred_size just shrinker_nr_max? And
>
> No, it is variable. It is nr * sizeof(atomit_long_t). The nr is the
> current last shrinker ID. If a new shrinker is registered, the nr may
> grow.
>
> > memcg_expand_shrinker_deferred() is only called when size >= old_size
> > in the first place (because id >= shrinker_nr_max)?
>
> Yes.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/9] mm: vmscan: use nid from shrink_control for tracepoint
  2020-12-03  3:13   ` Xiaqing (A)
@ 2020-12-11 19:20     ` Yang Shi
  0 siblings, 0 replies; 41+ messages in thread
From: Yang Shi @ 2020-12-11 19:20 UTC (permalink / raw)
  To: Xiaqing (A)
  Cc: Roman Gushchin, Kirill Tkhai, Shakeel Butt, Dave Chinner,
	Johannes Weiner, Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List, Liu Yi

On Wed, Dec 2, 2020 at 7:13 PM Xiaqing (A) <saberlily.xia@hisilicon.com> wrote:
>
>
>
> On 2020/12/3 2:27, Yang Shi wrote:
> > The tracepoint's nid should show what node the shrink happens on, the start tracepoint
> > uses nid from shrinkctl, but the nid might be set to 0 before end tracepoint if the
> > shrinker is not NUMA aware, so the traceing log may show the shrink happens on one
> > node but end up on the other node.  It seems confusing.  And the following patch
> > will remove using nid directly in do_shrink_slab(), this patch also helps cleanup
> > the code.
> >
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > ---
> >   mm/vmscan.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 7d6186a07daf..457ce04eebf2 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -533,7 +533,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> >       new_nr = atomic_long_add_return(next_deferred,
> >                                       &shrinker->nr_deferred[nid]);
> >
> > -     trace_mm_shrink_slab_end(shrinker, nid, freed, nr, new_nr, total_scan);
> > +     trace_mm_shrink_slab_end(shrinker, shrinkctl->nid, freed, nr, new_nr, total_scan);
>
> Hi, Yang
>
> When I read this patch, I wondered why you modified it so much until I read patch6. Could you merge
> this patch into patch6?

Sorry for the late reply. It could be, but I was inclined to think
this is a bug and we might need backport it to stable, so I leave it
as a standalone patch.

>
> Thanks!
>
> >       return freed;
> >   }
> >
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/9] mm: vmscan: use per memcg nr_deferred of shrinker
  2020-12-10 15:17             ` Kirill Tkhai
@ 2020-12-15 16:44               ` Johannes Weiner
  0 siblings, 0 replies; 41+ messages in thread
From: Johannes Weiner @ 2020-12-15 16:44 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Yang Shi, Roman Gushchin, Shakeel Butt, Dave Chinner,
	Michal Hocko, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Thu, Dec 10, 2020 at 06:17:54PM +0300, Kirill Tkhai wrote:
> On 10.12.2020 18:13, Johannes Weiner wrote:
> > On Wed, Dec 09, 2020 at 09:32:37AM -0800, Yang Shi wrote:
> >> On Wed, Dec 9, 2020 at 7:42 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> >>>
> >>> On 08.12.2020 20:13, Yang Shi wrote:
> >>>> On Thu, Dec 3, 2020 at 3:40 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> >>>>>
> >>>>> On 02.12.2020 21:27, Yang Shi wrote:
> >>>>>> Use per memcg's nr_deferred for memcg aware shrinkers.  The shrinker's nr_deferred
> >>>>>> will be used in the following cases:
> >>>>>>     1. Non memcg aware shrinkers
> >>>>>>     2. !CONFIG_MEMCG
> >>>>>>     3. memcg is disabled by boot parameter
> >>>>>>
> >>>>>> Signed-off-by: Yang Shi <shy828301@gmail.com>
> >>>>>> ---
> >>>>>>  mm/vmscan.c | 88 +++++++++++++++++++++++++++++++++++++++++++++++++----
> >>>>>>  1 file changed, 82 insertions(+), 6 deletions(-)
> >>>>>>
> >>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >>>>>> index cba0bc8d4661..d569fdcaba79 100644
> >>>>>> --- a/mm/vmscan.c
> >>>>>> +++ b/mm/vmscan.c
> >>>>>> @@ -203,6 +203,12 @@ static DECLARE_RWSEM(shrinker_rwsem);
> >>>>>>  static DEFINE_IDR(shrinker_idr);
> >>>>>>  static int shrinker_nr_max;
> >>>>>>
> >>>>>> +static inline bool is_deferred_memcg_aware(struct shrinker *shrinker)
> >>>>>> +{
> >>>>>> +     return (shrinker->flags & SHRINKER_MEMCG_AWARE) &&
> >>>>>> +             !mem_cgroup_disabled();
> >>>>>> +}
> >>>>>> +
> >>>>>>  static int prealloc_memcg_shrinker(struct shrinker *shrinker)
> >>>>>>  {
> >>>>>>       int id, ret = -ENOMEM;
> >>>>>> @@ -271,7 +277,58 @@ static bool writeback_throttling_sane(struct scan_control *sc)
> >>>>>>  #endif
> >>>>>>       return false;
> >>>>>>  }
> >>>>>> +
> >>>>>> +static inline long count_nr_deferred(struct shrinker *shrinker,
> >>>>>> +                                  struct shrink_control *sc)
> >>>>>> +{
> >>>>>> +     bool per_memcg_deferred = is_deferred_memcg_aware(shrinker) && sc->memcg;
> >>>>>> +     struct memcg_shrinker_deferred *deferred;
> >>>>>> +     struct mem_cgroup *memcg = sc->memcg;
> >>>>>> +     int nid = sc->nid;
> >>>>>> +     int id = shrinker->id;
> >>>>>> +     long nr;
> >>>>>> +
> >>>>>> +     if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
> >>>>>> +             nid = 0;
> >>>>>> +
> >>>>>> +     if (per_memcg_deferred) {
> >>>>>> +             deferred = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_deferred,
> >>>>>> +                                                  true);
> >>>>>
> >>>>> My comment is about both 5/9 and 6/9 patches.
> >>>>
> >>>> Sorry for the late reply, I don't know why Gmail filtered this out to spam.
> >>>>
> >>>>>
> >>>>> shrink_slab_memcg() races with mem_cgroup_css_online(). A visibility of CSS_ONLINE flag
> >>>>> in shrink_slab_memcg()->mem_cgroup_online() does not guarantee that you will see
> >>>>> memcg->nodeinfo[nid]->shrinker_deferred != NULL in count_nr_deferred(). This may occur
> >>>>> because of processor reordering on !x86 (there is no a common lock or memory barriers).
> >>>>>
> >>>>> Regarding to shrinker_map this is not a problem due to map check in shrink_slab_memcg().
> >>>>> The map can't be NULL there.
> >>>>>
> >>>>> Regarding to shrinker_deferred you should prove either this is not a problem too,
> >>>>> or to add proper synchronization (maybe, based on barriers) or to add some similar check
> >>>>> (maybe, in shrink_slab_memcg() too).
> >>>>
> >>>> It seems shrink_slab_memcg() might see shrinker_deferred as NULL
> >>>> either due to the same reason. I don't think there is a guarantee it
> >>>> won't happen.
> >>>>
> >>>> We just need guarantee CSS_ONLINE is seen after shrinker_maps and
> >>>> shrinker_deferred are allocated, so I'm supposed barriers before
> >>>> "css->flags |= CSS_ONLINE" should work.
> >>>>
> >>>> So the below patch may be ok:
> >>>>
> >>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >>>> index df128cab900f..9f7fb0450d69 100644
> >>>> --- a/mm/memcontrol.c
> >>>> +++ b/mm/memcontrol.c
> >>>> @@ -5539,6 +5539,12 @@ static int mem_cgroup_css_online(struct
> >>>> cgroup_subsys_state *css)
> >>>>                 return -ENOMEM;
> >>>>         }
> >>>>
> >>>>
> >>>> +       /*
> >>>> +        * Barrier for CSS_ONLINE, so that shrink_slab_memcg() sees
> >>>> shirnker_maps
> >>>> +        * and shrinker_deferred before CSS_ONLINE.
> >>>> +        */
> >>>> +       smp_mb();
> >>>> +
> >>>>         /* Online state pins memcg ID, memcg ID pins CSS */
> >>>>         refcount_set(&memcg->id.ref, 1);
> >>>>         css_get(css);
> >>>
> >>> smp barriers synchronize data access from different cpus. They should go in a pair.
> >>> In case of you add the smp barrier into mem_cgroup_css_online(), we should also
> >>> add one more smp barrier in another place, which we want to synchonize with this.
> >>> Also, every place should contain a comment referring to its pair: "Pairs with...".
> >>
> >> Thanks, I think you are correct. Looked into it further, it seems the
> >> race pattern looks like:
> >>
> >> CPU A                                                                  CPU B
> >> store shrinker_maps pointer                      load CSS_ONLINE
> >> store CSS_ONLINE                                   load shrinker_maps pointer
> >>
> >> By checking the memory-barriers document, it seems we need write
> >> barrier/read barrier pair as below:
> >>
> >> CPU A                                                                  CPU B
> >> store shrinker_maps pointer                       load CSS_ONLINE
> >> <write barrier>                                             <read barrier>
> >> store CSS_ONLINE                                    load shrinker_maps pointer
> >>
> >>
> >> So, the patch should look like:
> >>
> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >> index df128cab900f..489c0a84f82b 100644
> >> --- a/mm/memcontrol.c
> >> +++ b/mm/memcontrol.c
> >> @@ -5539,6 +5539,13 @@ static int mem_cgroup_css_online(struct
> >> cgroup_subsys_state *css)
> >>                 return -ENOMEM;
> >>         }
> >>
> >> +       /*
> >> +        * Barrier for CSS_ONLINE, so that shrink_slab_memcg() sees
> >> shirnker_maps
> >> +        * and shrinker_deferred before CSS_ONLINE. It pairs with the
> >> read barrier
> >> +        * in shrink_slab_memcg().
> >> +        */
> >> +       smp_wmb();
> > 
> > Is there a reason why the shrinker allocations aren't done in
> > .css_alloc()? That would take care of all necessary ordering:
> 
> The reason is that allocations have to be made in a place, where
> mem-cgroup_iter() can't miss it, since memcg_expand_shrinker_maps()
> shouldn't miss allocated shrinker maps.

I see, because we could have this:

.css_alloc()
  memcg_alloc_shrinker_maps()
    down_read(&shrinker_sem)
    map = alloc(shrinker_nr_max * sizeof(long));
    rcu_assign_pointer(memcg->...->shrinker_map = map);
    up_read(&shrinker_sem);
                                                            register_shrinker()
                                                              down_write(&shrinker_sem)
                                                              shrinker_nr_max = id + 1;
                                                              memcg_expand_shrinker_maps()
                                                                for_each_mem_cgroup()
                                                                  realloc
                                                              up_write(&shrinker_sem)
  list_add_tail_rcu(&css->sibling, &parent->children);

  /* boom: missed new shrinker, map too small */

Thanks for the clarification.

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2020-12-15 16:52 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-02 18:27 [RFC PATCH 0/9] Make shrinker's nr_deferred memcg aware Yang Shi
2020-12-02 18:27 ` [PATCH 1/9] mm: vmscan: simplify nr_deferred update code Yang Shi
2020-12-03  2:56   ` Roman Gushchin
2020-12-02 18:27 ` [PATCH 2/9] mm: vmscan: use nid from shrink_control for tracepoint Yang Shi
2020-12-03  3:13   ` Xiaqing (A)
2020-12-11 19:20     ` Yang Shi
2020-12-02 18:27 ` [PATCH 3/9] mm: memcontrol: rename memcg_shrinker_map_mutex to memcg_shrinker_mutex Yang Shi
2020-12-02 18:27 ` [PATCH 4/9] mm: vmscan: use a new flag to indicate shrinker is registered Yang Shi
2020-12-03  3:01   ` Roman Gushchin
2020-12-03  4:59     ` Yang Shi
2020-12-03 20:08       ` Roman Gushchin
2020-12-03 22:25         ` Yang Shi
2020-12-04 18:52           ` Johannes Weiner
2020-12-04 21:24             ` Yang Shi
2020-12-02 18:27 ` [PATCH 5/9] mm: memcontrol: add per memcg shrinker nr_deferred Yang Shi
2020-12-03  3:06   ` Roman Gushchin
2020-12-03  4:54     ` Yang Shi
2020-12-03 18:03       ` Yang Shi
2020-12-03 20:07         ` Roman Gushchin
2020-12-03 22:49           ` Yang Shi
2020-12-03 23:30             ` Roman Gushchin
2020-12-04  0:22               ` Yang Shi
2020-12-10 15:33   ` Johannes Weiner
2020-12-10 19:12     ` Yang Shi
2020-12-11 17:52       ` Yang Shi
2020-12-10 21:59     ` Yang Shi
2020-12-02 18:27 ` [PATCH 6/9] mm: vmscan: use per memcg nr_deferred of shrinker Yang Shi
2020-12-03  3:08   ` Roman Gushchin
2020-12-03  5:01     ` Yang Shi
2020-12-03 11:40   ` Kirill Tkhai
2020-12-08 17:13     ` Yang Shi
2020-12-09 15:41       ` Kirill Tkhai
2020-12-09 17:32         ` Yang Shi
2020-12-10 15:13           ` Johannes Weiner
2020-12-10 15:17             ` Kirill Tkhai
2020-12-15 16:44               ` Johannes Weiner
2020-12-02 18:27 ` [PATCH 7/9] mm: vmscan: don't need allocate shrinker->nr_deferred for memcg aware shrinkers Yang Shi
2020-12-02 18:27 ` [PATCH 8/9] mm: memcontrol: reparent nr_deferred when memcg offline Yang Shi
2020-12-02 18:27 ` [PATCH 9/9] mm: vmscan: shrink deferred objects proportional to priority Yang Shi
2020-12-03  2:52 ` [RFC PATCH 0/9] Make shrinker's nr_deferred memcg aware Roman Gushchin
2020-12-03 17:52   ` Yang Shi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).