linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/7] make slab shrink lockless
@ 2023-02-23 13:27 Qi Zheng
  2023-02-23 13:27 ` [PATCH v2 1/7] mm: vmscan: add a map_nr_max field to shrinker_info Qi Zheng
                   ` (7 more replies)
  0 siblings, 8 replies; 33+ messages in thread
From: Qi Zheng @ 2023-02-23 13:27 UTC (permalink / raw)
  To: akpm, tkhai, hannes, shakeelb, mhocko, roman.gushchin,
	muchun.song, david, shy828301
  Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel, Qi Zheng

Hi all,

This patch series aims to make slab shrink lockless.

1. Background
=============

On our servers, we often find the following system cpu hotspots:

  44.16%  [kernel]  [k] down_read_trylock
  14.12%  [kernel]  [k] up_read
  13.43%  [kernel]  [k] shrink_slab
   5.25%  [kernel]  [k] count_shadow_nodes
   3.42%  [kernel]  [k] idr_find

Then we used bpftrace to capture its calltrace as follows:

@[
    down_read_trylock+5
    shrink_slab+292
    shrink_node+640
    do_try_to_free_pages+211
    try_to_free_mem_cgroup_pages+266
    try_charge_memcg+386
    charge_memcg+51
    __mem_cgroup_charge+44
    __handle_mm_fault+1416
    handle_mm_fault+260
    do_user_addr_fault+459
    exc_page_fault+104
    asm_exc_page_fault+38
    clear_user_rep_good+18
    read_zero+100
    vfs_read+176
    ksys_read+93
    do_syscall_64+62
    entry_SYSCALL_64_after_hwframe+114
]: 1868979

It is easy to see that this is caused by the frequent failure to obtain the
read lock of shrinker_rwsem when reclaiming slab memory.

Currently, the shrinker_rwsem is a global lock. And the following cases may
cause the above system cpu hotspots:

a. the write lock of shrinker_rwsem was held for too long. For example, there
   are many memcgs in the system, which causes some paths to hold locks and
   traverse it for too long. (e.g. expand_shrinker_info())
b. the read lock of shrinker_rwsem was held for too long, and a writer came at
   this time. Then this writer will be forced to wait and block all subsequent
   readers.
   For example:
   - be scheduled when the read lock of shrinker_rwsem is held in
     do_shrink_slab()
   - some shrinker are blocked for too long. Like the case mentioned in the
     patchset[1].

[1]. https://lore.kernel.org/lkml/20191129214541.3110-1-ptikhomirov@virtuozzo.com/

And all the down_read_trylock() hotspots caused by the above cases can be
solved by replacing the shrinker_rwsem trylocks with SRCU.

2. Survey
=========

Before doing the code implementation, I found that there were many similar
submissions in the community:

a. Davidlohr Bueso submitted a patch in 2015.
   Subject: [PATCH -next v2] mm: srcu-ify shrinkers
   Link: https://lore.kernel.org/all/1437080113.3596.2.camel@stgolabs.net/
   Result: It was finally merged into the linux-next branch, but failed on arm
           allnoconfig (without CONFIG_SRCU)

b. Tetsuo Handa submitted a patchset in 2017.
   Subject: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
   Link: https://lore.kernel.org/lkml/1510609063-3327-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp/
   Result: Finally chose to use the current simple way (break when
           rwsem_is_contended()). And Christoph Hellwig suggested to using SRCU,
           but SRCU was not unconditionally enabled at the time.

c. Kirill Tkhai submitted a patchset in 2018.
   Subject: [PATCH RFC 00/10] Introduce lockless shrink_slab()
   Link: https://lore.kernel.org/lkml/153365347929.19074.12509495712735843805.stgit@localhost.localdomain/
   Result: At that time, SRCU was not unconditionally enabled, and there were
           some objections to enabling SRCU. Later, because Kirill's focus was
           moved to other things, this patchset was not continued to be updated.

d. Sultan Alsawaf submitted a patch in 2021.
   Subject: [PATCH] mm: vmscan: Replace shrinker_rwsem trylocks with SRCU protection
   Link: https://lore.kernel.org/lkml/20210927074823.5825-1-sultan@kerneltoast.com/
   Result: Rejected because SRCU was not unconditionally enabled.

We can find that almost all these historical commits were abandoned because SRCU
was not unconditionally enabled. But now SRCU has been unconditionally enable
by Paul E. McKenney in 2023 [2], so it's time to replace shrinker_rwsem trylocks
with SRCU.

[2] https://lore.kernel.org/lkml/20230105003759.GA1769545@paulmck-ThinkPad-P17-Gen-1/

3. Reproduction and testing
===========================

We can reproduce the down_read_trylock() hotspot through the following script:

```
#!/bin/bash
DIR="/root/shrinker/memcg/mnt"

do_create()
{
        mkdir /sys/fs/cgroup/memory/test
        echo 200M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
        for i in `seq 0 $1`;
        do
                mkdir /sys/fs/cgroup/memory/test/$i;
                echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
                mkdir -p $DIR/$i;
        done
}

do_mount()
{
        for i in `seq $1 $2`;
        do
                mount -t tmpfs $i $DIR/$i;
        done
}

do_touch()
{
        for i in `seq $1 $2`;
        do
                echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
                dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 &
        done
}

do_create 2000
do_mount 0 2000
do_touch 0 1000
```

Save the above script and execute it, we can get the following perf hotspots:

  46.60%  [kernel]  [k] down_read_trylock
  18.70%  [kernel]  [k] up_read
  15.44%  [kernel]  [k] shrink_slab
   4.37%  [kernel]  [k] _find_next_bit
   2.75%  [kernel]  [k] xa_load
   2.07%  [kernel]  [k] idr_find
   1.73%  [kernel]  [k] do_shrink_slab
   1.42%  [kernel]  [k] shrink_lruvec
   0.74%  [kernel]  [k] shrink_node
   0.60%  [kernel]  [k] list_lru_count_one

After applying this patchset, the hotspot becomes as follows:

  19.53%  [kernel]  [k] _find_next_bit
  14.63%  [kernel]  [k] do_shrink_slab
  14.58%  [kernel]  [k] shrink_slab
  11.83%  [kernel]  [k] shrink_lruvec
   9.33%  [kernel]  [k] __blk_flush_plug
   6.67%  [kernel]  [k] mem_cgroup_iter
   3.73%  [kernel]  [k] list_lru_count_one
   2.43%  [kernel]  [k] shrink_node
   1.96%  [kernel]  [k] super_cache_count
   1.78%  [kernel]  [k] __rcu_read_unlock
   1.38%  [kernel]  [k] __srcu_read_lock
   1.30%  [kernel]  [k] xas_descend

We can see that the slab reclaim is no longer blocked by shinker_rwsem trylock,
which realizes the lockless slab reclaim.

This series is based on next-20230217.

Comments and suggestions are welcome.

Thanks,
Qi.

Changelog in v1 -> v2:
 - add a map_nr_max field to shrinker_info (suggested by Kirill)
 - use shrinker_mutex in reparent_shrinker_deferred() (pointed by Kirill)

Qi Zheng (7):
  mm: vmscan: add a map_nr_max field to shrinker_info
  mm: vmscan: make global slab shrink lockless
  mm: vmscan: make memcg slab shrink lockless
  mm: shrinkers: make count and scan in shrinker debugfs lockless
  mm: vmscan: hold write lock to reparent shrinker nr_deferred
  mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers()
  mm: shrinkers: convert shrinker_rwsem to mutex

 drivers/md/dm-cache-metadata.c |   2 +-
 drivers/md/dm-thin-metadata.c  |   2 +-
 fs/super.c                     |   2 +-
 include/linux/memcontrol.h     |   1 +
 mm/shrinker_debug.c            |  38 ++++-----
 mm/vmscan.c                    | 142 +++++++++++++++++----------------
 6 files changed, 92 insertions(+), 95 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v2 1/7] mm: vmscan: add a map_nr_max field to shrinker_info
  2023-02-23 13:27 [PATCH v2 0/7] make slab shrink lockless Qi Zheng
@ 2023-02-23 13:27 ` Qi Zheng
  2023-02-25  8:18   ` Qi Zheng
  2023-02-23 13:27 ` [PATCH v2 2/7] mm: vmscan: make global slab shrink lockless Qi Zheng
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 33+ messages in thread
From: Qi Zheng @ 2023-02-23 13:27 UTC (permalink / raw)
  To: akpm, tkhai, hannes, shakeelb, mhocko, roman.gushchin,
	muchun.song, david, shy828301
  Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel, Qi Zheng

To prepare for the subsequent lockless memcg slab shrink,
add a map_nr_max field to struct shrinker_info to records
its own real shrinker_nr_max.

No functional changes.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/memcontrol.h |  1 +
 mm/vmscan.c                | 29 ++++++++++++++++++-----------
 2 files changed, 19 insertions(+), 11 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b6eda2ab205d..aa69ea98e2d8 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -97,6 +97,7 @@ struct shrinker_info {
 	struct rcu_head rcu;
 	atomic_long_t *nr_deferred;
 	unsigned long *map;
+	int map_nr_max;
 };
 
 struct lruvec_stats_percpu {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9c1c5e8b24b8..9f895ca6216c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -224,9 +224,16 @@ static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
 					 lockdep_is_held(&shrinker_rwsem));
 }
 
+static inline bool need_expand(int new_nr_max, int old_nr_max)
+{
+	return round_up(new_nr_max, BITS_PER_LONG) >
+	       round_up(old_nr_max, BITS_PER_LONG);
+}
+
 static int expand_one_shrinker_info(struct mem_cgroup *memcg,
 				    int map_size, int defer_size,
-				    int old_map_size, int old_defer_size)
+				    int old_map_size, int old_defer_size,
+				    int new_nr_max)
 {
 	struct shrinker_info *new, *old;
 	struct mem_cgroup_per_node *pn;
@@ -240,12 +247,16 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
 		if (!old)
 			return 0;
 
+		if (!need_expand(new_nr_max, old->map_nr_max))
+			return 0;
+
 		new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
 		if (!new)
 			return -ENOMEM;
 
 		new->nr_deferred = (atomic_long_t *)(new + 1);
 		new->map = (void *)new->nr_deferred + defer_size;
+		new->map_nr_max = new_nr_max;
 
 		/* map: set all old bits, clear all new bits */
 		memset(new->map, (int)0xff, old_map_size);
@@ -295,6 +306,7 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
 		}
 		info->nr_deferred = (atomic_long_t *)(info + 1);
 		info->map = (void *)info->nr_deferred + defer_size;
+		info->map_nr_max = shrinker_nr_max;
 		rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, info);
 	}
 	up_write(&shrinker_rwsem);
@@ -302,12 +314,6 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
 	return ret;
 }
 
-static inline bool need_expand(int nr_max)
-{
-	return round_up(nr_max, BITS_PER_LONG) >
-	       round_up(shrinker_nr_max, BITS_PER_LONG);
-}
-
 static int expand_shrinker_info(int new_id)
 {
 	int ret = 0;
@@ -316,7 +322,7 @@ static int expand_shrinker_info(int new_id)
 	int old_map_size, old_defer_size = 0;
 	struct mem_cgroup *memcg;
 
-	if (!need_expand(new_nr_max))
+	if (!need_expand(new_nr_max, shrinker_nr_max))
 		goto out;
 
 	if (!root_mem_cgroup)
@@ -332,7 +338,8 @@ static int expand_shrinker_info(int new_id)
 	memcg = mem_cgroup_iter(NULL, NULL, NULL);
 	do {
 		ret = expand_one_shrinker_info(memcg, map_size, defer_size,
-					       old_map_size, old_defer_size);
+					       old_map_size, old_defer_size,
+					       new_nr_max);
 		if (ret) {
 			mem_cgroup_iter_break(NULL, memcg);
 			goto out;
@@ -432,7 +439,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
 	for_each_node(nid) {
 		child_info = shrinker_info_protected(memcg, nid);
 		parent_info = shrinker_info_protected(parent, nid);
-		for (i = 0; i < shrinker_nr_max; i++) {
+		for (i = 0; i < child_info->map_nr_max; i++) {
 			nr = atomic_long_read(&child_info->nr_deferred[i]);
 			atomic_long_add(nr, &parent_info->nr_deferred[i]);
 		}
@@ -899,7 +906,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
 	if (unlikely(!info))
 		goto unlock;
 
-	for_each_set_bit(i, info->map, shrinker_nr_max) {
+	for_each_set_bit(i, info->map, info->map_nr_max) {
 		struct shrink_control sc = {
 			.gfp_mask = gfp_mask,
 			.nid = nid,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 2/7] mm: vmscan: make global slab shrink lockless
  2023-02-23 13:27 [PATCH v2 0/7] make slab shrink lockless Qi Zheng
  2023-02-23 13:27 ` [PATCH v2 1/7] mm: vmscan: add a map_nr_max field to shrinker_info Qi Zheng
@ 2023-02-23 13:27 ` Qi Zheng
  2023-02-23 15:26   ` Rafael Aquini
  2023-02-23 18:24   ` Sultan Alsawaf
  2023-02-23 13:27 ` [PATCH v2 3/7] mm: vmscan: make memcg " Qi Zheng
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 33+ messages in thread
From: Qi Zheng @ 2023-02-23 13:27 UTC (permalink / raw)
  To: akpm, tkhai, hannes, shakeelb, mhocko, roman.gushchin,
	muchun.song, david, shy828301
  Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel, Qi Zheng

The shrinker_rwsem is a global lock in shrinkers subsystem,
it is easy to cause blocking in the following cases:

a. the write lock of shrinker_rwsem was held for too long.
   For example, there are many memcgs in the system, which
   causes some paths to hold locks and traverse it for too
   long. (e.g. expand_shrinker_info())
b. the read lock of shrinker_rwsem was held for too long,
   and a writer came at this time. Then this writer will be
   forced to wait and block all subsequent readers.
   For example:
   - be scheduled when the read lock of shrinker_rwsem is
     held in do_shrink_slab()
   - some shrinker are blocked for too long. Like the case
     mentioned in the patchset[1].

Therefore, many times in history ([2],[3],[4],[5]), some
people wanted to replace shrinker_rwsem reader with SRCU,
but they all gave up because SRCU was not unconditionally
enabled.

But now, since commit 1cd0bd06093c ("rcu: Remove CONFIG_SRCU"),
the SRCU is unconditionally enabled. So it's time to use
SRCU to protect readers who previously held shrinker_rwsem.

[1]. https://lore.kernel.org/lkml/20191129214541.3110-1-ptikhomirov@virtuozzo.com/
[2]. https://lore.kernel.org/all/1437080113.3596.2.camel@stgolabs.net/
[3]. https://lore.kernel.org/lkml/1510609063-3327-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp/
[4]. https://lore.kernel.org/lkml/153365347929.19074.12509495712735843805.stgit@localhost.localdomain/
[5]. https://lore.kernel.org/lkml/20210927074823.5825-1-sultan@kerneltoast.com/

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/vmscan.c | 27 +++++++++++----------------
 1 file changed, 11 insertions(+), 16 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9f895ca6216c..02987a6f95d1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -202,6 +202,7 @@ static void set_task_reclaim_state(struct task_struct *task,
 
 LIST_HEAD(shrinker_list);
 DECLARE_RWSEM(shrinker_rwsem);
+DEFINE_SRCU(shrinker_srcu);
 
 #ifdef CONFIG_MEMCG
 static int shrinker_nr_max;
@@ -706,7 +707,7 @@ void free_prealloced_shrinker(struct shrinker *shrinker)
 void register_shrinker_prepared(struct shrinker *shrinker)
 {
 	down_write(&shrinker_rwsem);
-	list_add_tail(&shrinker->list, &shrinker_list);
+	list_add_tail_rcu(&shrinker->list, &shrinker_list);
 	shrinker->flags |= SHRINKER_REGISTERED;
 	shrinker_debugfs_add(shrinker);
 	up_write(&shrinker_rwsem);
@@ -760,13 +761,15 @@ void unregister_shrinker(struct shrinker *shrinker)
 		return;
 
 	down_write(&shrinker_rwsem);
-	list_del(&shrinker->list);
+	list_del_rcu(&shrinker->list);
 	shrinker->flags &= ~SHRINKER_REGISTERED;
 	if (shrinker->flags & SHRINKER_MEMCG_AWARE)
 		unregister_memcg_shrinker(shrinker);
 	debugfs_entry = shrinker_debugfs_remove(shrinker);
 	up_write(&shrinker_rwsem);
 
+	synchronize_srcu(&shrinker_srcu);
+
 	debugfs_remove_recursive(debugfs_entry);
 
 	kfree(shrinker->nr_deferred);
@@ -786,6 +789,7 @@ void synchronize_shrinkers(void)
 {
 	down_write(&shrinker_rwsem);
 	up_write(&shrinker_rwsem);
+	synchronize_srcu(&shrinker_srcu);
 }
 EXPORT_SYMBOL(synchronize_shrinkers);
 
@@ -996,6 +1000,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
 {
 	unsigned long ret, freed = 0;
 	struct shrinker *shrinker;
+	int srcu_idx;
 
 	/*
 	 * The root memcg might be allocated even though memcg is disabled
@@ -1007,10 +1012,10 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
 	if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
 		return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
 
-	if (!down_read_trylock(&shrinker_rwsem))
-		goto out;
+	srcu_idx = srcu_read_lock(&shrinker_srcu);
 
-	list_for_each_entry(shrinker, &shrinker_list, list) {
+	list_for_each_entry_srcu(shrinker, &shrinker_list, list,
+				 srcu_read_lock_held(&shrinker_srcu)) {
 		struct shrink_control sc = {
 			.gfp_mask = gfp_mask,
 			.nid = nid,
@@ -1021,19 +1026,9 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
 		if (ret == SHRINK_EMPTY)
 			ret = 0;
 		freed += ret;
-		/*
-		 * Bail out if someone want to register a new shrinker to
-		 * prevent the registration from being stalled for long periods
-		 * by parallel ongoing shrinking.
-		 */
-		if (rwsem_is_contended(&shrinker_rwsem)) {
-			freed = freed ? : 1;
-			break;
-		}
 	}
 
-	up_read(&shrinker_rwsem);
-out:
+	srcu_read_unlock(&shrinker_srcu, srcu_idx);
 	cond_resched();
 	return freed;
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 3/7] mm: vmscan: make memcg slab shrink lockless
  2023-02-23 13:27 [PATCH v2 0/7] make slab shrink lockless Qi Zheng
  2023-02-23 13:27 ` [PATCH v2 1/7] mm: vmscan: add a map_nr_max field to shrinker_info Qi Zheng
  2023-02-23 13:27 ` [PATCH v2 2/7] mm: vmscan: make global slab shrink lockless Qi Zheng
@ 2023-02-23 13:27 ` Qi Zheng
  2023-02-23 13:27 ` [PATCH v2 4/7] mm: shrinkers: make count and scan in shrinker debugfs lockless Qi Zheng
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 33+ messages in thread
From: Qi Zheng @ 2023-02-23 13:27 UTC (permalink / raw)
  To: akpm, tkhai, hannes, shakeelb, mhocko, roman.gushchin,
	muchun.song, david, shy828301
  Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel, Qi Zheng

Like global slab shrink, since commit 1cd0bd06093c
("rcu: Remove CONFIG_SRCU"), it's time to use SRCU
to protect readers who previously held shrinker_rwsem.

We can test with the following script:

```
DIR="/root/shrinker/memcg/mnt"

do_create()
{
        mkdir /sys/fs/cgroup/memory/test
        echo 200M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
        for i in `seq 0 $1`;
        do
                mkdir /sys/fs/cgroup/memory/test/$i;
                echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
                mkdir -p $DIR/$i;
        done
}

do_mount()
{
        for i in `seq $1 $2`;
        do
                mount -t tmpfs $i $DIR/$i;
        done
}

do_touch()
{
        for i in `seq $1 $2`;
        do
                echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
                dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 &
        done
}

do_create 2000
do_mount 0 2000
do_touch 0 1000
```

Before applying:

  46.60%  [kernel]  [k] down_read_trylock
  18.70%  [kernel]  [k] up_read
  15.44%  [kernel]  [k] shrink_slab
   4.37%  [kernel]  [k] _find_next_bit
   2.75%  [kernel]  [k] xa_load
   2.07%  [kernel]  [k] idr_find
   1.73%  [kernel]  [k] do_shrink_slab
   1.42%  [kernel]  [k] shrink_lruvec
   0.74%  [kernel]  [k] shrink_node
   0.60%  [kernel]  [k] list_lru_count_one

After applying:

  19.53%  [kernel]  [k] _find_next_bit
  14.63%  [kernel]  [k] do_shrink_slab
  14.58%  [kernel]  [k] shrink_slab
  11.83%  [kernel]  [k] shrink_lruvec
   9.33%  [kernel]  [k] __blk_flush_plug
   6.67%  [kernel]  [k] mem_cgroup_iter
   3.73%  [kernel]  [k] list_lru_count_one
   2.43%  [kernel]  [k] shrink_node
   1.96%  [kernel]  [k] super_cache_count
   1.78%  [kernel]  [k] __rcu_read_unlock
   1.38%  [kernel]  [k] __srcu_read_lock
   1.30%  [kernel]  [k] xas_descend

We can see that the readers is no longer blocked.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/vmscan.c | 46 +++++++++++++++++++++++++++-------------------
 1 file changed, 27 insertions(+), 19 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 02987a6f95d1..25a4a660e45f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -57,6 +57,7 @@
 #include <linux/khugepaged.h>
 #include <linux/rculist_nulls.h>
 #include <linux/random.h>
+#include <linux/srcu.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -221,8 +222,21 @@ static inline int shrinker_defer_size(int nr_items)
 static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
 						     int nid)
 {
-	return rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_info,
-					 lockdep_is_held(&shrinker_rwsem));
+	return srcu_dereference_check(memcg->nodeinfo[nid]->shrinker_info,
+				      &shrinker_srcu,
+				      lockdep_is_held(&shrinker_rwsem));
+}
+
+static struct shrinker_info *shrinker_info_srcu(struct mem_cgroup *memcg,
+						     int nid)
+{
+	return srcu_dereference(memcg->nodeinfo[nid]->shrinker_info,
+				&shrinker_srcu);
+}
+
+static void free_shrinker_info_rcu(struct rcu_head *head)
+{
+	kvfree(container_of(head, struct shrinker_info, rcu));
 }
 
 static inline bool need_expand(int new_nr_max, int old_nr_max)
@@ -268,7 +282,7 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
 		       defer_size - old_defer_size);
 
 		rcu_assign_pointer(pn->shrinker_info, new);
-		kvfree_rcu(old, rcu);
+		call_srcu(&shrinker_srcu, &old->rcu, free_shrinker_info_rcu);
 	}
 
 	return 0;
@@ -357,13 +371,14 @@ void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
 {
 	if (shrinker_id >= 0 && memcg && !mem_cgroup_is_root(memcg)) {
 		struct shrinker_info *info;
+		int srcu_idx;
 
-		rcu_read_lock();
-		info = rcu_dereference(memcg->nodeinfo[nid]->shrinker_info);
+		srcu_idx = srcu_read_lock(&shrinker_srcu);
+		info = shrinker_info_srcu(memcg, nid);
 		/* Pairs with smp mb in shrink_slab() */
 		smp_mb__before_atomic();
 		set_bit(shrinker_id, info->map);
-		rcu_read_unlock();
+		srcu_read_unlock(&shrinker_srcu, srcu_idx);
 	}
 }
 
@@ -377,7 +392,6 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
 		return -ENOSYS;
 
 	down_write(&shrinker_rwsem);
-	/* This may call shrinker, so it must use down_read_trylock() */
 	id = idr_alloc(&shrinker_idr, shrinker, 0, 0, GFP_KERNEL);
 	if (id < 0)
 		goto unlock;
@@ -411,7 +425,7 @@ static long xchg_nr_deferred_memcg(int nid, struct shrinker *shrinker,
 {
 	struct shrinker_info *info;
 
-	info = shrinker_info_protected(memcg, nid);
+	info = shrinker_info_srcu(memcg, nid);
 	return atomic_long_xchg(&info->nr_deferred[shrinker->id], 0);
 }
 
@@ -420,7 +434,7 @@ static long add_nr_deferred_memcg(long nr, int nid, struct shrinker *shrinker,
 {
 	struct shrinker_info *info;
 
-	info = shrinker_info_protected(memcg, nid);
+	info = shrinker_info_srcu(memcg, nid);
 	return atomic_long_add_return(nr, &info->nr_deferred[shrinker->id]);
 }
 
@@ -898,15 +912,14 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
 {
 	struct shrinker_info *info;
 	unsigned long ret, freed = 0;
+	int srcu_idx;
 	int i;
 
 	if (!mem_cgroup_online(memcg))
 		return 0;
 
-	if (!down_read_trylock(&shrinker_rwsem))
-		return 0;
-
-	info = shrinker_info_protected(memcg, nid);
+	srcu_idx = srcu_read_lock(&shrinker_srcu);
+	info = shrinker_info_srcu(memcg, nid);
 	if (unlikely(!info))
 		goto unlock;
 
@@ -956,14 +969,9 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
 				set_shrinker_bit(memcg, nid, i);
 		}
 		freed += ret;
-
-		if (rwsem_is_contended(&shrinker_rwsem)) {
-			freed = freed ? : 1;
-			break;
-		}
 	}
 unlock:
-	up_read(&shrinker_rwsem);
+	srcu_read_unlock(&shrinker_srcu, srcu_idx);
 	return freed;
 }
 #else /* CONFIG_MEMCG */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 4/7] mm: shrinkers: make count and scan in shrinker debugfs lockless
  2023-02-23 13:27 [PATCH v2 0/7] make slab shrink lockless Qi Zheng
                   ` (2 preceding siblings ...)
  2023-02-23 13:27 ` [PATCH v2 3/7] mm: vmscan: make memcg " Qi Zheng
@ 2023-02-23 13:27 ` Qi Zheng
  2023-02-23 13:27 ` [PATCH v2 5/7] mm: vmscan: hold write lock to reparent shrinker nr_deferred Qi Zheng
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 33+ messages in thread
From: Qi Zheng @ 2023-02-23 13:27 UTC (permalink / raw)
  To: akpm, tkhai, hannes, shakeelb, mhocko, roman.gushchin,
	muchun.song, david, shy828301
  Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel, Qi Zheng

Like global and memcg slab shrink, also use SRCU to
make count and scan operations in memory shrinker
debugfs lockless.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/shrinker_debug.c | 24 +++++++-----------------
 1 file changed, 7 insertions(+), 17 deletions(-)

diff --git a/mm/shrinker_debug.c b/mm/shrinker_debug.c
index 39c3491e28a3..6aa7a7ec69da 100644
--- a/mm/shrinker_debug.c
+++ b/mm/shrinker_debug.c
@@ -9,6 +9,7 @@
 /* defined in vmscan.c */
 extern struct rw_semaphore shrinker_rwsem;
 extern struct list_head shrinker_list;
+extern struct srcu_struct shrinker_srcu;
 
 static DEFINE_IDA(shrinker_debugfs_ida);
 static struct dentry *shrinker_debugfs_root;
@@ -49,18 +50,13 @@ static int shrinker_debugfs_count_show(struct seq_file *m, void *v)
 	struct mem_cgroup *memcg;
 	unsigned long total;
 	bool memcg_aware;
-	int ret, nid;
+	int ret = 0, nid, srcu_idx;
 
 	count_per_node = kcalloc(nr_node_ids, sizeof(unsigned long), GFP_KERNEL);
 	if (!count_per_node)
 		return -ENOMEM;
 
-	ret = down_read_killable(&shrinker_rwsem);
-	if (ret) {
-		kfree(count_per_node);
-		return ret;
-	}
-	rcu_read_lock();
+	srcu_idx = srcu_read_lock(&shrinker_srcu);
 
 	memcg_aware = shrinker->flags & SHRINKER_MEMCG_AWARE;
 
@@ -91,8 +87,7 @@ static int shrinker_debugfs_count_show(struct seq_file *m, void *v)
 		}
 	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
 
-	rcu_read_unlock();
-	up_read(&shrinker_rwsem);
+	srcu_read_unlock(&shrinker_srcu, srcu_idx);
 
 	kfree(count_per_node);
 	return ret;
@@ -115,9 +110,8 @@ static ssize_t shrinker_debugfs_scan_write(struct file *file,
 		.gfp_mask = GFP_KERNEL,
 	};
 	struct mem_cgroup *memcg = NULL;
-	int nid;
+	int nid, srcu_idx;
 	char kbuf[72];
-	ssize_t ret;
 
 	read_len = size < (sizeof(kbuf) - 1) ? size : (sizeof(kbuf) - 1);
 	if (copy_from_user(kbuf, buf, read_len))
@@ -146,11 +140,7 @@ static ssize_t shrinker_debugfs_scan_write(struct file *file,
 		return -EINVAL;
 	}
 
-	ret = down_read_killable(&shrinker_rwsem);
-	if (ret) {
-		mem_cgroup_put(memcg);
-		return ret;
-	}
+	srcu_idx = srcu_read_lock(&shrinker_srcu);
 
 	sc.nid = nid;
 	sc.memcg = memcg;
@@ -159,7 +149,7 @@ static ssize_t shrinker_debugfs_scan_write(struct file *file,
 
 	shrinker->scan_objects(shrinker, &sc);
 
-	up_read(&shrinker_rwsem);
+	srcu_read_unlock(&shrinker_srcu, srcu_idx);
 	mem_cgroup_put(memcg);
 
 	return size;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 5/7] mm: vmscan: hold write lock to reparent shrinker nr_deferred
  2023-02-23 13:27 [PATCH v2 0/7] make slab shrink lockless Qi Zheng
                   ` (3 preceding siblings ...)
  2023-02-23 13:27 ` [PATCH v2 4/7] mm: shrinkers: make count and scan in shrinker debugfs lockless Qi Zheng
@ 2023-02-23 13:27 ` Qi Zheng
  2023-02-23 13:27 ` [PATCH v2 6/7] mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers() Qi Zheng
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 33+ messages in thread
From: Qi Zheng @ 2023-02-23 13:27 UTC (permalink / raw)
  To: akpm, tkhai, hannes, shakeelb, mhocko, roman.gushchin,
	muchun.song, david, shy828301
  Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel, Qi Zheng

For now, reparent_shrinker_deferred() is the only holder
of read lock of shrinker_rwsem. And it already holds the
global cgroup_mutex, so it will not be called in parallel.

Therefore, in order to convert shrinker_rwsem to shrinker_mutex
later, here we change to hold the write lock of shrinker_rwsem
to reparent.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/vmscan.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 25a4a660e45f..89602e97583a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -450,7 +450,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
 		parent = root_mem_cgroup;
 
 	/* Prevent from concurrent shrinker_info expand */
-	down_read(&shrinker_rwsem);
+	down_write(&shrinker_rwsem);
 	for_each_node(nid) {
 		child_info = shrinker_info_protected(memcg, nid);
 		parent_info = shrinker_info_protected(parent, nid);
@@ -459,7 +459,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
 			atomic_long_add(nr, &parent_info->nr_deferred[i]);
 		}
 	}
-	up_read(&shrinker_rwsem);
+	up_write(&shrinker_rwsem);
 }
 
 static bool cgroup_reclaim(struct scan_control *sc)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 6/7] mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers()
  2023-02-23 13:27 [PATCH v2 0/7] make slab shrink lockless Qi Zheng
                   ` (4 preceding siblings ...)
  2023-02-23 13:27 ` [PATCH v2 5/7] mm: vmscan: hold write lock to reparent shrinker nr_deferred Qi Zheng
@ 2023-02-23 13:27 ` Qi Zheng
  2023-02-23 13:27 ` [PATCH v2 7/7] mm: shrinkers: convert shrinker_rwsem to mutex Qi Zheng
  2023-02-23 18:19 ` [PATCH v2 0/7] make slab shrink lockless Paul E. McKenney
  7 siblings, 0 replies; 33+ messages in thread
From: Qi Zheng @ 2023-02-23 13:27 UTC (permalink / raw)
  To: akpm, tkhai, hannes, shakeelb, mhocko, roman.gushchin,
	muchun.song, david, shy828301
  Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel, Qi Zheng

Now there are no readers of shrinker_rwsem, so
synchronize_shrinkers() does not need to hold the
writer of shrinker_rwsem to wait for all running
shinkers to complete, synchronize_srcu() is enough.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/vmscan.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 89602e97583a..d1a95d60d127 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -794,15 +794,11 @@ EXPORT_SYMBOL(unregister_shrinker);
 /**
  * synchronize_shrinkers - Wait for all running shrinkers to complete.
  *
- * This is equivalent to calling unregister_shrink() and register_shrinker(),
- * but atomically and with less overhead. This is useful to guarantee that all
- * shrinker invocations have seen an update, before freeing memory, similar to
- * rcu.
+ * This is useful to guarantee that all shrinker invocations have seen an
+ * update, before freeing memory.
  */
 void synchronize_shrinkers(void)
 {
-	down_write(&shrinker_rwsem);
-	up_write(&shrinker_rwsem);
 	synchronize_srcu(&shrinker_srcu);
 }
 EXPORT_SYMBOL(synchronize_shrinkers);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 7/7] mm: shrinkers: convert shrinker_rwsem to mutex
  2023-02-23 13:27 [PATCH v2 0/7] make slab shrink lockless Qi Zheng
                   ` (5 preceding siblings ...)
  2023-02-23 13:27 ` [PATCH v2 6/7] mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers() Qi Zheng
@ 2023-02-23 13:27 ` Qi Zheng
  2023-02-23 18:19 ` [PATCH v2 0/7] make slab shrink lockless Paul E. McKenney
  7 siblings, 0 replies; 33+ messages in thread
From: Qi Zheng @ 2023-02-23 13:27 UTC (permalink / raw)
  To: akpm, tkhai, hannes, shakeelb, mhocko, roman.gushchin,
	muchun.song, david, shy828301
  Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel, Qi Zheng

Now there are no readers of shrinker_rwsem, so we
can simply replace it with mutex lock.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 drivers/md/dm-cache-metadata.c |  2 +-
 drivers/md/dm-thin-metadata.c  |  2 +-
 fs/super.c                     |  2 +-
 mm/shrinker_debug.c            | 14 +++++++-------
 mm/vmscan.c                    | 34 +++++++++++++++++-----------------
 5 files changed, 27 insertions(+), 27 deletions(-)

diff --git a/drivers/md/dm-cache-metadata.c b/drivers/md/dm-cache-metadata.c
index acffed750e3e..9e0c69958587 100644
--- a/drivers/md/dm-cache-metadata.c
+++ b/drivers/md/dm-cache-metadata.c
@@ -1828,7 +1828,7 @@ int dm_cache_metadata_abort(struct dm_cache_metadata *cmd)
 	 * Replacement block manager (new_bm) is created and old_bm destroyed outside of
 	 * cmd root_lock to avoid ABBA deadlock that would result (due to life-cycle of
 	 * shrinker associated with the block manager's bufio client vs cmd root_lock).
-	 * - must take shrinker_rwsem without holding cmd->root_lock
+	 * - must take shrinker_mutex without holding cmd->root_lock
 	 */
 	new_bm = dm_block_manager_create(cmd->bdev, DM_CACHE_METADATA_BLOCK_SIZE << SECTOR_SHIFT,
 					 CACHE_MAX_CONCURRENT_LOCKS);
diff --git a/drivers/md/dm-thin-metadata.c b/drivers/md/dm-thin-metadata.c
index fd464fb024c3..9f5cb52c5763 100644
--- a/drivers/md/dm-thin-metadata.c
+++ b/drivers/md/dm-thin-metadata.c
@@ -1887,7 +1887,7 @@ int dm_pool_abort_metadata(struct dm_pool_metadata *pmd)
 	 * Replacement block manager (new_bm) is created and old_bm destroyed outside of
 	 * pmd root_lock to avoid ABBA deadlock that would result (due to life-cycle of
 	 * shrinker associated with the block manager's bufio client vs pmd root_lock).
-	 * - must take shrinker_rwsem without holding pmd->root_lock
+	 * - must take shrinker_mutex without holding pmd->root_lock
 	 */
 	new_bm = dm_block_manager_create(pmd->bdev, THIN_METADATA_BLOCK_SIZE << SECTOR_SHIFT,
 					 THIN_MAX_CONCURRENT_LOCKS);
diff --git a/fs/super.c b/fs/super.c
index 84332d5cb817..91a4037b1d95 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -54,7 +54,7 @@ static char *sb_writers_name[SB_FREEZE_LEVELS] = {
  * One thing we have to be careful of with a per-sb shrinker is that we don't
  * drop the last active reference to the superblock from within the shrinker.
  * If that happens we could trigger unregistering the shrinker from within the
- * shrinker path and that leads to deadlock on the shrinker_rwsem. Hence we
+ * shrinker path and that leads to deadlock on the shrinker_mutex. Hence we
  * take a passive reference to the superblock to avoid this from occurring.
  */
 static unsigned long super_cache_scan(struct shrinker *shrink,
diff --git a/mm/shrinker_debug.c b/mm/shrinker_debug.c
index 6aa7a7ec69da..b0f6aff372df 100644
--- a/mm/shrinker_debug.c
+++ b/mm/shrinker_debug.c
@@ -7,7 +7,7 @@
 #include <linux/memcontrol.h>
 
 /* defined in vmscan.c */
-extern struct rw_semaphore shrinker_rwsem;
+extern struct mutex shrinker_mutex;
 extern struct list_head shrinker_list;
 extern struct srcu_struct shrinker_srcu;
 
@@ -167,7 +167,7 @@ int shrinker_debugfs_add(struct shrinker *shrinker)
 	char buf[128];
 	int id;
 
-	lockdep_assert_held(&shrinker_rwsem);
+	lockdep_assert_held(&shrinker_mutex);
 
 	/* debugfs isn't initialized yet, add debugfs entries later. */
 	if (!shrinker_debugfs_root)
@@ -210,7 +210,7 @@ int shrinker_debugfs_rename(struct shrinker *shrinker, const char *fmt, ...)
 	if (!new)
 		return -ENOMEM;
 
-	down_write(&shrinker_rwsem);
+	mutex_lock(&shrinker_mutex);
 
 	old = shrinker->name;
 	shrinker->name = new;
@@ -228,7 +228,7 @@ int shrinker_debugfs_rename(struct shrinker *shrinker, const char *fmt, ...)
 			shrinker->debugfs_entry = entry;
 	}
 
-	up_write(&shrinker_rwsem);
+	mutex_unlock(&shrinker_mutex);
 
 	kfree_const(old);
 
@@ -240,7 +240,7 @@ struct dentry *shrinker_debugfs_remove(struct shrinker *shrinker)
 {
 	struct dentry *entry = shrinker->debugfs_entry;
 
-	lockdep_assert_held(&shrinker_rwsem);
+	lockdep_assert_held(&shrinker_mutex);
 
 	kfree_const(shrinker->name);
 	shrinker->name = NULL;
@@ -265,14 +265,14 @@ static int __init shrinker_debugfs_init(void)
 	shrinker_debugfs_root = dentry;
 
 	/* Create debugfs entries for shrinkers registered at boot */
-	down_write(&shrinker_rwsem);
+	mutex_lock(&shrinker_mutex);
 	list_for_each_entry(shrinker, &shrinker_list, list)
 		if (!shrinker->debugfs_entry) {
 			ret = shrinker_debugfs_add(shrinker);
 			if (ret)
 				break;
 		}
-	up_write(&shrinker_rwsem);
+	mutex_unlock(&shrinker_mutex);
 
 	return ret;
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d1a95d60d127..27ef9946ae8a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -35,7 +35,7 @@
 #include <linux/cpuset.h>
 #include <linux/compaction.h>
 #include <linux/notifier.h>
-#include <linux/rwsem.h>
+#include <linux/mutex.h>
 #include <linux/delay.h>
 #include <linux/kthread.h>
 #include <linux/freezer.h>
@@ -202,7 +202,7 @@ static void set_task_reclaim_state(struct task_struct *task,
 }
 
 LIST_HEAD(shrinker_list);
-DECLARE_RWSEM(shrinker_rwsem);
+DEFINE_MUTEX(shrinker_mutex);
 DEFINE_SRCU(shrinker_srcu);
 
 #ifdef CONFIG_MEMCG
@@ -224,7 +224,7 @@ static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
 {
 	return srcu_dereference_check(memcg->nodeinfo[nid]->shrinker_info,
 				      &shrinker_srcu,
-				      lockdep_is_held(&shrinker_rwsem));
+				      lockdep_is_held(&shrinker_mutex));
 }
 
 static struct shrinker_info *shrinker_info_srcu(struct mem_cgroup *memcg,
@@ -308,7 +308,7 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
 	int nid, size, ret = 0;
 	int map_size, defer_size = 0;
 
-	down_write(&shrinker_rwsem);
+	mutex_lock(&shrinker_mutex);
 	map_size = shrinker_map_size(shrinker_nr_max);
 	defer_size = shrinker_defer_size(shrinker_nr_max);
 	size = map_size + defer_size;
@@ -324,7 +324,7 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
 		info->map_nr_max = shrinker_nr_max;
 		rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, info);
 	}
-	up_write(&shrinker_rwsem);
+	mutex_unlock(&shrinker_mutex);
 
 	return ret;
 }
@@ -343,7 +343,7 @@ static int expand_shrinker_info(int new_id)
 	if (!root_mem_cgroup)
 		goto out;
 
-	lockdep_assert_held(&shrinker_rwsem);
+	lockdep_assert_held(&shrinker_mutex);
 
 	map_size = shrinker_map_size(new_nr_max);
 	defer_size = shrinker_defer_size(new_nr_max);
@@ -391,7 +391,7 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
 	if (mem_cgroup_disabled())
 		return -ENOSYS;
 
-	down_write(&shrinker_rwsem);
+	mutex_lock(&shrinker_mutex);
 	id = idr_alloc(&shrinker_idr, shrinker, 0, 0, GFP_KERNEL);
 	if (id < 0)
 		goto unlock;
@@ -405,7 +405,7 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
 	shrinker->id = id;
 	ret = 0;
 unlock:
-	up_write(&shrinker_rwsem);
+	mutex_unlock(&shrinker_mutex);
 	return ret;
 }
 
@@ -415,7 +415,7 @@ static void unregister_memcg_shrinker(struct shrinker *shrinker)
 
 	BUG_ON(id < 0);
 
-	lockdep_assert_held(&shrinker_rwsem);
+	lockdep_assert_held(&shrinker_mutex);
 
 	idr_remove(&shrinker_idr, id);
 }
@@ -450,7 +450,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
 		parent = root_mem_cgroup;
 
 	/* Prevent from concurrent shrinker_info expand */
-	down_write(&shrinker_rwsem);
+	mutex_lock(&shrinker_mutex);
 	for_each_node(nid) {
 		child_info = shrinker_info_protected(memcg, nid);
 		parent_info = shrinker_info_protected(parent, nid);
@@ -459,7 +459,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
 			atomic_long_add(nr, &parent_info->nr_deferred[i]);
 		}
 	}
-	up_write(&shrinker_rwsem);
+	mutex_unlock(&shrinker_mutex);
 }
 
 static bool cgroup_reclaim(struct scan_control *sc)
@@ -708,9 +708,9 @@ void free_prealloced_shrinker(struct shrinker *shrinker)
 	shrinker->name = NULL;
 #endif
 	if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
-		down_write(&shrinker_rwsem);
+		mutex_lock(&shrinker_mutex);
 		unregister_memcg_shrinker(shrinker);
-		up_write(&shrinker_rwsem);
+		mutex_unlock(&shrinker_mutex);
 		return;
 	}
 
@@ -720,11 +720,11 @@ void free_prealloced_shrinker(struct shrinker *shrinker)
 
 void register_shrinker_prepared(struct shrinker *shrinker)
 {
-	down_write(&shrinker_rwsem);
+	mutex_lock(&shrinker_mutex);
 	list_add_tail_rcu(&shrinker->list, &shrinker_list);
 	shrinker->flags |= SHRINKER_REGISTERED;
 	shrinker_debugfs_add(shrinker);
-	up_write(&shrinker_rwsem);
+	mutex_unlock(&shrinker_mutex);
 }
 
 static int __register_shrinker(struct shrinker *shrinker)
@@ -774,13 +774,13 @@ void unregister_shrinker(struct shrinker *shrinker)
 	if (!(shrinker->flags & SHRINKER_REGISTERED))
 		return;
 
-	down_write(&shrinker_rwsem);
+	mutex_lock(&shrinker_mutex);
 	list_del_rcu(&shrinker->list);
 	shrinker->flags &= ~SHRINKER_REGISTERED;
 	if (shrinker->flags & SHRINKER_MEMCG_AWARE)
 		unregister_memcg_shrinker(shrinker);
 	debugfs_entry = shrinker_debugfs_remove(shrinker);
-	up_write(&shrinker_rwsem);
+	mutex_unlock(&shrinker_mutex);
 
 	synchronize_srcu(&shrinker_srcu);
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/7] mm: vmscan: make global slab shrink lockless
  2023-02-23 13:27 ` [PATCH v2 2/7] mm: vmscan: make global slab shrink lockless Qi Zheng
@ 2023-02-23 15:26   ` Rafael Aquini
  2023-02-23 15:37     ` Rafael Aquini
  2023-02-23 18:24   ` Sultan Alsawaf
  1 sibling, 1 reply; 33+ messages in thread
From: Rafael Aquini @ 2023-02-23 15:26 UTC (permalink / raw)
  To: Qi Zheng
  Cc: akpm, tkhai, hannes, shakeelb, mhocko, roman.gushchin,
	muchun.song, david, shy828301, sultan, dave, penguin-kernel,
	paulmck, linux-mm, linux-kernel

On Thu, Feb 23, 2023 at 09:27:20PM +0800, Qi Zheng wrote:
> The shrinker_rwsem is a global lock in shrinkers subsystem,
> it is easy to cause blocking in the following cases:
> 
> a. the write lock of shrinker_rwsem was held for too long.
>    For example, there are many memcgs in the system, which
>    causes some paths to hold locks and traverse it for too
>    long. (e.g. expand_shrinker_info())
> b. the read lock of shrinker_rwsem was held for too long,
>    and a writer came at this time. Then this writer will be
>    forced to wait and block all subsequent readers.
>    For example:
>    - be scheduled when the read lock of shrinker_rwsem is
>      held in do_shrink_slab()
>    - some shrinker are blocked for too long. Like the case
>      mentioned in the patchset[1].
> 
> Therefore, many times in history ([2],[3],[4],[5]), some
> people wanted to replace shrinker_rwsem reader with SRCU,
> but they all gave up because SRCU was not unconditionally
> enabled.
> 
> But now, since commit 1cd0bd06093c ("rcu: Remove CONFIG_SRCU"),
> the SRCU is unconditionally enabled. So it's time to use
> SRCU to protect readers who previously held shrinker_rwsem.
> 
> [1]. https://lore.kernel.org/lkml/20191129214541.3110-1-ptikhomirov@virtuozzo.com/
> [2]. https://lore.kernel.org/all/1437080113.3596.2.camel@stgolabs.net/
> [3]. https://lore.kernel.org/lkml/1510609063-3327-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp/
> [4]. https://lore.kernel.org/lkml/153365347929.19074.12509495712735843805.stgit@localhost.localdomain/
> [5]. https://lore.kernel.org/lkml/20210927074823.5825-1-sultan@kerneltoast.com/
> 
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
>  mm/vmscan.c | 27 +++++++++++----------------
>  1 file changed, 11 insertions(+), 16 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 9f895ca6216c..02987a6f95d1 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -202,6 +202,7 @@ static void set_task_reclaim_state(struct task_struct *task,
>  
>  LIST_HEAD(shrinker_list);
>  DECLARE_RWSEM(shrinker_rwsem);
> +DEFINE_SRCU(shrinker_srcu);
>  
>  #ifdef CONFIG_MEMCG
>  static int shrinker_nr_max;
> @@ -706,7 +707,7 @@ void free_prealloced_shrinker(struct shrinker *shrinker)
>  void register_shrinker_prepared(struct shrinker *shrinker)
>  {
>  	down_write(&shrinker_rwsem);

I think you could revert the rwsem back to a simple mutex, now.

> -	list_add_tail(&shrinker->list, &shrinker_list);
> +	list_add_tail_rcu(&shrinker->list, &shrinker_list);
>  	shrinker->flags |= SHRINKER_REGISTERED;
>  	shrinker_debugfs_add(shrinker);
>  	up_write(&shrinker_rwsem);
> @@ -760,13 +761,15 @@ void unregister_shrinker(struct shrinker *shrinker)
>  		return;
>  
>  	down_write(&shrinker_rwsem);
> -	list_del(&shrinker->list);
> +	list_del_rcu(&shrinker->list);
>  	shrinker->flags &= ~SHRINKER_REGISTERED;
>  	if (shrinker->flags & SHRINKER_MEMCG_AWARE)
>  		unregister_memcg_shrinker(shrinker);
>  	debugfs_entry = shrinker_debugfs_remove(shrinker);
>  	up_write(&shrinker_rwsem);
>  
> +	synchronize_srcu(&shrinker_srcu);
> +
>  	debugfs_remove_recursive(debugfs_entry);
>  
>  	kfree(shrinker->nr_deferred);
> @@ -786,6 +789,7 @@ void synchronize_shrinkers(void)
>  {
>  	down_write(&shrinker_rwsem);
>  	up_write(&shrinker_rwsem);
> +	synchronize_srcu(&shrinker_srcu);
>  }
>  EXPORT_SYMBOL(synchronize_shrinkers);
>  
> @@ -996,6 +1000,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>  {
>  	unsigned long ret, freed = 0;
>  	struct shrinker *shrinker;
> +	int srcu_idx;
>  
>  	/*
>  	 * The root memcg might be allocated even though memcg is disabled
> @@ -1007,10 +1012,10 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>  	if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
>  		return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>  
> -	if (!down_read_trylock(&shrinker_rwsem))
> -		goto out;
> +	srcu_idx = srcu_read_lock(&shrinker_srcu);
>  
> -	list_for_each_entry(shrinker, &shrinker_list, list) {
> +	list_for_each_entry_srcu(shrinker, &shrinker_list, list,
> +				 srcu_read_lock_held(&shrinker_srcu)) {
>  		struct shrink_control sc = {
>  			.gfp_mask = gfp_mask,
>  			.nid = nid,
> @@ -1021,19 +1026,9 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>  		if (ret == SHRINK_EMPTY)
>  			ret = 0;
>  		freed += ret;
> -		/*
> -		 * Bail out if someone want to register a new shrinker to
> -		 * prevent the registration from being stalled for long periods
> -		 * by parallel ongoing shrinking.
> -		 */
> -		if (rwsem_is_contended(&shrinker_rwsem)) {
> -			freed = freed ? : 1;
> -			break;
> -		}
>  	}
>  
> -	up_read(&shrinker_rwsem);
> -out:
> +	srcu_read_unlock(&shrinker_srcu, srcu_idx);
>  	cond_resched();
>  	return freed;
>  }
> -- 
> 2.20.1
> 
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/7] mm: vmscan: make global slab shrink lockless
  2023-02-23 15:26   ` Rafael Aquini
@ 2023-02-23 15:37     ` Rafael Aquini
  2023-02-24  4:09       ` Qi Zheng
  0 siblings, 1 reply; 33+ messages in thread
From: Rafael Aquini @ 2023-02-23 15:37 UTC (permalink / raw)
  To: Qi Zheng
  Cc: akpm, tkhai, hannes, shakeelb, mhocko, roman.gushchin,
	muchun.song, david, shy828301, sultan, dave, penguin-kernel,
	paulmck, linux-mm, linux-kernel

On Thu, Feb 23, 2023 at 10:26:45AM -0500, Rafael Aquini wrote:
> On Thu, Feb 23, 2023 at 09:27:20PM +0800, Qi Zheng wrote:
> > The shrinker_rwsem is a global lock in shrinkers subsystem,
> > it is easy to cause blocking in the following cases:
> > 
> > a. the write lock of shrinker_rwsem was held for too long.
> >    For example, there are many memcgs in the system, which
> >    causes some paths to hold locks and traverse it for too
> >    long. (e.g. expand_shrinker_info())
> > b. the read lock of shrinker_rwsem was held for too long,
> >    and a writer came at this time. Then this writer will be
> >    forced to wait and block all subsequent readers.
> >    For example:
> >    - be scheduled when the read lock of shrinker_rwsem is
> >      held in do_shrink_slab()
> >    - some shrinker are blocked for too long. Like the case
> >      mentioned in the patchset[1].
> > 
> > Therefore, many times in history ([2],[3],[4],[5]), some
> > people wanted to replace shrinker_rwsem reader with SRCU,
> > but they all gave up because SRCU was not unconditionally
> > enabled.
> > 
> > But now, since commit 1cd0bd06093c ("rcu: Remove CONFIG_SRCU"),
> > the SRCU is unconditionally enabled. So it's time to use
> > SRCU to protect readers who previously held shrinker_rwsem.
> > 
> > [1]. https://lore.kernel.org/lkml/20191129214541.3110-1-ptikhomirov@virtuozzo.com/
> > [2]. https://lore.kernel.org/all/1437080113.3596.2.camel@stgolabs.net/
> > [3]. https://lore.kernel.org/lkml/1510609063-3327-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp/
> > [4]. https://lore.kernel.org/lkml/153365347929.19074.12509495712735843805.stgit@localhost.localdomain/
> > [5]. https://lore.kernel.org/lkml/20210927074823.5825-1-sultan@kerneltoast.com/
> > 
> > Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> > ---
> >  mm/vmscan.c | 27 +++++++++++----------------
> >  1 file changed, 11 insertions(+), 16 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 9f895ca6216c..02987a6f95d1 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -202,6 +202,7 @@ static void set_task_reclaim_state(struct task_struct *task,
> >  
> >  LIST_HEAD(shrinker_list);
> >  DECLARE_RWSEM(shrinker_rwsem);
> > +DEFINE_SRCU(shrinker_srcu);
> >  
> >  #ifdef CONFIG_MEMCG
> >  static int shrinker_nr_max;
> > @@ -706,7 +707,7 @@ void free_prealloced_shrinker(struct shrinker *shrinker)
> >  void register_shrinker_prepared(struct shrinker *shrinker)
> >  {
> >  	down_write(&shrinker_rwsem);
> 
> I think you could revert the rwsem back to a simple mutex, now.
>

NVM, that's exactly what patch 7 does. :)

 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 0/7] make slab shrink lockless
  2023-02-23 13:27 [PATCH v2 0/7] make slab shrink lockless Qi Zheng
                   ` (6 preceding siblings ...)
  2023-02-23 13:27 ` [PATCH v2 7/7] mm: shrinkers: convert shrinker_rwsem to mutex Qi Zheng
@ 2023-02-23 18:19 ` Paul E. McKenney
  2023-02-24  4:08   ` Qi Zheng
  7 siblings, 1 reply; 33+ messages in thread
From: Paul E. McKenney @ 2023-02-23 18:19 UTC (permalink / raw)
  To: Qi Zheng
  Cc: akpm, tkhai, hannes, shakeelb, mhocko, roman.gushchin,
	muchun.song, david, shy828301, sultan, dave, penguin-kernel,
	linux-mm, linux-kernel

On Thu, Feb 23, 2023 at 09:27:18PM +0800, Qi Zheng wrote:
> Hi all,
> 
> This patch series aims to make slab shrink lockless.
> 
> 1. Background
> =============
> 
> On our servers, we often find the following system cpu hotspots:
> 
>   44.16%  [kernel]  [k] down_read_trylock
>   14.12%  [kernel]  [k] up_read
>   13.43%  [kernel]  [k] shrink_slab
>    5.25%  [kernel]  [k] count_shadow_nodes
>    3.42%  [kernel]  [k] idr_find
> 
> Then we used bpftrace to capture its calltrace as follows:
> 
> @[
>     down_read_trylock+5
>     shrink_slab+292
>     shrink_node+640
>     do_try_to_free_pages+211
>     try_to_free_mem_cgroup_pages+266
>     try_charge_memcg+386
>     charge_memcg+51
>     __mem_cgroup_charge+44
>     __handle_mm_fault+1416
>     handle_mm_fault+260
>     do_user_addr_fault+459
>     exc_page_fault+104
>     asm_exc_page_fault+38
>     clear_user_rep_good+18
>     read_zero+100
>     vfs_read+176
>     ksys_read+93
>     do_syscall_64+62
>     entry_SYSCALL_64_after_hwframe+114
> ]: 1868979
> 
> It is easy to see that this is caused by the frequent failure to obtain the
> read lock of shrinker_rwsem when reclaiming slab memory.
> 
> Currently, the shrinker_rwsem is a global lock. And the following cases may
> cause the above system cpu hotspots:
> 
> a. the write lock of shrinker_rwsem was held for too long. For example, there
>    are many memcgs in the system, which causes some paths to hold locks and
>    traverse it for too long. (e.g. expand_shrinker_info())
> b. the read lock of shrinker_rwsem was held for too long, and a writer came at
>    this time. Then this writer will be forced to wait and block all subsequent
>    readers.
>    For example:
>    - be scheduled when the read lock of shrinker_rwsem is held in
>      do_shrink_slab()
>    - some shrinker are blocked for too long. Like the case mentioned in the
>      patchset[1].
> 
> [1]. https://lore.kernel.org/lkml/20191129214541.3110-1-ptikhomirov@virtuozzo.com/
> 
> And all the down_read_trylock() hotspots caused by the above cases can be
> solved by replacing the shrinker_rwsem trylocks with SRCU.

Glad to see that making SRCU unconditional was helpful!  And I do very
much like the idea of the shrinker running better!

The main thing that enabled unconditional SRCU was the code added in
v5.19 to dynamically allocate SRCU's srcu_node combining tree.  This is
important for a number of Linux distributions that have NR_CPUS up in the
thousands, for which this combining tree is quite large.  In v5.19 and
later, srcu_struct structures without frequent call_srcu() invocations
never allocate that combining tree.  Even srcu_struct structures that
have enough call_srcu() activity to cause the lock contention that in
turn forces the combining tree to be allocated, that combining tree
is sized for the actual number of CPUs present, which is usually way
smaller than NR_CPUS.

So if you are going to backport this back past v5.19, you might also
need those SRCU changes.  Or not, depending on how much memory your
systems are equipped with.  ;-)

							Thanx, Paul

> 2. Survey
> =========
> 
> Before doing the code implementation, I found that there were many similar
> submissions in the community:
> 
> a. Davidlohr Bueso submitted a patch in 2015.
>    Subject: [PATCH -next v2] mm: srcu-ify shrinkers
>    Link: https://lore.kernel.org/all/1437080113.3596.2.camel@stgolabs.net/
>    Result: It was finally merged into the linux-next branch, but failed on arm
>            allnoconfig (without CONFIG_SRCU)
> 
> b. Tetsuo Handa submitted a patchset in 2017.
>    Subject: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
>    Link: https://lore.kernel.org/lkml/1510609063-3327-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp/
>    Result: Finally chose to use the current simple way (break when
>            rwsem_is_contended()). And Christoph Hellwig suggested to using SRCU,
>            but SRCU was not unconditionally enabled at the time.
> 
> c. Kirill Tkhai submitted a patchset in 2018.
>    Subject: [PATCH RFC 00/10] Introduce lockless shrink_slab()
>    Link: https://lore.kernel.org/lkml/153365347929.19074.12509495712735843805.stgit@localhost.localdomain/
>    Result: At that time, SRCU was not unconditionally enabled, and there were
>            some objections to enabling SRCU. Later, because Kirill's focus was
>            moved to other things, this patchset was not continued to be updated.
> 
> d. Sultan Alsawaf submitted a patch in 2021.
>    Subject: [PATCH] mm: vmscan: Replace shrinker_rwsem trylocks with SRCU protection
>    Link: https://lore.kernel.org/lkml/20210927074823.5825-1-sultan@kerneltoast.com/
>    Result: Rejected because SRCU was not unconditionally enabled.
> 
> We can find that almost all these historical commits were abandoned because SRCU
> was not unconditionally enabled. But now SRCU has been unconditionally enable
> by Paul E. McKenney in 2023 [2], so it's time to replace shrinker_rwsem trylocks
> with SRCU.
> 
> [2] https://lore.kernel.org/lkml/20230105003759.GA1769545@paulmck-ThinkPad-P17-Gen-1/
> 
> 3. Reproduction and testing
> ===========================
> 
> We can reproduce the down_read_trylock() hotspot through the following script:
> 
> ```
> #!/bin/bash
> DIR="/root/shrinker/memcg/mnt"
> 
> do_create()
> {
>         mkdir /sys/fs/cgroup/memory/test
>         echo 200M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
>         for i in `seq 0 $1`;
>         do
>                 mkdir /sys/fs/cgroup/memory/test/$i;
>                 echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
>                 mkdir -p $DIR/$i;
>         done
> }
> 
> do_mount()
> {
>         for i in `seq $1 $2`;
>         do
>                 mount -t tmpfs $i $DIR/$i;
>         done
> }
> 
> do_touch()
> {
>         for i in `seq $1 $2`;
>         do
>                 echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
>                 dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 &
>         done
> }
> 
> do_create 2000
> do_mount 0 2000
> do_touch 0 1000
> ```
> 
> Save the above script and execute it, we can get the following perf hotspots:
> 
>   46.60%  [kernel]  [k] down_read_trylock
>   18.70%  [kernel]  [k] up_read
>   15.44%  [kernel]  [k] shrink_slab
>    4.37%  [kernel]  [k] _find_next_bit
>    2.75%  [kernel]  [k] xa_load
>    2.07%  [kernel]  [k] idr_find
>    1.73%  [kernel]  [k] do_shrink_slab
>    1.42%  [kernel]  [k] shrink_lruvec
>    0.74%  [kernel]  [k] shrink_node
>    0.60%  [kernel]  [k] list_lru_count_one
> 
> After applying this patchset, the hotspot becomes as follows:
> 
>   19.53%  [kernel]  [k] _find_next_bit
>   14.63%  [kernel]  [k] do_shrink_slab
>   14.58%  [kernel]  [k] shrink_slab
>   11.83%  [kernel]  [k] shrink_lruvec
>    9.33%  [kernel]  [k] __blk_flush_plug
>    6.67%  [kernel]  [k] mem_cgroup_iter
>    3.73%  [kernel]  [k] list_lru_count_one
>    2.43%  [kernel]  [k] shrink_node
>    1.96%  [kernel]  [k] super_cache_count
>    1.78%  [kernel]  [k] __rcu_read_unlock
>    1.38%  [kernel]  [k] __srcu_read_lock
>    1.30%  [kernel]  [k] xas_descend
> 
> We can see that the slab reclaim is no longer blocked by shinker_rwsem trylock,
> which realizes the lockless slab reclaim.
> 
> This series is based on next-20230217.
> 
> Comments and suggestions are welcome.
> 
> Thanks,
> Qi.
> 
> Changelog in v1 -> v2:
>  - add a map_nr_max field to shrinker_info (suggested by Kirill)
>  - use shrinker_mutex in reparent_shrinker_deferred() (pointed by Kirill)
> 
> Qi Zheng (7):
>   mm: vmscan: add a map_nr_max field to shrinker_info
>   mm: vmscan: make global slab shrink lockless
>   mm: vmscan: make memcg slab shrink lockless
>   mm: shrinkers: make count and scan in shrinker debugfs lockless
>   mm: vmscan: hold write lock to reparent shrinker nr_deferred
>   mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers()
>   mm: shrinkers: convert shrinker_rwsem to mutex
> 
>  drivers/md/dm-cache-metadata.c |   2 +-
>  drivers/md/dm-thin-metadata.c  |   2 +-
>  fs/super.c                     |   2 +-
>  include/linux/memcontrol.h     |   1 +
>  mm/shrinker_debug.c            |  38 ++++-----
>  mm/vmscan.c                    | 142 +++++++++++++++++----------------
>  6 files changed, 92 insertions(+), 95 deletions(-)
> 
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/7] mm: vmscan: make global slab shrink lockless
  2023-02-23 13:27 ` [PATCH v2 2/7] mm: vmscan: make global slab shrink lockless Qi Zheng
  2023-02-23 15:26   ` Rafael Aquini
@ 2023-02-23 18:24   ` Sultan Alsawaf
  2023-02-23 18:39     ` Paul E. McKenney
  2023-02-24  4:00     ` Qi Zheng
  1 sibling, 2 replies; 33+ messages in thread
From: Sultan Alsawaf @ 2023-02-23 18:24 UTC (permalink / raw)
  To: Qi Zheng
  Cc: akpm, tkhai, hannes, shakeelb, mhocko, roman.gushchin,
	muchun.song, david, shy828301, dave, penguin-kernel, paulmck,
	linux-mm, linux-kernel

On Thu, Feb 23, 2023 at 09:27:20PM +0800, Qi Zheng wrote:
> The shrinker_rwsem is a global lock in shrinkers subsystem,
> it is easy to cause blocking in the following cases:
> 
> a. the write lock of shrinker_rwsem was held for too long.
>    For example, there are many memcgs in the system, which
>    causes some paths to hold locks and traverse it for too
>    long. (e.g. expand_shrinker_info())
> b. the read lock of shrinker_rwsem was held for too long,
>    and a writer came at this time. Then this writer will be
>    forced to wait and block all subsequent readers.
>    For example:
>    - be scheduled when the read lock of shrinker_rwsem is
>      held in do_shrink_slab()
>    - some shrinker are blocked for too long. Like the case
>      mentioned in the patchset[1].
> 
> Therefore, many times in history ([2],[3],[4],[5]), some
> people wanted to replace shrinker_rwsem reader with SRCU,
> but they all gave up because SRCU was not unconditionally
> enabled.
> 
> But now, since commit 1cd0bd06093c ("rcu: Remove CONFIG_SRCU"),
> the SRCU is unconditionally enabled. So it's time to use
> SRCU to protect readers who previously held shrinker_rwsem.
> 
> [1]. https://lore.kernel.org/lkml/20191129214541.3110-1-ptikhomirov@virtuozzo.com/
> [2]. https://lore.kernel.org/all/1437080113.3596.2.camel@stgolabs.net/
> [3]. https://lore.kernel.org/lkml/1510609063-3327-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp/
> [4]. https://lore.kernel.org/lkml/153365347929.19074.12509495712735843805.stgit@localhost.localdomain/
> [5]. https://lore.kernel.org/lkml/20210927074823.5825-1-sultan@kerneltoast.com/
> 
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
>  mm/vmscan.c | 27 +++++++++++----------------
>  1 file changed, 11 insertions(+), 16 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 9f895ca6216c..02987a6f95d1 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -202,6 +202,7 @@ static void set_task_reclaim_state(struct task_struct *task,
>  
>  LIST_HEAD(shrinker_list);
>  DECLARE_RWSEM(shrinker_rwsem);
> +DEFINE_SRCU(shrinker_srcu);
>  
>  #ifdef CONFIG_MEMCG
>  static int shrinker_nr_max;
> @@ -706,7 +707,7 @@ void free_prealloced_shrinker(struct shrinker *shrinker)
>  void register_shrinker_prepared(struct shrinker *shrinker)
>  {
>  	down_write(&shrinker_rwsem);
> -	list_add_tail(&shrinker->list, &shrinker_list);
> +	list_add_tail_rcu(&shrinker->list, &shrinker_list);
>  	shrinker->flags |= SHRINKER_REGISTERED;
>  	shrinker_debugfs_add(shrinker);
>  	up_write(&shrinker_rwsem);
> @@ -760,13 +761,15 @@ void unregister_shrinker(struct shrinker *shrinker)
>  		return;
>  
>  	down_write(&shrinker_rwsem);
> -	list_del(&shrinker->list);
> +	list_del_rcu(&shrinker->list);
>  	shrinker->flags &= ~SHRINKER_REGISTERED;
>  	if (shrinker->flags & SHRINKER_MEMCG_AWARE)
>  		unregister_memcg_shrinker(shrinker);
>  	debugfs_entry = shrinker_debugfs_remove(shrinker);
>  	up_write(&shrinker_rwsem);
>  
> +	synchronize_srcu(&shrinker_srcu);
> +
>  	debugfs_remove_recursive(debugfs_entry);
>  
>  	kfree(shrinker->nr_deferred);
> @@ -786,6 +789,7 @@ void synchronize_shrinkers(void)
>  {
>  	down_write(&shrinker_rwsem);
>  	up_write(&shrinker_rwsem);
> +	synchronize_srcu(&shrinker_srcu);
>  }
>  EXPORT_SYMBOL(synchronize_shrinkers);
>  
> @@ -996,6 +1000,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>  {
>  	unsigned long ret, freed = 0;
>  	struct shrinker *shrinker;
> +	int srcu_idx;
>  
>  	/*
>  	 * The root memcg might be allocated even though memcg is disabled
> @@ -1007,10 +1012,10 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>  	if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
>  		return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>  
> -	if (!down_read_trylock(&shrinker_rwsem))
> -		goto out;
> +	srcu_idx = srcu_read_lock(&shrinker_srcu);
>  
> -	list_for_each_entry(shrinker, &shrinker_list, list) {
> +	list_for_each_entry_srcu(shrinker, &shrinker_list, list,
> +				 srcu_read_lock_held(&shrinker_srcu)) {
>  		struct shrink_control sc = {
>  			.gfp_mask = gfp_mask,
>  			.nid = nid,
> @@ -1021,19 +1026,9 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>  		if (ret == SHRINK_EMPTY)
>  			ret = 0;
>  		freed += ret;
> -		/*
> -		 * Bail out if someone want to register a new shrinker to
> -		 * prevent the registration from being stalled for long periods
> -		 * by parallel ongoing shrinking.
> -		 */
> -		if (rwsem_is_contended(&shrinker_rwsem)) {
> -			freed = freed ? : 1;
> -			break;
> -		}
>  	}
>  
> -	up_read(&shrinker_rwsem);
> -out:
> +	srcu_read_unlock(&shrinker_srcu, srcu_idx);
>  	cond_resched();
>  	return freed;
>  }
> -- 
> 2.20.1
> 
> 

Hi Qi,

A different problem I realized after my old attempt to use SRCU was that the
unregister_shrinker() path became quite slow due to the heavy synchronize_srcu()
call. Both register_shrinker() *and* unregister_shrinker() are called frequently
these days, and SRCU is too unfair to the unregister path IMO.

Although I never got around to submitting it, I made a non-SRCU solution [1]
that uses fine-grained locking instead, which is fair to both the register path
and unregister path. (The patch I've linked is a version of this adapted to an
older 4.14 kernel FYI, but it can be reworked for the current kernel.)

What do you think about the fine-grained locking approach?

Thanks,
Sultan

[1] https://github.com/kerneltoast/android_kernel_google_floral/commit/012378f3173a82d2333d3ae7326691544301e76a

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/7] mm: vmscan: make global slab shrink lockless
  2023-02-23 18:24   ` Sultan Alsawaf
@ 2023-02-23 18:39     ` Paul E. McKenney
  2023-02-23 19:18       ` Sultan Alsawaf
  2023-02-24  4:00     ` Qi Zheng
  1 sibling, 1 reply; 33+ messages in thread
From: Paul E. McKenney @ 2023-02-23 18:39 UTC (permalink / raw)
  To: Sultan Alsawaf
  Cc: Qi Zheng, akpm, tkhai, hannes, shakeelb, mhocko, roman.gushchin,
	muchun.song, david, shy828301, dave, penguin-kernel, linux-mm,
	linux-kernel

On Thu, Feb 23, 2023 at 10:24:47AM -0800, Sultan Alsawaf wrote:
> On Thu, Feb 23, 2023 at 09:27:20PM +0800, Qi Zheng wrote:
> > The shrinker_rwsem is a global lock in shrinkers subsystem,
> > it is easy to cause blocking in the following cases:
> > 
> > a. the write lock of shrinker_rwsem was held for too long.
> >    For example, there are many memcgs in the system, which
> >    causes some paths to hold locks and traverse it for too
> >    long. (e.g. expand_shrinker_info())
> > b. the read lock of shrinker_rwsem was held for too long,
> >    and a writer came at this time. Then this writer will be
> >    forced to wait and block all subsequent readers.
> >    For example:
> >    - be scheduled when the read lock of shrinker_rwsem is
> >      held in do_shrink_slab()
> >    - some shrinker are blocked for too long. Like the case
> >      mentioned in the patchset[1].
> > 
> > Therefore, many times in history ([2],[3],[4],[5]), some
> > people wanted to replace shrinker_rwsem reader with SRCU,
> > but they all gave up because SRCU was not unconditionally
> > enabled.
> > 
> > But now, since commit 1cd0bd06093c ("rcu: Remove CONFIG_SRCU"),
> > the SRCU is unconditionally enabled. So it's time to use
> > SRCU to protect readers who previously held shrinker_rwsem.
> > 
> > [1]. https://lore.kernel.org/lkml/20191129214541.3110-1-ptikhomirov@virtuozzo.com/
> > [2]. https://lore.kernel.org/all/1437080113.3596.2.camel@stgolabs.net/
> > [3]. https://lore.kernel.org/lkml/1510609063-3327-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp/
> > [4]. https://lore.kernel.org/lkml/153365347929.19074.12509495712735843805.stgit@localhost.localdomain/
> > [5]. https://lore.kernel.org/lkml/20210927074823.5825-1-sultan@kerneltoast.com/
> > 
> > Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> > ---
> >  mm/vmscan.c | 27 +++++++++++----------------
> >  1 file changed, 11 insertions(+), 16 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 9f895ca6216c..02987a6f95d1 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -202,6 +202,7 @@ static void set_task_reclaim_state(struct task_struct *task,
> >  
> >  LIST_HEAD(shrinker_list);
> >  DECLARE_RWSEM(shrinker_rwsem);
> > +DEFINE_SRCU(shrinker_srcu);
> >  
> >  #ifdef CONFIG_MEMCG
> >  static int shrinker_nr_max;
> > @@ -706,7 +707,7 @@ void free_prealloced_shrinker(struct shrinker *shrinker)
> >  void register_shrinker_prepared(struct shrinker *shrinker)
> >  {
> >  	down_write(&shrinker_rwsem);
> > -	list_add_tail(&shrinker->list, &shrinker_list);
> > +	list_add_tail_rcu(&shrinker->list, &shrinker_list);
> >  	shrinker->flags |= SHRINKER_REGISTERED;
> >  	shrinker_debugfs_add(shrinker);
> >  	up_write(&shrinker_rwsem);
> > @@ -760,13 +761,15 @@ void unregister_shrinker(struct shrinker *shrinker)
> >  		return;
> >  
> >  	down_write(&shrinker_rwsem);
> > -	list_del(&shrinker->list);
> > +	list_del_rcu(&shrinker->list);
> >  	shrinker->flags &= ~SHRINKER_REGISTERED;
> >  	if (shrinker->flags & SHRINKER_MEMCG_AWARE)
> >  		unregister_memcg_shrinker(shrinker);
> >  	debugfs_entry = shrinker_debugfs_remove(shrinker);
> >  	up_write(&shrinker_rwsem);
> >  
> > +	synchronize_srcu(&shrinker_srcu);
> > +
> >  	debugfs_remove_recursive(debugfs_entry);
> >  
> >  	kfree(shrinker->nr_deferred);
> > @@ -786,6 +789,7 @@ void synchronize_shrinkers(void)
> >  {
> >  	down_write(&shrinker_rwsem);
> >  	up_write(&shrinker_rwsem);
> > +	synchronize_srcu(&shrinker_srcu);
> >  }
> >  EXPORT_SYMBOL(synchronize_shrinkers);
> >  
> > @@ -996,6 +1000,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
> >  {
> >  	unsigned long ret, freed = 0;
> >  	struct shrinker *shrinker;
> > +	int srcu_idx;
> >  
> >  	/*
> >  	 * The root memcg might be allocated even though memcg is disabled
> > @@ -1007,10 +1012,10 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
> >  	if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
> >  		return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
> >  
> > -	if (!down_read_trylock(&shrinker_rwsem))
> > -		goto out;
> > +	srcu_idx = srcu_read_lock(&shrinker_srcu);
> >  
> > -	list_for_each_entry(shrinker, &shrinker_list, list) {
> > +	list_for_each_entry_srcu(shrinker, &shrinker_list, list,
> > +				 srcu_read_lock_held(&shrinker_srcu)) {
> >  		struct shrink_control sc = {
> >  			.gfp_mask = gfp_mask,
> >  			.nid = nid,
> > @@ -1021,19 +1026,9 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
> >  		if (ret == SHRINK_EMPTY)
> >  			ret = 0;
> >  		freed += ret;
> > -		/*
> > -		 * Bail out if someone want to register a new shrinker to
> > -		 * prevent the registration from being stalled for long periods
> > -		 * by parallel ongoing shrinking.
> > -		 */
> > -		if (rwsem_is_contended(&shrinker_rwsem)) {
> > -			freed = freed ? : 1;
> > -			break;
> > -		}
> >  	}
> >  
> > -	up_read(&shrinker_rwsem);
> > -out:
> > +	srcu_read_unlock(&shrinker_srcu, srcu_idx);
> >  	cond_resched();
> >  	return freed;
> >  }
> > -- 
> > 2.20.1
> > 
> > 
> 
> Hi Qi,
> 
> A different problem I realized after my old attempt to use SRCU was that the
> unregister_shrinker() path became quite slow due to the heavy synchronize_srcu()
> call. Both register_shrinker() *and* unregister_shrinker() are called frequently
> these days, and SRCU is too unfair to the unregister path IMO.
> 
> Although I never got around to submitting it, I made a non-SRCU solution [1]
> that uses fine-grained locking instead, which is fair to both the register path
> and unregister path. (The patch I've linked is a version of this adapted to an
> older 4.14 kernel FYI, but it can be reworked for the current kernel.)
> 
> What do you think about the fine-grained locking approach?

Another approach is to use synchronize_srcu_expedited(), which avoids
the sleeps that are otherwise used to encourage sharing of grace periods
among concurrent requests.  It might be possible to use call_srcu(),
but I don't claim to know the shrinker code well enough to say for sure.

							Thanx, Paul

> Thanks,
> Sultan
> 
> [1] https://github.com/kerneltoast/android_kernel_google_floral/commit/012378f3173a82d2333d3ae7326691544301e76a

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/7] mm: vmscan: make global slab shrink lockless
  2023-02-23 18:39     ` Paul E. McKenney
@ 2023-02-23 19:18       ` Sultan Alsawaf
  0 siblings, 0 replies; 33+ messages in thread
From: Sultan Alsawaf @ 2023-02-23 19:18 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Qi Zheng, akpm, tkhai, hannes, shakeelb, mhocko, roman.gushchin,
	muchun.song, david, shy828301, dave, penguin-kernel, linux-mm,
	linux-kernel

On Thu, Feb 23, 2023 at 10:39:17AM -0800, Paul E. McKenney wrote:
> On Thu, Feb 23, 2023 at 10:24:47AM -0800, Sultan Alsawaf wrote:
> > On Thu, Feb 23, 2023 at 09:27:20PM +0800, Qi Zheng wrote:
> > > The shrinker_rwsem is a global lock in shrinkers subsystem,
> > > it is easy to cause blocking in the following cases:
> > > 
> > > a. the write lock of shrinker_rwsem was held for too long.
> > >    For example, there are many memcgs in the system, which
> > >    causes some paths to hold locks and traverse it for too
> > >    long. (e.g. expand_shrinker_info())
> > > b. the read lock of shrinker_rwsem was held for too long,
> > >    and a writer came at this time. Then this writer will be
> > >    forced to wait and block all subsequent readers.
> > >    For example:
> > >    - be scheduled when the read lock of shrinker_rwsem is
> > >      held in do_shrink_slab()
> > >    - some shrinker are blocked for too long. Like the case
> > >      mentioned in the patchset[1].
> > > 
> > > Therefore, many times in history ([2],[3],[4],[5]), some
> > > people wanted to replace shrinker_rwsem reader with SRCU,
> > > but they all gave up because SRCU was not unconditionally
> > > enabled.
> > > 
> > > But now, since commit 1cd0bd06093c ("rcu: Remove CONFIG_SRCU"),
> > > the SRCU is unconditionally enabled. So it's time to use
> > > SRCU to protect readers who previously held shrinker_rwsem.
> > > 
> > > [1]. https://lore.kernel.org/lkml/20191129214541.3110-1-ptikhomirov@virtuozzo.com/
> > > [2]. https://lore.kernel.org/all/1437080113.3596.2.camel@stgolabs.net/
> > > [3]. https://lore.kernel.org/lkml/1510609063-3327-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp/
> > > [4]. https://lore.kernel.org/lkml/153365347929.19074.12509495712735843805.stgit@localhost.localdomain/
> > > [5]. https://lore.kernel.org/lkml/20210927074823.5825-1-sultan@kerneltoast.com/
> > > 
> > > Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> > > ---
> > >  mm/vmscan.c | 27 +++++++++++----------------
> > >  1 file changed, 11 insertions(+), 16 deletions(-)
> > > 
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 9f895ca6216c..02987a6f95d1 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -202,6 +202,7 @@ static void set_task_reclaim_state(struct task_struct *task,
> > >  
> > >  LIST_HEAD(shrinker_list);
> > >  DECLARE_RWSEM(shrinker_rwsem);
> > > +DEFINE_SRCU(shrinker_srcu);
> > >  
> > >  #ifdef CONFIG_MEMCG
> > >  static int shrinker_nr_max;
> > > @@ -706,7 +707,7 @@ void free_prealloced_shrinker(struct shrinker *shrinker)
> > >  void register_shrinker_prepared(struct shrinker *shrinker)
> > >  {
> > >  	down_write(&shrinker_rwsem);
> > > -	list_add_tail(&shrinker->list, &shrinker_list);
> > > +	list_add_tail_rcu(&shrinker->list, &shrinker_list);
> > >  	shrinker->flags |= SHRINKER_REGISTERED;
> > >  	shrinker_debugfs_add(shrinker);
> > >  	up_write(&shrinker_rwsem);
> > > @@ -760,13 +761,15 @@ void unregister_shrinker(struct shrinker *shrinker)
> > >  		return;
> > >  
> > >  	down_write(&shrinker_rwsem);
> > > -	list_del(&shrinker->list);
> > > +	list_del_rcu(&shrinker->list);
> > >  	shrinker->flags &= ~SHRINKER_REGISTERED;
> > >  	if (shrinker->flags & SHRINKER_MEMCG_AWARE)
> > >  		unregister_memcg_shrinker(shrinker);
> > >  	debugfs_entry = shrinker_debugfs_remove(shrinker);
> > >  	up_write(&shrinker_rwsem);
> > >  
> > > +	synchronize_srcu(&shrinker_srcu);
> > > +
> > >  	debugfs_remove_recursive(debugfs_entry);
> > >  
> > >  	kfree(shrinker->nr_deferred);
> > > @@ -786,6 +789,7 @@ void synchronize_shrinkers(void)
> > >  {
> > >  	down_write(&shrinker_rwsem);
> > >  	up_write(&shrinker_rwsem);
> > > +	synchronize_srcu(&shrinker_srcu);
> > >  }
> > >  EXPORT_SYMBOL(synchronize_shrinkers);
> > >  
> > > @@ -996,6 +1000,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
> > >  {
> > >  	unsigned long ret, freed = 0;
> > >  	struct shrinker *shrinker;
> > > +	int srcu_idx;
> > >  
> > >  	/*
> > >  	 * The root memcg might be allocated even though memcg is disabled
> > > @@ -1007,10 +1012,10 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
> > >  	if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
> > >  		return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
> > >  
> > > -	if (!down_read_trylock(&shrinker_rwsem))
> > > -		goto out;
> > > +	srcu_idx = srcu_read_lock(&shrinker_srcu);
> > >  
> > > -	list_for_each_entry(shrinker, &shrinker_list, list) {
> > > +	list_for_each_entry_srcu(shrinker, &shrinker_list, list,
> > > +				 srcu_read_lock_held(&shrinker_srcu)) {
> > >  		struct shrink_control sc = {
> > >  			.gfp_mask = gfp_mask,
> > >  			.nid = nid,
> > > @@ -1021,19 +1026,9 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
> > >  		if (ret == SHRINK_EMPTY)
> > >  			ret = 0;
> > >  		freed += ret;
> > > -		/*
> > > -		 * Bail out if someone want to register a new shrinker to
> > > -		 * prevent the registration from being stalled for long periods
> > > -		 * by parallel ongoing shrinking.
> > > -		 */
> > > -		if (rwsem_is_contended(&shrinker_rwsem)) {
> > > -			freed = freed ? : 1;
> > > -			break;
> > > -		}
> > >  	}
> > >  
> > > -	up_read(&shrinker_rwsem);
> > > -out:
> > > +	srcu_read_unlock(&shrinker_srcu, srcu_idx);
> > >  	cond_resched();
> > >  	return freed;
> > >  }
> > > -- 
> > > 2.20.1
> > > 
> > > 
> > 
> > Hi Qi,
> > 
> > A different problem I realized after my old attempt to use SRCU was that the
> > unregister_shrinker() path became quite slow due to the heavy synchronize_srcu()
> > call. Both register_shrinker() *and* unregister_shrinker() are called frequently
> > these days, and SRCU is too unfair to the unregister path IMO.
> > 
> > Although I never got around to submitting it, I made a non-SRCU solution [1]
> > that uses fine-grained locking instead, which is fair to both the register path
> > and unregister path. (The patch I've linked is a version of this adapted to an
> > older 4.14 kernel FYI, but it can be reworked for the current kernel.)
> > 
> > What do you think about the fine-grained locking approach?
> 
> Another approach is to use synchronize_srcu_expedited(), which avoids
> the sleeps that are otherwise used to encourage sharing of grace periods
> among concurrent requests.  It might be possible to use call_srcu(),
> but I don't claim to know the shrinker code well enough to say for sure.

Hi Paul,

I don't believe call_srcu() can be used since shrinker users need to be
guaranteed that their shrinkers aren't in use after unregister_shrinker().

Using synchronize_srcu_expedited() sounds like it'd definitely help, though
unregistering a single shrinker would ultimately still require waiting for all
shrinkers to finish running before the grace period can elapse. There can be
many shrinkers and they're not very fast I think.

Thanks,
Sultan

> 
> 							Thanx, Paul
> 
> > Thanks,
> > Sultan
> > 
> > [1] https://github.com/kerneltoast/android_kernel_google_floral/commit/012378f3173a82d2333d3ae7326691544301e76a
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/7] mm: vmscan: make global slab shrink lockless
  2023-02-23 18:24   ` Sultan Alsawaf
  2023-02-23 18:39     ` Paul E. McKenney
@ 2023-02-24  4:00     ` Qi Zheng
  2023-02-24  4:16       ` Qi Zheng
                         ` (2 more replies)
  1 sibling, 3 replies; 33+ messages in thread
From: Qi Zheng @ 2023-02-24  4:00 UTC (permalink / raw)
  To: Sultan Alsawaf
  Cc: akpm, tkhai, hannes, shakeelb, mhocko, roman.gushchin,
	muchun.song, david, shy828301, dave, penguin-kernel, paulmck,
	linux-mm, linux-kernel



On 2023/2/24 02:24, Sultan Alsawaf wrote:
> On Thu, Feb 23, 2023 at 09:27:20PM +0800, Qi Zheng wrote:
>> The shrinker_rwsem is a global lock in shrinkers subsystem,
>> it is easy to cause blocking in the following cases:
>>
>> a. the write lock of shrinker_rwsem was held for too long.
>>     For example, there are many memcgs in the system, which
>>     causes some paths to hold locks and traverse it for too
>>     long. (e.g. expand_shrinker_info())
>> b. the read lock of shrinker_rwsem was held for too long,
>>     and a writer came at this time. Then this writer will be
>>     forced to wait and block all subsequent readers.
>>     For example:
>>     - be scheduled when the read lock of shrinker_rwsem is
>>       held in do_shrink_slab()
>>     - some shrinker are blocked for too long. Like the case
>>       mentioned in the patchset[1].
>>
>> Therefore, many times in history ([2],[3],[4],[5]), some
>> people wanted to replace shrinker_rwsem reader with SRCU,
>> but they all gave up because SRCU was not unconditionally
>> enabled.
>>
>> But now, since commit 1cd0bd06093c ("rcu: Remove CONFIG_SRCU"),
>> the SRCU is unconditionally enabled. So it's time to use
>> SRCU to protect readers who previously held shrinker_rwsem.
>>
>> [1]. https://lore.kernel.org/lkml/20191129214541.3110-1-ptikhomirov@virtuozzo.com/
>> [2]. https://lore.kernel.org/all/1437080113.3596.2.camel@stgolabs.net/
>> [3]. https://lore.kernel.org/lkml/1510609063-3327-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp/
>> [4]. https://lore.kernel.org/lkml/153365347929.19074.12509495712735843805.stgit@localhost.localdomain/
>> [5]. https://lore.kernel.org/lkml/20210927074823.5825-1-sultan@kerneltoast.com/
>>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> ---
>>   mm/vmscan.c | 27 +++++++++++----------------
>>   1 file changed, 11 insertions(+), 16 deletions(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 9f895ca6216c..02987a6f95d1 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -202,6 +202,7 @@ static void set_task_reclaim_state(struct task_struct *task,
>>   
>>   LIST_HEAD(shrinker_list);
>>   DECLARE_RWSEM(shrinker_rwsem);
>> +DEFINE_SRCU(shrinker_srcu);
>>   
>>   #ifdef CONFIG_MEMCG
>>   static int shrinker_nr_max;
>> @@ -706,7 +707,7 @@ void free_prealloced_shrinker(struct shrinker *shrinker)
>>   void register_shrinker_prepared(struct shrinker *shrinker)
>>   {
>>   	down_write(&shrinker_rwsem);
>> -	list_add_tail(&shrinker->list, &shrinker_list);
>> +	list_add_tail_rcu(&shrinker->list, &shrinker_list);
>>   	shrinker->flags |= SHRINKER_REGISTERED;
>>   	shrinker_debugfs_add(shrinker);
>>   	up_write(&shrinker_rwsem);
>> @@ -760,13 +761,15 @@ void unregister_shrinker(struct shrinker *shrinker)
>>   		return;
>>   
>>   	down_write(&shrinker_rwsem);
>> -	list_del(&shrinker->list);
>> +	list_del_rcu(&shrinker->list);
>>   	shrinker->flags &= ~SHRINKER_REGISTERED;
>>   	if (shrinker->flags & SHRINKER_MEMCG_AWARE)
>>   		unregister_memcg_shrinker(shrinker);
>>   	debugfs_entry = shrinker_debugfs_remove(shrinker);
>>   	up_write(&shrinker_rwsem);
>>   
>> +	synchronize_srcu(&shrinker_srcu);
>> +
>>   	debugfs_remove_recursive(debugfs_entry);
>>   
>>   	kfree(shrinker->nr_deferred);
>> @@ -786,6 +789,7 @@ void synchronize_shrinkers(void)
>>   {
>>   	down_write(&shrinker_rwsem);
>>   	up_write(&shrinker_rwsem);
>> +	synchronize_srcu(&shrinker_srcu);
>>   }
>>   EXPORT_SYMBOL(synchronize_shrinkers);
>>   
>> @@ -996,6 +1000,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>   {
>>   	unsigned long ret, freed = 0;
>>   	struct shrinker *shrinker;
>> +	int srcu_idx;
>>   
>>   	/*
>>   	 * The root memcg might be allocated even though memcg is disabled
>> @@ -1007,10 +1012,10 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>   	if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
>>   		return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>>   
>> -	if (!down_read_trylock(&shrinker_rwsem))
>> -		goto out;
>> +	srcu_idx = srcu_read_lock(&shrinker_srcu);
>>   
>> -	list_for_each_entry(shrinker, &shrinker_list, list) {
>> +	list_for_each_entry_srcu(shrinker, &shrinker_list, list,
>> +				 srcu_read_lock_held(&shrinker_srcu)) {
>>   		struct shrink_control sc = {
>>   			.gfp_mask = gfp_mask,
>>   			.nid = nid,
>> @@ -1021,19 +1026,9 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>   		if (ret == SHRINK_EMPTY)
>>   			ret = 0;
>>   		freed += ret;
>> -		/*
>> -		 * Bail out if someone want to register a new shrinker to
>> -		 * prevent the registration from being stalled for long periods
>> -		 * by parallel ongoing shrinking.
>> -		 */
>> -		if (rwsem_is_contended(&shrinker_rwsem)) {
>> -			freed = freed ? : 1;
>> -			break;
>> -		}
>>   	}
>>   
>> -	up_read(&shrinker_rwsem);
>> -out:
>> +	srcu_read_unlock(&shrinker_srcu, srcu_idx);
>>   	cond_resched();
>>   	return freed;
>>   }
>> -- 
>> 2.20.1
>>
>>
> 
> Hi Qi,
> 
> A different problem I realized after my old attempt to use SRCU was that the
> unregister_shrinker() path became quite slow due to the heavy synchronize_srcu()
> call. Both register_shrinker() *and* unregister_shrinker() are called frequently
> these days, and SRCU is too unfair to the unregister path IMO.

Hi Sultan,

IIUC, for unregister_shrinker(), the wait time is hardly longer with
SRCU than with shrinker_rwsem before.

And I just did a simple test. After using the script in cover letter to
increase the shrink_slab hotspot, I did umount 1k times at the same
time, and then I used bpftrace to measure the time consumption of
unregister_shrinker() as follows:

bpftrace -e 'kprobe:unregister_shrinker { @start[tid] = nsecs; } 
kretprobe:unregister_shrinker /@start[tid]/ { @ns[comm] = hist(nsecs - 
@start[tid]); delete(@start[tid]); }'

@ns[umount]:
[16K, 32K)             3 | 
      |
[32K, 64K)            66 |@@@@@@@@@@ 
      |
[64K, 128K)           32 |@@@@@ 
      |
[128K, 256K)          22 |@@@ 
      |
[256K, 512K)          48 |@@@@@@@ 
      |
[512K, 1M)            19 |@@@ 
      |
[1M, 2M)             131 |@@@@@@@@@@@@@@@@@@@@@ 
      |
[2M, 4M)             313 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4M, 8M)             302 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  |
[8M, 16M)             55 |@@@@@@@@@

I see that the highest time-consuming of unregister_shrinker() is 
between 8ms and 16ms, which feels tolerable?

Thanks,
Qi

> 
> Although I never got around to submitting it, I made a non-SRCU solution [1]
> that uses fine-grained locking instead, which is fair to both the register path
> and unregister path. (The patch I've linked is a version of this adapted to an
> older 4.14 kernel FYI, but it can be reworked for the current kernel.)
> 
> What do you think about the fine-grained locking approach?
> 
> Thanks,
> Sultan
> 
> [1] https://github.com/kerneltoast/android_kernel_google_floral/commit/012378f3173a82d2333d3ae7326691544301e76a
> 

-- 
Thanks,
Qi

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 0/7] make slab shrink lockless
  2023-02-23 18:19 ` [PATCH v2 0/7] make slab shrink lockless Paul E. McKenney
@ 2023-02-24  4:08   ` Qi Zheng
  0 siblings, 0 replies; 33+ messages in thread
From: Qi Zheng @ 2023-02-24  4:08 UTC (permalink / raw)
  To: paulmck
  Cc: akpm, tkhai, hannes, shakeelb, mhocko, roman.gushchin,
	muchun.song, david, shy828301, sultan, dave, penguin-kernel,
	linux-mm, linux-kernel



On 2023/2/24 02:19, Paul E. McKenney wrote:
> On Thu, Feb 23, 2023 at 09:27:18PM +0800, Qi Zheng wrote:
>> Hi all,
>>
>> This patch series aims to make slab shrink lockless.
>>
>> 1. Background
>> =============
>>
>> On our servers, we often find the following system cpu hotspots:
>>
>>    44.16%  [kernel]  [k] down_read_trylock
>>    14.12%  [kernel]  [k] up_read
>>    13.43%  [kernel]  [k] shrink_slab
>>     5.25%  [kernel]  [k] count_shadow_nodes
>>     3.42%  [kernel]  [k] idr_find
>>
>> Then we used bpftrace to capture its calltrace as follows:
>>
>> @[
>>      down_read_trylock+5
>>      shrink_slab+292
>>      shrink_node+640
>>      do_try_to_free_pages+211
>>      try_to_free_mem_cgroup_pages+266
>>      try_charge_memcg+386
>>      charge_memcg+51
>>      __mem_cgroup_charge+44
>>      __handle_mm_fault+1416
>>      handle_mm_fault+260
>>      do_user_addr_fault+459
>>      exc_page_fault+104
>>      asm_exc_page_fault+38
>>      clear_user_rep_good+18
>>      read_zero+100
>>      vfs_read+176
>>      ksys_read+93
>>      do_syscall_64+62
>>      entry_SYSCALL_64_after_hwframe+114
>> ]: 1868979
>>
>> It is easy to see that this is caused by the frequent failure to obtain the
>> read lock of shrinker_rwsem when reclaiming slab memory.
>>
>> Currently, the shrinker_rwsem is a global lock. And the following cases may
>> cause the above system cpu hotspots:
>>
>> a. the write lock of shrinker_rwsem was held for too long. For example, there
>>     are many memcgs in the system, which causes some paths to hold locks and
>>     traverse it for too long. (e.g. expand_shrinker_info())
>> b. the read lock of shrinker_rwsem was held for too long, and a writer came at
>>     this time. Then this writer will be forced to wait and block all subsequent
>>     readers.
>>     For example:
>>     - be scheduled when the read lock of shrinker_rwsem is held in
>>       do_shrink_slab()
>>     - some shrinker are blocked for too long. Like the case mentioned in the
>>       patchset[1].
>>
>> [1]. https://lore.kernel.org/lkml/20191129214541.3110-1-ptikhomirov@virtuozzo.com/
>>
>> And all the down_read_trylock() hotspots caused by the above cases can be
>> solved by replacing the shrinker_rwsem trylocks with SRCU.

Hi Paul,

> 
> Glad to see that making SRCU unconditional was helpful!  And I do very
> much like the idea of the shrinker running better!

+1 :)

> 
> The main thing that enabled unconditional SRCU was the code added in
> v5.19 to dynamically allocate SRCU's srcu_node combining tree.  This is
> important for a number of Linux distributions that have NR_CPUS up in the
> thousands, for which this combining tree is quite large.  In v5.19 and
> later, srcu_struct structures without frequent call_srcu() invocations
> never allocate that combining tree.  Even srcu_struct structures that
> have enough call_srcu() activity to cause the lock contention that in
> turn forces the combining tree to be allocated, that combining tree
> is sized for the actual number of CPUs present, which is usually way
> smaller than NR_CPUS.

Thank you very much for such a detailed background introduction. :)

> 
> So if you are going to backport this back past v5.19, you might also
> need those SRCU changes.  Or not, depending on how much memory your
> systems are equipped with.  ;-)

Got it.

Thanks,
Qi

> 
> 							Thanx, Paul
> 
>> 2. Survey
>> =========
>>
>> Before doing the code implementation, I found that there were many similar
>> submissions in the community:
>>
>> a. Davidlohr Bueso submitted a patch in 2015.
>>     Subject: [PATCH -next v2] mm: srcu-ify shrinkers
>>     Link: https://lore.kernel.org/all/1437080113.3596.2.camel@stgolabs.net/
>>     Result: It was finally merged into the linux-next branch, but failed on arm
>>             allnoconfig (without CONFIG_SRCU)
>>
>> b. Tetsuo Handa submitted a patchset in 2017.
>>     Subject: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
>>     Link: https://lore.kernel.org/lkml/1510609063-3327-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp/
>>     Result: Finally chose to use the current simple way (break when
>>             rwsem_is_contended()). And Christoph Hellwig suggested to using SRCU,
>>             but SRCU was not unconditionally enabled at the time.
>>
>> c. Kirill Tkhai submitted a patchset in 2018.
>>     Subject: [PATCH RFC 00/10] Introduce lockless shrink_slab()
>>     Link: https://lore.kernel.org/lkml/153365347929.19074.12509495712735843805.stgit@localhost.localdomain/
>>     Result: At that time, SRCU was not unconditionally enabled, and there were
>>             some objections to enabling SRCU. Later, because Kirill's focus was
>>             moved to other things, this patchset was not continued to be updated.
>>
>> d. Sultan Alsawaf submitted a patch in 2021.
>>     Subject: [PATCH] mm: vmscan: Replace shrinker_rwsem trylocks with SRCU protection
>>     Link: https://lore.kernel.org/lkml/20210927074823.5825-1-sultan@kerneltoast.com/
>>     Result: Rejected because SRCU was not unconditionally enabled.
>>
>> We can find that almost all these historical commits were abandoned because SRCU
>> was not unconditionally enabled. But now SRCU has been unconditionally enable
>> by Paul E. McKenney in 2023 [2], so it's time to replace shrinker_rwsem trylocks
>> with SRCU.
>>
>> [2] https://lore.kernel.org/lkml/20230105003759.GA1769545@paulmck-ThinkPad-P17-Gen-1/
>>
>> 3. Reproduction and testing
>> ===========================
>>
>> We can reproduce the down_read_trylock() hotspot through the following script:
>>
>> ```
>> #!/bin/bash
>> DIR="/root/shrinker/memcg/mnt"
>>
>> do_create()
>> {
>>          mkdir /sys/fs/cgroup/memory/test
>>          echo 200M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
>>          for i in `seq 0 $1`;
>>          do
>>                  mkdir /sys/fs/cgroup/memory/test/$i;
>>                  echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
>>                  mkdir -p $DIR/$i;
>>          done
>> }
>>
>> do_mount()
>> {
>>          for i in `seq $1 $2`;
>>          do
>>                  mount -t tmpfs $i $DIR/$i;
>>          done
>> }
>>
>> do_touch()
>> {
>>          for i in `seq $1 $2`;
>>          do
>>                  echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
>>                  dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 &
>>          done
>> }
>>
>> do_create 2000
>> do_mount 0 2000
>> do_touch 0 1000
>> ```
>>
>> Save the above script and execute it, we can get the following perf hotspots:
>>
>>    46.60%  [kernel]  [k] down_read_trylock
>>    18.70%  [kernel]  [k] up_read
>>    15.44%  [kernel]  [k] shrink_slab
>>     4.37%  [kernel]  [k] _find_next_bit
>>     2.75%  [kernel]  [k] xa_load
>>     2.07%  [kernel]  [k] idr_find
>>     1.73%  [kernel]  [k] do_shrink_slab
>>     1.42%  [kernel]  [k] shrink_lruvec
>>     0.74%  [kernel]  [k] shrink_node
>>     0.60%  [kernel]  [k] list_lru_count_one
>>
>> After applying this patchset, the hotspot becomes as follows:
>>
>>    19.53%  [kernel]  [k] _find_next_bit
>>    14.63%  [kernel]  [k] do_shrink_slab
>>    14.58%  [kernel]  [k] shrink_slab
>>    11.83%  [kernel]  [k] shrink_lruvec
>>     9.33%  [kernel]  [k] __blk_flush_plug
>>     6.67%  [kernel]  [k] mem_cgroup_iter
>>     3.73%  [kernel]  [k] list_lru_count_one
>>     2.43%  [kernel]  [k] shrink_node
>>     1.96%  [kernel]  [k] super_cache_count
>>     1.78%  [kernel]  [k] __rcu_read_unlock
>>     1.38%  [kernel]  [k] __srcu_read_lock
>>     1.30%  [kernel]  [k] xas_descend
>>
>> We can see that the slab reclaim is no longer blocked by shinker_rwsem trylock,
>> which realizes the lockless slab reclaim.
>>
>> This series is based on next-20230217.
>>
>> Comments and suggestions are welcome.
>>
>> Thanks,
>> Qi.
>>
>> Changelog in v1 -> v2:
>>   - add a map_nr_max field to shrinker_info (suggested by Kirill)
>>   - use shrinker_mutex in reparent_shrinker_deferred() (pointed by Kirill)
>>
>> Qi Zheng (7):
>>    mm: vmscan: add a map_nr_max field to shrinker_info
>>    mm: vmscan: make global slab shrink lockless
>>    mm: vmscan: make memcg slab shrink lockless
>>    mm: shrinkers: make count and scan in shrinker debugfs lockless
>>    mm: vmscan: hold write lock to reparent shrinker nr_deferred
>>    mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers()
>>    mm: shrinkers: convert shrinker_rwsem to mutex
>>
>>   drivers/md/dm-cache-metadata.c |   2 +-
>>   drivers/md/dm-thin-metadata.c  |   2 +-
>>   fs/super.c                     |   2 +-
>>   include/linux/memcontrol.h     |   1 +
>>   mm/shrinker_debug.c            |  38 ++++-----
>>   mm/vmscan.c                    | 142 +++++++++++++++++----------------
>>   6 files changed, 92 insertions(+), 95 deletions(-)
>>
>> -- 
>> 2.20.1
>>

-- 
Thanks,
Qi

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/7] mm: vmscan: make global slab shrink lockless
  2023-02-23 15:37     ` Rafael Aquini
@ 2023-02-24  4:09       ` Qi Zheng
  0 siblings, 0 replies; 33+ messages in thread
From: Qi Zheng @ 2023-02-24  4:09 UTC (permalink / raw)
  To: Rafael Aquini
  Cc: akpm, tkhai, hannes, shakeelb, mhocko, roman.gushchin,
	muchun.song, david, shy828301, sultan, dave, penguin-kernel,
	paulmck, linux-mm, linux-kernel



On 2023/2/23 23:37, Rafael Aquini wrote:
> On Thu, Feb 23, 2023 at 10:26:45AM -0500, Rafael Aquini wrote:
>> On Thu, Feb 23, 2023 at 09:27:20PM +0800, Qi Zheng wrote:
>>> The shrinker_rwsem is a global lock in shrinkers subsystem,
>>> it is easy to cause blocking in the following cases:
>>>
>>> a. the write lock of shrinker_rwsem was held for too long.
>>>     For example, there are many memcgs in the system, which
>>>     causes some paths to hold locks and traverse it for too
>>>     long. (e.g. expand_shrinker_info())
>>> b. the read lock of shrinker_rwsem was held for too long,
>>>     and a writer came at this time. Then this writer will be
>>>     forced to wait and block all subsequent readers.
>>>     For example:
>>>     - be scheduled when the read lock of shrinker_rwsem is
>>>       held in do_shrink_slab()
>>>     - some shrinker are blocked for too long. Like the case
>>>       mentioned in the patchset[1].
>>>
>>> Therefore, many times in history ([2],[3],[4],[5]), some
>>> people wanted to replace shrinker_rwsem reader with SRCU,
>>> but they all gave up because SRCU was not unconditionally
>>> enabled.
>>>
>>> But now, since commit 1cd0bd06093c ("rcu: Remove CONFIG_SRCU"),
>>> the SRCU is unconditionally enabled. So it's time to use
>>> SRCU to protect readers who previously held shrinker_rwsem.
>>>
>>> [1]. https://lore.kernel.org/lkml/20191129214541.3110-1-ptikhomirov@virtuozzo.com/
>>> [2]. https://lore.kernel.org/all/1437080113.3596.2.camel@stgolabs.net/
>>> [3]. https://lore.kernel.org/lkml/1510609063-3327-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp/
>>> [4]. https://lore.kernel.org/lkml/153365347929.19074.12509495712735843805.stgit@localhost.localdomain/
>>> [5]. https://lore.kernel.org/lkml/20210927074823.5825-1-sultan@kerneltoast.com/
>>>
>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>> ---
>>>   mm/vmscan.c | 27 +++++++++++----------------
>>>   1 file changed, 11 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index 9f895ca6216c..02987a6f95d1 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -202,6 +202,7 @@ static void set_task_reclaim_state(struct task_struct *task,
>>>   
>>>   LIST_HEAD(shrinker_list);
>>>   DECLARE_RWSEM(shrinker_rwsem);
>>> +DEFINE_SRCU(shrinker_srcu);
>>>   
>>>   #ifdef CONFIG_MEMCG
>>>   static int shrinker_nr_max;
>>> @@ -706,7 +707,7 @@ void free_prealloced_shrinker(struct shrinker *shrinker)
>>>   void register_shrinker_prepared(struct shrinker *shrinker)
>>>   {
>>>   	down_write(&shrinker_rwsem);
>>
>> I think you could revert the rwsem back to a simple mutex, now.
>>
> 
> NVM, that's exactly what patch 7 does. :)

Yeah. :)

> 
>   
> 

-- 
Thanks,
Qi

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/7] mm: vmscan: make global slab shrink lockless
  2023-02-24  4:00     ` Qi Zheng
@ 2023-02-24  4:16       ` Qi Zheng
  2023-02-24  8:20       ` Sultan Alsawaf
  2023-02-24 21:02       ` Kirill Tkhai
  2 siblings, 0 replies; 33+ messages in thread
From: Qi Zheng @ 2023-02-24  4:16 UTC (permalink / raw)
  To: Sultan Alsawaf, paulmck
  Cc: akpm, tkhai, hannes, shakeelb, mhocko, roman.gushchin,
	muchun.song, david, shy828301, dave, penguin-kernel, linux-mm,
	linux-kernel



On 2023/2/24 12:00, Qi Zheng wrote:
> 
> 
> On 2023/2/24 02:24, Sultan Alsawaf wrote:
>> On Thu, Feb 23, 2023 at 09:27:20PM +0800, Qi Zheng wrote:
>>> The shrinker_rwsem is a global lock in shrinkers subsystem,
>>> it is easy to cause blocking in the following cases:
>>>
>>> a. the write lock of shrinker_rwsem was held for too long.
>>>     For example, there are many memcgs in the system, which
>>>     causes some paths to hold locks and traverse it for too
>>>     long. (e.g. expand_shrinker_info())
>>> b. the read lock of shrinker_rwsem was held for too long,
>>>     and a writer came at this time. Then this writer will be
>>>     forced to wait and block all subsequent readers.
>>>     For example:
>>>     - be scheduled when the read lock of shrinker_rwsem is
>>>       held in do_shrink_slab()
>>>     - some shrinker are blocked for too long. Like the case
>>>       mentioned in the patchset[1].
>>>
>>> Therefore, many times in history ([2],[3],[4],[5]), some
>>> people wanted to replace shrinker_rwsem reader with SRCU,
>>> but they all gave up because SRCU was not unconditionally
>>> enabled.
>>>
>>> But now, since commit 1cd0bd06093c ("rcu: Remove CONFIG_SRCU"),
>>> the SRCU is unconditionally enabled. So it's time to use
>>> SRCU to protect readers who previously held shrinker_rwsem.
>>>
>>> [1]. 
>>> https://lore.kernel.org/lkml/20191129214541.3110-1-ptikhomirov@virtuozzo.com/
>>> [2]. https://lore.kernel.org/all/1437080113.3596.2.camel@stgolabs.net/
>>> [3]. 
>>> https://lore.kernel.org/lkml/1510609063-3327-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp/
>>> [4]. 
>>> https://lore.kernel.org/lkml/153365347929.19074.12509495712735843805.stgit@localhost.localdomain/
>>> [5]. 
>>> https://lore.kernel.org/lkml/20210927074823.5825-1-sultan@kerneltoast.com/
>>>
>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>> ---
>>>   mm/vmscan.c | 27 +++++++++++----------------
>>>   1 file changed, 11 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index 9f895ca6216c..02987a6f95d1 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -202,6 +202,7 @@ static void set_task_reclaim_state(struct 
>>> task_struct *task,
>>>   LIST_HEAD(shrinker_list);
>>>   DECLARE_RWSEM(shrinker_rwsem);
>>> +DEFINE_SRCU(shrinker_srcu);
>>>   #ifdef CONFIG_MEMCG
>>>   static int shrinker_nr_max;
>>> @@ -706,7 +707,7 @@ void free_prealloced_shrinker(struct shrinker 
>>> *shrinker)
>>>   void register_shrinker_prepared(struct shrinker *shrinker)
>>>   {
>>>       down_write(&shrinker_rwsem);
>>> -    list_add_tail(&shrinker->list, &shrinker_list);
>>> +    list_add_tail_rcu(&shrinker->list, &shrinker_list);
>>>       shrinker->flags |= SHRINKER_REGISTERED;
>>>       shrinker_debugfs_add(shrinker);
>>>       up_write(&shrinker_rwsem);
>>> @@ -760,13 +761,15 @@ void unregister_shrinker(struct shrinker 
>>> *shrinker)
>>>           return;
>>>       down_write(&shrinker_rwsem);
>>> -    list_del(&shrinker->list);
>>> +    list_del_rcu(&shrinker->list);
>>>       shrinker->flags &= ~SHRINKER_REGISTERED;
>>>       if (shrinker->flags & SHRINKER_MEMCG_AWARE)
>>>           unregister_memcg_shrinker(shrinker);
>>>       debugfs_entry = shrinker_debugfs_remove(shrinker);
>>>       up_write(&shrinker_rwsem);
>>> +    synchronize_srcu(&shrinker_srcu);
>>> +
>>>       debugfs_remove_recursive(debugfs_entry);
>>>       kfree(shrinker->nr_deferred);
>>> @@ -786,6 +789,7 @@ void synchronize_shrinkers(void)
>>>   {
>>>       down_write(&shrinker_rwsem);
>>>       up_write(&shrinker_rwsem);
>>> +    synchronize_srcu(&shrinker_srcu);
>>>   }
>>>   EXPORT_SYMBOL(synchronize_shrinkers);
>>> @@ -996,6 +1000,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, 
>>> int nid,
>>>   {
>>>       unsigned long ret, freed = 0;
>>>       struct shrinker *shrinker;
>>> +    int srcu_idx;
>>>       /*
>>>        * The root memcg might be allocated even though memcg is disabled
>>> @@ -1007,10 +1012,10 @@ static unsigned long shrink_slab(gfp_t 
>>> gfp_mask, int nid,
>>>       if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
>>>           return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>>> -    if (!down_read_trylock(&shrinker_rwsem))
>>> -        goto out;
>>> +    srcu_idx = srcu_read_lock(&shrinker_srcu);
>>> -    list_for_each_entry(shrinker, &shrinker_list, list) {
>>> +    list_for_each_entry_srcu(shrinker, &shrinker_list, list,
>>> +                 srcu_read_lock_held(&shrinker_srcu)) {
>>>           struct shrink_control sc = {
>>>               .gfp_mask = gfp_mask,
>>>               .nid = nid,
>>> @@ -1021,19 +1026,9 @@ static unsigned long shrink_slab(gfp_t 
>>> gfp_mask, int nid,
>>>           if (ret == SHRINK_EMPTY)
>>>               ret = 0;
>>>           freed += ret;
>>> -        /*
>>> -         * Bail out if someone want to register a new shrinker to
>>> -         * prevent the registration from being stalled for long periods
>>> -         * by parallel ongoing shrinking.
>>> -         */
>>> -        if (rwsem_is_contended(&shrinker_rwsem)) {
>>> -            freed = freed ? : 1;
>>> -            break;
>>> -        }
>>>       }
>>> -    up_read(&shrinker_rwsem);
>>> -out:
>>> +    srcu_read_unlock(&shrinker_srcu, srcu_idx);
>>>       cond_resched();
>>>       return freed;
>>>   }
>>> -- 
>>> 2.20.1
>>>
>>>
>>
>> Hi Qi,
>>
>> A different problem I realized after my old attempt to use SRCU was 
>> that the
>> unregister_shrinker() path became quite slow due to the heavy 
>> synchronize_srcu()
>> call. Both register_shrinker() *and* unregister_shrinker() are called 
>> frequently
>> these days, and SRCU is too unfair to the unregister path IMO.
> 
> Hi Sultan,
> 
> IIUC, for unregister_shrinker(), the wait time is hardly longer with
> SRCU than with shrinker_rwsem before.
> 
> And I just did a simple test. After using the script in cover letter to
> increase the shrink_slab hotspot, I did umount 1k times at the same
> time, and then I used bpftrace to measure the time consumption of
> unregister_shrinker() as follows:
> 
> bpftrace -e 'kprobe:unregister_shrinker { @start[tid] = nsecs; } 
> kretprobe:unregister_shrinker /@start[tid]/ { @ns[comm] = hist(nsecs - 
> @start[tid]); delete(@start[tid]); }'
> 
> @ns[umount]:
> [16K, 32K)             3 |      |
> [32K, 64K)            66 |@@@@@@@@@@      |
> [64K, 128K)           32 |@@@@@      |
> [128K, 256K)          22 |@@@      |
> [256K, 512K)          48 |@@@@@@@      |
> [512K, 1M)            19 |@@@      |
> [1M, 2M)             131 |@@@@@@@@@@@@@@@@@@@@@      |
> [2M, 4M)             313 
> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [4M, 8M)             302 
> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  |
> [8M, 16M)             55 |@@@@@@@@@
> 
> I see that the highest time-consuming of unregister_shrinker() is 
> between 8ms and 16ms, which feels tolerable?

And when I use the synchronize_srcu_expedited() mentioned by Paul,
the measured time consumption has a more obvious decrease:

@ns[umount]:
[16K, 32K)           105 |@@@@@@@@@@ 
      |
[32K, 64K)           521 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[64K, 128K)          119 |@@@@@@@@@@@ 
      |
[128K, 256K)          32 |@@@ 
      |
[256K, 512K)          70 |@@@@@@ 
      |
[512K, 1M)            49 |@@@@ 
      |
[1M, 2M)              34 |@@@ 
      |
[2M, 4M)              18 |@ 
      |
[4M, 8M)               4 |

> 
> Thanks,
> Qi
> 
>>
>> Although I never got around to submitting it, I made a non-SRCU 
>> solution [1]
>> that uses fine-grained locking instead, which is fair to both the 
>> register path
>> and unregister path. (The patch I've linked is a version of this 
>> adapted to an
>> older 4.14 kernel FYI, but it can be reworked for the current kernel.)
>>
>> What do you think about the fine-grained locking approach?
>>
>> Thanks,
>> Sultan
>>
>> [1] 
>> https://github.com/kerneltoast/android_kernel_google_floral/commit/012378f3173a82d2333d3ae7326691544301e76a
>>
> 

-- 
Thanks,
Qi

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/7] mm: vmscan: make global slab shrink lockless
  2023-02-24  4:00     ` Qi Zheng
  2023-02-24  4:16       ` Qi Zheng
@ 2023-02-24  8:20       ` Sultan Alsawaf
  2023-02-24 10:12         ` Qi Zheng
  2023-02-24 21:02       ` Kirill Tkhai
  2 siblings, 1 reply; 33+ messages in thread
From: Sultan Alsawaf @ 2023-02-24  8:20 UTC (permalink / raw)
  To: Qi Zheng
  Cc: akpm, tkhai, hannes, shakeelb, mhocko, roman.gushchin,
	muchun.song, david, shy828301, dave, penguin-kernel, paulmck,
	linux-mm, linux-kernel

On Fri, Feb 24, 2023 at 12:00:21PM +0800, Qi Zheng wrote:
> 
> 
> On 2023/2/24 02:24, Sultan Alsawaf wrote:
> > On Thu, Feb 23, 2023 at 09:27:20PM +0800, Qi Zheng wrote:
> > > The shrinker_rwsem is a global lock in shrinkers subsystem,
> > > it is easy to cause blocking in the following cases:
> > > 
> > > a. the write lock of shrinker_rwsem was held for too long.
> > >     For example, there are many memcgs in the system, which
> > >     causes some paths to hold locks and traverse it for too
> > >     long. (e.g. expand_shrinker_info())
> > > b. the read lock of shrinker_rwsem was held for too long,
> > >     and a writer came at this time. Then this writer will be
> > >     forced to wait and block all subsequent readers.
> > >     For example:
> > >     - be scheduled when the read lock of shrinker_rwsem is
> > >       held in do_shrink_slab()
> > >     - some shrinker are blocked for too long. Like the case
> > >       mentioned in the patchset[1].
> > > 
> > > Therefore, many times in history ([2],[3],[4],[5]), some
> > > people wanted to replace shrinker_rwsem reader with SRCU,
> > > but they all gave up because SRCU was not unconditionally
> > > enabled.
> > > 
> > > But now, since commit 1cd0bd06093c ("rcu: Remove CONFIG_SRCU"),
> > > the SRCU is unconditionally enabled. So it's time to use
> > > SRCU to protect readers who previously held shrinker_rwsem.
> > > 
> > > [1]. https://lore.kernel.org/lkml/20191129214541.3110-1-ptikhomirov@virtuozzo.com/
> > > [2]. https://lore.kernel.org/all/1437080113.3596.2.camel@stgolabs.net/
> > > [3]. https://lore.kernel.org/lkml/1510609063-3327-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp/
> > > [4]. https://lore.kernel.org/lkml/153365347929.19074.12509495712735843805.stgit@localhost.localdomain/
> > > [5]. https://lore.kernel.org/lkml/20210927074823.5825-1-sultan@kerneltoast.com/
> > > 
> > > Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> > > ---
> > >   mm/vmscan.c | 27 +++++++++++----------------
> > >   1 file changed, 11 insertions(+), 16 deletions(-)
> > > 
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 9f895ca6216c..02987a6f95d1 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -202,6 +202,7 @@ static void set_task_reclaim_state(struct task_struct *task,
> > >   LIST_HEAD(shrinker_list);
> > >   DECLARE_RWSEM(shrinker_rwsem);
> > > +DEFINE_SRCU(shrinker_srcu);
> > >   #ifdef CONFIG_MEMCG
> > >   static int shrinker_nr_max;
> > > @@ -706,7 +707,7 @@ void free_prealloced_shrinker(struct shrinker *shrinker)
> > >   void register_shrinker_prepared(struct shrinker *shrinker)
> > >   {
> > >   	down_write(&shrinker_rwsem);
> > > -	list_add_tail(&shrinker->list, &shrinker_list);
> > > +	list_add_tail_rcu(&shrinker->list, &shrinker_list);
> > >   	shrinker->flags |= SHRINKER_REGISTERED;
> > >   	shrinker_debugfs_add(shrinker);
> > >   	up_write(&shrinker_rwsem);
> > > @@ -760,13 +761,15 @@ void unregister_shrinker(struct shrinker *shrinker)
> > >   		return;
> > >   	down_write(&shrinker_rwsem);
> > > -	list_del(&shrinker->list);
> > > +	list_del_rcu(&shrinker->list);
> > >   	shrinker->flags &= ~SHRINKER_REGISTERED;
> > >   	if (shrinker->flags & SHRINKER_MEMCG_AWARE)
> > >   		unregister_memcg_shrinker(shrinker);
> > >   	debugfs_entry = shrinker_debugfs_remove(shrinker);
> > >   	up_write(&shrinker_rwsem);
> > > +	synchronize_srcu(&shrinker_srcu);
> > > +
> > >   	debugfs_remove_recursive(debugfs_entry);
> > >   	kfree(shrinker->nr_deferred);
> > > @@ -786,6 +789,7 @@ void synchronize_shrinkers(void)
> > >   {
> > >   	down_write(&shrinker_rwsem);
> > >   	up_write(&shrinker_rwsem);
> > > +	synchronize_srcu(&shrinker_srcu);
> > >   }
> > >   EXPORT_SYMBOL(synchronize_shrinkers);
> > > @@ -996,6 +1000,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
> > >   {
> > >   	unsigned long ret, freed = 0;
> > >   	struct shrinker *shrinker;
> > > +	int srcu_idx;
> > >   	/*
> > >   	 * The root memcg might be allocated even though memcg is disabled
> > > @@ -1007,10 +1012,10 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
> > >   	if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
> > >   		return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
> > > -	if (!down_read_trylock(&shrinker_rwsem))
> > > -		goto out;
> > > +	srcu_idx = srcu_read_lock(&shrinker_srcu);
> > > -	list_for_each_entry(shrinker, &shrinker_list, list) {
> > > +	list_for_each_entry_srcu(shrinker, &shrinker_list, list,
> > > +				 srcu_read_lock_held(&shrinker_srcu)) {
> > >   		struct shrink_control sc = {
> > >   			.gfp_mask = gfp_mask,
> > >   			.nid = nid,
> > > @@ -1021,19 +1026,9 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
> > >   		if (ret == SHRINK_EMPTY)
> > >   			ret = 0;
> > >   		freed += ret;
> > > -		/*
> > > -		 * Bail out if someone want to register a new shrinker to
> > > -		 * prevent the registration from being stalled for long periods
> > > -		 * by parallel ongoing shrinking.
> > > -		 */
> > > -		if (rwsem_is_contended(&shrinker_rwsem)) {
> > > -			freed = freed ? : 1;
> > > -			break;
> > > -		}
> > >   	}
> > > -	up_read(&shrinker_rwsem);
> > > -out:
> > > +	srcu_read_unlock(&shrinker_srcu, srcu_idx);
> > >   	cond_resched();
> > >   	return freed;
> > >   }
> > > -- 
> > > 2.20.1
> > > 
> > > 
> > 
> > Hi Qi,
> > 
> > A different problem I realized after my old attempt to use SRCU was that the
> > unregister_shrinker() path became quite slow due to the heavy synchronize_srcu()
> > call. Both register_shrinker() *and* unregister_shrinker() are called frequently
> > these days, and SRCU is too unfair to the unregister path IMO.
> 
> Hi Sultan,
> 
> IIUC, for unregister_shrinker(), the wait time is hardly longer with
> SRCU than with shrinker_rwsem before.

The wait time can be quite different because with shrinker_rwsem, the
rwsem_is_contended() bailout would cause unregister_shrinker() to wait for only
one random shrinker to finish at worst rather than waiting for *all* shrinkers
to finish.

> And I just did a simple test. After using the script in cover letter to
> increase the shrink_slab hotspot, I did umount 1k times at the same
> time, and then I used bpftrace to measure the time consumption of
> unregister_shrinker() as follows:
> 
> bpftrace -e 'kprobe:unregister_shrinker { @start[tid] = nsecs; }
> kretprobe:unregister_shrinker /@start[tid]/ { @ns[comm] = hist(nsecs -
> @start[tid]); delete(@start[tid]); }'
> 
> @ns[umount]:
> [16K, 32K)             3 |      |
> [32K, 64K)            66 |@@@@@@@@@@      |
> [64K, 128K)           32 |@@@@@      |
> [128K, 256K)          22 |@@@      |
> [256K, 512K)          48 |@@@@@@@      |
> [512K, 1M)            19 |@@@      |
> [1M, 2M)             131 |@@@@@@@@@@@@@@@@@@@@@      |
> [2M, 4M)             313
> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [4M, 8M)             302 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
> |
> [8M, 16M)             55 |@@@@@@@@@
> 
> I see that the highest time-consuming of unregister_shrinker() is between
> 8ms and 16ms, which feels tolerable?

If you've got a fast x86 machine then I'd say that's a bit slow. :)

This depends a lot on which shrinkers are active on your system and how much
work each one does upon running. If a driver's shrinker doesn't have much to do
because there's nothing it can shrink further, then it'll run fast. Conversely,
if a driver is stressed in a way that constantly creates a lot of potential work
for its shrinker, then the shrinker will run longer.

Since shrinkers are allowed to sleep, the delays can really add up when waiting
for all of them to finish running. In the past, I recall observing delays of
100ms+ in unregister_shrinker() on slower arm64 hardware when I stress tested
the SRCU approach.

If your GPU driver has a shrinker (such as i915), I suggest testing again under
heavy GPU load. The GPU shrinkers can be pretty heavy IIRC.

Thanks,
Sultan

> Thanks,
> Qi
> 
> > 
> > Although I never got around to submitting it, I made a non-SRCU solution [1]
> > that uses fine-grained locking instead, which is fair to both the register path
> > and unregister path. (The patch I've linked is a version of this adapted to an
> > older 4.14 kernel FYI, but it can be reworked for the current kernel.)
> > 
> > What do you think about the fine-grained locking approach?
> > 
> > Thanks,
> > Sultan
> > 
> > [1] https://github.com/kerneltoast/android_kernel_google_floral/commit/012378f3173a82d2333d3ae7326691544301e76a
> > 
> 
> -- 
> Thanks,
> Qi

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/7] mm: vmscan: make global slab shrink lockless
  2023-02-24  8:20       ` Sultan Alsawaf
@ 2023-02-24 10:12         ` Qi Zheng
  0 siblings, 0 replies; 33+ messages in thread
From: Qi Zheng @ 2023-02-24 10:12 UTC (permalink / raw)
  To: Sultan Alsawaf
  Cc: akpm, tkhai, hannes, shakeelb, mhocko, roman.gushchin,
	muchun.song, david, shy828301, dave, penguin-kernel, paulmck,
	linux-mm, linux-kernel



On 2023/2/24 16:20, Sultan Alsawaf wrote:
> On Fri, Feb 24, 2023 at 12:00:21PM +0800, Qi Zheng wrote:
>>
>>
>> On 2023/2/24 02:24, Sultan Alsawaf wrote:
>>> On Thu, Feb 23, 2023 at 09:27:20PM +0800, Qi Zheng wrote:
>>>> The shrinker_rwsem is a global lock in shrinkers subsystem,
>>>> it is easy to cause blocking in the following cases:
>>>>
>>>> a. the write lock of shrinker_rwsem was held for too long.
>>>>      For example, there are many memcgs in the system, which
>>>>      causes some paths to hold locks and traverse it for too
>>>>      long. (e.g. expand_shrinker_info())
>>>> b. the read lock of shrinker_rwsem was held for too long,
>>>>      and a writer came at this time. Then this writer will be
>>>>      forced to wait and block all subsequent readers.
>>>>      For example:
>>>>      - be scheduled when the read lock of shrinker_rwsem is
>>>>        held in do_shrink_slab()
>>>>      - some shrinker are blocked for too long. Like the case
>>>>        mentioned in the patchset[1].
>>>>
>>>> Therefore, many times in history ([2],[3],[4],[5]), some
>>>> people wanted to replace shrinker_rwsem reader with SRCU,
>>>> but they all gave up because SRCU was not unconditionally
>>>> enabled.
>>>>
>>>> But now, since commit 1cd0bd06093c ("rcu: Remove CONFIG_SRCU"),
>>>> the SRCU is unconditionally enabled. So it's time to use
>>>> SRCU to protect readers who previously held shrinker_rwsem.
>>>>
>>>> [1]. https://lore.kernel.org/lkml/20191129214541.3110-1-ptikhomirov@virtuozzo.com/
>>>> [2]. https://lore.kernel.org/all/1437080113.3596.2.camel@stgolabs.net/
>>>> [3]. https://lore.kernel.org/lkml/1510609063-3327-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp/
>>>> [4]. https://lore.kernel.org/lkml/153365347929.19074.12509495712735843805.stgit@localhost.localdomain/
>>>> [5]. https://lore.kernel.org/lkml/20210927074823.5825-1-sultan@kerneltoast.com/
>>>>
>>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>>> ---
>>>>    mm/vmscan.c | 27 +++++++++++----------------
>>>>    1 file changed, 11 insertions(+), 16 deletions(-)
>>>>
>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>> index 9f895ca6216c..02987a6f95d1 100644
>>>> --- a/mm/vmscan.c
>>>> +++ b/mm/vmscan.c
>>>> @@ -202,6 +202,7 @@ static void set_task_reclaim_state(struct task_struct *task,
>>>>    LIST_HEAD(shrinker_list);
>>>>    DECLARE_RWSEM(shrinker_rwsem);
>>>> +DEFINE_SRCU(shrinker_srcu);
>>>>    #ifdef CONFIG_MEMCG
>>>>    static int shrinker_nr_max;
>>>> @@ -706,7 +707,7 @@ void free_prealloced_shrinker(struct shrinker *shrinker)
>>>>    void register_shrinker_prepared(struct shrinker *shrinker)
>>>>    {
>>>>    	down_write(&shrinker_rwsem);
>>>> -	list_add_tail(&shrinker->list, &shrinker_list);
>>>> +	list_add_tail_rcu(&shrinker->list, &shrinker_list);
>>>>    	shrinker->flags |= SHRINKER_REGISTERED;
>>>>    	shrinker_debugfs_add(shrinker);
>>>>    	up_write(&shrinker_rwsem);
>>>> @@ -760,13 +761,15 @@ void unregister_shrinker(struct shrinker *shrinker)
>>>>    		return;
>>>>    	down_write(&shrinker_rwsem);
>>>> -	list_del(&shrinker->list);
>>>> +	list_del_rcu(&shrinker->list);
>>>>    	shrinker->flags &= ~SHRINKER_REGISTERED;
>>>>    	if (shrinker->flags & SHRINKER_MEMCG_AWARE)
>>>>    		unregister_memcg_shrinker(shrinker);
>>>>    	debugfs_entry = shrinker_debugfs_remove(shrinker);
>>>>    	up_write(&shrinker_rwsem);
>>>> +	synchronize_srcu(&shrinker_srcu);
>>>> +
>>>>    	debugfs_remove_recursive(debugfs_entry);
>>>>    	kfree(shrinker->nr_deferred);
>>>> @@ -786,6 +789,7 @@ void synchronize_shrinkers(void)
>>>>    {
>>>>    	down_write(&shrinker_rwsem);
>>>>    	up_write(&shrinker_rwsem);
>>>> +	synchronize_srcu(&shrinker_srcu);
>>>>    }
>>>>    EXPORT_SYMBOL(synchronize_shrinkers);
>>>> @@ -996,6 +1000,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>>    {
>>>>    	unsigned long ret, freed = 0;
>>>>    	struct shrinker *shrinker;
>>>> +	int srcu_idx;
>>>>    	/*
>>>>    	 * The root memcg might be allocated even though memcg is disabled
>>>> @@ -1007,10 +1012,10 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>>    	if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
>>>>    		return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>>>> -	if (!down_read_trylock(&shrinker_rwsem))
>>>> -		goto out;
>>>> +	srcu_idx = srcu_read_lock(&shrinker_srcu);
>>>> -	list_for_each_entry(shrinker, &shrinker_list, list) {
>>>> +	list_for_each_entry_srcu(shrinker, &shrinker_list, list,
>>>> +				 srcu_read_lock_held(&shrinker_srcu)) {
>>>>    		struct shrink_control sc = {
>>>>    			.gfp_mask = gfp_mask,
>>>>    			.nid = nid,
>>>> @@ -1021,19 +1026,9 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>>    		if (ret == SHRINK_EMPTY)
>>>>    			ret = 0;
>>>>    		freed += ret;
>>>> -		/*
>>>> -		 * Bail out if someone want to register a new shrinker to
>>>> -		 * prevent the registration from being stalled for long periods
>>>> -		 * by parallel ongoing shrinking.
>>>> -		 */
>>>> -		if (rwsem_is_contended(&shrinker_rwsem)) {
>>>> -			freed = freed ? : 1;
>>>> -			break;
>>>> -		}
>>>>    	}
>>>> -	up_read(&shrinker_rwsem);
>>>> -out:
>>>> +	srcu_read_unlock(&shrinker_srcu, srcu_idx);
>>>>    	cond_resched();
>>>>    	return freed;
>>>>    }
>>>> -- 
>>>> 2.20.1
>>>>
>>>>
>>>
>>> Hi Qi,
>>>
>>> A different problem I realized after my old attempt to use SRCU was that the
>>> unregister_shrinker() path became quite slow due to the heavy synchronize_srcu()
>>> call. Both register_shrinker() *and* unregister_shrinker() are called frequently
>>> these days, and SRCU is too unfair to the unregister path IMO.
>>
>> Hi Sultan,
>>
>> IIUC, for unregister_shrinker(), the wait time is hardly longer with
>> SRCU than with shrinker_rwsem before.
> 
> The wait time can be quite different because with shrinker_rwsem, the
> rwsem_is_contended() bailout would cause unregister_shrinker() to wait for only
> one random shrinker to finish at worst rather than waiting for *all* shrinkers
> to finish.

Yes, to be exact, unregister_shrinker() needs to wait for all the
shrinkers who entered grace period before it. But the benefit in
exchange is that the slab shrink is completely lock-free, I think this 
is more worthwhile than letting unregister_shrinker() wait a little
longer.

> 
>> And I just did a simple test. After using the script in cover letter to
>> increase the shrink_slab hotspot, I did umount 1k times at the same
>> time, and then I used bpftrace to measure the time consumption of
>> unregister_shrinker() as follows:
>>
>> bpftrace -e 'kprobe:unregister_shrinker { @start[tid] = nsecs; }
>> kretprobe:unregister_shrinker /@start[tid]/ { @ns[comm] = hist(nsecs -
>> @start[tid]); delete(@start[tid]); }'
>>
>> @ns[umount]:
>> [16K, 32K)             3 |      |
>> [32K, 64K)            66 |@@@@@@@@@@      |
>> [64K, 128K)           32 |@@@@@      |
>> [128K, 256K)          22 |@@@      |
>> [256K, 512K)          48 |@@@@@@@      |
>> [512K, 1M)            19 |@@@      |
>> [1M, 2M)             131 |@@@@@@@@@@@@@@@@@@@@@      |
>> [2M, 4M)             313
>> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>> [4M, 8M)             302 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
>> |
>> [8M, 16M)             55 |@@@@@@@@@
>>
>> I see that the highest time-consuming of unregister_shrinker() is between
>> 8ms and 16ms, which feels tolerable?
> 
> If you've got a fast x86 machine then I'd say that's a bit slow. :)

Nope, I tested it on a qemu virtual machine.

And I just tested it on a physical machine (Intel(R) Xeon(R) Platinum
8260 CPU @ 2.40GHz) and the results are as follows:

1) use synchronize_srcu():

@ns[umount]:
[8K, 16K)             83 |@@@@@@@ 
      |
[16K, 32K)           578 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[32K, 64K)            78 |@@@@@@@ 
      |
[64K, 128K)            6 | 
      |
[128K, 256K)           7 | 
      |
[256K, 512K)          29 |@@ 
      |
[512K, 1M)            51 |@@@@ 
      |
[1M, 2M)              90 |@@@@@@@@ 
      |
[2M, 4M)              70 |@@@@@@ 
      |
[4M, 8M)               8 | 
      |

2) use synchronize_srcu_expedited():

@ns[umount]:
[8K, 16K)             31 |@@ 
      |
[16K, 32K)           803 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[32K, 64K)           158 |@@@@@@@@@@ 
      |
[64K, 128K)            4 | 
      |
[128K, 256K)           2 | 
      |
[256K, 512K)           2 | 
      |

Thanks,
Qi

> 
> This depends a lot on which shrinkers are active on your system and how much
> work each one does upon running. If a driver's shrinker doesn't have much to do
> because there's nothing it can shrink further, then it'll run fast. Conversely,
> if a driver is stressed in a way that constantly creates a lot of potential work
> for its shrinker, then the shrinker will run longer.
> 
> Since shrinkers are allowed to sleep, the delays can really add up when waiting
> for all of them to finish running. In the past, I recall observing delays of
> 100ms+ in unregister_shrinker() on slower arm64 hardware when I stress tested
> the SRCU approach.
> 
> If your GPU driver has a shrinker (such as i915), I suggest testing again under
> heavy GPU load. The GPU shrinkers can be pretty heavy IIRC.
> 
> Thanks,
> Sultan
> 
>> Thanks,
>> Qi
>>
>>>
>>> Although I never got around to submitting it, I made a non-SRCU solution [1]
>>> that uses fine-grained locking instead, which is fair to both the register path
>>> and unregister path. (The patch I've linked is a version of this adapted to an
>>> older 4.14 kernel FYI, but it can be reworked for the current kernel.)
>>>
>>> What do you think about the fine-grained locking approach?
>>>
>>> Thanks,
>>> Sultan
>>>
>>> [1] https://github.com/kerneltoast/android_kernel_google_floral/commit/012378f3173a82d2333d3ae7326691544301e76a
>>>
>>
>> -- 
>> Thanks,
>> Qi

-- 
Thanks,
Qi

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/7] mm: vmscan: make global slab shrink lockless
  2023-02-24  4:00     ` Qi Zheng
  2023-02-24  4:16       ` Qi Zheng
  2023-02-24  8:20       ` Sultan Alsawaf
@ 2023-02-24 21:02       ` Kirill Tkhai
  2023-02-24 21:14         ` Kirill Tkhai
  2 siblings, 1 reply; 33+ messages in thread
From: Kirill Tkhai @ 2023-02-24 21:02 UTC (permalink / raw)
  To: Qi Zheng, Sultan Alsawaf
  Cc: akpm, hannes, shakeelb, mhocko, roman.gushchin, muchun.song,
	david, shy828301, dave, penguin-kernel, paulmck, linux-mm,
	linux-kernel

On 24.02.2023 07:00, Qi Zheng wrote:
> 
> 
> On 2023/2/24 02:24, Sultan Alsawaf wrote:
>> On Thu, Feb 23, 2023 at 09:27:20PM +0800, Qi Zheng wrote:
>>> The shrinker_rwsem is a global lock in shrinkers subsystem,
>>> it is easy to cause blocking in the following cases:
>>>
>>> a. the write lock of shrinker_rwsem was held for too long.
>>>     For example, there are many memcgs in the system, which
>>>     causes some paths to hold locks and traverse it for too
>>>     long. (e.g. expand_shrinker_info())
>>> b. the read lock of shrinker_rwsem was held for too long,
>>>     and a writer came at this time. Then this writer will be
>>>     forced to wait and block all subsequent readers.
>>>     For example:
>>>     - be scheduled when the read lock of shrinker_rwsem is
>>>       held in do_shrink_slab()
>>>     - some shrinker are blocked for too long. Like the case
>>>       mentioned in the patchset[1].
>>>
>>> Therefore, many times in history ([2],[3],[4],[5]), some
>>> people wanted to replace shrinker_rwsem reader with SRCU,
>>> but they all gave up because SRCU was not unconditionally
>>> enabled.
>>>
>>> But now, since commit 1cd0bd06093c ("rcu: Remove CONFIG_SRCU"),
>>> the SRCU is unconditionally enabled. So it's time to use
>>> SRCU to protect readers who previously held shrinker_rwsem.
>>>
>>> [1]. https://lore.kernel.org/lkml/20191129214541.3110-1-ptikhomirov@virtuozzo.com/
>>> [2]. https://lore.kernel.org/all/1437080113.3596.2.camel@stgolabs.net/
>>> [3]. https://lore.kernel.org/lkml/1510609063-3327-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp/
>>> [4]. https://lore.kernel.org/lkml/153365347929.19074.12509495712735843805.stgit@localhost.localdomain/
>>> [5]. https://lore.kernel.org/lkml/20210927074823.5825-1-sultan@kerneltoast.com/
>>>
>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>> ---
>>>   mm/vmscan.c | 27 +++++++++++----------------
>>>   1 file changed, 11 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index 9f895ca6216c..02987a6f95d1 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -202,6 +202,7 @@ static void set_task_reclaim_state(struct task_struct *task,
>>>     LIST_HEAD(shrinker_list);
>>>   DECLARE_RWSEM(shrinker_rwsem);
>>> +DEFINE_SRCU(shrinker_srcu);
>>>     #ifdef CONFIG_MEMCG
>>>   static int shrinker_nr_max;
>>> @@ -706,7 +707,7 @@ void free_prealloced_shrinker(struct shrinker *shrinker)
>>>   void register_shrinker_prepared(struct shrinker *shrinker)
>>>   {
>>>       down_write(&shrinker_rwsem);
>>> -    list_add_tail(&shrinker->list, &shrinker_list);
>>> +    list_add_tail_rcu(&shrinker->list, &shrinker_list);
>>>       shrinker->flags |= SHRINKER_REGISTERED;
>>>       shrinker_debugfs_add(shrinker);
>>>       up_write(&shrinker_rwsem);
>>> @@ -760,13 +761,15 @@ void unregister_shrinker(struct shrinker *shrinker)
>>>           return;
>>>         down_write(&shrinker_rwsem);
>>> -    list_del(&shrinker->list);
>>> +    list_del_rcu(&shrinker->list);
>>>       shrinker->flags &= ~SHRINKER_REGISTERED;
>>>       if (shrinker->flags & SHRINKER_MEMCG_AWARE)
>>>           unregister_memcg_shrinker(shrinker);
>>>       debugfs_entry = shrinker_debugfs_remove(shrinker);
>>>       up_write(&shrinker_rwsem);
>>>   +    synchronize_srcu(&shrinker_srcu);
>>> +
>>>       debugfs_remove_recursive(debugfs_entry);
>>>         kfree(shrinker->nr_deferred);
>>> @@ -786,6 +789,7 @@ void synchronize_shrinkers(void)
>>>   {
>>>       down_write(&shrinker_rwsem);
>>>       up_write(&shrinker_rwsem);
>>> +    synchronize_srcu(&shrinker_srcu);
>>>   }
>>>   EXPORT_SYMBOL(synchronize_shrinkers);
>>>   @@ -996,6 +1000,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>   {
>>>       unsigned long ret, freed = 0;
>>>       struct shrinker *shrinker;
>>> +    int srcu_idx;
>>>         /*
>>>        * The root memcg might be allocated even though memcg is disabled
>>> @@ -1007,10 +1012,10 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>       if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
>>>           return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>>>   -    if (!down_read_trylock(&shrinker_rwsem))
>>> -        goto out;
>>> +    srcu_idx = srcu_read_lock(&shrinker_srcu);
>>>   -    list_for_each_entry(shrinker, &shrinker_list, list) {
>>> +    list_for_each_entry_srcu(shrinker, &shrinker_list, list,
>>> +                 srcu_read_lock_held(&shrinker_srcu)) {
>>>           struct shrink_control sc = {
>>>               .gfp_mask = gfp_mask,
>>>               .nid = nid,
>>> @@ -1021,19 +1026,9 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>           if (ret == SHRINK_EMPTY)
>>>               ret = 0;
>>>           freed += ret;
>>> -        /*
>>> -         * Bail out if someone want to register a new shrinker to
>>> -         * prevent the registration from being stalled for long periods
>>> -         * by parallel ongoing shrinking.
>>> -         */
>>> -        if (rwsem_is_contended(&shrinker_rwsem)) {
>>> -            freed = freed ? : 1;
>>> -            break;
>>> -        }
>>>       }
>>>   -    up_read(&shrinker_rwsem);
>>> -out:
>>> +    srcu_read_unlock(&shrinker_srcu, srcu_idx);
>>>       cond_resched();
>>>       return freed;
>>>   }
>>> -- 
>>> 2.20.1
>>>
>>>
>>
>> Hi Qi,
>>
>> A different problem I realized after my old attempt to use SRCU was that the
>> unregister_shrinker() path became quite slow due to the heavy synchronize_srcu()
>> call. Both register_shrinker() *and* unregister_shrinker() are called frequently
>> these days, and SRCU is too unfair to the unregister path IMO.
> 
> Hi Sultan,
> 
> IIUC, for unregister_shrinker(), the wait time is hardly longer with
> SRCU than with shrinker_rwsem before.
> 
> And I just did a simple test. After using the script in cover letter to
> increase the shrink_slab hotspot, I did umount 1k times at the same
> time, and then I used bpftrace to measure the time consumption of
> unregister_shrinker() as follows:
> 
> bpftrace -e 'kprobe:unregister_shrinker { @start[tid] = nsecs; } kretprobe:unregister_shrinker /@start[tid]/ { @ns[comm] = hist(nsecs - @start[tid]); delete(@start[tid]); }'
> 
> @ns[umount]:
> [16K, 32K)             3 |      |
> [32K, 64K)            66 |@@@@@@@@@@      |
> [64K, 128K)           32 |@@@@@      |
> [128K, 256K)          22 |@@@      |
> [256K, 512K)          48 |@@@@@@@      |
> [512K, 1M)            19 |@@@      |
> [1M, 2M)             131 |@@@@@@@@@@@@@@@@@@@@@      |
> [2M, 4M)             313 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [4M, 8M)             302 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  |
> [8M, 16M)             55 |@@@@@@@@@
> 
> I see that the highest time-consuming of unregister_shrinker() is between 8ms and 16ms, which feels tolerable?

The fundamental difference is that before the patchset this for_each_set_bit() iteration could be broken in the middle
of two do_shrink_slab() calls, while after the patchset we can leave for_each_set_bit() only after visiting all set bits.

Using only synchronize_srcu_expedited() won't help here.

My opinion is we should restore a check similar to the rwsem_is_contendent() check that we had before. Something like
the below on top of your patchset merged into appropriate patch:

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 27ef9946ae8a..50e7812468ec 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -204,6 +204,7 @@ static void set_task_reclaim_state(struct task_struct *task,
 LIST_HEAD(shrinker_list);
 DEFINE_MUTEX(shrinker_mutex);
 DEFINE_SRCU(shrinker_srcu);
+static atomic_t shrinker_srcu_generation = ATOMIC_INIT(0);
 
 #ifdef CONFIG_MEMCG
 static int shrinker_nr_max;
@@ -782,6 +783,7 @@ void unregister_shrinker(struct shrinker *shrinker)
 	debugfs_entry = shrinker_debugfs_remove(shrinker);
 	mutex_unlock(&shrinker_mutex);
 
+	atomic_inc(&shrinker_srcu_generation);
 	synchronize_srcu(&shrinker_srcu);
 
 	debugfs_remove_recursive(debugfs_entry);
@@ -799,6 +801,7 @@ EXPORT_SYMBOL(unregister_shrinker);
  */
 void synchronize_shrinkers(void)
 {
+	atomic_inc(&shrinker_srcu_generation);
 	synchronize_srcu(&shrinker_srcu);
 }
 EXPORT_SYMBOL(synchronize_shrinkers);
@@ -908,7 +911,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
 {
 	struct shrinker_info *info;
 	unsigned long ret, freed = 0;
-	int srcu_idx;
+	int srcu_idx, generation;
 	int i;
 
 	if (!mem_cgroup_online(memcg))
@@ -919,6 +922,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
 	if (unlikely(!info))
 		goto unlock;
 
+	generation = atomic_read(&shrinker_srcu_generation);
 	for_each_set_bit(i, info->map, info->map_nr_max) {
 		struct shrink_control sc = {
 			.gfp_mask = gfp_mask,
@@ -965,6 +969,11 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
 				set_shrinker_bit(memcg, nid, i);
 		}
 		freed += ret;
+
+		if (atomic_read(&shrinker_srcu_generation) != generation) {
+			freed = freed ? : 1;
+			break;
+		}
 	}
 unlock:
 	srcu_read_unlock(&shrinker_srcu, srcu_idx);
@@ -1004,7 +1013,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
 {
 	unsigned long ret, freed = 0;
 	struct shrinker *shrinker;
-	int srcu_idx;
+	int srcu_idx, generation;
 
 	/*
 	 * The root memcg might be allocated even though memcg is disabled
@@ -1017,6 +1026,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
 		return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
 
 	srcu_idx = srcu_read_lock(&shrinker_srcu);
+	generation = atomic_read(&shrinker_srcu_generation);
 
 	list_for_each_entry_srcu(shrinker, &shrinker_list, list,
 				 srcu_read_lock_held(&shrinker_srcu)) {
@@ -1030,6 +1040,11 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
 		if (ret == SHRINK_EMPTY)
 			ret = 0;
 		freed += ret;
+
+		if (atomic_read(&shrinker_srcu_generation) != generation) {
+			freed = freed ? : 1;
+			break;
+		}
 	}
 
 	srcu_read_unlock(&shrinker_srcu, srcu_idx);

Kirill

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/7] mm: vmscan: make global slab shrink lockless
  2023-02-24 21:02       ` Kirill Tkhai
@ 2023-02-24 21:14         ` Kirill Tkhai
  2023-02-25  8:08           ` Qi Zheng
  0 siblings, 1 reply; 33+ messages in thread
From: Kirill Tkhai @ 2023-02-24 21:14 UTC (permalink / raw)
  To: Qi Zheng, Sultan Alsawaf
  Cc: akpm, hannes, shakeelb, mhocko, roman.gushchin, muchun.song,
	david, shy828301, dave, penguin-kernel, paulmck, linux-mm,
	linux-kernel

On 25.02.2023 00:02, Kirill Tkhai wrote:
> On 24.02.2023 07:00, Qi Zheng wrote:
>>
>>
>> On 2023/2/24 02:24, Sultan Alsawaf wrote:
>>> On Thu, Feb 23, 2023 at 09:27:20PM +0800, Qi Zheng wrote:
>>>> The shrinker_rwsem is a global lock in shrinkers subsystem,
>>>> it is easy to cause blocking in the following cases:
>>>>
>>>> a. the write lock of shrinker_rwsem was held for too long.
>>>>     For example, there are many memcgs in the system, which
>>>>     causes some paths to hold locks and traverse it for too
>>>>     long. (e.g. expand_shrinker_info())
>>>> b. the read lock of shrinker_rwsem was held for too long,
>>>>     and a writer came at this time. Then this writer will be
>>>>     forced to wait and block all subsequent readers.
>>>>     For example:
>>>>     - be scheduled when the read lock of shrinker_rwsem is
>>>>       held in do_shrink_slab()
>>>>     - some shrinker are blocked for too long. Like the case
>>>>       mentioned in the patchset[1].
>>>>
>>>> Therefore, many times in history ([2],[3],[4],[5]), some
>>>> people wanted to replace shrinker_rwsem reader with SRCU,
>>>> but they all gave up because SRCU was not unconditionally
>>>> enabled.
>>>>
>>>> But now, since commit 1cd0bd06093c ("rcu: Remove CONFIG_SRCU"),
>>>> the SRCU is unconditionally enabled. So it's time to use
>>>> SRCU to protect readers who previously held shrinker_rwsem.
>>>>
>>>> [1]. https://lore.kernel.org/lkml/20191129214541.3110-1-ptikhomirov@virtuozzo.com/
>>>> [2]. https://lore.kernel.org/all/1437080113.3596.2.camel@stgolabs.net/
>>>> [3]. https://lore.kernel.org/lkml/1510609063-3327-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp/
>>>> [4]. https://lore.kernel.org/lkml/153365347929.19074.12509495712735843805.stgit@localhost.localdomain/
>>>> [5]. https://lore.kernel.org/lkml/20210927074823.5825-1-sultan@kerneltoast.com/
>>>>
>>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>>> ---
>>>>   mm/vmscan.c | 27 +++++++++++----------------
>>>>   1 file changed, 11 insertions(+), 16 deletions(-)
>>>>
>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>> index 9f895ca6216c..02987a6f95d1 100644
>>>> --- a/mm/vmscan.c
>>>> +++ b/mm/vmscan.c
>>>> @@ -202,6 +202,7 @@ static void set_task_reclaim_state(struct task_struct *task,
>>>>     LIST_HEAD(shrinker_list);
>>>>   DECLARE_RWSEM(shrinker_rwsem);
>>>> +DEFINE_SRCU(shrinker_srcu);
>>>>     #ifdef CONFIG_MEMCG
>>>>   static int shrinker_nr_max;
>>>> @@ -706,7 +707,7 @@ void free_prealloced_shrinker(struct shrinker *shrinker)
>>>>   void register_shrinker_prepared(struct shrinker *shrinker)
>>>>   {
>>>>       down_write(&shrinker_rwsem);
>>>> -    list_add_tail(&shrinker->list, &shrinker_list);
>>>> +    list_add_tail_rcu(&shrinker->list, &shrinker_list);
>>>>       shrinker->flags |= SHRINKER_REGISTERED;
>>>>       shrinker_debugfs_add(shrinker);
>>>>       up_write(&shrinker_rwsem);
>>>> @@ -760,13 +761,15 @@ void unregister_shrinker(struct shrinker *shrinker)
>>>>           return;
>>>>         down_write(&shrinker_rwsem);
>>>> -    list_del(&shrinker->list);
>>>> +    list_del_rcu(&shrinker->list);
>>>>       shrinker->flags &= ~SHRINKER_REGISTERED;
>>>>       if (shrinker->flags & SHRINKER_MEMCG_AWARE)
>>>>           unregister_memcg_shrinker(shrinker);
>>>>       debugfs_entry = shrinker_debugfs_remove(shrinker);
>>>>       up_write(&shrinker_rwsem);
>>>>   +    synchronize_srcu(&shrinker_srcu);
>>>> +
>>>>       debugfs_remove_recursive(debugfs_entry);
>>>>         kfree(shrinker->nr_deferred);
>>>> @@ -786,6 +789,7 @@ void synchronize_shrinkers(void)
>>>>   {
>>>>       down_write(&shrinker_rwsem);
>>>>       up_write(&shrinker_rwsem);
>>>> +    synchronize_srcu(&shrinker_srcu);
>>>>   }
>>>>   EXPORT_SYMBOL(synchronize_shrinkers);
>>>>   @@ -996,6 +1000,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>>   {
>>>>       unsigned long ret, freed = 0;
>>>>       struct shrinker *shrinker;
>>>> +    int srcu_idx;
>>>>         /*
>>>>        * The root memcg might be allocated even though memcg is disabled
>>>> @@ -1007,10 +1012,10 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>>       if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
>>>>           return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>>>>   -    if (!down_read_trylock(&shrinker_rwsem))
>>>> -        goto out;
>>>> +    srcu_idx = srcu_read_lock(&shrinker_srcu);
>>>>   -    list_for_each_entry(shrinker, &shrinker_list, list) {
>>>> +    list_for_each_entry_srcu(shrinker, &shrinker_list, list,
>>>> +                 srcu_read_lock_held(&shrinker_srcu)) {
>>>>           struct shrink_control sc = {
>>>>               .gfp_mask = gfp_mask,
>>>>               .nid = nid,
>>>> @@ -1021,19 +1026,9 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>>           if (ret == SHRINK_EMPTY)
>>>>               ret = 0;
>>>>           freed += ret;
>>>> -        /*
>>>> -         * Bail out if someone want to register a new shrinker to
>>>> -         * prevent the registration from being stalled for long periods
>>>> -         * by parallel ongoing shrinking.
>>>> -         */
>>>> -        if (rwsem_is_contended(&shrinker_rwsem)) {
>>>> -            freed = freed ? : 1;
>>>> -            break;
>>>> -        }
>>>>       }
>>>>   -    up_read(&shrinker_rwsem);
>>>> -out:
>>>> +    srcu_read_unlock(&shrinker_srcu, srcu_idx);
>>>>       cond_resched();
>>>>       return freed;
>>>>   }
>>>> -- 
>>>> 2.20.1
>>>>
>>>>
>>>
>>> Hi Qi,
>>>
>>> A different problem I realized after my old attempt to use SRCU was that the
>>> unregister_shrinker() path became quite slow due to the heavy synchronize_srcu()
>>> call. Both register_shrinker() *and* unregister_shrinker() are called frequently
>>> these days, and SRCU is too unfair to the unregister path IMO.
>>
>> Hi Sultan,
>>
>> IIUC, for unregister_shrinker(), the wait time is hardly longer with
>> SRCU than with shrinker_rwsem before.
>>
>> And I just did a simple test. After using the script in cover letter to
>> increase the shrink_slab hotspot, I did umount 1k times at the same
>> time, and then I used bpftrace to measure the time consumption of
>> unregister_shrinker() as follows:
>>
>> bpftrace -e 'kprobe:unregister_shrinker { @start[tid] = nsecs; } kretprobe:unregister_shrinker /@start[tid]/ { @ns[comm] = hist(nsecs - @start[tid]); delete(@start[tid]); }'
>>
>> @ns[umount]:
>> [16K, 32K)             3 |      |
>> [32K, 64K)            66 |@@@@@@@@@@      |
>> [64K, 128K)           32 |@@@@@      |
>> [128K, 256K)          22 |@@@      |
>> [256K, 512K)          48 |@@@@@@@      |
>> [512K, 1M)            19 |@@@      |
>> [1M, 2M)             131 |@@@@@@@@@@@@@@@@@@@@@      |
>> [2M, 4M)             313 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>> [4M, 8M)             302 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  |
>> [8M, 16M)             55 |@@@@@@@@@
>>
>> I see that the highest time-consuming of unregister_shrinker() is between 8ms and 16ms, which feels tolerable?
> 
> The fundamental difference is that before the patchset this for_each_set_bit() iteration could be broken in the middle
> of two do_shrink_slab() calls, while after the patchset we can leave for_each_set_bit() only after visiting all set bits.
> 
> Using only synchronize_srcu_expedited() won't help here.
> 
> My opinion is we should restore a check similar to the rwsem_is_contendent() check that we had before. Something like
> the below on top of your patchset merged into appropriate patch:
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 27ef9946ae8a..50e7812468ec 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -204,6 +204,7 @@ static void set_task_reclaim_state(struct task_struct *task,
>  LIST_HEAD(shrinker_list);
>  DEFINE_MUTEX(shrinker_mutex);
>  DEFINE_SRCU(shrinker_srcu);
> +static atomic_t shrinker_srcu_generation = ATOMIC_INIT(0);
>  
>  #ifdef CONFIG_MEMCG
>  static int shrinker_nr_max;
> @@ -782,6 +783,7 @@ void unregister_shrinker(struct shrinker *shrinker)
>  	debugfs_entry = shrinker_debugfs_remove(shrinker);
>  	mutex_unlock(&shrinker_mutex);
>  
> +	atomic_inc(&shrinker_srcu_generation);
>  	synchronize_srcu(&shrinker_srcu);
>  
>  	debugfs_remove_recursive(debugfs_entry);
> @@ -799,6 +801,7 @@ EXPORT_SYMBOL(unregister_shrinker);
>   */
>  void synchronize_shrinkers(void)
>  {
> +	atomic_inc(&shrinker_srcu_generation);
>  	synchronize_srcu(&shrinker_srcu);
>  }
>  EXPORT_SYMBOL(synchronize_shrinkers);
> @@ -908,7 +911,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>  {
>  	struct shrinker_info *info;
>  	unsigned long ret, freed = 0;
> -	int srcu_idx;
> +	int srcu_idx, generation;
>  	int i;
>  
>  	if (!mem_cgroup_online(memcg))
> @@ -919,6 +922,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>  	if (unlikely(!info))
>  		goto unlock;
>  
> +	generation = atomic_read(&shrinker_srcu_generation);
>  	for_each_set_bit(i, info->map, info->map_nr_max) {
>  		struct shrink_control sc = {
>  			.gfp_mask = gfp_mask,
> @@ -965,6 +969,11 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>  				set_shrinker_bit(memcg, nid, i);
>  		}
>  		freed += ret;
> +
> +		if (atomic_read(&shrinker_srcu_generation) != generation) {
> +			freed = freed ? : 1;
> +			break;
> +		}
>  	}
>  unlock:
>  	srcu_read_unlock(&shrinker_srcu, srcu_idx);
> @@ -1004,7 +1013,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>  {
>  	unsigned long ret, freed = 0;
>  	struct shrinker *shrinker;
> -	int srcu_idx;
> +	int srcu_idx, generation;
>  
>  	/*
>  	 * The root memcg might be allocated even though memcg is disabled
> @@ -1017,6 +1026,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>  		return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>  
>  	srcu_idx = srcu_read_lock(&shrinker_srcu);
> +	generation = atomic_read(&shrinker_srcu_generation);
>  
>  	list_for_each_entry_srcu(shrinker, &shrinker_list, list,
>  				 srcu_read_lock_held(&shrinker_srcu)) {
> @@ -1030,6 +1040,11 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>  		if (ret == SHRINK_EMPTY)
>  			ret = 0;
>  		freed += ret;
> +
> +		if (atomic_read(&shrinker_srcu_generation) != generation) {
> +			freed = freed ? : 1;
> +			break;
> +		}
>  	}
>  
>  	srcu_read_unlock(&shrinker_srcu, srcu_idx);

Even more, for memcg shrinkers we may unlock SRCU and continue iterations from the same shrinker id:

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 27ef9946ae8a..0b197bba1257 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -204,6 +204,7 @@ static void set_task_reclaim_state(struct task_struct *task,
 LIST_HEAD(shrinker_list);
 DEFINE_MUTEX(shrinker_mutex);
 DEFINE_SRCU(shrinker_srcu);
+static atomic_t shrinker_srcu_generation = ATOMIC_INIT(0);
 
 #ifdef CONFIG_MEMCG
 static int shrinker_nr_max;
@@ -782,6 +783,7 @@ void unregister_shrinker(struct shrinker *shrinker)
 	debugfs_entry = shrinker_debugfs_remove(shrinker);
 	mutex_unlock(&shrinker_mutex);
 
+	atomic_inc(&shrinker_srcu_generation);
 	synchronize_srcu(&shrinker_srcu);
 
 	debugfs_remove_recursive(debugfs_entry);
@@ -799,6 +801,7 @@ EXPORT_SYMBOL(unregister_shrinker);
  */
 void synchronize_shrinkers(void)
 {
+	atomic_inc(&shrinker_srcu_generation);
 	synchronize_srcu(&shrinker_srcu);
 }
 EXPORT_SYMBOL(synchronize_shrinkers);
@@ -908,18 +911,19 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
 {
 	struct shrinker_info *info;
 	unsigned long ret, freed = 0;
-	int srcu_idx;
-	int i;
+	int srcu_idx, generation;
+	int i = 0;
 
 	if (!mem_cgroup_online(memcg))
 		return 0;
-
+again:
 	srcu_idx = srcu_read_lock(&shrinker_srcu);
 	info = shrinker_info_srcu(memcg, nid);
 	if (unlikely(!info))
 		goto unlock;
 
-	for_each_set_bit(i, info->map, info->map_nr_max) {
+	generation = atomic_read(&shrinker_srcu_generation);
+	for_each_set_bit_from(i, info->map, info->map_nr_max) {
 		struct shrink_control sc = {
 			.gfp_mask = gfp_mask,
 			.nid = nid,
@@ -965,6 +969,11 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
 				set_shrinker_bit(memcg, nid, i);
 		}
 		freed += ret;
+
+		if (atomic_read(&shrinker_srcu_generation) != generation) {
+			srcu_read_unlock(&shrinker_srcu, srcu_idx);
+			goto again;
+		}
 	}
 unlock:
 	srcu_read_unlock(&shrinker_srcu, srcu_idx);
@@ -1004,7 +1013,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
 {
 	unsigned long ret, freed = 0;
 	struct shrinker *shrinker;
-	int srcu_idx;
+	int srcu_idx, generation;
 
 	/*
 	 * The root memcg might be allocated even though memcg is disabled
@@ -1017,6 +1026,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
 		return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
 
 	srcu_idx = srcu_read_lock(&shrinker_srcu);
+	generation = atomic_read(&shrinker_srcu_generation);
 
 	list_for_each_entry_srcu(shrinker, &shrinker_list, list,
 				 srcu_read_lock_held(&shrinker_srcu)) {
@@ -1030,6 +1040,11 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
 		if (ret == SHRINK_EMPTY)
 			ret = 0;
 		freed += ret;
+
+		if (atomic_read(&shrinker_srcu_generation) != generation) {
+			freed = freed ? : 1;
+			break;
+		}
 	}
 
 	srcu_read_unlock(&shrinker_srcu, srcu_idx);



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/7] mm: vmscan: make global slab shrink lockless
  2023-02-24 21:14         ` Kirill Tkhai
@ 2023-02-25  8:08           ` Qi Zheng
  2023-02-25 15:30             ` Kirill Tkhai
  0 siblings, 1 reply; 33+ messages in thread
From: Qi Zheng @ 2023-02-25  8:08 UTC (permalink / raw)
  To: Kirill Tkhai, Sultan Alsawaf
  Cc: akpm, hannes, shakeelb, mhocko, roman.gushchin, muchun.song,
	david, shy828301, dave, penguin-kernel, paulmck, linux-mm,
	linux-kernel



On 2023/2/25 05:14, Kirill Tkhai wrote:
> On 25.02.2023 00:02, Kirill Tkhai wrote:
>> On 24.02.2023 07:00, Qi Zheng wrote:
>>>
>>>
>>> On 2023/2/24 02:24, Sultan Alsawaf wrote:
>>>> On Thu, Feb 23, 2023 at 09:27:20PM +0800, Qi Zheng wrote:
>>>>> The shrinker_rwsem is a global lock in shrinkers subsystem,
>>>>> it is easy to cause blocking in the following cases:
>>>>>
>>>>> a. the write lock of shrinker_rwsem was held for too long.
>>>>>      For example, there are many memcgs in the system, which
>>>>>      causes some paths to hold locks and traverse it for too
>>>>>      long. (e.g. expand_shrinker_info())
>>>>> b. the read lock of shrinker_rwsem was held for too long,
>>>>>      and a writer came at this time. Then this writer will be
>>>>>      forced to wait and block all subsequent readers.
>>>>>      For example:
>>>>>      - be scheduled when the read lock of shrinker_rwsem is
>>>>>        held in do_shrink_slab()
>>>>>      - some shrinker are blocked for too long. Like the case
>>>>>        mentioned in the patchset[1].
>>>>>
>>>>> Therefore, many times in history ([2],[3],[4],[5]), some
>>>>> people wanted to replace shrinker_rwsem reader with SRCU,
>>>>> but they all gave up because SRCU was not unconditionally
>>>>> enabled.
>>>>>
>>>>> But now, since commit 1cd0bd06093c ("rcu: Remove CONFIG_SRCU"),
>>>>> the SRCU is unconditionally enabled. So it's time to use
>>>>> SRCU to protect readers who previously held shrinker_rwsem.
>>>>>
>>>>> [1]. https://lore.kernel.org/lkml/20191129214541.3110-1-ptikhomirov@virtuozzo.com/
>>>>> [2]. https://lore.kernel.org/all/1437080113.3596.2.camel@stgolabs.net/
>>>>> [3]. https://lore.kernel.org/lkml/1510609063-3327-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp/
>>>>> [4]. https://lore.kernel.org/lkml/153365347929.19074.12509495712735843805.stgit@localhost.localdomain/
>>>>> [5]. https://lore.kernel.org/lkml/20210927074823.5825-1-sultan@kerneltoast.com/
>>>>>
>>>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>>>> ---
>>>>>    mm/vmscan.c | 27 +++++++++++----------------
>>>>>    1 file changed, 11 insertions(+), 16 deletions(-)
>>>>>
>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>> index 9f895ca6216c..02987a6f95d1 100644
>>>>> --- a/mm/vmscan.c
>>>>> +++ b/mm/vmscan.c
>>>>> @@ -202,6 +202,7 @@ static void set_task_reclaim_state(struct task_struct *task,
>>>>>      LIST_HEAD(shrinker_list);
>>>>>    DECLARE_RWSEM(shrinker_rwsem);
>>>>> +DEFINE_SRCU(shrinker_srcu);
>>>>>      #ifdef CONFIG_MEMCG
>>>>>    static int shrinker_nr_max;
>>>>> @@ -706,7 +707,7 @@ void free_prealloced_shrinker(struct shrinker *shrinker)
>>>>>    void register_shrinker_prepared(struct shrinker *shrinker)
>>>>>    {
>>>>>        down_write(&shrinker_rwsem);
>>>>> -    list_add_tail(&shrinker->list, &shrinker_list);
>>>>> +    list_add_tail_rcu(&shrinker->list, &shrinker_list);
>>>>>        shrinker->flags |= SHRINKER_REGISTERED;
>>>>>        shrinker_debugfs_add(shrinker);
>>>>>        up_write(&shrinker_rwsem);
>>>>> @@ -760,13 +761,15 @@ void unregister_shrinker(struct shrinker *shrinker)
>>>>>            return;
>>>>>          down_write(&shrinker_rwsem);
>>>>> -    list_del(&shrinker->list);
>>>>> +    list_del_rcu(&shrinker->list);
>>>>>        shrinker->flags &= ~SHRINKER_REGISTERED;
>>>>>        if (shrinker->flags & SHRINKER_MEMCG_AWARE)
>>>>>            unregister_memcg_shrinker(shrinker);
>>>>>        debugfs_entry = shrinker_debugfs_remove(shrinker);
>>>>>        up_write(&shrinker_rwsem);
>>>>>    +    synchronize_srcu(&shrinker_srcu);
>>>>> +
>>>>>        debugfs_remove_recursive(debugfs_entry);
>>>>>          kfree(shrinker->nr_deferred);
>>>>> @@ -786,6 +789,7 @@ void synchronize_shrinkers(void)
>>>>>    {
>>>>>        down_write(&shrinker_rwsem);
>>>>>        up_write(&shrinker_rwsem);
>>>>> +    synchronize_srcu(&shrinker_srcu);
>>>>>    }
>>>>>    EXPORT_SYMBOL(synchronize_shrinkers);
>>>>>    @@ -996,6 +1000,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>>>    {
>>>>>        unsigned long ret, freed = 0;
>>>>>        struct shrinker *shrinker;
>>>>> +    int srcu_idx;
>>>>>          /*
>>>>>         * The root memcg might be allocated even though memcg is disabled
>>>>> @@ -1007,10 +1012,10 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>>>        if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
>>>>>            return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>>>>>    -    if (!down_read_trylock(&shrinker_rwsem))
>>>>> -        goto out;
>>>>> +    srcu_idx = srcu_read_lock(&shrinker_srcu);
>>>>>    -    list_for_each_entry(shrinker, &shrinker_list, list) {
>>>>> +    list_for_each_entry_srcu(shrinker, &shrinker_list, list,
>>>>> +                 srcu_read_lock_held(&shrinker_srcu)) {
>>>>>            struct shrink_control sc = {
>>>>>                .gfp_mask = gfp_mask,
>>>>>                .nid = nid,
>>>>> @@ -1021,19 +1026,9 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>>>            if (ret == SHRINK_EMPTY)
>>>>>                ret = 0;
>>>>>            freed += ret;
>>>>> -        /*
>>>>> -         * Bail out if someone want to register a new shrinker to
>>>>> -         * prevent the registration from being stalled for long periods
>>>>> -         * by parallel ongoing shrinking.
>>>>> -         */
>>>>> -        if (rwsem_is_contended(&shrinker_rwsem)) {
>>>>> -            freed = freed ? : 1;
>>>>> -            break;
>>>>> -        }
>>>>>        }
>>>>>    -    up_read(&shrinker_rwsem);
>>>>> -out:
>>>>> +    srcu_read_unlock(&shrinker_srcu, srcu_idx);
>>>>>        cond_resched();
>>>>>        return freed;
>>>>>    }
>>>>> -- 
>>>>> 2.20.1
>>>>>
>>>>>
>>>>
>>>> Hi Qi,
>>>>
>>>> A different problem I realized after my old attempt to use SRCU was that the
>>>> unregister_shrinker() path became quite slow due to the heavy synchronize_srcu()
>>>> call. Both register_shrinker() *and* unregister_shrinker() are called frequently
>>>> these days, and SRCU is too unfair to the unregister path IMO.
>>>
>>> Hi Sultan,
>>>
>>> IIUC, for unregister_shrinker(), the wait time is hardly longer with
>>> SRCU than with shrinker_rwsem before.
>>>
>>> And I just did a simple test. After using the script in cover letter to
>>> increase the shrink_slab hotspot, I did umount 1k times at the same
>>> time, and then I used bpftrace to measure the time consumption of
>>> unregister_shrinker() as follows:
>>>
>>> bpftrace -e 'kprobe:unregister_shrinker { @start[tid] = nsecs; } kretprobe:unregister_shrinker /@start[tid]/ { @ns[comm] = hist(nsecs - @start[tid]); delete(@start[tid]); }'
>>>
>>> @ns[umount]:
>>> [16K, 32K)             3 |      |
>>> [32K, 64K)            66 |@@@@@@@@@@      |
>>> [64K, 128K)           32 |@@@@@      |
>>> [128K, 256K)          22 |@@@      |
>>> [256K, 512K)          48 |@@@@@@@      |
>>> [512K, 1M)            19 |@@@      |
>>> [1M, 2M)             131 |@@@@@@@@@@@@@@@@@@@@@      |
>>> [2M, 4M)             313 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>>> [4M, 8M)             302 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  |
>>> [8M, 16M)             55 |@@@@@@@@@
>>>
>>> I see that the highest time-consuming of unregister_shrinker() is between 8ms and 16ms, which feels tolerable?

Hi Kirill,

>>
>> The fundamental difference is that before the patchset this for_each_set_bit() iteration could be broken in the middle
>> of two do_shrink_slab() calls, while after the patchset we can leave for_each_set_bit() only after visiting all set bits.

After looking at the git log[1], I saw that we originally introduced
rwsem_is_contendent() here to aviod blocking register_shrinker(),
not unregister_shrinker().

So I am curious, do we really care about the speed of
unregister_shrinker()?

And after using SRCU, register_shrinker() will not be blocked by slab
shrink at all.

[1]. https://github.com/torvalds/linux/commit/e496612

>>
>> Using only synchronize_srcu_expedited() won't help here.
>>
>> My opinion is we should restore a check similar to the rwsem_is_contendent() check that we had before. Something like

If we really care about the speed of unregister_shrinker() like
register_shrinker(), I think this is a good idea. This guarantees
at least the speed of the unregister_shrinker() is not deteriorated. :)

>> the below on top of your patchset merged into appropriate patch:
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 27ef9946ae8a..50e7812468ec 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -204,6 +204,7 @@ static void set_task_reclaim_state(struct task_struct *task,
>>   LIST_HEAD(shrinker_list);
>>   DEFINE_MUTEX(shrinker_mutex);
>>   DEFINE_SRCU(shrinker_srcu);
>> +static atomic_t shrinker_srcu_generation = ATOMIC_INIT(0);
>>   
>>   #ifdef CONFIG_MEMCG
>>   static int shrinker_nr_max;
>> @@ -782,6 +783,7 @@ void unregister_shrinker(struct shrinker *shrinker)
>>   	debugfs_entry = shrinker_debugfs_remove(shrinker);
>>   	mutex_unlock(&shrinker_mutex);
>>   
>> +	atomic_inc(&shrinker_srcu_generation);
>>   	synchronize_srcu(&shrinker_srcu);
>>   
>>   	debugfs_remove_recursive(debugfs_entry);
>> @@ -799,6 +801,7 @@ EXPORT_SYMBOL(unregister_shrinker);
>>    */
>>   void synchronize_shrinkers(void)
>>   {
>> +	atomic_inc(&shrinker_srcu_generation);
>>   	synchronize_srcu(&shrinker_srcu);
>>   }
>>   EXPORT_SYMBOL(synchronize_shrinkers);
>> @@ -908,7 +911,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>>   {
>>   	struct shrinker_info *info;
>>   	unsigned long ret, freed = 0;
>> -	int srcu_idx;
>> +	int srcu_idx, generation;
>>   	int i;
>>   
>>   	if (!mem_cgroup_online(memcg))
>> @@ -919,6 +922,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>>   	if (unlikely(!info))
>>   		goto unlock;
>>   
>> +	generation = atomic_read(&shrinker_srcu_generation);
>>   	for_each_set_bit(i, info->map, info->map_nr_max) {
>>   		struct shrink_control sc = {
>>   			.gfp_mask = gfp_mask,
>> @@ -965,6 +969,11 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>>   				set_shrinker_bit(memcg, nid, i);
>>   		}
>>   		freed += ret;
>> +
>> +		if (atomic_read(&shrinker_srcu_generation) != generation) {
>> +			freed = freed ? : 1;
>> +			break;
>> +		}
>>   	}
>>   unlock:
>>   	srcu_read_unlock(&shrinker_srcu, srcu_idx);
>> @@ -1004,7 +1013,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>   {
>>   	unsigned long ret, freed = 0;
>>   	struct shrinker *shrinker;
>> -	int srcu_idx;
>> +	int srcu_idx, generation;
>>   
>>   	/*
>>   	 * The root memcg might be allocated even though memcg is disabled
>> @@ -1017,6 +1026,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>   		return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>>   
>>   	srcu_idx = srcu_read_lock(&shrinker_srcu);
>> +	generation = atomic_read(&shrinker_srcu_generation);
>>   
>>   	list_for_each_entry_srcu(shrinker, &shrinker_list, list,
>>   				 srcu_read_lock_held(&shrinker_srcu)) {
>> @@ -1030,6 +1040,11 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>   		if (ret == SHRINK_EMPTY)
>>   			ret = 0;
>>   		freed += ret;
>> +
>> +		if (atomic_read(&shrinker_srcu_generation) != generation) {
>> +			freed = freed ? : 1;
>> +			break;
>> +		}
>>   	}
>>   
>>   	srcu_read_unlock(&shrinker_srcu, srcu_idx);
> 
> Even more, for memcg shrinkers we may unlock SRCU and continue iterations from the same shrinker id:

Maybe we can also do this for global slab shrink? Like below:

diff --git a/mm/vmscan.c b/mm/vmscan.c
index ffddbd204259..9d8c53075298 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1012,7 +1012,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, 
int nid,
                                  int priority)
  {
         unsigned long ret, freed = 0;
-       struct shrinker *shrinker;
+       struct shrinker *shrinker = NULL;
         int srcu_idx, generation;

         /*
@@ -1025,11 +1025,15 @@ static unsigned long shrink_slab(gfp_t gfp_mask, 
int nid,
         if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
                 return shrink_slab_memcg(gfp_mask, nid, memcg, priority);

+again:
         srcu_idx = srcu_read_lock(&shrinker_srcu);

         generation = atomic_read(&shrinker_srcu_generation);
-       list_for_each_entry_srcu(shrinker, &shrinker_list, list,
-                                srcu_read_lock_held(&shrinker_srcu)) {
+       if (!shrinker)
+               shrinker = list_entry_rcu(shrinker_list.next, struct 
shrinker, list);
+       else
+               shrinker = list_entry_rcu(shrinker->list.next, struct 
shrinker, list);
+       list_for_each_entry_from_rcu(shrinker, &shrinker_list, list) {
                 struct shrink_control sc = {
                         .gfp_mask = gfp_mask,
                         .nid = nid,
@@ -1042,8 +1046,9 @@ static unsigned long shrink_slab(gfp_t gfp_mask, 
int nid,
                 freed += ret;

                 if (atomic_read(&shrinker_srcu_generation) != generation) {
-                       freed = freed ? : 1;
-                       break;
+                       srcu_read_unlock(&shrinker_srcu, srcu_idx);
+                       cond_resched();
+                       goto again;
                 }
         }

> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 27ef9946ae8a..0b197bba1257 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -204,6 +204,7 @@ static void set_task_reclaim_state(struct task_struct *task,
>   LIST_HEAD(shrinker_list);
>   DEFINE_MUTEX(shrinker_mutex);
>   DEFINE_SRCU(shrinker_srcu);
> +static atomic_t shrinker_srcu_generation = ATOMIC_INIT(0);
>   
>   #ifdef CONFIG_MEMCG
>   static int shrinker_nr_max;
> @@ -782,6 +783,7 @@ void unregister_shrinker(struct shrinker *shrinker)
>   	debugfs_entry = shrinker_debugfs_remove(shrinker);
>   	mutex_unlock(&shrinker_mutex);
>   
> +	atomic_inc(&shrinker_srcu_generation);
>   	synchronize_srcu(&shrinker_srcu);
>   
>   	debugfs_remove_recursive(debugfs_entry);
> @@ -799,6 +801,7 @@ EXPORT_SYMBOL(unregister_shrinker);
>    */
>   void synchronize_shrinkers(void)
>   {
> +	atomic_inc(&shrinker_srcu_generation);
>   	synchronize_srcu(&shrinker_srcu);
>   }
>   EXPORT_SYMBOL(synchronize_shrinkers);
> @@ -908,18 +911,19 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>   {
>   	struct shrinker_info *info;
>   	unsigned long ret, freed = 0;
> -	int srcu_idx;
> -	int i;
> +	int srcu_idx, generation;
> +	int i = 0;
>   
>   	if (!mem_cgroup_online(memcg))
>   		return 0;
> -
> +again:
>   	srcu_idx = srcu_read_lock(&shrinker_srcu);
>   	info = shrinker_info_srcu(memcg, nid);
>   	if (unlikely(!info))
>   		goto unlock;
>   
> -	for_each_set_bit(i, info->map, info->map_nr_max) {
> +	generation = atomic_read(&shrinker_srcu_generation);
> +	for_each_set_bit_from(i, info->map, info->map_nr_max) {
>   		struct shrink_control sc = {
>   			.gfp_mask = gfp_mask,
>   			.nid = nid,
> @@ -965,6 +969,11 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>   				set_shrinker_bit(memcg, nid, i);
>   		}
>   		freed += ret;
> +
> +		if (atomic_read(&shrinker_srcu_generation) != generation) {
> +			srcu_read_unlock(&shrinker_srcu, srcu_idx);

Maybe we can add the following code here, so as to avoid repeating the
current id and avoid triggering softlockup:

			i++;
			cond_resched();

Thanks,
Qi

> +			goto again;
> +		}
>   	}
>   unlock:
>   	srcu_read_unlock(&shrinker_srcu, srcu_idx);
> @@ -1004,7 +1013,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>   {
>   	unsigned long ret, freed = 0;
>   	struct shrinker *shrinker;
> -	int srcu_idx;
> +	int srcu_idx, generation;
>   
>   	/*
>   	 * The root memcg might be allocated even though memcg is disabled
> @@ -1017,6 +1026,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>   		return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>   
>   	srcu_idx = srcu_read_lock(&shrinker_srcu);
> +	generation = atomic_read(&shrinker_srcu_generation);
>   
>   	list_for_each_entry_srcu(shrinker, &shrinker_list, list,
>   				 srcu_read_lock_held(&shrinker_srcu)) {
> @@ -1030,6 +1040,11 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>   		if (ret == SHRINK_EMPTY)
>   			ret = 0;
>   		freed += ret;
> +
> +		if (atomic_read(&shrinker_srcu_generation) != generation) {
> +			freed = freed ? : 1;
> +			break;
> +		}
>   	}
>   
>   	srcu_read_unlock(&shrinker_srcu, srcu_idx);
> 
> 

-- 
Thanks,
Qi

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 1/7] mm: vmscan: add a map_nr_max field to shrinker_info
  2023-02-23 13:27 ` [PATCH v2 1/7] mm: vmscan: add a map_nr_max field to shrinker_info Qi Zheng
@ 2023-02-25  8:18   ` Qi Zheng
  2023-02-25 15:14     ` Kirill Tkhai
  0 siblings, 1 reply; 33+ messages in thread
From: Qi Zheng @ 2023-02-25  8:18 UTC (permalink / raw)
  To: tkhai
  Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel,
	Andrew Morton, Johannes Weiner, Shakeel Butt, Michal Hocko,
	Roman Gushchin, Muchun Song, David Hildenbrand, Yang Shi



On 2023/2/23 21:27, Qi Zheng wrote:
> To prepare for the subsequent lockless memcg slab shrink,
> add a map_nr_max field to struct shrinker_info to records
> its own real shrinker_nr_max.
> 
> No functional changes.
> 
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>

I missed Suggested-by here, hi Kirill, can I add it?

Suggested-by: Kirill Tkhai <tkhai@ya.ru>

> ---
>   include/linux/memcontrol.h |  1 +
>   mm/vmscan.c                | 29 ++++++++++++++++++-----------
>   2 files changed, 19 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index b6eda2ab205d..aa69ea98e2d8 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -97,6 +97,7 @@ struct shrinker_info {
>   	struct rcu_head rcu;
>   	atomic_long_t *nr_deferred;
>   	unsigned long *map;
> +	int map_nr_max;
>   };
>   
>   struct lruvec_stats_percpu {
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 9c1c5e8b24b8..9f895ca6216c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -224,9 +224,16 @@ static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
>   					 lockdep_is_held(&shrinker_rwsem));
>   }
>   
> +static inline bool need_expand(int new_nr_max, int old_nr_max)
> +{
> +	return round_up(new_nr_max, BITS_PER_LONG) >
> +	       round_up(old_nr_max, BITS_PER_LONG);
> +}
> +
>   static int expand_one_shrinker_info(struct mem_cgroup *memcg,
>   				    int map_size, int defer_size,
> -				    int old_map_size, int old_defer_size)
> +				    int old_map_size, int old_defer_size,
> +				    int new_nr_max)
>   {
>   	struct shrinker_info *new, *old;
>   	struct mem_cgroup_per_node *pn;
> @@ -240,12 +247,16 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
>   		if (!old)
>   			return 0;
>   
> +		if (!need_expand(new_nr_max, old->map_nr_max))
> +			return 0;
> +
>   		new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
>   		if (!new)
>   			return -ENOMEM;
>   
>   		new->nr_deferred = (atomic_long_t *)(new + 1);
>   		new->map = (void *)new->nr_deferred + defer_size;
> +		new->map_nr_max = new_nr_max;
>   
>   		/* map: set all old bits, clear all new bits */
>   		memset(new->map, (int)0xff, old_map_size);
> @@ -295,6 +306,7 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
>   		}
>   		info->nr_deferred = (atomic_long_t *)(info + 1);
>   		info->map = (void *)info->nr_deferred + defer_size;
> +		info->map_nr_max = shrinker_nr_max;
>   		rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, info);
>   	}
>   	up_write(&shrinker_rwsem);
> @@ -302,12 +314,6 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
>   	return ret;
>   }
>   
> -static inline bool need_expand(int nr_max)
> -{
> -	return round_up(nr_max, BITS_PER_LONG) >
> -	       round_up(shrinker_nr_max, BITS_PER_LONG);
> -}
> -
>   static int expand_shrinker_info(int new_id)
>   {
>   	int ret = 0;
> @@ -316,7 +322,7 @@ static int expand_shrinker_info(int new_id)
>   	int old_map_size, old_defer_size = 0;
>   	struct mem_cgroup *memcg;
>   
> -	if (!need_expand(new_nr_max))
> +	if (!need_expand(new_nr_max, shrinker_nr_max))
>   		goto out;
>   
>   	if (!root_mem_cgroup)
> @@ -332,7 +338,8 @@ static int expand_shrinker_info(int new_id)
>   	memcg = mem_cgroup_iter(NULL, NULL, NULL);
>   	do {
>   		ret = expand_one_shrinker_info(memcg, map_size, defer_size,
> -					       old_map_size, old_defer_size);
> +					       old_map_size, old_defer_size,
> +					       new_nr_max);
>   		if (ret) {
>   			mem_cgroup_iter_break(NULL, memcg);
>   			goto out;
> @@ -432,7 +439,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
>   	for_each_node(nid) {
>   		child_info = shrinker_info_protected(memcg, nid);
>   		parent_info = shrinker_info_protected(parent, nid);
> -		for (i = 0; i < shrinker_nr_max; i++) {
> +		for (i = 0; i < child_info->map_nr_max; i++) {
>   			nr = atomic_long_read(&child_info->nr_deferred[i]);
>   			atomic_long_add(nr, &parent_info->nr_deferred[i]);
>   		}
> @@ -899,7 +906,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>   	if (unlikely(!info))
>   		goto unlock;
>   
> -	for_each_set_bit(i, info->map, shrinker_nr_max) {
> +	for_each_set_bit(i, info->map, info->map_nr_max) {
>   		struct shrink_control sc = {
>   			.gfp_mask = gfp_mask,
>   			.nid = nid,

-- 
Thanks,
Qi

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 1/7] mm: vmscan: add a map_nr_max field to shrinker_info
  2023-02-25  8:18   ` Qi Zheng
@ 2023-02-25 15:14     ` Kirill Tkhai
  2023-02-25 15:52       ` Qi Zheng
  2023-02-26 13:54       ` Qi Zheng
  0 siblings, 2 replies; 33+ messages in thread
From: Kirill Tkhai @ 2023-02-25 15:14 UTC (permalink / raw)
  To: Qi Zheng
  Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel,
	Andrew Morton, Johannes Weiner, Shakeel Butt, Michal Hocko,
	Roman Gushchin, Muchun Song, David Hildenbrand, Yang Shi

Hi Qi,

On 25.02.2023 11:18, Qi Zheng wrote:
> 
> 
> On 2023/2/23 21:27, Qi Zheng wrote:
>> To prepare for the subsequent lockless memcg slab shrink,
>> add a map_nr_max field to struct shrinker_info to records
>> its own real shrinker_nr_max.
>>
>> No functional changes.
>>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> 
> I missed Suggested-by here, hi Kirill, can I add it?
> 
> Suggested-by: Kirill Tkhai <tkhai@ya.ru>

Yes, feel free to add this tag.

There is a comment below.

>> ---
>>   include/linux/memcontrol.h |  1 +
>>   mm/vmscan.c                | 29 ++++++++++++++++++-----------
>>   2 files changed, 19 insertions(+), 11 deletions(-)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index b6eda2ab205d..aa69ea98e2d8 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -97,6 +97,7 @@ struct shrinker_info {
>>       struct rcu_head rcu;
>>       atomic_long_t *nr_deferred;
>>       unsigned long *map;
>> +    int map_nr_max;
>>   };
>>     struct lruvec_stats_percpu {
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 9c1c5e8b24b8..9f895ca6216c 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -224,9 +224,16 @@ static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
>>                        lockdep_is_held(&shrinker_rwsem));
>>   }
>>   +static inline bool need_expand(int new_nr_max, int old_nr_max)
>> +{
>> +    return round_up(new_nr_max, BITS_PER_LONG) >
>> +           round_up(old_nr_max, BITS_PER_LONG);
>> +}
>> +
>>   static int expand_one_shrinker_info(struct mem_cgroup *memcg,
>>                       int map_size, int defer_size,
>> -                    int old_map_size, int old_defer_size)
>> +                    int old_map_size, int old_defer_size,
>> +                    int new_nr_max)
>>   {
>>       struct shrinker_info *new, *old;
>>       struct mem_cgroup_per_node *pn;
>> @@ -240,12 +247,16 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
>>           if (!old)
>>               return 0;
>>   +        if (!need_expand(new_nr_max, old->map_nr_max))
>> +            return 0;
>> +
>>           new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
>>           if (!new)
>>               return -ENOMEM;
>>             new->nr_deferred = (atomic_long_t *)(new + 1);
>>           new->map = (void *)new->nr_deferred + defer_size;
>> +        new->map_nr_max = new_nr_max;
>>             /* map: set all old bits, clear all new bits */
>>           memset(new->map, (int)0xff, old_map_size);
>> @@ -295,6 +306,7 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
>>           }
>>           info->nr_deferred = (atomic_long_t *)(info + 1);
>>           info->map = (void *)info->nr_deferred + defer_size;
>> +        info->map_nr_max = shrinker_nr_max;
>>           rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, info);
>>       }
>>       up_write(&shrinker_rwsem);
>> @@ -302,12 +314,6 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
>>       return ret;
>>   }
>>   -static inline bool need_expand(int nr_max)
>> -{
>> -    return round_up(nr_max, BITS_PER_LONG) >
>> -           round_up(shrinker_nr_max, BITS_PER_LONG);
>> -}
>> -
>>   static int expand_shrinker_info(int new_id)
>>   {
>>       int ret = 0;
>> @@ -316,7 +322,7 @@ static int expand_shrinker_info(int new_id)
>>       int old_map_size, old_defer_size = 0;
>>       struct mem_cgroup *memcg;
>>   -    if (!need_expand(new_nr_max))
>> +    if (!need_expand(new_nr_max, shrinker_nr_max))
>>           goto out;
>>         if (!root_mem_cgroup)
>> @@ -332,7 +338,8 @@ static int expand_shrinker_info(int new_id)
>>       memcg = mem_cgroup_iter(NULL, NULL, NULL);
>>       do {
>>           ret = expand_one_shrinker_info(memcg, map_size, defer_size,
>> -                           old_map_size, old_defer_size);
>> +                           old_map_size, old_defer_size,
>> +                           new_nr_max);
>>           if (ret) {
>>               mem_cgroup_iter_break(NULL, memcg);
>>               goto out;
>> @@ -432,7 +439,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
>>       for_each_node(nid) {
>>           child_info = shrinker_info_protected(memcg, nid);
>>           parent_info = shrinker_info_protected(parent, nid);
>> -        for (i = 0; i < shrinker_nr_max; i++) {
>> +        for (i = 0; i < child_info->map_nr_max; i++) {
>>               nr = atomic_long_read(&child_info->nr_deferred[i]);
>>               atomic_long_add(nr, &parent_info->nr_deferred[i]);
>>           }
>> @@ -899,7 +906,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>>       if (unlikely(!info))
>>           goto unlock;
>>   -    for_each_set_bit(i, info->map, shrinker_nr_max) {
>> +    for_each_set_bit(i, info->map, info->map_nr_max) {
>>           struct shrink_control sc = {
>>               .gfp_mask = gfp_mask,
>>               .nid = nid,

The patch as whole thing won't work as expected. It won't ever call shrinker with ids from [round_down(shrinker_nr_max, sizeof(unsigned long)) + 1, shrinker_nr_max - 1]

Just replay the sequence we add new shrinkers:

1)We add shrinker #0:
   shrinker_nr_max = 0;

   prealloc_memcg_shrinker()
      id = 0;
      expand_shrinker_info(0)
        new_nr_max = 1;
        expand_one_shrinker_info(new_nr_max = 1)
          new->map_nr_max = 1;
        shrinker_nr_max = 1;

2)We add shrinker #1:
   prealloc_memcg_shrinker()
     id = 1;
     expand_shrinker_info(1)
       new_nr_max = 2;
       need_expand(2, 1) => false => ignore expand
       shrinker_nr_max = 2;

3)Then we call shrinker:
  shrink_slab_memcg()
    for_each_set_bit(i, info->map, 1/* info->map_nr_max */ ) {
    } => ignore shrinker #1

I'd fixed this patch by something like the below:

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9f895ca6216c..bb617a3871f1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -224,12 +224,6 @@ static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
 					 lockdep_is_held(&shrinker_rwsem));
 }
 
-static inline bool need_expand(int new_nr_max, int old_nr_max)
-{
-	return round_up(new_nr_max, BITS_PER_LONG) >
-	       round_up(old_nr_max, BITS_PER_LONG);
-}
-
 static int expand_one_shrinker_info(struct mem_cgroup *memcg,
 				    int map_size, int defer_size,
 				    int old_map_size, int old_defer_size,
@@ -247,9 +241,6 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
 		if (!old)
 			return 0;
 
-		if (!need_expand(new_nr_max, old->map_nr_max))
-			return 0;
-
 		new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
 		if (!new)
 			return -ENOMEM;
@@ -317,14 +308,11 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
 static int expand_shrinker_info(int new_id)
 {
 	int ret = 0;
-	int new_nr_max = new_id + 1;
+	int new_nr_max = round_up(new_id + 1, BITS_PER_LONG);
 	int map_size, defer_size = 0;
 	int old_map_size, old_defer_size = 0;
 	struct mem_cgroup *memcg;
 
-	if (!need_expand(new_nr_max, shrinker_nr_max))
-		goto out;
-
 	if (!root_mem_cgroup)
 		goto out;
 
@@ -359,9 +347,11 @@ void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
 
 		rcu_read_lock();
 		info = rcu_dereference(memcg->nodeinfo[nid]->shrinker_info);
-		/* Pairs with smp mb in shrink_slab() */
-		smp_mb__before_atomic();
-		set_bit(shrinker_id, info->map);
+		if (!WARN_ON_ONCE(shrinker_id >= info->map_nr_max)) {
+			/* Pairs with smp mb in shrink_slab() */
+			smp_mb__before_atomic();
+			set_bit(shrinker_id, info->map);
+		}
 		rcu_read_unlock();
 	}
 }

(I also added a new check into set_shrinker_bit() for safety).

Kirill

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/7] mm: vmscan: make global slab shrink lockless
  2023-02-25  8:08           ` Qi Zheng
@ 2023-02-25 15:30             ` Kirill Tkhai
  2023-02-25 15:57               ` Qi Zheng
  0 siblings, 1 reply; 33+ messages in thread
From: Kirill Tkhai @ 2023-02-25 15:30 UTC (permalink / raw)
  To: Qi Zheng, Sultan Alsawaf
  Cc: akpm, hannes, shakeelb, mhocko, roman.gushchin, muchun.song,
	david, shy828301, dave, penguin-kernel, paulmck, linux-mm,
	linux-kernel

On 25.02.2023 11:08, Qi Zheng wrote:
> 
> 
> On 2023/2/25 05:14, Kirill Tkhai wrote:
>> On 25.02.2023 00:02, Kirill Tkhai wrote:
>>> On 24.02.2023 07:00, Qi Zheng wrote:
>>>>
>>>>
>>>> On 2023/2/24 02:24, Sultan Alsawaf wrote:
>>>>> On Thu, Feb 23, 2023 at 09:27:20PM +0800, Qi Zheng wrote:
>>>>>> The shrinker_rwsem is a global lock in shrinkers subsystem,
>>>>>> it is easy to cause blocking in the following cases:
>>>>>>
>>>>>> a. the write lock of shrinker_rwsem was held for too long.
>>>>>>      For example, there are many memcgs in the system, which
>>>>>>      causes some paths to hold locks and traverse it for too
>>>>>>      long. (e.g. expand_shrinker_info())
>>>>>> b. the read lock of shrinker_rwsem was held for too long,
>>>>>>      and a writer came at this time. Then this writer will be
>>>>>>      forced to wait and block all subsequent readers.
>>>>>>      For example:
>>>>>>      - be scheduled when the read lock of shrinker_rwsem is
>>>>>>        held in do_shrink_slab()
>>>>>>      - some shrinker are blocked for too long. Like the case
>>>>>>        mentioned in the patchset[1].
>>>>>>
>>>>>> Therefore, many times in history ([2],[3],[4],[5]), some
>>>>>> people wanted to replace shrinker_rwsem reader with SRCU,
>>>>>> but they all gave up because SRCU was not unconditionally
>>>>>> enabled.
>>>>>>
>>>>>> But now, since commit 1cd0bd06093c ("rcu: Remove CONFIG_SRCU"),
>>>>>> the SRCU is unconditionally enabled. So it's time to use
>>>>>> SRCU to protect readers who previously held shrinker_rwsem.
>>>>>>
>>>>>> [1]. https://lore.kernel.org/lkml/20191129214541.3110-1-ptikhomirov@virtuozzo.com/
>>>>>> [2]. https://lore.kernel.org/all/1437080113.3596.2.camel@stgolabs.net/
>>>>>> [3]. https://lore.kernel.org/lkml/1510609063-3327-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp/
>>>>>> [4]. https://lore.kernel.org/lkml/153365347929.19074.12509495712735843805.stgit@localhost.localdomain/
>>>>>> [5]. https://lore.kernel.org/lkml/20210927074823.5825-1-sultan@kerneltoast.com/
>>>>>>
>>>>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>>>>> ---
>>>>>>    mm/vmscan.c | 27 +++++++++++----------------
>>>>>>    1 file changed, 11 insertions(+), 16 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>>> index 9f895ca6216c..02987a6f95d1 100644
>>>>>> --- a/mm/vmscan.c
>>>>>> +++ b/mm/vmscan.c
>>>>>> @@ -202,6 +202,7 @@ static void set_task_reclaim_state(struct task_struct *task,
>>>>>>      LIST_HEAD(shrinker_list);
>>>>>>    DECLARE_RWSEM(shrinker_rwsem);
>>>>>> +DEFINE_SRCU(shrinker_srcu);
>>>>>>      #ifdef CONFIG_MEMCG
>>>>>>    static int shrinker_nr_max;
>>>>>> @@ -706,7 +707,7 @@ void free_prealloced_shrinker(struct shrinker *shrinker)
>>>>>>    void register_shrinker_prepared(struct shrinker *shrinker)
>>>>>>    {
>>>>>>        down_write(&shrinker_rwsem);
>>>>>> -    list_add_tail(&shrinker->list, &shrinker_list);
>>>>>> +    list_add_tail_rcu(&shrinker->list, &shrinker_list);
>>>>>>        shrinker->flags |= SHRINKER_REGISTERED;
>>>>>>        shrinker_debugfs_add(shrinker);
>>>>>>        up_write(&shrinker_rwsem);
>>>>>> @@ -760,13 +761,15 @@ void unregister_shrinker(struct shrinker *shrinker)
>>>>>>            return;
>>>>>>          down_write(&shrinker_rwsem);
>>>>>> -    list_del(&shrinker->list);
>>>>>> +    list_del_rcu(&shrinker->list);
>>>>>>        shrinker->flags &= ~SHRINKER_REGISTERED;
>>>>>>        if (shrinker->flags & SHRINKER_MEMCG_AWARE)
>>>>>>            unregister_memcg_shrinker(shrinker);
>>>>>>        debugfs_entry = shrinker_debugfs_remove(shrinker);
>>>>>>        up_write(&shrinker_rwsem);
>>>>>>    +    synchronize_srcu(&shrinker_srcu);
>>>>>> +
>>>>>>        debugfs_remove_recursive(debugfs_entry);
>>>>>>          kfree(shrinker->nr_deferred);
>>>>>> @@ -786,6 +789,7 @@ void synchronize_shrinkers(void)
>>>>>>    {
>>>>>>        down_write(&shrinker_rwsem);
>>>>>>        up_write(&shrinker_rwsem);
>>>>>> +    synchronize_srcu(&shrinker_srcu);
>>>>>>    }
>>>>>>    EXPORT_SYMBOL(synchronize_shrinkers);
>>>>>>    @@ -996,6 +1000,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>>>>    {
>>>>>>        unsigned long ret, freed = 0;
>>>>>>        struct shrinker *shrinker;
>>>>>> +    int srcu_idx;
>>>>>>          /*
>>>>>>         * The root memcg might be allocated even though memcg is disabled
>>>>>> @@ -1007,10 +1012,10 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>>>>        if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
>>>>>>            return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>>>>>>    -    if (!down_read_trylock(&shrinker_rwsem))
>>>>>> -        goto out;
>>>>>> +    srcu_idx = srcu_read_lock(&shrinker_srcu);
>>>>>>    -    list_for_each_entry(shrinker, &shrinker_list, list) {
>>>>>> +    list_for_each_entry_srcu(shrinker, &shrinker_list, list,
>>>>>> +                 srcu_read_lock_held(&shrinker_srcu)) {
>>>>>>            struct shrink_control sc = {
>>>>>>                .gfp_mask = gfp_mask,
>>>>>>                .nid = nid,
>>>>>> @@ -1021,19 +1026,9 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>>>>            if (ret == SHRINK_EMPTY)
>>>>>>                ret = 0;
>>>>>>            freed += ret;
>>>>>> -        /*
>>>>>> -         * Bail out if someone want to register a new shrinker to
>>>>>> -         * prevent the registration from being stalled for long periods
>>>>>> -         * by parallel ongoing shrinking.
>>>>>> -         */
>>>>>> -        if (rwsem_is_contended(&shrinker_rwsem)) {
>>>>>> -            freed = freed ? : 1;
>>>>>> -            break;
>>>>>> -        }
>>>>>>        }
>>>>>>    -    up_read(&shrinker_rwsem);
>>>>>> -out:
>>>>>> +    srcu_read_unlock(&shrinker_srcu, srcu_idx);
>>>>>>        cond_resched();
>>>>>>        return freed;
>>>>>>    }
>>>>>> -- 
>>>>>> 2.20.1
>>>>>>
>>>>>>
>>>>>
>>>>> Hi Qi,
>>>>>
>>>>> A different problem I realized after my old attempt to use SRCU was that the
>>>>> unregister_shrinker() path became quite slow due to the heavy synchronize_srcu()
>>>>> call. Both register_shrinker() *and* unregister_shrinker() are called frequently
>>>>> these days, and SRCU is too unfair to the unregister path IMO.
>>>>
>>>> Hi Sultan,
>>>>
>>>> IIUC, for unregister_shrinker(), the wait time is hardly longer with
>>>> SRCU than with shrinker_rwsem before.
>>>>
>>>> And I just did a simple test. After using the script in cover letter to
>>>> increase the shrink_slab hotspot, I did umount 1k times at the same
>>>> time, and then I used bpftrace to measure the time consumption of
>>>> unregister_shrinker() as follows:
>>>>
>>>> bpftrace -e 'kprobe:unregister_shrinker { @start[tid] = nsecs; } kretprobe:unregister_shrinker /@start[tid]/ { @ns[comm] = hist(nsecs - @start[tid]); delete(@start[tid]); }'
>>>>
>>>> @ns[umount]:
>>>> [16K, 32K)             3 |      |
>>>> [32K, 64K)            66 |@@@@@@@@@@      |
>>>> [64K, 128K)           32 |@@@@@      |
>>>> [128K, 256K)          22 |@@@      |
>>>> [256K, 512K)          48 |@@@@@@@      |
>>>> [512K, 1M)            19 |@@@      |
>>>> [1M, 2M)             131 |@@@@@@@@@@@@@@@@@@@@@      |
>>>> [2M, 4M)             313 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>>>> [4M, 8M)             302 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  |
>>>> [8M, 16M)             55 |@@@@@@@@@
>>>>
>>>> I see that the highest time-consuming of unregister_shrinker() is between 8ms and 16ms, which feels tolerable?
> 
> Hi Kirill,
> 
>>>
>>> The fundamental difference is that before the patchset this for_each_set_bit() iteration could be broken in the middle
>>> of two do_shrink_slab() calls, while after the patchset we can leave for_each_set_bit() only after visiting all set bits.
> 
> After looking at the git log[1], I saw that we originally introduced
> rwsem_is_contendent() here to aviod blocking register_shrinker(),
> not unregister_shrinker().
> 
> So I am curious, do we really care about the speed of
> unregister_shrinker()?

My opinion is that for general reasons we should avoid long unbreakable actions in kernel. Especially when they may be called
synchronous from userspace.

We even have this as generic rule. See check_hung_task().

Before, the longest sleep in unregister_shrinker() was a sleep waiting for single longest do_shrink_slab().

After the patch the longest sleep will be a sleep waiting for all do_shrink_slab() calls (all set bits in shrinker_info).

> And after using SRCU, register_shrinker() will not be blocked by slab
> shrink at all.
> 
> [1]. https://github.com/torvalds/linux/commit/e496612
> 
>>>
>>> Using only synchronize_srcu_expedited() won't help here.
>>>
>>> My opinion is we should restore a check similar to the rwsem_is_contendent() check that we had before. Something like
> 
> If we really care about the speed of unregister_shrinker() like
> register_shrinker(), I think this is a good idea. This guarantees
> at least the speed of the unregister_shrinker() is not deteriorated. :)
> 
>>> the below on top of your patchset merged into appropriate patch:
>>>
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index 27ef9946ae8a..50e7812468ec 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -204,6 +204,7 @@ static void set_task_reclaim_state(struct task_struct *task,
>>>   LIST_HEAD(shrinker_list);
>>>   DEFINE_MUTEX(shrinker_mutex);
>>>   DEFINE_SRCU(shrinker_srcu);
>>> +static atomic_t shrinker_srcu_generation = ATOMIC_INIT(0);
>>>     #ifdef CONFIG_MEMCG
>>>   static int shrinker_nr_max;
>>> @@ -782,6 +783,7 @@ void unregister_shrinker(struct shrinker *shrinker)
>>>       debugfs_entry = shrinker_debugfs_remove(shrinker);
>>>       mutex_unlock(&shrinker_mutex);
>>>   +    atomic_inc(&shrinker_srcu_generation);
>>>       synchronize_srcu(&shrinker_srcu);
>>>         debugfs_remove_recursive(debugfs_entry);
>>> @@ -799,6 +801,7 @@ EXPORT_SYMBOL(unregister_shrinker);
>>>    */
>>>   void synchronize_shrinkers(void)
>>>   {
>>> +    atomic_inc(&shrinker_srcu_generation);
>>>       synchronize_srcu(&shrinker_srcu);
>>>   }
>>>   EXPORT_SYMBOL(synchronize_shrinkers);
>>> @@ -908,7 +911,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>>>   {
>>>       struct shrinker_info *info;
>>>       unsigned long ret, freed = 0;
>>> -    int srcu_idx;
>>> +    int srcu_idx, generation;
>>>       int i;
>>>         if (!mem_cgroup_online(memcg))
>>> @@ -919,6 +922,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>>>       if (unlikely(!info))
>>>           goto unlock;
>>>   +    generation = atomic_read(&shrinker_srcu_generation);
>>>       for_each_set_bit(i, info->map, info->map_nr_max) {
>>>           struct shrink_control sc = {
>>>               .gfp_mask = gfp_mask,
>>> @@ -965,6 +969,11 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>>>                   set_shrinker_bit(memcg, nid, i);
>>>           }
>>>           freed += ret;
>>> +
>>> +        if (atomic_read(&shrinker_srcu_generation) != generation) {
>>> +            freed = freed ? : 1;
>>> +            break;
>>> +        }
>>>       }
>>>   unlock:
>>>       srcu_read_unlock(&shrinker_srcu, srcu_idx);
>>> @@ -1004,7 +1013,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>   {
>>>       unsigned long ret, freed = 0;
>>>       struct shrinker *shrinker;
>>> -    int srcu_idx;
>>> +    int srcu_idx, generation;
>>>         /*
>>>        * The root memcg might be allocated even though memcg is disabled
>>> @@ -1017,6 +1026,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>           return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>>>         srcu_idx = srcu_read_lock(&shrinker_srcu);
>>> +    generation = atomic_read(&shrinker_srcu_generation);
>>>         list_for_each_entry_srcu(shrinker, &shrinker_list, list,
>>>                    srcu_read_lock_held(&shrinker_srcu)) {
>>> @@ -1030,6 +1040,11 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>           if (ret == SHRINK_EMPTY)
>>>               ret = 0;
>>>           freed += ret;
>>> +
>>> +        if (atomic_read(&shrinker_srcu_generation) != generation) {
>>> +            freed = freed ? : 1;
>>> +            break;
>>> +        }
>>>       }
>>>         srcu_read_unlock(&shrinker_srcu, srcu_idx);
>>
>> Even more, for memcg shrinkers we may unlock SRCU and continue iterations from the same shrinker id:
> 
> Maybe we can also do this for global slab shrink? Like below:
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index ffddbd204259..9d8c53075298 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1012,7 +1012,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>                                  int priority)
>  {
>         unsigned long ret, freed = 0;
> -       struct shrinker *shrinker;
> +       struct shrinker *shrinker = NULL;
>         int srcu_idx, generation;
> 
>         /*
> @@ -1025,11 +1025,15 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>         if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
>                 return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
> 
> +again:
>         srcu_idx = srcu_read_lock(&shrinker_srcu);
> 
>         generation = atomic_read(&shrinker_srcu_generation);
> -       list_for_each_entry_srcu(shrinker, &shrinker_list, list,
> -                                srcu_read_lock_held(&shrinker_srcu)) {
> +       if (!shrinker)
> +               shrinker = list_entry_rcu(shrinker_list.next, struct shrinker, list);
> +       else
> +               shrinker = list_entry_rcu(shrinker->list.next, struct shrinker, list);
> +       list_for_each_entry_from_rcu(shrinker, &shrinker_list, list) {
>                 struct shrink_control sc = {
>                         .gfp_mask = gfp_mask,
>                         .nid = nid,
> @@ -1042,8 +1046,9 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>                 freed += ret;
> 
>                 if (atomic_read(&shrinker_srcu_generation) != generation) {
> -                       freed = freed ? : 1;
> -                       break;
> +                       srcu_read_unlock(&shrinker_srcu, srcu_idx);
> +                       cond_resched();
> +                       goto again;
>                 }
>         }
> 
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 27ef9946ae8a..0b197bba1257 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -204,6 +204,7 @@ static void set_task_reclaim_state(struct task_struct *task,
>>   LIST_HEAD(shrinker_list);
>>   DEFINE_MUTEX(shrinker_mutex);
>>   DEFINE_SRCU(shrinker_srcu);
>> +static atomic_t shrinker_srcu_generation = ATOMIC_INIT(0);
>>     #ifdef CONFIG_MEMCG
>>   static int shrinker_nr_max;
>> @@ -782,6 +783,7 @@ void unregister_shrinker(struct shrinker *shrinker)
>>       debugfs_entry = shrinker_debugfs_remove(shrinker);
>>       mutex_unlock(&shrinker_mutex);
>>   +    atomic_inc(&shrinker_srcu_generation);
>>       synchronize_srcu(&shrinker_srcu);
>>         debugfs_remove_recursive(debugfs_entry);
>> @@ -799,6 +801,7 @@ EXPORT_SYMBOL(unregister_shrinker);
>>    */
>>   void synchronize_shrinkers(void)
>>   {
>> +    atomic_inc(&shrinker_srcu_generation);
>>       synchronize_srcu(&shrinker_srcu);
>>   }
>>   EXPORT_SYMBOL(synchronize_shrinkers);
>> @@ -908,18 +911,19 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>>   {
>>       struct shrinker_info *info;
>>       unsigned long ret, freed = 0;
>> -    int srcu_idx;
>> -    int i;
>> +    int srcu_idx, generation;
>> +    int i = 0;
>>         if (!mem_cgroup_online(memcg))
>>           return 0;
>> -
>> +again:
>>       srcu_idx = srcu_read_lock(&shrinker_srcu);
>>       info = shrinker_info_srcu(memcg, nid);
>>       if (unlikely(!info))
>>           goto unlock;
>>   -    for_each_set_bit(i, info->map, info->map_nr_max) {
>> +    generation = atomic_read(&shrinker_srcu_generation);
>> +    for_each_set_bit_from(i, info->map, info->map_nr_max) {
>>           struct shrink_control sc = {
>>               .gfp_mask = gfp_mask,
>>               .nid = nid,
>> @@ -965,6 +969,11 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>>                   set_shrinker_bit(memcg, nid, i);
>>           }
>>           freed += ret;
>> +
>> +        if (atomic_read(&shrinker_srcu_generation) != generation) {
>> +            srcu_read_unlock(&shrinker_srcu, srcu_idx);
> 
> Maybe we can add the following code here, so as to avoid repeating the
> current id and avoid triggering softlockup:
> 
>             i++;
>             cond_resched();
> 
> Thanks,
> Qi
> 
>> +            goto again;
>> +        }
>>       }
>>   unlock:
>>       srcu_read_unlock(&shrinker_srcu, srcu_idx);
>> @@ -1004,7 +1013,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>   {
>>       unsigned long ret, freed = 0;
>>       struct shrinker *shrinker;
>> -    int srcu_idx;
>> +    int srcu_idx, generation;
>>         /*
>>        * The root memcg might be allocated even though memcg is disabled
>> @@ -1017,6 +1026,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>           return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>>         srcu_idx = srcu_read_lock(&shrinker_srcu);
>> +    generation = atomic_read(&shrinker_srcu_generation);
>>         list_for_each_entry_srcu(shrinker, &shrinker_list, list,
>>                    srcu_read_lock_held(&shrinker_srcu)) {
>> @@ -1030,6 +1040,11 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>           if (ret == SHRINK_EMPTY)
>>               ret = 0;
>>           freed += ret;
>> +
>> +        if (atomic_read(&shrinker_srcu_generation) != generation) {
>> +            freed = freed ? : 1;
>> +            break;
>> +        }
>>       }
>>         srcu_read_unlock(&shrinker_srcu, srcu_idx);
>>
>>
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 1/7] mm: vmscan: add a map_nr_max field to shrinker_info
  2023-02-25 15:14     ` Kirill Tkhai
@ 2023-02-25 15:52       ` Qi Zheng
  2023-02-26 13:54       ` Qi Zheng
  1 sibling, 0 replies; 33+ messages in thread
From: Qi Zheng @ 2023-02-25 15:52 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel,
	Andrew Morton, Johannes Weiner, Shakeel Butt, Michal Hocko,
	Roman Gushchin, Muchun Song, David Hildenbrand, Yang Shi



On 2023/2/25 23:14, Kirill Tkhai wrote:
> Hi Qi,
> 
> On 25.02.2023 11:18, Qi Zheng wrote:
>>
>>
>> On 2023/2/23 21:27, Qi Zheng wrote:
>>> To prepare for the subsequent lockless memcg slab shrink,
>>> add a map_nr_max field to struct shrinker_info to records
>>> its own real shrinker_nr_max.
>>>
>>> No functional changes.
>>>
>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>
>> I missed Suggested-by here, hi Kirill, can I add it?
>>
>> Suggested-by: Kirill Tkhai <tkhai@ya.ru>
> 
> Yes, feel free to add this tag.

Thanks.

> 
> There is a comment below.
> 
>>> ---
>>>    include/linux/memcontrol.h |  1 +
>>>    mm/vmscan.c                | 29 ++++++++++++++++++-----------
>>>    2 files changed, 19 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>>> index b6eda2ab205d..aa69ea98e2d8 100644
>>> --- a/include/linux/memcontrol.h
>>> +++ b/include/linux/memcontrol.h
>>> @@ -97,6 +97,7 @@ struct shrinker_info {
>>>        struct rcu_head rcu;
>>>        atomic_long_t *nr_deferred;
>>>        unsigned long *map;
>>> +    int map_nr_max;
>>>    };
>>>      struct lruvec_stats_percpu {
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index 9c1c5e8b24b8..9f895ca6216c 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -224,9 +224,16 @@ static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
>>>                         lockdep_is_held(&shrinker_rwsem));
>>>    }
>>>    +static inline bool need_expand(int new_nr_max, int old_nr_max)
>>> +{
>>> +    return round_up(new_nr_max, BITS_PER_LONG) >
>>> +           round_up(old_nr_max, BITS_PER_LONG);
>>> +}
>>> +
>>>    static int expand_one_shrinker_info(struct mem_cgroup *memcg,
>>>                        int map_size, int defer_size,
>>> -                    int old_map_size, int old_defer_size)
>>> +                    int old_map_size, int old_defer_size,
>>> +                    int new_nr_max)
>>>    {
>>>        struct shrinker_info *new, *old;
>>>        struct mem_cgroup_per_node *pn;
>>> @@ -240,12 +247,16 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
>>>            if (!old)
>>>                return 0;
>>>    +        if (!need_expand(new_nr_max, old->map_nr_max))
>>> +            return 0;
>>> +
>>>            new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
>>>            if (!new)
>>>                return -ENOMEM;
>>>              new->nr_deferred = (atomic_long_t *)(new + 1);
>>>            new->map = (void *)new->nr_deferred + defer_size;
>>> +        new->map_nr_max = new_nr_max;
>>>              /* map: set all old bits, clear all new bits */
>>>            memset(new->map, (int)0xff, old_map_size);
>>> @@ -295,6 +306,7 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
>>>            }
>>>            info->nr_deferred = (atomic_long_t *)(info + 1);
>>>            info->map = (void *)info->nr_deferred + defer_size;
>>> +        info->map_nr_max = shrinker_nr_max;
>>>            rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, info);
>>>        }
>>>        up_write(&shrinker_rwsem);
>>> @@ -302,12 +314,6 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
>>>        return ret;
>>>    }
>>>    -static inline bool need_expand(int nr_max)
>>> -{
>>> -    return round_up(nr_max, BITS_PER_LONG) >
>>> -           round_up(shrinker_nr_max, BITS_PER_LONG);
>>> -}
>>> -
>>>    static int expand_shrinker_info(int new_id)
>>>    {
>>>        int ret = 0;
>>> @@ -316,7 +322,7 @@ static int expand_shrinker_info(int new_id)
>>>        int old_map_size, old_defer_size = 0;
>>>        struct mem_cgroup *memcg;
>>>    -    if (!need_expand(new_nr_max))
>>> +    if (!need_expand(new_nr_max, shrinker_nr_max))
>>>            goto out;
>>>          if (!root_mem_cgroup)
>>> @@ -332,7 +338,8 @@ static int expand_shrinker_info(int new_id)
>>>        memcg = mem_cgroup_iter(NULL, NULL, NULL);
>>>        do {
>>>            ret = expand_one_shrinker_info(memcg, map_size, defer_size,
>>> -                           old_map_size, old_defer_size);
>>> +                           old_map_size, old_defer_size,
>>> +                           new_nr_max);
>>>            if (ret) {
>>>                mem_cgroup_iter_break(NULL, memcg);
>>>                goto out;
>>> @@ -432,7 +439,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
>>>        for_each_node(nid) {
>>>            child_info = shrinker_info_protected(memcg, nid);
>>>            parent_info = shrinker_info_protected(parent, nid);
>>> -        for (i = 0; i < shrinker_nr_max; i++) {
>>> +        for (i = 0; i < child_info->map_nr_max; i++) {
>>>                nr = atomic_long_read(&child_info->nr_deferred[i]);
>>>                atomic_long_add(nr, &parent_info->nr_deferred[i]);
>>>            }
>>> @@ -899,7 +906,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>>>        if (unlikely(!info))
>>>            goto unlock;
>>>    -    for_each_set_bit(i, info->map, shrinker_nr_max) {
>>> +    for_each_set_bit(i, info->map, info->map_nr_max) {
>>>            struct shrink_control sc = {
>>>                .gfp_mask = gfp_mask,
>>>                .nid = nid,
> 
> The patch as whole thing won't work as expected. It won't ever call shrinker with ids from [round_down(shrinker_nr_max, sizeof(unsigned long)) + 1, shrinker_nr_max - 1]
> 
> Just replay the sequence we add new shrinkers:
> 
> 1)We add shrinker #0:
>     shrinker_nr_max = 0;
> 
>     prealloc_memcg_shrinker()
>        id = 0;
>        expand_shrinker_info(0)
>          new_nr_max = 1;
>          expand_one_shrinker_info(new_nr_max = 1)
>            new->map_nr_max = 1;
>          shrinker_nr_max = 1;
> 
> 2)We add shrinker #1:
>     prealloc_memcg_shrinker()
>       id = 1;
>       expand_shrinker_info(1)
>         new_nr_max = 2;
>         need_expand(2, 1) => false => ignore expand
>         shrinker_nr_max = 2;
> 
> 3)Then we call shrinker:
>    shrink_slab_memcg()
>      for_each_set_bit(i, info->map, 1/* info->map_nr_max */ ) {
>      } => ignore shrinker #1

Oh, I got it.

> 
> I'd fixed this patch by something like the below:

The fix below looks good to me, will add them to the next version. :)

Thanks,
Qi

> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 9f895ca6216c..bb617a3871f1 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -224,12 +224,6 @@ static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
>   					 lockdep_is_held(&shrinker_rwsem));
>   }
>   
> -static inline bool need_expand(int new_nr_max, int old_nr_max)
> -{
> -	return round_up(new_nr_max, BITS_PER_LONG) >
> -	       round_up(old_nr_max, BITS_PER_LONG);
> -}
> -
>   static int expand_one_shrinker_info(struct mem_cgroup *memcg,
>   				    int map_size, int defer_size,
>   				    int old_map_size, int old_defer_size,
> @@ -247,9 +241,6 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
>   		if (!old)
>   			return 0;
>   
> -		if (!need_expand(new_nr_max, old->map_nr_max))
> -			return 0;
> -
>   		new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
>   		if (!new)
>   			return -ENOMEM;
> @@ -317,14 +308,11 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
>   static int expand_shrinker_info(int new_id)
>   {
>   	int ret = 0;
> -	int new_nr_max = new_id + 1;
> +	int new_nr_max = round_up(new_id + 1, BITS_PER_LONG);
>   	int map_size, defer_size = 0;
>   	int old_map_size, old_defer_size = 0;
>   	struct mem_cgroup *memcg;
>   
> -	if (!need_expand(new_nr_max, shrinker_nr_max))
> -		goto out;
> -
>   	if (!root_mem_cgroup)
>   		goto out;
>   
> @@ -359,9 +347,11 @@ void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
>   
>   		rcu_read_lock();
>   		info = rcu_dereference(memcg->nodeinfo[nid]->shrinker_info);
> -		/* Pairs with smp mb in shrink_slab() */
> -		smp_mb__before_atomic();
> -		set_bit(shrinker_id, info->map);
> +		if (!WARN_ON_ONCE(shrinker_id >= info->map_nr_max)) {
> +			/* Pairs with smp mb in shrink_slab() */
> +			smp_mb__before_atomic();
> +			set_bit(shrinker_id, info->map);
> +		}
>   		rcu_read_unlock();
>   	}
>   }
> 
> (I also added a new check into set_shrinker_bit() for safety).
> 
> Kirill

-- 
Thanks,
Qi

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/7] mm: vmscan: make global slab shrink lockless
  2023-02-25 15:30             ` Kirill Tkhai
@ 2023-02-25 15:57               ` Qi Zheng
  2023-02-25 16:17                 ` Kirill Tkhai
  0 siblings, 1 reply; 33+ messages in thread
From: Qi Zheng @ 2023-02-25 15:57 UTC (permalink / raw)
  To: Kirill Tkhai, Sultan Alsawaf
  Cc: akpm, hannes, shakeelb, mhocko, roman.gushchin, muchun.song,
	david, shy828301, dave, penguin-kernel, paulmck, linux-mm,
	linux-kernel



On 2023/2/25 23:30, Kirill Tkhai wrote:
> On 25.02.2023 11:08, Qi Zheng wrote:
>>
>>
>> On 2023/2/25 05:14, Kirill Tkhai wrote:
>>> On 25.02.2023 00:02, Kirill Tkhai wrote:
>>>> On 24.02.2023 07:00, Qi Zheng wrote:
>>>>>
>>>>>
>>>>> On 2023/2/24 02:24, Sultan Alsawaf wrote:
>>>>>> On Thu, Feb 23, 2023 at 09:27:20PM +0800, Qi Zheng wrote:
>>>>>>> The shrinker_rwsem is a global lock in shrinkers subsystem,
>>>>>>> it is easy to cause blocking in the following cases:
>>>>>>>
>>>>>>> a. the write lock of shrinker_rwsem was held for too long.
>>>>>>>       For example, there are many memcgs in the system, which
>>>>>>>       causes some paths to hold locks and traverse it for too
>>>>>>>       long. (e.g. expand_shrinker_info())
>>>>>>> b. the read lock of shrinker_rwsem was held for too long,
>>>>>>>       and a writer came at this time. Then this writer will be
>>>>>>>       forced to wait and block all subsequent readers.
>>>>>>>       For example:
>>>>>>>       - be scheduled when the read lock of shrinker_rwsem is
>>>>>>>         held in do_shrink_slab()
>>>>>>>       - some shrinker are blocked for too long. Like the case
>>>>>>>         mentioned in the patchset[1].
>>>>>>>
>>>>>>> Therefore, many times in history ([2],[3],[4],[5]), some
>>>>>>> people wanted to replace shrinker_rwsem reader with SRCU,
>>>>>>> but they all gave up because SRCU was not unconditionally
>>>>>>> enabled.
>>>>>>>
>>>>>>> But now, since commit 1cd0bd06093c ("rcu: Remove CONFIG_SRCU"),
>>>>>>> the SRCU is unconditionally enabled. So it's time to use
>>>>>>> SRCU to protect readers who previously held shrinker_rwsem.
>>>>>>>
>>>>>>> [1]. https://lore.kernel.org/lkml/20191129214541.3110-1-ptikhomirov@virtuozzo.com/
>>>>>>> [2]. https://lore.kernel.org/all/1437080113.3596.2.camel@stgolabs.net/
>>>>>>> [3]. https://lore.kernel.org/lkml/1510609063-3327-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp/
>>>>>>> [4]. https://lore.kernel.org/lkml/153365347929.19074.12509495712735843805.stgit@localhost.localdomain/
>>>>>>> [5]. https://lore.kernel.org/lkml/20210927074823.5825-1-sultan@kerneltoast.com/
>>>>>>>
>>>>>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>>>>>> ---
>>>>>>>     mm/vmscan.c | 27 +++++++++++----------------
>>>>>>>     1 file changed, 11 insertions(+), 16 deletions(-)
>>>>>>>
>>>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>>>> index 9f895ca6216c..02987a6f95d1 100644
>>>>>>> --- a/mm/vmscan.c
>>>>>>> +++ b/mm/vmscan.c
>>>>>>> @@ -202,6 +202,7 @@ static void set_task_reclaim_state(struct task_struct *task,
>>>>>>>       LIST_HEAD(shrinker_list);
>>>>>>>     DECLARE_RWSEM(shrinker_rwsem);
>>>>>>> +DEFINE_SRCU(shrinker_srcu);
>>>>>>>       #ifdef CONFIG_MEMCG
>>>>>>>     static int shrinker_nr_max;
>>>>>>> @@ -706,7 +707,7 @@ void free_prealloced_shrinker(struct shrinker *shrinker)
>>>>>>>     void register_shrinker_prepared(struct shrinker *shrinker)
>>>>>>>     {
>>>>>>>         down_write(&shrinker_rwsem);
>>>>>>> -    list_add_tail(&shrinker->list, &shrinker_list);
>>>>>>> +    list_add_tail_rcu(&shrinker->list, &shrinker_list);
>>>>>>>         shrinker->flags |= SHRINKER_REGISTERED;
>>>>>>>         shrinker_debugfs_add(shrinker);
>>>>>>>         up_write(&shrinker_rwsem);
>>>>>>> @@ -760,13 +761,15 @@ void unregister_shrinker(struct shrinker *shrinker)
>>>>>>>             return;
>>>>>>>           down_write(&shrinker_rwsem);
>>>>>>> -    list_del(&shrinker->list);
>>>>>>> +    list_del_rcu(&shrinker->list);
>>>>>>>         shrinker->flags &= ~SHRINKER_REGISTERED;
>>>>>>>         if (shrinker->flags & SHRINKER_MEMCG_AWARE)
>>>>>>>             unregister_memcg_shrinker(shrinker);
>>>>>>>         debugfs_entry = shrinker_debugfs_remove(shrinker);
>>>>>>>         up_write(&shrinker_rwsem);
>>>>>>>     +    synchronize_srcu(&shrinker_srcu);
>>>>>>> +
>>>>>>>         debugfs_remove_recursive(debugfs_entry);
>>>>>>>           kfree(shrinker->nr_deferred);
>>>>>>> @@ -786,6 +789,7 @@ void synchronize_shrinkers(void)
>>>>>>>     {
>>>>>>>         down_write(&shrinker_rwsem);
>>>>>>>         up_write(&shrinker_rwsem);
>>>>>>> +    synchronize_srcu(&shrinker_srcu);
>>>>>>>     }
>>>>>>>     EXPORT_SYMBOL(synchronize_shrinkers);
>>>>>>>     @@ -996,6 +1000,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>>>>>     {
>>>>>>>         unsigned long ret, freed = 0;
>>>>>>>         struct shrinker *shrinker;
>>>>>>> +    int srcu_idx;
>>>>>>>           /*
>>>>>>>          * The root memcg might be allocated even though memcg is disabled
>>>>>>> @@ -1007,10 +1012,10 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>>>>>         if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
>>>>>>>             return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>>>>>>>     -    if (!down_read_trylock(&shrinker_rwsem))
>>>>>>> -        goto out;
>>>>>>> +    srcu_idx = srcu_read_lock(&shrinker_srcu);
>>>>>>>     -    list_for_each_entry(shrinker, &shrinker_list, list) {
>>>>>>> +    list_for_each_entry_srcu(shrinker, &shrinker_list, list,
>>>>>>> +                 srcu_read_lock_held(&shrinker_srcu)) {
>>>>>>>             struct shrink_control sc = {
>>>>>>>                 .gfp_mask = gfp_mask,
>>>>>>>                 .nid = nid,
>>>>>>> @@ -1021,19 +1026,9 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>>>>>             if (ret == SHRINK_EMPTY)
>>>>>>>                 ret = 0;
>>>>>>>             freed += ret;
>>>>>>> -        /*
>>>>>>> -         * Bail out if someone want to register a new shrinker to
>>>>>>> -         * prevent the registration from being stalled for long periods
>>>>>>> -         * by parallel ongoing shrinking.
>>>>>>> -         */
>>>>>>> -        if (rwsem_is_contended(&shrinker_rwsem)) {
>>>>>>> -            freed = freed ? : 1;
>>>>>>> -            break;
>>>>>>> -        }
>>>>>>>         }
>>>>>>>     -    up_read(&shrinker_rwsem);
>>>>>>> -out:
>>>>>>> +    srcu_read_unlock(&shrinker_srcu, srcu_idx);
>>>>>>>         cond_resched();
>>>>>>>         return freed;
>>>>>>>     }
>>>>>>> -- 
>>>>>>> 2.20.1
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> Hi Qi,
>>>>>>
>>>>>> A different problem I realized after my old attempt to use SRCU was that the
>>>>>> unregister_shrinker() path became quite slow due to the heavy synchronize_srcu()
>>>>>> call. Both register_shrinker() *and* unregister_shrinker() are called frequently
>>>>>> these days, and SRCU is too unfair to the unregister path IMO.
>>>>>
>>>>> Hi Sultan,
>>>>>
>>>>> IIUC, for unregister_shrinker(), the wait time is hardly longer with
>>>>> SRCU than with shrinker_rwsem before.
>>>>>
>>>>> And I just did a simple test. After using the script in cover letter to
>>>>> increase the shrink_slab hotspot, I did umount 1k times at the same
>>>>> time, and then I used bpftrace to measure the time consumption of
>>>>> unregister_shrinker() as follows:
>>>>>
>>>>> bpftrace -e 'kprobe:unregister_shrinker { @start[tid] = nsecs; } kretprobe:unregister_shrinker /@start[tid]/ { @ns[comm] = hist(nsecs - @start[tid]); delete(@start[tid]); }'
>>>>>
>>>>> @ns[umount]:
>>>>> [16K, 32K)             3 |      |
>>>>> [32K, 64K)            66 |@@@@@@@@@@      |
>>>>> [64K, 128K)           32 |@@@@@      |
>>>>> [128K, 256K)          22 |@@@      |
>>>>> [256K, 512K)          48 |@@@@@@@      |
>>>>> [512K, 1M)            19 |@@@      |
>>>>> [1M, 2M)             131 |@@@@@@@@@@@@@@@@@@@@@      |
>>>>> [2M, 4M)             313 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>>>>> [4M, 8M)             302 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  |
>>>>> [8M, 16M)             55 |@@@@@@@@@
>>>>>
>>>>> I see that the highest time-consuming of unregister_shrinker() is between 8ms and 16ms, which feels tolerable?
>>
>> Hi Kirill,
>>
>>>>
>>>> The fundamental difference is that before the patchset this for_each_set_bit() iteration could be broken in the middle
>>>> of two do_shrink_slab() calls, while after the patchset we can leave for_each_set_bit() only after visiting all set bits.
>>
>> After looking at the git log[1], I saw that we originally introduced
>> rwsem_is_contendent() here to aviod blocking register_shrinker(),
>> not unregister_shrinker().
>>
>> So I am curious, do we really care about the speed of
>> unregister_shrinker()?
> 
> My opinion is that for general reasons we should avoid long unbreakable actions in kernel. Especially when they may be called
> synchronous from userspace.

Got it.

And maybe you missed the previous comments below.

> 
> We even have this as generic rule. See check_hung_task().
> 
> Before, the longest sleep in unregister_shrinker() was a sleep waiting for single longest do_shrink_slab().
> 
> After the patch the longest sleep will be a sleep waiting for all do_shrink_slab() calls (all set bits in shrinker_info).
> 
>> And after using SRCU, register_shrinker() will not be blocked by slab
>> shrink at all.
>>
>> [1]. https://github.com/torvalds/linux/commit/e496612
>>
>>>>
>>>> Using only synchronize_srcu_expedited() won't help here.
>>>>
>>>> My opinion is we should restore a check similar to the rwsem_is_contendent() check that we had before. Something like
>>
>> If we really care about the speed of unregister_shrinker() like
>> register_shrinker(), I think this is a good idea. This guarantees
>> at least the speed of the unregister_shrinker() is not deteriorated. :)
>>
>>>> the below on top of your patchset merged into appropriate patch:
>>>>
>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>> index 27ef9946ae8a..50e7812468ec 100644
>>>> --- a/mm/vmscan.c
>>>> +++ b/mm/vmscan.c
>>>> @@ -204,6 +204,7 @@ static void set_task_reclaim_state(struct task_struct *task,
>>>>    LIST_HEAD(shrinker_list);
>>>>    DEFINE_MUTEX(shrinker_mutex);
>>>>    DEFINE_SRCU(shrinker_srcu);
>>>> +static atomic_t shrinker_srcu_generation = ATOMIC_INIT(0);
>>>>      #ifdef CONFIG_MEMCG
>>>>    static int shrinker_nr_max;
>>>> @@ -782,6 +783,7 @@ void unregister_shrinker(struct shrinker *shrinker)
>>>>        debugfs_entry = shrinker_debugfs_remove(shrinker);
>>>>        mutex_unlock(&shrinker_mutex);
>>>>    +    atomic_inc(&shrinker_srcu_generation);
>>>>        synchronize_srcu(&shrinker_srcu);
>>>>          debugfs_remove_recursive(debugfs_entry);
>>>> @@ -799,6 +801,7 @@ EXPORT_SYMBOL(unregister_shrinker);
>>>>     */
>>>>    void synchronize_shrinkers(void)
>>>>    {
>>>> +    atomic_inc(&shrinker_srcu_generation);
>>>>        synchronize_srcu(&shrinker_srcu);
>>>>    }
>>>>    EXPORT_SYMBOL(synchronize_shrinkers);
>>>> @@ -908,7 +911,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>>>>    {
>>>>        struct shrinker_info *info;
>>>>        unsigned long ret, freed = 0;
>>>> -    int srcu_idx;
>>>> +    int srcu_idx, generation;
>>>>        int i;
>>>>          if (!mem_cgroup_online(memcg))
>>>> @@ -919,6 +922,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>>>>        if (unlikely(!info))
>>>>            goto unlock;
>>>>    +    generation = atomic_read(&shrinker_srcu_generation);
>>>>        for_each_set_bit(i, info->map, info->map_nr_max) {
>>>>            struct shrink_control sc = {
>>>>                .gfp_mask = gfp_mask,
>>>> @@ -965,6 +969,11 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>>>>                    set_shrinker_bit(memcg, nid, i);
>>>>            }
>>>>            freed += ret;
>>>> +
>>>> +        if (atomic_read(&shrinker_srcu_generation) != generation) {
>>>> +            freed = freed ? : 1;
>>>> +            break;
>>>> +        }
>>>>        }
>>>>    unlock:
>>>>        srcu_read_unlock(&shrinker_srcu, srcu_idx);
>>>> @@ -1004,7 +1013,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>>    {
>>>>        unsigned long ret, freed = 0;
>>>>        struct shrinker *shrinker;
>>>> -    int srcu_idx;
>>>> +    int srcu_idx, generation;
>>>>          /*
>>>>         * The root memcg might be allocated even though memcg is disabled
>>>> @@ -1017,6 +1026,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>>            return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>>>>          srcu_idx = srcu_read_lock(&shrinker_srcu);
>>>> +    generation = atomic_read(&shrinker_srcu_generation);
>>>>          list_for_each_entry_srcu(shrinker, &shrinker_list, list,
>>>>                     srcu_read_lock_held(&shrinker_srcu)) {
>>>> @@ -1030,6 +1040,11 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>>            if (ret == SHRINK_EMPTY)
>>>>                ret = 0;
>>>>            freed += ret;
>>>> +
>>>> +        if (atomic_read(&shrinker_srcu_generation) != generation) {
>>>> +            freed = freed ? : 1;
>>>> +            break;
>>>> +        }
>>>>        }
>>>>          srcu_read_unlock(&shrinker_srcu, srcu_idx);
>>>
>>> Even more, for memcg shrinkers we may unlock SRCU and continue iterations from the same shrinker id:
>>
>> Maybe we can also do this for global slab shrink? Like below:

How about this?

>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index ffddbd204259..9d8c53075298 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1012,7 +1012,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>                                   int priority)
>>   {
>>          unsigned long ret, freed = 0;
>> -       struct shrinker *shrinker;
>> +       struct shrinker *shrinker = NULL;
>>          int srcu_idx, generation;
>>
>>          /*
>> @@ -1025,11 +1025,15 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>          if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
>>                  return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>>
>> +again:
>>          srcu_idx = srcu_read_lock(&shrinker_srcu);
>>
>>          generation = atomic_read(&shrinker_srcu_generation);
>> -       list_for_each_entry_srcu(shrinker, &shrinker_list, list,
>> -                                srcu_read_lock_held(&shrinker_srcu)) {
>> +       if (!shrinker)
>> +               shrinker = list_entry_rcu(shrinker_list.next, struct shrinker, list);
>> +       else
>> +               shrinker = list_entry_rcu(shrinker->list.next, struct shrinker, list);
>> +       list_for_each_entry_from_rcu(shrinker, &shrinker_list, list) {
>>                  struct shrink_control sc = {
>>                          .gfp_mask = gfp_mask,
>>                          .nid = nid,
>> @@ -1042,8 +1046,9 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>                  freed += ret;
>>
>>                  if (atomic_read(&shrinker_srcu_generation) != generation) {
>> -                       freed = freed ? : 1;
>> -                       break;
>> +                       srcu_read_unlock(&shrinker_srcu, srcu_idx);
>> +                       cond_resched();
>> +                       goto again;
>>                  }
>>          }
>>
>>>
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index 27ef9946ae8a..0b197bba1257 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -204,6 +204,7 @@ static void set_task_reclaim_state(struct task_struct *task,
>>>    LIST_HEAD(shrinker_list);
>>>    DEFINE_MUTEX(shrinker_mutex);
>>>    DEFINE_SRCU(shrinker_srcu);
>>> +static atomic_t shrinker_srcu_generation = ATOMIC_INIT(0);
>>>      #ifdef CONFIG_MEMCG
>>>    static int shrinker_nr_max;
>>> @@ -782,6 +783,7 @@ void unregister_shrinker(struct shrinker *shrinker)
>>>        debugfs_entry = shrinker_debugfs_remove(shrinker);
>>>        mutex_unlock(&shrinker_mutex);
>>>    +    atomic_inc(&shrinker_srcu_generation);
>>>        synchronize_srcu(&shrinker_srcu);
>>>          debugfs_remove_recursive(debugfs_entry);
>>> @@ -799,6 +801,7 @@ EXPORT_SYMBOL(unregister_shrinker);
>>>     */
>>>    void synchronize_shrinkers(void)
>>>    {
>>> +    atomic_inc(&shrinker_srcu_generation);
>>>        synchronize_srcu(&shrinker_srcu);
>>>    }
>>>    EXPORT_SYMBOL(synchronize_shrinkers);
>>> @@ -908,18 +911,19 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>>>    {
>>>        struct shrinker_info *info;
>>>        unsigned long ret, freed = 0;
>>> -    int srcu_idx;
>>> -    int i;
>>> +    int srcu_idx, generation;
>>> +    int i = 0;
>>>          if (!mem_cgroup_online(memcg))
>>>            return 0;
>>> -
>>> +again:
>>>        srcu_idx = srcu_read_lock(&shrinker_srcu);
>>>        info = shrinker_info_srcu(memcg, nid);
>>>        if (unlikely(!info))
>>>            goto unlock;
>>>    -    for_each_set_bit(i, info->map, info->map_nr_max) {
>>> +    generation = atomic_read(&shrinker_srcu_generation);
>>> +    for_each_set_bit_from(i, info->map, info->map_nr_max) {
>>>            struct shrink_control sc = {
>>>                .gfp_mask = gfp_mask,
>>>                .nid = nid,
>>> @@ -965,6 +969,11 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>>>                    set_shrinker_bit(memcg, nid, i);
>>>            }
>>>            freed += ret;
>>> +
>>> +        if (atomic_read(&shrinker_srcu_generation) != generation) {
>>> +            srcu_read_unlock(&shrinker_srcu, srcu_idx);
>>
>> Maybe we can add the following code here, so as to avoid repeating the
>> current id and avoid triggering softlockup:
>>
>>              i++;
>>              cond_resched();

And this. :)

Thanks,
Qi

>>
>> Thanks,
>> Qi
>>
>>> +            goto again;
>>> +        }
>>>        }
>>>    unlock:
>>>        srcu_read_unlock(&shrinker_srcu, srcu_idx);
>>> @@ -1004,7 +1013,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>    {
>>>        unsigned long ret, freed = 0;
>>>        struct shrinker *shrinker;
>>> -    int srcu_idx;
>>> +    int srcu_idx, generation;
>>>          /*
>>>         * The root memcg might be allocated even though memcg is disabled
>>> @@ -1017,6 +1026,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>            return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>>>          srcu_idx = srcu_read_lock(&shrinker_srcu);
>>> +    generation = atomic_read(&shrinker_srcu_generation);
>>>          list_for_each_entry_srcu(shrinker, &shrinker_list, list,
>>>                     srcu_read_lock_held(&shrinker_srcu)) {
>>> @@ -1030,6 +1040,11 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>            if (ret == SHRINK_EMPTY)
>>>                ret = 0;
>>>            freed += ret;
>>> +
>>> +        if (atomic_read(&shrinker_srcu_generation) != generation) {
>>> +            freed = freed ? : 1;
>>> +            break;
>>> +        }
>>>        }
>>>          srcu_read_unlock(&shrinker_srcu, srcu_idx);
>>>
>>>
>>
> 

-- 
Thanks,
Qi

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/7] mm: vmscan: make global slab shrink lockless
  2023-02-25 15:57               ` Qi Zheng
@ 2023-02-25 16:17                 ` Kirill Tkhai
  2023-02-25 16:37                   ` Qi Zheng
  0 siblings, 1 reply; 33+ messages in thread
From: Kirill Tkhai @ 2023-02-25 16:17 UTC (permalink / raw)
  To: Qi Zheng, Sultan Alsawaf
  Cc: akpm, hannes, shakeelb, mhocko, roman.gushchin, muchun.song,
	david, shy828301, dave, penguin-kernel, paulmck, linux-mm,
	linux-kernel

On 25.02.2023 18:57, Qi Zheng wrote:
> 
> 
> On 2023/2/25 23:30, Kirill Tkhai wrote:
>> On 25.02.2023 11:08, Qi Zheng wrote:
>>>
>>>
>>> On 2023/2/25 05:14, Kirill Tkhai wrote:
>>>> On 25.02.2023 00:02, Kirill Tkhai wrote:
>>>>> On 24.02.2023 07:00, Qi Zheng wrote:
>>>>>>
>>>>>>
>>>>>> On 2023/2/24 02:24, Sultan Alsawaf wrote:
>>>>>>> On Thu, Feb 23, 2023 at 09:27:20PM +0800, Qi Zheng wrote:
>>>>>>>> The shrinker_rwsem is a global lock in shrinkers subsystem,
>>>>>>>> it is easy to cause blocking in the following cases:
>>>>>>>>
>>>>>>>> a. the write lock of shrinker_rwsem was held for too long.
>>>>>>>>       For example, there are many memcgs in the system, which
>>>>>>>>       causes some paths to hold locks and traverse it for too
>>>>>>>>       long. (e.g. expand_shrinker_info())
>>>>>>>> b. the read lock of shrinker_rwsem was held for too long,
>>>>>>>>       and a writer came at this time. Then this writer will be
>>>>>>>>       forced to wait and block all subsequent readers.
>>>>>>>>       For example:
>>>>>>>>       - be scheduled when the read lock of shrinker_rwsem is
>>>>>>>>         held in do_shrink_slab()
>>>>>>>>       - some shrinker are blocked for too long. Like the case
>>>>>>>>         mentioned in the patchset[1].
>>>>>>>>
>>>>>>>> Therefore, many times in history ([2],[3],[4],[5]), some
>>>>>>>> people wanted to replace shrinker_rwsem reader with SRCU,
>>>>>>>> but they all gave up because SRCU was not unconditionally
>>>>>>>> enabled.
>>>>>>>>
>>>>>>>> But now, since commit 1cd0bd06093c ("rcu: Remove CONFIG_SRCU"),
>>>>>>>> the SRCU is unconditionally enabled. So it's time to use
>>>>>>>> SRCU to protect readers who previously held shrinker_rwsem.
>>>>>>>>
>>>>>>>> [1]. https://lore.kernel.org/lkml/20191129214541.3110-1-ptikhomirov@virtuozzo.com/
>>>>>>>> [2]. https://lore.kernel.org/all/1437080113.3596.2.camel@stgolabs.net/
>>>>>>>> [3]. https://lore.kernel.org/lkml/1510609063-3327-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp/
>>>>>>>> [4]. https://lore.kernel.org/lkml/153365347929.19074.12509495712735843805.stgit@localhost.localdomain/
>>>>>>>> [5]. https://lore.kernel.org/lkml/20210927074823.5825-1-sultan@kerneltoast.com/
>>>>>>>>
>>>>>>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>>>>>>> ---
>>>>>>>>     mm/vmscan.c | 27 +++++++++++----------------
>>>>>>>>     1 file changed, 11 insertions(+), 16 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>>>>> index 9f895ca6216c..02987a6f95d1 100644
>>>>>>>> --- a/mm/vmscan.c
>>>>>>>> +++ b/mm/vmscan.c
>>>>>>>> @@ -202,6 +202,7 @@ static void set_task_reclaim_state(struct task_struct *task,
>>>>>>>>       LIST_HEAD(shrinker_list);
>>>>>>>>     DECLARE_RWSEM(shrinker_rwsem);
>>>>>>>> +DEFINE_SRCU(shrinker_srcu);
>>>>>>>>       #ifdef CONFIG_MEMCG
>>>>>>>>     static int shrinker_nr_max;
>>>>>>>> @@ -706,7 +707,7 @@ void free_prealloced_shrinker(struct shrinker *shrinker)
>>>>>>>>     void register_shrinker_prepared(struct shrinker *shrinker)
>>>>>>>>     {
>>>>>>>>         down_write(&shrinker_rwsem);
>>>>>>>> -    list_add_tail(&shrinker->list, &shrinker_list);
>>>>>>>> +    list_add_tail_rcu(&shrinker->list, &shrinker_list);
>>>>>>>>         shrinker->flags |= SHRINKER_REGISTERED;
>>>>>>>>         shrinker_debugfs_add(shrinker);
>>>>>>>>         up_write(&shrinker_rwsem);
>>>>>>>> @@ -760,13 +761,15 @@ void unregister_shrinker(struct shrinker *shrinker)
>>>>>>>>             return;
>>>>>>>>           down_write(&shrinker_rwsem);
>>>>>>>> -    list_del(&shrinker->list);
>>>>>>>> +    list_del_rcu(&shrinker->list);
>>>>>>>>         shrinker->flags &= ~SHRINKER_REGISTERED;
>>>>>>>>         if (shrinker->flags & SHRINKER_MEMCG_AWARE)
>>>>>>>>             unregister_memcg_shrinker(shrinker);
>>>>>>>>         debugfs_entry = shrinker_debugfs_remove(shrinker);
>>>>>>>>         up_write(&shrinker_rwsem);
>>>>>>>>     +    synchronize_srcu(&shrinker_srcu);
>>>>>>>> +
>>>>>>>>         debugfs_remove_recursive(debugfs_entry);
>>>>>>>>           kfree(shrinker->nr_deferred);
>>>>>>>> @@ -786,6 +789,7 @@ void synchronize_shrinkers(void)
>>>>>>>>     {
>>>>>>>>         down_write(&shrinker_rwsem);
>>>>>>>>         up_write(&shrinker_rwsem);
>>>>>>>> +    synchronize_srcu(&shrinker_srcu);
>>>>>>>>     }
>>>>>>>>     EXPORT_SYMBOL(synchronize_shrinkers);
>>>>>>>>     @@ -996,6 +1000,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>>>>>>     {
>>>>>>>>         unsigned long ret, freed = 0;
>>>>>>>>         struct shrinker *shrinker;
>>>>>>>> +    int srcu_idx;
>>>>>>>>           /*
>>>>>>>>          * The root memcg might be allocated even though memcg is disabled
>>>>>>>> @@ -1007,10 +1012,10 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>>>>>>         if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
>>>>>>>>             return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>>>>>>>>     -    if (!down_read_trylock(&shrinker_rwsem))
>>>>>>>> -        goto out;
>>>>>>>> +    srcu_idx = srcu_read_lock(&shrinker_srcu);
>>>>>>>>     -    list_for_each_entry(shrinker, &shrinker_list, list) {
>>>>>>>> +    list_for_each_entry_srcu(shrinker, &shrinker_list, list,
>>>>>>>> +                 srcu_read_lock_held(&shrinker_srcu)) {
>>>>>>>>             struct shrink_control sc = {
>>>>>>>>                 .gfp_mask = gfp_mask,
>>>>>>>>                 .nid = nid,
>>>>>>>> @@ -1021,19 +1026,9 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>>>>>>             if (ret == SHRINK_EMPTY)
>>>>>>>>                 ret = 0;
>>>>>>>>             freed += ret;
>>>>>>>> -        /*
>>>>>>>> -         * Bail out if someone want to register a new shrinker to
>>>>>>>> -         * prevent the registration from being stalled for long periods
>>>>>>>> -         * by parallel ongoing shrinking.
>>>>>>>> -         */
>>>>>>>> -        if (rwsem_is_contended(&shrinker_rwsem)) {
>>>>>>>> -            freed = freed ? : 1;
>>>>>>>> -            break;
>>>>>>>> -        }
>>>>>>>>         }
>>>>>>>>     -    up_read(&shrinker_rwsem);
>>>>>>>> -out:
>>>>>>>> +    srcu_read_unlock(&shrinker_srcu, srcu_idx);
>>>>>>>>         cond_resched();
>>>>>>>>         return freed;
>>>>>>>>     }
>>>>>>>> -- 
>>>>>>>> 2.20.1
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> Hi Qi,
>>>>>>>
>>>>>>> A different problem I realized after my old attempt to use SRCU was that the
>>>>>>> unregister_shrinker() path became quite slow due to the heavy synchronize_srcu()
>>>>>>> call. Both register_shrinker() *and* unregister_shrinker() are called frequently
>>>>>>> these days, and SRCU is too unfair to the unregister path IMO.
>>>>>>
>>>>>> Hi Sultan,
>>>>>>
>>>>>> IIUC, for unregister_shrinker(), the wait time is hardly longer with
>>>>>> SRCU than with shrinker_rwsem before.
>>>>>>
>>>>>> And I just did a simple test. After using the script in cover letter to
>>>>>> increase the shrink_slab hotspot, I did umount 1k times at the same
>>>>>> time, and then I used bpftrace to measure the time consumption of
>>>>>> unregister_shrinker() as follows:
>>>>>>
>>>>>> bpftrace -e 'kprobe:unregister_shrinker { @start[tid] = nsecs; } kretprobe:unregister_shrinker /@start[tid]/ { @ns[comm] = hist(nsecs - @start[tid]); delete(@start[tid]); }'
>>>>>>
>>>>>> @ns[umount]:
>>>>>> [16K, 32K)             3 |      |
>>>>>> [32K, 64K)            66 |@@@@@@@@@@      |
>>>>>> [64K, 128K)           32 |@@@@@      |
>>>>>> [128K, 256K)          22 |@@@      |
>>>>>> [256K, 512K)          48 |@@@@@@@      |
>>>>>> [512K, 1M)            19 |@@@      |
>>>>>> [1M, 2M)             131 |@@@@@@@@@@@@@@@@@@@@@      |
>>>>>> [2M, 4M)             313 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>>>>>> [4M, 8M)             302 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  |
>>>>>> [8M, 16M)             55 |@@@@@@@@@
>>>>>>
>>>>>> I see that the highest time-consuming of unregister_shrinker() is between 8ms and 16ms, which feels tolerable?
>>>
>>> Hi Kirill,
>>>
>>>>>
>>>>> The fundamental difference is that before the patchset this for_each_set_bit() iteration could be broken in the middle
>>>>> of two do_shrink_slab() calls, while after the patchset we can leave for_each_set_bit() only after visiting all set bits.
>>>
>>> After looking at the git log[1], I saw that we originally introduced
>>> rwsem_is_contendent() here to aviod blocking register_shrinker(),
>>> not unregister_shrinker().
>>>
>>> So I am curious, do we really care about the speed of
>>> unregister_shrinker()?
>>
>> My opinion is that for general reasons we should avoid long unbreakable actions in kernel. Especially when they may be called
>> synchronous from userspace.
> 
> Got it.
> 
> And maybe you missed the previous comments below.

Oh, I really missed them!

>>
>> We even have this as generic rule. See check_hung_task().
>>
>> Before, the longest sleep in unregister_shrinker() was a sleep waiting for single longest do_shrink_slab().
>>
>> After the patch the longest sleep will be a sleep waiting for all do_shrink_slab() calls (all set bits in shrinker_info).
>>
>>> And after using SRCU, register_shrinker() will not be blocked by slab
>>> shrink at all.
>>>
>>> [1]. https://github.com/torvalds/linux/commit/e496612
>>>
>>>>>
>>>>> Using only synchronize_srcu_expedited() won't help here.
>>>>>
>>>>> My opinion is we should restore a check similar to the rwsem_is_contendent() check that we had before. Something like
>>>
>>> If we really care about the speed of unregister_shrinker() like
>>> register_shrinker(), I think this is a good idea. This guarantees
>>> at least the speed of the unregister_shrinker() is not deteriorated. :)
>>>
>>>>> the below on top of your patchset merged into appropriate patch:
>>>>>
>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>> index 27ef9946ae8a..50e7812468ec 100644
>>>>> --- a/mm/vmscan.c
>>>>> +++ b/mm/vmscan.c
>>>>> @@ -204,6 +204,7 @@ static void set_task_reclaim_state(struct task_struct *task,
>>>>>    LIST_HEAD(shrinker_list);
>>>>>    DEFINE_MUTEX(shrinker_mutex);
>>>>>    DEFINE_SRCU(shrinker_srcu);
>>>>> +static atomic_t shrinker_srcu_generation = ATOMIC_INIT(0);
>>>>>      #ifdef CONFIG_MEMCG
>>>>>    static int shrinker_nr_max;
>>>>> @@ -782,6 +783,7 @@ void unregister_shrinker(struct shrinker *shrinker)
>>>>>        debugfs_entry = shrinker_debugfs_remove(shrinker);
>>>>>        mutex_unlock(&shrinker_mutex);
>>>>>    +    atomic_inc(&shrinker_srcu_generation);
>>>>>        synchronize_srcu(&shrinker_srcu);
>>>>>          debugfs_remove_recursive(debugfs_entry);
>>>>> @@ -799,6 +801,7 @@ EXPORT_SYMBOL(unregister_shrinker);
>>>>>     */
>>>>>    void synchronize_shrinkers(void)
>>>>>    {
>>>>> +    atomic_inc(&shrinker_srcu_generation);
>>>>>        synchronize_srcu(&shrinker_srcu);
>>>>>    }
>>>>>    EXPORT_SYMBOL(synchronize_shrinkers);
>>>>> @@ -908,7 +911,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>>>>>    {
>>>>>        struct shrinker_info *info;
>>>>>        unsigned long ret, freed = 0;
>>>>> -    int srcu_idx;
>>>>> +    int srcu_idx, generation;
>>>>>        int i;
>>>>>          if (!mem_cgroup_online(memcg))
>>>>> @@ -919,6 +922,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>>>>>        if (unlikely(!info))
>>>>>            goto unlock;
>>>>>    +    generation = atomic_read(&shrinker_srcu_generation);
>>>>>        for_each_set_bit(i, info->map, info->map_nr_max) {
>>>>>            struct shrink_control sc = {
>>>>>                .gfp_mask = gfp_mask,
>>>>> @@ -965,6 +969,11 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>>>>>                    set_shrinker_bit(memcg, nid, i);
>>>>>            }
>>>>>            freed += ret;
>>>>> +
>>>>> +        if (atomic_read(&shrinker_srcu_generation) != generation) {
>>>>> +            freed = freed ? : 1;
>>>>> +            break;
>>>>> +        }
>>>>>        }
>>>>>    unlock:
>>>>>        srcu_read_unlock(&shrinker_srcu, srcu_idx);
>>>>> @@ -1004,7 +1013,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>>>    {
>>>>>        unsigned long ret, freed = 0;
>>>>>        struct shrinker *shrinker;
>>>>> -    int srcu_idx;
>>>>> +    int srcu_idx, generation;
>>>>>          /*
>>>>>         * The root memcg might be allocated even though memcg is disabled
>>>>> @@ -1017,6 +1026,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>>>            return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>>>>>          srcu_idx = srcu_read_lock(&shrinker_srcu);
>>>>> +    generation = atomic_read(&shrinker_srcu_generation);
>>>>>          list_for_each_entry_srcu(shrinker, &shrinker_list, list,
>>>>>                     srcu_read_lock_held(&shrinker_srcu)) {
>>>>> @@ -1030,6 +1040,11 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>>>            if (ret == SHRINK_EMPTY)
>>>>>                ret = 0;
>>>>>            freed += ret;
>>>>> +
>>>>> +        if (atomic_read(&shrinker_srcu_generation) != generation) {
>>>>> +            freed = freed ? : 1;
>>>>> +            break;
>>>>> +        }
>>>>>        }
>>>>>          srcu_read_unlock(&shrinker_srcu, srcu_idx);
>>>>
>>>> Even more, for memcg shrinkers we may unlock SRCU and continue iterations from the same shrinker id:
>>>
>>> Maybe we can also do this for global slab shrink? Like below:
> 
> How about this?
>>>
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index ffddbd204259..9d8c53075298 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -1012,7 +1012,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>                                   int priority)
>>>   {
>>>          unsigned long ret, freed = 0;
>>> -       struct shrinker *shrinker;
>>> +       struct shrinker *shrinker = NULL;
>>>          int srcu_idx, generation;
>>>
>>>          /*
>>> @@ -1025,11 +1025,15 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>          if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
>>>                  return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>>>
>>> +again:
>>>          srcu_idx = srcu_read_lock(&shrinker_srcu);
>>>
>>>          generation = atomic_read(&shrinker_srcu_generation);
>>> -       list_for_each_entry_srcu(shrinker, &shrinker_list, list,
>>> -                                srcu_read_lock_held(&shrinker_srcu)) {
>>> +       if (!shrinker)
>>> +               shrinker = list_entry_rcu(shrinker_list.next, struct shrinker, list);
>>> +       else
>>> +               shrinker = list_entry_rcu(shrinker->list.next, struct shrinker, list);
>>> +       list_for_each_entry_from_rcu(shrinker, &shrinker_list, list) {
>>>                  struct shrink_control sc = {
>>>                          .gfp_mask = gfp_mask,
>>>                          .nid = nid,
>>> @@ -1042,8 +1046,9 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>                  freed += ret;
>>>
>>>                  if (atomic_read(&shrinker_srcu_generation) != generation) {
>>> -                       freed = freed ? : 1;
>>> -                       break;
>>> +                       srcu_read_unlock(&shrinker_srcu, srcu_idx);

After SRCU in unlocked we can't believe @shrinker anymore. So, above list_entry_rcu(shrinker->list.next)
dereferences some random memory.

>>> +                       cond_resched();
>>> +                       goto again;
>>>                  }
>>>          }
>>>
>>>>
>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>> index 27ef9946ae8a..0b197bba1257 100644
>>>> --- a/mm/vmscan.c
>>>> +++ b/mm/vmscan.c
>>>> @@ -204,6 +204,7 @@ static void set_task_reclaim_state(struct task_struct *task,
>>>>    LIST_HEAD(shrinker_list);
>>>>    DEFINE_MUTEX(shrinker_mutex);
>>>>    DEFINE_SRCU(shrinker_srcu);
>>>> +static atomic_t shrinker_srcu_generation = ATOMIC_INIT(0);
>>>>      #ifdef CONFIG_MEMCG
>>>>    static int shrinker_nr_max;
>>>> @@ -782,6 +783,7 @@ void unregister_shrinker(struct shrinker *shrinker)
>>>>        debugfs_entry = shrinker_debugfs_remove(shrinker);
>>>>        mutex_unlock(&shrinker_mutex);
>>>>    +    atomic_inc(&shrinker_srcu_generation);
>>>>        synchronize_srcu(&shrinker_srcu);
>>>>          debugfs_remove_recursive(debugfs_entry);
>>>> @@ -799,6 +801,7 @@ EXPORT_SYMBOL(unregister_shrinker);
>>>>     */
>>>>    void synchronize_shrinkers(void)
>>>>    {
>>>> +    atomic_inc(&shrinker_srcu_generation);
>>>>        synchronize_srcu(&shrinker_srcu);
>>>>    }
>>>>    EXPORT_SYMBOL(synchronize_shrinkers);
>>>> @@ -908,18 +911,19 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>>>>    {
>>>>        struct shrinker_info *info;
>>>>        unsigned long ret, freed = 0;
>>>> -    int srcu_idx;
>>>> -    int i;
>>>> +    int srcu_idx, generation;
>>>> +    int i = 0;
>>>>          if (!mem_cgroup_online(memcg))
>>>>            return 0;
>>>> -
>>>> +again:
>>>>        srcu_idx = srcu_read_lock(&shrinker_srcu);
>>>>        info = shrinker_info_srcu(memcg, nid);
>>>>        if (unlikely(!info))
>>>>            goto unlock;
>>>>    -    for_each_set_bit(i, info->map, info->map_nr_max) {
>>>> +    generation = atomic_read(&shrinker_srcu_generation);
>>>> +    for_each_set_bit_from(i, info->map, info->map_nr_max) {
>>>>            struct shrink_control sc = {
>>>>                .gfp_mask = gfp_mask,
>>>>                .nid = nid,
>>>> @@ -965,6 +969,11 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>>>>                    set_shrinker_bit(memcg, nid, i);
>>>>            }
>>>>            freed += ret;
>>>> +
>>>> +        if (atomic_read(&shrinker_srcu_generation) != generation) {
>>>> +            srcu_read_unlock(&shrinker_srcu, srcu_idx);
>>>
>>> Maybe we can add the following code here, so as to avoid repeating the
>>> current id and avoid triggering softlockup:
>>>
>>>              i++;

This is OK.

>>>              cond_resched();

Possible, existing cond_resched() in do_shrink_slab() is enough.

> And this. :)
> 
> Thanks,
> Qi
> 
>>>
>>> Thanks,
>>> Qi
>>>
>>>> +            goto again;
>>>> +        }
>>>>        }
>>>>    unlock:
>>>>        srcu_read_unlock(&shrinker_srcu, srcu_idx);
>>>> @@ -1004,7 +1013,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>>    {
>>>>        unsigned long ret, freed = 0;
>>>>        struct shrinker *shrinker;
>>>> -    int srcu_idx;
>>>> +    int srcu_idx, generation;
>>>>          /*
>>>>         * The root memcg might be allocated even though memcg is disabled
>>>> @@ -1017,6 +1026,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>>            return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>>>>          srcu_idx = srcu_read_lock(&shrinker_srcu);
>>>> +    generation = atomic_read(&shrinker_srcu_generation);
>>>>          list_for_each_entry_srcu(shrinker, &shrinker_list, list,
>>>>                     srcu_read_lock_held(&shrinker_srcu)) {
>>>> @@ -1030,6 +1040,11 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>>            if (ret == SHRINK_EMPTY)
>>>>                ret = 0;
>>>>            freed += ret;
>>>> +
>>>> +        if (atomic_read(&shrinker_srcu_generation) != generation) {
>>>> +            freed = freed ? : 1;
>>>> +            break;
>>>> +        }
>>>>        }
>>>>          srcu_read_unlock(&shrinker_srcu, srcu_idx);
>>>>
>>>>
>>>
>>
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/7] mm: vmscan: make global slab shrink lockless
  2023-02-25 16:17                 ` Kirill Tkhai
@ 2023-02-25 16:37                   ` Qi Zheng
  2023-02-25 21:28                     ` Kirill Tkhai
  0 siblings, 1 reply; 33+ messages in thread
From: Qi Zheng @ 2023-02-25 16:37 UTC (permalink / raw)
  To: Kirill Tkhai, Sultan Alsawaf
  Cc: akpm, hannes, shakeelb, mhocko, roman.gushchin, muchun.song,
	david, shy828301, dave, penguin-kernel, paulmck, linux-mm,
	linux-kernel



On 2023/2/26 00:17, Kirill Tkhai wrote:
> On 25.02.2023 18:57, Qi Zheng wrote:
>>
<...>
>> How about this?
>>>>
>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>> index ffddbd204259..9d8c53075298 100644
>>>> --- a/mm/vmscan.c
>>>> +++ b/mm/vmscan.c
>>>> @@ -1012,7 +1012,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>>                                    int priority)
>>>>    {
>>>>           unsigned long ret, freed = 0;
>>>> -       struct shrinker *shrinker;
>>>> +       struct shrinker *shrinker = NULL;
>>>>           int srcu_idx, generation;
>>>>
>>>>           /*
>>>> @@ -1025,11 +1025,15 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>>           if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
>>>>                   return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>>>>
>>>> +again:
>>>>           srcu_idx = srcu_read_lock(&shrinker_srcu);
>>>>
>>>>           generation = atomic_read(&shrinker_srcu_generation);
>>>> -       list_for_each_entry_srcu(shrinker, &shrinker_list, list,
>>>> -                                srcu_read_lock_held(&shrinker_srcu)) {
>>>> +       if (!shrinker)
>>>> +               shrinker = list_entry_rcu(shrinker_list.next, struct shrinker, list);
>>>> +       else
>>>> +               shrinker = list_entry_rcu(shrinker->list.next, struct shrinker, list);
>>>> +       list_for_each_entry_from_rcu(shrinker, &shrinker_list, list) {
>>>>                   struct shrink_control sc = {
>>>>                           .gfp_mask = gfp_mask,
>>>>                           .nid = nid,
>>>> @@ -1042,8 +1046,9 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>>                   freed += ret;
>>>>
>>>>                   if (atomic_read(&shrinker_srcu_generation) != generation) {
>>>> -                       freed = freed ? : 1;
>>>> -                       break;
>>>> +                       srcu_read_unlock(&shrinker_srcu, srcu_idx);
> 
> After SRCU in unlocked we can't believe @shrinker anymore. So, above list_entry_rcu(shrinker->list.next)
> dereferences some random memory.

Indeed.

> 
>>>> +                       cond_resched();
>>>> +                       goto again;
>>>>                   }
>>>>           }
>>>>
>>>>>
>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>> index 27ef9946ae8a..0b197bba1257 100644
>>>>> --- a/mm/vmscan.c
>>>>> +++ b/mm/vmscan.c
>>>>> @@ -204,6 +204,7 @@ static void set_task_reclaim_state(struct task_struct *task,
>>>>>     LIST_HEAD(shrinker_list);
>>>>>     DEFINE_MUTEX(shrinker_mutex);
>>>>>     DEFINE_SRCU(shrinker_srcu);
>>>>> +static atomic_t shrinker_srcu_generation = ATOMIC_INIT(0);
>>>>>       #ifdef CONFIG_MEMCG
>>>>>     static int shrinker_nr_max;
>>>>> @@ -782,6 +783,7 @@ void unregister_shrinker(struct shrinker *shrinker)
>>>>>         debugfs_entry = shrinker_debugfs_remove(shrinker);
>>>>>         mutex_unlock(&shrinker_mutex);
>>>>>     +    atomic_inc(&shrinker_srcu_generation);
>>>>>         synchronize_srcu(&shrinker_srcu);
>>>>>           debugfs_remove_recursive(debugfs_entry);
>>>>> @@ -799,6 +801,7 @@ EXPORT_SYMBOL(unregister_shrinker);
>>>>>      */
>>>>>     void synchronize_shrinkers(void)
>>>>>     {
>>>>> +    atomic_inc(&shrinker_srcu_generation);
>>>>>         synchronize_srcu(&shrinker_srcu);
>>>>>     }
>>>>>     EXPORT_SYMBOL(synchronize_shrinkers);
>>>>> @@ -908,18 +911,19 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>>>>>     {
>>>>>         struct shrinker_info *info;
>>>>>         unsigned long ret, freed = 0;
>>>>> -    int srcu_idx;
>>>>> -    int i;
>>>>> +    int srcu_idx, generation;
>>>>> +    int i = 0;
>>>>>           if (!mem_cgroup_online(memcg))
>>>>>             return 0;
>>>>> -
>>>>> +again:
>>>>>         srcu_idx = srcu_read_lock(&shrinker_srcu);
>>>>>         info = shrinker_info_srcu(memcg, nid);
>>>>>         if (unlikely(!info))
>>>>>             goto unlock;
>>>>>     -    for_each_set_bit(i, info->map, info->map_nr_max) {
>>>>> +    generation = atomic_read(&shrinker_srcu_generation);
>>>>> +    for_each_set_bit_from(i, info->map, info->map_nr_max) {
>>>>>             struct shrink_control sc = {
>>>>>                 .gfp_mask = gfp_mask,
>>>>>                 .nid = nid,
>>>>> @@ -965,6 +969,11 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>>>>>                     set_shrinker_bit(memcg, nid, i);
>>>>>             }
>>>>>             freed += ret;
>>>>> +
>>>>> +        if (atomic_read(&shrinker_srcu_generation) != generation) {
>>>>> +            srcu_read_unlock(&shrinker_srcu, srcu_idx);
>>>>
>>>> Maybe we can add the following code here, so as to avoid repeating the
>>>> current id and avoid triggering softlockup:
>>>>
>>>>               i++;
> 
> This is OK.
> 
>>>>               cond_resched();
> 
> Possible, existing cond_resched() in do_shrink_slab() is enough.

Yeah.

I will add this patch in the next version. May I mark you as the author
of this patch?

Thanks,
Qi

> 
>> And this. :)
>>
>> Thanks,
>> Qi
>>
>>>>
>>>> Thanks,
>>>> Qi
>>>>
>>>>> +            goto again;
>>>>> +        }
>>>>>         }
>>>>>     unlock:
>>>>>         srcu_read_unlock(&shrinker_srcu, srcu_idx);
>>>>> @@ -1004,7 +1013,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>>>     {
>>>>>         unsigned long ret, freed = 0;
>>>>>         struct shrinker *shrinker;
>>>>> -    int srcu_idx;
>>>>> +    int srcu_idx, generation;
>>>>>           /*
>>>>>          * The root memcg might be allocated even though memcg is disabled
>>>>> @@ -1017,6 +1026,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>>>             return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>>>>>           srcu_idx = srcu_read_lock(&shrinker_srcu);
>>>>> +    generation = atomic_read(&shrinker_srcu_generation);
>>>>>           list_for_each_entry_srcu(shrinker, &shrinker_list, list,
>>>>>                      srcu_read_lock_held(&shrinker_srcu)) {
>>>>> @@ -1030,6 +1040,11 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>>>             if (ret == SHRINK_EMPTY)
>>>>>                 ret = 0;
>>>>>             freed += ret;
>>>>> +
>>>>> +        if (atomic_read(&shrinker_srcu_generation) != generation) {
>>>>> +            freed = freed ? : 1;
>>>>> +            break;
>>>>> +        }
>>>>>         }
>>>>>           srcu_read_unlock(&shrinker_srcu, srcu_idx);
>>>>>
>>>>>
>>>>
>>>
>>
> 

-- 
Thanks,
Qi

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/7] mm: vmscan: make global slab shrink lockless
  2023-02-25 16:37                   ` Qi Zheng
@ 2023-02-25 21:28                     ` Kirill Tkhai
  2023-02-26 13:56                       ` Qi Zheng
  0 siblings, 1 reply; 33+ messages in thread
From: Kirill Tkhai @ 2023-02-25 21:28 UTC (permalink / raw)
  To: Qi Zheng, Sultan Alsawaf
  Cc: akpm, hannes, shakeelb, mhocko, roman.gushchin, muchun.song,
	david, shy828301, dave, penguin-kernel, paulmck, linux-mm,
	linux-kernel

On 25.02.2023 19:37, Qi Zheng wrote:
> 
> 
> On 2023/2/26 00:17, Kirill Tkhai wrote:
>> On 25.02.2023 18:57, Qi Zheng wrote:
>>>
> <...>
>>> How about this?
>>>>>
>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>> index ffddbd204259..9d8c53075298 100644
>>>>> --- a/mm/vmscan.c
>>>>> +++ b/mm/vmscan.c
>>>>> @@ -1012,7 +1012,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>>>                                    int priority)
>>>>>    {
>>>>>           unsigned long ret, freed = 0;
>>>>> -       struct shrinker *shrinker;
>>>>> +       struct shrinker *shrinker = NULL;
>>>>>           int srcu_idx, generation;
>>>>>
>>>>>           /*
>>>>> @@ -1025,11 +1025,15 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>>>           if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
>>>>>                   return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>>>>>
>>>>> +again:
>>>>>           srcu_idx = srcu_read_lock(&shrinker_srcu);
>>>>>
>>>>>           generation = atomic_read(&shrinker_srcu_generation);
>>>>> -       list_for_each_entry_srcu(shrinker, &shrinker_list, list,
>>>>> -                                srcu_read_lock_held(&shrinker_srcu)) {
>>>>> +       if (!shrinker)
>>>>> +               shrinker = list_entry_rcu(shrinker_list.next, struct shrinker, list);
>>>>> +       else
>>>>> +               shrinker = list_entry_rcu(shrinker->list.next, struct shrinker, list);
>>>>> +       list_for_each_entry_from_rcu(shrinker, &shrinker_list, list) {
>>>>>                   struct shrink_control sc = {
>>>>>                           .gfp_mask = gfp_mask,
>>>>>                           .nid = nid,
>>>>> @@ -1042,8 +1046,9 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>>>                   freed += ret;
>>>>>
>>>>>                   if (atomic_read(&shrinker_srcu_generation) != generation) {
>>>>> -                       freed = freed ? : 1;
>>>>> -                       break;
>>>>> +                       srcu_read_unlock(&shrinker_srcu, srcu_idx);
>>
>> After SRCU in unlocked we can't believe @shrinker anymore. So, above list_entry_rcu(shrinker->list.next)
>> dereferences some random memory.
> 
> Indeed.
> 
>>
>>>>> +                       cond_resched();
>>>>> +                       goto again;
>>>>>                   }
>>>>>           }
>>>>>
>>>>>>
>>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>>> index 27ef9946ae8a..0b197bba1257 100644
>>>>>> --- a/mm/vmscan.c
>>>>>> +++ b/mm/vmscan.c
>>>>>> @@ -204,6 +204,7 @@ static void set_task_reclaim_state(struct task_struct *task,
>>>>>>     LIST_HEAD(shrinker_list);
>>>>>>     DEFINE_MUTEX(shrinker_mutex);
>>>>>>     DEFINE_SRCU(shrinker_srcu);
>>>>>> +static atomic_t shrinker_srcu_generation = ATOMIC_INIT(0);
>>>>>>       #ifdef CONFIG_MEMCG
>>>>>>     static int shrinker_nr_max;
>>>>>> @@ -782,6 +783,7 @@ void unregister_shrinker(struct shrinker *shrinker)
>>>>>>         debugfs_entry = shrinker_debugfs_remove(shrinker);
>>>>>>         mutex_unlock(&shrinker_mutex);
>>>>>>     +    atomic_inc(&shrinker_srcu_generation);
>>>>>>         synchronize_srcu(&shrinker_srcu);
>>>>>>           debugfs_remove_recursive(debugfs_entry);
>>>>>> @@ -799,6 +801,7 @@ EXPORT_SYMBOL(unregister_shrinker);
>>>>>>      */
>>>>>>     void synchronize_shrinkers(void)
>>>>>>     {
>>>>>> +    atomic_inc(&shrinker_srcu_generation);
>>>>>>         synchronize_srcu(&shrinker_srcu);
>>>>>>     }
>>>>>>     EXPORT_SYMBOL(synchronize_shrinkers);
>>>>>> @@ -908,18 +911,19 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>>>>>>     {
>>>>>>         struct shrinker_info *info;
>>>>>>         unsigned long ret, freed = 0;
>>>>>> -    int srcu_idx;
>>>>>> -    int i;
>>>>>> +    int srcu_idx, generation;
>>>>>> +    int i = 0;
>>>>>>           if (!mem_cgroup_online(memcg))
>>>>>>             return 0;
>>>>>> -
>>>>>> +again:
>>>>>>         srcu_idx = srcu_read_lock(&shrinker_srcu);
>>>>>>         info = shrinker_info_srcu(memcg, nid);
>>>>>>         if (unlikely(!info))
>>>>>>             goto unlock;
>>>>>>     -    for_each_set_bit(i, info->map, info->map_nr_max) {
>>>>>> +    generation = atomic_read(&shrinker_srcu_generation);
>>>>>> +    for_each_set_bit_from(i, info->map, info->map_nr_max) {
>>>>>>             struct shrink_control sc = {
>>>>>>                 .gfp_mask = gfp_mask,
>>>>>>                 .nid = nid,
>>>>>> @@ -965,6 +969,11 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>>>>>>                     set_shrinker_bit(memcg, nid, i);
>>>>>>             }
>>>>>>             freed += ret;
>>>>>> +
>>>>>> +        if (atomic_read(&shrinker_srcu_generation) != generation) {
>>>>>> +            srcu_read_unlock(&shrinker_srcu, srcu_idx);
>>>>>
>>>>> Maybe we can add the following code here, so as to avoid repeating the
>>>>> current id and avoid triggering softlockup:
>>>>>
>>>>>               i++;
>>
>> This is OK.
>>
>>>>>               cond_resched();
>>
>> Possible, existing cond_resched() in do_shrink_slab() is enough.
> 
> Yeah.
> 
> I will add this patch in the next version. May I mark you as the author
> of this patch?

I think, yes

>>
>>> And this. :)
>>>
>>> Thanks,
>>> Qi
>>>
>>>>>
>>>>> Thanks,
>>>>> Qi
>>>>>
>>>>>> +            goto again;
>>>>>> +        }
>>>>>>         }
>>>>>>     unlock:
>>>>>>         srcu_read_unlock(&shrinker_srcu, srcu_idx);
>>>>>> @@ -1004,7 +1013,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>>>>     {
>>>>>>         unsigned long ret, freed = 0;
>>>>>>         struct shrinker *shrinker;
>>>>>> -    int srcu_idx;
>>>>>> +    int srcu_idx, generation;
>>>>>>           /*
>>>>>>          * The root memcg might be allocated even though memcg is disabled
>>>>>> @@ -1017,6 +1026,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>>>>             return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>>>>>>           srcu_idx = srcu_read_lock(&shrinker_srcu);
>>>>>> +    generation = atomic_read(&shrinker_srcu_generation);
>>>>>>           list_for_each_entry_srcu(shrinker, &shrinker_list, list,
>>>>>>                      srcu_read_lock_held(&shrinker_srcu)) {
>>>>>> @@ -1030,6 +1040,11 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>>>>             if (ret == SHRINK_EMPTY)
>>>>>>                 ret = 0;
>>>>>>             freed += ret;
>>>>>> +
>>>>>> +        if (atomic_read(&shrinker_srcu_generation) != generation) {
>>>>>> +            freed = freed ? : 1;
>>>>>> +            break;
>>>>>> +        }
>>>>>>         }
>>>>>>           srcu_read_unlock(&shrinker_srcu, srcu_idx);
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 1/7] mm: vmscan: add a map_nr_max field to shrinker_info
  2023-02-25 15:14     ` Kirill Tkhai
  2023-02-25 15:52       ` Qi Zheng
@ 2023-02-26 13:54       ` Qi Zheng
  1 sibling, 0 replies; 33+ messages in thread
From: Qi Zheng @ 2023-02-26 13:54 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel,
	Andrew Morton, Johannes Weiner, Shakeel Butt, Michal Hocko,
	Roman Gushchin, Muchun Song, David Hildenbrand, Yang Shi



On 2023/2/25 23:14, Kirill Tkhai wrote:
> Hi Qi,
> 
> On 25.02.2023 11:18, Qi Zheng wrote:
>>
>>
>> On 2023/2/23 21:27, Qi Zheng wrote:
>>> To prepare for the subsequent lockless memcg slab shrink,
>>> add a map_nr_max field to struct shrinker_info to records
>>> its own real shrinker_nr_max.
>>>
>>> No functional changes.
>>>
>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>
>> I missed Suggested-by here, hi Kirill, can I add it?
>>
>> Suggested-by: Kirill Tkhai <tkhai@ya.ru>
> 
> Yes, feel free to add this tag.
> 
> There is a comment below.
> 
>>> ---
>>>    include/linux/memcontrol.h |  1 +
>>>    mm/vmscan.c                | 29 ++++++++++++++++++-----------
>>>    2 files changed, 19 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>>> index b6eda2ab205d..aa69ea98e2d8 100644
>>> --- a/include/linux/memcontrol.h
>>> +++ b/include/linux/memcontrol.h
>>> @@ -97,6 +97,7 @@ struct shrinker_info {
>>>        struct rcu_head rcu;
>>>        atomic_long_t *nr_deferred;
>>>        unsigned long *map;
>>> +    int map_nr_max;
>>>    };
>>>      struct lruvec_stats_percpu {
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index 9c1c5e8b24b8..9f895ca6216c 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -224,9 +224,16 @@ static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
>>>                         lockdep_is_held(&shrinker_rwsem));
>>>    }
>>>    +static inline bool need_expand(int new_nr_max, int old_nr_max)
>>> +{
>>> +    return round_up(new_nr_max, BITS_PER_LONG) >
>>> +           round_up(old_nr_max, BITS_PER_LONG);
>>> +}
>>> +
>>>    static int expand_one_shrinker_info(struct mem_cgroup *memcg,
>>>                        int map_size, int defer_size,
>>> -                    int old_map_size, int old_defer_size)
>>> +                    int old_map_size, int old_defer_size,
>>> +                    int new_nr_max)
>>>    {
>>>        struct shrinker_info *new, *old;
>>>        struct mem_cgroup_per_node *pn;
>>> @@ -240,12 +247,16 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
>>>            if (!old)
>>>                return 0;
>>>    +        if (!need_expand(new_nr_max, old->map_nr_max))
>>> +            return 0;
>>> +
>>>            new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
>>>            if (!new)
>>>                return -ENOMEM;
>>>              new->nr_deferred = (atomic_long_t *)(new + 1);
>>>            new->map = (void *)new->nr_deferred + defer_size;
>>> +        new->map_nr_max = new_nr_max;
>>>              /* map: set all old bits, clear all new bits */
>>>            memset(new->map, (int)0xff, old_map_size);
>>> @@ -295,6 +306,7 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
>>>            }
>>>            info->nr_deferred = (atomic_long_t *)(info + 1);
>>>            info->map = (void *)info->nr_deferred + defer_size;
>>> +        info->map_nr_max = shrinker_nr_max;
>>>            rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, info);
>>>        }
>>>        up_write(&shrinker_rwsem);
>>> @@ -302,12 +314,6 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
>>>        return ret;
>>>    }
>>>    -static inline bool need_expand(int nr_max)
>>> -{
>>> -    return round_up(nr_max, BITS_PER_LONG) >
>>> -           round_up(shrinker_nr_max, BITS_PER_LONG);
>>> -}
>>> -
>>>    static int expand_shrinker_info(int new_id)
>>>    {
>>>        int ret = 0;
>>> @@ -316,7 +322,7 @@ static int expand_shrinker_info(int new_id)
>>>        int old_map_size, old_defer_size = 0;
>>>        struct mem_cgroup *memcg;
>>>    -    if (!need_expand(new_nr_max))
>>> +    if (!need_expand(new_nr_max, shrinker_nr_max))
>>>            goto out;
>>>          if (!root_mem_cgroup)
>>> @@ -332,7 +338,8 @@ static int expand_shrinker_info(int new_id)
>>>        memcg = mem_cgroup_iter(NULL, NULL, NULL);
>>>        do {
>>>            ret = expand_one_shrinker_info(memcg, map_size, defer_size,
>>> -                           old_map_size, old_defer_size);
>>> +                           old_map_size, old_defer_size,
>>> +                           new_nr_max);
>>>            if (ret) {
>>>                mem_cgroup_iter_break(NULL, memcg);
>>>                goto out;
>>> @@ -432,7 +439,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
>>>        for_each_node(nid) {
>>>            child_info = shrinker_info_protected(memcg, nid);
>>>            parent_info = shrinker_info_protected(parent, nid);
>>> -        for (i = 0; i < shrinker_nr_max; i++) {
>>> +        for (i = 0; i < child_info->map_nr_max; i++) {
>>>                nr = atomic_long_read(&child_info->nr_deferred[i]);
>>>                atomic_long_add(nr, &parent_info->nr_deferred[i]);
>>>            }
>>> @@ -899,7 +906,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>>>        if (unlikely(!info))
>>>            goto unlock;
>>>    -    for_each_set_bit(i, info->map, shrinker_nr_max) {
>>> +    for_each_set_bit(i, info->map, info->map_nr_max) {
>>>            struct shrink_control sc = {
>>>                .gfp_mask = gfp_mask,
>>>                .nid = nid,
> 
> The patch as whole thing won't work as expected. It won't ever call shrinker with ids from [round_down(shrinker_nr_max, sizeof(unsigned long)) + 1, shrinker_nr_max - 1]
> 
> Just replay the sequence we add new shrinkers:
> 
> 1)We add shrinker #0:
>     shrinker_nr_max = 0;
> 
>     prealloc_memcg_shrinker()
>        id = 0;
>        expand_shrinker_info(0)
>          new_nr_max = 1;
>          expand_one_shrinker_info(new_nr_max = 1)
>            new->map_nr_max = 1;
>          shrinker_nr_max = 1;
> 
> 2)We add shrinker #1:
>     prealloc_memcg_shrinker()
>       id = 1;
>       expand_shrinker_info(1)
>         new_nr_max = 2;
>         need_expand(2, 1) => false => ignore expand
>         shrinker_nr_max = 2;
> 
> 3)Then we call shrinker:
>    shrink_slab_memcg()
>      for_each_set_bit(i, info->map, 1/* info->map_nr_max */ ) {
>      } => ignore shrinker #1
> 
> I'd fixed this patch by something like the below:
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 9f895ca6216c..bb617a3871f1 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -224,12 +224,6 @@ static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
>   					 lockdep_is_held(&shrinker_rwsem));
>   }
>   
> -static inline bool need_expand(int new_nr_max, int old_nr_max)
> -{
> -	return round_up(new_nr_max, BITS_PER_LONG) >
> -	       round_up(old_nr_max, BITS_PER_LONG);
> -}
> -
>   static int expand_one_shrinker_info(struct mem_cgroup *memcg,
>   				    int map_size, int defer_size,
>   				    int old_map_size, int old_defer_size,
> @@ -247,9 +241,6 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
>   		if (!old)
>   			return 0;
>   
> -		if (!need_expand(new_nr_max, old->map_nr_max))
> -			return 0;
> -

Maybe we can keep this. For example, when we failed to allocate memory 
by calling kvmalloc_node() last time, some shrinker_info may have been
expanded, and these shrinker_info do not need to be expanded again.

>   		new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
>   		if (!new)
>   			return -ENOMEM;
> @@ -317,14 +308,11 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
>   static int expand_shrinker_info(int new_id)
>   {
>   	int ret = 0;
> -	int new_nr_max = new_id + 1;
> +	int new_nr_max = round_up(new_id + 1, BITS_PER_LONG);
>   	int map_size, defer_size = 0;
>   	int old_map_size, old_defer_size = 0;
>   	struct mem_cgroup *memcg;
>   
> -	if (!need_expand(new_nr_max, shrinker_nr_max))
> -		goto out;
> -
>   	if (!root_mem_cgroup)
>   		goto out;
>   
> @@ -359,9 +347,11 @@ void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
>   
>   		rcu_read_lock();
>   		info = rcu_dereference(memcg->nodeinfo[nid]->shrinker_info);
> -		/* Pairs with smp mb in shrink_slab() */
> -		smp_mb__before_atomic();
> -		set_bit(shrinker_id, info->map);
> +		if (!WARN_ON_ONCE(shrinker_id >= info->map_nr_max)) {
> +			/* Pairs with smp mb in shrink_slab() */
> +			smp_mb__before_atomic();
> +			set_bit(shrinker_id, info->map);
> +		}
>   		rcu_read_unlock();
>   	}
>   }
> 
> (I also added a new check into set_shrinker_bit() for safety).
> 
> Kirill

-- 
Thanks,
Qi

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/7] mm: vmscan: make global slab shrink lockless
  2023-02-25 21:28                     ` Kirill Tkhai
@ 2023-02-26 13:56                       ` Qi Zheng
  0 siblings, 0 replies; 33+ messages in thread
From: Qi Zheng @ 2023-02-26 13:56 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: akpm, hannes, shakeelb, mhocko, roman.gushchin, muchun.song,
	david, shy828301, dave, penguin-kernel, paulmck, linux-mm,
	linux-kernel, Sultan Alsawaf



On 2023/2/26 05:28, Kirill Tkhai wrote:
> On 25.02.2023 19:37, Qi Zheng wrote:
>>
>>
>> On 2023/2/26 00:17, Kirill Tkhai wrote:
>>> On 25.02.2023 18:57, Qi Zheng wrote:
>>>>
>> <...>
>>>> How about this?
>>>>>>
>>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>>> index ffddbd204259..9d8c53075298 100644
>>>>>> --- a/mm/vmscan.c
>>>>>> +++ b/mm/vmscan.c
>>>>>> @@ -1012,7 +1012,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>>>>                                     int priority)
>>>>>>     {
>>>>>>            unsigned long ret, freed = 0;
>>>>>> -       struct shrinker *shrinker;
>>>>>> +       struct shrinker *shrinker = NULL;
>>>>>>            int srcu_idx, generation;
>>>>>>
>>>>>>            /*
>>>>>> @@ -1025,11 +1025,15 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>>>>            if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
>>>>>>                    return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>>>>>>
>>>>>> +again:
>>>>>>            srcu_idx = srcu_read_lock(&shrinker_srcu);
>>>>>>
>>>>>>            generation = atomic_read(&shrinker_srcu_generation);
>>>>>> -       list_for_each_entry_srcu(shrinker, &shrinker_list, list,
>>>>>> -                                srcu_read_lock_held(&shrinker_srcu)) {
>>>>>> +       if (!shrinker)
>>>>>> +               shrinker = list_entry_rcu(shrinker_list.next, struct shrinker, list);
>>>>>> +       else
>>>>>> +               shrinker = list_entry_rcu(shrinker->list.next, struct shrinker, list);
>>>>>> +       list_for_each_entry_from_rcu(shrinker, &shrinker_list, list) {
>>>>>>                    struct shrink_control sc = {
>>>>>>                            .gfp_mask = gfp_mask,
>>>>>>                            .nid = nid,
>>>>>> @@ -1042,8 +1046,9 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>>>>                    freed += ret;
>>>>>>
>>>>>>                    if (atomic_read(&shrinker_srcu_generation) != generation) {
>>>>>> -                       freed = freed ? : 1;
>>>>>> -                       break;
>>>>>> +                       srcu_read_unlock(&shrinker_srcu, srcu_idx);
>>>
>>> After SRCU in unlocked we can't believe @shrinker anymore. So, above list_entry_rcu(shrinker->list.next)
>>> dereferences some random memory.
>>
>> Indeed.
>>
>>>
>>>>>> +                       cond_resched();
>>>>>> +                       goto again;
>>>>>>                    }
>>>>>>            }
>>>>>>
>>>>>>>
>>>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>>>> index 27ef9946ae8a..0b197bba1257 100644
>>>>>>> --- a/mm/vmscan.c
>>>>>>> +++ b/mm/vmscan.c
>>>>>>> @@ -204,6 +204,7 @@ static void set_task_reclaim_state(struct task_struct *task,
>>>>>>>      LIST_HEAD(shrinker_list);
>>>>>>>      DEFINE_MUTEX(shrinker_mutex);
>>>>>>>      DEFINE_SRCU(shrinker_srcu);
>>>>>>> +static atomic_t shrinker_srcu_generation = ATOMIC_INIT(0);
>>>>>>>        #ifdef CONFIG_MEMCG
>>>>>>>      static int shrinker_nr_max;
>>>>>>> @@ -782,6 +783,7 @@ void unregister_shrinker(struct shrinker *shrinker)
>>>>>>>          debugfs_entry = shrinker_debugfs_remove(shrinker);
>>>>>>>          mutex_unlock(&shrinker_mutex);
>>>>>>>      +    atomic_inc(&shrinker_srcu_generation);
>>>>>>>          synchronize_srcu(&shrinker_srcu);
>>>>>>>            debugfs_remove_recursive(debugfs_entry);
>>>>>>> @@ -799,6 +801,7 @@ EXPORT_SYMBOL(unregister_shrinker);
>>>>>>>       */
>>>>>>>      void synchronize_shrinkers(void)
>>>>>>>      {
>>>>>>> +    atomic_inc(&shrinker_srcu_generation);
>>>>>>>          synchronize_srcu(&shrinker_srcu);
>>>>>>>      }
>>>>>>>      EXPORT_SYMBOL(synchronize_shrinkers);
>>>>>>> @@ -908,18 +911,19 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>>>>>>>      {
>>>>>>>          struct shrinker_info *info;
>>>>>>>          unsigned long ret, freed = 0;
>>>>>>> -    int srcu_idx;
>>>>>>> -    int i;
>>>>>>> +    int srcu_idx, generation;
>>>>>>> +    int i = 0;
>>>>>>>            if (!mem_cgroup_online(memcg))
>>>>>>>              return 0;
>>>>>>> -
>>>>>>> +again:
>>>>>>>          srcu_idx = srcu_read_lock(&shrinker_srcu);
>>>>>>>          info = shrinker_info_srcu(memcg, nid);
>>>>>>>          if (unlikely(!info))
>>>>>>>              goto unlock;
>>>>>>>      -    for_each_set_bit(i, info->map, info->map_nr_max) {
>>>>>>> +    generation = atomic_read(&shrinker_srcu_generation);
>>>>>>> +    for_each_set_bit_from(i, info->map, info->map_nr_max) {
>>>>>>>              struct shrink_control sc = {
>>>>>>>                  .gfp_mask = gfp_mask,
>>>>>>>                  .nid = nid,
>>>>>>> @@ -965,6 +969,11 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>>>>>>>                      set_shrinker_bit(memcg, nid, i);
>>>>>>>              }
>>>>>>>              freed += ret;
>>>>>>> +
>>>>>>> +        if (atomic_read(&shrinker_srcu_generation) != generation) {
>>>>>>> +            srcu_read_unlock(&shrinker_srcu, srcu_idx);
>>>>>>
>>>>>> Maybe we can add the following code here, so as to avoid repeating the
>>>>>> current id and avoid triggering softlockup:
>>>>>>
>>>>>>                i++;
>>>
>>> This is OK.
>>>
>>>>>>                cond_resched();
>>>
>>> Possible, existing cond_resched() in do_shrink_slab() is enough.
>>
>> Yeah.
>>
>> I will add this patch in the next version. May I mark you as the author
>> of this patch?
> 
> I think, yes

Thanks. :)

Qi

> 
>>>
>>>> And this. :)
>>>>
>>>> Thanks,
>>>> Qi
>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Qi
>>>>>>
>>>>>>> +            goto again;
>>>>>>> +        }
>>>>>>>          }
>>>>>>>      unlock:
>>>>>>>          srcu_read_unlock(&shrinker_srcu, srcu_idx);
>>>>>>> @@ -1004,7 +1013,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>>>>>      {
>>>>>>>          unsigned long ret, freed = 0;
>>>>>>>          struct shrinker *shrinker;
>>>>>>> -    int srcu_idx;
>>>>>>> +    int srcu_idx, generation;
>>>>>>>            /*
>>>>>>>           * The root memcg might be allocated even though memcg is disabled
>>>>>>> @@ -1017,6 +1026,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>>>>>              return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>>>>>>>            srcu_idx = srcu_read_lock(&shrinker_srcu);
>>>>>>> +    generation = atomic_read(&shrinker_srcu_generation);
>>>>>>>            list_for_each_entry_srcu(shrinker, &shrinker_list, list,
>>>>>>>                       srcu_read_lock_held(&shrinker_srcu)) {
>>>>>>> @@ -1030,6 +1040,11 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>>>>>              if (ret == SHRINK_EMPTY)
>>>>>>>                  ret = 0;
>>>>>>>              freed += ret;
>>>>>>> +
>>>>>>> +        if (atomic_read(&shrinker_srcu_generation) != generation) {
>>>>>>> +            freed = freed ? : 1;
>>>>>>> +            break;
>>>>>>> +        }
>>>>>>>          }
>>>>>>>            srcu_read_unlock(&shrinker_srcu, srcu_idx);
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
> 

-- 
Thanks,
Qi

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2023-02-26 13:56 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-23 13:27 [PATCH v2 0/7] make slab shrink lockless Qi Zheng
2023-02-23 13:27 ` [PATCH v2 1/7] mm: vmscan: add a map_nr_max field to shrinker_info Qi Zheng
2023-02-25  8:18   ` Qi Zheng
2023-02-25 15:14     ` Kirill Tkhai
2023-02-25 15:52       ` Qi Zheng
2023-02-26 13:54       ` Qi Zheng
2023-02-23 13:27 ` [PATCH v2 2/7] mm: vmscan: make global slab shrink lockless Qi Zheng
2023-02-23 15:26   ` Rafael Aquini
2023-02-23 15:37     ` Rafael Aquini
2023-02-24  4:09       ` Qi Zheng
2023-02-23 18:24   ` Sultan Alsawaf
2023-02-23 18:39     ` Paul E. McKenney
2023-02-23 19:18       ` Sultan Alsawaf
2023-02-24  4:00     ` Qi Zheng
2023-02-24  4:16       ` Qi Zheng
2023-02-24  8:20       ` Sultan Alsawaf
2023-02-24 10:12         ` Qi Zheng
2023-02-24 21:02       ` Kirill Tkhai
2023-02-24 21:14         ` Kirill Tkhai
2023-02-25  8:08           ` Qi Zheng
2023-02-25 15:30             ` Kirill Tkhai
2023-02-25 15:57               ` Qi Zheng
2023-02-25 16:17                 ` Kirill Tkhai
2023-02-25 16:37                   ` Qi Zheng
2023-02-25 21:28                     ` Kirill Tkhai
2023-02-26 13:56                       ` Qi Zheng
2023-02-23 13:27 ` [PATCH v2 3/7] mm: vmscan: make memcg " Qi Zheng
2023-02-23 13:27 ` [PATCH v2 4/7] mm: shrinkers: make count and scan in shrinker debugfs lockless Qi Zheng
2023-02-23 13:27 ` [PATCH v2 5/7] mm: vmscan: hold write lock to reparent shrinker nr_deferred Qi Zheng
2023-02-23 13:27 ` [PATCH v2 6/7] mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers() Qi Zheng
2023-02-23 13:27 ` [PATCH v2 7/7] mm: shrinkers: convert shrinker_rwsem to mutex Qi Zheng
2023-02-23 18:19 ` [PATCH v2 0/7] make slab shrink lockless Paul E. McKenney
2023-02-24  4:08   ` Qi Zheng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).