linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/8] make slab shrink lockless
@ 2023-03-07  6:55 Qi Zheng
  2023-03-07  6:55 ` [PATCH v4 1/8] mm: vmscan: add a map_nr_max field to shrinker_info Qi Zheng
                   ` (8 more replies)
  0 siblings, 9 replies; 36+ messages in thread
From: Qi Zheng @ 2023-03-07  6:55 UTC (permalink / raw)
  To: akpm, tkhai, hannes, shakeelb, mhocko, roman.gushchin,
	muchun.song, david, shy828301, rppt
  Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel, Qi Zheng

Hi all,

This patch series aims to make slab shrink lockless.

1. Background
=============

On our servers, we often find the following system cpu hotspots:

  52.22% [kernel]        [k] down_read_trylock
  19.60% [kernel]        [k] up_read
   8.86% [kernel]        [k] shrink_slab
   2.44% [kernel]        [k] idr_find
   1.25% [kernel]        [k] count_shadow_nodes
   1.18% [kernel]        [k] shrink lruvec
   0.71% [kernel]        [k] mem_cgroup_iter
   0.71% [kernel]        [k] shrink_node
   0.55% [kernel]        [k] find_next_bit

And we used bpftrace to capture its calltrace as follows:

@[
    down_read_trylock+1
    shrink_slab+128
    shrink_node+371
    do_try_to_free_pages+232
    try_to_free_pages+243
    _alloc_pages_slowpath+771
    _alloc_pages_nodemask+702
    pagecache_get_page+255
    filemap_fault+1361
    ext4_filemap_fault+44
    __do_fault+76
    handle_mm_fault+3543
    do_user_addr_fault+442
    do_page_fault+48
    page_fault+62
]: 1161690
@[
    down_read_trylock+1
    shrink_slab+128
    shrink_node+371
    balance_pgdat+690
    kswapd+389
    kthread+246
    ret_from_fork+31
]: 8424884
@[
    down_read_trylock+1
    shrink_slab+128
    shrink_node+371
    do_try_to_free_pages+232
    try_to_free_pages+243
    __alloc_pages_slowpath+771
    __alloc_pages_nodemask+702
    __do_page_cache_readahead+244
    filemap_fault+1674
    ext4_filemap_fault+44
    __do_fault+76
    handle_mm_fault+3543
    do_user_addr_fault+442
    do_page_fault+48
    page_fault+62
]: 20917631

We can see that down_read_trylock() of shrinker_rwsem is being called with
high frequency at that time. Because of the poor multicore scalability of atomic
operations, this can lead to a significant drop in IPC (instructions per cycle).

And more, the shrinker_rwsem is a global read-write lock in shrinkers subsystem,
which protects most operations such as slab shrink, registration and
unregistration of shrinkers, etc. This can easily cause problems in the
following cases.

1) When the memory pressure is high and there are many filesystems mounted or
   unmounted at the same time, slab shrink will be affected (down_read_trylock()
   failed).

   Such as the real workload mentioned by Kirill Tkhai:

   ```
   One of the real workloads from my experience is start of an overcommitted
   node containing many starting containers after node crash (or many resuming
   containers after reboot for kernel update). In these cases memory pressure is
   huge, and the node goes round in long reclaim.
   ```

2) If a shrinker is blocked (such as the case mentioned in [1]) and a writer
   comes in (such as mount a fs), then this writer will be blocked and cause all
   subsequent shrinker-related operations to be blocked.

[1]. https://lore.kernel.org/lkml/20191129214541.3110-1-ptikhomirov@virtuozzo.com/

All the above cases can be solved by replacing the shrinker_rwsem trylocks with
SRCU.

2. Survey
=========

Before doing the code implementation, I found that there were many similar
submissions in the community:

a. Davidlohr Bueso submitted a patch in 2015.
   Subject: [PATCH -next v2] mm: srcu-ify shrinkers
   Link: https://lore.kernel.org/all/1437080113.3596.2.camel@stgolabs.net/
   Result: It was finally merged into the linux-next branch, but failed on arm
           allnoconfig (without CONFIG_SRCU)

b. Tetsuo Handa submitted a patchset in 2017.
   Subject: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
   Link: https://lore.kernel.org/lkml/1510609063-3327-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp/
   Result: Finally chose to use the current simple way (break when
           rwsem_is_contended()). And Christoph Hellwig suggested to using SRCU,
           but SRCU was not unconditionally enabled at the time.

c. Kirill Tkhai submitted a patchset in 2018.
   Subject: [PATCH RFC 00/10] Introduce lockless shrink_slab()
   Link: https://lore.kernel.org/lkml/153365347929.19074.12509495712735843805.stgit@localhost.localdomain/
   Result: At that time, SRCU was not unconditionally enabled, and there were
           some objections to enabling SRCU. Later, because Kirill's focus was
           moved to other things, this patchset was not continued to be updated.

d. Sultan Alsawaf submitted a patch in 2021.
   Subject: [PATCH] mm: vmscan: Replace shrinker_rwsem trylocks with SRCU protection
   Link: https://lore.kernel.org/lkml/20210927074823.5825-1-sultan@kerneltoast.com/
   Result: Rejected because SRCU was not unconditionally enabled.

We can find that almost all these historical commits were abandoned because SRCU
was not unconditionally enabled. But now SRCU has been unconditionally enable
by Paul E. McKenney in 2023 [2], so it's time to replace shrinker_rwsem trylocks
with SRCU.

[2] https://lore.kernel.org/lkml/20230105003759.GA1769545@paulmck-ThinkPad-P17-Gen-1/

3. Reproduction and testing
===========================

We can reproduce the down_read_trylock() hotspot through the following script:

```
#!/bin/bash

DIR="/root/shrinker/memcg/mnt"

do_create()
{
    mkdir -p /sys/fs/cgroup/memory/test
    mkdir -p /sys/fs/cgroup/perf_event/test
    echo 4G > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
    for i in `seq 0 $1`;
    do
        mkdir -p /sys/fs/cgroup/memory/test/$i;
        echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
        echo $$ > /sys/fs/cgroup/perf_event/test/cgroup.procs;
        mkdir -p $DIR/$i;
    done
}

do_mount()
{
    for i in `seq $1 $2`;
    do
        mount -t tmpfs $i $DIR/$i;
    done
}

do_touch()
{
    for i in `seq $1 $2`;
    do
        echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
        echo $$ > /sys/fs/cgroup/perf_event/test/cgroup.procs;
            dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 &
    done
}

case "$1" in
  touch)
    do_touch $2 $3
    ;;
  test)
      do_create 4000
    do_mount 0 4000
    do_touch 0 3000
    ;;
  *)
    exit 1
    ;;
esac
```

Save the above script, then run test and touch commands. Then we can use the
following perf command to view hotspots:

perf top -U -F 999

1) Before applying this patchset:

  32.31%  [kernel]           [k] down_read_trylock
  19.40%  [kernel]           [k] pv_native_safe_halt
  16.24%  [kernel]           [k] up_read
  15.70%  [kernel]           [k] shrink_slab
   4.69%  [kernel]           [k] _find_next_bit
   2.62%  [kernel]           [k] shrink_node
   1.78%  [kernel]           [k] shrink_lruvec
   0.76%  [kernel]           [k] do_shrink_slab

2) After applying this patchset:

  27.83%  [kernel]           [k] _find_next_bit
  16.97%  [kernel]           [k] shrink_slab
  15.82%  [kernel]           [k] pv_native_safe_halt
   9.58%  [kernel]           [k] shrink_node
   8.31%  [kernel]           [k] shrink_lruvec
   5.64%  [kernel]           [k] do_shrink_slab
   3.88%  [kernel]           [k] mem_cgroup_iter

At the same time, we use the following perf command to capture IPC information:

perf stat -e cycles,instructions -G test -a --repeat 5 -- sleep 10

1) Before applying this patchset:

 Performance counter stats for 'system wide' (5 runs):

      454187219766      cycles                    test                    ( +-  1.84% )
       78896433101      instructions              test #    0.17  insn per cycle           ( +-  0.44% )

        10.0020430 +- 0.0000366 seconds time elapsed  ( +-  0.00% )

2) After applying this patchset:

 Performance counter stats for 'system wide' (5 runs):

      841954709443      cycles                    test                    ( +- 15.80% )  (98.69%)
      527258677936      instructions              test #    0.63  insn per cycle           ( +- 15.11% )  (98.68%)

          10.01064 +- 0.00831 seconds time elapsed  ( +-  0.08% )

We can see that IPC drops very seriously when calling down_read_trylock() at
high frequency. After using SRCU, the IPC is at a normal level.

This series is based on next-20230306.

Comments and suggestions are welcome.

Thanks,
Qi.

Changelog in v3 -> v4:
 - fix bug in [PATCH v3 1/7]
 - revise commit messages
 - rebase onto the next-20230306

Changelog in v2 -> v3:
 - fix bug in [PATCH v2 1/7] (per Kirill)
 - add Kirill's pacth which restore a check similar to the rwsem_is_contendent()
   check by adding shrinker_srcu_generation

Changelog in v1 -> v2:
 - add a map_nr_max field to shrinker_info (suggested by Kirill)
 - use shrinker_mutex in reparent_shrinker_deferred() (pointed by Kirill)

Kirill Tkhai (1):
  mm: vmscan: add shrinker_srcu_generation

Qi Zheng (7):
  mm: vmscan: add a map_nr_max field to shrinker_info
  mm: vmscan: make global slab shrink lockless
  mm: vmscan: make memcg slab shrink lockless
  mm: shrinkers: make count and scan in shrinker debugfs lockless
  mm: vmscan: hold write lock to reparent shrinker nr_deferred
  mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers()
  mm: shrinkers: convert shrinker_rwsem to mutex

 drivers/md/dm-cache-metadata.c |   2 +-
 drivers/md/dm-thin-metadata.c  |   2 +-
 fs/super.c                     |   2 +-
 include/linux/memcontrol.h     |   1 +
 mm/shrinker_debug.c            |  38 +++-----
 mm/vmscan.c                    | 166 +++++++++++++++++++--------------
 6 files changed, 112 insertions(+), 99 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v4 1/8] mm: vmscan: add a map_nr_max field to shrinker_info
  2023-03-07  6:55 [PATCH v4 0/8] make slab shrink lockless Qi Zheng
@ 2023-03-07  6:55 ` Qi Zheng
  2023-03-08 14:40   ` Vlastimil Babka
  2023-03-08 22:13   ` Kirill Tkhai
  2023-03-07  6:55 ` [PATCH v4 2/8] mm: vmscan: make global slab shrink lockless Qi Zheng
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 36+ messages in thread
From: Qi Zheng @ 2023-03-07  6:55 UTC (permalink / raw)
  To: akpm, tkhai, hannes, shakeelb, mhocko, roman.gushchin,
	muchun.song, david, shy828301, rppt
  Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel, Qi Zheng

To prepare for the subsequent lockless memcg slab shrink,
add a map_nr_max field to struct shrinker_info to records
its own real shrinker_nr_max.

Suggested-by: Kirill Tkhai <tkhai@ya.ru>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/memcontrol.h |  1 +
 mm/vmscan.c                | 41 ++++++++++++++++++++++----------------
 2 files changed, 25 insertions(+), 17 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b6eda2ab205d..aa69ea98e2d8 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -97,6 +97,7 @@ struct shrinker_info {
 	struct rcu_head rcu;
 	atomic_long_t *nr_deferred;
 	unsigned long *map;
+	int map_nr_max;
 };
 
 struct lruvec_stats_percpu {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9414226218f0..2dcc01682026 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -224,9 +224,16 @@ static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
 					 lockdep_is_held(&shrinker_rwsem));
 }
 
+static inline bool need_expand(int new_nr_max, int old_nr_max)
+{
+	return round_up(new_nr_max, BITS_PER_LONG) >
+	       round_up(old_nr_max, BITS_PER_LONG);
+}
+
 static int expand_one_shrinker_info(struct mem_cgroup *memcg,
 				    int map_size, int defer_size,
-				    int old_map_size, int old_defer_size)
+				    int old_map_size, int old_defer_size,
+				    int new_nr_max)
 {
 	struct shrinker_info *new, *old;
 	struct mem_cgroup_per_node *pn;
@@ -240,12 +247,17 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
 		if (!old)
 			return 0;
 
+		/* Already expanded this shrinker_info */
+		if (!need_expand(new_nr_max, old->map_nr_max))
+			continue;
+
 		new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
 		if (!new)
 			return -ENOMEM;
 
 		new->nr_deferred = (atomic_long_t *)(new + 1);
 		new->map = (void *)new->nr_deferred + defer_size;
+		new->map_nr_max = new_nr_max;
 
 		/* map: set all old bits, clear all new bits */
 		memset(new->map, (int)0xff, old_map_size);
@@ -295,6 +307,7 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
 		}
 		info->nr_deferred = (atomic_long_t *)(info + 1);
 		info->map = (void *)info->nr_deferred + defer_size;
+		info->map_nr_max = shrinker_nr_max;
 		rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, info);
 	}
 	up_write(&shrinker_rwsem);
@@ -302,23 +315,14 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
 	return ret;
 }
 
-static inline bool need_expand(int nr_max)
-{
-	return round_up(nr_max, BITS_PER_LONG) >
-	       round_up(shrinker_nr_max, BITS_PER_LONG);
-}
-
 static int expand_shrinker_info(int new_id)
 {
 	int ret = 0;
-	int new_nr_max = new_id + 1;
+	int new_nr_max = round_up(new_id + 1, BITS_PER_LONG);
 	int map_size, defer_size = 0;
 	int old_map_size, old_defer_size = 0;
 	struct mem_cgroup *memcg;
 
-	if (!need_expand(new_nr_max))
-		goto out;
-
 	if (!root_mem_cgroup)
 		goto out;
 
@@ -332,7 +336,8 @@ static int expand_shrinker_info(int new_id)
 	memcg = mem_cgroup_iter(NULL, NULL, NULL);
 	do {
 		ret = expand_one_shrinker_info(memcg, map_size, defer_size,
-					       old_map_size, old_defer_size);
+					       old_map_size, old_defer_size,
+					       new_nr_max);
 		if (ret) {
 			mem_cgroup_iter_break(NULL, memcg);
 			goto out;
@@ -352,9 +357,11 @@ void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
 
 		rcu_read_lock();
 		info = rcu_dereference(memcg->nodeinfo[nid]->shrinker_info);
-		/* Pairs with smp mb in shrink_slab() */
-		smp_mb__before_atomic();
-		set_bit(shrinker_id, info->map);
+		if (!WARN_ON_ONCE(shrinker_id >= info->map_nr_max)) {
+			/* Pairs with smp mb in shrink_slab() */
+			smp_mb__before_atomic();
+			set_bit(shrinker_id, info->map);
+		}
 		rcu_read_unlock();
 	}
 }
@@ -432,7 +439,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
 	for_each_node(nid) {
 		child_info = shrinker_info_protected(memcg, nid);
 		parent_info = shrinker_info_protected(parent, nid);
-		for (i = 0; i < shrinker_nr_max; i++) {
+		for (i = 0; i < child_info->map_nr_max; i++) {
 			nr = atomic_long_read(&child_info->nr_deferred[i]);
 			atomic_long_add(nr, &parent_info->nr_deferred[i]);
 		}
@@ -899,7 +906,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
 	if (unlikely(!info))
 		goto unlock;
 
-	for_each_set_bit(i, info->map, shrinker_nr_max) {
+	for_each_set_bit(i, info->map, info->map_nr_max) {
 		struct shrink_control sc = {
 			.gfp_mask = gfp_mask,
 			.nid = nid,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v4 2/8] mm: vmscan: make global slab shrink lockless
  2023-03-07  6:55 [PATCH v4 0/8] make slab shrink lockless Qi Zheng
  2023-03-07  6:55 ` [PATCH v4 1/8] mm: vmscan: add a map_nr_max field to shrinker_info Qi Zheng
@ 2023-03-07  6:55 ` Qi Zheng
  2023-03-08 15:02   ` Vlastimil Babka
  2023-03-08 22:18   ` Kirill Tkhai
  2023-03-07  6:56 ` [PATCH v4 3/8] mm: vmscan: make memcg " Qi Zheng
                   ` (6 subsequent siblings)
  8 siblings, 2 replies; 36+ messages in thread
From: Qi Zheng @ 2023-03-07  6:55 UTC (permalink / raw)
  To: akpm, tkhai, hannes, shakeelb, mhocko, roman.gushchin,
	muchun.song, david, shy828301, rppt
  Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel, Qi Zheng

The shrinker_rwsem is a global read-write lock in
shrinkers subsystem, which protects most operations
such as slab shrink, registration and unregistration
of shrinkers, etc. This can easily cause problems in
the following cases.

1) When the memory pressure is high and there are many
   filesystems mounted or unmounted at the same time,
   slab shrink will be affected (down_read_trylock()
   failed).

   Such as the real workload mentioned by Kirill Tkhai:

   ```
   One of the real workloads from my experience is start
   of an overcommitted node containing many starting
   containers after node crash (or many resuming containers
   after reboot for kernel update). In these cases memory
   pressure is huge, and the node goes round in long reclaim.
   ```

2) If a shrinker is blocked (such as the case mentioned
   in [1]) and a writer comes in (such as mount a fs),
   then this writer will be blocked and cause all
   subsequent shrinker-related operations to be blocked.

Even if there is no competitor when shrinking slab, there
may still be a problem. If we have a long shrinker list
and we do not reclaim enough memory with each shrinker,
then the down_read_trylock() may be called with high
frequency. Because of the poor multicore scalability of
atomic operations, this can lead to a significant drop
in IPC (instructions per cycle).

So many times in history ([2],[3],[4],[5]), some people
wanted to replace shrinker_rwsem trylock with SRCU in
the slab shrink, but all these patches were abandoned
because SRCU was not unconditionally enabled.

But now, since commit 1cd0bd06093c ("rcu: Remove CONFIG_SRCU"),
the SRCU is unconditionally enabled. So it's time to use
SRCU to protect readers who previously held shrinker_rwsem.

This commit uses SRCU to make global slab shrink lockless,
the memcg slab shrink is handled in the subsequent patch.

[1]. https://lore.kernel.org/lkml/20191129214541.3110-1-ptikhomirov@virtuozzo.com/
[2]. https://lore.kernel.org/all/1437080113.3596.2.camel@stgolabs.net/
[3]. https://lore.kernel.org/lkml/1510609063-3327-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp/
[4]. https://lore.kernel.org/lkml/153365347929.19074.12509495712735843805.stgit@localhost.localdomain/
[5]. https://lore.kernel.org/lkml/20210927074823.5825-1-sultan@kerneltoast.com/

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/vmscan.c | 27 +++++++++++----------------
 1 file changed, 11 insertions(+), 16 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2dcc01682026..8515ac40bcaf 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -202,6 +202,7 @@ static void set_task_reclaim_state(struct task_struct *task,
 
 LIST_HEAD(shrinker_list);
 DECLARE_RWSEM(shrinker_rwsem);
+DEFINE_SRCU(shrinker_srcu);
 
 #ifdef CONFIG_MEMCG
 static int shrinker_nr_max;
@@ -706,7 +707,7 @@ void free_prealloced_shrinker(struct shrinker *shrinker)
 void register_shrinker_prepared(struct shrinker *shrinker)
 {
 	down_write(&shrinker_rwsem);
-	list_add_tail(&shrinker->list, &shrinker_list);
+	list_add_tail_rcu(&shrinker->list, &shrinker_list);
 	shrinker->flags |= SHRINKER_REGISTERED;
 	shrinker_debugfs_add(shrinker);
 	up_write(&shrinker_rwsem);
@@ -760,13 +761,15 @@ void unregister_shrinker(struct shrinker *shrinker)
 		return;
 
 	down_write(&shrinker_rwsem);
-	list_del(&shrinker->list);
+	list_del_rcu(&shrinker->list);
 	shrinker->flags &= ~SHRINKER_REGISTERED;
 	if (shrinker->flags & SHRINKER_MEMCG_AWARE)
 		unregister_memcg_shrinker(shrinker);
 	debugfs_entry = shrinker_debugfs_remove(shrinker);
 	up_write(&shrinker_rwsem);
 
+	synchronize_srcu(&shrinker_srcu);
+
 	debugfs_remove_recursive(debugfs_entry);
 
 	kfree(shrinker->nr_deferred);
@@ -786,6 +789,7 @@ void synchronize_shrinkers(void)
 {
 	down_write(&shrinker_rwsem);
 	up_write(&shrinker_rwsem);
+	synchronize_srcu(&shrinker_srcu);
 }
 EXPORT_SYMBOL(synchronize_shrinkers);
 
@@ -996,6 +1000,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
 {
 	unsigned long ret, freed = 0;
 	struct shrinker *shrinker;
+	int srcu_idx;
 
 	/*
 	 * The root memcg might be allocated even though memcg is disabled
@@ -1007,10 +1012,10 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
 	if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
 		return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
 
-	if (!down_read_trylock(&shrinker_rwsem))
-		goto out;
+	srcu_idx = srcu_read_lock(&shrinker_srcu);
 
-	list_for_each_entry(shrinker, &shrinker_list, list) {
+	list_for_each_entry_srcu(shrinker, &shrinker_list, list,
+				 srcu_read_lock_held(&shrinker_srcu)) {
 		struct shrink_control sc = {
 			.gfp_mask = gfp_mask,
 			.nid = nid,
@@ -1021,19 +1026,9 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
 		if (ret == SHRINK_EMPTY)
 			ret = 0;
 		freed += ret;
-		/*
-		 * Bail out if someone want to register a new shrinker to
-		 * prevent the registration from being stalled for long periods
-		 * by parallel ongoing shrinking.
-		 */
-		if (rwsem_is_contended(&shrinker_rwsem)) {
-			freed = freed ? : 1;
-			break;
-		}
 	}
 
-	up_read(&shrinker_rwsem);
-out:
+	srcu_read_unlock(&shrinker_srcu, srcu_idx);
 	cond_resched();
 	return freed;
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v4 3/8] mm: vmscan: make memcg slab shrink lockless
  2023-03-07  6:55 [PATCH v4 0/8] make slab shrink lockless Qi Zheng
  2023-03-07  6:55 ` [PATCH v4 1/8] mm: vmscan: add a map_nr_max field to shrinker_info Qi Zheng
  2023-03-07  6:55 ` [PATCH v4 2/8] mm: vmscan: make global slab shrink lockless Qi Zheng
@ 2023-03-07  6:56 ` Qi Zheng
  2023-03-08 22:23   ` Kirill Tkhai
  2023-03-08 22:46   ` Vlastimil Babka
  2023-03-07  6:56 ` [PATCH v4 4/8] mm: vmscan: add shrinker_srcu_generation Qi Zheng
                   ` (5 subsequent siblings)
  8 siblings, 2 replies; 36+ messages in thread
From: Qi Zheng @ 2023-03-07  6:56 UTC (permalink / raw)
  To: akpm, tkhai, hannes, shakeelb, mhocko, roman.gushchin,
	muchun.song, david, shy828301, rppt
  Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel, Qi Zheng

Like global slab shrink, this commit also uses SRCU to make
memcg slab shrink lockless.

We can reproduce the down_read_trylock() hotspot through the
following script:

```

DIR="/root/shrinker/memcg/mnt"

do_create()
{
    mkdir -p /sys/fs/cgroup/memory/test
    mkdir -p /sys/fs/cgroup/perf_event/test
    echo 4G > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
    for i in `seq 0 $1`;
    do
        mkdir -p /sys/fs/cgroup/memory/test/$i;
        echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
        echo $$ > /sys/fs/cgroup/perf_event/test/cgroup.procs;
        mkdir -p $DIR/$i;
    done
}

do_mount()
{
    for i in `seq $1 $2`;
    do
        mount -t tmpfs $i $DIR/$i;
    done
}

do_touch()
{
    for i in `seq $1 $2`;
    do
        echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
        echo $$ > /sys/fs/cgroup/perf_event/test/cgroup.procs;
            dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 &
    done
}

case "$1" in
  touch)
    do_touch $2 $3
    ;;
  test)
      do_create 4000
    do_mount 0 4000
    do_touch 0 3000
    ;;
  *)
    exit 1
    ;;
esac
```

Save the above script, then run test and touch commands.
Then we can use the following perf command to view hotspots:

perf top -U -F 999

1) Before applying this patchset:

  32.31%  [kernel]           [k] down_read_trylock
  19.40%  [kernel]           [k] pv_native_safe_halt
  16.24%  [kernel]           [k] up_read
  15.70%  [kernel]           [k] shrink_slab
   4.69%  [kernel]           [k] _find_next_bit
   2.62%  [kernel]           [k] shrink_node
   1.78%  [kernel]           [k] shrink_lruvec
   0.76%  [kernel]           [k] do_shrink_slab

2) After applying this patchset:

  27.83%  [kernel]           [k] _find_next_bit
  16.97%  [kernel]           [k] shrink_slab
  15.82%  [kernel]           [k] pv_native_safe_halt
   9.58%  [kernel]           [k] shrink_node
   8.31%  [kernel]           [k] shrink_lruvec
   5.64%  [kernel]           [k] do_shrink_slab
   3.88%  [kernel]           [k] mem_cgroup_iter

At the same time, we use the following perf command to capture
IPC information:

perf stat -e cycles,instructions -G test -a --repeat 5 -- sleep 10

1) Before applying this patchset:

 Performance counter stats for 'system wide' (5 runs):

      454187219766      cycles                    test                    ( +-  1.84% )
       78896433101      instructions              test #    0.17  insn per cycle           ( +-  0.44% )

        10.0020430 +- 0.0000366 seconds time elapsed  ( +-  0.00% )

2) After applying this patchset:

 Performance counter stats for 'system wide' (5 runs):

      841954709443      cycles                    test                    ( +- 15.80% )  (98.69%)
      527258677936      instructions              test #    0.63  insn per cycle           ( +- 15.11% )  (98.68%)

          10.01064 +- 0.00831 seconds time elapsed  ( +-  0.08% )

We can see that IPC drops very seriously when calling
down_read_trylock() at high frequency. After using SRCU,
the IPC is at a normal level.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/vmscan.c | 46 +++++++++++++++++++++++++++-------------------
 1 file changed, 27 insertions(+), 19 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8515ac40bcaf..1de9bc3e5aa2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -57,6 +57,7 @@
 #include <linux/khugepaged.h>
 #include <linux/rculist_nulls.h>
 #include <linux/random.h>
+#include <linux/srcu.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -221,8 +222,21 @@ static inline int shrinker_defer_size(int nr_items)
 static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
 						     int nid)
 {
-	return rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_info,
-					 lockdep_is_held(&shrinker_rwsem));
+	return srcu_dereference_check(memcg->nodeinfo[nid]->shrinker_info,
+				      &shrinker_srcu,
+				      lockdep_is_held(&shrinker_rwsem));
+}
+
+static struct shrinker_info *shrinker_info_srcu(struct mem_cgroup *memcg,
+						     int nid)
+{
+	return srcu_dereference(memcg->nodeinfo[nid]->shrinker_info,
+				&shrinker_srcu);
+}
+
+static void free_shrinker_info_rcu(struct rcu_head *head)
+{
+	kvfree(container_of(head, struct shrinker_info, rcu));
 }
 
 static inline bool need_expand(int new_nr_max, int old_nr_max)
@@ -269,7 +283,7 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
 		       defer_size - old_defer_size);
 
 		rcu_assign_pointer(pn->shrinker_info, new);
-		kvfree_rcu(old, rcu);
+		call_srcu(&shrinker_srcu, &old->rcu, free_shrinker_info_rcu);
 	}
 
 	return 0;
@@ -355,15 +369,16 @@ void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
 {
 	if (shrinker_id >= 0 && memcg && !mem_cgroup_is_root(memcg)) {
 		struct shrinker_info *info;
+		int srcu_idx;
 
-		rcu_read_lock();
-		info = rcu_dereference(memcg->nodeinfo[nid]->shrinker_info);
+		srcu_idx = srcu_read_lock(&shrinker_srcu);
+		info = shrinker_info_srcu(memcg, nid);
 		if (!WARN_ON_ONCE(shrinker_id >= info->map_nr_max)) {
 			/* Pairs with smp mb in shrink_slab() */
 			smp_mb__before_atomic();
 			set_bit(shrinker_id, info->map);
 		}
-		rcu_read_unlock();
+		srcu_read_unlock(&shrinker_srcu, srcu_idx);
 	}
 }
 
@@ -377,7 +392,6 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
 		return -ENOSYS;
 
 	down_write(&shrinker_rwsem);
-	/* This may call shrinker, so it must use down_read_trylock() */
 	id = idr_alloc(&shrinker_idr, shrinker, 0, 0, GFP_KERNEL);
 	if (id < 0)
 		goto unlock;
@@ -411,7 +425,7 @@ static long xchg_nr_deferred_memcg(int nid, struct shrinker *shrinker,
 {
 	struct shrinker_info *info;
 
-	info = shrinker_info_protected(memcg, nid);
+	info = shrinker_info_srcu(memcg, nid);
 	return atomic_long_xchg(&info->nr_deferred[shrinker->id], 0);
 }
 
@@ -420,7 +434,7 @@ static long add_nr_deferred_memcg(long nr, int nid, struct shrinker *shrinker,
 {
 	struct shrinker_info *info;
 
-	info = shrinker_info_protected(memcg, nid);
+	info = shrinker_info_srcu(memcg, nid);
 	return atomic_long_add_return(nr, &info->nr_deferred[shrinker->id]);
 }
 
@@ -898,15 +912,14 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
 {
 	struct shrinker_info *info;
 	unsigned long ret, freed = 0;
+	int srcu_idx;
 	int i;
 
 	if (!mem_cgroup_online(memcg))
 		return 0;
 
-	if (!down_read_trylock(&shrinker_rwsem))
-		return 0;
-
-	info = shrinker_info_protected(memcg, nid);
+	srcu_idx = srcu_read_lock(&shrinker_srcu);
+	info = shrinker_info_srcu(memcg, nid);
 	if (unlikely(!info))
 		goto unlock;
 
@@ -956,14 +969,9 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
 				set_shrinker_bit(memcg, nid, i);
 		}
 		freed += ret;
-
-		if (rwsem_is_contended(&shrinker_rwsem)) {
-			freed = freed ? : 1;
-			break;
-		}
 	}
 unlock:
-	up_read(&shrinker_rwsem);
+	srcu_read_unlock(&shrinker_srcu, srcu_idx);
 	return freed;
 }
 #else /* CONFIG_MEMCG */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v4 4/8] mm: vmscan: add shrinker_srcu_generation
  2023-03-07  6:55 [PATCH v4 0/8] make slab shrink lockless Qi Zheng
                   ` (2 preceding siblings ...)
  2023-03-07  6:56 ` [PATCH v4 3/8] mm: vmscan: make memcg " Qi Zheng
@ 2023-03-07  6:56 ` Qi Zheng
  2023-03-09  9:23   ` Vlastimil Babka
  2023-03-07  6:56 ` [PATCH v4 5/8] mm: shrinkers: make count and scan in shrinker debugfs lockless Qi Zheng
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 36+ messages in thread
From: Qi Zheng @ 2023-03-07  6:56 UTC (permalink / raw)
  To: akpm, tkhai, hannes, shakeelb, mhocko, roman.gushchin,
	muchun.song, david, shy828301, rppt
  Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel, Qi Zheng

From: Kirill Tkhai <tkhai@ya.ru>

After we make slab shrink lockless with SRCU, the longest
sleep unregister_shrinker() will be a sleep waiting for
all do_shrink_slab() calls.

To aviod long unbreakable action in the unregister_shrinker(),
add shrinker_srcu_generation to restore a check similar to the
rwsem_is_contendent() check that we had before.

And for memcg slab shrink, we unlock SRCU and continue
iterations from the next shrinker id.

Signed-off-by: Kirill Tkhai <tkhai@ya.ru>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/vmscan.c | 24 ++++++++++++++++++++----
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1de9bc3e5aa2..9a5a3da5c8b5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -204,6 +204,7 @@ static void set_task_reclaim_state(struct task_struct *task,
 LIST_HEAD(shrinker_list);
 DECLARE_RWSEM(shrinker_rwsem);
 DEFINE_SRCU(shrinker_srcu);
+static atomic_t shrinker_srcu_generation = ATOMIC_INIT(0);
 
 #ifdef CONFIG_MEMCG
 static int shrinker_nr_max;
@@ -782,6 +783,7 @@ void unregister_shrinker(struct shrinker *shrinker)
 	debugfs_entry = shrinker_debugfs_remove(shrinker);
 	up_write(&shrinker_rwsem);
 
+	atomic_inc(&shrinker_srcu_generation);
 	synchronize_srcu(&shrinker_srcu);
 
 	debugfs_remove_recursive(debugfs_entry);
@@ -803,6 +805,7 @@ void synchronize_shrinkers(void)
 {
 	down_write(&shrinker_rwsem);
 	up_write(&shrinker_rwsem);
+	atomic_inc(&shrinker_srcu_generation);
 	synchronize_srcu(&shrinker_srcu);
 }
 EXPORT_SYMBOL(synchronize_shrinkers);
@@ -912,18 +915,20 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
 {
 	struct shrinker_info *info;
 	unsigned long ret, freed = 0;
-	int srcu_idx;
-	int i;
+	int srcu_idx, generation;
+	int i = 0;
 
 	if (!mem_cgroup_online(memcg))
 		return 0;
 
+again:
 	srcu_idx = srcu_read_lock(&shrinker_srcu);
 	info = shrinker_info_srcu(memcg, nid);
 	if (unlikely(!info))
 		goto unlock;
 
-	for_each_set_bit(i, info->map, info->map_nr_max) {
+	generation = atomic_read(&shrinker_srcu_generation);
+	for_each_set_bit_from(i, info->map, info->map_nr_max) {
 		struct shrink_control sc = {
 			.gfp_mask = gfp_mask,
 			.nid = nid,
@@ -969,6 +974,11 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
 				set_shrinker_bit(memcg, nid, i);
 		}
 		freed += ret;
+		if (atomic_read(&shrinker_srcu_generation) != generation) {
+			srcu_read_unlock(&shrinker_srcu, srcu_idx);
+			i++;
+			goto again;
+		}
 	}
 unlock:
 	srcu_read_unlock(&shrinker_srcu, srcu_idx);
@@ -1008,7 +1018,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
 {
 	unsigned long ret, freed = 0;
 	struct shrinker *shrinker;
-	int srcu_idx;
+	int srcu_idx, generation;
 
 	/*
 	 * The root memcg might be allocated even though memcg is disabled
@@ -1022,6 +1032,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
 
 	srcu_idx = srcu_read_lock(&shrinker_srcu);
 
+	generation = atomic_read(&shrinker_srcu_generation);
 	list_for_each_entry_srcu(shrinker, &shrinker_list, list,
 				 srcu_read_lock_held(&shrinker_srcu)) {
 		struct shrink_control sc = {
@@ -1034,6 +1045,11 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
 		if (ret == SHRINK_EMPTY)
 			ret = 0;
 		freed += ret;
+
+		if (atomic_read(&shrinker_srcu_generation) != generation) {
+			freed = freed ? : 1;
+			break;
+		}
 	}
 
 	srcu_read_unlock(&shrinker_srcu, srcu_idx);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v4 5/8] mm: shrinkers: make count and scan in shrinker debugfs lockless
  2023-03-07  6:55 [PATCH v4 0/8] make slab shrink lockless Qi Zheng
                   ` (3 preceding siblings ...)
  2023-03-07  6:56 ` [PATCH v4 4/8] mm: vmscan: add shrinker_srcu_generation Qi Zheng
@ 2023-03-07  6:56 ` Qi Zheng
  2023-03-09  9:36   ` Vlastimil Babka
                     ` (2 more replies)
  2023-03-07  6:56 ` [PATCH v4 6/8] mm: vmscan: hold write lock to reparent shrinker nr_deferred Qi Zheng
                   ` (3 subsequent siblings)
  8 siblings, 3 replies; 36+ messages in thread
From: Qi Zheng @ 2023-03-07  6:56 UTC (permalink / raw)
  To: akpm, tkhai, hannes, shakeelb, mhocko, roman.gushchin,
	muchun.song, david, shy828301, rppt
  Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel, Qi Zheng

Like global and memcg slab shrink, also use SRCU to
make count and scan operations in memory shrinker
debugfs lockless.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/shrinker_debug.c | 24 +++++++-----------------
 1 file changed, 7 insertions(+), 17 deletions(-)

diff --git a/mm/shrinker_debug.c b/mm/shrinker_debug.c
index 39c3491e28a3..6aa7a7ec69da 100644
--- a/mm/shrinker_debug.c
+++ b/mm/shrinker_debug.c
@@ -9,6 +9,7 @@
 /* defined in vmscan.c */
 extern struct rw_semaphore shrinker_rwsem;
 extern struct list_head shrinker_list;
+extern struct srcu_struct shrinker_srcu;
 
 static DEFINE_IDA(shrinker_debugfs_ida);
 static struct dentry *shrinker_debugfs_root;
@@ -49,18 +50,13 @@ static int shrinker_debugfs_count_show(struct seq_file *m, void *v)
 	struct mem_cgroup *memcg;
 	unsigned long total;
 	bool memcg_aware;
-	int ret, nid;
+	int ret = 0, nid, srcu_idx;
 
 	count_per_node = kcalloc(nr_node_ids, sizeof(unsigned long), GFP_KERNEL);
 	if (!count_per_node)
 		return -ENOMEM;
 
-	ret = down_read_killable(&shrinker_rwsem);
-	if (ret) {
-		kfree(count_per_node);
-		return ret;
-	}
-	rcu_read_lock();
+	srcu_idx = srcu_read_lock(&shrinker_srcu);
 
 	memcg_aware = shrinker->flags & SHRINKER_MEMCG_AWARE;
 
@@ -91,8 +87,7 @@ static int shrinker_debugfs_count_show(struct seq_file *m, void *v)
 		}
 	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
 
-	rcu_read_unlock();
-	up_read(&shrinker_rwsem);
+	srcu_read_unlock(&shrinker_srcu, srcu_idx);
 
 	kfree(count_per_node);
 	return ret;
@@ -115,9 +110,8 @@ static ssize_t shrinker_debugfs_scan_write(struct file *file,
 		.gfp_mask = GFP_KERNEL,
 	};
 	struct mem_cgroup *memcg = NULL;
-	int nid;
+	int nid, srcu_idx;
 	char kbuf[72];
-	ssize_t ret;
 
 	read_len = size < (sizeof(kbuf) - 1) ? size : (sizeof(kbuf) - 1);
 	if (copy_from_user(kbuf, buf, read_len))
@@ -146,11 +140,7 @@ static ssize_t shrinker_debugfs_scan_write(struct file *file,
 		return -EINVAL;
 	}
 
-	ret = down_read_killable(&shrinker_rwsem);
-	if (ret) {
-		mem_cgroup_put(memcg);
-		return ret;
-	}
+	srcu_idx = srcu_read_lock(&shrinker_srcu);
 
 	sc.nid = nid;
 	sc.memcg = memcg;
@@ -159,7 +149,7 @@ static ssize_t shrinker_debugfs_scan_write(struct file *file,
 
 	shrinker->scan_objects(shrinker, &sc);
 
-	up_read(&shrinker_rwsem);
+	srcu_read_unlock(&shrinker_srcu, srcu_idx);
 	mem_cgroup_put(memcg);
 
 	return size;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v4 6/8] mm: vmscan: hold write lock to reparent shrinker nr_deferred
  2023-03-07  6:55 [PATCH v4 0/8] make slab shrink lockless Qi Zheng
                   ` (4 preceding siblings ...)
  2023-03-07  6:56 ` [PATCH v4 5/8] mm: shrinkers: make count and scan in shrinker debugfs lockless Qi Zheng
@ 2023-03-07  6:56 ` Qi Zheng
  2023-03-09  9:36   ` Vlastimil Babka
  2023-03-09 19:32   ` Kirill Tkhai
  2023-03-07  6:56 ` [PATCH v4 7/8] mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers() Qi Zheng
                   ` (2 subsequent siblings)
  8 siblings, 2 replies; 36+ messages in thread
From: Qi Zheng @ 2023-03-07  6:56 UTC (permalink / raw)
  To: akpm, tkhai, hannes, shakeelb, mhocko, roman.gushchin,
	muchun.song, david, shy828301, rppt
  Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel, Qi Zheng

For now, reparent_shrinker_deferred() is the only holder
of read lock of shrinker_rwsem. And it already holds the
global cgroup_mutex, so it will not be called in parallel.

Therefore, in order to convert shrinker_rwsem to shrinker_mutex
later, here we change to hold the write lock of shrinker_rwsem
to reparent.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/vmscan.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9a5a3da5c8b5..7aaf6f94ac1b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -451,7 +451,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
 		parent = root_mem_cgroup;
 
 	/* Prevent from concurrent shrinker_info expand */
-	down_read(&shrinker_rwsem);
+	down_write(&shrinker_rwsem);
 	for_each_node(nid) {
 		child_info = shrinker_info_protected(memcg, nid);
 		parent_info = shrinker_info_protected(parent, nid);
@@ -460,7 +460,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
 			atomic_long_add(nr, &parent_info->nr_deferred[i]);
 		}
 	}
-	up_read(&shrinker_rwsem);
+	up_write(&shrinker_rwsem);
 }
 
 static bool cgroup_reclaim(struct scan_control *sc)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v4 7/8] mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers()
  2023-03-07  6:55 [PATCH v4 0/8] make slab shrink lockless Qi Zheng
                   ` (5 preceding siblings ...)
  2023-03-07  6:56 ` [PATCH v4 6/8] mm: vmscan: hold write lock to reparent shrinker nr_deferred Qi Zheng
@ 2023-03-07  6:56 ` Qi Zheng
  2023-03-08 22:39   ` Kirill Tkhai
                     ` (2 more replies)
  2023-03-07  6:56 ` [PATCH v4 8/8] mm: shrinkers: convert shrinker_rwsem to mutex Qi Zheng
  2023-03-07 22:20 ` [PATCH v4 0/8] make slab shrink lockless Andrew Morton
  8 siblings, 3 replies; 36+ messages in thread
From: Qi Zheng @ 2023-03-07  6:56 UTC (permalink / raw)
  To: akpm, tkhai, hannes, shakeelb, mhocko, roman.gushchin,
	muchun.song, david, shy828301, rppt
  Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel, Qi Zheng

Now there are no readers of shrinker_rwsem, so
synchronize_shrinkers() does not need to hold the
writer of shrinker_rwsem to wait for all running
shinkers to complete, synchronize_srcu() is enough.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/vmscan.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7aaf6f94ac1b..ac7ab4aa344f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -796,15 +796,11 @@ EXPORT_SYMBOL(unregister_shrinker);
 /**
  * synchronize_shrinkers - Wait for all running shrinkers to complete.
  *
- * This is equivalent to calling unregister_shrink() and register_shrinker(),
- * but atomically and with less overhead. This is useful to guarantee that all
- * shrinker invocations have seen an update, before freeing memory, similar to
- * rcu.
+ * This is useful to guarantee that all shrinker invocations have seen an
+ * update, before freeing memory.
  */
 void synchronize_shrinkers(void)
 {
-	down_write(&shrinker_rwsem);
-	up_write(&shrinker_rwsem);
 	atomic_inc(&shrinker_srcu_generation);
 	synchronize_srcu(&shrinker_srcu);
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v4 8/8] mm: shrinkers: convert shrinker_rwsem to mutex
  2023-03-07  6:55 [PATCH v4 0/8] make slab shrink lockless Qi Zheng
                   ` (6 preceding siblings ...)
  2023-03-07  6:56 ` [PATCH v4 7/8] mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers() Qi Zheng
@ 2023-03-07  6:56 ` Qi Zheng
  2023-03-09  9:42   ` Vlastimil Babka
  2023-03-09 19:49   ` Kirill Tkhai
  2023-03-07 22:20 ` [PATCH v4 0/8] make slab shrink lockless Andrew Morton
  8 siblings, 2 replies; 36+ messages in thread
From: Qi Zheng @ 2023-03-07  6:56 UTC (permalink / raw)
  To: akpm, tkhai, hannes, shakeelb, mhocko, roman.gushchin,
	muchun.song, david, shy828301, rppt
  Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel, Qi Zheng

Now there are no readers of shrinker_rwsem, so we
can simply replace it with mutex lock.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 drivers/md/dm-cache-metadata.c |  2 +-
 drivers/md/dm-thin-metadata.c  |  2 +-
 fs/super.c                     |  2 +-
 mm/shrinker_debug.c            | 14 +++++++-------
 mm/vmscan.c                    | 34 +++++++++++++++++-----------------
 5 files changed, 27 insertions(+), 27 deletions(-)

diff --git a/drivers/md/dm-cache-metadata.c b/drivers/md/dm-cache-metadata.c
index acffed750e3e..9e0c69958587 100644
--- a/drivers/md/dm-cache-metadata.c
+++ b/drivers/md/dm-cache-metadata.c
@@ -1828,7 +1828,7 @@ int dm_cache_metadata_abort(struct dm_cache_metadata *cmd)
 	 * Replacement block manager (new_bm) is created and old_bm destroyed outside of
 	 * cmd root_lock to avoid ABBA deadlock that would result (due to life-cycle of
 	 * shrinker associated with the block manager's bufio client vs cmd root_lock).
-	 * - must take shrinker_rwsem without holding cmd->root_lock
+	 * - must take shrinker_mutex without holding cmd->root_lock
 	 */
 	new_bm = dm_block_manager_create(cmd->bdev, DM_CACHE_METADATA_BLOCK_SIZE << SECTOR_SHIFT,
 					 CACHE_MAX_CONCURRENT_LOCKS);
diff --git a/drivers/md/dm-thin-metadata.c b/drivers/md/dm-thin-metadata.c
index fd464fb024c3..9f5cb52c5763 100644
--- a/drivers/md/dm-thin-metadata.c
+++ b/drivers/md/dm-thin-metadata.c
@@ -1887,7 +1887,7 @@ int dm_pool_abort_metadata(struct dm_pool_metadata *pmd)
 	 * Replacement block manager (new_bm) is created and old_bm destroyed outside of
 	 * pmd root_lock to avoid ABBA deadlock that would result (due to life-cycle of
 	 * shrinker associated with the block manager's bufio client vs pmd root_lock).
-	 * - must take shrinker_rwsem without holding pmd->root_lock
+	 * - must take shrinker_mutex without holding pmd->root_lock
 	 */
 	new_bm = dm_block_manager_create(pmd->bdev, THIN_METADATA_BLOCK_SIZE << SECTOR_SHIFT,
 					 THIN_MAX_CONCURRENT_LOCKS);
diff --git a/fs/super.c b/fs/super.c
index 84332d5cb817..91a4037b1d95 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -54,7 +54,7 @@ static char *sb_writers_name[SB_FREEZE_LEVELS] = {
  * One thing we have to be careful of with a per-sb shrinker is that we don't
  * drop the last active reference to the superblock from within the shrinker.
  * If that happens we could trigger unregistering the shrinker from within the
- * shrinker path and that leads to deadlock on the shrinker_rwsem. Hence we
+ * shrinker path and that leads to deadlock on the shrinker_mutex. Hence we
  * take a passive reference to the superblock to avoid this from occurring.
  */
 static unsigned long super_cache_scan(struct shrinker *shrink,
diff --git a/mm/shrinker_debug.c b/mm/shrinker_debug.c
index 6aa7a7ec69da..b0f6aff372df 100644
--- a/mm/shrinker_debug.c
+++ b/mm/shrinker_debug.c
@@ -7,7 +7,7 @@
 #include <linux/memcontrol.h>
 
 /* defined in vmscan.c */
-extern struct rw_semaphore shrinker_rwsem;
+extern struct mutex shrinker_mutex;
 extern struct list_head shrinker_list;
 extern struct srcu_struct shrinker_srcu;
 
@@ -167,7 +167,7 @@ int shrinker_debugfs_add(struct shrinker *shrinker)
 	char buf[128];
 	int id;
 
-	lockdep_assert_held(&shrinker_rwsem);
+	lockdep_assert_held(&shrinker_mutex);
 
 	/* debugfs isn't initialized yet, add debugfs entries later. */
 	if (!shrinker_debugfs_root)
@@ -210,7 +210,7 @@ int shrinker_debugfs_rename(struct shrinker *shrinker, const char *fmt, ...)
 	if (!new)
 		return -ENOMEM;
 
-	down_write(&shrinker_rwsem);
+	mutex_lock(&shrinker_mutex);
 
 	old = shrinker->name;
 	shrinker->name = new;
@@ -228,7 +228,7 @@ int shrinker_debugfs_rename(struct shrinker *shrinker, const char *fmt, ...)
 			shrinker->debugfs_entry = entry;
 	}
 
-	up_write(&shrinker_rwsem);
+	mutex_unlock(&shrinker_mutex);
 
 	kfree_const(old);
 
@@ -240,7 +240,7 @@ struct dentry *shrinker_debugfs_remove(struct shrinker *shrinker)
 {
 	struct dentry *entry = shrinker->debugfs_entry;
 
-	lockdep_assert_held(&shrinker_rwsem);
+	lockdep_assert_held(&shrinker_mutex);
 
 	kfree_const(shrinker->name);
 	shrinker->name = NULL;
@@ -265,14 +265,14 @@ static int __init shrinker_debugfs_init(void)
 	shrinker_debugfs_root = dentry;
 
 	/* Create debugfs entries for shrinkers registered at boot */
-	down_write(&shrinker_rwsem);
+	mutex_lock(&shrinker_mutex);
 	list_for_each_entry(shrinker, &shrinker_list, list)
 		if (!shrinker->debugfs_entry) {
 			ret = shrinker_debugfs_add(shrinker);
 			if (ret)
 				break;
 		}
-	up_write(&shrinker_rwsem);
+	mutex_unlock(&shrinker_mutex);
 
 	return ret;
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ac7ab4aa344f..c00302fabc3d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -35,7 +35,7 @@
 #include <linux/cpuset.h>
 #include <linux/compaction.h>
 #include <linux/notifier.h>
-#include <linux/rwsem.h>
+#include <linux/mutex.h>
 #include <linux/delay.h>
 #include <linux/kthread.h>
 #include <linux/freezer.h>
@@ -202,7 +202,7 @@ static void set_task_reclaim_state(struct task_struct *task,
 }
 
 LIST_HEAD(shrinker_list);
-DECLARE_RWSEM(shrinker_rwsem);
+DEFINE_MUTEX(shrinker_mutex);
 DEFINE_SRCU(shrinker_srcu);
 static atomic_t shrinker_srcu_generation = ATOMIC_INIT(0);
 
@@ -225,7 +225,7 @@ static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
 {
 	return srcu_dereference_check(memcg->nodeinfo[nid]->shrinker_info,
 				      &shrinker_srcu,
-				      lockdep_is_held(&shrinker_rwsem));
+				      lockdep_is_held(&shrinker_mutex));
 }
 
 static struct shrinker_info *shrinker_info_srcu(struct mem_cgroup *memcg,
@@ -310,7 +310,7 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
 	int nid, size, ret = 0;
 	int map_size, defer_size = 0;
 
-	down_write(&shrinker_rwsem);
+	mutex_lock(&shrinker_mutex);
 	map_size = shrinker_map_size(shrinker_nr_max);
 	defer_size = shrinker_defer_size(shrinker_nr_max);
 	size = map_size + defer_size;
@@ -326,7 +326,7 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
 		info->map_nr_max = shrinker_nr_max;
 		rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, info);
 	}
-	up_write(&shrinker_rwsem);
+	mutex_unlock(&shrinker_mutex);
 
 	return ret;
 }
@@ -342,7 +342,7 @@ static int expand_shrinker_info(int new_id)
 	if (!root_mem_cgroup)
 		goto out;
 
-	lockdep_assert_held(&shrinker_rwsem);
+	lockdep_assert_held(&shrinker_mutex);
 
 	map_size = shrinker_map_size(new_nr_max);
 	defer_size = shrinker_defer_size(new_nr_max);
@@ -392,7 +392,7 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
 	if (mem_cgroup_disabled())
 		return -ENOSYS;
 
-	down_write(&shrinker_rwsem);
+	mutex_lock(&shrinker_mutex);
 	id = idr_alloc(&shrinker_idr, shrinker, 0, 0, GFP_KERNEL);
 	if (id < 0)
 		goto unlock;
@@ -406,7 +406,7 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
 	shrinker->id = id;
 	ret = 0;
 unlock:
-	up_write(&shrinker_rwsem);
+	mutex_unlock(&shrinker_mutex);
 	return ret;
 }
 
@@ -416,7 +416,7 @@ static void unregister_memcg_shrinker(struct shrinker *shrinker)
 
 	BUG_ON(id < 0);
 
-	lockdep_assert_held(&shrinker_rwsem);
+	lockdep_assert_held(&shrinker_mutex);
 
 	idr_remove(&shrinker_idr, id);
 }
@@ -451,7 +451,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
 		parent = root_mem_cgroup;
 
 	/* Prevent from concurrent shrinker_info expand */
-	down_write(&shrinker_rwsem);
+	mutex_lock(&shrinker_mutex);
 	for_each_node(nid) {
 		child_info = shrinker_info_protected(memcg, nid);
 		parent_info = shrinker_info_protected(parent, nid);
@@ -460,7 +460,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
 			atomic_long_add(nr, &parent_info->nr_deferred[i]);
 		}
 	}
-	up_write(&shrinker_rwsem);
+	mutex_unlock(&shrinker_mutex);
 }
 
 static bool cgroup_reclaim(struct scan_control *sc)
@@ -709,9 +709,9 @@ void free_prealloced_shrinker(struct shrinker *shrinker)
 	shrinker->name = NULL;
 #endif
 	if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
-		down_write(&shrinker_rwsem);
+		mutex_lock(&shrinker_mutex);
 		unregister_memcg_shrinker(shrinker);
-		up_write(&shrinker_rwsem);
+		mutex_unlock(&shrinker_mutex);
 		return;
 	}
 
@@ -721,11 +721,11 @@ void free_prealloced_shrinker(struct shrinker *shrinker)
 
 void register_shrinker_prepared(struct shrinker *shrinker)
 {
-	down_write(&shrinker_rwsem);
+	mutex_lock(&shrinker_mutex);
 	list_add_tail_rcu(&shrinker->list, &shrinker_list);
 	shrinker->flags |= SHRINKER_REGISTERED;
 	shrinker_debugfs_add(shrinker);
-	up_write(&shrinker_rwsem);
+	mutex_unlock(&shrinker_mutex);
 }
 
 static int __register_shrinker(struct shrinker *shrinker)
@@ -775,13 +775,13 @@ void unregister_shrinker(struct shrinker *shrinker)
 	if (!(shrinker->flags & SHRINKER_REGISTERED))
 		return;
 
-	down_write(&shrinker_rwsem);
+	mutex_lock(&shrinker_mutex);
 	list_del_rcu(&shrinker->list);
 	shrinker->flags &= ~SHRINKER_REGISTERED;
 	if (shrinker->flags & SHRINKER_MEMCG_AWARE)
 		unregister_memcg_shrinker(shrinker);
 	debugfs_entry = shrinker_debugfs_remove(shrinker);
-	up_write(&shrinker_rwsem);
+	mutex_unlock(&shrinker_mutex);
 
 	atomic_inc(&shrinker_srcu_generation);
 	synchronize_srcu(&shrinker_srcu);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v4 0/8] make slab shrink lockless
  2023-03-07  6:55 [PATCH v4 0/8] make slab shrink lockless Qi Zheng
                   ` (7 preceding siblings ...)
  2023-03-07  6:56 ` [PATCH v4 8/8] mm: shrinkers: convert shrinker_rwsem to mutex Qi Zheng
@ 2023-03-07 22:20 ` Andrew Morton
  2023-03-08 11:59   ` Qi Zheng
  8 siblings, 1 reply; 36+ messages in thread
From: Andrew Morton @ 2023-03-07 22:20 UTC (permalink / raw)
  To: Qi Zheng
  Cc: tkhai, hannes, shakeelb, mhocko, roman.gushchin, muchun.song,
	david, shy828301, rppt, sultan, dave, penguin-kernel, paulmck,
	linux-mm, linux-kernel

On Tue,  7 Mar 2023 14:55:57 +0800 Qi Zheng <zhengqi.arch@bytedance.com> wrote:

> Hi all,
> 
> This patch series aims to make slab shrink lockless.

The v3 discussion did contain requests for some sort of measurements of
real-world workloads.  And Kirill did suggest a workload which could be
used for this measurement.

It's quite important that we have this info, please.  I mean, speeding
up real-world workloads is the entire point of the patchset and without
measurements, we don't know if the patchset achieves its primary
objective!


> 3. Reproduction and testing
> ===========================
> 
> We can reproduce the down_read_trylock() hotspot through the following script:
> 
> ```
> #!/bin/bash
> 
> DIR="/root/shrinker/memcg/mnt"
> 
> do_create()
> {
>     mkdir -p /sys/fs/cgroup/memory/test
>     mkdir -p /sys/fs/cgroup/perf_event/test
>     echo 4G > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
>     for i in `seq 0 $1`;
>     do
>         mkdir -p /sys/fs/cgroup/memory/test/$i;
>         echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
>         echo $$ > /sys/fs/cgroup/perf_event/test/cgroup.procs;
>         mkdir -p $DIR/$i;
>     done
> }
> 
> do_mount()
> {
>     for i in `seq $1 $2`;
>     do
>         mount -t tmpfs $i $DIR/$i;
>     done
> }
> 
> do_touch()
> {
>     for i in `seq $1 $2`;
>     do
>         echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
>         echo $$ > /sys/fs/cgroup/perf_event/test/cgroup.procs;
>             dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 &
>     done
> }
> 
> case "$1" in
>   touch)
>     do_touch $2 $3
>     ;;
>   test)
>       do_create 4000
>     do_mount 0 4000
>     do_touch 0 3000
>     ;;
>   *)
>     exit 1
>     ;;
> esac
> ```
> 
> Save the above script, then run test and touch commands. Then we can use the
> following perf command to view hotspots:

Well.  Simply runnimg

	time that-script

before and after and including the info in the changelog would be a start?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v4 0/8] make slab shrink lockless
  2023-03-07 22:20 ` [PATCH v4 0/8] make slab shrink lockless Andrew Morton
@ 2023-03-08 11:59   ` Qi Zheng
  0 siblings, 0 replies; 36+ messages in thread
From: Qi Zheng @ 2023-03-08 11:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: tkhai, hannes, shakeelb, mhocko, roman.gushchin, muchun.song,
	david, shy828301, rppt, sultan, dave, penguin-kernel, paulmck,
	linux-mm, linux-kernel


Hi Andrew,

On 2023/3/8 06:20, Andrew Morton wrote:
> On Tue,  7 Mar 2023 14:55:57 +0800 Qi Zheng <zhengqi.arch@bytedance.com> wrote:
> 
>> Hi all,
>>
>> This patch series aims to make slab shrink lockless.
> 
> The v3 discussion did contain requests for some sort of measurements of
> real-world workloads.  And Kirill did suggest a workload which could be
> used for this measurement.
> 
> It's quite important that we have this info, please.  I mean, speeding
> up real-world workloads is the entire point of the patchset and without
> measurements, we don't know if the patchset achieves its primary
> objective!

I agree with this.

For the down_read_trylock() hotspot problem I encountered, I
posted a reproduction program in the cover letter, and measured the
change of IPC before and after applying the patchset.

For the case mentioned by Kirill, theoretically there is no competition
between slab shrink and register_shrinker() after applying this
patchset. But I haven't found a way to reproduce it yet, I will
continue to try to do it.

> 
> 
>> 3. Reproduction and testing
>> ===========================
>>
>> We can reproduce the down_read_trylock() hotspot through the following script:
>>
>> ```
>> #!/bin/bash
>>
>> DIR="/root/shrinker/memcg/mnt"
>>
>> do_create()
>> {
>>      mkdir -p /sys/fs/cgroup/memory/test
>>      mkdir -p /sys/fs/cgroup/perf_event/test
>>      echo 4G > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
>>      for i in `seq 0 $1`;
>>      do
>>          mkdir -p /sys/fs/cgroup/memory/test/$i;
>>          echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
>>          echo $$ > /sys/fs/cgroup/perf_event/test/cgroup.procs;
>>          mkdir -p $DIR/$i;
>>      done
>> }
>>
>> do_mount()
>> {
>>      for i in `seq $1 $2`;
>>      do
>>          mount -t tmpfs $i $DIR/$i;
>>      done
>> }
>>
>> do_touch()
>> {
>>      for i in `seq $1 $2`;
>>      do
>>          echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
>>          echo $$ > /sys/fs/cgroup/perf_event/test/cgroup.procs;
>>              dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 &
>>      done
>> }
>>
>> case "$1" in
>>    touch)
>>      do_touch $2 $3
>>      ;;
>>    test)
>>        do_create 4000
>>      do_mount 0 4000
>>      do_touch 0 3000
>>      ;;
>>    *)
>>      exit 1
>>      ;;
>> esac
>> ```
>>
>> Save the above script, then run test and touch commands. Then we can use the
>> following perf command to view hotspots:
> 
> Well.  Simply runnimg
> 
> 	time that-script

I tried this, but the script process will be killed because of OOM, so
the measured time is not accurate.

I will continue to try to measure more data besides IPC.

Thanks,
Qi

> 
> before and after and including the info in the changelog would be a start?


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v4 1/8] mm: vmscan: add a map_nr_max field to shrinker_info
  2023-03-07  6:55 ` [PATCH v4 1/8] mm: vmscan: add a map_nr_max field to shrinker_info Qi Zheng
@ 2023-03-08 14:40   ` Vlastimil Babka
  2023-03-08 22:13   ` Kirill Tkhai
  1 sibling, 0 replies; 36+ messages in thread
From: Vlastimil Babka @ 2023-03-08 14:40 UTC (permalink / raw)
  To: Qi Zheng, akpm, tkhai, hannes, shakeelb, mhocko, roman.gushchin,
	muchun.song, david, shy828301, rppt
  Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel



On 3/7/23 07:55, Qi Zheng wrote:
> To prepare for the subsequent lockless memcg slab shrink,
> add a map_nr_max field to struct shrinker_info to records
> its own real shrinker_nr_max.
> 
> Suggested-by: Kirill Tkhai <tkhai@ya.ru>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  include/linux/memcontrol.h |  1 +
>  mm/vmscan.c                | 41 ++++++++++++++++++++++----------------
>  2 files changed, 25 insertions(+), 17 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index b6eda2ab205d..aa69ea98e2d8 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -97,6 +97,7 @@ struct shrinker_info {
>  	struct rcu_head rcu;
>  	atomic_long_t *nr_deferred;
>  	unsigned long *map;
> +	int map_nr_max;
>  };
>  
>  struct lruvec_stats_percpu {
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 9414226218f0..2dcc01682026 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -224,9 +224,16 @@ static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
>  					 lockdep_is_held(&shrinker_rwsem));
>  }
>  
> +static inline bool need_expand(int new_nr_max, int old_nr_max)
> +{
> +	return round_up(new_nr_max, BITS_PER_LONG) >
> +	       round_up(old_nr_max, BITS_PER_LONG);
> +}
> +
>  static int expand_one_shrinker_info(struct mem_cgroup *memcg,
>  				    int map_size, int defer_size,
> -				    int old_map_size, int old_defer_size)
> +				    int old_map_size, int old_defer_size,
> +				    int new_nr_max)
>  {
>  	struct shrinker_info *new, *old;
>  	struct mem_cgroup_per_node *pn;
> @@ -240,12 +247,17 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
>  		if (!old)
>  			return 0;
>  
> +		/* Already expanded this shrinker_info */
> +		if (!need_expand(new_nr_max, old->map_nr_max))
> +			continue;
> +
>  		new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
>  		if (!new)
>  			return -ENOMEM;
>  
>  		new->nr_deferred = (atomic_long_t *)(new + 1);
>  		new->map = (void *)new->nr_deferred + defer_size;
> +		new->map_nr_max = new_nr_max;
>  
>  		/* map: set all old bits, clear all new bits */
>  		memset(new->map, (int)0xff, old_map_size);
> @@ -295,6 +307,7 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
>  		}
>  		info->nr_deferred = (atomic_long_t *)(info + 1);
>  		info->map = (void *)info->nr_deferred + defer_size;
> +		info->map_nr_max = shrinker_nr_max;
>  		rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, info);
>  	}
>  	up_write(&shrinker_rwsem);
> @@ -302,23 +315,14 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
>  	return ret;
>  }
>  
> -static inline bool need_expand(int nr_max)
> -{
> -	return round_up(nr_max, BITS_PER_LONG) >
> -	       round_up(shrinker_nr_max, BITS_PER_LONG);
> -}
> -
>  static int expand_shrinker_info(int new_id)
>  {
>  	int ret = 0;
> -	int new_nr_max = new_id + 1;
> +	int new_nr_max = round_up(new_id + 1, BITS_PER_LONG);
>  	int map_size, defer_size = 0;
>  	int old_map_size, old_defer_size = 0;
>  	struct mem_cgroup *memcg;
>  
> -	if (!need_expand(new_nr_max))
> -		goto out;
> -
>  	if (!root_mem_cgroup)
>  		goto out;
>  
> @@ -332,7 +336,8 @@ static int expand_shrinker_info(int new_id)
>  	memcg = mem_cgroup_iter(NULL, NULL, NULL);
>  	do {
>  		ret = expand_one_shrinker_info(memcg, map_size, defer_size,
> -					       old_map_size, old_defer_size);
> +					       old_map_size, old_defer_size,
> +					       new_nr_max);
>  		if (ret) {
>  			mem_cgroup_iter_break(NULL, memcg);
>  			goto out;
> @@ -352,9 +357,11 @@ void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
>  
>  		rcu_read_lock();
>  		info = rcu_dereference(memcg->nodeinfo[nid]->shrinker_info);
> -		/* Pairs with smp mb in shrink_slab() */
> -		smp_mb__before_atomic();
> -		set_bit(shrinker_id, info->map);
> +		if (!WARN_ON_ONCE(shrinker_id >= info->map_nr_max)) {
> +			/* Pairs with smp mb in shrink_slab() */
> +			smp_mb__before_atomic();
> +			set_bit(shrinker_id, info->map);
> +		}
>  		rcu_read_unlock();
>  	}
>  }
> @@ -432,7 +439,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
>  	for_each_node(nid) {
>  		child_info = shrinker_info_protected(memcg, nid);
>  		parent_info = shrinker_info_protected(parent, nid);
> -		for (i = 0; i < shrinker_nr_max; i++) {
> +		for (i = 0; i < child_info->map_nr_max; i++) {
>  			nr = atomic_long_read(&child_info->nr_deferred[i]);
>  			atomic_long_add(nr, &parent_info->nr_deferred[i]);
>  		}
> @@ -899,7 +906,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>  	if (unlikely(!info))
>  		goto unlock;
>  
> -	for_each_set_bit(i, info->map, shrinker_nr_max) {
> +	for_each_set_bit(i, info->map, info->map_nr_max) {
>  		struct shrink_control sc = {
>  			.gfp_mask = gfp_mask,
>  			.nid = nid,

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v4 2/8] mm: vmscan: make global slab shrink lockless
  2023-03-07  6:55 ` [PATCH v4 2/8] mm: vmscan: make global slab shrink lockless Qi Zheng
@ 2023-03-08 15:02   ` Vlastimil Babka
  2023-03-08 22:18   ` Kirill Tkhai
  1 sibling, 0 replies; 36+ messages in thread
From: Vlastimil Babka @ 2023-03-08 15:02 UTC (permalink / raw)
  To: Qi Zheng, akpm, tkhai, hannes, shakeelb, mhocko, roman.gushchin,
	muchun.song, david, shy828301, rppt
  Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel

On 3/7/23 07:55, Qi Zheng wrote:
> The shrinker_rwsem is a global read-write lock in
> shrinkers subsystem, which protects most operations
> such as slab shrink, registration and unregistration
> of shrinkers, etc. This can easily cause problems in
> the following cases.
> 
> 1) When the memory pressure is high and there are many
>    filesystems mounted or unmounted at the same time,
>    slab shrink will be affected (down_read_trylock()
>    failed).
> 
>    Such as the real workload mentioned by Kirill Tkhai:
> 
>    ```
>    One of the real workloads from my experience is start
>    of an overcommitted node containing many starting
>    containers after node crash (or many resuming containers
>    after reboot for kernel update). In these cases memory
>    pressure is huge, and the node goes round in long reclaim.
>    ```
> 
> 2) If a shrinker is blocked (such as the case mentioned
>    in [1]) and a writer comes in (such as mount a fs),
>    then this writer will be blocked and cause all
>    subsequent shrinker-related operations to be blocked.
> 
> Even if there is no competitor when shrinking slab, there
> may still be a problem. If we have a long shrinker list
> and we do not reclaim enough memory with each shrinker,
> then the down_read_trylock() may be called with high
> frequency. Because of the poor multicore scalability of
> atomic operations, this can lead to a significant drop
> in IPC (instructions per cycle).
> 
> So many times in history ([2],[3],[4],[5]), some people
> wanted to replace shrinker_rwsem trylock with SRCU in
> the slab shrink, but all these patches were abandoned
> because SRCU was not unconditionally enabled.
> 
> But now, since commit 1cd0bd06093c ("rcu: Remove CONFIG_SRCU"),
> the SRCU is unconditionally enabled. So it's time to use
> SRCU to protect readers who previously held shrinker_rwsem.
> 
> This commit uses SRCU to make global slab shrink lockless,
> the memcg slab shrink is handled in the subsequent patch.
> 
> [1]. https://lore.kernel.org/lkml/20191129214541.3110-1-ptikhomirov@virtuozzo.com/
> [2]. https://lore.kernel.org/all/1437080113.3596.2.camel@stgolabs.net/
> [3]. https://lore.kernel.org/lkml/1510609063-3327-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp/
> [4]. https://lore.kernel.org/lkml/153365347929.19074.12509495712735843805.stgit@localhost.localdomain/
> [5]. https://lore.kernel.org/lkml/20210927074823.5825-1-sultan@kerneltoast.com/
> 
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v4 1/8] mm: vmscan: add a map_nr_max field to shrinker_info
  2023-03-07  6:55 ` [PATCH v4 1/8] mm: vmscan: add a map_nr_max field to shrinker_info Qi Zheng
  2023-03-08 14:40   ` Vlastimil Babka
@ 2023-03-08 22:13   ` Kirill Tkhai
  2023-03-09  6:33     ` Qi Zheng
  1 sibling, 1 reply; 36+ messages in thread
From: Kirill Tkhai @ 2023-03-08 22:13 UTC (permalink / raw)
  To: Qi Zheng, akpm, hannes, shakeelb, mhocko, roman.gushchin,
	muchun.song, david, shy828301, rppt
  Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel

Hi,

On 07.03.2023 09:55, Qi Zheng wrote:
> To prepare for the subsequent lockless memcg slab shrink,
> add a map_nr_max field to struct shrinker_info to records
> its own real shrinker_nr_max.
> 
> Suggested-by: Kirill Tkhai <tkhai@ya.ru>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
>  include/linux/memcontrol.h |  1 +
>  mm/vmscan.c                | 41 ++++++++++++++++++++++----------------
>  2 files changed, 25 insertions(+), 17 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index b6eda2ab205d..aa69ea98e2d8 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -97,6 +97,7 @@ struct shrinker_info {
>  	struct rcu_head rcu;
>  	atomic_long_t *nr_deferred;
>  	unsigned long *map;
> +	int map_nr_max;
>  };
>  
>  struct lruvec_stats_percpu {
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 9414226218f0..2dcc01682026 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -224,9 +224,16 @@ static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
>  					 lockdep_is_held(&shrinker_rwsem));
>  }
>  
> +static inline bool need_expand(int new_nr_max, int old_nr_max)
> +{
> +	return round_up(new_nr_max, BITS_PER_LONG) >
> +	       round_up(old_nr_max, BITS_PER_LONG);
> +}
> +
>  static int expand_one_shrinker_info(struct mem_cgroup *memcg,
>  				    int map_size, int defer_size,
> -				    int old_map_size, int old_defer_size)
> +				    int old_map_size, int old_defer_size,
> +				    int new_nr_max)
>  {
>  	struct shrinker_info *new, *old;
>  	struct mem_cgroup_per_node *pn;
> @@ -240,12 +247,17 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
>  		if (!old)
>  			return 0;
>  
> +		/* Already expanded this shrinker_info */
> +		if (!need_expand(new_nr_max, old->map_nr_max))

need_expand() looks confusing here. It's strange that we round_up(old->map_nr_max),
despite old->map never may exceed old->map_nr_max.

Won't plain

	if (new_nr_max <= old->map_nr_max)

look clearer here?

The rest in patch looks OK for me.

> +			continue;
> +
>  		new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
>  		if (!new)
>  			return -ENOMEM;
>  
>  		new->nr_deferred = (atomic_long_t *)(new + 1);
>  		new->map = (void *)new->nr_deferred + defer_size;
> +		new->map_nr_max = new_nr_max;
>  
>  		/* map: set all old bits, clear all new bits */
>  		memset(new->map, (int)0xff, old_map_size);
> @@ -295,6 +307,7 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
>  		}
>  		info->nr_deferred = (atomic_long_t *)(info + 1);
>  		info->map = (void *)info->nr_deferred + defer_size;
> +		info->map_nr_max = shrinker_nr_max;
>  		rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, info);
>  	}
>  	up_write(&shrinker_rwsem);
> @@ -302,23 +315,14 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
>  	return ret;
>  }
>  
> -static inline bool need_expand(int nr_max)
> -{
> -	return round_up(nr_max, BITS_PER_LONG) >
> -	       round_up(shrinker_nr_max, BITS_PER_LONG);
> -}
> -
>  static int expand_shrinker_info(int new_id)
>  {
>  	int ret = 0;
> -	int new_nr_max = new_id + 1;
> +	int new_nr_max = round_up(new_id + 1, BITS_PER_LONG);
>  	int map_size, defer_size = 0;
>  	int old_map_size, old_defer_size = 0;
>  	struct mem_cgroup *memcg;
>  
> -	if (!need_expand(new_nr_max))
> -		goto out;
> -
>  	if (!root_mem_cgroup)
>  		goto out;
>  
> @@ -332,7 +336,8 @@ static int expand_shrinker_info(int new_id)
>  	memcg = mem_cgroup_iter(NULL, NULL, NULL);
>  	do {
>  		ret = expand_one_shrinker_info(memcg, map_size, defer_size,
> -					       old_map_size, old_defer_size);
> +					       old_map_size, old_defer_size,
> +					       new_nr_max);
>  		if (ret) {
>  			mem_cgroup_iter_break(NULL, memcg);
>  			goto out;
> @@ -352,9 +357,11 @@ void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
>  
>  		rcu_read_lock();
>  		info = rcu_dereference(memcg->nodeinfo[nid]->shrinker_info);
> -		/* Pairs with smp mb in shrink_slab() */
> -		smp_mb__before_atomic();
> -		set_bit(shrinker_id, info->map);
> +		if (!WARN_ON_ONCE(shrinker_id >= info->map_nr_max)) {
> +			/* Pairs with smp mb in shrink_slab() */
> +			smp_mb__before_atomic();
> +			set_bit(shrinker_id, info->map);
> +		}
>  		rcu_read_unlock();
>  	}
>  }
> @@ -432,7 +439,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
>  	for_each_node(nid) {
>  		child_info = shrinker_info_protected(memcg, nid);
>  		parent_info = shrinker_info_protected(parent, nid);
> -		for (i = 0; i < shrinker_nr_max; i++) {
> +		for (i = 0; i < child_info->map_nr_max; i++) {
>  			nr = atomic_long_read(&child_info->nr_deferred[i]);
>  			atomic_long_add(nr, &parent_info->nr_deferred[i]);
>  		}
> @@ -899,7 +906,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>  	if (unlikely(!info))
>  		goto unlock;
>  
> -	for_each_set_bit(i, info->map, shrinker_nr_max) {
> +	for_each_set_bit(i, info->map, info->map_nr_max) {
>  		struct shrink_control sc = {
>  			.gfp_mask = gfp_mask,
>  			.nid = nid,


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v4 2/8] mm: vmscan: make global slab shrink lockless
  2023-03-07  6:55 ` [PATCH v4 2/8] mm: vmscan: make global slab shrink lockless Qi Zheng
  2023-03-08 15:02   ` Vlastimil Babka
@ 2023-03-08 22:18   ` Kirill Tkhai
  1 sibling, 0 replies; 36+ messages in thread
From: Kirill Tkhai @ 2023-03-08 22:18 UTC (permalink / raw)
  To: Qi Zheng, akpm, hannes, shakeelb, mhocko, roman.gushchin,
	muchun.song, david, shy828301, rppt
  Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel

On 07.03.2023 09:55, Qi Zheng wrote:
> The shrinker_rwsem is a global read-write lock in
> shrinkers subsystem, which protects most operations
> such as slab shrink, registration and unregistration
> of shrinkers, etc. This can easily cause problems in
> the following cases.
> 
> 1) When the memory pressure is high and there are many
>    filesystems mounted or unmounted at the same time,
>    slab shrink will be affected (down_read_trylock()
>    failed).
> 
>    Such as the real workload mentioned by Kirill Tkhai:
> 
>    ```
>    One of the real workloads from my experience is start
>    of an overcommitted node containing many starting
>    containers after node crash (or many resuming containers
>    after reboot for kernel update). In these cases memory
>    pressure is huge, and the node goes round in long reclaim.
>    ```
> 
> 2) If a shrinker is blocked (such as the case mentioned
>    in [1]) and a writer comes in (such as mount a fs),
>    then this writer will be blocked and cause all
>    subsequent shrinker-related operations to be blocked.
> 
> Even if there is no competitor when shrinking slab, there
> may still be a problem. If we have a long shrinker list
> and we do not reclaim enough memory with each shrinker,
> then the down_read_trylock() may be called with high
> frequency. Because of the poor multicore scalability of
> atomic operations, this can lead to a significant drop
> in IPC (instructions per cycle).
> 
> So many times in history ([2],[3],[4],[5]), some people
> wanted to replace shrinker_rwsem trylock with SRCU in
> the slab shrink, but all these patches were abandoned
> because SRCU was not unconditionally enabled.
> 
> But now, since commit 1cd0bd06093c ("rcu: Remove CONFIG_SRCU"),
> the SRCU is unconditionally enabled. So it's time to use
> SRCU to protect readers who previously held shrinker_rwsem.
> 
> This commit uses SRCU to make global slab shrink lockless,
> the memcg slab shrink is handled in the subsequent patch.
> 
> [1]. https://lore.kernel.org/lkml/20191129214541.3110-1-ptikhomirov@virtuozzo.com/
> [2]. https://lore.kernel.org/all/1437080113.3596.2.camel@stgolabs.net/
> [3]. https://lore.kernel.org/lkml/1510609063-3327-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp/
> [4]. https://lore.kernel.org/lkml/153365347929.19074.12509495712735843805.stgit@localhost.localdomain/
> [5]. https://lore.kernel.org/lkml/20210927074823.5825-1-sultan@kerneltoast.com/
> 
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: Kirill Tkhai <tkhai@ya.ru>
> ---
>  mm/vmscan.c | 27 +++++++++++----------------
>  1 file changed, 11 insertions(+), 16 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2dcc01682026..8515ac40bcaf 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -202,6 +202,7 @@ static void set_task_reclaim_state(struct task_struct *task,
>  
>  LIST_HEAD(shrinker_list);
>  DECLARE_RWSEM(shrinker_rwsem);
> +DEFINE_SRCU(shrinker_srcu);
>  
>  #ifdef CONFIG_MEMCG
>  static int shrinker_nr_max;
> @@ -706,7 +707,7 @@ void free_prealloced_shrinker(struct shrinker *shrinker)
>  void register_shrinker_prepared(struct shrinker *shrinker)
>  {
>  	down_write(&shrinker_rwsem);
> -	list_add_tail(&shrinker->list, &shrinker_list);
> +	list_add_tail_rcu(&shrinker->list, &shrinker_list);
>  	shrinker->flags |= SHRINKER_REGISTERED;
>  	shrinker_debugfs_add(shrinker);
>  	up_write(&shrinker_rwsem);
> @@ -760,13 +761,15 @@ void unregister_shrinker(struct shrinker *shrinker)
>  		return;
>  
>  	down_write(&shrinker_rwsem);
> -	list_del(&shrinker->list);
> +	list_del_rcu(&shrinker->list);
>  	shrinker->flags &= ~SHRINKER_REGISTERED;
>  	if (shrinker->flags & SHRINKER_MEMCG_AWARE)
>  		unregister_memcg_shrinker(shrinker);
>  	debugfs_entry = shrinker_debugfs_remove(shrinker);
>  	up_write(&shrinker_rwsem);
>  
> +	synchronize_srcu(&shrinker_srcu);
> +
>  	debugfs_remove_recursive(debugfs_entry);
>  
>  	kfree(shrinker->nr_deferred);
> @@ -786,6 +789,7 @@ void synchronize_shrinkers(void)
>  {
>  	down_write(&shrinker_rwsem);
>  	up_write(&shrinker_rwsem);
> +	synchronize_srcu(&shrinker_srcu);
>  }
>  EXPORT_SYMBOL(synchronize_shrinkers);
>  
> @@ -996,6 +1000,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>  {
>  	unsigned long ret, freed = 0;
>  	struct shrinker *shrinker;
> +	int srcu_idx;
>  
>  	/*
>  	 * The root memcg might be allocated even though memcg is disabled
> @@ -1007,10 +1012,10 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>  	if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
>  		return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>  
> -	if (!down_read_trylock(&shrinker_rwsem))
> -		goto out;
> +	srcu_idx = srcu_read_lock(&shrinker_srcu);
>  
> -	list_for_each_entry(shrinker, &shrinker_list, list) {
> +	list_for_each_entry_srcu(shrinker, &shrinker_list, list,
> +				 srcu_read_lock_held(&shrinker_srcu)) {
>  		struct shrink_control sc = {
>  			.gfp_mask = gfp_mask,
>  			.nid = nid,
> @@ -1021,19 +1026,9 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>  		if (ret == SHRINK_EMPTY)
>  			ret = 0;
>  		freed += ret;
> -		/*
> -		 * Bail out if someone want to register a new shrinker to
> -		 * prevent the registration from being stalled for long periods
> -		 * by parallel ongoing shrinking.
> -		 */
> -		if (rwsem_is_contended(&shrinker_rwsem)) {
> -			freed = freed ? : 1;
> -			break;
> -		}
>  	}
>  
> -	up_read(&shrinker_rwsem);
> -out:
> +	srcu_read_unlock(&shrinker_srcu, srcu_idx);
>  	cond_resched();
>  	return freed;
>  }


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v4 3/8] mm: vmscan: make memcg slab shrink lockless
  2023-03-07  6:56 ` [PATCH v4 3/8] mm: vmscan: make memcg " Qi Zheng
@ 2023-03-08 22:23   ` Kirill Tkhai
  2023-03-08 22:46   ` Vlastimil Babka
  1 sibling, 0 replies; 36+ messages in thread
From: Kirill Tkhai @ 2023-03-08 22:23 UTC (permalink / raw)
  To: Qi Zheng, akpm, hannes, shakeelb, mhocko, roman.gushchin,
	muchun.song, david, shy828301, rppt
  Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel

On 07.03.2023 09:56, Qi Zheng wrote:
> Like global slab shrink, this commit also uses SRCU to make
> memcg slab shrink lockless.
> 
> We can reproduce the down_read_trylock() hotspot through the
> following script:
> 
> ```
> 
> DIR="/root/shrinker/memcg/mnt"
> 
> do_create()
> {
>     mkdir -p /sys/fs/cgroup/memory/test
>     mkdir -p /sys/fs/cgroup/perf_event/test
>     echo 4G > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
>     for i in `seq 0 $1`;
>     do
>         mkdir -p /sys/fs/cgroup/memory/test/$i;
>         echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
>         echo $$ > /sys/fs/cgroup/perf_event/test/cgroup.procs;
>         mkdir -p $DIR/$i;
>     done
> }
> 
> do_mount()
> {
>     for i in `seq $1 $2`;
>     do
>         mount -t tmpfs $i $DIR/$i;
>     done
> }
> 
> do_touch()
> {
>     for i in `seq $1 $2`;
>     do
>         echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
>         echo $$ > /sys/fs/cgroup/perf_event/test/cgroup.procs;
>             dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 &
>     done
> }
> 
> case "$1" in
>   touch)
>     do_touch $2 $3
>     ;;
>   test)
>       do_create 4000
>     do_mount 0 4000
>     do_touch 0 3000
>     ;;
>   *)
>     exit 1
>     ;;
> esac
> ```
> 
> Save the above script, then run test and touch commands.
> Then we can use the following perf command to view hotspots:
> 
> perf top -U -F 999
> 
> 1) Before applying this patchset:
> 
>   32.31%  [kernel]           [k] down_read_trylock
>   19.40%  [kernel]           [k] pv_native_safe_halt
>   16.24%  [kernel]           [k] up_read
>   15.70%  [kernel]           [k] shrink_slab
>    4.69%  [kernel]           [k] _find_next_bit
>    2.62%  [kernel]           [k] shrink_node
>    1.78%  [kernel]           [k] shrink_lruvec
>    0.76%  [kernel]           [k] do_shrink_slab
> 
> 2) After applying this patchset:
> 
>   27.83%  [kernel]           [k] _find_next_bit
>   16.97%  [kernel]           [k] shrink_slab
>   15.82%  [kernel]           [k] pv_native_safe_halt
>    9.58%  [kernel]           [k] shrink_node
>    8.31%  [kernel]           [k] shrink_lruvec
>    5.64%  [kernel]           [k] do_shrink_slab
>    3.88%  [kernel]           [k] mem_cgroup_iter
> 
> At the same time, we use the following perf command to capture
> IPC information:
> 
> perf stat -e cycles,instructions -G test -a --repeat 5 -- sleep 10
> 
> 1) Before applying this patchset:
> 
>  Performance counter stats for 'system wide' (5 runs):
> 
>       454187219766      cycles                    test                    ( +-  1.84% )
>        78896433101      instructions              test #    0.17  insn per cycle           ( +-  0.44% )
> 
>         10.0020430 +- 0.0000366 seconds time elapsed  ( +-  0.00% )
> 
> 2) After applying this patchset:
> 
>  Performance counter stats for 'system wide' (5 runs):
> 
>       841954709443      cycles                    test                    ( +- 15.80% )  (98.69%)
>       527258677936      instructions              test #    0.63  insn per cycle           ( +- 15.11% )  (98.68%)
> 
>           10.01064 +- 0.00831 seconds time elapsed  ( +-  0.08% )
> 
> We can see that IPC drops very seriously when calling
> down_read_trylock() at high frequency. After using SRCU,
> the IPC is at a normal level.
> 
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>

Acked-by: Kirill Tkhai <tkhai@ya.ru>

> ---
>  mm/vmscan.c | 46 +++++++++++++++++++++++++++-------------------
>  1 file changed, 27 insertions(+), 19 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 8515ac40bcaf..1de9bc3e5aa2 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -57,6 +57,7 @@
>  #include <linux/khugepaged.h>
>  #include <linux/rculist_nulls.h>
>  #include <linux/random.h>
> +#include <linux/srcu.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/div64.h>
> @@ -221,8 +222,21 @@ static inline int shrinker_defer_size(int nr_items)
>  static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
>  						     int nid)
>  {
> -	return rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_info,
> -					 lockdep_is_held(&shrinker_rwsem));
> +	return srcu_dereference_check(memcg->nodeinfo[nid]->shrinker_info,
> +				      &shrinker_srcu,
> +				      lockdep_is_held(&shrinker_rwsem));
> +}
> +
> +static struct shrinker_info *shrinker_info_srcu(struct mem_cgroup *memcg,
> +						     int nid)
> +{
> +	return srcu_dereference(memcg->nodeinfo[nid]->shrinker_info,
> +				&shrinker_srcu);
> +}
> +
> +static void free_shrinker_info_rcu(struct rcu_head *head)
> +{
> +	kvfree(container_of(head, struct shrinker_info, rcu));
>  }
>  
>  static inline bool need_expand(int new_nr_max, int old_nr_max)
> @@ -269,7 +283,7 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
>  		       defer_size - old_defer_size);
>  
>  		rcu_assign_pointer(pn->shrinker_info, new);
> -		kvfree_rcu(old, rcu);
> +		call_srcu(&shrinker_srcu, &old->rcu, free_shrinker_info_rcu);
>  	}
>  
>  	return 0;
> @@ -355,15 +369,16 @@ void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
>  {
>  	if (shrinker_id >= 0 && memcg && !mem_cgroup_is_root(memcg)) {
>  		struct shrinker_info *info;
> +		int srcu_idx;
>  
> -		rcu_read_lock();
> -		info = rcu_dereference(memcg->nodeinfo[nid]->shrinker_info);
> +		srcu_idx = srcu_read_lock(&shrinker_srcu);
> +		info = shrinker_info_srcu(memcg, nid);
>  		if (!WARN_ON_ONCE(shrinker_id >= info->map_nr_max)) {
>  			/* Pairs with smp mb in shrink_slab() */
>  			smp_mb__before_atomic();
>  			set_bit(shrinker_id, info->map);
>  		}
> -		rcu_read_unlock();
> +		srcu_read_unlock(&shrinker_srcu, srcu_idx);
>  	}
>  }
>  
> @@ -377,7 +392,6 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
>  		return -ENOSYS;
>  
>  	down_write(&shrinker_rwsem);
> -	/* This may call shrinker, so it must use down_read_trylock() */
>  	id = idr_alloc(&shrinker_idr, shrinker, 0, 0, GFP_KERNEL);
>  	if (id < 0)
>  		goto unlock;
> @@ -411,7 +425,7 @@ static long xchg_nr_deferred_memcg(int nid, struct shrinker *shrinker,
>  {
>  	struct shrinker_info *info;
>  
> -	info = shrinker_info_protected(memcg, nid);
> +	info = shrinker_info_srcu(memcg, nid);
>  	return atomic_long_xchg(&info->nr_deferred[shrinker->id], 0);
>  }
>  
> @@ -420,7 +434,7 @@ static long add_nr_deferred_memcg(long nr, int nid, struct shrinker *shrinker,
>  {
>  	struct shrinker_info *info;
>  
> -	info = shrinker_info_protected(memcg, nid);
> +	info = shrinker_info_srcu(memcg, nid);
>  	return atomic_long_add_return(nr, &info->nr_deferred[shrinker->id]);
>  }
>  
> @@ -898,15 +912,14 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>  {
>  	struct shrinker_info *info;
>  	unsigned long ret, freed = 0;
> +	int srcu_idx;
>  	int i;
>  
>  	if (!mem_cgroup_online(memcg))
>  		return 0;
>  
> -	if (!down_read_trylock(&shrinker_rwsem))
> -		return 0;
> -
> -	info = shrinker_info_protected(memcg, nid);
> +	srcu_idx = srcu_read_lock(&shrinker_srcu);
> +	info = shrinker_info_srcu(memcg, nid);
>  	if (unlikely(!info))
>  		goto unlock;
>  
> @@ -956,14 +969,9 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>  				set_shrinker_bit(memcg, nid, i);
>  		}
>  		freed += ret;
> -
> -		if (rwsem_is_contended(&shrinker_rwsem)) {
> -			freed = freed ? : 1;
> -			break;
> -		}
>  	}
>  unlock:
> -	up_read(&shrinker_rwsem);
> +	srcu_read_unlock(&shrinker_srcu, srcu_idx);
>  	return freed;
>  }
>  #else /* CONFIG_MEMCG */


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v4 7/8] mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers()
  2023-03-07  6:56 ` [PATCH v4 7/8] mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers() Qi Zheng
@ 2023-03-08 22:39   ` Kirill Tkhai
  2023-03-09  7:06     ` Qi Zheng
  2023-03-09  9:40   ` Vlastimil Babka
  2023-03-09 19:34   ` Kirill Tkhai
  2 siblings, 1 reply; 36+ messages in thread
From: Kirill Tkhai @ 2023-03-08 22:39 UTC (permalink / raw)
  To: Qi Zheng, akpm, hannes, shakeelb, mhocko, roman.gushchin,
	muchun.song, david, shy828301, rppt
  Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel

On 07.03.2023 09:56, Qi Zheng wrote:
> Now there are no readers of shrinker_rwsem, so
> synchronize_shrinkers() does not need to hold the
> writer of shrinker_rwsem to wait for all running
> shinkers to complete, synchronize_srcu() is enough.
> 
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
>  mm/vmscan.c | 8 ++------
>  1 file changed, 2 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 7aaf6f94ac1b..ac7ab4aa344f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -796,15 +796,11 @@ EXPORT_SYMBOL(unregister_shrinker);
>  /**
>   * synchronize_shrinkers - Wait for all running shrinkers to complete.
>   *
> - * This is equivalent to calling unregister_shrink() and register_shrinker(),
> - * but atomically and with less overhead. This is useful to guarantee that all
> - * shrinker invocations have seen an update, before freeing memory, similar to
> - * rcu.
> + * This is useful to guarantee that all shrinker invocations have seen an
> + * update, before freeing memory.
>   */
>  void synchronize_shrinkers(void)
>  {
> -	down_write(&shrinker_rwsem);
> -	up_write(&shrinker_rwsem);
>  	atomic_inc(&shrinker_srcu_generation);
>  	synchronize_srcu(&shrinker_srcu);
>  }

Just curious, callers of synchronize_shrinkers() don't want to have parallel register_shrinker() and unregister_shrink() are completed?
Here we only should wait for parallel shrink_slab(), correct?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v4 3/8] mm: vmscan: make memcg slab shrink lockless
  2023-03-07  6:56 ` [PATCH v4 3/8] mm: vmscan: make memcg " Qi Zheng
  2023-03-08 22:23   ` Kirill Tkhai
@ 2023-03-08 22:46   ` Vlastimil Babka
  2023-03-09  6:47     ` Qi Zheng
  1 sibling, 1 reply; 36+ messages in thread
From: Vlastimil Babka @ 2023-03-08 22:46 UTC (permalink / raw)
  To: Qi Zheng, akpm, tkhai, hannes, shakeelb, mhocko, roman.gushchin,
	muchun.song, david, shy828301, rppt
  Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel

On 3/7/23 07:56, Qi Zheng wrote:
> Like global slab shrink, this commit also uses SRCU to make
> memcg slab shrink lockless.
> 
> We can reproduce the down_read_trylock() hotspot through the
> following script:
> 
> ```
> 
> DIR="/root/shrinker/memcg/mnt"
> 
> do_create()
> {
>     mkdir -p /sys/fs/cgroup/memory/test
>     mkdir -p /sys/fs/cgroup/perf_event/test
>     echo 4G > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
>     for i in `seq 0 $1`;
>     do
>         mkdir -p /sys/fs/cgroup/memory/test/$i;
>         echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
>         echo $$ > /sys/fs/cgroup/perf_event/test/cgroup.procs;
>         mkdir -p $DIR/$i;
>     done
> }
> 
> do_mount()
> {
>     for i in `seq $1 $2`;
>     do
>         mount -t tmpfs $i $DIR/$i;
>     done
> }
> 
> do_touch()
> {
>     for i in `seq $1 $2`;
>     do
>         echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
>         echo $$ > /sys/fs/cgroup/perf_event/test/cgroup.procs;
>             dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 &
>     done
> }
> 
> case "$1" in
>   touch)
>     do_touch $2 $3
>     ;;
>   test)
>       do_create 4000
>     do_mount 0 4000
>     do_touch 0 3000
>     ;;
>   *)
>     exit 1
>     ;;
> esac
> ```
> 
> Save the above script, then run test and touch commands.
> Then we can use the following perf command to view hotspots:
> 
> perf top -U -F 999
> 
> 1) Before applying this patchset:
> 
>   32.31%  [kernel]           [k] down_read_trylock
>   19.40%  [kernel]           [k] pv_native_safe_halt
>   16.24%  [kernel]           [k] up_read
>   15.70%  [kernel]           [k] shrink_slab
>    4.69%  [kernel]           [k] _find_next_bit
>    2.62%  [kernel]           [k] shrink_node
>    1.78%  [kernel]           [k] shrink_lruvec
>    0.76%  [kernel]           [k] do_shrink_slab
> 
> 2) After applying this patchset:
> 
>   27.83%  [kernel]           [k] _find_next_bit
>   16.97%  [kernel]           [k] shrink_slab
>   15.82%  [kernel]           [k] pv_native_safe_halt
>    9.58%  [kernel]           [k] shrink_node
>    8.31%  [kernel]           [k] shrink_lruvec
>    5.64%  [kernel]           [k] do_shrink_slab
>    3.88%  [kernel]           [k] mem_cgroup_iter
> 
> At the same time, we use the following perf command to capture
> IPC information:
> 
> perf stat -e cycles,instructions -G test -a --repeat 5 -- sleep 10
> 
> 1) Before applying this patchset:
> 
>  Performance counter stats for 'system wide' (5 runs):
> 
>       454187219766      cycles                    test                    ( +-  1.84% )
>        78896433101      instructions              test #    0.17  insn per cycle           ( +-  0.44% )
> 
>         10.0020430 +- 0.0000366 seconds time elapsed  ( +-  0.00% )
> 
> 2) After applying this patchset:
> 
>  Performance counter stats for 'system wide' (5 runs):
> 
>       841954709443      cycles                    test                    ( +- 15.80% )  (98.69%)
>       527258677936      instructions              test #    0.63  insn per cycle           ( +- 15.11% )  (98.68%)
> 
>           10.01064 +- 0.00831 seconds time elapsed  ( +-  0.08% )
> 
> We can see that IPC drops very seriously when calling
> down_read_trylock() at high frequency. After using SRCU,
> the IPC is at a normal level.

The interpretation looks somewhat weird to me. I'd say the workload is
stalled a lot as it fails the trylock (there might be some optimistic
spinning perhaps) and then goes to sleep. See how "pv_native_safe_halt" is
also more prominent in before. And because of that sleeping, there's less
instructions executed in the same amount of cycles (as it's a system wide
collection, otherwise it wouldn't be collecting the sleeping processes).

> 
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>

Other than that:

Acked-by: Vlastimil Babka <Vbabka@suse.cz>

A small thing below:

> ---
>  mm/vmscan.c | 46 +++++++++++++++++++++++++++-------------------
>  1 file changed, 27 insertions(+), 19 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 8515ac40bcaf..1de9bc3e5aa2 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -57,6 +57,7 @@
>  #include <linux/khugepaged.h>
>  #include <linux/rculist_nulls.h>
>  #include <linux/random.h>
> +#include <linux/srcu.h>

I guess this should have been in patch 2/8 already? It may work accidentaly
because some other header pulls it transitively...



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v4 1/8] mm: vmscan: add a map_nr_max field to shrinker_info
  2023-03-08 22:13   ` Kirill Tkhai
@ 2023-03-09  6:33     ` Qi Zheng
  0 siblings, 0 replies; 36+ messages in thread
From: Qi Zheng @ 2023-03-09  6:33 UTC (permalink / raw)
  To: Kirill Tkhai, akpm, hannes, shakeelb, mhocko, roman.gushchin,
	muchun.song, david, shy828301, rppt
  Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel



On 2023/3/9 06:13, Kirill Tkhai wrote:
> Hi,
> 
> On 07.03.2023 09:55, Qi Zheng wrote:
>> To prepare for the subsequent lockless memcg slab shrink,
>> add a map_nr_max field to struct shrinker_info to records
>> its own real shrinker_nr_max.
>>
>> Suggested-by: Kirill Tkhai <tkhai@ya.ru>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> ---
>>   include/linux/memcontrol.h |  1 +
>>   mm/vmscan.c                | 41 ++++++++++++++++++++++----------------
>>   2 files changed, 25 insertions(+), 17 deletions(-)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index b6eda2ab205d..aa69ea98e2d8 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -97,6 +97,7 @@ struct shrinker_info {
>>   	struct rcu_head rcu;
>>   	atomic_long_t *nr_deferred;
>>   	unsigned long *map;
>> +	int map_nr_max;
>>   };
>>   
>>   struct lruvec_stats_percpu {
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 9414226218f0..2dcc01682026 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -224,9 +224,16 @@ static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
>>   					 lockdep_is_held(&shrinker_rwsem));
>>   }
>>   
>> +static inline bool need_expand(int new_nr_max, int old_nr_max)
>> +{
>> +	return round_up(new_nr_max, BITS_PER_LONG) >
>> +	       round_up(old_nr_max, BITS_PER_LONG);
>> +}
>> +
>>   static int expand_one_shrinker_info(struct mem_cgroup *memcg,
>>   				    int map_size, int defer_size,
>> -				    int old_map_size, int old_defer_size)
>> +				    int old_map_size, int old_defer_size,
>> +				    int new_nr_max)
>>   {
>>   	struct shrinker_info *new, *old;
>>   	struct mem_cgroup_per_node *pn;
>> @@ -240,12 +247,17 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
>>   		if (!old)
>>   			return 0;
>>   
>> +		/* Already expanded this shrinker_info */
>> +		if (!need_expand(new_nr_max, old->map_nr_max))
> 
> need_expand() looks confusing here. It's strange that we round_up(old->map_nr_max),
> despite old->map never may exceed old->map_nr_max.
> 
> Won't plain
> 
> 	if (new_nr_max <= old->map_nr_max)
> 
> look clearer here?

Yeah, will change to it.

> 
> The rest in patch looks OK for me.
> 
>> +			continue;
>> +
>>   		new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
>>   		if (!new)
>>   			return -ENOMEM;
>>   
>>   		new->nr_deferred = (atomic_long_t *)(new + 1);
>>   		new->map = (void *)new->nr_deferred + defer_size;
>> +		new->map_nr_max = new_nr_max;
>>   
>>   		/* map: set all old bits, clear all new bits */
>>   		memset(new->map, (int)0xff, old_map_size);
>> @@ -295,6 +307,7 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
>>   		}
>>   		info->nr_deferred = (atomic_long_t *)(info + 1);
>>   		info->map = (void *)info->nr_deferred + defer_size;
>> +		info->map_nr_max = shrinker_nr_max;
>>   		rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, info);
>>   	}
>>   	up_write(&shrinker_rwsem);
>> @@ -302,23 +315,14 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
>>   	return ret;
>>   }
>>   
>> -static inline bool need_expand(int nr_max)
>> -{
>> -	return round_up(nr_max, BITS_PER_LONG) >
>> -	       round_up(shrinker_nr_max, BITS_PER_LONG);
>> -}
>> -
>>   static int expand_shrinker_info(int new_id)
>>   {
>>   	int ret = 0;
>> -	int new_nr_max = new_id + 1;
>> +	int new_nr_max = round_up(new_id + 1, BITS_PER_LONG);
>>   	int map_size, defer_size = 0;
>>   	int old_map_size, old_defer_size = 0;
>>   	struct mem_cgroup *memcg;
>>   
>> -	if (!need_expand(new_nr_max))
>> -		goto out;
>> -
>>   	if (!root_mem_cgroup)
>>   		goto out;
>>   
>> @@ -332,7 +336,8 @@ static int expand_shrinker_info(int new_id)
>>   	memcg = mem_cgroup_iter(NULL, NULL, NULL);
>>   	do {
>>   		ret = expand_one_shrinker_info(memcg, map_size, defer_size,
>> -					       old_map_size, old_defer_size);
>> +					       old_map_size, old_defer_size,
>> +					       new_nr_max);
>>   		if (ret) {
>>   			mem_cgroup_iter_break(NULL, memcg);
>>   			goto out;
>> @@ -352,9 +357,11 @@ void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
>>   
>>   		rcu_read_lock();
>>   		info = rcu_dereference(memcg->nodeinfo[nid]->shrinker_info);
>> -		/* Pairs with smp mb in shrink_slab() */
>> -		smp_mb__before_atomic();
>> -		set_bit(shrinker_id, info->map);
>> +		if (!WARN_ON_ONCE(shrinker_id >= info->map_nr_max)) {
>> +			/* Pairs with smp mb in shrink_slab() */
>> +			smp_mb__before_atomic();
>> +			set_bit(shrinker_id, info->map);
>> +		}
>>   		rcu_read_unlock();
>>   	}
>>   }
>> @@ -432,7 +439,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
>>   	for_each_node(nid) {
>>   		child_info = shrinker_info_protected(memcg, nid);
>>   		parent_info = shrinker_info_protected(parent, nid);
>> -		for (i = 0; i < shrinker_nr_max; i++) {
>> +		for (i = 0; i < child_info->map_nr_max; i++) {
>>   			nr = atomic_long_read(&child_info->nr_deferred[i]);
>>   			atomic_long_add(nr, &parent_info->nr_deferred[i]);
>>   		}
>> @@ -899,7 +906,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>>   	if (unlikely(!info))
>>   		goto unlock;
>>   
>> -	for_each_set_bit(i, info->map, shrinker_nr_max) {
>> +	for_each_set_bit(i, info->map, info->map_nr_max) {
>>   		struct shrink_control sc = {
>>   			.gfp_mask = gfp_mask,
>>   			.nid = nid,
> 

-- 
Thanks,
Qi

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v4 3/8] mm: vmscan: make memcg slab shrink lockless
  2023-03-08 22:46   ` Vlastimil Babka
@ 2023-03-09  6:47     ` Qi Zheng
  0 siblings, 0 replies; 36+ messages in thread
From: Qi Zheng @ 2023-03-09  6:47 UTC (permalink / raw)
  To: Vlastimil Babka, akpm, tkhai, hannes, shakeelb, mhocko,
	roman.gushchin, muchun.song, david, shy828301, rppt
  Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel

Hi Vlastimil,

On 2023/3/9 06:46, Vlastimil Babka wrote:
> On 3/7/23 07:56, Qi Zheng wrote:
>> Like global slab shrink, this commit also uses SRCU to make
>> memcg slab shrink lockless.
>>
>> We can reproduce the down_read_trylock() hotspot through the
>> following script:
>>
>> ```
>>
>> DIR="/root/shrinker/memcg/mnt"
>>
>> do_create()
>> {
>>      mkdir -p /sys/fs/cgroup/memory/test
>>      mkdir -p /sys/fs/cgroup/perf_event/test
>>      echo 4G > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
>>      for i in `seq 0 $1`;
>>      do
>>          mkdir -p /sys/fs/cgroup/memory/test/$i;
>>          echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
>>          echo $$ > /sys/fs/cgroup/perf_event/test/cgroup.procs;
>>          mkdir -p $DIR/$i;
>>      done
>> }
>>
>> do_mount()
>> {
>>      for i in `seq $1 $2`;
>>      do
>>          mount -t tmpfs $i $DIR/$i;
>>      done
>> }
>>
>> do_touch()
>> {
>>      for i in `seq $1 $2`;
>>      do
>>          echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
>>          echo $$ > /sys/fs/cgroup/perf_event/test/cgroup.procs;
>>              dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 &
>>      done
>> }
>>
>> case "$1" in
>>    touch)
>>      do_touch $2 $3
>>      ;;
>>    test)
>>        do_create 4000
>>      do_mount 0 4000
>>      do_touch 0 3000
>>      ;;
>>    *)
>>      exit 1
>>      ;;
>> esac
>> ```
>>
>> Save the above script, then run test and touch commands.
>> Then we can use the following perf command to view hotspots:
>>
>> perf top -U -F 999
>>
>> 1) Before applying this patchset:
>>
>>    32.31%  [kernel]           [k] down_read_trylock
>>    19.40%  [kernel]           [k] pv_native_safe_halt
>>    16.24%  [kernel]           [k] up_read
>>    15.70%  [kernel]           [k] shrink_slab
>>     4.69%  [kernel]           [k] _find_next_bit
>>     2.62%  [kernel]           [k] shrink_node
>>     1.78%  [kernel]           [k] shrink_lruvec
>>     0.76%  [kernel]           [k] do_shrink_slab
>>
>> 2) After applying this patchset:
>>
>>    27.83%  [kernel]           [k] _find_next_bit
>>    16.97%  [kernel]           [k] shrink_slab
>>    15.82%  [kernel]           [k] pv_native_safe_halt
>>     9.58%  [kernel]           [k] shrink_node
>>     8.31%  [kernel]           [k] shrink_lruvec
>>     5.64%  [kernel]           [k] do_shrink_slab
>>     3.88%  [kernel]           [k] mem_cgroup_iter
>>
>> At the same time, we use the following perf command to capture
>> IPC information:
>>
>> perf stat -e cycles,instructions -G test -a --repeat 5 -- sleep 10
>>
>> 1) Before applying this patchset:
>>
>>   Performance counter stats for 'system wide' (5 runs):
>>
>>        454187219766      cycles                    test                    ( +-  1.84% )
>>         78896433101      instructions              test #    0.17  insn per cycle           ( +-  0.44% )
>>
>>          10.0020430 +- 0.0000366 seconds time elapsed  ( +-  0.00% )
>>
>> 2) After applying this patchset:
>>
>>   Performance counter stats for 'system wide' (5 runs):
>>
>>        841954709443      cycles                    test                    ( +- 15.80% )  (98.69%)
>>        527258677936      instructions              test #    0.63  insn per cycle           ( +- 15.11% )  (98.68%)
>>
>>            10.01064 +- 0.00831 seconds time elapsed  ( +-  0.08% )
>>
>> We can see that IPC drops very seriously when calling
>> down_read_trylock() at high frequency. After using SRCU,
>> the IPC is at a normal level.
> 
> The interpretation looks somewhat weird to me. I'd say the workload is
> stalled a lot as it fails the trylock (there might be some optimistic
> spinning perhaps) and then goes to sleep. See how "pv_native_safe_halt" is
> also more prominent in before. And because of that sleeping, there's less
> instructions executed in the same amount of cycles (as it's a system wide
> collection, otherwise it wouldn't be collecting the sleeping processes).

But in my tests, the trylock basically did not fail, so I think it is
caused by high-frequency atomic operation.

bpftrace -e 'kr:down_read_trylock {@[kstack, retval]=count();} 
interval:s:1 {exit();}'

Attaching 2 probes...

<...>
@[
     shrink_slab+288
     shrink_node+640
     do_try_to_free_pages+203
     try_to_free_mem_cgroup_pages+266
     try_charge_memcg+412
     charge_memcg+51
     __mem_cgroup_charge+44
     __handle_mm_fault+2119
     handle_mm_fault+272
     do_user_addr_fault+712
     exc_page_fault+124
     asm_exc_page_fault+38
     clear_user_erms+14
     read_zero+86
     vfs_read+173
     ksys_read+93
     do_syscall_64+56
     entry_SYSCALL_64_after_hwframe+99
, 1]: 617019
@[
     shrink_slab+288
     shrink_node+640
     do_try_to_free_pages+203
     try_to_free_mem_cgroup_pages+266
     try_charge_memcg+412
     charge_memcg+51
     __mem_cgroup_charge+44
     shmem_add_to_page_cache+545
     shmem_get_folio_gfp+621
     shmem_write_begin+95
     generic_perform_write+257
     __generic_file_write_iter+202
     generic_file_write_iter+97
     vfs_write+704
     ksys_write+93
     do_syscall_64+56
     entry_SYSCALL_64_after_hwframe+99
, 1]: 617065

> 
>>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> 
> Other than that:
> 
> Acked-by: Vlastimil Babka <Vbabka@suse.cz>

Thanks.

> 
> A small thing below:
> 
>> ---
>>   mm/vmscan.c | 46 +++++++++++++++++++++++++++-------------------
>>   1 file changed, 27 insertions(+), 19 deletions(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 8515ac40bcaf..1de9bc3e5aa2 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -57,6 +57,7 @@
>>   #include <linux/khugepaged.h>
>>   #include <linux/rculist_nulls.h>
>>   #include <linux/random.h>
>> +#include <linux/srcu.h>
> 
> I guess this should have been in patch 2/8 already? It may work accidentaly
> because some other header pulls it transitively...

Yeah, in fact, patch 3/8 also can compile successfully without srcu.h,
but maybe it is better to explicitly include this header file, I will
add it in patch 2/8.

Thanks,
Qi

> 
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v4 7/8] mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers()
  2023-03-08 22:39   ` Kirill Tkhai
@ 2023-03-09  7:06     ` Qi Zheng
  2023-03-09  8:11       ` Christian König
  0 siblings, 1 reply; 36+ messages in thread
From: Qi Zheng @ 2023-03-09  7:06 UTC (permalink / raw)
  To: Kirill Tkhai, akpm, hannes, shakeelb, mhocko, roman.gushchin,
	muchun.song, david, shy828301, rppt
  Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel,
	christian.koenig

Hi Kirill,

On 2023/3/9 06:39, Kirill Tkhai wrote:
> On 07.03.2023 09:56, Qi Zheng wrote:
>> Now there are no readers of shrinker_rwsem, so
>> synchronize_shrinkers() does not need to hold the
>> writer of shrinker_rwsem to wait for all running
>> shinkers to complete, synchronize_srcu() is enough.
>>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> ---
>>   mm/vmscan.c | 8 ++------
>>   1 file changed, 2 insertions(+), 6 deletions(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 7aaf6f94ac1b..ac7ab4aa344f 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -796,15 +796,11 @@ EXPORT_SYMBOL(unregister_shrinker);
>>   /**
>>    * synchronize_shrinkers - Wait for all running shrinkers to complete.
>>    *
>> - * This is equivalent to calling unregister_shrink() and register_shrinker(),
>> - * but atomically and with less overhead. This is useful to guarantee that all
>> - * shrinker invocations have seen an update, before freeing memory, similar to
>> - * rcu.
>> + * This is useful to guarantee that all shrinker invocations have seen an
>> + * update, before freeing memory.
>>    */
>>   void synchronize_shrinkers(void)
>>   {
>> -	down_write(&shrinker_rwsem);
>> -	up_write(&shrinker_rwsem);
>>   	atomic_inc(&shrinker_srcu_generation);
>>   	synchronize_srcu(&shrinker_srcu);
>>   }
> 
> Just curious, callers of synchronize_shrinkers() don't want to have parallel register_shrinker() and unregister_shrink() are completed?
> Here we only should wait for parallel shrink_slab(), correct?

I think yes.

The synchronize_shrinkers() is currently only used by TTM pool.

In TTM pool, a shrinker named "drm-ttm_pool" is registered, and
the scan_objects callback will pick an entry from its own shrinker_list:

ttm_pool_shrink
--> spin_lock(&shrinker_lock);
     pt = list_first_entry(&shrinker_list, typeof(*pt), shrinker_list);
     list_move_tail(&pt->shrinker_list, &shrinker_list);
     spin_unlock(&shrinker_lock);

These entries have been removed from shrinker_list before calling
synchronize_shrinkers():

ttm_pool_fini
--> ttm_pool_type_fini
     --> spin_lock(&shrinker_lock);
	list_del(&pt->shrinker_list);
	spin_unlock(&shrinker_lock);
     synchronize_shrinkers

So IIUC, we only need to wait for the parallel shrink_slab() here. Like
its comment says:

/* We removed the pool types from the LRU, but we need to also make sure
  * that no shrinker is concurrently freeing pages from the pool.
  */

+ CC: Christian König :)

Thanks,
Qi

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v4 7/8] mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers()
  2023-03-09  7:06     ` Qi Zheng
@ 2023-03-09  8:11       ` Christian König
  2023-03-09  8:32         ` Qi Zheng
  0 siblings, 1 reply; 36+ messages in thread
From: Christian König @ 2023-03-09  8:11 UTC (permalink / raw)
  To: Qi Zheng, Kirill Tkhai, akpm, hannes, shakeelb, mhocko,
	roman.gushchin, muchun.song, david, shy828301, rppt
  Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel

Am 09.03.23 um 08:06 schrieb Qi Zheng:
> Hi Kirill,
>
> On 2023/3/9 06:39, Kirill Tkhai wrote:
>> On 07.03.2023 09:56, Qi Zheng wrote:
>>> Now there are no readers of shrinker_rwsem, so
>>> synchronize_shrinkers() does not need to hold the
>>> writer of shrinker_rwsem to wait for all running
>>> shinkers to complete, synchronize_srcu() is enough.
>>>
>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>> ---
>>>   mm/vmscan.c | 8 ++------
>>>   1 file changed, 2 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index 7aaf6f94ac1b..ac7ab4aa344f 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -796,15 +796,11 @@ EXPORT_SYMBOL(unregister_shrinker);
>>>   /**
>>>    * synchronize_shrinkers - Wait for all running shrinkers to 
>>> complete.
>>>    *
>>> - * This is equivalent to calling unregister_shrink() and 
>>> register_shrinker(),
>>> - * but atomically and with less overhead. This is useful to 
>>> guarantee that all
>>> - * shrinker invocations have seen an update, before freeing memory, 
>>> similar to
>>> - * rcu.
>>> + * This is useful to guarantee that all shrinker invocations have 
>>> seen an
>>> + * update, before freeing memory.
>>>    */
>>>   void synchronize_shrinkers(void)
>>>   {
>>> -    down_write(&shrinker_rwsem);
>>> -    up_write(&shrinker_rwsem);
>>>       atomic_inc(&shrinker_srcu_generation);
>>>       synchronize_srcu(&shrinker_srcu);
>>>   }
>>
>> Just curious, callers of synchronize_shrinkers() don't want to have 
>> parallel register_shrinker() and unregister_shrink() are completed?
>> Here we only should wait for parallel shrink_slab(), correct?
>
> I think yes.
>
> The synchronize_shrinkers() is currently only used by TTM pool.
>
> In TTM pool, a shrinker named "drm-ttm_pool" is registered, and
> the scan_objects callback will pick an entry from its own shrinker_list:
>
> ttm_pool_shrink
> --> spin_lock(&shrinker_lock);
>     pt = list_first_entry(&shrinker_list, typeof(*pt), shrinker_list);
>     list_move_tail(&pt->shrinker_list, &shrinker_list);
>     spin_unlock(&shrinker_lock);
>
> These entries have been removed from shrinker_list before calling
> synchronize_shrinkers():
>
> ttm_pool_fini
> --> ttm_pool_type_fini
>     --> spin_lock(&shrinker_lock);
>     list_del(&pt->shrinker_list);
>     spin_unlock(&shrinker_lock);
>     synchronize_shrinkers
>
> So IIUC, we only need to wait for the parallel shrink_slab() here. Like
> its comment says:
>
> /* We removed the pool types from the LRU, but we need to also make sure
>  * that no shrinker is concurrently freeing pages from the pool.
>  */

Yes your analyses is completely correct.

I just didn't wanted to add another SRCU into the critical code paths of 
the TTM pool just for device hot plug when I have that functionality 
already.

We just make sure that no shrinker is running in parallel with 
destruction of the pool. Registering and unregistering is harmless.

Regards,
Christian.

>
> + CC: Christian König :)
>
> Thanks,
> Qi


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v4 7/8] mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers()
  2023-03-09  8:11       ` Christian König
@ 2023-03-09  8:32         ` Qi Zheng
  2023-03-09 19:34           ` Kirill Tkhai
  0 siblings, 1 reply; 36+ messages in thread
From: Qi Zheng @ 2023-03-09  8:32 UTC (permalink / raw)
  To: Christian König, Kirill Tkhai, akpm, hannes, shakeelb,
	mhocko, roman.gushchin, muchun.song, david, shy828301, rppt
  Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel

Hi Christian,

On 2023/3/9 16:11, Christian König wrote:
> Am 09.03.23 um 08:06 schrieb Qi Zheng:
>> Hi Kirill,
>>
>> On 2023/3/9 06:39, Kirill Tkhai wrote:
>>> On 07.03.2023 09:56, Qi Zheng wrote:
>>>> Now there are no readers of shrinker_rwsem, so
>>>> synchronize_shrinkers() does not need to hold the
>>>> writer of shrinker_rwsem to wait for all running
>>>> shinkers to complete, synchronize_srcu() is enough.
>>>>
>>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>>> ---
>>>>   mm/vmscan.c | 8 ++------
>>>>   1 file changed, 2 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>> index 7aaf6f94ac1b..ac7ab4aa344f 100644
>>>> --- a/mm/vmscan.c
>>>> +++ b/mm/vmscan.c
>>>> @@ -796,15 +796,11 @@ EXPORT_SYMBOL(unregister_shrinker);
>>>>   /**
>>>>    * synchronize_shrinkers - Wait for all running shrinkers to 
>>>> complete.
>>>>    *
>>>> - * This is equivalent to calling unregister_shrink() and 
>>>> register_shrinker(),
>>>> - * but atomically and with less overhead. This is useful to 
>>>> guarantee that all
>>>> - * shrinker invocations have seen an update, before freeing memory, 
>>>> similar to
>>>> - * rcu.
>>>> + * This is useful to guarantee that all shrinker invocations have 
>>>> seen an
>>>> + * update, before freeing memory.
>>>>    */
>>>>   void synchronize_shrinkers(void)
>>>>   {
>>>> -    down_write(&shrinker_rwsem);
>>>> -    up_write(&shrinker_rwsem);
>>>>       atomic_inc(&shrinker_srcu_generation);
>>>>       synchronize_srcu(&shrinker_srcu);
>>>>   }
>>>
>>> Just curious, callers of synchronize_shrinkers() don't want to have 
>>> parallel register_shrinker() and unregister_shrink() are completed?
>>> Here we only should wait for parallel shrink_slab(), correct?
>>
>> I think yes.
>>
>> The synchronize_shrinkers() is currently only used by TTM pool.
>>
>> In TTM pool, a shrinker named "drm-ttm_pool" is registered, and
>> the scan_objects callback will pick an entry from its own shrinker_list:
>>
>> ttm_pool_shrink
>> --> spin_lock(&shrinker_lock);
>>     pt = list_first_entry(&shrinker_list, typeof(*pt), shrinker_list);
>>     list_move_tail(&pt->shrinker_list, &shrinker_list);
>>     spin_unlock(&shrinker_lock);
>>
>> These entries have been removed from shrinker_list before calling
>> synchronize_shrinkers():
>>
>> ttm_pool_fini
>> --> ttm_pool_type_fini
>>     --> spin_lock(&shrinker_lock);
>>     list_del(&pt->shrinker_list);
>>     spin_unlock(&shrinker_lock);
>>     synchronize_shrinkers
>>
>> So IIUC, we only need to wait for the parallel shrink_slab() here. Like
>> its comment says:
>>
>> /* We removed the pool types from the LRU, but we need to also make sure
>>  * that no shrinker is concurrently freeing pages from the pool.
>>  */
> 
> Yes your analyses is completely correct.
> 
> I just didn't wanted to add another SRCU into the critical code paths of 
> the TTM pool just for device hot plug when I have that functionality 
> already.
> 
> We just make sure that no shrinker is running in parallel with 
> destruction of the pool. Registering and unregistering is harmless.

That's great, thanks for confirming.

Thanks,
Qi

> 
> Regards,
> Christian.
> 
>>
>> + CC: Christian König :)
>>
>> Thanks,
>> Qi
> 
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v4 4/8] mm: vmscan: add shrinker_srcu_generation
  2023-03-07  6:56 ` [PATCH v4 4/8] mm: vmscan: add shrinker_srcu_generation Qi Zheng
@ 2023-03-09  9:23   ` Vlastimil Babka
  2023-03-09 10:12     ` Qi Zheng
  0 siblings, 1 reply; 36+ messages in thread
From: Vlastimil Babka @ 2023-03-09  9:23 UTC (permalink / raw)
  To: Qi Zheng, akpm, tkhai, hannes, shakeelb, mhocko, roman.gushchin,
	muchun.song, david, shy828301, rppt
  Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel

On 3/7/23 07:56, Qi Zheng wrote:
> From: Kirill Tkhai <tkhai@ya.ru>
> 
> After we make slab shrink lockless with SRCU, the longest
> sleep unregister_shrinker() will be a sleep waiting for
> all do_shrink_slab() calls.
> 
> To aviod long unbreakable action in the unregister_shrinker(),

     ^ avoid

> add shrinker_srcu_generation to restore a check similar to the
> rwsem_is_contendent() check that we had before.
> 
> And for memcg slab shrink, we unlock SRCU and continue
> iterations from the next shrinker id.
> 
> Signed-off-by: Kirill Tkhai <tkhai@ya.ru>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v4 5/8] mm: shrinkers: make count and scan in shrinker debugfs lockless
  2023-03-07  6:56 ` [PATCH v4 5/8] mm: shrinkers: make count and scan in shrinker debugfs lockless Qi Zheng
@ 2023-03-09  9:36   ` Vlastimil Babka
  2023-03-09  9:39   ` Vlastimil Babka
  2023-03-09 19:30   ` Kirill Tkhai
  2 siblings, 0 replies; 36+ messages in thread
From: Vlastimil Babka @ 2023-03-09  9:36 UTC (permalink / raw)
  To: Qi Zheng, akpm, tkhai, hannes, shakeelb, mhocko, roman.gushchin,
	muchun.song, david, shy828301, rppt
  Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel

On 3/7/23 07:56, Qi Zheng wrote:
> Like global and memcg slab shrink, also use SRCU to
> make count and scan operations in memory shrinker
> debugfs lockless.
> 
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/shrinker_debug.c | 24 +++++++-----------------
>  1 file changed, 7 insertions(+), 17 deletions(-)
> 
> diff --git a/mm/shrinker_debug.c b/mm/shrinker_debug.c
> index 39c3491e28a3..6aa7a7ec69da 100644
> --- a/mm/shrinker_debug.c
> +++ b/mm/shrinker_debug.c
> @@ -9,6 +9,7 @@
>  /* defined in vmscan.c */
>  extern struct rw_semaphore shrinker_rwsem;
>  extern struct list_head shrinker_list;
> +extern struct srcu_struct shrinker_srcu;
>  
>  static DEFINE_IDA(shrinker_debugfs_ida);
>  static struct dentry *shrinker_debugfs_root;
> @@ -49,18 +50,13 @@ static int shrinker_debugfs_count_show(struct seq_file *m, void *v)
>  	struct mem_cgroup *memcg;
>  	unsigned long total;
>  	bool memcg_aware;
> -	int ret, nid;
> +	int ret = 0, nid, srcu_idx;
>  
>  	count_per_node = kcalloc(nr_node_ids, sizeof(unsigned long), GFP_KERNEL);
>  	if (!count_per_node)
>  		return -ENOMEM;
>  
> -	ret = down_read_killable(&shrinker_rwsem);
> -	if (ret) {
> -		kfree(count_per_node);
> -		return ret;
> -	}
> -	rcu_read_lock();
> +	srcu_idx = srcu_read_lock(&shrinker_srcu);
>  
>  	memcg_aware = shrinker->flags & SHRINKER_MEMCG_AWARE;
>  
> @@ -91,8 +87,7 @@ static int shrinker_debugfs_count_show(struct seq_file *m, void *v)
>  		}
>  	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
>  
> -	rcu_read_unlock();
> -	up_read(&shrinker_rwsem);
> +	srcu_read_unlock(&shrinker_srcu, srcu_idx);
>  
>  	kfree(count_per_node);
>  	return ret;
> @@ -115,9 +110,8 @@ static ssize_t shrinker_debugfs_scan_write(struct file *file,
>  		.gfp_mask = GFP_KERNEL,
>  	};
>  	struct mem_cgroup *memcg = NULL;
> -	int nid;
> +	int nid, srcu_idx;
>  	char kbuf[72];
> -	ssize_t ret;
>  
>  	read_len = size < (sizeof(kbuf) - 1) ? size : (sizeof(kbuf) - 1);
>  	if (copy_from_user(kbuf, buf, read_len))
> @@ -146,11 +140,7 @@ static ssize_t shrinker_debugfs_scan_write(struct file *file,
>  		return -EINVAL;
>  	}
>  
> -	ret = down_read_killable(&shrinker_rwsem);
> -	if (ret) {
> -		mem_cgroup_put(memcg);
> -		return ret;
> -	}
> +	srcu_idx = srcu_read_lock(&shrinker_srcu);
>  
>  	sc.nid = nid;
>  	sc.memcg = memcg;
> @@ -159,7 +149,7 @@ static ssize_t shrinker_debugfs_scan_write(struct file *file,
>  
>  	shrinker->scan_objects(shrinker, &sc);
>  
> -	up_read(&shrinker_rwsem);
> +	srcu_read_unlock(&shrinker_srcu, srcu_idx);
>  	mem_cgroup_put(memcg);
>  
>  	return size;


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v4 6/8] mm: vmscan: hold write lock to reparent shrinker nr_deferred
  2023-03-07  6:56 ` [PATCH v4 6/8] mm: vmscan: hold write lock to reparent shrinker nr_deferred Qi Zheng
@ 2023-03-09  9:36   ` Vlastimil Babka
  2023-03-09 19:32   ` Kirill Tkhai
  1 sibling, 0 replies; 36+ messages in thread
From: Vlastimil Babka @ 2023-03-09  9:36 UTC (permalink / raw)
  To: Qi Zheng, akpm, tkhai, hannes, shakeelb, mhocko, roman.gushchin,
	muchun.song, david, shy828301, rppt
  Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel

On 3/7/23 07:56, Qi Zheng wrote:
> For now, reparent_shrinker_deferred() is the only holder
> of read lock of shrinker_rwsem. And it already holds the
> global cgroup_mutex, so it will not be called in parallel.
> 
> Therefore, in order to convert shrinker_rwsem to shrinker_mutex
> later, here we change to hold the write lock of shrinker_rwsem
> to reparent.
> 
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/vmscan.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 9a5a3da5c8b5..7aaf6f94ac1b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -451,7 +451,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
>  		parent = root_mem_cgroup;
>  
>  	/* Prevent from concurrent shrinker_info expand */
> -	down_read(&shrinker_rwsem);
> +	down_write(&shrinker_rwsem);
>  	for_each_node(nid) {
>  		child_info = shrinker_info_protected(memcg, nid);
>  		parent_info = shrinker_info_protected(parent, nid);
> @@ -460,7 +460,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
>  			atomic_long_add(nr, &parent_info->nr_deferred[i]);
>  		}
>  	}
> -	up_read(&shrinker_rwsem);
> +	up_write(&shrinker_rwsem);
>  }
>  
>  static bool cgroup_reclaim(struct scan_control *sc)


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v4 5/8] mm: shrinkers: make count and scan in shrinker debugfs lockless
  2023-03-07  6:56 ` [PATCH v4 5/8] mm: shrinkers: make count and scan in shrinker debugfs lockless Qi Zheng
  2023-03-09  9:36   ` Vlastimil Babka
@ 2023-03-09  9:39   ` Vlastimil Babka
  2023-03-09 10:14     ` Qi Zheng
  2023-03-09 19:30   ` Kirill Tkhai
  2 siblings, 1 reply; 36+ messages in thread
From: Vlastimil Babka @ 2023-03-09  9:39 UTC (permalink / raw)
  To: Qi Zheng, akpm, tkhai, hannes, shakeelb, mhocko, roman.gushchin,
	muchun.song, david, shy828301, rppt
  Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel

On 3/7/23 07:56, Qi Zheng wrote:
> Like global and memcg slab shrink, also use SRCU to
> make count and scan operations in memory shrinker
> debugfs lockless.
> 
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
>  mm/shrinker_debug.c | 24 +++++++-----------------
>  1 file changed, 7 insertions(+), 17 deletions(-)
> 
> diff --git a/mm/shrinker_debug.c b/mm/shrinker_debug.c
> index 39c3491e28a3..6aa7a7ec69da 100644
> --- a/mm/shrinker_debug.c
> +++ b/mm/shrinker_debug.c

Forgot to mention that this file should now likely also #include <linux/srcu.h>

> @@ -9,6 +9,7 @@
>  /* defined in vmscan.c */
>  extern struct rw_semaphore shrinker_rwsem;
>  extern struct list_head shrinker_list;
> +extern struct srcu_struct shrinker_srcu;


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v4 7/8] mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers()
  2023-03-07  6:56 ` [PATCH v4 7/8] mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers() Qi Zheng
  2023-03-08 22:39   ` Kirill Tkhai
@ 2023-03-09  9:40   ` Vlastimil Babka
  2023-03-09 19:34   ` Kirill Tkhai
  2 siblings, 0 replies; 36+ messages in thread
From: Vlastimil Babka @ 2023-03-09  9:40 UTC (permalink / raw)
  To: Qi Zheng, akpm, tkhai, hannes, shakeelb, mhocko, roman.gushchin,
	muchun.song, david, shy828301, rppt
  Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel

On 3/7/23 07:56, Qi Zheng wrote:
> Now there are no readers of shrinker_rwsem, so
> synchronize_shrinkers() does not need to hold the
> writer of shrinker_rwsem to wait for all running
> shinkers to complete, synchronize_srcu() is enough.
> 
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/vmscan.c | 8 ++------
>  1 file changed, 2 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 7aaf6f94ac1b..ac7ab4aa344f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -796,15 +796,11 @@ EXPORT_SYMBOL(unregister_shrinker);
>  /**
>   * synchronize_shrinkers - Wait for all running shrinkers to complete.
>   *
> - * This is equivalent to calling unregister_shrink() and register_shrinker(),
> - * but atomically and with less overhead. This is useful to guarantee that all
> - * shrinker invocations have seen an update, before freeing memory, similar to
> - * rcu.
> + * This is useful to guarantee that all shrinker invocations have seen an
> + * update, before freeing memory.
>   */
>  void synchronize_shrinkers(void)
>  {
> -	down_write(&shrinker_rwsem);
> -	up_write(&shrinker_rwsem);
>  	atomic_inc(&shrinker_srcu_generation);
>  	synchronize_srcu(&shrinker_srcu);
>  }


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v4 8/8] mm: shrinkers: convert shrinker_rwsem to mutex
  2023-03-07  6:56 ` [PATCH v4 8/8] mm: shrinkers: convert shrinker_rwsem to mutex Qi Zheng
@ 2023-03-09  9:42   ` Vlastimil Babka
  2023-03-09 19:49   ` Kirill Tkhai
  1 sibling, 0 replies; 36+ messages in thread
From: Vlastimil Babka @ 2023-03-09  9:42 UTC (permalink / raw)
  To: Qi Zheng, akpm, tkhai, hannes, shakeelb, mhocko, roman.gushchin,
	muchun.song, david, shy828301, rppt
  Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel

On 3/7/23 07:56, Qi Zheng wrote:
> Now there are no readers of shrinker_rwsem, so we
> can simply replace it with mutex lock.
> 
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v4 4/8] mm: vmscan: add shrinker_srcu_generation
  2023-03-09  9:23   ` Vlastimil Babka
@ 2023-03-09 10:12     ` Qi Zheng
  0 siblings, 0 replies; 36+ messages in thread
From: Qi Zheng @ 2023-03-09 10:12 UTC (permalink / raw)
  To: Vlastimil Babka, akpm, tkhai, hannes, shakeelb, mhocko,
	roman.gushchin, muchun.song, david, shy828301, rppt
  Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel



On 2023/3/9 17:23, Vlastimil Babka wrote:
> On 3/7/23 07:56, Qi Zheng wrote:
>> From: Kirill Tkhai <tkhai@ya.ru>
>>
>> After we make slab shrink lockless with SRCU, the longest
>> sleep unregister_shrinker() will be a sleep waiting for
>> all do_shrink_slab() calls.
>>
>> To aviod long unbreakable action in the unregister_shrinker(),
> 
>       ^ avoid

will change.

> 
>> add shrinker_srcu_generation to restore a check similar to the
>> rwsem_is_contendent() check that we had before.
>>
>> And for memcg slab shrink, we unlock SRCU and continue
>> iterations from the next shrinker id.
>>
>> Signed-off-by: Kirill Tkhai <tkhai@ya.ru>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Thanks.

> 
> 

-- 
Thanks,
Qi

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v4 5/8] mm: shrinkers: make count and scan in shrinker debugfs lockless
  2023-03-09  9:39   ` Vlastimil Babka
@ 2023-03-09 10:14     ` Qi Zheng
  0 siblings, 0 replies; 36+ messages in thread
From: Qi Zheng @ 2023-03-09 10:14 UTC (permalink / raw)
  To: Vlastimil Babka, akpm, tkhai, hannes, shakeelb, mhocko,
	roman.gushchin, muchun.song, david, shy828301, rppt
  Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel



On 2023/3/9 17:39, Vlastimil Babka wrote:
> On 3/7/23 07:56, Qi Zheng wrote:
>> Like global and memcg slab shrink, also use SRCU to
>> make count and scan operations in memory shrinker
>> debugfs lockless.
>>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> ---
>>   mm/shrinker_debug.c | 24 +++++++-----------------
>>   1 file changed, 7 insertions(+), 17 deletions(-)
>>
>> diff --git a/mm/shrinker_debug.c b/mm/shrinker_debug.c
>> index 39c3491e28a3..6aa7a7ec69da 100644
>> --- a/mm/shrinker_debug.c
>> +++ b/mm/shrinker_debug.c
> 
> Forgot to mention that this file should now likely also #include <linux/srcu.h>

Got it. Will do.

Thanks,
Qi

> 
>> @@ -9,6 +9,7 @@
>>   /* defined in vmscan.c */
>>   extern struct rw_semaphore shrinker_rwsem;
>>   extern struct list_head shrinker_list;
>> +extern struct srcu_struct shrinker_srcu;
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v4 5/8] mm: shrinkers: make count and scan in shrinker debugfs lockless
  2023-03-07  6:56 ` [PATCH v4 5/8] mm: shrinkers: make count and scan in shrinker debugfs lockless Qi Zheng
  2023-03-09  9:36   ` Vlastimil Babka
  2023-03-09  9:39   ` Vlastimil Babka
@ 2023-03-09 19:30   ` Kirill Tkhai
  2 siblings, 0 replies; 36+ messages in thread
From: Kirill Tkhai @ 2023-03-09 19:30 UTC (permalink / raw)
  To: Qi Zheng, akpm, hannes, shakeelb, mhocko, roman.gushchin,
	muchun.song, david, shy828301, rppt
  Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel

On 07.03.2023 09:56, Qi Zheng wrote:
> Like global and memcg slab shrink, also use SRCU to
> make count and scan operations in memory shrinker
> debugfs lockless.
> 
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>

Acked-by: Kirill Tkhai <tkhai@ya.ru>

> ---
>  mm/shrinker_debug.c | 24 +++++++-----------------
>  1 file changed, 7 insertions(+), 17 deletions(-)
> 
> diff --git a/mm/shrinker_debug.c b/mm/shrinker_debug.c
> index 39c3491e28a3..6aa7a7ec69da 100644
> --- a/mm/shrinker_debug.c
> +++ b/mm/shrinker_debug.c
> @@ -9,6 +9,7 @@
>  /* defined in vmscan.c */
>  extern struct rw_semaphore shrinker_rwsem;
>  extern struct list_head shrinker_list;
> +extern struct srcu_struct shrinker_srcu;
>  
>  static DEFINE_IDA(shrinker_debugfs_ida);
>  static struct dentry *shrinker_debugfs_root;
> @@ -49,18 +50,13 @@ static int shrinker_debugfs_count_show(struct seq_file *m, void *v)
>  	struct mem_cgroup *memcg;
>  	unsigned long total;
>  	bool memcg_aware;
> -	int ret, nid;
> +	int ret = 0, nid, srcu_idx;
>  
>  	count_per_node = kcalloc(nr_node_ids, sizeof(unsigned long), GFP_KERNEL);
>  	if (!count_per_node)
>  		return -ENOMEM;
>  
> -	ret = down_read_killable(&shrinker_rwsem);
> -	if (ret) {
> -		kfree(count_per_node);
> -		return ret;
> -	}
> -	rcu_read_lock();
> +	srcu_idx = srcu_read_lock(&shrinker_srcu);
>  
>  	memcg_aware = shrinker->flags & SHRINKER_MEMCG_AWARE;
>  
> @@ -91,8 +87,7 @@ static int shrinker_debugfs_count_show(struct seq_file *m, void *v)
>  		}
>  	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
>  
> -	rcu_read_unlock();
> -	up_read(&shrinker_rwsem);
> +	srcu_read_unlock(&shrinker_srcu, srcu_idx);
>  
>  	kfree(count_per_node);
>  	return ret;
> @@ -115,9 +110,8 @@ static ssize_t shrinker_debugfs_scan_write(struct file *file,
>  		.gfp_mask = GFP_KERNEL,
>  	};
>  	struct mem_cgroup *memcg = NULL;
> -	int nid;
> +	int nid, srcu_idx;
>  	char kbuf[72];
> -	ssize_t ret;
>  
>  	read_len = size < (sizeof(kbuf) - 1) ? size : (sizeof(kbuf) - 1);
>  	if (copy_from_user(kbuf, buf, read_len))
> @@ -146,11 +140,7 @@ static ssize_t shrinker_debugfs_scan_write(struct file *file,
>  		return -EINVAL;
>  	}
>  
> -	ret = down_read_killable(&shrinker_rwsem);
> -	if (ret) {
> -		mem_cgroup_put(memcg);
> -		return ret;
> -	}
> +	srcu_idx = srcu_read_lock(&shrinker_srcu);
>  
>  	sc.nid = nid;
>  	sc.memcg = memcg;
> @@ -159,7 +149,7 @@ static ssize_t shrinker_debugfs_scan_write(struct file *file,
>  
>  	shrinker->scan_objects(shrinker, &sc);
>  
> -	up_read(&shrinker_rwsem);
> +	srcu_read_unlock(&shrinker_srcu, srcu_idx);
>  	mem_cgroup_put(memcg);
>  
>  	return size;


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v4 6/8] mm: vmscan: hold write lock to reparent shrinker nr_deferred
  2023-03-07  6:56 ` [PATCH v4 6/8] mm: vmscan: hold write lock to reparent shrinker nr_deferred Qi Zheng
  2023-03-09  9:36   ` Vlastimil Babka
@ 2023-03-09 19:32   ` Kirill Tkhai
  1 sibling, 0 replies; 36+ messages in thread
From: Kirill Tkhai @ 2023-03-09 19:32 UTC (permalink / raw)
  To: Qi Zheng, akpm, hannes, shakeelb, mhocko, roman.gushchin,
	muchun.song, david, shy828301, rppt
  Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel

On 07.03.2023 09:56, Qi Zheng wrote:
> For now, reparent_shrinker_deferred() is the only holder
> of read lock of shrinker_rwsem. And it already holds the
> global cgroup_mutex, so it will not be called in parallel.
> 
> Therefore, in order to convert shrinker_rwsem to shrinker_mutex
> later, here we change to hold the write lock of shrinker_rwsem
> to reparent.
> 
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>

Acked-by: Kirill Tkhai <tkhai@ya.ru>

> ---
>  mm/vmscan.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 9a5a3da5c8b5..7aaf6f94ac1b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -451,7 +451,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
>  		parent = root_mem_cgroup;
>  
>  	/* Prevent from concurrent shrinker_info expand */
> -	down_read(&shrinker_rwsem);
> +	down_write(&shrinker_rwsem);
>  	for_each_node(nid) {
>  		child_info = shrinker_info_protected(memcg, nid);
>  		parent_info = shrinker_info_protected(parent, nid);
> @@ -460,7 +460,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
>  			atomic_long_add(nr, &parent_info->nr_deferred[i]);
>  		}
>  	}
> -	up_read(&shrinker_rwsem);
> +	up_write(&shrinker_rwsem);
>  }
>  
>  static bool cgroup_reclaim(struct scan_control *sc)


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v4 7/8] mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers()
  2023-03-09  8:32         ` Qi Zheng
@ 2023-03-09 19:34           ` Kirill Tkhai
  0 siblings, 0 replies; 36+ messages in thread
From: Kirill Tkhai @ 2023-03-09 19:34 UTC (permalink / raw)
  To: Qi Zheng, Christian König, akpm, hannes, shakeelb, mhocko,
	roman.gushchin, muchun.song, david, shy828301, rppt
  Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel

On 09.03.2023 11:32, Qi Zheng wrote:
> Hi Christian,
> 
> On 2023/3/9 16:11, Christian König wrote:
>> Am 09.03.23 um 08:06 schrieb Qi Zheng:
>>> Hi Kirill,
>>>
>>> On 2023/3/9 06:39, Kirill Tkhai wrote:
>>>> On 07.03.2023 09:56, Qi Zheng wrote:
>>>>> Now there are no readers of shrinker_rwsem, so
>>>>> synchronize_shrinkers() does not need to hold the
>>>>> writer of shrinker_rwsem to wait for all running
>>>>> shinkers to complete, synchronize_srcu() is enough.
>>>>>
>>>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>>>> ---
>>>>>   mm/vmscan.c | 8 ++------
>>>>>   1 file changed, 2 insertions(+), 6 deletions(-)
>>>>>
>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>> index 7aaf6f94ac1b..ac7ab4aa344f 100644
>>>>> --- a/mm/vmscan.c
>>>>> +++ b/mm/vmscan.c
>>>>> @@ -796,15 +796,11 @@ EXPORT_SYMBOL(unregister_shrinker);
>>>>>   /**
>>>>>    * synchronize_shrinkers - Wait for all running shrinkers to complete.
>>>>>    *
>>>>> - * This is equivalent to calling unregister_shrink() and register_shrinker(),
>>>>> - * but atomically and with less overhead. This is useful to guarantee that all
>>>>> - * shrinker invocations have seen an update, before freeing memory, similar to
>>>>> - * rcu.
>>>>> + * This is useful to guarantee that all shrinker invocations have seen an
>>>>> + * update, before freeing memory.
>>>>>    */
>>>>>   void synchronize_shrinkers(void)
>>>>>   {
>>>>> -    down_write(&shrinker_rwsem);
>>>>> -    up_write(&shrinker_rwsem);
>>>>>       atomic_inc(&shrinker_srcu_generation);
>>>>>       synchronize_srcu(&shrinker_srcu);
>>>>>   }
>>>>
>>>> Just curious, callers of synchronize_shrinkers() don't want to have parallel register_shrinker() and unregister_shrink() are completed?
>>>> Here we only should wait for parallel shrink_slab(), correct?
>>>
>>> I think yes.
>>>
>>> The synchronize_shrinkers() is currently only used by TTM pool.
>>>
>>> In TTM pool, a shrinker named "drm-ttm_pool" is registered, and
>>> the scan_objects callback will pick an entry from its own shrinker_list:
>>>
>>> ttm_pool_shrink
>>> --> spin_lock(&shrinker_lock);
>>>     pt = list_first_entry(&shrinker_list, typeof(*pt), shrinker_list);
>>>     list_move_tail(&pt->shrinker_list, &shrinker_list);
>>>     spin_unlock(&shrinker_lock);
>>>
>>> These entries have been removed from shrinker_list before calling
>>> synchronize_shrinkers():
>>>
>>> ttm_pool_fini
>>> --> ttm_pool_type_fini
>>>     --> spin_lock(&shrinker_lock);
>>>     list_del(&pt->shrinker_list);
>>>     spin_unlock(&shrinker_lock);
>>>     synchronize_shrinkers
>>>
>>> So IIUC, we only need to wait for the parallel shrink_slab() here. Like
>>> its comment says:
>>>
>>> /* We removed the pool types from the LRU, but we need to also make sure
>>>  * that no shrinker is concurrently freeing pages from the pool.
>>>  */
>>
>> Yes your analyses is completely correct.
>>
>> I just didn't wanted to add another SRCU into the critical code paths of the TTM pool just for device hot plug when I have that functionality already.
>>
>> We just make sure that no shrinker is running in parallel with destruction of the pool. Registering and unregistering is harmless.
> 
> That's great, thanks for confirming.
> 
> Thanks,
> Qi

Christian and Qi, thanks for the explanation.

>>
>> Regards,
>> Christian.
>>
>>>
>>> + CC: Christian König :)
>>>
>>> Thanks,
>>> Qi
>>
>>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v4 7/8] mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers()
  2023-03-07  6:56 ` [PATCH v4 7/8] mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers() Qi Zheng
  2023-03-08 22:39   ` Kirill Tkhai
  2023-03-09  9:40   ` Vlastimil Babka
@ 2023-03-09 19:34   ` Kirill Tkhai
  2 siblings, 0 replies; 36+ messages in thread
From: Kirill Tkhai @ 2023-03-09 19:34 UTC (permalink / raw)
  To: Qi Zheng, akpm, hannes, shakeelb, mhocko, roman.gushchin,
	muchun.song, david, shy828301, rppt
  Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel

On 07.03.2023 09:56, Qi Zheng wrote:
> Now there are no readers of shrinker_rwsem, so
> synchronize_shrinkers() does not need to hold the
> writer of shrinker_rwsem to wait for all running
> shinkers to complete, synchronize_srcu() is enough.
> 
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>

Acked-by: Kirill Tkhai <tkhai@ya.ru>

> ---
>  mm/vmscan.c | 8 ++------
>  1 file changed, 2 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 7aaf6f94ac1b..ac7ab4aa344f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -796,15 +796,11 @@ EXPORT_SYMBOL(unregister_shrinker);
>  /**
>   * synchronize_shrinkers - Wait for all running shrinkers to complete.
>   *
> - * This is equivalent to calling unregister_shrink() and register_shrinker(),
> - * but atomically and with less overhead. This is useful to guarantee that all
> - * shrinker invocations have seen an update, before freeing memory, similar to
> - * rcu.
> + * This is useful to guarantee that all shrinker invocations have seen an
> + * update, before freeing memory.
>   */
>  void synchronize_shrinkers(void)
>  {
> -	down_write(&shrinker_rwsem);
> -	up_write(&shrinker_rwsem);
>  	atomic_inc(&shrinker_srcu_generation);
>  	synchronize_srcu(&shrinker_srcu);
>  }


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v4 8/8] mm: shrinkers: convert shrinker_rwsem to mutex
  2023-03-07  6:56 ` [PATCH v4 8/8] mm: shrinkers: convert shrinker_rwsem to mutex Qi Zheng
  2023-03-09  9:42   ` Vlastimil Babka
@ 2023-03-09 19:49   ` Kirill Tkhai
  1 sibling, 0 replies; 36+ messages in thread
From: Kirill Tkhai @ 2023-03-09 19:49 UTC (permalink / raw)
  To: Qi Zheng, akpm, hannes, shakeelb, mhocko, roman.gushchin,
	muchun.song, david, shy828301, rppt
  Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel

On 07.03.2023 09:56, Qi Zheng wrote:
> Now there are no readers of shrinker_rwsem, so we
> can simply replace it with mutex lock.
> 
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>

Acked-by: Kirill Tkhai <tkhai@ya.ru>

> ---
>  drivers/md/dm-cache-metadata.c |  2 +-
>  drivers/md/dm-thin-metadata.c  |  2 +-
>  fs/super.c                     |  2 +-
>  mm/shrinker_debug.c            | 14 +++++++-------
>  mm/vmscan.c                    | 34 +++++++++++++++++-----------------
>  5 files changed, 27 insertions(+), 27 deletions(-)
> 
> diff --git a/drivers/md/dm-cache-metadata.c b/drivers/md/dm-cache-metadata.c
> index acffed750e3e..9e0c69958587 100644
> --- a/drivers/md/dm-cache-metadata.c
> +++ b/drivers/md/dm-cache-metadata.c
> @@ -1828,7 +1828,7 @@ int dm_cache_metadata_abort(struct dm_cache_metadata *cmd)
>  	 * Replacement block manager (new_bm) is created and old_bm destroyed outside of
>  	 * cmd root_lock to avoid ABBA deadlock that would result (due to life-cycle of
>  	 * shrinker associated with the block manager's bufio client vs cmd root_lock).
> -	 * - must take shrinker_rwsem without holding cmd->root_lock
> +	 * - must take shrinker_mutex without holding cmd->root_lock
>  	 */
>  	new_bm = dm_block_manager_create(cmd->bdev, DM_CACHE_METADATA_BLOCK_SIZE << SECTOR_SHIFT,
>  					 CACHE_MAX_CONCURRENT_LOCKS);
> diff --git a/drivers/md/dm-thin-metadata.c b/drivers/md/dm-thin-metadata.c
> index fd464fb024c3..9f5cb52c5763 100644
> --- a/drivers/md/dm-thin-metadata.c
> +++ b/drivers/md/dm-thin-metadata.c
> @@ -1887,7 +1887,7 @@ int dm_pool_abort_metadata(struct dm_pool_metadata *pmd)
>  	 * Replacement block manager (new_bm) is created and old_bm destroyed outside of
>  	 * pmd root_lock to avoid ABBA deadlock that would result (due to life-cycle of
>  	 * shrinker associated with the block manager's bufio client vs pmd root_lock).
> -	 * - must take shrinker_rwsem without holding pmd->root_lock
> +	 * - must take shrinker_mutex without holding pmd->root_lock
>  	 */
>  	new_bm = dm_block_manager_create(pmd->bdev, THIN_METADATA_BLOCK_SIZE << SECTOR_SHIFT,
>  					 THIN_MAX_CONCURRENT_LOCKS);
> diff --git a/fs/super.c b/fs/super.c
> index 84332d5cb817..91a4037b1d95 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -54,7 +54,7 @@ static char *sb_writers_name[SB_FREEZE_LEVELS] = {
>   * One thing we have to be careful of with a per-sb shrinker is that we don't
>   * drop the last active reference to the superblock from within the shrinker.
>   * If that happens we could trigger unregistering the shrinker from within the
> - * shrinker path and that leads to deadlock on the shrinker_rwsem. Hence we
> + * shrinker path and that leads to deadlock on the shrinker_mutex. Hence we
>   * take a passive reference to the superblock to avoid this from occurring.
>   */
>  static unsigned long super_cache_scan(struct shrinker *shrink,
> diff --git a/mm/shrinker_debug.c b/mm/shrinker_debug.c
> index 6aa7a7ec69da..b0f6aff372df 100644
> --- a/mm/shrinker_debug.c
> +++ b/mm/shrinker_debug.c
> @@ -7,7 +7,7 @@
>  #include <linux/memcontrol.h>
>  
>  /* defined in vmscan.c */
> -extern struct rw_semaphore shrinker_rwsem;
> +extern struct mutex shrinker_mutex;
>  extern struct list_head shrinker_list;
>  extern struct srcu_struct shrinker_srcu;
>  
> @@ -167,7 +167,7 @@ int shrinker_debugfs_add(struct shrinker *shrinker)
>  	char buf[128];
>  	int id;
>  
> -	lockdep_assert_held(&shrinker_rwsem);
> +	lockdep_assert_held(&shrinker_mutex);
>  
>  	/* debugfs isn't initialized yet, add debugfs entries later. */
>  	if (!shrinker_debugfs_root)
> @@ -210,7 +210,7 @@ int shrinker_debugfs_rename(struct shrinker *shrinker, const char *fmt, ...)
>  	if (!new)
>  		return -ENOMEM;
>  
> -	down_write(&shrinker_rwsem);
> +	mutex_lock(&shrinker_mutex);
>  
>  	old = shrinker->name;
>  	shrinker->name = new;
> @@ -228,7 +228,7 @@ int shrinker_debugfs_rename(struct shrinker *shrinker, const char *fmt, ...)
>  			shrinker->debugfs_entry = entry;
>  	}
>  
> -	up_write(&shrinker_rwsem);
> +	mutex_unlock(&shrinker_mutex);
>  
>  	kfree_const(old);
>  
> @@ -240,7 +240,7 @@ struct dentry *shrinker_debugfs_remove(struct shrinker *shrinker)
>  {
>  	struct dentry *entry = shrinker->debugfs_entry;
>  
> -	lockdep_assert_held(&shrinker_rwsem);
> +	lockdep_assert_held(&shrinker_mutex);
>  
>  	kfree_const(shrinker->name);
>  	shrinker->name = NULL;
> @@ -265,14 +265,14 @@ static int __init shrinker_debugfs_init(void)
>  	shrinker_debugfs_root = dentry;
>  
>  	/* Create debugfs entries for shrinkers registered at boot */
> -	down_write(&shrinker_rwsem);
> +	mutex_lock(&shrinker_mutex);
>  	list_for_each_entry(shrinker, &shrinker_list, list)
>  		if (!shrinker->debugfs_entry) {
>  			ret = shrinker_debugfs_add(shrinker);
>  			if (ret)
>  				break;
>  		}
> -	up_write(&shrinker_rwsem);
> +	mutex_unlock(&shrinker_mutex);
>  
>  	return ret;
>  }
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index ac7ab4aa344f..c00302fabc3d 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -35,7 +35,7 @@
>  #include <linux/cpuset.h>
>  #include <linux/compaction.h>
>  #include <linux/notifier.h>
> -#include <linux/rwsem.h>
> +#include <linux/mutex.h>
>  #include <linux/delay.h>
>  #include <linux/kthread.h>
>  #include <linux/freezer.h>
> @@ -202,7 +202,7 @@ static void set_task_reclaim_state(struct task_struct *task,
>  }
>  
>  LIST_HEAD(shrinker_list);
> -DECLARE_RWSEM(shrinker_rwsem);
> +DEFINE_MUTEX(shrinker_mutex);
>  DEFINE_SRCU(shrinker_srcu);
>  static atomic_t shrinker_srcu_generation = ATOMIC_INIT(0);
>  
> @@ -225,7 +225,7 @@ static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
>  {
>  	return srcu_dereference_check(memcg->nodeinfo[nid]->shrinker_info,
>  				      &shrinker_srcu,
> -				      lockdep_is_held(&shrinker_rwsem));
> +				      lockdep_is_held(&shrinker_mutex));
>  }
>  
>  static struct shrinker_info *shrinker_info_srcu(struct mem_cgroup *memcg,
> @@ -310,7 +310,7 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
>  	int nid, size, ret = 0;
>  	int map_size, defer_size = 0;
>  
> -	down_write(&shrinker_rwsem);
> +	mutex_lock(&shrinker_mutex);
>  	map_size = shrinker_map_size(shrinker_nr_max);
>  	defer_size = shrinker_defer_size(shrinker_nr_max);
>  	size = map_size + defer_size;
> @@ -326,7 +326,7 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
>  		info->map_nr_max = shrinker_nr_max;
>  		rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, info);
>  	}
> -	up_write(&shrinker_rwsem);
> +	mutex_unlock(&shrinker_mutex);
>  
>  	return ret;
>  }
> @@ -342,7 +342,7 @@ static int expand_shrinker_info(int new_id)
>  	if (!root_mem_cgroup)
>  		goto out;
>  
> -	lockdep_assert_held(&shrinker_rwsem);
> +	lockdep_assert_held(&shrinker_mutex);
>  
>  	map_size = shrinker_map_size(new_nr_max);
>  	defer_size = shrinker_defer_size(new_nr_max);
> @@ -392,7 +392,7 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
>  	if (mem_cgroup_disabled())
>  		return -ENOSYS;
>  
> -	down_write(&shrinker_rwsem);
> +	mutex_lock(&shrinker_mutex);
>  	id = idr_alloc(&shrinker_idr, shrinker, 0, 0, GFP_KERNEL);
>  	if (id < 0)
>  		goto unlock;
> @@ -406,7 +406,7 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
>  	shrinker->id = id;
>  	ret = 0;
>  unlock:
> -	up_write(&shrinker_rwsem);
> +	mutex_unlock(&shrinker_mutex);
>  	return ret;
>  }
>  
> @@ -416,7 +416,7 @@ static void unregister_memcg_shrinker(struct shrinker *shrinker)
>  
>  	BUG_ON(id < 0);
>  
> -	lockdep_assert_held(&shrinker_rwsem);
> +	lockdep_assert_held(&shrinker_mutex);
>  
>  	idr_remove(&shrinker_idr, id);
>  }
> @@ -451,7 +451,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
>  		parent = root_mem_cgroup;
>  
>  	/* Prevent from concurrent shrinker_info expand */
> -	down_write(&shrinker_rwsem);
> +	mutex_lock(&shrinker_mutex);
>  	for_each_node(nid) {
>  		child_info = shrinker_info_protected(memcg, nid);
>  		parent_info = shrinker_info_protected(parent, nid);
> @@ -460,7 +460,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
>  			atomic_long_add(nr, &parent_info->nr_deferred[i]);
>  		}
>  	}
> -	up_write(&shrinker_rwsem);
> +	mutex_unlock(&shrinker_mutex);
>  }
>  
>  static bool cgroup_reclaim(struct scan_control *sc)
> @@ -709,9 +709,9 @@ void free_prealloced_shrinker(struct shrinker *shrinker)
>  	shrinker->name = NULL;
>  #endif
>  	if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
> -		down_write(&shrinker_rwsem);
> +		mutex_lock(&shrinker_mutex);
>  		unregister_memcg_shrinker(shrinker);
> -		up_write(&shrinker_rwsem);
> +		mutex_unlock(&shrinker_mutex);
>  		return;
>  	}
>  
> @@ -721,11 +721,11 @@ void free_prealloced_shrinker(struct shrinker *shrinker)
>  
>  void register_shrinker_prepared(struct shrinker *shrinker)
>  {
> -	down_write(&shrinker_rwsem);
> +	mutex_lock(&shrinker_mutex);
>  	list_add_tail_rcu(&shrinker->list, &shrinker_list);
>  	shrinker->flags |= SHRINKER_REGISTERED;
>  	shrinker_debugfs_add(shrinker);
> -	up_write(&shrinker_rwsem);
> +	mutex_unlock(&shrinker_mutex);
>  }
>  
>  static int __register_shrinker(struct shrinker *shrinker)
> @@ -775,13 +775,13 @@ void unregister_shrinker(struct shrinker *shrinker)
>  	if (!(shrinker->flags & SHRINKER_REGISTERED))
>  		return;
>  
> -	down_write(&shrinker_rwsem);
> +	mutex_lock(&shrinker_mutex);
>  	list_del_rcu(&shrinker->list);
>  	shrinker->flags &= ~SHRINKER_REGISTERED;
>  	if (shrinker->flags & SHRINKER_MEMCG_AWARE)
>  		unregister_memcg_shrinker(shrinker);
>  	debugfs_entry = shrinker_debugfs_remove(shrinker);
> -	up_write(&shrinker_rwsem);
> +	mutex_unlock(&shrinker_mutex);
>  
>  	atomic_inc(&shrinker_srcu_generation);
>  	synchronize_srcu(&shrinker_srcu);


^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2023-03-09 19:49 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-07  6:55 [PATCH v4 0/8] make slab shrink lockless Qi Zheng
2023-03-07  6:55 ` [PATCH v4 1/8] mm: vmscan: add a map_nr_max field to shrinker_info Qi Zheng
2023-03-08 14:40   ` Vlastimil Babka
2023-03-08 22:13   ` Kirill Tkhai
2023-03-09  6:33     ` Qi Zheng
2023-03-07  6:55 ` [PATCH v4 2/8] mm: vmscan: make global slab shrink lockless Qi Zheng
2023-03-08 15:02   ` Vlastimil Babka
2023-03-08 22:18   ` Kirill Tkhai
2023-03-07  6:56 ` [PATCH v4 3/8] mm: vmscan: make memcg " Qi Zheng
2023-03-08 22:23   ` Kirill Tkhai
2023-03-08 22:46   ` Vlastimil Babka
2023-03-09  6:47     ` Qi Zheng
2023-03-07  6:56 ` [PATCH v4 4/8] mm: vmscan: add shrinker_srcu_generation Qi Zheng
2023-03-09  9:23   ` Vlastimil Babka
2023-03-09 10:12     ` Qi Zheng
2023-03-07  6:56 ` [PATCH v4 5/8] mm: shrinkers: make count and scan in shrinker debugfs lockless Qi Zheng
2023-03-09  9:36   ` Vlastimil Babka
2023-03-09  9:39   ` Vlastimil Babka
2023-03-09 10:14     ` Qi Zheng
2023-03-09 19:30   ` Kirill Tkhai
2023-03-07  6:56 ` [PATCH v4 6/8] mm: vmscan: hold write lock to reparent shrinker nr_deferred Qi Zheng
2023-03-09  9:36   ` Vlastimil Babka
2023-03-09 19:32   ` Kirill Tkhai
2023-03-07  6:56 ` [PATCH v4 7/8] mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers() Qi Zheng
2023-03-08 22:39   ` Kirill Tkhai
2023-03-09  7:06     ` Qi Zheng
2023-03-09  8:11       ` Christian König
2023-03-09  8:32         ` Qi Zheng
2023-03-09 19:34           ` Kirill Tkhai
2023-03-09  9:40   ` Vlastimil Babka
2023-03-09 19:34   ` Kirill Tkhai
2023-03-07  6:56 ` [PATCH v4 8/8] mm: shrinkers: convert shrinker_rwsem to mutex Qi Zheng
2023-03-09  9:42   ` Vlastimil Babka
2023-03-09 19:49   ` Kirill Tkhai
2023-03-07 22:20 ` [PATCH v4 0/8] make slab shrink lockless Andrew Morton
2023-03-08 11:59   ` Qi Zheng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).