* [PATCH 0/5] make slab shrink lockless
@ 2023-02-20 9:16 Qi Zheng
2023-02-20 9:16 ` [PATCH 1/5] mm: vmscan: make global " Qi Zheng
` (4 more replies)
0 siblings, 5 replies; 16+ messages in thread
From: Qi Zheng @ 2023-02-20 9:16 UTC (permalink / raw)
To: akpm, hannes, shakeelb, mhocko, roman.gushchin, muchun.song,
david, shy828301
Cc: tkhai, sultan, dave, penguin-kernel, paulmck, linux-mm,
linux-kernel, Qi Zheng
Hi all,
This patch series aims to make slab shrink lockless.
1. Background
=============
On our servers, we often find the following system cpu hotspots:
44.16% [kernel] [k] down_read_trylock
14.12% [kernel] [k] up_read
13.43% [kernel] [k] shrink_slab
5.25% [kernel] [k] count_shadow_nodes
3.42% [kernel] [k] idr_find
Then we used bpftrace to capture its calltrace as follows:
@[
down_read_trylock+5
shrink_slab+292
shrink_node+640
do_try_to_free_pages+211
try_to_free_mem_cgroup_pages+266
try_charge_memcg+386
charge_memcg+51
__mem_cgroup_charge+44
__handle_mm_fault+1416
handle_mm_fault+260
do_user_addr_fault+459
exc_page_fault+104
asm_exc_page_fault+38
clear_user_rep_good+18
read_zero+100
vfs_read+176
ksys_read+93
do_syscall_64+62
entry_SYSCALL_64_after_hwframe+114
]: 1868979
It is easy to see that this is caused by the frequent failure to obtain the
read lock of shrinker_rwsem when reclaiming slab memory.
Currently, the shrinker_rwsem is a global lock. And the following cases may
cause the above system cpu hotspots:
a. the write lock of shrinker_rwsem was held for too long. For example, there
are many memcgs in the system, which causes some paths to hold locks and
traverse it for too long. (e.g. expand_shrinker_info())
b. the read lock of shrinker_rwsem was held for too long, and a writer came at
this time. Then this writer will be forced to wait and block all subsequent
readers.
For example:
- be scheduled when the read lock of shrinker_rwsem is held in
do_shrink_slab()
- some shrinker are blocked for too long. Like the case mentioned in the
patchset[1].
[1]. https://lore.kernel.org/lkml/20191129214541.3110-1-ptikhomirov@virtuozzo.com/
And all the down_read_trylock() hotspots caused by the above cases can be
solved by replacing the shrinker_rwsem trylocks with SRCU.
2. Survey
=========
Before doing the code implementation, I found that there were many similar
submissions in the community:
a. Davidlohr Bueso submitted a patch in 2015.
Subject: [PATCH -next v2] mm: srcu-ify shrinkers
Link: https://lore.kernel.org/all/1437080113.3596.2.camel@stgolabs.net/
Result: It was finally merged into the linux-next branch, but failed on arm
allnoconfig (without CONFIG_SRCU)
b. Tetsuo Handa submitted a patchset in 2017.
Subject: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
Link: https://lore.kernel.org/lkml/1510609063-3327-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp/
Result: Finally chose to use the current simple way (break when
rwsem_is_contended()). And Christoph Hellwig suggested to using SRCU,
but SRCU was not unconditionally enabled at the time.
c. Kirill Tkhai submitted a patchset in 2018.
Subject: [PATCH RFC 00/10] Introduce lockless shrink_slab()
Link: https://lore.kernel.org/lkml/153365347929.19074.12509495712735843805.stgit@localhost.localdomain/
Result: At that time, SRCU was not unconditionally enabled, and there were
some objections to enabling SRCU. Later, because Kirill's focus was
moved to other things, this patchset was not continued to be updated.
d. Sultan Alsawaf submitted a patch in 2021.
Subject: [PATCH] mm: vmscan: Replace shrinker_rwsem trylocks with SRCU protection
Link: https://lore.kernel.org/lkml/20210927074823.5825-1-sultan@kerneltoast.com/
Result: Rejected because SRCU was not unconditionally enabled.
We can find that almost all these historical commits were abandoned because SRCU
was not unconditionally enabled. But now SRCU has been unconditionally enable
by Paul E. McKenney in 2023 [2], so it's time to replace shrinker_rwsem trylocks
with SRCU.
[2] https://lore.kernel.org/lkml/20230105003759.GA1769545@paulmck-ThinkPad-P17-Gen-1/
3. Reproduction and testing
===========================
We can reproduce the down_read_trylock() hotspot through the following script:
```
#!/bin/bash
DIR="/root/shrinker/memcg/mnt"
do_create()
{
mkdir /sys/fs/cgroup/memory/test
echo 200M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
for i in `seq 0 $1`;
do
mkdir /sys/fs/cgroup/memory/test/$i;
echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
mkdir -p $DIR/$i;
done
}
do_mount()
{
for i in `seq $1 $2`;
do
mount -t tmpfs $i $DIR/$i;
done
}
do_touch()
{
for i in `seq $1 $2`;
do
echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 &
done
}
do_create 2000
do_mount 0 2000
do_touch 0 1000
```
Save the above script and execute it, we can get the following perf hotspots:
46.60% [kernel] [k] down_read_trylock
18.70% [kernel] [k] up_read
15.44% [kernel] [k] shrink_slab
4.37% [kernel] [k] _find_next_bit
2.75% [kernel] [k] xa_load
2.07% [kernel] [k] idr_find
1.73% [kernel] [k] do_shrink_slab
1.42% [kernel] [k] shrink_lruvec
0.74% [kernel] [k] shrink_node
0.60% [kernel] [k] list_lru_count_one
After applying this patchset, the hotspot becomes as follows:
19.53% [kernel] [k] _find_next_bit
14.63% [kernel] [k] do_shrink_slab
14.58% [kernel] [k] shrink_slab
11.83% [kernel] [k] shrink_lruvec
9.33% [kernel] [k] __blk_flush_plug
6.67% [kernel] [k] mem_cgroup_iter
3.73% [kernel] [k] list_lru_count_one
2.43% [kernel] [k] shrink_node
1.96% [kernel] [k] super_cache_count
1.78% [kernel] [k] __rcu_read_unlock
1.38% [kernel] [k] __srcu_read_lock
1.30% [kernel] [k] xas_descend
We can see that the slab reclaim is no longer blocked by shinker_rwsem trylock,
which realizes the lockless slab reclaim.
This series is based on next-20230217.
Comments and suggestions are welcome.
Thanks,
Qi.
Qi Zheng (5):
mm: vmscan: make global slab shrink lockless
mm: vmscan: make memcg slab shrink lockless
mm: shrinkers: make count and scan in shrinker debugfs lockless
mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers()
mm: shrinkers: convert shrinker_rwsem to mutex
drivers/md/dm-cache-metadata.c | 2 +-
drivers/md/dm-thin-metadata.c | 2 +-
fs/super.c | 2 +-
mm/shrinker_debug.c | 38 ++++-------
mm/vmscan.c | 119 ++++++++++++++++-----------------
5 files changed, 76 insertions(+), 87 deletions(-)
--
2.20.1
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 1/5] mm: vmscan: make global slab shrink lockless
2023-02-20 9:16 [PATCH 0/5] make slab shrink lockless Qi Zheng
@ 2023-02-20 9:16 ` Qi Zheng
2023-02-20 9:16 ` [PATCH 2/5] mm: vmscan: make memcg " Qi Zheng
` (3 subsequent siblings)
4 siblings, 0 replies; 16+ messages in thread
From: Qi Zheng @ 2023-02-20 9:16 UTC (permalink / raw)
To: akpm, hannes, shakeelb, mhocko, roman.gushchin, muchun.song,
david, shy828301
Cc: tkhai, sultan, dave, penguin-kernel, paulmck, linux-mm,
linux-kernel, Qi Zheng
The shrinker_rwsem is a global lock in shrinkers subsystem,
it is easy to cause blocking in the following cases:
a. the write lock of shrinker_rwsem was held for too long.
For example, there are many memcgs in the system, which
causes some paths to hold locks and traverse it for too
long. (e.g. expand_shrinker_info())
b. the read lock of shrinker_rwsem was held for too long,
and a writer came at this time. Then this writer will be
forced to wait and block all subsequent readers.
For example:
- be scheduled when the read lock of shrinker_rwsem is
held in do_shrink_slab()
- some shrinker are blocked for too long. Like the case
mentioned in the patchset[1].
Therefore, many times in history ([2],[3],[4],[5]), some
people wanted to replace shrinker_rwsem reader with SRCU,
but they all gave up because SRCU was not unconditionally
enabled.
But now, since commit 1cd0bd06093c ("rcu: Remove CONFIG_SRCU"),
the SRCU is unconditionally enabled. So it's time to use
SRCU to protect readers who previously held shrinker_rwsem.
[1]. https://lore.kernel.org/lkml/20191129214541.3110-1-ptikhomirov@virtuozzo.com/
[2]. https://lore.kernel.org/all/1437080113.3596.2.camel@stgolabs.net/
[3]. https://lore.kernel.org/lkml/1510609063-3327-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp/
[4]. https://lore.kernel.org/lkml/153365347929.19074.12509495712735843805.stgit@localhost.localdomain/
[5]. https://lore.kernel.org/lkml/20210927074823.5825-1-sultan@kerneltoast.com/
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
mm/vmscan.c | 27 +++++++++++----------------
1 file changed, 11 insertions(+), 16 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9c1c5e8b24b8..95a3d6ddc6c1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -202,6 +202,7 @@ static void set_task_reclaim_state(struct task_struct *task,
LIST_HEAD(shrinker_list);
DECLARE_RWSEM(shrinker_rwsem);
+DEFINE_SRCU(shrinker_srcu);
#ifdef CONFIG_MEMCG
static int shrinker_nr_max;
@@ -699,7 +700,7 @@ void free_prealloced_shrinker(struct shrinker *shrinker)
void register_shrinker_prepared(struct shrinker *shrinker)
{
down_write(&shrinker_rwsem);
- list_add_tail(&shrinker->list, &shrinker_list);
+ list_add_tail_rcu(&shrinker->list, &shrinker_list);
shrinker->flags |= SHRINKER_REGISTERED;
shrinker_debugfs_add(shrinker);
up_write(&shrinker_rwsem);
@@ -753,13 +754,15 @@ void unregister_shrinker(struct shrinker *shrinker)
return;
down_write(&shrinker_rwsem);
- list_del(&shrinker->list);
+ list_del_rcu(&shrinker->list);
shrinker->flags &= ~SHRINKER_REGISTERED;
if (shrinker->flags & SHRINKER_MEMCG_AWARE)
unregister_memcg_shrinker(shrinker);
debugfs_entry = shrinker_debugfs_remove(shrinker);
up_write(&shrinker_rwsem);
+ synchronize_srcu(&shrinker_srcu);
+
debugfs_remove_recursive(debugfs_entry);
kfree(shrinker->nr_deferred);
@@ -779,6 +782,7 @@ void synchronize_shrinkers(void)
{
down_write(&shrinker_rwsem);
up_write(&shrinker_rwsem);
+ synchronize_srcu(&shrinker_srcu);
}
EXPORT_SYMBOL(synchronize_shrinkers);
@@ -989,6 +993,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
{
unsigned long ret, freed = 0;
struct shrinker *shrinker;
+ int srcu_idx;
/*
* The root memcg might be allocated even though memcg is disabled
@@ -1000,10 +1005,10 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
- if (!down_read_trylock(&shrinker_rwsem))
- goto out;
+ srcu_idx = srcu_read_lock(&shrinker_srcu);
- list_for_each_entry(shrinker, &shrinker_list, list) {
+ list_for_each_entry_srcu(shrinker, &shrinker_list, list,
+ srcu_read_lock_held(&shrinker_srcu)) {
struct shrink_control sc = {
.gfp_mask = gfp_mask,
.nid = nid,
@@ -1014,19 +1019,9 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
if (ret == SHRINK_EMPTY)
ret = 0;
freed += ret;
- /*
- * Bail out if someone want to register a new shrinker to
- * prevent the registration from being stalled for long periods
- * by parallel ongoing shrinking.
- */
- if (rwsem_is_contended(&shrinker_rwsem)) {
- freed = freed ? : 1;
- break;
- }
}
- up_read(&shrinker_rwsem);
-out:
+ srcu_read_unlock(&shrinker_srcu, srcu_idx);
cond_resched();
return freed;
}
--
2.20.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 2/5] mm: vmscan: make memcg slab shrink lockless
2023-02-20 9:16 [PATCH 0/5] make slab shrink lockless Qi Zheng
2023-02-20 9:16 ` [PATCH 1/5] mm: vmscan: make global " Qi Zheng
@ 2023-02-20 9:16 ` Qi Zheng
2023-02-21 21:28 ` Kirill Tkhai
2023-02-21 21:43 ` Kirill Tkhai
2023-02-20 9:16 ` [PATCH 3/5] mm: shrinkers: make count and scan in shrinker debugfs lockless Qi Zheng
` (2 subsequent siblings)
4 siblings, 2 replies; 16+ messages in thread
From: Qi Zheng @ 2023-02-20 9:16 UTC (permalink / raw)
To: akpm, hannes, shakeelb, mhocko, roman.gushchin, muchun.song,
david, shy828301
Cc: tkhai, sultan, dave, penguin-kernel, paulmck, linux-mm,
linux-kernel, Qi Zheng
Like global slab shrink, since commit 1cd0bd06093c
("rcu: Remove CONFIG_SRCU"), it's time to use SRCU
to protect readers who previously held shrinker_rwsem.
We can test with the following script:
```
DIR="/root/shrinker/memcg/mnt"
do_create()
{
mkdir /sys/fs/cgroup/memory/test
echo 200M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
for i in `seq 0 $1`;
do
mkdir /sys/fs/cgroup/memory/test/$i;
echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
mkdir -p $DIR/$i;
done
}
do_mount()
{
for i in `seq $1 $2`;
do
mount -t tmpfs $i $DIR/$i;
done
}
do_touch()
{
for i in `seq $1 $2`;
do
echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 &
done
}
do_create 2000
do_mount 0 2000
do_touch 0 1000
```
Before applying:
46.60% [kernel] [k] down_read_trylock
18.70% [kernel] [k] up_read
15.44% [kernel] [k] shrink_slab
4.37% [kernel] [k] _find_next_bit
2.75% [kernel] [k] xa_load
2.07% [kernel] [k] idr_find
1.73% [kernel] [k] do_shrink_slab
1.42% [kernel] [k] shrink_lruvec
0.74% [kernel] [k] shrink_node
0.60% [kernel] [k] list_lru_count_one
After applying:
19.53% [kernel] [k] _find_next_bit
14.63% [kernel] [k] do_shrink_slab
14.58% [kernel] [k] shrink_slab
11.83% [kernel] [k] shrink_lruvec
9.33% [kernel] [k] __blk_flush_plug
6.67% [kernel] [k] mem_cgroup_iter
3.73% [kernel] [k] list_lru_count_one
2.43% [kernel] [k] shrink_node
1.96% [kernel] [k] super_cache_count
1.78% [kernel] [k] __rcu_read_unlock
1.38% [kernel] [k] __srcu_read_lock
1.30% [kernel] [k] xas_descend
We can see that the readers is no longer blocked.
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
mm/vmscan.c | 56 ++++++++++++++++++++++++++++++-----------------------
1 file changed, 32 insertions(+), 24 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 95a3d6ddc6c1..dc47396ecd0e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -57,6 +57,7 @@
#include <linux/khugepaged.h>
#include <linux/rculist_nulls.h>
#include <linux/random.h>
+#include <linux/srcu.h>
#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -221,8 +222,21 @@ static inline int shrinker_defer_size(int nr_items)
static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
int nid)
{
- return rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_info,
- lockdep_is_held(&shrinker_rwsem));
+ return srcu_dereference_check(memcg->nodeinfo[nid]->shrinker_info,
+ &shrinker_srcu,
+ lockdep_is_held(&shrinker_rwsem));
+}
+
+static struct shrinker_info *shrinker_info_srcu(struct mem_cgroup *memcg,
+ int nid)
+{
+ return srcu_dereference(memcg->nodeinfo[nid]->shrinker_info,
+ &shrinker_srcu);
+}
+
+static void free_shrinker_info_rcu(struct rcu_head *head)
+{
+ kvfree(container_of(head, struct shrinker_info, rcu));
}
static int expand_one_shrinker_info(struct mem_cgroup *memcg,
@@ -257,7 +271,7 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
defer_size - old_defer_size);
rcu_assign_pointer(pn->shrinker_info, new);
- kvfree_rcu(old, rcu);
+ call_srcu(&shrinker_srcu, &old->rcu, free_shrinker_info_rcu);
}
return 0;
@@ -350,13 +364,14 @@ void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
{
if (shrinker_id >= 0 && memcg && !mem_cgroup_is_root(memcg)) {
struct shrinker_info *info;
+ int srcu_idx;
- rcu_read_lock();
- info = rcu_dereference(memcg->nodeinfo[nid]->shrinker_info);
+ srcu_idx = srcu_read_lock(&shrinker_srcu);
+ info = shrinker_info_srcu(memcg, nid);
/* Pairs with smp mb in shrink_slab() */
smp_mb__before_atomic();
set_bit(shrinker_id, info->map);
- rcu_read_unlock();
+ srcu_read_unlock(&shrinker_srcu, srcu_idx);
}
}
@@ -370,7 +385,6 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
return -ENOSYS;
down_write(&shrinker_rwsem);
- /* This may call shrinker, so it must use down_read_trylock() */
id = idr_alloc(&shrinker_idr, shrinker, 0, 0, GFP_KERNEL);
if (id < 0)
goto unlock;
@@ -404,7 +418,7 @@ static long xchg_nr_deferred_memcg(int nid, struct shrinker *shrinker,
{
struct shrinker_info *info;
- info = shrinker_info_protected(memcg, nid);
+ info = shrinker_info_srcu(memcg, nid);
return atomic_long_xchg(&info->nr_deferred[shrinker->id], 0);
}
@@ -413,13 +427,13 @@ static long add_nr_deferred_memcg(long nr, int nid, struct shrinker *shrinker,
{
struct shrinker_info *info;
- info = shrinker_info_protected(memcg, nid);
+ info = shrinker_info_srcu(memcg, nid);
return atomic_long_add_return(nr, &info->nr_deferred[shrinker->id]);
}
void reparent_shrinker_deferred(struct mem_cgroup *memcg)
{
- int i, nid;
+ int i, nid, srcu_idx;
long nr;
struct mem_cgroup *parent;
struct shrinker_info *child_info, *parent_info;
@@ -429,16 +443,16 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
parent = root_mem_cgroup;
/* Prevent from concurrent shrinker_info expand */
- down_read(&shrinker_rwsem);
+ srcu_idx = srcu_read_lock(&shrinker_srcu);
for_each_node(nid) {
- child_info = shrinker_info_protected(memcg, nid);
- parent_info = shrinker_info_protected(parent, nid);
+ child_info = shrinker_info_srcu(memcg, nid);
+ parent_info = shrinker_info_srcu(parent, nid);
for (i = 0; i < shrinker_nr_max; i++) {
nr = atomic_long_read(&child_info->nr_deferred[i]);
atomic_long_add(nr, &parent_info->nr_deferred[i]);
}
}
- up_read(&shrinker_rwsem);
+ srcu_read_unlock(&shrinker_srcu, srcu_idx);
}
static bool cgroup_reclaim(struct scan_control *sc)
@@ -891,15 +905,14 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
{
struct shrinker_info *info;
unsigned long ret, freed = 0;
+ int srcu_idx;
int i;
if (!mem_cgroup_online(memcg))
return 0;
- if (!down_read_trylock(&shrinker_rwsem))
- return 0;
-
- info = shrinker_info_protected(memcg, nid);
+ srcu_idx = srcu_read_lock(&shrinker_srcu);
+ info = shrinker_info_srcu(memcg, nid);
if (unlikely(!info))
goto unlock;
@@ -949,14 +962,9 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
set_shrinker_bit(memcg, nid, i);
}
freed += ret;
-
- if (rwsem_is_contended(&shrinker_rwsem)) {
- freed = freed ? : 1;
- break;
- }
}
unlock:
- up_read(&shrinker_rwsem);
+ srcu_read_unlock(&shrinker_srcu, srcu_idx);
return freed;
}
#else /* CONFIG_MEMCG */
--
2.20.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 3/5] mm: shrinkers: make count and scan in shrinker debugfs lockless
2023-02-20 9:16 [PATCH 0/5] make slab shrink lockless Qi Zheng
2023-02-20 9:16 ` [PATCH 1/5] mm: vmscan: make global " Qi Zheng
2023-02-20 9:16 ` [PATCH 2/5] mm: vmscan: make memcg " Qi Zheng
@ 2023-02-20 9:16 ` Qi Zheng
2023-02-20 13:22 ` kernel test robot
2023-02-20 9:16 ` [PATCH 4/5] mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers() Qi Zheng
2023-02-20 9:16 ` [PATCH 5/5] mm: shrinkers: convert shrinker_rwsem to mutex Qi Zheng
4 siblings, 1 reply; 16+ messages in thread
From: Qi Zheng @ 2023-02-20 9:16 UTC (permalink / raw)
To: akpm, hannes, shakeelb, mhocko, roman.gushchin, muchun.song,
david, shy828301
Cc: tkhai, sultan, dave, penguin-kernel, paulmck, linux-mm,
linux-kernel, Qi Zheng
Like global and memcg slab shrink, also use SRCU to
make count and scan operations in memory shrinker
debugfs lockless.
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
mm/shrinker_debug.c | 24 +++++++-----------------
1 file changed, 7 insertions(+), 17 deletions(-)
diff --git a/mm/shrinker_debug.c b/mm/shrinker_debug.c
index 39c3491e28a3..6a26e8ac40aa 100644
--- a/mm/shrinker_debug.c
+++ b/mm/shrinker_debug.c
@@ -9,6 +9,7 @@
/* defined in vmscan.c */
extern struct rw_semaphore shrinker_rwsem;
extern struct list_head shrinker_list;
+extern struct srcu_struct shrinker_srcu;
static DEFINE_IDA(shrinker_debugfs_ida);
static struct dentry *shrinker_debugfs_root;
@@ -49,18 +50,13 @@ static int shrinker_debugfs_count_show(struct seq_file *m, void *v)
struct mem_cgroup *memcg;
unsigned long total;
bool memcg_aware;
- int ret, nid;
+ int ret, nid, srcu_idx;
count_per_node = kcalloc(nr_node_ids, sizeof(unsigned long), GFP_KERNEL);
if (!count_per_node)
return -ENOMEM;
- ret = down_read_killable(&shrinker_rwsem);
- if (ret) {
- kfree(count_per_node);
- return ret;
- }
- rcu_read_lock();
+ srcu_idx = srcu_read_lock(&shrinker_srcu);
memcg_aware = shrinker->flags & SHRINKER_MEMCG_AWARE;
@@ -91,8 +87,7 @@ static int shrinker_debugfs_count_show(struct seq_file *m, void *v)
}
} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
- rcu_read_unlock();
- up_read(&shrinker_rwsem);
+ srcu_read_unlock(&shrinker_srcu, srcu_idx);
kfree(count_per_node);
return ret;
@@ -115,9 +110,8 @@ static ssize_t shrinker_debugfs_scan_write(struct file *file,
.gfp_mask = GFP_KERNEL,
};
struct mem_cgroup *memcg = NULL;
- int nid;
+ int nid, srcu_idx;
char kbuf[72];
- ssize_t ret;
read_len = size < (sizeof(kbuf) - 1) ? size : (sizeof(kbuf) - 1);
if (copy_from_user(kbuf, buf, read_len))
@@ -146,11 +140,7 @@ static ssize_t shrinker_debugfs_scan_write(struct file *file,
return -EINVAL;
}
- ret = down_read_killable(&shrinker_rwsem);
- if (ret) {
- mem_cgroup_put(memcg);
- return ret;
- }
+ srcu_idx = srcu_read_lock(&shrinker_srcu);
sc.nid = nid;
sc.memcg = memcg;
@@ -159,7 +149,7 @@ static ssize_t shrinker_debugfs_scan_write(struct file *file,
shrinker->scan_objects(shrinker, &sc);
- up_read(&shrinker_rwsem);
+ srcu_read_unlock(&shrinker_srcu, srcu_idx);
mem_cgroup_put(memcg);
return size;
--
2.20.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 4/5] mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers()
2023-02-20 9:16 [PATCH 0/5] make slab shrink lockless Qi Zheng
` (2 preceding siblings ...)
2023-02-20 9:16 ` [PATCH 3/5] mm: shrinkers: make count and scan in shrinker debugfs lockless Qi Zheng
@ 2023-02-20 9:16 ` Qi Zheng
2023-02-20 9:16 ` [PATCH 5/5] mm: shrinkers: convert shrinker_rwsem to mutex Qi Zheng
4 siblings, 0 replies; 16+ messages in thread
From: Qi Zheng @ 2023-02-20 9:16 UTC (permalink / raw)
To: akpm, hannes, shakeelb, mhocko, roman.gushchin, muchun.song,
david, shy828301
Cc: tkhai, sultan, dave, penguin-kernel, paulmck, linux-mm,
linux-kernel, Qi Zheng
Now there are no readers of shrinker_rwsem, so
synchronize_shrinkers() does not need to hold the
writer of shrinker_rwsem to wait for all running
shinkers to complete, synchronize_srcu() is enough.
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
mm/vmscan.c | 8 ++------
1 file changed, 2 insertions(+), 6 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index dc47396ecd0e..b8da56beba96 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -787,15 +787,11 @@ EXPORT_SYMBOL(unregister_shrinker);
/**
* synchronize_shrinkers - Wait for all running shrinkers to complete.
*
- * This is equivalent to calling unregister_shrink() and register_shrinker(),
- * but atomically and with less overhead. This is useful to guarantee that all
- * shrinker invocations have seen an update, before freeing memory, similar to
- * rcu.
+ * This is useful to guarantee that all shrinker invocations have seen an
+ * update, before freeing memory.
*/
void synchronize_shrinkers(void)
{
- down_write(&shrinker_rwsem);
- up_write(&shrinker_rwsem);
synchronize_srcu(&shrinker_srcu);
}
EXPORT_SYMBOL(synchronize_shrinkers);
--
2.20.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 5/5] mm: shrinkers: convert shrinker_rwsem to mutex
2023-02-20 9:16 [PATCH 0/5] make slab shrink lockless Qi Zheng
` (3 preceding siblings ...)
2023-02-20 9:16 ` [PATCH 4/5] mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers() Qi Zheng
@ 2023-02-20 9:16 ` Qi Zheng
4 siblings, 0 replies; 16+ messages in thread
From: Qi Zheng @ 2023-02-20 9:16 UTC (permalink / raw)
To: akpm, hannes, shakeelb, mhocko, roman.gushchin, muchun.song,
david, shy828301
Cc: tkhai, sultan, dave, penguin-kernel, paulmck, linux-mm,
linux-kernel, Qi Zheng
Now there are no readers of shrinker_rwsem, so we
can simply replace it with mutex lock.
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
drivers/md/dm-cache-metadata.c | 2 +-
drivers/md/dm-thin-metadata.c | 2 +-
fs/super.c | 2 +-
mm/shrinker_debug.c | 14 +++++++-------
mm/vmscan.c | 30 +++++++++++++++---------------
5 files changed, 25 insertions(+), 25 deletions(-)
diff --git a/drivers/md/dm-cache-metadata.c b/drivers/md/dm-cache-metadata.c
index acffed750e3e..9e0c69958587 100644
--- a/drivers/md/dm-cache-metadata.c
+++ b/drivers/md/dm-cache-metadata.c
@@ -1828,7 +1828,7 @@ int dm_cache_metadata_abort(struct dm_cache_metadata *cmd)
* Replacement block manager (new_bm) is created and old_bm destroyed outside of
* cmd root_lock to avoid ABBA deadlock that would result (due to life-cycle of
* shrinker associated with the block manager's bufio client vs cmd root_lock).
- * - must take shrinker_rwsem without holding cmd->root_lock
+ * - must take shrinker_mutex without holding cmd->root_lock
*/
new_bm = dm_block_manager_create(cmd->bdev, DM_CACHE_METADATA_BLOCK_SIZE << SECTOR_SHIFT,
CACHE_MAX_CONCURRENT_LOCKS);
diff --git a/drivers/md/dm-thin-metadata.c b/drivers/md/dm-thin-metadata.c
index fd464fb024c3..9f5cb52c5763 100644
--- a/drivers/md/dm-thin-metadata.c
+++ b/drivers/md/dm-thin-metadata.c
@@ -1887,7 +1887,7 @@ int dm_pool_abort_metadata(struct dm_pool_metadata *pmd)
* Replacement block manager (new_bm) is created and old_bm destroyed outside of
* pmd root_lock to avoid ABBA deadlock that would result (due to life-cycle of
* shrinker associated with the block manager's bufio client vs pmd root_lock).
- * - must take shrinker_rwsem without holding pmd->root_lock
+ * - must take shrinker_mutex without holding pmd->root_lock
*/
new_bm = dm_block_manager_create(pmd->bdev, THIN_METADATA_BLOCK_SIZE << SECTOR_SHIFT,
THIN_MAX_CONCURRENT_LOCKS);
diff --git a/fs/super.c b/fs/super.c
index 84332d5cb817..91a4037b1d95 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -54,7 +54,7 @@ static char *sb_writers_name[SB_FREEZE_LEVELS] = {
* One thing we have to be careful of with a per-sb shrinker is that we don't
* drop the last active reference to the superblock from within the shrinker.
* If that happens we could trigger unregistering the shrinker from within the
- * shrinker path and that leads to deadlock on the shrinker_rwsem. Hence we
+ * shrinker path and that leads to deadlock on the shrinker_mutex. Hence we
* take a passive reference to the superblock to avoid this from occurring.
*/
static unsigned long super_cache_scan(struct shrinker *shrink,
diff --git a/mm/shrinker_debug.c b/mm/shrinker_debug.c
index 6a26e8ac40aa..4c3a5d9afb5c 100644
--- a/mm/shrinker_debug.c
+++ b/mm/shrinker_debug.c
@@ -7,7 +7,7 @@
#include <linux/memcontrol.h>
/* defined in vmscan.c */
-extern struct rw_semaphore shrinker_rwsem;
+extern struct mutex shrinker_mutex;
extern struct list_head shrinker_list;
extern struct srcu_struct shrinker_srcu;
@@ -167,7 +167,7 @@ int shrinker_debugfs_add(struct shrinker *shrinker)
char buf[128];
int id;
- lockdep_assert_held(&shrinker_rwsem);
+ lockdep_assert_held(&shrinker_mutex);
/* debugfs isn't initialized yet, add debugfs entries later. */
if (!shrinker_debugfs_root)
@@ -210,7 +210,7 @@ int shrinker_debugfs_rename(struct shrinker *shrinker, const char *fmt, ...)
if (!new)
return -ENOMEM;
- down_write(&shrinker_rwsem);
+ mutex_lock(&shrinker_mutex);
old = shrinker->name;
shrinker->name = new;
@@ -228,7 +228,7 @@ int shrinker_debugfs_rename(struct shrinker *shrinker, const char *fmt, ...)
shrinker->debugfs_entry = entry;
}
- up_write(&shrinker_rwsem);
+ mutex_unlock(&shrinker_mutex);
kfree_const(old);
@@ -240,7 +240,7 @@ struct dentry *shrinker_debugfs_remove(struct shrinker *shrinker)
{
struct dentry *entry = shrinker->debugfs_entry;
- lockdep_assert_held(&shrinker_rwsem);
+ lockdep_assert_held(&shrinker_mutex);
kfree_const(shrinker->name);
shrinker->name = NULL;
@@ -265,14 +265,14 @@ static int __init shrinker_debugfs_init(void)
shrinker_debugfs_root = dentry;
/* Create debugfs entries for shrinkers registered at boot */
- down_write(&shrinker_rwsem);
+ mutex_lock(&shrinker_mutex);
list_for_each_entry(shrinker, &shrinker_list, list)
if (!shrinker->debugfs_entry) {
ret = shrinker_debugfs_add(shrinker);
if (ret)
break;
}
- up_write(&shrinker_rwsem);
+ mutex_unlock(&shrinker_mutex);
return ret;
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b8da56beba96..f94bfe540675 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -35,7 +35,7 @@
#include <linux/cpuset.h>
#include <linux/compaction.h>
#include <linux/notifier.h>
-#include <linux/rwsem.h>
+#include <linux/mutex.h>
#include <linux/delay.h>
#include <linux/kthread.h>
#include <linux/freezer.h>
@@ -202,7 +202,7 @@ static void set_task_reclaim_state(struct task_struct *task,
}
LIST_HEAD(shrinker_list);
-DECLARE_RWSEM(shrinker_rwsem);
+DEFINE_MUTEX(shrinker_mutex);
DEFINE_SRCU(shrinker_srcu);
#ifdef CONFIG_MEMCG
@@ -224,7 +224,7 @@ static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
{
return srcu_dereference_check(memcg->nodeinfo[nid]->shrinker_info,
&shrinker_srcu,
- lockdep_is_held(&shrinker_rwsem));
+ lockdep_is_held(&shrinker_mutex));
}
static struct shrinker_info *shrinker_info_srcu(struct mem_cgroup *memcg,
@@ -297,7 +297,7 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
int nid, size, ret = 0;
int map_size, defer_size = 0;
- down_write(&shrinker_rwsem);
+ mutex_lock(&shrinker_mutex);
map_size = shrinker_map_size(shrinker_nr_max);
defer_size = shrinker_defer_size(shrinker_nr_max);
size = map_size + defer_size;
@@ -312,7 +312,7 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
info->map = (void *)info->nr_deferred + defer_size;
rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, info);
}
- up_write(&shrinker_rwsem);
+ mutex_unlock(&shrinker_mutex);
return ret;
}
@@ -337,7 +337,7 @@ static int expand_shrinker_info(int new_id)
if (!root_mem_cgroup)
goto out;
- lockdep_assert_held(&shrinker_rwsem);
+ lockdep_assert_held(&shrinker_mutex);
map_size = shrinker_map_size(new_nr_max);
defer_size = shrinker_defer_size(new_nr_max);
@@ -384,7 +384,7 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
if (mem_cgroup_disabled())
return -ENOSYS;
- down_write(&shrinker_rwsem);
+ mutex_lock(&shrinker_mutex);
id = idr_alloc(&shrinker_idr, shrinker, 0, 0, GFP_KERNEL);
if (id < 0)
goto unlock;
@@ -398,7 +398,7 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
shrinker->id = id;
ret = 0;
unlock:
- up_write(&shrinker_rwsem);
+ mutex_unlock(&shrinker_mutex);
return ret;
}
@@ -408,7 +408,7 @@ static void unregister_memcg_shrinker(struct shrinker *shrinker)
BUG_ON(id < 0);
- lockdep_assert_held(&shrinker_rwsem);
+ lockdep_assert_held(&shrinker_mutex);
idr_remove(&shrinker_idr, id);
}
@@ -701,9 +701,9 @@ void free_prealloced_shrinker(struct shrinker *shrinker)
shrinker->name = NULL;
#endif
if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
- down_write(&shrinker_rwsem);
+ mutex_lock(&shrinker_mutex);
unregister_memcg_shrinker(shrinker);
- up_write(&shrinker_rwsem);
+ mutex_unlock(&shrinker_mutex);
return;
}
@@ -713,11 +713,11 @@ void free_prealloced_shrinker(struct shrinker *shrinker)
void register_shrinker_prepared(struct shrinker *shrinker)
{
- down_write(&shrinker_rwsem);
+ mutex_lock(&shrinker_mutex);
list_add_tail_rcu(&shrinker->list, &shrinker_list);
shrinker->flags |= SHRINKER_REGISTERED;
shrinker_debugfs_add(shrinker);
- up_write(&shrinker_rwsem);
+ mutex_unlock(&shrinker_mutex);
}
static int __register_shrinker(struct shrinker *shrinker)
@@ -767,13 +767,13 @@ void unregister_shrinker(struct shrinker *shrinker)
if (!(shrinker->flags & SHRINKER_REGISTERED))
return;
- down_write(&shrinker_rwsem);
+ mutex_lock(&shrinker_mutex);
list_del_rcu(&shrinker->list);
shrinker->flags &= ~SHRINKER_REGISTERED;
if (shrinker->flags & SHRINKER_MEMCG_AWARE)
unregister_memcg_shrinker(shrinker);
debugfs_entry = shrinker_debugfs_remove(shrinker);
- up_write(&shrinker_rwsem);
+ mutex_unlock(&shrinker_mutex);
synchronize_srcu(&shrinker_srcu);
--
2.20.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 3/5] mm: shrinkers: make count and scan in shrinker debugfs lockless
2023-02-20 9:16 ` [PATCH 3/5] mm: shrinkers: make count and scan in shrinker debugfs lockless Qi Zheng
@ 2023-02-20 13:22 ` kernel test robot
0 siblings, 0 replies; 16+ messages in thread
From: kernel test robot @ 2023-02-20 13:22 UTC (permalink / raw)
To: Qi Zheng, akpm, hannes, shakeelb, mhocko, roman.gushchin,
muchun.song, david, shy828301
Cc: llvm, oe-kbuild-all, tkhai, sultan, dave, penguin-kernel,
paulmck, linux-mm, linux-kernel, Qi Zheng
Hi Qi,
Thank you for the patch! Perhaps something to improve:
[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on next-20230220]
[cannot apply to device-mapper-dm/for-next linus/master v6.2]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Qi-Zheng/mm-vmscan-make-global-slab-shrink-lockless/20230220-171954
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20230220091637.64865-4-zhengqi.arch%40bytedance.com
patch subject: [PATCH 3/5] mm: shrinkers: make count and scan in shrinker debugfs lockless
config: riscv-randconfig-r036-20230219 (https://download.01.org/0day-ci/archive/20230220/202302202134.OFjSh3rl-lkp@intel.com/config)
compiler: clang version 17.0.0 (https://github.com/llvm/llvm-project db89896bbbd2251fff457699635acbbedeead27f)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# install riscv cross compiling tool for clang build
# apt-get install binutils-riscv64-linux-gnu
# https://github.com/intel-lab-lkp/linux/commit/b5b7259339a7a5cfae5a120356750c5a9e151d12
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Qi-Zheng/mm-vmscan-make-global-slab-shrink-lockless/20230220-171954
git checkout b5b7259339a7a5cfae5a120356750c5a9e151d12
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=riscv olddefconfig
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=riscv SHELL=/bin/bash
If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202302202134.OFjSh3rl-lkp@intel.com/
All warnings (new ones prefixed by >>):
>> mm/shrinker_debug.c:88:11: warning: variable 'ret' is used uninitialized whenever 'do' loop exits because its condition is false [-Wsometimes-uninitialized]
} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
mm/shrinker_debug.c:93:9: note: uninitialized use occurs here
return ret;
^~~
mm/shrinker_debug.c:88:11: note: remove the condition if it is always true
} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1
>> mm/shrinker_debug.c:78:7: warning: variable 'ret' is used uninitialized whenever 'if' condition is true [-Wsometimes-uninitialized]
if (!memcg_aware) {
^~~~~~~~~~~~
mm/shrinker_debug.c:93:9: note: uninitialized use occurs here
return ret;
^~~
mm/shrinker_debug.c:78:3: note: remove the 'if' if its condition is always false
if (!memcg_aware) {
^~~~~~~~~~~~~~~~~~~
mm/shrinker_debug.c:53:9: note: initialize the variable 'ret' to silence this warning
int ret, nid, srcu_idx;
^
= 0
2 warnings generated.
vim +88 mm/shrinker_debug.c
5035ebc644aec9 Roman Gushchin 2022-05-31 45
5035ebc644aec9 Roman Gushchin 2022-05-31 46 static int shrinker_debugfs_count_show(struct seq_file *m, void *v)
5035ebc644aec9 Roman Gushchin 2022-05-31 47 {
5035ebc644aec9 Roman Gushchin 2022-05-31 48 struct shrinker *shrinker = m->private;
5035ebc644aec9 Roman Gushchin 2022-05-31 49 unsigned long *count_per_node;
5035ebc644aec9 Roman Gushchin 2022-05-31 50 struct mem_cgroup *memcg;
5035ebc644aec9 Roman Gushchin 2022-05-31 51 unsigned long total;
5035ebc644aec9 Roman Gushchin 2022-05-31 52 bool memcg_aware;
b5b7259339a7a5 Qi Zheng 2023-02-20 53 int ret, nid, srcu_idx;
5035ebc644aec9 Roman Gushchin 2022-05-31 54
5035ebc644aec9 Roman Gushchin 2022-05-31 55 count_per_node = kcalloc(nr_node_ids, sizeof(unsigned long), GFP_KERNEL);
5035ebc644aec9 Roman Gushchin 2022-05-31 56 if (!count_per_node)
5035ebc644aec9 Roman Gushchin 2022-05-31 57 return -ENOMEM;
5035ebc644aec9 Roman Gushchin 2022-05-31 58
b5b7259339a7a5 Qi Zheng 2023-02-20 59 srcu_idx = srcu_read_lock(&shrinker_srcu);
5035ebc644aec9 Roman Gushchin 2022-05-31 60
5035ebc644aec9 Roman Gushchin 2022-05-31 61 memcg_aware = shrinker->flags & SHRINKER_MEMCG_AWARE;
5035ebc644aec9 Roman Gushchin 2022-05-31 62
5035ebc644aec9 Roman Gushchin 2022-05-31 63 memcg = mem_cgroup_iter(NULL, NULL, NULL);
5035ebc644aec9 Roman Gushchin 2022-05-31 64 do {
5035ebc644aec9 Roman Gushchin 2022-05-31 65 if (memcg && !mem_cgroup_online(memcg))
5035ebc644aec9 Roman Gushchin 2022-05-31 66 continue;
5035ebc644aec9 Roman Gushchin 2022-05-31 67
5035ebc644aec9 Roman Gushchin 2022-05-31 68 total = shrinker_count_objects(shrinker,
5035ebc644aec9 Roman Gushchin 2022-05-31 69 memcg_aware ? memcg : NULL,
5035ebc644aec9 Roman Gushchin 2022-05-31 70 count_per_node);
5035ebc644aec9 Roman Gushchin 2022-05-31 71 if (total) {
5035ebc644aec9 Roman Gushchin 2022-05-31 72 seq_printf(m, "%lu", mem_cgroup_ino(memcg));
5035ebc644aec9 Roman Gushchin 2022-05-31 73 for_each_node(nid)
5035ebc644aec9 Roman Gushchin 2022-05-31 74 seq_printf(m, " %lu", count_per_node[nid]);
5035ebc644aec9 Roman Gushchin 2022-05-31 75 seq_putc(m, '\n');
5035ebc644aec9 Roman Gushchin 2022-05-31 76 }
5035ebc644aec9 Roman Gushchin 2022-05-31 77
5035ebc644aec9 Roman Gushchin 2022-05-31 @78 if (!memcg_aware) {
5035ebc644aec9 Roman Gushchin 2022-05-31 79 mem_cgroup_iter_break(NULL, memcg);
5035ebc644aec9 Roman Gushchin 2022-05-31 80 break;
5035ebc644aec9 Roman Gushchin 2022-05-31 81 }
5035ebc644aec9 Roman Gushchin 2022-05-31 82
5035ebc644aec9 Roman Gushchin 2022-05-31 83 if (signal_pending(current)) {
5035ebc644aec9 Roman Gushchin 2022-05-31 84 mem_cgroup_iter_break(NULL, memcg);
5035ebc644aec9 Roman Gushchin 2022-05-31 85 ret = -EINTR;
5035ebc644aec9 Roman Gushchin 2022-05-31 86 break;
5035ebc644aec9 Roman Gushchin 2022-05-31 87 }
5035ebc644aec9 Roman Gushchin 2022-05-31 @88 } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
5035ebc644aec9 Roman Gushchin 2022-05-31 89
b5b7259339a7a5 Qi Zheng 2023-02-20 90 srcu_read_unlock(&shrinker_srcu, srcu_idx);
5035ebc644aec9 Roman Gushchin 2022-05-31 91
5035ebc644aec9 Roman Gushchin 2022-05-31 92 kfree(count_per_node);
5035ebc644aec9 Roman Gushchin 2022-05-31 @93 return ret;
5035ebc644aec9 Roman Gushchin 2022-05-31 94 }
5035ebc644aec9 Roman Gushchin 2022-05-31 95 DEFINE_SHOW_ATTRIBUTE(shrinker_debugfs_count);
5035ebc644aec9 Roman Gushchin 2022-05-31 96
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/5] mm: vmscan: make memcg slab shrink lockless
2023-02-20 9:16 ` [PATCH 2/5] mm: vmscan: make memcg " Qi Zheng
@ 2023-02-21 21:28 ` Kirill Tkhai
2023-02-22 7:32 ` Qi Zheng
2023-02-21 21:43 ` Kirill Tkhai
1 sibling, 1 reply; 16+ messages in thread
From: Kirill Tkhai @ 2023-02-21 21:28 UTC (permalink / raw)
To: Qi Zheng, akpm, hannes, shakeelb, mhocko, roman.gushchin,
muchun.song, david, shy828301
Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel
On 20.02.2023 12:16, Qi Zheng wrote:
> Like global slab shrink, since commit 1cd0bd06093c
> ("rcu: Remove CONFIG_SRCU"), it's time to use SRCU
> to protect readers who previously held shrinker_rwsem.
>
> We can test with the following script:
>
> ```
> DIR="/root/shrinker/memcg/mnt"
>
> do_create()
> {
> mkdir /sys/fs/cgroup/memory/test
> echo 200M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
> for i in `seq 0 $1`;
> do
> mkdir /sys/fs/cgroup/memory/test/$i;
> echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
> mkdir -p $DIR/$i;
> done
> }
>
> do_mount()
> {
> for i in `seq $1 $2`;
> do
> mount -t tmpfs $i $DIR/$i;
> done
> }
>
> do_touch()
> {
> for i in `seq $1 $2`;
> do
> echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
> dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 &
> done
> }
>
> do_create 2000
> do_mount 0 2000
> do_touch 0 1000
> ```
>
> Before applying:
>
> 46.60% [kernel] [k] down_read_trylock
> 18.70% [kernel] [k] up_read
> 15.44% [kernel] [k] shrink_slab
> 4.37% [kernel] [k] _find_next_bit
> 2.75% [kernel] [k] xa_load
> 2.07% [kernel] [k] idr_find
> 1.73% [kernel] [k] do_shrink_slab
> 1.42% [kernel] [k] shrink_lruvec
> 0.74% [kernel] [k] shrink_node
> 0.60% [kernel] [k] list_lru_count_one
>
> After applying:
>
> 19.53% [kernel] [k] _find_next_bit
> 14.63% [kernel] [k] do_shrink_slab
> 14.58% [kernel] [k] shrink_slab
> 11.83% [kernel] [k] shrink_lruvec
> 9.33% [kernel] [k] __blk_flush_plug
> 6.67% [kernel] [k] mem_cgroup_iter
> 3.73% [kernel] [k] list_lru_count_one
> 2.43% [kernel] [k] shrink_node
> 1.96% [kernel] [k] super_cache_count
> 1.78% [kernel] [k] __rcu_read_unlock
> 1.38% [kernel] [k] __srcu_read_lock
> 1.30% [kernel] [k] xas_descend
>
> We can see that the readers is no longer blocked.
>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
> mm/vmscan.c | 56 ++++++++++++++++++++++++++++++-----------------------
> 1 file changed, 32 insertions(+), 24 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 95a3d6ddc6c1..dc47396ecd0e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -57,6 +57,7 @@
> #include <linux/khugepaged.h>
> #include <linux/rculist_nulls.h>
> #include <linux/random.h>
> +#include <linux/srcu.h>
>
> #include <asm/tlbflush.h>
> #include <asm/div64.h>
> @@ -221,8 +222,21 @@ static inline int shrinker_defer_size(int nr_items)
> static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
> int nid)
> {
> - return rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_info,
> - lockdep_is_held(&shrinker_rwsem));
> + return srcu_dereference_check(memcg->nodeinfo[nid]->shrinker_info,
> + &shrinker_srcu,
> + lockdep_is_held(&shrinker_rwsem));
> +}
> +
> +static struct shrinker_info *shrinker_info_srcu(struct mem_cgroup *memcg,
> + int nid)
> +{
> + return srcu_dereference(memcg->nodeinfo[nid]->shrinker_info,
> + &shrinker_srcu);
> +}
> +
> +static void free_shrinker_info_rcu(struct rcu_head *head)
> +{
> + kvfree(container_of(head, struct shrinker_info, rcu));
> }
>
> static int expand_one_shrinker_info(struct mem_cgroup *memcg,
> @@ -257,7 +271,7 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
> defer_size - old_defer_size);
>
> rcu_assign_pointer(pn->shrinker_info, new);
> - kvfree_rcu(old, rcu);
> + call_srcu(&shrinker_srcu, &old->rcu, free_shrinker_info_rcu);
> }
>
> return 0;
> @@ -350,13 +364,14 @@ void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
> {
> if (shrinker_id >= 0 && memcg && !mem_cgroup_is_root(memcg)) {
> struct shrinker_info *info;
> + int srcu_idx;
>
> - rcu_read_lock();
> - info = rcu_dereference(memcg->nodeinfo[nid]->shrinker_info);
> + srcu_idx = srcu_read_lock(&shrinker_srcu);
> + info = shrinker_info_srcu(memcg, nid);
> /* Pairs with smp mb in shrink_slab() */
> smp_mb__before_atomic();
> set_bit(shrinker_id, info->map);
> - rcu_read_unlock();
> + srcu_read_unlock(&shrinker_srcu, srcu_idx);
> }
> }
>
> @@ -370,7 +385,6 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
> return -ENOSYS;
>
> down_write(&shrinker_rwsem);
> - /* This may call shrinker, so it must use down_read_trylock() */
> id = idr_alloc(&shrinker_idr, shrinker, 0, 0, GFP_KERNEL);
> if (id < 0)
> goto unlock;
> @@ -404,7 +418,7 @@ static long xchg_nr_deferred_memcg(int nid, struct shrinker *shrinker,
> {
> struct shrinker_info *info;
>
> - info = shrinker_info_protected(memcg, nid);
> + info = shrinker_info_srcu(memcg, nid);
> return atomic_long_xchg(&info->nr_deferred[shrinker->id], 0);
> }
>
> @@ -413,13 +427,13 @@ static long add_nr_deferred_memcg(long nr, int nid, struct shrinker *shrinker,
> {
> struct shrinker_info *info;
>
> - info = shrinker_info_protected(memcg, nid);
> + info = shrinker_info_srcu(memcg, nid);
> return atomic_long_add_return(nr, &info->nr_deferred[shrinker->id]);
> }
>
> void reparent_shrinker_deferred(struct mem_cgroup *memcg)
> {
> - int i, nid;
> + int i, nid, srcu_idx;
> long nr;
> struct mem_cgroup *parent;
> struct shrinker_info *child_info, *parent_info;
> @@ -429,16 +443,16 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
> parent = root_mem_cgroup;
>
> /* Prevent from concurrent shrinker_info expand */
> - down_read(&shrinker_rwsem);
> + srcu_idx = srcu_read_lock(&shrinker_srcu);
Don't we still have to be protected against parallel expand_one_shrinker_info()?
It looks like parent->nodeinfo[nid]->shrinker_info pointer may be changed in expand*
right after we've dereferenced it here.
> for_each_node(nid) {
> - child_info = shrinker_info_protected(memcg, nid);
> - parent_info = shrinker_info_protected(parent, nid);
> + child_info = shrinker_info_srcu(memcg, nid);
> + parent_info = shrinker_info_srcu(parent, nid);
> for (i = 0; i < shrinker_nr_max; i++) {
> nr = atomic_long_read(&child_info->nr_deferred[i]);
> atomic_long_add(nr, &parent_info->nr_deferred[i]);
> }
> }
> - up_read(&shrinker_rwsem);
> + srcu_read_unlock(&shrinker_srcu, srcu_idx);
> }
>
> static bool cgroup_reclaim(struct scan_control *sc)
> @@ -891,15 +905,14 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
> {
> struct shrinker_info *info;
> unsigned long ret, freed = 0;
> + int srcu_idx;
> int i;
>
> if (!mem_cgroup_online(memcg))
> return 0;
>
> - if (!down_read_trylock(&shrinker_rwsem))
> - return 0;
> -
> - info = shrinker_info_protected(memcg, nid);
> + srcu_idx = srcu_read_lock(&shrinker_srcu);
> + info = shrinker_info_srcu(memcg, nid);
> if (unlikely(!info))
> goto unlock;
>
> @@ -949,14 +962,9 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
> set_shrinker_bit(memcg, nid, i);
> }
> freed += ret;
> -
> - if (rwsem_is_contended(&shrinker_rwsem)) {
> - freed = freed ? : 1;
> - break;
> - }
> }
> unlock:
> - up_read(&shrinker_rwsem);
> + srcu_read_unlock(&shrinker_srcu, srcu_idx);
> return freed;
> }
> #else /* CONFIG_MEMCG */
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/5] mm: vmscan: make memcg slab shrink lockless
2023-02-20 9:16 ` [PATCH 2/5] mm: vmscan: make memcg " Qi Zheng
2023-02-21 21:28 ` Kirill Tkhai
@ 2023-02-21 21:43 ` Kirill Tkhai
2023-02-22 8:16 ` Qi Zheng
1 sibling, 1 reply; 16+ messages in thread
From: Kirill Tkhai @ 2023-02-21 21:43 UTC (permalink / raw)
To: Qi Zheng, akpm, hannes, shakeelb, mhocko, roman.gushchin,
muchun.song, david, shy828301
Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel
On 20.02.2023 12:16, Qi Zheng wrote:
> Like global slab shrink, since commit 1cd0bd06093c
> ("rcu: Remove CONFIG_SRCU"), it's time to use SRCU
> to protect readers who previously held shrinker_rwsem.
>
> We can test with the following script:
>
> ```
> DIR="/root/shrinker/memcg/mnt"
>
> do_create()
> {
> mkdir /sys/fs/cgroup/memory/test
> echo 200M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
> for i in `seq 0 $1`;
> do
> mkdir /sys/fs/cgroup/memory/test/$i;
> echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
> mkdir -p $DIR/$i;
> done
> }
>
> do_mount()
> {
> for i in `seq $1 $2`;
> do
> mount -t tmpfs $i $DIR/$i;
> done
> }
>
> do_touch()
> {
> for i in `seq $1 $2`;
> do
> echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
> dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 &
> done
> }
>
> do_create 2000
> do_mount 0 2000
> do_touch 0 1000
> ```
>
> Before applying:
>
> 46.60% [kernel] [k] down_read_trylock
> 18.70% [kernel] [k] up_read
> 15.44% [kernel] [k] shrink_slab
> 4.37% [kernel] [k] _find_next_bit
> 2.75% [kernel] [k] xa_load
> 2.07% [kernel] [k] idr_find
> 1.73% [kernel] [k] do_shrink_slab
> 1.42% [kernel] [k] shrink_lruvec
> 0.74% [kernel] [k] shrink_node
> 0.60% [kernel] [k] list_lru_count_one
>
> After applying:
>
> 19.53% [kernel] [k] _find_next_bit
> 14.63% [kernel] [k] do_shrink_slab
> 14.58% [kernel] [k] shrink_slab
> 11.83% [kernel] [k] shrink_lruvec
> 9.33% [kernel] [k] __blk_flush_plug
> 6.67% [kernel] [k] mem_cgroup_iter
> 3.73% [kernel] [k] list_lru_count_one
> 2.43% [kernel] [k] shrink_node
> 1.96% [kernel] [k] super_cache_count
> 1.78% [kernel] [k] __rcu_read_unlock
> 1.38% [kernel] [k] __srcu_read_lock
> 1.30% [kernel] [k] xas_descend
>
> We can see that the readers is no longer blocked.
>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
> mm/vmscan.c | 56 ++++++++++++++++++++++++++++++-----------------------
> 1 file changed, 32 insertions(+), 24 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 95a3d6ddc6c1..dc47396ecd0e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -57,6 +57,7 @@
> #include <linux/khugepaged.h>
> #include <linux/rculist_nulls.h>
> #include <linux/random.h>
> +#include <linux/srcu.h>
>
> #include <asm/tlbflush.h>
> #include <asm/div64.h>
> @@ -221,8 +222,21 @@ static inline int shrinker_defer_size(int nr_items)
> static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
> int nid)
> {
> - return rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_info,
> - lockdep_is_held(&shrinker_rwsem));
> + return srcu_dereference_check(memcg->nodeinfo[nid]->shrinker_info,
> + &shrinker_srcu,
> + lockdep_is_held(&shrinker_rwsem));
> +}
> +
> +static struct shrinker_info *shrinker_info_srcu(struct mem_cgroup *memcg,
> + int nid)
> +{
> + return srcu_dereference(memcg->nodeinfo[nid]->shrinker_info,
> + &shrinker_srcu);
> +}
> +
> +static void free_shrinker_info_rcu(struct rcu_head *head)
> +{
> + kvfree(container_of(head, struct shrinker_info, rcu));
> }
>
> static int expand_one_shrinker_info(struct mem_cgroup *memcg,
> @@ -257,7 +271,7 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
> defer_size - old_defer_size);
>
> rcu_assign_pointer(pn->shrinker_info, new);
> - kvfree_rcu(old, rcu);
> + call_srcu(&shrinker_srcu, &old->rcu, free_shrinker_info_rcu);
> }
>
> return 0;
> @@ -350,13 +364,14 @@ void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
> {
> if (shrinker_id >= 0 && memcg && !mem_cgroup_is_root(memcg)) {
> struct shrinker_info *info;
> + int srcu_idx;
>
> - rcu_read_lock();
> - info = rcu_dereference(memcg->nodeinfo[nid]->shrinker_info);
> + srcu_idx = srcu_read_lock(&shrinker_srcu);
> + info = shrinker_info_srcu(memcg, nid);
> /* Pairs with smp mb in shrink_slab() */
> smp_mb__before_atomic();
> set_bit(shrinker_id, info->map);
> - rcu_read_unlock();
> + srcu_read_unlock(&shrinker_srcu, srcu_idx);
> }
> }
>
> @@ -370,7 +385,6 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
> return -ENOSYS;
>
> down_write(&shrinker_rwsem);
> - /* This may call shrinker, so it must use down_read_trylock() */
> id = idr_alloc(&shrinker_idr, shrinker, 0, 0, GFP_KERNEL);
> if (id < 0)
> goto unlock;
> @@ -404,7 +418,7 @@ static long xchg_nr_deferred_memcg(int nid, struct shrinker *shrinker,
> {
> struct shrinker_info *info;
>
> - info = shrinker_info_protected(memcg, nid);
> + info = shrinker_info_srcu(memcg, nid);
> return atomic_long_xchg(&info->nr_deferred[shrinker->id], 0);
> }
>
> @@ -413,13 +427,13 @@ static long add_nr_deferred_memcg(long nr, int nid, struct shrinker *shrinker,
> {
> struct shrinker_info *info;
>
> - info = shrinker_info_protected(memcg, nid);
> + info = shrinker_info_srcu(memcg, nid);
> return atomic_long_add_return(nr, &info->nr_deferred[shrinker->id]);
> }
>
> void reparent_shrinker_deferred(struct mem_cgroup *memcg)
> {
> - int i, nid;
> + int i, nid, srcu_idx;
> long nr;
> struct mem_cgroup *parent;
> struct shrinker_info *child_info, *parent_info;
> @@ -429,16 +443,16 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
> parent = root_mem_cgroup;
>
> /* Prevent from concurrent shrinker_info expand */
> - down_read(&shrinker_rwsem);
> + srcu_idx = srcu_read_lock(&shrinker_srcu);
> for_each_node(nid) {
> - child_info = shrinker_info_protected(memcg, nid);
> - parent_info = shrinker_info_protected(parent, nid);
> + child_info = shrinker_info_srcu(memcg, nid);
> + parent_info = shrinker_info_srcu(parent, nid);
> for (i = 0; i < shrinker_nr_max; i++) {
> nr = atomic_long_read(&child_info->nr_deferred[i]);
> atomic_long_add(nr, &parent_info->nr_deferred[i]);
> }
> }
> - up_read(&shrinker_rwsem);
> + srcu_read_unlock(&shrinker_srcu, srcu_idx);
> }
>
> static bool cgroup_reclaim(struct scan_control *sc)
> @@ -891,15 +905,14 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
> {
> struct shrinker_info *info;
> unsigned long ret, freed = 0;
> + int srcu_idx;
> int i;
>
> if (!mem_cgroup_online(memcg))
> return 0;
>
> - if (!down_read_trylock(&shrinker_rwsem))
> - return 0;
> -
> - info = shrinker_info_protected(memcg, nid);
> + srcu_idx = srcu_read_lock(&shrinker_srcu);
> + info = shrinker_info_srcu(memcg, nid);
> if (unlikely(!info))
> goto unlock;
There is shrinker_nr_max dereference under this hunk. It's not in the patch:
for_each_set_bit(i, info->map, shrinker_nr_max) {
Since shrinker_nr_max may grow in parallel, this leads to access beyond allocated memory :(
It looks like we should save size of info->map as a new member of struct shrinker_info.
>
> @@ -949,14 +962,9 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
> set_shrinker_bit(memcg, nid, i);
> }
> freed += ret;
> -
> - if (rwsem_is_contended(&shrinker_rwsem)) {
> - freed = freed ? : 1;
> - break;
> - }
> }
> unlock:
> - up_read(&shrinker_rwsem);
> + srcu_read_unlock(&shrinker_srcu, srcu_idx);
> return freed;
> }
> #else /* CONFIG_MEMCG */
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/5] mm: vmscan: make memcg slab shrink lockless
2023-02-21 21:28 ` Kirill Tkhai
@ 2023-02-22 7:32 ` Qi Zheng
2023-02-22 19:58 ` Kirill Tkhai
0 siblings, 1 reply; 16+ messages in thread
From: Qi Zheng @ 2023-02-22 7:32 UTC (permalink / raw)
To: Kirill Tkhai
Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel,
Andrew Morton, Johannes Weiner, Shakeel Butt, Michal Hocko,
Roman Gushchin, Muchun Song, David Hildenbrand, Yang Shi
On 2023/2/22 05:28, Kirill Tkhai wrote:
> On 20.02.2023 12:16, Qi Zheng wrote:
<...>
>>
>> void reparent_shrinker_deferred(struct mem_cgroup *memcg)
>> {
>> - int i, nid;
>> + int i, nid, srcu_idx;
>> long nr;
>> struct mem_cgroup *parent;
>> struct shrinker_info *child_info, *parent_info;
>> @@ -429,16 +443,16 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
>> parent = root_mem_cgroup;
>>
>> /* Prevent from concurrent shrinker_info expand */
>> - down_read(&shrinker_rwsem);
>> + srcu_idx = srcu_read_lock(&shrinker_srcu);
>
> Don't we still have to be protected against parallel expand_one_shrinker_info()?
>
> It looks like parent->nodeinfo[nid]->shrinker_info pointer may be changed in expand*
> right after we've dereferenced it here.
Hi Kirill,
Oh, indeed. We may wrongly reparent the child's nr_deferred to the old
parent's nr_deferred (it is about to be freed by call_srcu).
The reparent_shrinker_deferred() will only be called on the offline
path (not a hotspot path), so we may be able to use shrinker_mutex
introduced later for protection. What do you think?
Thanks,
Qi
>
>> for_each_node(nid) {
>> - child_info = shrinker_info_protected(memcg, nid);
>> - parent_info = shrinker_info_protected(parent, nid);
>> + child_info = shrinker_info_srcu(memcg, nid);
>> + parent_info = shrinker_info_srcu(parent, nid);
>> for (i = 0; i < shrinker_nr_max; i++) {
>> nr = atomic_long_read(&child_info->nr_deferred[i]);
>> atomic_long_add(nr, &parent_info->nr_deferred[i]);
>> }
>> }
>> - up_read(&shrinker_rwsem);
>> + srcu_read_unlock(&shrinker_srcu, srcu_idx);
>> }
>>
>> static bool cgroup_reclaim(struct scan_control *sc)
>> @@ -891,15 +905,14 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>> {
>> struct shrinker_info *info;
>> unsigned long ret, freed = 0;
>> + int srcu_idx;
>> int i;
>>
>> if (!mem_cgroup_online(memcg))
>> return 0;
>>
>> - if (!down_read_trylock(&shrinker_rwsem))
>> - return 0;
>> -
>> - info = shrinker_info_protected(memcg, nid);
>> + srcu_idx = srcu_read_lock(&shrinker_srcu);
>> + info = shrinker_info_srcu(memcg, nid);
>> if (unlikely(!info))
>> goto unlock;
>>
>> @@ -949,14 +962,9 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>> set_shrinker_bit(memcg, nid, i);
>> }
>> freed += ret;
>> -
>> - if (rwsem_is_contended(&shrinker_rwsem)) {
>> - freed = freed ? : 1;
>> - break;
>> - }
>> }
>> unlock:
>> - up_read(&shrinker_rwsem);
>> + srcu_read_unlock(&shrinker_srcu, srcu_idx);
>> return freed;
>> }
>> #else /* CONFIG_MEMCG */
>
--
Thanks,
Qi
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/5] mm: vmscan: make memcg slab shrink lockless
2023-02-21 21:43 ` Kirill Tkhai
@ 2023-02-22 8:16 ` Qi Zheng
2023-02-22 8:21 ` Qi Zheng
0 siblings, 1 reply; 16+ messages in thread
From: Qi Zheng @ 2023-02-22 8:16 UTC (permalink / raw)
To: Kirill Tkhai
Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel,
Andrew Morton, Johannes Weiner, Shakeel Butt, Michal Hocko,
Roman Gushchin, Muchun Song, David Hildenbrand, Yang Shi
Hi Kirill,
On 2023/2/22 05:43, Kirill Tkhai wrote:
> On 20.02.2023 12:16, Qi Zheng wrote:
>> Like global slab shrink, since commit 1cd0bd06093c<...>
>> static bool cgroup_reclaim(struct scan_control *sc)
>> @@ -891,15 +905,14 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>> {
>> struct shrinker_info *info;
>> unsigned long ret, freed = 0;
>> + int srcu_idx;
>> int i;
>>
>> if (!mem_cgroup_online(memcg))
>> return 0;
>>
>> - if (!down_read_trylock(&shrinker_rwsem))
>> - return 0;
>> -
>> - info = shrinker_info_protected(memcg, nid);
>> + srcu_idx = srcu_read_lock(&shrinker_srcu);
>> + info = shrinker_info_srcu(memcg, nid);
>> if (unlikely(!info))
>> goto unlock;
>
> There is shrinker_nr_max dereference under this hunk. It's not in the patch:
>
> for_each_set_bit(i, info->map, shrinker_nr_max) {
>
> Since shrinker_nr_max may grow in parallel, this leads to access beyond allocated memory :(
Oh, indeed.
>
> It looks like we should save size of info->map as a new member of struct shrinker_info.
Agree, then we only traverse info->map_size each time. Like below:
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b6eda2ab205d..f1b5d2803007 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -97,6 +97,7 @@ struct shrinker_info {
struct rcu_head rcu;
atomic_long_t *nr_deferred;
unsigned long *map;
+ int map_size;
};
struct lruvec_stats_percpu {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f94bfe540675..dd07eb107915 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -239,14 +239,20 @@ static void free_shrinker_info_rcu(struct rcu_head
*head)
kvfree(container_of(head, struct shrinker_info, rcu));
}
-static int expand_one_shrinker_info(struct mem_cgroup *memcg,
- int map_size, int defer_size,
- int old_map_size, int old_defer_size)
+static int expand_one_shrinker_info(struct mem_cgroup *memcg, int
new_nr_max,
+ int old_nr_max)
{
+ int map_size, defer_size, old_map_size, old_defer_size;
struct shrinker_info *new, *old;
struct mem_cgroup_per_node *pn;
int nid;
- int size = map_size + defer_size;
+ int size;
+
+ map_size = shrinker_map_size(new_nr_max);
+ defer_size = shrinker_defer_size(new_nr_max);
+ old_map_size = shrinker_map_size(shrinker_nr_max);
+ old_defer_size = shrinker_defer_size(shrinker_nr_max);
+ size = map_size + defer_size;
for_each_node(nid) {
pn = memcg->nodeinfo[nid];
@@ -261,6 +267,7 @@ static int expand_one_shrinker_info(struct
mem_cgroup *memcg,
new->nr_deferred = (atomic_long_t *)(new + 1);
new->map = (void *)new->nr_deferred + defer_size;
+ new->map_size = new_nr_max;
/* map: set all old bits, clear all new bits */
memset(new->map, (int)0xff, old_map_size);
@@ -310,6 +317,7 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
}
info->nr_deferred = (atomic_long_t *)(info + 1);
info->map = (void *)info->nr_deferred + defer_size;
+ info->map_size = shrinker_nr_max;
rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info,
info);
}
mutex_unlock(&shrinker_mutex);
@@ -327,8 +335,6 @@ static int expand_shrinker_info(int new_id)
{
int ret = 0;
int new_nr_max = new_id + 1;
- int map_size, defer_size = 0;
- int old_map_size, old_defer_size = 0;
struct mem_cgroup *memcg;
if (!need_expand(new_nr_max))
@@ -339,15 +345,9 @@ static int expand_shrinker_info(int new_id)
lockdep_assert_held(&shrinker_mutex);
- map_size = shrinker_map_size(new_nr_max);
- defer_size = shrinker_defer_size(new_nr_max);
- old_map_size = shrinker_map_size(shrinker_nr_max);
- old_defer_size = shrinker_defer_size(shrinker_nr_max);
-
memcg = mem_cgroup_iter(NULL, NULL, NULL);
do {
- ret = expand_one_shrinker_info(memcg, map_size, defer_size,
- old_map_size,
old_defer_size);
+ ret = expand_one_shrinker_info(memcg, new_nr_max,
shrinker_nr_max);
if (ret) {
mem_cgroup_iter_break(NULL, memcg);
goto out;
@@ -912,7 +912,7 @@ static unsigned long shrink_slab_memcg(gfp_t
gfp_mask, int nid,
if (unlikely(!info))
goto unlock;
- for_each_set_bit(i, info->map, shrinker_nr_max) {
+ for_each_set_bit(i, info->map, info->map_size) {
struct shrink_control sc = {
.gfp_mask = gfp_mask,
.nid = nid,
I will send the v2.
Thanks,
Qi
>
>>
>> @@ -949,14 +962,9 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>> set_shrinker_bit(memcg, nid, i);
>> }
>> freed += ret;
>> -
>> - if (rwsem_is_contended(&shrinker_rwsem)) {
>> - freed = freed ? : 1;
>> - break;
>> - }
>> }
>> unlock:
>> - up_read(&shrinker_rwsem);
>> + srcu_read_unlock(&shrinker_srcu, srcu_idx);
>> return freed;
>> }
>> #else /* CONFIG_MEMCG */
>
--
Thanks,
Qi
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 2/5] mm: vmscan: make memcg slab shrink lockless
2023-02-22 8:16 ` Qi Zheng
@ 2023-02-22 8:21 ` Qi Zheng
2023-02-22 20:05 ` Kirill Tkhai
0 siblings, 1 reply; 16+ messages in thread
From: Qi Zheng @ 2023-02-22 8:21 UTC (permalink / raw)
To: Kirill Tkhai
Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel,
Andrew Morton, Johannes Weiner, Shakeel Butt, Michal Hocko,
Roman Gushchin, Muchun Song, David Hildenbrand, Yang Shi
On 2023/2/22 16:16, Qi Zheng wrote:
> Hi Kirill,
>
> On 2023/2/22 05:43, Kirill Tkhai wrote:
>> On 20.02.2023 12:16, Qi Zheng wrote:
>>> Like global slab shrink, since commit 1cd0bd06093c<...>
>>> static bool cgroup_reclaim(struct scan_control *sc)
>>> @@ -891,15 +905,14 @@ static unsigned long shrink_slab_memcg(gfp_t
>>> gfp_mask, int nid,
>>> {
>>> struct shrinker_info *info;
>>> unsigned long ret, freed = 0;
>>> + int srcu_idx;
>>> int i;
>>> if (!mem_cgroup_online(memcg))
>>> return 0;
>>> - if (!down_read_trylock(&shrinker_rwsem))
>>> - return 0;
>>> -
>>> - info = shrinker_info_protected(memcg, nid);
>>> + srcu_idx = srcu_read_lock(&shrinker_srcu);
>>> + info = shrinker_info_srcu(memcg, nid);
>>> if (unlikely(!info))
>>> goto unlock;
>>
>> There is shrinker_nr_max dereference under this hunk. It's not in the
>> patch:
>>
>> for_each_set_bit(i, info->map, shrinker_nr_max) {
>>
>> Since shrinker_nr_max may grow in parallel, this leads to access
>> beyond allocated memory :(
>
> Oh, indeed.
>
>>
>> It looks like we should save size of info->map as a new member of
>> struct shrinker_info.
>
> Agree, then we only traverse info->map_size each time. Like below:
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index b6eda2ab205d..f1b5d2803007 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -97,6 +97,7 @@ struct shrinker_info {
> struct rcu_head rcu;
> atomic_long_t *nr_deferred;
> unsigned long *map;
> + int map_size;
> };
>
> struct lruvec_stats_percpu {
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f94bfe540675..dd07eb107915 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -239,14 +239,20 @@ static void free_shrinker_info_rcu(struct rcu_head
> *head)
> kvfree(container_of(head, struct shrinker_info, rcu));
> }
>
> -static int expand_one_shrinker_info(struct mem_cgroup *memcg,
> - int map_size, int defer_size,
> - int old_map_size, int old_defer_size)
> +static int expand_one_shrinker_info(struct mem_cgroup *memcg, int
> new_nr_max,
> + int old_nr_max)
> {
> + int map_size, defer_size, old_map_size, old_defer_size;
> struct shrinker_info *new, *old;
> struct mem_cgroup_per_node *pn;
> int nid;
> - int size = map_size + defer_size;
> + int size;
> +
> + map_size = shrinker_map_size(new_nr_max);
> + defer_size = shrinker_defer_size(new_nr_max);
> + old_map_size = shrinker_map_size(shrinker_nr_max);
> + old_defer_size = shrinker_defer_size(shrinker_nr_max);
Perhaps these should still be calculated outside the loop as before.
> + size = map_size + defer_size;
>
> for_each_node(nid) {
> pn = memcg->nodeinfo[nid];
> @@ -261,6 +267,7 @@ static int expand_one_shrinker_info(struct
> mem_cgroup *memcg,
>
> new->nr_deferred = (atomic_long_t *)(new + 1);
> new->map = (void *)new->nr_deferred + defer_size;
> + new->map_size = new_nr_max;
>
> /* map: set all old bits, clear all new bits */
> memset(new->map, (int)0xff, old_map_size);
> @@ -310,6 +317,7 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
> }
> info->nr_deferred = (atomic_long_t *)(info + 1);
> info->map = (void *)info->nr_deferred + defer_size;
> + info->map_size = shrinker_nr_max;
> rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info,
> info);
> }
> mutex_unlock(&shrinker_mutex);
> @@ -327,8 +335,6 @@ static int expand_shrinker_info(int new_id)
> {
> int ret = 0;
> int new_nr_max = new_id + 1;
> - int map_size, defer_size = 0;
> - int old_map_size, old_defer_size = 0;
> struct mem_cgroup *memcg;
>
> if (!need_expand(new_nr_max))
> @@ -339,15 +345,9 @@ static int expand_shrinker_info(int new_id)
>
> lockdep_assert_held(&shrinker_mutex);
>
> - map_size = shrinker_map_size(new_nr_max);
> - defer_size = shrinker_defer_size(new_nr_max);
> - old_map_size = shrinker_map_size(shrinker_nr_max);
> - old_defer_size = shrinker_defer_size(shrinker_nr_max);
> -
> memcg = mem_cgroup_iter(NULL, NULL, NULL);
> do {
> - ret = expand_one_shrinker_info(memcg, map_size, defer_size,
> - old_map_size,
> old_defer_size);
> + ret = expand_one_shrinker_info(memcg, new_nr_max,
> shrinker_nr_max);
> if (ret) {
> mem_cgroup_iter_break(NULL, memcg);
> goto out;
> @@ -912,7 +912,7 @@ static unsigned long shrink_slab_memcg(gfp_t
> gfp_mask, int nid,
> if (unlikely(!info))
> goto unlock;
>
> - for_each_set_bit(i, info->map, shrinker_nr_max) {
> + for_each_set_bit(i, info->map, info->map_size) {
> struct shrink_control sc = {
> .gfp_mask = gfp_mask,
> .nid = nid,
>
> I will send the v2.
>
> Thanks,
> Qi
>
>>
>>> @@ -949,14 +962,9 @@ static unsigned long shrink_slab_memcg(gfp_t
>>> gfp_mask, int nid,
>>> set_shrinker_bit(memcg, nid, i);
>>> }
>>> freed += ret;
>>> -
>>> - if (rwsem_is_contended(&shrinker_rwsem)) {
>>> - freed = freed ? : 1;
>>> - break;
>>> - }
>>> }
>>> unlock:
>>> - up_read(&shrinker_rwsem);
>>> + srcu_read_unlock(&shrinker_srcu, srcu_idx);
>>> return freed;
>>> }
>>> #else /* CONFIG_MEMCG */
>>
>
--
Thanks,
Qi
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/5] mm: vmscan: make memcg slab shrink lockless
2023-02-22 7:32 ` Qi Zheng
@ 2023-02-22 19:58 ` Kirill Tkhai
2023-02-23 4:36 ` Qi Zheng
0 siblings, 1 reply; 16+ messages in thread
From: Kirill Tkhai @ 2023-02-22 19:58 UTC (permalink / raw)
To: Qi Zheng
Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel,
Andrew Morton, Johannes Weiner, Shakeel Butt, Michal Hocko,
Roman Gushchin, Muchun Song, David Hildenbrand, Yang Shi
On 22.02.2023 10:32, Qi Zheng wrote:
>
>
> On 2023/2/22 05:28, Kirill Tkhai wrote:
>> On 20.02.2023 12:16, Qi Zheng wrote:
> <...>
>>> void reparent_shrinker_deferred(struct mem_cgroup *memcg)
>>> {
>>> - int i, nid;
>>> + int i, nid, srcu_idx;
>>> long nr;
>>> struct mem_cgroup *parent;
>>> struct shrinker_info *child_info, *parent_info;
>>> @@ -429,16 +443,16 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
>>> parent = root_mem_cgroup;
>>> /* Prevent from concurrent shrinker_info expand */
>>> - down_read(&shrinker_rwsem);
>>> + srcu_idx = srcu_read_lock(&shrinker_srcu);
>>
>> Don't we still have to be protected against parallel expand_one_shrinker_info()?
>>
>> It looks like parent->nodeinfo[nid]->shrinker_info pointer may be changed in expand*
>> right after we've dereferenced it here.
>
> Hi Kirill,
>
> Oh, indeed. We may wrongly reparent the child's nr_deferred to the old
> parent's nr_deferred (it is about to be freed by call_srcu).
>
> The reparent_shrinker_deferred() will only be called on the offline
> path (not a hotspot path), so we may be able to use shrinker_mutex
> introduced later for protection. What do you think?
It looks good for me. One more thing I'd analyzed is whether we want to have
is two reparent_shrinker_deferred() are executing in parallel.
Possible, we should leave rwsem there as it was used before..
>>
>>> for_each_node(nid) {
>>> - child_info = shrinker_info_protected(memcg, nid);
>>> - parent_info = shrinker_info_protected(parent, nid);
>>> + child_info = shrinker_info_srcu(memcg, nid);
>>> + parent_info = shrinker_info_srcu(parent, nid);
>>> for (i = 0; i < shrinker_nr_max; i++) {
>>> nr = atomic_long_read(&child_info->nr_deferred[i]);
>>> atomic_long_add(nr, &parent_info->nr_deferred[i]);
>>> }
>>> }
>>> - up_read(&shrinker_rwsem);
>>> + srcu_read_unlock(&shrinker_srcu, srcu_idx);
>>> }
>>> static bool cgroup_reclaim(struct scan_control *sc)
>>> @@ -891,15 +905,14 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>>> {
>>> struct shrinker_info *info;
>>> unsigned long ret, freed = 0;
>>> + int srcu_idx;
>>> int i;
>>> if (!mem_cgroup_online(memcg))
>>> return 0;
>>> - if (!down_read_trylock(&shrinker_rwsem))
>>> - return 0;
>>> -
>>> - info = shrinker_info_protected(memcg, nid);
>>> + srcu_idx = srcu_read_lock(&shrinker_srcu);
>>> + info = shrinker_info_srcu(memcg, nid);
>>> if (unlikely(!info))
>>> goto unlock;
>>> @@ -949,14 +962,9 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>>> set_shrinker_bit(memcg, nid, i);
>>> }
>>> freed += ret;
>>> -
>>> - if (rwsem_is_contended(&shrinker_rwsem)) {
>>> - freed = freed ? : 1;
>>> - break;
>>> - }
>>> }
>>> unlock:
>>> - up_read(&shrinker_rwsem);
>>> + srcu_read_unlock(&shrinker_srcu, srcu_idx);
>>> return freed;
>>> }
>>> #else /* CONFIG_MEMCG */
>>
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/5] mm: vmscan: make memcg slab shrink lockless
2023-02-22 8:21 ` Qi Zheng
@ 2023-02-22 20:05 ` Kirill Tkhai
2023-02-23 4:37 ` Qi Zheng
0 siblings, 1 reply; 16+ messages in thread
From: Kirill Tkhai @ 2023-02-22 20:05 UTC (permalink / raw)
To: Qi Zheng
Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel,
Andrew Morton, Johannes Weiner, Shakeel Butt, Michal Hocko,
Roman Gushchin, Muchun Song, David Hildenbrand, Yang Shi
On 22.02.2023 11:21, Qi Zheng wrote:
>
>
> On 2023/2/22 16:16, Qi Zheng wrote:
>> Hi Kirill,
>>
>> On 2023/2/22 05:43, Kirill Tkhai wrote:
>>> On 20.02.2023 12:16, Qi Zheng wrote:
>>>> Like global slab shrink, since commit 1cd0bd06093c<...>
>>>> static bool cgroup_reclaim(struct scan_control *sc)
>>>> @@ -891,15 +905,14 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>>>> {
>>>> struct shrinker_info *info;
>>>> unsigned long ret, freed = 0;
>>>> + int srcu_idx;
>>>> int i;
>>>> if (!mem_cgroup_online(memcg))
>>>> return 0;
>>>> - if (!down_read_trylock(&shrinker_rwsem))
>>>> - return 0;
>>>> -
>>>> - info = shrinker_info_protected(memcg, nid);
>>>> + srcu_idx = srcu_read_lock(&shrinker_srcu);
>>>> + info = shrinker_info_srcu(memcg, nid);
>>>> if (unlikely(!info))
>>>> goto unlock;
>>>
>>> There is shrinker_nr_max dereference under this hunk. It's not in the patch:
>>>
>>> for_each_set_bit(i, info->map, shrinker_nr_max) {
>>>
>>> Since shrinker_nr_max may grow in parallel, this leads to access beyond allocated memory :(
>>
>> Oh, indeed.
>>
>>>
>>> It looks like we should save size of info->map as a new member of struct shrinker_info.
>>
>> Agree, then we only traverse info->map_size each time. Like below:
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index b6eda2ab205d..f1b5d2803007 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -97,6 +97,7 @@ struct shrinker_info {
>> struct rcu_head rcu;
>> atomic_long_t *nr_deferred;
>> unsigned long *map;
>> + int map_size;
Sure, like this. The only thing (after rethinking) I want to say is that despite "size" was
may suggestion, now it makes me think that name "size" is about size in bytes.
Possible, something like map_nr_max would be better here.
>> };
>>
>> struct lruvec_stats_percpu {
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index f94bfe540675..dd07eb107915 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -239,14 +239,20 @@ static void free_shrinker_info_rcu(struct rcu_head *head)
>> kvfree(container_of(head, struct shrinker_info, rcu));
>> }
>>
>> -static int expand_one_shrinker_info(struct mem_cgroup *memcg,
>> - int map_size, int defer_size,
>> - int old_map_size, int old_defer_size)
>> +static int expand_one_shrinker_info(struct mem_cgroup *memcg, int new_nr_max,
>> + int old_nr_max)
>> {
>> + int map_size, defer_size, old_map_size, old_defer_size;
>> struct shrinker_info *new, *old;
>> struct mem_cgroup_per_node *pn;
>> int nid;
>> - int size = map_size + defer_size;
>> + int size;
>> +
>> + map_size = shrinker_map_size(new_nr_max);
>> + defer_size = shrinker_defer_size(new_nr_max);
>> + old_map_size = shrinker_map_size(shrinker_nr_max);
>> + old_defer_size = shrinker_defer_size(shrinker_nr_max);
>
> Perhaps these should still be calculated outside the loop as before.
Yeah, for me it's also better.
>> + size = map_size + defer_size;
>>
>> for_each_node(nid) {
>> pn = memcg->nodeinfo[nid];
>> @@ -261,6 +267,7 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
>>
>> new->nr_deferred = (atomic_long_t *)(new + 1);
>> new->map = (void *)new->nr_deferred + defer_size;
>> + new->map_size = new_nr_max;
>>
>> /* map: set all old bits, clear all new bits */
>> memset(new->map, (int)0xff, old_map_size);
>> @@ -310,6 +317,7 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
>> }
>> info->nr_deferred = (atomic_long_t *)(info + 1);
>> info->map = (void *)info->nr_deferred + defer_size;
>> + info->map_size = shrinker_nr_max;
>> rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, info);
>> }
>> mutex_unlock(&shrinker_mutex);
>> @@ -327,8 +335,6 @@ static int expand_shrinker_info(int new_id)
>> {
>> int ret = 0;
>> int new_nr_max = new_id + 1;
>> - int map_size, defer_size = 0;
>> - int old_map_size, old_defer_size = 0;
>> struct mem_cgroup *memcg;
>>
>> if (!need_expand(new_nr_max))
>> @@ -339,15 +345,9 @@ static int expand_shrinker_info(int new_id)
>>
>> lockdep_assert_held(&shrinker_mutex);
>>
>> - map_size = shrinker_map_size(new_nr_max);
>> - defer_size = shrinker_defer_size(new_nr_max);
>> - old_map_size = shrinker_map_size(shrinker_nr_max);
>> - old_defer_size = shrinker_defer_size(shrinker_nr_max);
>> -
>> memcg = mem_cgroup_iter(NULL, NULL, NULL);
>> do {
>> - ret = expand_one_shrinker_info(memcg, map_size, defer_size,
>> - old_map_size, old_defer_size);
>> + ret = expand_one_shrinker_info(memcg, new_nr_max, shrinker_nr_max);
>> if (ret) {
>> mem_cgroup_iter_break(NULL, memcg);
>> goto out;
>> @@ -912,7 +912,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>> if (unlikely(!info))
>> goto unlock;
>>
>> - for_each_set_bit(i, info->map, shrinker_nr_max) {
>> + for_each_set_bit(i, info->map, info->map_size) {
>> struct shrink_control sc = {
>> .gfp_mask = gfp_mask,
>> .nid = nid,
>>
>> I will send the v2.
>>
>> Thanks,
>> Qi
>>
>>>
>>>> @@ -949,14 +962,9 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>>>> set_shrinker_bit(memcg, nid, i);
>>>> }
>>>> freed += ret;
>>>> -
>>>> - if (rwsem_is_contended(&shrinker_rwsem)) {
>>>> - freed = freed ? : 1;
>>>> - break;
>>>> - }
>>>> }
>>>> unlock:
>>>> - up_read(&shrinker_rwsem);
>>>> + srcu_read_unlock(&shrinker_srcu, srcu_idx);
>>>> return freed;
>>>> }
>>>> #else /* CONFIG_MEMCG */
>>>
>>
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/5] mm: vmscan: make memcg slab shrink lockless
2023-02-22 19:58 ` Kirill Tkhai
@ 2023-02-23 4:36 ` Qi Zheng
0 siblings, 0 replies; 16+ messages in thread
From: Qi Zheng @ 2023-02-23 4:36 UTC (permalink / raw)
To: Kirill Tkhai
Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel,
Andrew Morton, Johannes Weiner, Shakeel Butt, Michal Hocko,
Roman Gushchin, Muchun Song, David Hildenbrand, Yang Shi
On 2023/2/23 03:58, Kirill Tkhai wrote:
> On 22.02.2023 10:32, Qi Zheng wrote:
>>
>>
>> On 2023/2/22 05:28, Kirill Tkhai wrote:
>>> On 20.02.2023 12:16, Qi Zheng wrote:
>> <...>
>>>> void reparent_shrinker_deferred(struct mem_cgroup *memcg)
>>>> {
>>>> - int i, nid;
>>>> + int i, nid, srcu_idx;
>>>> long nr;
>>>> struct mem_cgroup *parent;
>>>> struct shrinker_info *child_info, *parent_info;
>>>> @@ -429,16 +443,16 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
>>>> parent = root_mem_cgroup;
>>>> /* Prevent from concurrent shrinker_info expand */
>>>> - down_read(&shrinker_rwsem);
>>>> + srcu_idx = srcu_read_lock(&shrinker_srcu);
>>>
>>> Don't we still have to be protected against parallel expand_one_shrinker_info()?
>>>
>>> It looks like parent->nodeinfo[nid]->shrinker_info pointer may be changed in expand*
>>> right after we've dereferenced it here.
>>
>> Hi Kirill,
>>
>> Oh, indeed. We may wrongly reparent the child's nr_deferred to the old
>> parent's nr_deferred (it is about to be freed by call_srcu).
>>
>> The reparent_shrinker_deferred() will only be called on the offline
>> path (not a hotspot path), so we may be able to use shrinker_mutex
>> introduced later for protection. What do you think?
>
> It looks good for me. One more thing I'd analyzed is whether we want to have
> is two reparent_shrinker_deferred() are executing in parallel.
I see that mem_cgroup_css_offline() is already protected by
cgroup_mutex, so maybe shrinker_mutex is enough here. :)
>
> Possible, we should leave rwsem there as it was used before..
>
>>>
>>>> for_each_node(nid) {
>>>> - child_info = shrinker_info_protected(memcg, nid);
>>>> - parent_info = shrinker_info_protected(parent, nid);
>>>> + child_info = shrinker_info_srcu(memcg, nid);
>>>> + parent_info = shrinker_info_srcu(parent, nid);
>>>> for (i = 0; i < shrinker_nr_max; i++) {
>>>> nr = atomic_long_read(&child_info->nr_deferred[i]);
>>>> atomic_long_add(nr, &parent_info->nr_deferred[i]);
>>>> }
>>>> }
>>>> - up_read(&shrinker_rwsem);
>>>> + srcu_read_unlock(&shrinker_srcu, srcu_idx);
>>>> }
>>>> static bool cgroup_reclaim(struct scan_control *sc)
>>>> @@ -891,15 +905,14 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>>>> {
>>>> struct shrinker_info *info;
>>>> unsigned long ret, freed = 0;
>>>> + int srcu_idx;
>>>> int i;
>>>> if (!mem_cgroup_online(memcg))
>>>> return 0;
>>>> - if (!down_read_trylock(&shrinker_rwsem))
>>>> - return 0;
>>>> -
>>>> - info = shrinker_info_protected(memcg, nid);
>>>> + srcu_idx = srcu_read_lock(&shrinker_srcu);
>>>> + info = shrinker_info_srcu(memcg, nid);
>>>> if (unlikely(!info))
>>>> goto unlock;
>>>> @@ -949,14 +962,9 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>>>> set_shrinker_bit(memcg, nid, i);
>>>> }
>>>> freed += ret;
>>>> -
>>>> - if (rwsem_is_contended(&shrinker_rwsem)) {
>>>> - freed = freed ? : 1;
>>>> - break;
>>>> - }
>>>> }
>>>> unlock:
>>>> - up_read(&shrinker_rwsem);
>>>> + srcu_read_unlock(&shrinker_srcu, srcu_idx);
>>>> return freed;
>>>> }
>>>> #else /* CONFIG_MEMCG */
>>>
>>
>
--
Thanks,
Qi
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/5] mm: vmscan: make memcg slab shrink lockless
2023-02-22 20:05 ` Kirill Tkhai
@ 2023-02-23 4:37 ` Qi Zheng
0 siblings, 0 replies; 16+ messages in thread
From: Qi Zheng @ 2023-02-23 4:37 UTC (permalink / raw)
To: Kirill Tkhai
Cc: sultan, dave, penguin-kernel, paulmck, linux-mm, linux-kernel,
Andrew Morton, Johannes Weiner, Shakeel Butt, Michal Hocko,
Roman Gushchin, Muchun Song, David Hildenbrand, Yang Shi
On 2023/2/23 04:05, Kirill Tkhai wrote:
> On 22.02.2023 11:21, Qi Zheng wrote:
>>
>>
>> On 2023/2/22 16:16, Qi Zheng wrote:
>>> Hi Kirill,
>>>
>>> On 2023/2/22 05:43, Kirill Tkhai wrote:
>>>> On 20.02.2023 12:16, Qi Zheng wrote:
>>>>> Like global slab shrink, since commit 1cd0bd06093c<...>
>>>>> static bool cgroup_reclaim(struct scan_control *sc)
>>>>> @@ -891,15 +905,14 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>>>>> {
>>>>> struct shrinker_info *info;
>>>>> unsigned long ret, freed = 0;
>>>>> + int srcu_idx;
>>>>> int i;
>>>>> if (!mem_cgroup_online(memcg))
>>>>> return 0;
>>>>> - if (!down_read_trylock(&shrinker_rwsem))
>>>>> - return 0;
>>>>> -
>>>>> - info = shrinker_info_protected(memcg, nid);
>>>>> + srcu_idx = srcu_read_lock(&shrinker_srcu);
>>>>> + info = shrinker_info_srcu(memcg, nid);
>>>>> if (unlikely(!info))
>>>>> goto unlock;
>>>>
>>>> There is shrinker_nr_max dereference under this hunk. It's not in the patch:
>>>>
>>>> for_each_set_bit(i, info->map, shrinker_nr_max) {
>>>>
>>>> Since shrinker_nr_max may grow in parallel, this leads to access beyond allocated memory :(
>>>
>>> Oh, indeed.
>>>
>>>>
>>>> It looks like we should save size of info->map as a new member of struct shrinker_info.
>>>
>>> Agree, then we only traverse info->map_size each time. Like below:
>>>
>>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>>> index b6eda2ab205d..f1b5d2803007 100644
>>> --- a/include/linux/memcontrol.h
>>> +++ b/include/linux/memcontrol.h
>>> @@ -97,6 +97,7 @@ struct shrinker_info {
>>> struct rcu_head rcu;
>>> atomic_long_t *nr_deferred;
>>> unsigned long *map;
>>> + int map_size;
>
> Sure, like this. The only thing (after rethinking) I want to say is that despite "size" was
> may suggestion, now it makes me think that name "size" is about size in bytes.
>
> Possible, something like map_nr_max would be better here.
Agree. Will change to it.
>
>>> };
>>>
>>> struct lruvec_stats_percpu {
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index f94bfe540675..dd07eb107915 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -239,14 +239,20 @@ static void free_shrinker_info_rcu(struct rcu_head *head)
>>> kvfree(container_of(head, struct shrinker_info, rcu));
>>> }
>>>
>>> -static int expand_one_shrinker_info(struct mem_cgroup *memcg,
>>> - int map_size, int defer_size,
>>> - int old_map_size, int old_defer_size)
>>> +static int expand_one_shrinker_info(struct mem_cgroup *memcg, int new_nr_max,
>>> + int old_nr_max)
>>> {
>>> + int map_size, defer_size, old_map_size, old_defer_size;
>>> struct shrinker_info *new, *old;
>>> struct mem_cgroup_per_node *pn;
>>> int nid;
>>> - int size = map_size + defer_size;
>>> + int size;
>>> +
>>> + map_size = shrinker_map_size(new_nr_max);
>>> + defer_size = shrinker_defer_size(new_nr_max);
>>> + old_map_size = shrinker_map_size(shrinker_nr_max);
>>> + old_defer_size = shrinker_defer_size(shrinker_nr_max);
>>
>> Perhaps these should still be calculated outside the loop as before.
>
> Yeah, for me it's also better.
>
>>> + size = map_size + defer_size;
>>>
>>> for_each_node(nid) {
>>> pn = memcg->nodeinfo[nid];
>>> @@ -261,6 +267,7 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg,
>>>
>>> new->nr_deferred = (atomic_long_t *)(new + 1);
>>> new->map = (void *)new->nr_deferred + defer_size;
>>> + new->map_size = new_nr_max;
>>>
>>> /* map: set all old bits, clear all new bits */
>>> memset(new->map, (int)0xff, old_map_size);
>>> @@ -310,6 +317,7 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
>>> }
>>> info->nr_deferred = (atomic_long_t *)(info + 1);
>>> info->map = (void *)info->nr_deferred + defer_size;
>>> + info->map_size = shrinker_nr_max;
>>> rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, info);
>>> }
>>> mutex_unlock(&shrinker_mutex);
>>> @@ -327,8 +335,6 @@ static int expand_shrinker_info(int new_id)
>>> {
>>> int ret = 0;
>>> int new_nr_max = new_id + 1;
>>> - int map_size, defer_size = 0;
>>> - int old_map_size, old_defer_size = 0;
>>> struct mem_cgroup *memcg;
>>>
>>> if (!need_expand(new_nr_max))
>>> @@ -339,15 +345,9 @@ static int expand_shrinker_info(int new_id)
>>>
>>> lockdep_assert_held(&shrinker_mutex);
>>>
>>> - map_size = shrinker_map_size(new_nr_max);
>>> - defer_size = shrinker_defer_size(new_nr_max);
>>> - old_map_size = shrinker_map_size(shrinker_nr_max);
>>> - old_defer_size = shrinker_defer_size(shrinker_nr_max);
>>> -
>>> memcg = mem_cgroup_iter(NULL, NULL, NULL);
>>> do {
>>> - ret = expand_one_shrinker_info(memcg, map_size, defer_size,
>>> - old_map_size, old_defer_size);
>>> + ret = expand_one_shrinker_info(memcg, new_nr_max, shrinker_nr_max);
>>> if (ret) {
>>> mem_cgroup_iter_break(NULL, memcg);
>>> goto out;
>>> @@ -912,7 +912,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>>> if (unlikely(!info))
>>> goto unlock;
>>>
>>> - for_each_set_bit(i, info->map, shrinker_nr_max) {
>>> + for_each_set_bit(i, info->map, info->map_size) {
>>> struct shrink_control sc = {
>>> .gfp_mask = gfp_mask,
>>> .nid = nid,
>>>
>>> I will send the v2.
>>>
>>> Thanks,
>>> Qi
>>>
>>>>
>>>>> @@ -949,14 +962,9 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>>>>> set_shrinker_bit(memcg, nid, i);
>>>>> }
>>>>> freed += ret;
>>>>> -
>>>>> - if (rwsem_is_contended(&shrinker_rwsem)) {
>>>>> - freed = freed ? : 1;
>>>>> - break;
>>>>> - }
>>>>> }
>>>>> unlock:
>>>>> - up_read(&shrinker_rwsem);
>>>>> + srcu_read_unlock(&shrinker_srcu, srcu_idx);
>>>>> return freed;
>>>>> }
>>>>> #else /* CONFIG_MEMCG */
>>>>
>>>
>>
>
--
Thanks,
Qi
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2023-02-23 4:38 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-20 9:16 [PATCH 0/5] make slab shrink lockless Qi Zheng
2023-02-20 9:16 ` [PATCH 1/5] mm: vmscan: make global " Qi Zheng
2023-02-20 9:16 ` [PATCH 2/5] mm: vmscan: make memcg " Qi Zheng
2023-02-21 21:28 ` Kirill Tkhai
2023-02-22 7:32 ` Qi Zheng
2023-02-22 19:58 ` Kirill Tkhai
2023-02-23 4:36 ` Qi Zheng
2023-02-21 21:43 ` Kirill Tkhai
2023-02-22 8:16 ` Qi Zheng
2023-02-22 8:21 ` Qi Zheng
2023-02-22 20:05 ` Kirill Tkhai
2023-02-23 4:37 ` Qi Zheng
2023-02-20 9:16 ` [PATCH 3/5] mm: shrinkers: make count and scan in shrinker debugfs lockless Qi Zheng
2023-02-20 13:22 ` kernel test robot
2023-02-20 9:16 ` [PATCH 4/5] mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers() Qi Zheng
2023-02-20 9:16 ` [PATCH 5/5] mm: shrinkers: convert shrinker_rwsem to mutex Qi Zheng
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).