linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/29] use refcount+RCU method to implement lockless slab shrink
@ 2023-06-22  8:53 Qi Zheng
  2023-06-22  8:53 ` [PATCH 01/29] mm: shrinker: add shrinker::private_data field Qi Zheng
                   ` (29 more replies)
  0 siblings, 30 replies; 55+ messages in thread
From: Qi Zheng @ 2023-06-22  8:53 UTC (permalink / raw)
  To: akpm, david, tkhai, vbabka, roman.gushchin, djwong, brauner,
	paulmck, tytso
  Cc: linux-kernel, linux-mm, intel-gfx, dri-devel, linux-arm-msm,
	dm-devel, linux-raid, linux-bcache, virtualization,
	linux-fsdevel, linux-ext4, linux-nfs, linux-xfs, linux-btrfs,
	Qi Zheng

Hi all,

1. Background
=============

We used to implement the lockless slab shrink with SRCU [1], but then kernel
test robot reported -88.8% regression in stress-ng.ramfs.ops_per_sec test
case [2], so we reverted it [3].

This patch series aims to re-implement the lockless slab shrink using the
refcount+RCU method proposed by Dave Chinner [4].

[1]. https://lore.kernel.org/lkml/20230313112819.38938-1-zhengqi.arch@bytedance.com/
[2]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie.liu@intel.com/
[3]. https://lore.kernel.org/all/20230609081518.3039120-1-qi.zheng@linux.dev/
[4]. https://lore.kernel.org/lkml/ZIJhou1d55d4H1s0@dread.disaster.area/

2. Implementation
=================

Currently, the shrinker instances can be divided into the following three types:

a) global shrinker instance statically defined in the kernel, such as
   workingset_shadow_shrinker.

b) global shrinker instance statically defined in the kernel modules, such as
   mmu_shrinker in x86.

c) shrinker instance embedded in other structures.

For *case a*, the memory of shrinker instance is never freed. For *case b*, the
memory of shrinker instance will be freed after the module is unloaded. But we
will call synchronize_rcu() in free_module() to wait for RCU read-side critical
section to exit. For *case c*, we need to dynamically allocate these shrinker
instances, then the memory of shrinker instance can be dynamically freed alone
by calling kfree_rcu(). Then we can use rcu_read_{lock,unlock}() to ensure that
the shrinker instance is valid.

The shrinker::refcount mechanism ensures that the shrinker instance will not be
run again after unregistration. So the structure that records the pointer of
shrinker instance can be safely freed without waiting for the RCU read-side
critical section.

In this way, while we implement the lockless slab shrink, we don't need to be
blocked in unregister_shrinker() to wait RCU read-side critical section.

PATCH 1 ~ 2: infrastructure for dynamically allocating shrinker instances
PATCH 3 ~ 21: dynamically allocate the shrinker instances in case c
PATCH 22: introduce pool_shrink_rwsem to implement private synchronize_shrinkers()
PATCH 23 ~ 28: implement the lockless slab shrink
PATCH 29: move shrinker-related code into a separate file

3. Testing
==========

3.1 slab shrink stress test
---------------------------

We can reproduce the down_read_trylock() hotspot through the following script:

```

DIR="/root/shrinker/memcg/mnt"

do_create()
{
    mkdir -p /sys/fs/cgroup/memory/test
    mkdir -p /sys/fs/cgroup/perf_event/test
    echo 4G > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
    for i in `seq 0 $1`;
    do
        mkdir -p /sys/fs/cgroup/memory/test/$i;
        echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
        echo $$ > /sys/fs/cgroup/perf_event/test/cgroup.procs;
        mkdir -p $DIR/$i;
    done
}

do_mount()
{
    for i in `seq $1 $2`;
    do
        mount -t tmpfs $i $DIR/$i;
    done
}

do_touch()
{
    for i in `seq $1 $2`;
    do
        echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
        echo $$ > /sys/fs/cgroup/perf_event/test/cgroup.procs;
            dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 &
    done
}

case "$1" in
  touch)
    do_touch $2 $3
    ;;
  test)
    do_create 4000
    do_mount 0 4000
    do_touch 0 3000
    ;;
  *)
    exit 1
    ;;
esac
```

Save the above script, then run test and touch commands. Then we can use the
following perf command to view hotspots:

perf top -U -F 999 [-g]

1) Before applying this patchset:

  35.34%  [kernel]             [k] down_read_trylock
  18.44%  [kernel]             [k] shrink_slab
  15.98%  [kernel]             [k] pv_native_safe_halt
  15.08%  [kernel]             [k] up_read
   5.33%  [kernel]             [k] idr_find
   2.71%  [kernel]             [k] _find_next_bit
   2.21%  [kernel]             [k] shrink_node
   1.29%  [kernel]             [k] shrink_lruvec
   0.66%  [kernel]             [k] do_shrink_slab
   0.33%  [kernel]             [k] list_lru_count_one
   0.33%  [kernel]             [k] __radix_tree_lookup
   0.25%  [kernel]             [k] mem_cgroup_iter

-   82.19%    19.49%  [kernel]                  [k] shrink_slab
   - 62.00% shrink_slab
        36.37% down_read_trylock
        15.52% up_read
        5.48% idr_find
        3.38% _find_next_bit
      + 0.98% do_shrink_slab

2) After applying this patchset:

  46.83%  [kernel]           [k] shrink_slab
  20.52%  [kernel]           [k] pv_native_safe_halt
   8.85%  [kernel]           [k] do_shrink_slab
   7.71%  [kernel]           [k] _find_next_bit
   1.72%  [kernel]           [k] xas_descend
   1.70%  [kernel]           [k] shrink_node
   1.44%  [kernel]           [k] shrink_lruvec
   1.43%  [kernel]           [k] mem_cgroup_iter
   1.28%  [kernel]           [k] xas_load
   0.89%  [kernel]           [k] super_cache_count
   0.84%  [kernel]           [k] xas_start
   0.66%  [kernel]           [k] list_lru_count_one

-   65.50%    40.44%  [kernel]                  [k] shrink_slab
   - 22.96% shrink_slab
        13.11% _find_next_bit
      - 9.91% do_shrink_slab
         - 1.59% super_cache_count
              0.92% list_lru_count_one

We can see that the first perf hotspot becomes shrink_slab, which is what we
expect.

3.2 registeration and unregisteration stress test
-------------------------------------------------

Run the command below to test:

stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 &

1) Before applying this patchset:

 setting to a 60 second run per stressor
 dispatching hogs: 9 ramfs
 stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
 ramfs            880623     60.02      7.71    226.93     14671.45        3753.09
 ramfs:
          1 System Management Interrupt
 for a 60.03s run time:
    5762.40s available CPU time
       7.71s user time   (  0.13%)
     226.93s system time (  3.94%)
     234.64s total time  (  4.07%)
 load average: 8.54 3.06 2.11
 passed: 9: ramfs (9)
 failed: 0
 skipped: 0
 successful run completed in 60.03s (1 min, 0.03 secs)

2) After applying this patchset:

 setting to a 60 second run per stressor
 dispatching hogs: 9 ramfs
 stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
 ramfs            847562     60.02      7.44    230.22     14120.66        3566.23
 ramfs:
          4 System Management Interrupts
 for a 60.12s run time:
    5771.95s available CPU time
       7.44s user time   (  0.13%)
     230.22s system time (  3.99%)
     237.66s total time  (  4.12%)
 load average: 8.18 2.43 0.84
 passed: 9: ramfs (9)
 failed: 0
 skipped: 0
 successful run completed in 60.12s (1 min, 0.12 secs)

We can see that the ops/s has hardly changed.

This series is based on next-20230613.

Comments and suggestions are welcome.

Thanks,
Qi.

Qi Zheng (29):
  mm: shrinker: add shrinker::private_data field
  mm: vmscan: introduce some helpers for dynamically allocating shrinker
  drm/i915: dynamically allocate the i915_gem_mm shrinker
  drm/msm: dynamically allocate the drm-msm_gem shrinker
  drm/panfrost: dynamically allocate the drm-panfrost shrinker
  dm: dynamically allocate the dm-bufio shrinker
  dm zoned: dynamically allocate the dm-zoned-meta shrinker
  md/raid5: dynamically allocate the md-raid5 shrinker
  bcache: dynamically allocate the md-bcache shrinker
  vmw_balloon: dynamically allocate the vmw-balloon shrinker
  virtio_balloon: dynamically allocate the virtio-balloon shrinker
  mbcache: dynamically allocate the mbcache shrinker
  ext4: dynamically allocate the ext4-es shrinker
  jbd2,ext4: dynamically allocate the jbd2-journal shrinker
  NFSD: dynamically allocate the nfsd-client shrinker
  NFSD: dynamically allocate the nfsd-reply shrinker
  xfs: dynamically allocate the xfs-buf shrinker
  xfs: dynamically allocate the xfs-inodegc shrinker
  xfs: dynamically allocate the xfs-qm shrinker
  zsmalloc: dynamically allocate the mm-zspool shrinker
  fs: super: dynamically allocate the s_shrink
  drm/ttm: introduce pool_shrink_rwsem
  mm: shrinker: add refcount and completion_wait fields
  mm: vmscan: make global slab shrink lockless
  mm: vmscan: make memcg slab shrink lockless
  mm: shrinker: make count and scan in shrinker debugfs lockless
  mm: vmscan: hold write lock to reparent shrinker nr_deferred
  mm: shrinkers: convert shrinker_rwsem to mutex
  mm: shrinker: move shrinker-related code into a separate file

 drivers/gpu/drm/i915/gem/i915_gem_shrinker.c  |  27 +-
 drivers/gpu/drm/i915/i915_drv.h               |   3 +-
 drivers/gpu/drm/msm/msm_drv.h                 |   2 +-
 drivers/gpu/drm/msm/msm_gem_shrinker.c        |  25 +-
 drivers/gpu/drm/panfrost/panfrost_device.h    |   2 +-
 .../gpu/drm/panfrost/panfrost_gem_shrinker.c  |  24 +-
 drivers/gpu/drm/ttm/ttm_pool.c                |  15 +
 drivers/md/bcache/bcache.h                    |   2 +-
 drivers/md/bcache/btree.c                     |  23 +-
 drivers/md/bcache/sysfs.c                     |   2 +-
 drivers/md/dm-bufio.c                         |  23 +-
 drivers/md/dm-cache-metadata.c                |   2 +-
 drivers/md/dm-thin-metadata.c                 |   2 +-
 drivers/md/dm-zoned-metadata.c                |  25 +-
 drivers/md/raid5.c                            |  28 +-
 drivers/md/raid5.h                            |   2 +-
 drivers/misc/vmw_balloon.c                    |  16 +-
 drivers/virtio/virtio_balloon.c               |  26 +-
 fs/btrfs/super.c                              |   2 +-
 fs/ext4/ext4.h                                |   2 +-
 fs/ext4/extents_status.c                      |  21 +-
 fs/jbd2/journal.c                             |  32 +-
 fs/kernfs/mount.c                             |   2 +-
 fs/mbcache.c                                  |  39 +-
 fs/nfsd/netns.h                               |   4 +-
 fs/nfsd/nfs4state.c                           |  20 +-
 fs/nfsd/nfscache.c                            |  33 +-
 fs/proc/root.c                                |   2 +-
 fs/super.c                                    |  40 +-
 fs/xfs/xfs_buf.c                              |  25 +-
 fs/xfs/xfs_buf.h                              |   2 +-
 fs/xfs/xfs_icache.c                           |  27 +-
 fs/xfs/xfs_mount.c                            |   4 +-
 fs/xfs/xfs_mount.h                            |   2 +-
 fs/xfs/xfs_qm.c                               |  24 +-
 fs/xfs/xfs_qm.h                               |   2 +-
 include/linux/fs.h                            |   2 +-
 include/linux/jbd2.h                          |   2 +-
 include/linux/shrinker.h                      |  35 +-
 mm/Makefile                                   |   4 +-
 mm/shrinker.c                                 | 750 ++++++++++++++++++
 mm/shrinker_debug.c                           |  26 +-
 mm/vmscan.c                                   | 702 ----------------
 mm/zsmalloc.c                                 |  28 +-
 44 files changed, 1128 insertions(+), 953 deletions(-)
 create mode 100644 mm/shrinker.c

-- 
2.30.2


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 01/29] mm: shrinker: add shrinker::private_data field
  2023-06-22  8:53 [PATCH 00/29] use refcount+RCU method to implement lockless slab shrink Qi Zheng
@ 2023-06-22  8:53 ` Qi Zheng
  2023-06-22 14:47   ` Vlastimil Babka
  2023-06-22  8:53 ` [PATCH 02/29] mm: vmscan: introduce some helpers for dynamically allocating shrinker Qi Zheng
                   ` (28 subsequent siblings)
  29 siblings, 1 reply; 55+ messages in thread
From: Qi Zheng @ 2023-06-22  8:53 UTC (permalink / raw)
  To: akpm, david, tkhai, vbabka, roman.gushchin, djwong, brauner,
	paulmck, tytso
  Cc: linux-kernel, linux-mm, intel-gfx, dri-devel, linux-arm-msm,
	dm-devel, linux-raid, linux-bcache, virtualization,
	linux-fsdevel, linux-ext4, linux-nfs, linux-xfs, linux-btrfs,
	Qi Zheng

To prepare for the dynamic allocation of shrinker instances
embedded in other structures, add a private_data field to
struct shrinker, so that we can use shrinker::private_data
to record and get the original embedded structure.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/shrinker.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 224293b2dd06..43e6fcabbf51 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -70,6 +70,8 @@ struct shrinker {
 	int seeks;	/* seeks to recreate an obj */
 	unsigned flags;
 
+	void *private_data;
+
 	/* These are for internal use */
 	struct list_head list;
 #ifdef CONFIG_MEMCG
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 02/29] mm: vmscan: introduce some helpers for dynamically allocating shrinker
  2023-06-22  8:53 [PATCH 00/29] use refcount+RCU method to implement lockless slab shrink Qi Zheng
  2023-06-22  8:53 ` [PATCH 01/29] mm: shrinker: add shrinker::private_data field Qi Zheng
@ 2023-06-22  8:53 ` Qi Zheng
  2023-06-23  6:12   ` Dave Chinner
  2023-06-22  8:53 ` [PATCH 03/29] drm/i915: dynamically allocate the i915_gem_mm shrinker Qi Zheng
                   ` (27 subsequent siblings)
  29 siblings, 1 reply; 55+ messages in thread
From: Qi Zheng @ 2023-06-22  8:53 UTC (permalink / raw)
  To: akpm, david, tkhai, vbabka, roman.gushchin, djwong, brauner,
	paulmck, tytso
  Cc: linux-kernel, linux-mm, intel-gfx, dri-devel, linux-arm-msm,
	dm-devel, linux-raid, linux-bcache, virtualization,
	linux-fsdevel, linux-ext4, linux-nfs, linux-xfs, linux-btrfs,
	Qi Zheng

Introduce some helpers for dynamically allocating shrinker instance,
and their uses are as follows:

1. shrinker_alloc_and_init()

Used to allocate and initialize a shrinker instance, the priv_data
parameter is used to pass the pointer of the previously embedded
structure of the shrinker instance.

2. shrinker_free()

Used to free the shrinker instance when the registration of shrinker
fails.

3. unregister_and_free_shrinker()

Used to unregister and free the shrinker instance, and the kfree()
will be changed to kfree_rcu() later.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/shrinker.h | 12 ++++++++++++
 mm/vmscan.c              | 35 +++++++++++++++++++++++++++++++++++
 2 files changed, 47 insertions(+)

diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 43e6fcabbf51..8e9ba6fa3fcc 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -107,6 +107,18 @@ extern void unregister_shrinker(struct shrinker *shrinker);
 extern void free_prealloced_shrinker(struct shrinker *shrinker);
 extern void synchronize_shrinkers(void);
 
+typedef unsigned long (*count_objects_cb)(struct shrinker *s,
+					  struct shrink_control *sc);
+typedef unsigned long (*scan_objects_cb)(struct shrinker *s,
+					 struct shrink_control *sc);
+
+struct shrinker *shrinker_alloc_and_init(count_objects_cb count,
+					 scan_objects_cb scan, long batch,
+					 int seeks, unsigned flags,
+					 void *priv_data);
+void shrinker_free(struct shrinker *shrinker);
+void unregister_and_free_shrinker(struct shrinker *shrinker);
+
 #ifdef CONFIG_SHRINKER_DEBUG
 extern int shrinker_debugfs_add(struct shrinker *shrinker);
 extern struct dentry *shrinker_debugfs_detach(struct shrinker *shrinker,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 45d17c7cc555..64ff598fbad9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -809,6 +809,41 @@ void unregister_shrinker(struct shrinker *shrinker)
 }
 EXPORT_SYMBOL(unregister_shrinker);
 
+struct shrinker *shrinker_alloc_and_init(count_objects_cb count,
+					 scan_objects_cb scan, long batch,
+					 int seeks, unsigned flags,
+					 void *priv_data)
+{
+	struct shrinker *shrinker;
+
+	shrinker = kzalloc(sizeof(struct shrinker), GFP_KERNEL);
+	if (!shrinker)
+		return NULL;
+
+	shrinker->count_objects = count;
+	shrinker->scan_objects = scan;
+	shrinker->batch = batch;
+	shrinker->seeks = seeks;
+	shrinker->flags = flags;
+	shrinker->private_data = priv_data;
+
+	return shrinker;
+}
+EXPORT_SYMBOL(shrinker_alloc_and_init);
+
+void shrinker_free(struct shrinker *shrinker)
+{
+	kfree(shrinker);
+}
+EXPORT_SYMBOL(shrinker_free);
+
+void unregister_and_free_shrinker(struct shrinker *shrinker)
+{
+	unregister_shrinker(shrinker);
+	kfree(shrinker);
+}
+EXPORT_SYMBOL(unregister_and_free_shrinker);
+
 /**
  * synchronize_shrinkers - Wait for all running shrinkers to complete.
  *
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 03/29] drm/i915: dynamically allocate the i915_gem_mm shrinker
  2023-06-22  8:53 [PATCH 00/29] use refcount+RCU method to implement lockless slab shrink Qi Zheng
  2023-06-22  8:53 ` [PATCH 01/29] mm: shrinker: add shrinker::private_data field Qi Zheng
  2023-06-22  8:53 ` [PATCH 02/29] mm: vmscan: introduce some helpers for dynamically allocating shrinker Qi Zheng
@ 2023-06-22  8:53 ` Qi Zheng
  2023-06-22  8:53 ` [PATCH 04/29] drm/msm: dynamically allocate the drm-msm_gem shrinker Qi Zheng
                   ` (26 subsequent siblings)
  29 siblings, 0 replies; 55+ messages in thread
From: Qi Zheng @ 2023-06-22  8:53 UTC (permalink / raw)
  To: akpm, david, tkhai, vbabka, roman.gushchin, djwong, brauner,
	paulmck, tytso
  Cc: linux-kernel, linux-mm, intel-gfx, dri-devel, linux-arm-msm,
	dm-devel, linux-raid, linux-bcache, virtualization,
	linux-fsdevel, linux-ext4, linux-nfs, linux-xfs, linux-btrfs,
	Qi Zheng

In preparation for implementing lockless slab shrink,
we need to dynamically allocate the i915_gem_mm shrinker,
so that it can be freed asynchronously by using kfree_rcu().
Then it doesn't need to wait for RCU read-side critical
section when releasing the struct drm_i915_private.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 drivers/gpu/drm/i915/gem/i915_gem_shrinker.c | 27 ++++++++++----------
 drivers/gpu/drm/i915/i915_drv.h              |  3 ++-
 2 files changed, 16 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c b/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c
index 214763942aa2..4dcdace26a08 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c
@@ -284,8 +284,7 @@ unsigned long i915_gem_shrink_all(struct drm_i915_private *i915)
 static unsigned long
 i915_gem_shrinker_count(struct shrinker *shrinker, struct shrink_control *sc)
 {
-	struct drm_i915_private *i915 =
-		container_of(shrinker, struct drm_i915_private, mm.shrinker);
+	struct drm_i915_private *i915 = shrinker->private_data;
 	unsigned long num_objects;
 	unsigned long count;
 
@@ -302,8 +301,8 @@ i915_gem_shrinker_count(struct shrinker *shrinker, struct shrink_control *sc)
 	if (num_objects) {
 		unsigned long avg = 2 * count / num_objects;
 
-		i915->mm.shrinker.batch =
-			max((i915->mm.shrinker.batch + avg) >> 1,
+		i915->mm.shrinker->batch =
+			max((i915->mm.shrinker->batch + avg) >> 1,
 			    128ul /* default SHRINK_BATCH */);
 	}
 
@@ -313,8 +312,7 @@ i915_gem_shrinker_count(struct shrinker *shrinker, struct shrink_control *sc)
 static unsigned long
 i915_gem_shrinker_scan(struct shrinker *shrinker, struct shrink_control *sc)
 {
-	struct drm_i915_private *i915 =
-		container_of(shrinker, struct drm_i915_private, mm.shrinker);
+	struct drm_i915_private *i915 = shrinker->private_data;
 	unsigned long freed;
 
 	sc->nr_scanned = 0;
@@ -422,12 +420,15 @@ i915_gem_shrinker_vmap(struct notifier_block *nb, unsigned long event, void *ptr
 
 void i915_gem_driver_register__shrinker(struct drm_i915_private *i915)
 {
-	i915->mm.shrinker.scan_objects = i915_gem_shrinker_scan;
-	i915->mm.shrinker.count_objects = i915_gem_shrinker_count;
-	i915->mm.shrinker.seeks = DEFAULT_SEEKS;
-	i915->mm.shrinker.batch = 4096;
-	drm_WARN_ON(&i915->drm, register_shrinker(&i915->mm.shrinker,
-						  "drm-i915_gem"));
+	i915->mm.shrinker = shrinker_alloc_and_init(i915_gem_shrinker_count,
+						    i915_gem_shrinker_scan,
+						    4096, DEFAULT_SEEKS, 0,
+						    i915);
+	if (i915->mm.shrinker &&
+	    register_shrinker(i915->mm.shrinker, "drm-i915_gem")) {
+		shrinker_free(i915->mm.shrinker);
+		drm_WARN_ON(&i915->drm, 1);
+	}
 
 	i915->mm.oom_notifier.notifier_call = i915_gem_shrinker_oom;
 	drm_WARN_ON(&i915->drm, register_oom_notifier(&i915->mm.oom_notifier));
@@ -443,7 +444,7 @@ void i915_gem_driver_unregister__shrinker(struct drm_i915_private *i915)
 		    unregister_vmap_purge_notifier(&i915->mm.vmap_notifier));
 	drm_WARN_ON(&i915->drm,
 		    unregister_oom_notifier(&i915->mm.oom_notifier));
-	unregister_shrinker(&i915->mm.shrinker);
+	unregister_and_free_shrinker(i915->mm.shrinker);
 }
 
 void i915_gem_shrinker_taints_mutex(struct drm_i915_private *i915,
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index b4cf6f0f636d..06b04428596d 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -163,7 +163,8 @@ struct i915_gem_mm {
 
 	struct notifier_block oom_notifier;
 	struct notifier_block vmap_notifier;
-	struct shrinker shrinker;
+
+	struct shrinker *shrinker;
 
 #ifdef CONFIG_MMU_NOTIFIER
 	/**
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 04/29] drm/msm: dynamically allocate the drm-msm_gem shrinker
  2023-06-22  8:53 [PATCH 00/29] use refcount+RCU method to implement lockless slab shrink Qi Zheng
                   ` (2 preceding siblings ...)
  2023-06-22  8:53 ` [PATCH 03/29] drm/i915: dynamically allocate the i915_gem_mm shrinker Qi Zheng
@ 2023-06-22  8:53 ` Qi Zheng
  2023-06-22  8:53 ` [PATCH 05/29] drm/panfrost: dynamically allocate the drm-panfrost shrinker Qi Zheng
                   ` (25 subsequent siblings)
  29 siblings, 0 replies; 55+ messages in thread
From: Qi Zheng @ 2023-06-22  8:53 UTC (permalink / raw)
  To: akpm, david, tkhai, vbabka, roman.gushchin, djwong, brauner,
	paulmck, tytso
  Cc: linux-kernel, linux-mm, intel-gfx, dri-devel, linux-arm-msm,
	dm-devel, linux-raid, linux-bcache, virtualization,
	linux-fsdevel, linux-ext4, linux-nfs, linux-xfs, linux-btrfs,
	Qi Zheng

In preparation for implementing lockless slab shrink,
we need to dynamically allocate the drm-msm_gem shrinker,
so that it can be freed asynchronously using kfree_rcu().
Then it doesn't need to wait for RCU read-side critical
section when releasing the struct msm_drm_private.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 drivers/gpu/drm/msm/msm_drv.h          |  2 +-
 drivers/gpu/drm/msm/msm_gem_shrinker.c | 25 ++++++++++++++-----------
 2 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/msm/msm_drv.h b/drivers/gpu/drm/msm/msm_drv.h
index e13a8cbd61c9..4f3ba55058cd 100644
--- a/drivers/gpu/drm/msm/msm_drv.h
+++ b/drivers/gpu/drm/msm/msm_drv.h
@@ -217,7 +217,7 @@ struct msm_drm_private {
 	} vram;
 
 	struct notifier_block vmap_notifier;
-	struct shrinker shrinker;
+	struct shrinker *shrinker;
 
 	struct drm_atomic_state *pm_state;
 
diff --git a/drivers/gpu/drm/msm/msm_gem_shrinker.c b/drivers/gpu/drm/msm/msm_gem_shrinker.c
index f38296ad8743..db7582ae1f19 100644
--- a/drivers/gpu/drm/msm/msm_gem_shrinker.c
+++ b/drivers/gpu/drm/msm/msm_gem_shrinker.c
@@ -34,8 +34,7 @@ static bool can_block(struct shrink_control *sc)
 static unsigned long
 msm_gem_shrinker_count(struct shrinker *shrinker, struct shrink_control *sc)
 {
-	struct msm_drm_private *priv =
-		container_of(shrinker, struct msm_drm_private, shrinker);
+	struct msm_drm_private *priv = shrinker->private_data;
 	unsigned count = priv->lru.dontneed.count;
 
 	if (can_swap())
@@ -100,8 +99,7 @@ active_evict(struct drm_gem_object *obj)
 static unsigned long
 msm_gem_shrinker_scan(struct shrinker *shrinker, struct shrink_control *sc)
 {
-	struct msm_drm_private *priv =
-		container_of(shrinker, struct msm_drm_private, shrinker);
+	struct msm_drm_private *priv = shrinker->private_data;
 	struct {
 		struct drm_gem_lru *lru;
 		bool (*shrink)(struct drm_gem_object *obj);
@@ -151,7 +149,7 @@ msm_gem_shrinker_shrink(struct drm_device *dev, unsigned long nr_to_scan)
 	int ret;
 
 	fs_reclaim_acquire(GFP_KERNEL);
-	ret = msm_gem_shrinker_scan(&priv->shrinker, &sc);
+	ret = msm_gem_shrinker_scan(priv->shrinker, &sc);
 	fs_reclaim_release(GFP_KERNEL);
 
 	return ret;
@@ -213,10 +211,15 @@ msm_gem_shrinker_vmap(struct notifier_block *nb, unsigned long event, void *ptr)
 void msm_gem_shrinker_init(struct drm_device *dev)
 {
 	struct msm_drm_private *priv = dev->dev_private;
-	priv->shrinker.count_objects = msm_gem_shrinker_count;
-	priv->shrinker.scan_objects = msm_gem_shrinker_scan;
-	priv->shrinker.seeks = DEFAULT_SEEKS;
-	WARN_ON(register_shrinker(&priv->shrinker, "drm-msm_gem"));
+
+	priv->shrinker = shrinker_alloc_and_init(msm_gem_shrinker_count,
+						 msm_gem_shrinker_scan, 0,
+						 DEFAULT_SEEKS, 0, priv);
+	if (priv->shrinker &&
+	    register_shrinker(priv->shrinker, "drm-msm_gem")) {
+		shrinker_free(priv->shrinker);
+		WARN_ON(1);
+	}
 
 	priv->vmap_notifier.notifier_call = msm_gem_shrinker_vmap;
 	WARN_ON(register_vmap_purge_notifier(&priv->vmap_notifier));
@@ -232,8 +235,8 @@ void msm_gem_shrinker_cleanup(struct drm_device *dev)
 {
 	struct msm_drm_private *priv = dev->dev_private;
 
-	if (priv->shrinker.nr_deferred) {
+	if (priv->shrinker->nr_deferred) {
 		WARN_ON(unregister_vmap_purge_notifier(&priv->vmap_notifier));
-		unregister_shrinker(&priv->shrinker);
+		unregister_and_free_shrinker(priv->shrinker);
 	}
 }
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 05/29] drm/panfrost: dynamically allocate the drm-panfrost shrinker
  2023-06-22  8:53 [PATCH 00/29] use refcount+RCU method to implement lockless slab shrink Qi Zheng
                   ` (3 preceding siblings ...)
  2023-06-22  8:53 ` [PATCH 04/29] drm/msm: dynamically allocate the drm-msm_gem shrinker Qi Zheng
@ 2023-06-22  8:53 ` Qi Zheng
  2023-06-23 13:33   ` Qi Zheng
  2023-06-22  8:53 ` [PATCH 06/29] dm: dynamically allocate the dm-bufio shrinker Qi Zheng
                   ` (24 subsequent siblings)
  29 siblings, 1 reply; 55+ messages in thread
From: Qi Zheng @ 2023-06-22  8:53 UTC (permalink / raw)
  To: akpm, david, tkhai, vbabka, roman.gushchin, djwong, brauner,
	paulmck, tytso
  Cc: linux-kernel, linux-mm, intel-gfx, dri-devel, linux-arm-msm,
	dm-devel, linux-raid, linux-bcache, virtualization,
	linux-fsdevel, linux-ext4, linux-nfs, linux-xfs, linux-btrfs,
	Qi Zheng

In preparation for implementing lockless slab shrink,
we need to dynamically allocate the drm-panfrost shrinker,
so that it can be freed asynchronously using kfree_rcu().
Then it doesn't need to wait for RCU read-side critical
section when releasing the struct panfrost_device.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 drivers/gpu/drm/panfrost/panfrost_device.h    |  2 +-
 .../gpu/drm/panfrost/panfrost_gem_shrinker.c  | 24 ++++++++++---------
 2 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h b/drivers/gpu/drm/panfrost/panfrost_device.h
index b0126b9fbadc..e667e5689353 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.h
+++ b/drivers/gpu/drm/panfrost/panfrost_device.h
@@ -118,7 +118,7 @@ struct panfrost_device {
 
 	struct mutex shrinker_lock;
 	struct list_head shrinker_list;
-	struct shrinker shrinker;
+	struct shrinker *shrinker;
 
 	struct panfrost_devfreq pfdevfreq;
 };
diff --git a/drivers/gpu/drm/panfrost/panfrost_gem_shrinker.c b/drivers/gpu/drm/panfrost/panfrost_gem_shrinker.c
index bf0170782f25..2a5513eb9e1f 100644
--- a/drivers/gpu/drm/panfrost/panfrost_gem_shrinker.c
+++ b/drivers/gpu/drm/panfrost/panfrost_gem_shrinker.c
@@ -18,8 +18,7 @@
 static unsigned long
 panfrost_gem_shrinker_count(struct shrinker *shrinker, struct shrink_control *sc)
 {
-	struct panfrost_device *pfdev =
-		container_of(shrinker, struct panfrost_device, shrinker);
+	struct panfrost_device *pfdev = shrinker->private_data;
 	struct drm_gem_shmem_object *shmem;
 	unsigned long count = 0;
 
@@ -65,8 +64,7 @@ static bool panfrost_gem_purge(struct drm_gem_object *obj)
 static unsigned long
 panfrost_gem_shrinker_scan(struct shrinker *shrinker, struct shrink_control *sc)
 {
-	struct panfrost_device *pfdev =
-		container_of(shrinker, struct panfrost_device, shrinker);
+	struct panfrost_device *pfdev = shrinker->private_data;
 	struct drm_gem_shmem_object *shmem, *tmp;
 	unsigned long freed = 0;
 
@@ -100,10 +98,15 @@ panfrost_gem_shrinker_scan(struct shrinker *shrinker, struct shrink_control *sc)
 void panfrost_gem_shrinker_init(struct drm_device *dev)
 {
 	struct panfrost_device *pfdev = dev->dev_private;
-	pfdev->shrinker.count_objects = panfrost_gem_shrinker_count;
-	pfdev->shrinker.scan_objects = panfrost_gem_shrinker_scan;
-	pfdev->shrinker.seeks = DEFAULT_SEEKS;
-	WARN_ON(register_shrinker(&pfdev->shrinker, "drm-panfrost"));
+
+	pfdev->shrinker = shrinker_alloc_and_init(panfrost_gem_shrinker_count,
+						  panfrost_gem_shrinker_scan, 0,
+						  DEFAULT_SEEKS, 0, pfdev);
+	if (pfdev->shrinker &&
+	    register_shrinker(pfdev->shrinker, "drm-panfrost")) {
+		shrinker_free(pfdev->shrinker);
+		WARN_ON(1);
+	}
 }
 
 /**
@@ -116,7 +119,6 @@ void panfrost_gem_shrinker_cleanup(struct drm_device *dev)
 {
 	struct panfrost_device *pfdev = dev->dev_private;
 
-	if (pfdev->shrinker.nr_deferred) {
-		unregister_shrinker(&pfdev->shrinker);
-	}
+	if (pfdev->shrinker->nr_deferred)
+		unregister_and_free_shrinker(pfdev->shrinker);
 }
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 06/29] dm: dynamically allocate the dm-bufio shrinker
  2023-06-22  8:53 [PATCH 00/29] use refcount+RCU method to implement lockless slab shrink Qi Zheng
                   ` (4 preceding siblings ...)
  2023-06-22  8:53 ` [PATCH 05/29] drm/panfrost: dynamically allocate the drm-panfrost shrinker Qi Zheng
@ 2023-06-22  8:53 ` Qi Zheng
  2023-06-22  8:53 ` [PATCH 07/29] dm zoned: dynamically allocate the dm-zoned-meta shrinker Qi Zheng
                   ` (23 subsequent siblings)
  29 siblings, 0 replies; 55+ messages in thread
From: Qi Zheng @ 2023-06-22  8:53 UTC (permalink / raw)
  To: akpm, david, tkhai, vbabka, roman.gushchin, djwong, brauner,
	paulmck, tytso
  Cc: linux-kernel, linux-mm, intel-gfx, dri-devel, linux-arm-msm,
	dm-devel, linux-raid, linux-bcache, virtualization,
	linux-fsdevel, linux-ext4, linux-nfs, linux-xfs, linux-btrfs,
	Qi Zheng

In preparation for implementing lockless slab shrink,
we need to dynamically allocate the dm-bufio shrinker,
so that it can be freed asynchronously using kfree_rcu().
Then it doesn't need to wait for RCU read-side critical
section when releasing the struct dm_bufio_client.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 drivers/md/dm-bufio.c | 23 +++++++++++++----------
 1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
index eea977662e81..9472470d456d 100644
--- a/drivers/md/dm-bufio.c
+++ b/drivers/md/dm-bufio.c
@@ -963,7 +963,7 @@ struct dm_bufio_client {
 
 	sector_t start;
 
-	struct shrinker shrinker;
+	struct shrinker *shrinker;
 	struct work_struct shrink_work;
 	atomic_long_t need_shrink;
 
@@ -2385,7 +2385,7 @@ static unsigned long dm_bufio_shrink_scan(struct shrinker *shrink, struct shrink
 {
 	struct dm_bufio_client *c;
 
-	c = container_of(shrink, struct dm_bufio_client, shrinker);
+	c = shrink->private_data;
 	atomic_long_add(sc->nr_to_scan, &c->need_shrink);
 	queue_work(dm_bufio_wq, &c->shrink_work);
 
@@ -2394,7 +2394,7 @@ static unsigned long dm_bufio_shrink_scan(struct shrinker *shrink, struct shrink
 
 static unsigned long dm_bufio_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
 {
-	struct dm_bufio_client *c = container_of(shrink, struct dm_bufio_client, shrinker);
+	struct dm_bufio_client *c = shrink->private_data;
 	unsigned long count = cache_total(&c->cache);
 	unsigned long retain_target = get_retain_buffers(c);
 	unsigned long queued_for_cleanup = atomic_long_read(&c->need_shrink);
@@ -2507,14 +2507,15 @@ struct dm_bufio_client *dm_bufio_client_create(struct block_device *bdev, unsign
 	INIT_WORK(&c->shrink_work, shrink_work);
 	atomic_long_set(&c->need_shrink, 0);
 
-	c->shrinker.count_objects = dm_bufio_shrink_count;
-	c->shrinker.scan_objects = dm_bufio_shrink_scan;
-	c->shrinker.seeks = 1;
-	c->shrinker.batch = 0;
-	r = register_shrinker(&c->shrinker, "dm-bufio:(%u:%u)",
+	c->shrinker = shrinker_alloc_and_init(dm_bufio_shrink_count,
+					      dm_bufio_shrink_scan, 0, 1, 0, c);
+	if (!c->shrinker)
+		goto bad;
+
+	r = register_shrinker(c->shrinker, "dm-bufio:(%u:%u)",
 			      MAJOR(bdev->bd_dev), MINOR(bdev->bd_dev));
 	if (r)
-		goto bad;
+		goto bad_shrinker;
 
 	mutex_lock(&dm_bufio_clients_lock);
 	dm_bufio_client_count++;
@@ -2524,6 +2525,8 @@ struct dm_bufio_client *dm_bufio_client_create(struct block_device *bdev, unsign
 
 	return c;
 
+bad_shrinker:
+	shrinker_free(c->shrinker);
 bad:
 	while (!list_empty(&c->reserved_buffers)) {
 		struct dm_buffer *b = list_to_buffer(c->reserved_buffers.next);
@@ -2554,7 +2557,7 @@ void dm_bufio_client_destroy(struct dm_bufio_client *c)
 
 	drop_buffers(c);
 
-	unregister_shrinker(&c->shrinker);
+	unregister_and_free_shrinker(c->shrinker);
 	flush_work(&c->shrink_work);
 
 	mutex_lock(&dm_bufio_clients_lock);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 07/29] dm zoned: dynamically allocate the dm-zoned-meta shrinker
  2023-06-22  8:53 [PATCH 00/29] use refcount+RCU method to implement lockless slab shrink Qi Zheng
                   ` (5 preceding siblings ...)
  2023-06-22  8:53 ` [PATCH 06/29] dm: dynamically allocate the dm-bufio shrinker Qi Zheng
@ 2023-06-22  8:53 ` Qi Zheng
  2023-06-22  8:53 ` [PATCH 08/29] md/raid5: dynamically allocate the md-raid5 shrinker Qi Zheng
                   ` (22 subsequent siblings)
  29 siblings, 0 replies; 55+ messages in thread
From: Qi Zheng @ 2023-06-22  8:53 UTC (permalink / raw)
  To: akpm, david, tkhai, vbabka, roman.gushchin, djwong, brauner,
	paulmck, tytso
  Cc: linux-kernel, linux-mm, intel-gfx, dri-devel, linux-arm-msm,
	dm-devel, linux-raid, linux-bcache, virtualization,
	linux-fsdevel, linux-ext4, linux-nfs, linux-xfs, linux-btrfs,
	Qi Zheng

In preparation for implementing lockless slab shrink,
we need to dynamically allocate the dm-zoned-meta shrinker,
so that it can be freed asynchronously using kfree_rcu().
Then it doesn't need to wait for RCU read-side critical
section when releasing the struct dmz_metadata.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 drivers/md/dm-zoned-metadata.c | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/drivers/md/dm-zoned-metadata.c b/drivers/md/dm-zoned-metadata.c
index 9d3cca8e3dc9..41b10ffb968a 100644
--- a/drivers/md/dm-zoned-metadata.c
+++ b/drivers/md/dm-zoned-metadata.c
@@ -187,7 +187,7 @@ struct dmz_metadata {
 	struct rb_root		mblk_rbtree;
 	struct list_head	mblk_lru_list;
 	struct list_head	mblk_dirty_list;
-	struct shrinker		mblk_shrinker;
+	struct shrinker		*mblk_shrinker;
 
 	/* Zone allocation management */
 	struct mutex		map_lock;
@@ -615,7 +615,7 @@ static unsigned long dmz_shrink_mblock_cache(struct dmz_metadata *zmd,
 static unsigned long dmz_mblock_shrinker_count(struct shrinker *shrink,
 					       struct shrink_control *sc)
 {
-	struct dmz_metadata *zmd = container_of(shrink, struct dmz_metadata, mblk_shrinker);
+	struct dmz_metadata *zmd = shrink->private_data;
 
 	return atomic_read(&zmd->nr_mblks);
 }
@@ -626,7 +626,7 @@ static unsigned long dmz_mblock_shrinker_count(struct shrinker *shrink,
 static unsigned long dmz_mblock_shrinker_scan(struct shrinker *shrink,
 					      struct shrink_control *sc)
 {
-	struct dmz_metadata *zmd = container_of(shrink, struct dmz_metadata, mblk_shrinker);
+	struct dmz_metadata *zmd = shrink->private_data;
 	unsigned long count;
 
 	spin_lock(&zmd->mblk_lock);
@@ -2936,17 +2936,22 @@ int dmz_ctr_metadata(struct dmz_dev *dev, int num_dev,
 	 */
 	zmd->min_nr_mblks = 2 + zmd->nr_map_blocks + zmd->zone_nr_bitmap_blocks * 16;
 	zmd->max_nr_mblks = zmd->min_nr_mblks + 512;
-	zmd->mblk_shrinker.count_objects = dmz_mblock_shrinker_count;
-	zmd->mblk_shrinker.scan_objects = dmz_mblock_shrinker_scan;
-	zmd->mblk_shrinker.seeks = DEFAULT_SEEKS;
+
+	zmd->mblk_shrinker = shrinker_alloc_and_init(dmz_mblock_shrinker_count,
+						     dmz_mblock_shrinker_scan,
+						     0, DEFAULT_SEEKS, 0, zmd);
+	if (!zmd->mblk_shrinker) {
+		dmz_zmd_err(zmd, "allocate metadata cache shrinker failed");
+		goto err;
+	}
 
 	/* Metadata cache shrinker */
-	ret = register_shrinker(&zmd->mblk_shrinker, "dm-zoned-meta:(%u:%u)",
+	ret = register_shrinker(zmd->mblk_shrinker, "dm-zoned-meta:(%u:%u)",
 				MAJOR(dev->bdev->bd_dev),
 				MINOR(dev->bdev->bd_dev));
 	if (ret) {
 		dmz_zmd_err(zmd, "Register metadata cache shrinker failed");
-		goto err;
+		goto err_shrinker;
 	}
 
 	dmz_zmd_info(zmd, "DM-Zoned metadata version %d", zmd->sb_version);
@@ -2982,6 +2987,8 @@ int dmz_ctr_metadata(struct dmz_dev *dev, int num_dev,
 	*metadata = zmd;
 
 	return 0;
+err_shrinker:
+	shrinker_free(zmd->mblk_shrinker);
 err:
 	dmz_cleanup_metadata(zmd);
 	kfree(zmd);
@@ -2995,7 +3002,7 @@ int dmz_ctr_metadata(struct dmz_dev *dev, int num_dev,
  */
 void dmz_dtr_metadata(struct dmz_metadata *zmd)
 {
-	unregister_shrinker(&zmd->mblk_shrinker);
+	unregister_and_free_shrinker(zmd->mblk_shrinker);
 	dmz_cleanup_metadata(zmd);
 	kfree(zmd);
 }
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 08/29] md/raid5: dynamically allocate the md-raid5 shrinker
  2023-06-22  8:53 [PATCH 00/29] use refcount+RCU method to implement lockless slab shrink Qi Zheng
                   ` (6 preceding siblings ...)
  2023-06-22  8:53 ` [PATCH 07/29] dm zoned: dynamically allocate the dm-zoned-meta shrinker Qi Zheng
@ 2023-06-22  8:53 ` Qi Zheng
  2023-06-22  8:53 ` [PATCH 09/29] bcache: dynamically allocate the md-bcache shrinker Qi Zheng
                   ` (21 subsequent siblings)
  29 siblings, 0 replies; 55+ messages in thread
From: Qi Zheng @ 2023-06-22  8:53 UTC (permalink / raw)
  To: akpm, david, tkhai, vbabka, roman.gushchin, djwong, brauner,
	paulmck, tytso
  Cc: linux-kernel, linux-mm, intel-gfx, dri-devel, linux-arm-msm,
	dm-devel, linux-raid, linux-bcache, virtualization,
	linux-fsdevel, linux-ext4, linux-nfs, linux-xfs, linux-btrfs,
	Qi Zheng

In preparation for implementing lockless slab shrink,
we need to dynamically allocate the md-raid5 shrinker,
so that it can be freed asynchronously using kfree_rcu().
Then it doesn't need to wait for RCU read-side critical
section when releasing the struct r5conf.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 drivers/md/raid5.c | 28 +++++++++++++++++-----------
 drivers/md/raid5.h |  2 +-
 2 files changed, 18 insertions(+), 12 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index f4eea1bbbeaf..4866cad1ad62 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -7391,7 +7391,7 @@ static void free_conf(struct r5conf *conf)
 
 	log_exit(conf);
 
-	unregister_shrinker(&conf->shrinker);
+	unregister_and_free_shrinker(conf->shrinker);
 	free_thread_groups(conf);
 	shrink_stripes(conf);
 	raid5_free_percpu(conf);
@@ -7439,7 +7439,7 @@ static int raid5_alloc_percpu(struct r5conf *conf)
 static unsigned long raid5_cache_scan(struct shrinker *shrink,
 				      struct shrink_control *sc)
 {
-	struct r5conf *conf = container_of(shrink, struct r5conf, shrinker);
+	struct r5conf *conf = shrink->private_data;
 	unsigned long ret = SHRINK_STOP;
 
 	if (mutex_trylock(&conf->cache_size_mutex)) {
@@ -7460,7 +7460,7 @@ static unsigned long raid5_cache_scan(struct shrinker *shrink,
 static unsigned long raid5_cache_count(struct shrinker *shrink,
 				       struct shrink_control *sc)
 {
-	struct r5conf *conf = container_of(shrink, struct r5conf, shrinker);
+	struct r5conf *conf = shrink->private_data;
 
 	if (conf->max_nr_stripes < conf->min_nr_stripes)
 		/* unlikely, but not impossible */
@@ -7695,16 +7695,21 @@ static struct r5conf *setup_conf(struct mddev *mddev)
 	 * it reduces the queue depth and so can hurt throughput.
 	 * So set it rather large, scaled by number of devices.
 	 */
-	conf->shrinker.seeks = DEFAULT_SEEKS * conf->raid_disks * 4;
-	conf->shrinker.scan_objects = raid5_cache_scan;
-	conf->shrinker.count_objects = raid5_cache_count;
-	conf->shrinker.batch = 128;
-	conf->shrinker.flags = 0;
-	ret = register_shrinker(&conf->shrinker, "md-raid5:%s", mdname(mddev));
+	conf->shrinker = shrinker_alloc_and_init(raid5_cache_count,
+						 raid5_cache_scan, 128,
+						 DEFAULT_SEEKS * conf->raid_disks * 4,
+						 0, conf);
+	if (!conf->shrinker) {
+		pr_warn("md/raid:%s: couldn't allocate shrinker.\n",
+			mdname(mddev));
+		goto abort;
+	}
+
+	ret = register_shrinker(conf->shrinker, "md-raid5:%s", mdname(mddev));
 	if (ret) {
 		pr_warn("md/raid:%s: couldn't register shrinker.\n",
 			mdname(mddev));
-		goto abort;
+		goto abort_shrinker;
 	}
 
 	sprintf(pers_name, "raid%d", mddev->new_level);
@@ -7717,7 +7722,8 @@ static struct r5conf *setup_conf(struct mddev *mddev)
 	}
 
 	return conf;
-
+abort_shrinker:
+	shrinker_free(conf->shrinker);
  abort:
 	if (conf)
 		free_conf(conf);
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 6a92fafb0748..806f84681599 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -670,7 +670,7 @@ struct r5conf {
 	wait_queue_head_t	wait_for_stripe;
 	wait_queue_head_t	wait_for_overlap;
 	unsigned long		cache_state;
-	struct shrinker		shrinker;
+	struct shrinker		*shrinker;
 	int			pool_size; /* number of disks in stripeheads in pool */
 	spinlock_t		device_lock;
 	struct disk_info	*disks;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 09/29] bcache: dynamically allocate the md-bcache shrinker
  2023-06-22  8:53 [PATCH 00/29] use refcount+RCU method to implement lockless slab shrink Qi Zheng
                   ` (7 preceding siblings ...)
  2023-06-22  8:53 ` [PATCH 08/29] md/raid5: dynamically allocate the md-raid5 shrinker Qi Zheng
@ 2023-06-22  8:53 ` Qi Zheng
  2023-06-22  8:53 ` [PATCH 10/29] vmw_balloon: dynamically allocate the vmw-balloon shrinker Qi Zheng
                   ` (20 subsequent siblings)
  29 siblings, 0 replies; 55+ messages in thread
From: Qi Zheng @ 2023-06-22  8:53 UTC (permalink / raw)
  To: akpm, david, tkhai, vbabka, roman.gushchin, djwong, brauner,
	paulmck, tytso
  Cc: linux-kernel, linux-mm, intel-gfx, dri-devel, linux-arm-msm,
	dm-devel, linux-raid, linux-bcache, virtualization,
	linux-fsdevel, linux-ext4, linux-nfs, linux-xfs, linux-btrfs,
	Qi Zheng

In preparation for implementing lockless slab shrink,
we need to dynamically allocate the md-bcache shrinker,
so that it can be freed asynchronously using kfree_rcu().
Then it doesn't need to wait for RCU read-side critical
section when releasing the struct cache_set.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 drivers/md/bcache/bcache.h |  2 +-
 drivers/md/bcache/btree.c  | 23 ++++++++++++++---------
 drivers/md/bcache/sysfs.c  |  2 +-
 3 files changed, 16 insertions(+), 11 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 700dc5588d5f..53c73b372e7a 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -541,7 +541,7 @@ struct cache_set {
 	struct bio_set		bio_split;
 
 	/* For the btree cache */
-	struct shrinker		shrink;
+	struct shrinker		*shrink;
 
 	/* For the btree cache and anything allocation related */
 	struct mutex		bucket_lock;
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 569f48958bde..1131ae91f62a 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -667,7 +667,7 @@ static int mca_reap(struct btree *b, unsigned int min_order, bool flush)
 static unsigned long bch_mca_scan(struct shrinker *shrink,
 				  struct shrink_control *sc)
 {
-	struct cache_set *c = container_of(shrink, struct cache_set, shrink);
+	struct cache_set *c = shrink->private_data;
 	struct btree *b, *t;
 	unsigned long i, nr = sc->nr_to_scan;
 	unsigned long freed = 0;
@@ -734,7 +734,7 @@ static unsigned long bch_mca_scan(struct shrinker *shrink,
 static unsigned long bch_mca_count(struct shrinker *shrink,
 				   struct shrink_control *sc)
 {
-	struct cache_set *c = container_of(shrink, struct cache_set, shrink);
+	struct cache_set *c = shrink->private_data;
 
 	if (c->shrinker_disabled)
 		return 0;
@@ -752,8 +752,8 @@ void bch_btree_cache_free(struct cache_set *c)
 
 	closure_init_stack(&cl);
 
-	if (c->shrink.list.next)
-		unregister_shrinker(&c->shrink);
+	if (c->shrink->list.next)
+		unregister_and_free_shrinker(c->shrink);
 
 	mutex_lock(&c->bucket_lock);
 
@@ -828,14 +828,19 @@ int bch_btree_cache_alloc(struct cache_set *c)
 		c->verify_data = NULL;
 #endif
 
-	c->shrink.count_objects = bch_mca_count;
-	c->shrink.scan_objects = bch_mca_scan;
-	c->shrink.seeks = 4;
-	c->shrink.batch = c->btree_pages * 2;
+	c->shrink = shrinker_alloc_and_init(bch_mca_count, bch_mca_scan,
+					    c->btree_pages * 2, 4, 0, c);
+	if (!c->shrink) {
+		pr_warn("bcache: %s: could not allocate shrinker\n",
+				__func__);
+		return -ENOMEM;
+	}
 
-	if (register_shrinker(&c->shrink, "md-bcache:%pU", c->set_uuid))
+	if (register_shrinker(c->shrink, "md-bcache:%pU", c->set_uuid)) {
 		pr_warn("bcache: %s: could not register shrinker\n",
 				__func__);
+		shrinker_free(c->shrink);
+	}
 
 	return 0;
 }
diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index c6f677059214..771577581f52 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -866,7 +866,7 @@ STORE(__bch_cache_set)
 
 		sc.gfp_mask = GFP_KERNEL;
 		sc.nr_to_scan = strtoul_or_return(buf);
-		c->shrink.scan_objects(&c->shrink, &sc);
+		c->shrink->scan_objects(c->shrink, &sc);
 	}
 
 	sysfs_strtoul_clamp(congested_read_threshold_us,
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 10/29] vmw_balloon: dynamically allocate the vmw-balloon shrinker
  2023-06-22  8:53 [PATCH 00/29] use refcount+RCU method to implement lockless slab shrink Qi Zheng
                   ` (8 preceding siblings ...)
  2023-06-22  8:53 ` [PATCH 09/29] bcache: dynamically allocate the md-bcache shrinker Qi Zheng
@ 2023-06-22  8:53 ` Qi Zheng
  2023-06-22  8:53 ` [PATCH 11/29] virtio_balloon: dynamically allocate the virtio-balloon shrinker Qi Zheng
                   ` (19 subsequent siblings)
  29 siblings, 0 replies; 55+ messages in thread
From: Qi Zheng @ 2023-06-22  8:53 UTC (permalink / raw)
  To: akpm, david, tkhai, vbabka, roman.gushchin, djwong, brauner,
	paulmck, tytso
  Cc: linux-kernel, linux-mm, intel-gfx, dri-devel, linux-arm-msm,
	dm-devel, linux-raid, linux-bcache, virtualization,
	linux-fsdevel, linux-ext4, linux-nfs, linux-xfs, linux-btrfs,
	Qi Zheng

In preparation for implementing lockless slab shrink,
we need to dynamically allocate the vmw-balloon shrinker,
so that it can be freed asynchronously using kfree_rcu().
Then it doesn't need to wait for RCU read-side critical
section when releasing the struct vmballoon.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 drivers/misc/vmw_balloon.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/drivers/misc/vmw_balloon.c b/drivers/misc/vmw_balloon.c
index 9ce9b9e0e9b6..2f86f666b476 100644
--- a/drivers/misc/vmw_balloon.c
+++ b/drivers/misc/vmw_balloon.c
@@ -380,7 +380,7 @@ struct vmballoon {
 	/**
 	 * @shrinker: shrinker interface that is used to avoid over-inflation.
 	 */
-	struct shrinker shrinker;
+	struct shrinker *shrinker;
 
 	/**
 	 * @shrinker_registered: whether the shrinker was registered.
@@ -1569,7 +1569,7 @@ static unsigned long vmballoon_shrinker_count(struct shrinker *shrinker,
 static void vmballoon_unregister_shrinker(struct vmballoon *b)
 {
 	if (b->shrinker_registered)
-		unregister_shrinker(&b->shrinker);
+		unregister_and_free_shrinker(b->shrinker);
 	b->shrinker_registered = false;
 }
 
@@ -1581,14 +1581,18 @@ static int vmballoon_register_shrinker(struct vmballoon *b)
 	if (!vmwballoon_shrinker_enable)
 		return 0;
 
-	b->shrinker.scan_objects = vmballoon_shrinker_scan;
-	b->shrinker.count_objects = vmballoon_shrinker_count;
-	b->shrinker.seeks = DEFAULT_SEEKS;
+	b->shrinker = shrinker_alloc_and_init(vmballoon_shrinker_count,
+					      vmballoon_shrinker_scan,
+					      0, DEFAULT_SEEKS, 0, b);
+	if (!b->shrinker)
+		return -ENOMEM;
 
-	r = register_shrinker(&b->shrinker, "vmw-balloon");
+	r = register_shrinker(b->shrinker, "vmw-balloon");
 
 	if (r == 0)
 		b->shrinker_registered = true;
+	else
+		shrinker_free(b->shrinker);
 
 	return r;
 }
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 11/29] virtio_balloon: dynamically allocate the virtio-balloon shrinker
  2023-06-22  8:53 [PATCH 00/29] use refcount+RCU method to implement lockless slab shrink Qi Zheng
                   ` (9 preceding siblings ...)
  2023-06-22  8:53 ` [PATCH 10/29] vmw_balloon: dynamically allocate the vmw-balloon shrinker Qi Zheng
@ 2023-06-22  8:53 ` Qi Zheng
  2023-06-22  8:53 ` [PATCH 12/29] mbcache: dynamically allocate the mbcache shrinker Qi Zheng
                   ` (18 subsequent siblings)
  29 siblings, 0 replies; 55+ messages in thread
From: Qi Zheng @ 2023-06-22  8:53 UTC (permalink / raw)
  To: akpm, david, tkhai, vbabka, roman.gushchin, djwong, brauner,
	paulmck, tytso
  Cc: linux-kernel, linux-mm, intel-gfx, dri-devel, linux-arm-msm,
	dm-devel, linux-raid, linux-bcache, virtualization,
	linux-fsdevel, linux-ext4, linux-nfs, linux-xfs, linux-btrfs,
	Qi Zheng

In preparation for implementing lockless slab shrink,
we need to dynamically allocate the virtio-balloon shrinker,
so that it can be freed asynchronously using kfree_rcu().
Then it doesn't need to wait for RCU read-side critical
section when releasing the struct virtio_balloon.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 drivers/virtio/virtio_balloon.c | 26 ++++++++++++++++----------
 1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 5b15936a5214..fa051bff8d90 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -111,7 +111,7 @@ struct virtio_balloon {
 	struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
 
 	/* Shrinker to return free pages - VIRTIO_BALLOON_F_FREE_PAGE_HINT */
-	struct shrinker shrinker;
+	struct shrinker *shrinker;
 
 	/* OOM notifier to deflate on OOM - VIRTIO_BALLOON_F_DEFLATE_ON_OOM */
 	struct notifier_block oom_nb;
@@ -816,8 +816,7 @@ static unsigned long shrink_free_pages(struct virtio_balloon *vb,
 static unsigned long virtio_balloon_shrinker_scan(struct shrinker *shrinker,
 						  struct shrink_control *sc)
 {
-	struct virtio_balloon *vb = container_of(shrinker,
-					struct virtio_balloon, shrinker);
+	struct virtio_balloon *vb = shrinker->private_data;
 
 	return shrink_free_pages(vb, sc->nr_to_scan);
 }
@@ -825,8 +824,7 @@ static unsigned long virtio_balloon_shrinker_scan(struct shrinker *shrinker,
 static unsigned long virtio_balloon_shrinker_count(struct shrinker *shrinker,
 						   struct shrink_control *sc)
 {
-	struct virtio_balloon *vb = container_of(shrinker,
-					struct virtio_balloon, shrinker);
+	struct virtio_balloon *vb = shrinker->private_data;
 
 	return vb->num_free_page_blocks * VIRTIO_BALLOON_HINT_BLOCK_PAGES;
 }
@@ -847,16 +845,24 @@ static int virtio_balloon_oom_notify(struct notifier_block *nb,
 
 static void virtio_balloon_unregister_shrinker(struct virtio_balloon *vb)
 {
-	unregister_shrinker(&vb->shrinker);
+	unregister_and_free_shrinker(vb->shrinker);
 }
 
 static int virtio_balloon_register_shrinker(struct virtio_balloon *vb)
 {
-	vb->shrinker.scan_objects = virtio_balloon_shrinker_scan;
-	vb->shrinker.count_objects = virtio_balloon_shrinker_count;
-	vb->shrinker.seeks = DEFAULT_SEEKS;
+	int ret;
+
+	vb->shrinker = shrinker_alloc_and_init(virtio_balloon_shrinker_count,
+					       virtio_balloon_shrinker_scan,
+					       0, DEFAULT_SEEKS, 0, vb);
+	if (!vb->shrinker)
+		return -ENOMEM;
+
+	ret = register_shrinker(vb->shrinker, "virtio-balloon");
+	if (ret)
+		shrinker_free(vb->shrinker);
 
-	return register_shrinker(&vb->shrinker, "virtio-balloon");
+	return ret;
 }
 
 static int virtballoon_probe(struct virtio_device *vdev)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 12/29] mbcache: dynamically allocate the mbcache shrinker
  2023-06-22  8:53 [PATCH 00/29] use refcount+RCU method to implement lockless slab shrink Qi Zheng
                   ` (10 preceding siblings ...)
  2023-06-22  8:53 ` [PATCH 11/29] virtio_balloon: dynamically allocate the virtio-balloon shrinker Qi Zheng
@ 2023-06-22  8:53 ` Qi Zheng
  2023-06-22  8:53 ` [PATCH 13/29] ext4: dynamically allocate the ext4-es shrinker Qi Zheng
                   ` (17 subsequent siblings)
  29 siblings, 0 replies; 55+ messages in thread
From: Qi Zheng @ 2023-06-22  8:53 UTC (permalink / raw)
  To: akpm, david, tkhai, vbabka, roman.gushchin, djwong, brauner,
	paulmck, tytso
  Cc: linux-kernel, linux-mm, intel-gfx, dri-devel, linux-arm-msm,
	dm-devel, linux-raid, linux-bcache, virtualization,
	linux-fsdevel, linux-ext4, linux-nfs, linux-xfs, linux-btrfs,
	Qi Zheng

In preparation for implementing lockless slab shrink,
we need to dynamically allocate the mbcache shrinker,
so that it can be freed asynchronously using kfree_rcu().
Then it doesn't need to wait for RCU read-side critical
section when releasing the struct mb_cache.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 fs/mbcache.c | 39 +++++++++++++++++++++------------------
 1 file changed, 21 insertions(+), 18 deletions(-)

diff --git a/fs/mbcache.c b/fs/mbcache.c
index 2a4b8b549e93..fec393e55a66 100644
--- a/fs/mbcache.c
+++ b/fs/mbcache.c
@@ -37,7 +37,7 @@ struct mb_cache {
 	struct list_head	c_list;
 	/* Number of entries in cache */
 	unsigned long		c_entry_count;
-	struct shrinker		c_shrink;
+	struct shrinker		*c_shrink;
 	/* Work for shrinking when the cache has too many entries */
 	struct work_struct	c_shrink_work;
 };
@@ -293,8 +293,7 @@ EXPORT_SYMBOL(mb_cache_entry_touch);
 static unsigned long mb_cache_count(struct shrinker *shrink,
 				    struct shrink_control *sc)
 {
-	struct mb_cache *cache = container_of(shrink, struct mb_cache,
-					      c_shrink);
+	struct mb_cache *cache = shrink->private_data;
 
 	return cache->c_entry_count;
 }
@@ -333,8 +332,8 @@ static unsigned long mb_cache_shrink(struct mb_cache *cache,
 static unsigned long mb_cache_scan(struct shrinker *shrink,
 				   struct shrink_control *sc)
 {
-	struct mb_cache *cache = container_of(shrink, struct mb_cache,
-					      c_shrink);
+	struct mb_cache *cache = shrink->private_data;
+
 	return mb_cache_shrink(cache, sc->nr_to_scan);
 }
 
@@ -370,26 +369,30 @@ struct mb_cache *mb_cache_create(int bucket_bits)
 	cache->c_hash = kmalloc_array(bucket_count,
 				      sizeof(struct hlist_bl_head),
 				      GFP_KERNEL);
-	if (!cache->c_hash) {
-		kfree(cache);
-		goto err_out;
-	}
+	if (!cache->c_hash)
+		goto err_c_hash;
+
 	for (i = 0; i < bucket_count; i++)
 		INIT_HLIST_BL_HEAD(&cache->c_hash[i]);
 
-	cache->c_shrink.count_objects = mb_cache_count;
-	cache->c_shrink.scan_objects = mb_cache_scan;
-	cache->c_shrink.seeks = DEFAULT_SEEKS;
-	if (register_shrinker(&cache->c_shrink, "mbcache-shrinker")) {
-		kfree(cache->c_hash);
-		kfree(cache);
-		goto err_out;
-	}
+	cache->c_shrink = shrinker_alloc_and_init(mb_cache_count, mb_cache_scan,
+						  0, DEFAULT_SEEKS, 0, cache);
+	if (!cache->c_shrink)
+		goto err_shrinker;
+
+	if (register_shrinker(cache->c_shrink, "mbcache-shrinker"))
+		goto err_register;
 
 	INIT_WORK(&cache->c_shrink_work, mb_cache_shrink_worker);
 
 	return cache;
 
+err_register:
+	shrinker_free(cache->c_shrink);
+err_shrinker:
+	kfree(cache->c_hash);
+err_c_hash:
+	kfree(cache);
 err_out:
 	return NULL;
 }
@@ -406,7 +409,7 @@ void mb_cache_destroy(struct mb_cache *cache)
 {
 	struct mb_cache_entry *entry, *next;
 
-	unregister_shrinker(&cache->c_shrink);
+	unregister_and_free_shrinker(cache->c_shrink);
 
 	/*
 	 * We don't bother with any locking. Cache must not be used at this
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 13/29] ext4: dynamically allocate the ext4-es shrinker
  2023-06-22  8:53 [PATCH 00/29] use refcount+RCU method to implement lockless slab shrink Qi Zheng
                   ` (11 preceding siblings ...)
  2023-06-22  8:53 ` [PATCH 12/29] mbcache: dynamically allocate the mbcache shrinker Qi Zheng
@ 2023-06-22  8:53 ` Qi Zheng
  2023-06-22  8:53 ` [PATCH 14/29] jbd2,ext4: dynamically allocate the jbd2-journal shrinker Qi Zheng
                   ` (16 subsequent siblings)
  29 siblings, 0 replies; 55+ messages in thread
From: Qi Zheng @ 2023-06-22  8:53 UTC (permalink / raw)
  To: akpm, david, tkhai, vbabka, roman.gushchin, djwong, brauner,
	paulmck, tytso
  Cc: linux-kernel, linux-mm, intel-gfx, dri-devel, linux-arm-msm,
	dm-devel, linux-raid, linux-bcache, virtualization,
	linux-fsdevel, linux-ext4, linux-nfs, linux-xfs, linux-btrfs,
	Qi Zheng

In preparation for implementing lockless slab shrink,
we need to dynamically allocate the ext4-es shrinker,
so that it can be freed asynchronously using kfree_rcu().
Then it doesn't need to wait for RCU read-side critical
section when releasing the struct ext4_sb_info.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 fs/ext4/ext4.h           |  2 +-
 fs/ext4/extents_status.c | 21 ++++++++++++---------
 2 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 0a2d55faa095..1bd150d454f5 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1651,7 +1651,7 @@ struct ext4_sb_info {
 	__u32 s_csum_seed;
 
 	/* Reclaim extents from extent status tree */
-	struct shrinker s_es_shrinker;
+	struct shrinker *s_es_shrinker;
 	struct list_head s_es_list;	/* List of inodes with reclaimable extents */
 	long s_es_nr_inode;
 	struct ext4_es_stats s_es_stats;
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 9b5b8951afb4..fea82339f4b4 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -1596,7 +1596,7 @@ static unsigned long ext4_es_count(struct shrinker *shrink,
 	unsigned long nr;
 	struct ext4_sb_info *sbi;
 
-	sbi = container_of(shrink, struct ext4_sb_info, s_es_shrinker);
+	sbi = shrink->private_data;
 	nr = percpu_counter_read_positive(&sbi->s_es_stats.es_stats_shk_cnt);
 	trace_ext4_es_shrink_count(sbi->s_sb, sc->nr_to_scan, nr);
 	return nr;
@@ -1605,8 +1605,7 @@ static unsigned long ext4_es_count(struct shrinker *shrink,
 static unsigned long ext4_es_scan(struct shrinker *shrink,
 				  struct shrink_control *sc)
 {
-	struct ext4_sb_info *sbi = container_of(shrink,
-					struct ext4_sb_info, s_es_shrinker);
+	struct ext4_sb_info *sbi = shrink->private_data;
 	int nr_to_scan = sc->nr_to_scan;
 	int ret, nr_shrunk;
 
@@ -1690,15 +1689,19 @@ int ext4_es_register_shrinker(struct ext4_sb_info *sbi)
 	if (err)
 		goto err3;
 
-	sbi->s_es_shrinker.scan_objects = ext4_es_scan;
-	sbi->s_es_shrinker.count_objects = ext4_es_count;
-	sbi->s_es_shrinker.seeks = DEFAULT_SEEKS;
-	err = register_shrinker(&sbi->s_es_shrinker, "ext4-es:%s",
+	sbi->s_es_shrinker = shrinker_alloc_and_init(ext4_es_count, ext4_es_scan,
+						     0, DEFAULT_SEEKS, 0, sbi);
+	if (!sbi->s_es_shrinker)
+		goto err4;
+
+	err = register_shrinker(sbi->s_es_shrinker, "ext4-es:%s",
 				sbi->s_sb->s_id);
 	if (err)
-		goto err4;
+		goto err5;
 
 	return 0;
+err5:
+	shrinker_free(sbi->s_es_shrinker);
 err4:
 	percpu_counter_destroy(&sbi->s_es_stats.es_stats_shk_cnt);
 err3:
@@ -1716,7 +1719,7 @@ void ext4_es_unregister_shrinker(struct ext4_sb_info *sbi)
 	percpu_counter_destroy(&sbi->s_es_stats.es_stats_cache_misses);
 	percpu_counter_destroy(&sbi->s_es_stats.es_stats_all_cnt);
 	percpu_counter_destroy(&sbi->s_es_stats.es_stats_shk_cnt);
-	unregister_shrinker(&sbi->s_es_shrinker);
+	unregister_and_free_shrinker(sbi->s_es_shrinker);
 }
 
 /*
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 14/29] jbd2,ext4: dynamically allocate the jbd2-journal shrinker
  2023-06-22  8:53 [PATCH 00/29] use refcount+RCU method to implement lockless slab shrink Qi Zheng
                   ` (12 preceding siblings ...)
  2023-06-22  8:53 ` [PATCH 13/29] ext4: dynamically allocate the ext4-es shrinker Qi Zheng
@ 2023-06-22  8:53 ` Qi Zheng
  2023-06-22  8:53 ` [PATCH 15/29] NFSD: dynamically allocate the nfsd-client shrinker Qi Zheng
                   ` (15 subsequent siblings)
  29 siblings, 0 replies; 55+ messages in thread
From: Qi Zheng @ 2023-06-22  8:53 UTC (permalink / raw)
  To: akpm, david, tkhai, vbabka, roman.gushchin, djwong, brauner,
	paulmck, tytso
  Cc: linux-kernel, linux-mm, intel-gfx, dri-devel, linux-arm-msm,
	dm-devel, linux-raid, linux-bcache, virtualization,
	linux-fsdevel, linux-ext4, linux-nfs, linux-xfs, linux-btrfs,
	Qi Zheng

In preparation for implementing lockless slab shrink,
we need to dynamically allocate the jbd2-journal shrinker,
so that it can be freed asynchronously using kfree_rcu().
Then it doesn't need to wait for RCU read-side critical
section when releasing the struct journal_s.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 fs/jbd2/journal.c    | 32 +++++++++++++++++++-------------
 include/linux/jbd2.h |  2 +-
 2 files changed, 20 insertions(+), 14 deletions(-)

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index eee3c0ae349a..92a2f4360b5f 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -1301,7 +1301,7 @@ static int jbd2_min_tag_size(void)
 static unsigned long jbd2_journal_shrink_scan(struct shrinker *shrink,
 					      struct shrink_control *sc)
 {
-	journal_t *journal = container_of(shrink, journal_t, j_shrinker);
+	journal_t *journal = shrink->private_data;
 	unsigned long nr_to_scan = sc->nr_to_scan;
 	unsigned long nr_shrunk;
 	unsigned long count;
@@ -1327,7 +1327,7 @@ static unsigned long jbd2_journal_shrink_scan(struct shrinker *shrink,
 static unsigned long jbd2_journal_shrink_count(struct shrinker *shrink,
 					       struct shrink_control *sc)
 {
-	journal_t *journal = container_of(shrink, journal_t, j_shrinker);
+	journal_t *journal = shrink->private_data;
 	unsigned long count;
 
 	count = percpu_counter_read_positive(&journal->j_checkpoint_jh_count);
@@ -1415,21 +1415,27 @@ static journal_t *journal_init_common(struct block_device *bdev,
 	journal->j_superblock = (journal_superblock_t *)bh->b_data;
 
 	journal->j_shrink_transaction = NULL;
-	journal->j_shrinker.scan_objects = jbd2_journal_shrink_scan;
-	journal->j_shrinker.count_objects = jbd2_journal_shrink_count;
-	journal->j_shrinker.seeks = DEFAULT_SEEKS;
-	journal->j_shrinker.batch = journal->j_max_transaction_buffers;
 
 	if (percpu_counter_init(&journal->j_checkpoint_jh_count, 0, GFP_KERNEL))
 		goto err_cleanup;
 
-	if (register_shrinker(&journal->j_shrinker, "jbd2-journal:(%u:%u)",
-			      MAJOR(bdev->bd_dev), MINOR(bdev->bd_dev))) {
-		percpu_counter_destroy(&journal->j_checkpoint_jh_count);
-		goto err_cleanup;
-	}
+	journal->j_shrinker = shrinker_alloc_and_init(jbd2_journal_shrink_count,
+						      jbd2_journal_shrink_scan,
+						      journal->j_max_transaction_buffers,
+						      DEFAULT_SEEKS, 0, journal);
+	if (!journal->j_shrinker)
+		goto err_shrinker;
+
+	if (register_shrinker(journal->j_shrinker, "jbd2-journal:(%u:%u)",
+			      MAJOR(bdev->bd_dev), MINOR(bdev->bd_dev)))
+		goto err_register;
+
 	return journal;
 
+err_register:
+	shrinker_free(journal->j_shrinker);
+err_shrinker:
+	percpu_counter_destroy(&journal->j_checkpoint_jh_count);
 err_cleanup:
 	brelse(journal->j_sb_buffer);
 	kfree(journal->j_wbuf);
@@ -2190,9 +2196,9 @@ int jbd2_journal_destroy(journal_t *journal)
 		brelse(journal->j_sb_buffer);
 	}
 
-	if (journal->j_shrinker.flags & SHRINKER_REGISTERED) {
+	if (journal->j_shrinker->flags & SHRINKER_REGISTERED) {
 		percpu_counter_destroy(&journal->j_checkpoint_jh_count);
-		unregister_shrinker(&journal->j_shrinker);
+		unregister_and_free_shrinker(journal->j_shrinker);
 	}
 	if (journal->j_proc_entry)
 		jbd2_stats_proc_exit(journal);
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 44c298aa58d4..beb4c4586320 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -891,7 +891,7 @@ struct journal_s
 	 * Journal head shrinker, reclaim buffer's journal head which
 	 * has been written back.
 	 */
-	struct shrinker		j_shrinker;
+	struct shrinker		*j_shrinker;
 
 	/**
 	 * @j_checkpoint_jh_count:
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 15/29] NFSD: dynamically allocate the nfsd-client shrinker
  2023-06-22  8:53 [PATCH 00/29] use refcount+RCU method to implement lockless slab shrink Qi Zheng
                   ` (13 preceding siblings ...)
  2023-06-22  8:53 ` [PATCH 14/29] jbd2,ext4: dynamically allocate the jbd2-journal shrinker Qi Zheng
@ 2023-06-22  8:53 ` Qi Zheng
  2023-06-23 21:49   ` Chuck Lever
  2023-06-22  8:53 ` [PATCH 16/29] NFSD: dynamically allocate the nfsd-reply shrinker Qi Zheng
                   ` (14 subsequent siblings)
  29 siblings, 1 reply; 55+ messages in thread
From: Qi Zheng @ 2023-06-22  8:53 UTC (permalink / raw)
  To: akpm, david, tkhai, vbabka, roman.gushchin, djwong, brauner,
	paulmck, tytso
  Cc: linux-kernel, linux-mm, intel-gfx, dri-devel, linux-arm-msm,
	dm-devel, linux-raid, linux-bcache, virtualization,
	linux-fsdevel, linux-ext4, linux-nfs, linux-xfs, linux-btrfs,
	Qi Zheng

In preparation for implementing lockless slab shrink,
we need to dynamically allocate the nfsd-client shrinker,
so that it can be freed asynchronously using kfree_rcu().
Then it doesn't need to wait for RCU read-side critical
section when releasing the struct nfsd_net.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 fs/nfsd/netns.h     |  2 +-
 fs/nfsd/nfs4state.c | 20 ++++++++++++--------
 2 files changed, 13 insertions(+), 9 deletions(-)

diff --git a/fs/nfsd/netns.h b/fs/nfsd/netns.h
index ec49b200b797..f669444d5336 100644
--- a/fs/nfsd/netns.h
+++ b/fs/nfsd/netns.h
@@ -195,7 +195,7 @@ struct nfsd_net {
 	int			nfs4_max_clients;
 
 	atomic_t		nfsd_courtesy_clients;
-	struct shrinker		nfsd_client_shrinker;
+	struct shrinker		*nfsd_client_shrinker;
 	struct work_struct	nfsd_shrinker_work;
 };
 
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 6e61fa3acaf1..a06184270548 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -4388,8 +4388,7 @@ static unsigned long
 nfsd4_state_shrinker_count(struct shrinker *shrink, struct shrink_control *sc)
 {
 	int count;
-	struct nfsd_net *nn = container_of(shrink,
-			struct nfsd_net, nfsd_client_shrinker);
+	struct nfsd_net *nn = shrink->private_data;
 
 	count = atomic_read(&nn->nfsd_courtesy_clients);
 	if (!count)
@@ -8094,14 +8093,19 @@ static int nfs4_state_create_net(struct net *net)
 	INIT_WORK(&nn->nfsd_shrinker_work, nfsd4_state_shrinker_worker);
 	get_net(net);
 
-	nn->nfsd_client_shrinker.scan_objects = nfsd4_state_shrinker_scan;
-	nn->nfsd_client_shrinker.count_objects = nfsd4_state_shrinker_count;
-	nn->nfsd_client_shrinker.seeks = DEFAULT_SEEKS;
-
-	if (register_shrinker(&nn->nfsd_client_shrinker, "nfsd-client"))
+	nn->nfsd_client_shrinker = shrinker_alloc_and_init(nfsd4_state_shrinker_count,
+							   nfsd4_state_shrinker_scan,
+							   0, DEFAULT_SEEKS, 0,
+							   nn);
+	if (!nn->nfsd_client_shrinker)
 		goto err_shrinker;
+
+	if (register_shrinker(nn->nfsd_client_shrinker, "nfsd-client"))
+		goto err_register;
 	return 0;
 
+err_register:
+	shrinker_free(nn->nfsd_client_shrinker);
 err_shrinker:
 	put_net(net);
 	kfree(nn->sessionid_hashtbl);
@@ -8197,7 +8201,7 @@ nfs4_state_shutdown_net(struct net *net)
 	struct list_head *pos, *next, reaplist;
 	struct nfsd_net *nn = net_generic(net, nfsd_net_id);
 
-	unregister_shrinker(&nn->nfsd_client_shrinker);
+	unregister_and_free_shrinker(nn->nfsd_client_shrinker);
 	cancel_work(&nn->nfsd_shrinker_work);
 	cancel_delayed_work_sync(&nn->laundromat_work);
 	locks_end_grace(&nn->nfsd4_manager);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 16/29] NFSD: dynamically allocate the nfsd-reply shrinker
  2023-06-22  8:53 [PATCH 00/29] use refcount+RCU method to implement lockless slab shrink Qi Zheng
                   ` (14 preceding siblings ...)
  2023-06-22  8:53 ` [PATCH 15/29] NFSD: dynamically allocate the nfsd-client shrinker Qi Zheng
@ 2023-06-22  8:53 ` Qi Zheng
  2023-06-22  8:53 ` [PATCH 17/29] xfs: dynamically allocate the xfs-buf shrinker Qi Zheng
                   ` (13 subsequent siblings)
  29 siblings, 0 replies; 55+ messages in thread
From: Qi Zheng @ 2023-06-22  8:53 UTC (permalink / raw)
  To: akpm, david, tkhai, vbabka, roman.gushchin, djwong, brauner,
	paulmck, tytso
  Cc: linux-kernel, linux-mm, intel-gfx, dri-devel, linux-arm-msm,
	dm-devel, linux-raid, linux-bcache, virtualization,
	linux-fsdevel, linux-ext4, linux-nfs, linux-xfs, linux-btrfs,
	Qi Zheng

In preparation for implementing lockless slab shrink,
we need to dynamically allocate the nfsd-reply shrinker,
so that it can be freed asynchronously using kfree_rcu().
Then it doesn't need to wait for RCU read-side critical
section when releasing the struct nfsd_net.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 fs/nfsd/netns.h    |  2 +-
 fs/nfsd/nfscache.c | 33 +++++++++++++++++++--------------
 2 files changed, 20 insertions(+), 15 deletions(-)

diff --git a/fs/nfsd/netns.h b/fs/nfsd/netns.h
index f669444d5336..ab303a8b77d5 100644
--- a/fs/nfsd/netns.h
+++ b/fs/nfsd/netns.h
@@ -177,7 +177,7 @@ struct nfsd_net {
 	/* size of cache when we saw the longest hash chain */
 	unsigned int             longest_chain_cachesize;
 
-	struct shrinker		nfsd_reply_cache_shrinker;
+	struct shrinker		*nfsd_reply_cache_shrinker;
 
 	/* tracking server-to-server copy mounts */
 	spinlock_t              nfsd_ssc_lock;
diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
index 041faa13b852..ec33de8e418b 100644
--- a/fs/nfsd/nfscache.c
+++ b/fs/nfsd/nfscache.c
@@ -173,19 +173,23 @@ int nfsd_reply_cache_init(struct nfsd_net *nn)
 	if (status)
 		goto out_nomem;
 
-	nn->nfsd_reply_cache_shrinker.scan_objects = nfsd_reply_cache_scan;
-	nn->nfsd_reply_cache_shrinker.count_objects = nfsd_reply_cache_count;
-	nn->nfsd_reply_cache_shrinker.seeks = 1;
-	status = register_shrinker(&nn->nfsd_reply_cache_shrinker,
-				   "nfsd-reply:%s", nn->nfsd_name);
-	if (status)
-		goto out_stats_destroy;
-
 	nn->drc_hashtbl = kvzalloc(array_size(hashsize,
 				sizeof(*nn->drc_hashtbl)), GFP_KERNEL);
 	if (!nn->drc_hashtbl)
+		goto out_stats_destroy;
+
+	nn->nfsd_reply_cache_shrinker =
+		shrinker_alloc_and_init(nfsd_reply_cache_count,
+					nfsd_reply_cache_scan,
+					0, 1, 0, nn);
+	if (!nn->nfsd_reply_cache_shrinker)
 		goto out_shrinker;
 
+	status = register_shrinker(nn->nfsd_reply_cache_shrinker,
+				   "nfsd-reply:%s", nn->nfsd_name);
+	if (status)
+		goto out_register;
+
 	for (i = 0; i < hashsize; i++) {
 		INIT_LIST_HEAD(&nn->drc_hashtbl[i].lru_head);
 		spin_lock_init(&nn->drc_hashtbl[i].cache_lock);
@@ -193,8 +197,11 @@ int nfsd_reply_cache_init(struct nfsd_net *nn)
 	nn->drc_hashsize = hashsize;
 
 	return 0;
+
+out_register:
+	shrinker_free(nn->nfsd_reply_cache_shrinker);
 out_shrinker:
-	unregister_shrinker(&nn->nfsd_reply_cache_shrinker);
+	kvfree(nn->drc_hashtbl);
 out_stats_destroy:
 	nfsd_reply_cache_stats_destroy(nn);
 out_nomem:
@@ -207,7 +214,7 @@ void nfsd_reply_cache_shutdown(struct nfsd_net *nn)
 	struct svc_cacherep	*rp;
 	unsigned int i;
 
-	unregister_shrinker(&nn->nfsd_reply_cache_shrinker);
+	unregister_and_free_shrinker(nn->nfsd_reply_cache_shrinker);
 
 	for (i = 0; i < nn->drc_hashsize; i++) {
 		struct list_head *head = &nn->drc_hashtbl[i].lru_head;
@@ -297,8 +304,7 @@ prune_cache_entries(struct nfsd_net *nn)
 static unsigned long
 nfsd_reply_cache_count(struct shrinker *shrink, struct shrink_control *sc)
 {
-	struct nfsd_net *nn = container_of(shrink,
-				struct nfsd_net, nfsd_reply_cache_shrinker);
+	struct nfsd_net *nn = shrink->private_data;
 
 	return atomic_read(&nn->num_drc_entries);
 }
@@ -306,8 +312,7 @@ nfsd_reply_cache_count(struct shrinker *shrink, struct shrink_control *sc)
 static unsigned long
 nfsd_reply_cache_scan(struct shrinker *shrink, struct shrink_control *sc)
 {
-	struct nfsd_net *nn = container_of(shrink,
-				struct nfsd_net, nfsd_reply_cache_shrinker);
+	struct nfsd_net *nn = shrink->private_data;
 
 	return prune_cache_entries(nn);
 }
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 17/29] xfs: dynamically allocate the xfs-buf shrinker
  2023-06-22  8:53 [PATCH 00/29] use refcount+RCU method to implement lockless slab shrink Qi Zheng
                   ` (15 preceding siblings ...)
  2023-06-22  8:53 ` [PATCH 16/29] NFSD: dynamically allocate the nfsd-reply shrinker Qi Zheng
@ 2023-06-22  8:53 ` Qi Zheng
  2023-06-22  8:53 ` [PATCH 18/29] xfs: dynamically allocate the xfs-inodegc shrinker Qi Zheng
                   ` (12 subsequent siblings)
  29 siblings, 0 replies; 55+ messages in thread
From: Qi Zheng @ 2023-06-22  8:53 UTC (permalink / raw)
  To: akpm, david, tkhai, vbabka, roman.gushchin, djwong, brauner,
	paulmck, tytso
  Cc: linux-kernel, linux-mm, intel-gfx, dri-devel, linux-arm-msm,
	dm-devel, linux-raid, linux-bcache, virtualization,
	linux-fsdevel, linux-ext4, linux-nfs, linux-xfs, linux-btrfs,
	Qi Zheng

In preparation for implementing lockless slab shrink,
we need to dynamically allocate the xfs-buf shrinker,
so that it can be freed asynchronously using kfree_rcu().
Then it doesn't need to wait for RCU read-side critical
section when releasing the struct xfs_buftarg.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 fs/xfs/xfs_buf.c | 25 ++++++++++++++-----------
 fs/xfs/xfs_buf.h |  2 +-
 2 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 15d1e5a7c2d3..6657d285d26f 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1906,8 +1906,7 @@ xfs_buftarg_shrink_scan(
 	struct shrinker		*shrink,
 	struct shrink_control	*sc)
 {
-	struct xfs_buftarg	*btp = container_of(shrink,
-					struct xfs_buftarg, bt_shrinker);
+	struct xfs_buftarg	*btp = shrink->private_data;
 	LIST_HEAD(dispose);
 	unsigned long		freed;
 
@@ -1929,8 +1928,7 @@ xfs_buftarg_shrink_count(
 	struct shrinker		*shrink,
 	struct shrink_control	*sc)
 {
-	struct xfs_buftarg	*btp = container_of(shrink,
-					struct xfs_buftarg, bt_shrinker);
+	struct xfs_buftarg	*btp = shrink->private_data;
 	return list_lru_shrink_count(&btp->bt_lru, sc);
 }
 
@@ -1938,7 +1936,7 @@ void
 xfs_free_buftarg(
 	struct xfs_buftarg	*btp)
 {
-	unregister_shrinker(&btp->bt_shrinker);
+	unregister_and_free_shrinker(btp->bt_shrinker);
 	ASSERT(percpu_counter_sum(&btp->bt_io_count) == 0);
 	percpu_counter_destroy(&btp->bt_io_count);
 	list_lru_destroy(&btp->bt_lru);
@@ -2021,15 +2019,20 @@ xfs_alloc_buftarg(
 	if (percpu_counter_init(&btp->bt_io_count, 0, GFP_KERNEL))
 		goto error_lru;
 
-	btp->bt_shrinker.count_objects = xfs_buftarg_shrink_count;
-	btp->bt_shrinker.scan_objects = xfs_buftarg_shrink_scan;
-	btp->bt_shrinker.seeks = DEFAULT_SEEKS;
-	btp->bt_shrinker.flags = SHRINKER_NUMA_AWARE;
-	if (register_shrinker(&btp->bt_shrinker, "xfs-buf:%s",
-			      mp->m_super->s_id))
+	btp->bt_shrinker = shrinker_alloc_and_init(xfs_buftarg_shrink_count,
+						   xfs_buftarg_shrink_scan,
+						   0, DEFAULT_SEEKS,
+						   SHRINKER_NUMA_AWARE, btp);
+	if (!btp->bt_shrinker)
 		goto error_pcpu;
+
+	if (register_shrinker(btp->bt_shrinker, "xfs-buf:%s",
+			      mp->m_super->s_id))
+		goto error_shrinker;
 	return btp;
 
+error_shrinker:
+	shrinker_free(btp->bt_shrinker);
 error_pcpu:
 	percpu_counter_destroy(&btp->bt_io_count);
 error_lru:
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index 549c60942208..4e6969a675f7 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -102,7 +102,7 @@ typedef struct xfs_buftarg {
 	size_t			bt_logical_sectormask;
 
 	/* LRU control structures */
-	struct shrinker		bt_shrinker;
+	struct shrinker		*bt_shrinker;
 	struct list_lru		bt_lru;
 
 	struct percpu_counter	bt_io_count;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 18/29] xfs: dynamically allocate the xfs-inodegc shrinker
  2023-06-22  8:53 [PATCH 00/29] use refcount+RCU method to implement lockless slab shrink Qi Zheng
                   ` (16 preceding siblings ...)
  2023-06-22  8:53 ` [PATCH 17/29] xfs: dynamically allocate the xfs-buf shrinker Qi Zheng
@ 2023-06-22  8:53 ` Qi Zheng
  2023-06-22  8:53 ` [PATCH 19/29] xfs: dynamically allocate the xfs-qm shrinker Qi Zheng
                   ` (11 subsequent siblings)
  29 siblings, 0 replies; 55+ messages in thread
From: Qi Zheng @ 2023-06-22  8:53 UTC (permalink / raw)
  To: akpm, david, tkhai, vbabka, roman.gushchin, djwong, brauner,
	paulmck, tytso
  Cc: linux-kernel, linux-mm, intel-gfx, dri-devel, linux-arm-msm,
	dm-devel, linux-raid, linux-bcache, virtualization,
	linux-fsdevel, linux-ext4, linux-nfs, linux-xfs, linux-btrfs,
	Qi Zheng

In preparation for implementing lockless slab shrink,
we need to dynamically allocate the xfs-inodegc shrinker,
so that it can be freed asynchronously using kfree_rcu().
Then it doesn't need to wait for RCU read-side critical
section when releasing the struct xfs_mount.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 fs/xfs/xfs_icache.c | 27 ++++++++++++++++-----------
 fs/xfs/xfs_mount.c  |  4 ++--
 fs/xfs/xfs_mount.h  |  2 +-
 3 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 453890942d9f..1ef0c9fa57de 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -2225,8 +2225,7 @@ xfs_inodegc_shrinker_count(
 	struct shrinker		*shrink,
 	struct shrink_control	*sc)
 {
-	struct xfs_mount	*mp = container_of(shrink, struct xfs_mount,
-						   m_inodegc_shrinker);
+	struct xfs_mount	*mp = shrink->private_data;
 	struct xfs_inodegc	*gc;
 	int			cpu;
 
@@ -2247,8 +2246,7 @@ xfs_inodegc_shrinker_scan(
 	struct shrinker		*shrink,
 	struct shrink_control	*sc)
 {
-	struct xfs_mount	*mp = container_of(shrink, struct xfs_mount,
-						   m_inodegc_shrinker);
+	struct xfs_mount	*mp = shrink->private_data;
 	struct xfs_inodegc	*gc;
 	int			cpu;
 	bool			no_items = true;
@@ -2284,13 +2282,20 @@ int
 xfs_inodegc_register_shrinker(
 	struct xfs_mount	*mp)
 {
-	struct shrinker		*shrink = &mp->m_inodegc_shrinker;
+	int ret;
 
-	shrink->count_objects = xfs_inodegc_shrinker_count;
-	shrink->scan_objects = xfs_inodegc_shrinker_scan;
-	shrink->seeks = 0;
-	shrink->flags = SHRINKER_NONSLAB;
-	shrink->batch = XFS_INODEGC_SHRINKER_BATCH;
+	mp->m_inodegc_shrinker =
+		shrinker_alloc_and_init(xfs_inodegc_shrinker_count,
+					xfs_inodegc_shrinker_scan,
+					XFS_INODEGC_SHRINKER_BATCH,
+					0, SHRINKER_NONSLAB, mp);
+	if (!mp->m_inodegc_shrinker)
+		return -ENOMEM;
+
+	ret = register_shrinker(mp->m_inodegc_shrinker, "xfs-inodegc:%s",
+				mp->m_super->s_id);
+	if (ret)
+		shrinker_free(mp->m_inodegc_shrinker);
 
-	return register_shrinker(shrink, "xfs-inodegc:%s", mp->m_super->s_id);
+	return ret;
 }
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index fb87ffb48f7f..67286859ad34 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -1018,7 +1018,7 @@ xfs_mountfs(
  out_log_dealloc:
 	xfs_log_mount_cancel(mp);
  out_inodegc_shrinker:
-	unregister_shrinker(&mp->m_inodegc_shrinker);
+	unregister_and_free_shrinker(mp->m_inodegc_shrinker);
  out_fail_wait:
 	if (mp->m_logdev_targp && mp->m_logdev_targp != mp->m_ddev_targp)
 		xfs_buftarg_drain(mp->m_logdev_targp);
@@ -1100,7 +1100,7 @@ xfs_unmountfs(
 #if defined(DEBUG)
 	xfs_errortag_clearall(mp);
 #endif
-	unregister_shrinker(&mp->m_inodegc_shrinker);
+	unregister_and_free_shrinker(mp->m_inodegc_shrinker);
 	xfs_free_perag(mp);
 
 	xfs_errortag_del(mp);
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index e2866e7fa60c..562c294ca08e 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -217,7 +217,7 @@ typedef struct xfs_mount {
 	atomic_t		m_agirotor;	/* last ag dir inode alloced */
 
 	/* Memory shrinker to throttle and reprioritize inodegc */
-	struct shrinker		m_inodegc_shrinker;
+	struct shrinker		*m_inodegc_shrinker;
 	/*
 	 * Workqueue item so that we can coalesce multiple inode flush attempts
 	 * into a single flush.
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 19/29] xfs: dynamically allocate the xfs-qm shrinker
  2023-06-22  8:53 [PATCH 00/29] use refcount+RCU method to implement lockless slab shrink Qi Zheng
                   ` (17 preceding siblings ...)
  2023-06-22  8:53 ` [PATCH 18/29] xfs: dynamically allocate the xfs-inodegc shrinker Qi Zheng
@ 2023-06-22  8:53 ` Qi Zheng
  2023-06-22  8:53 ` [PATCH 20/29] zsmalloc: dynamically allocate the mm-zspool shrinker Qi Zheng
                   ` (10 subsequent siblings)
  29 siblings, 0 replies; 55+ messages in thread
From: Qi Zheng @ 2023-06-22  8:53 UTC (permalink / raw)
  To: akpm, david, tkhai, vbabka, roman.gushchin, djwong, brauner,
	paulmck, tytso
  Cc: linux-kernel, linux-mm, intel-gfx, dri-devel, linux-arm-msm,
	dm-devel, linux-raid, linux-bcache, virtualization,
	linux-fsdevel, linux-ext4, linux-nfs, linux-xfs, linux-btrfs,
	Qi Zheng

In preparation for implementing lockless slab shrink,
we need to dynamically allocate the xfs-qm shrinker,
so that it can be freed asynchronously using kfree_rcu().
Then it doesn't need to wait for RCU read-side critical
section when releasing the struct xfs_quotainfo.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 fs/xfs/xfs_qm.c | 24 +++++++++++++-----------
 fs/xfs/xfs_qm.h |  2 +-
 2 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index 6abcc34fafd8..b15320d469cc 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -504,8 +504,7 @@ xfs_qm_shrink_scan(
 	struct shrinker		*shrink,
 	struct shrink_control	*sc)
 {
-	struct xfs_quotainfo	*qi = container_of(shrink,
-					struct xfs_quotainfo, qi_shrinker);
+	struct xfs_quotainfo	*qi = shrink->private_data;
 	struct xfs_qm_isolate	isol;
 	unsigned long		freed;
 	int			error;
@@ -539,8 +538,7 @@ xfs_qm_shrink_count(
 	struct shrinker		*shrink,
 	struct shrink_control	*sc)
 {
-	struct xfs_quotainfo	*qi = container_of(shrink,
-					struct xfs_quotainfo, qi_shrinker);
+	struct xfs_quotainfo	*qi = shrink->private_data;
 
 	return list_lru_shrink_count(&qi->qi_lru, sc);
 }
@@ -680,18 +678,22 @@ xfs_qm_init_quotainfo(
 	if (XFS_IS_PQUOTA_ON(mp))
 		xfs_qm_set_defquota(mp, XFS_DQTYPE_PROJ, qinf);
 
-	qinf->qi_shrinker.count_objects = xfs_qm_shrink_count;
-	qinf->qi_shrinker.scan_objects = xfs_qm_shrink_scan;
-	qinf->qi_shrinker.seeks = DEFAULT_SEEKS;
-	qinf->qi_shrinker.flags = SHRINKER_NUMA_AWARE;
+	qinf->qi_shrinker = shrinker_alloc_and_init(xfs_qm_shrink_count,
+						    xfs_qm_shrink_scan,
+						    0, DEFAULT_SEEKS,
+						    SHRINKER_NUMA_AWARE, qinf);
+	if (!qinf->qi_shrinker)
+		goto out_free_inos;
 
-	error = register_shrinker(&qinf->qi_shrinker, "xfs-qm:%s",
+	error = register_shrinker(qinf->qi_shrinker, "xfs-qm:%s",
 				  mp->m_super->s_id);
 	if (error)
-		goto out_free_inos;
+		goto out_shrinker;
 
 	return 0;
 
+out_shrinker:
+	shrinker_free(qinf->qi_shrinker);
 out_free_inos:
 	mutex_destroy(&qinf->qi_quotaofflock);
 	mutex_destroy(&qinf->qi_tree_lock);
@@ -718,7 +720,7 @@ xfs_qm_destroy_quotainfo(
 	qi = mp->m_quotainfo;
 	ASSERT(qi != NULL);
 
-	unregister_shrinker(&qi->qi_shrinker);
+	unregister_and_free_shrinker(qi->qi_shrinker);
 	list_lru_destroy(&qi->qi_lru);
 	xfs_qm_destroy_quotainos(qi);
 	mutex_destroy(&qi->qi_tree_lock);
diff --git a/fs/xfs/xfs_qm.h b/fs/xfs/xfs_qm.h
index 9683f0457d19..d5c9fc4ba591 100644
--- a/fs/xfs/xfs_qm.h
+++ b/fs/xfs/xfs_qm.h
@@ -63,7 +63,7 @@ struct xfs_quotainfo {
 	struct xfs_def_quota	qi_usr_default;
 	struct xfs_def_quota	qi_grp_default;
 	struct xfs_def_quota	qi_prj_default;
-	struct shrinker		qi_shrinker;
+	struct shrinker		*qi_shrinker;
 
 	/* Minimum and maximum quota expiration timestamp values. */
 	time64_t		qi_expiry_min;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 20/29] zsmalloc: dynamically allocate the mm-zspool shrinker
  2023-06-22  8:53 [PATCH 00/29] use refcount+RCU method to implement lockless slab shrink Qi Zheng
                   ` (18 preceding siblings ...)
  2023-06-22  8:53 ` [PATCH 19/29] xfs: dynamically allocate the xfs-qm shrinker Qi Zheng
@ 2023-06-22  8:53 ` Qi Zheng
  2023-06-22  8:53 ` [PATCH 21/29] fs: super: dynamically allocate the s_shrink Qi Zheng
                   ` (9 subsequent siblings)
  29 siblings, 0 replies; 55+ messages in thread
From: Qi Zheng @ 2023-06-22  8:53 UTC (permalink / raw)
  To: akpm, david, tkhai, vbabka, roman.gushchin, djwong, brauner,
	paulmck, tytso
  Cc: linux-kernel, linux-mm, intel-gfx, dri-devel, linux-arm-msm,
	dm-devel, linux-raid, linux-bcache, virtualization,
	linux-fsdevel, linux-ext4, linux-nfs, linux-xfs, linux-btrfs,
	Qi Zheng

In preparation for implementing lockless slab shrink,
we need to dynamically allocate the mm-zspool shrinker,
so that it can be freed asynchronously using kfree_rcu().
Then it doesn't need to wait for RCU read-side critical
section when releasing the struct zs_pool.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/zsmalloc.c | 28 ++++++++++++++++------------
 1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 3f057970504e..c03b34ae637e 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -229,7 +229,7 @@ struct zs_pool {
 	struct zs_pool_stats stats;
 
 	/* Compact classes */
-	struct shrinker shrinker;
+	struct shrinker *shrinker;
 
 #ifdef CONFIG_ZSMALLOC_STAT
 	struct dentry *stat_dentry;
@@ -2107,8 +2107,7 @@ static unsigned long zs_shrinker_scan(struct shrinker *shrinker,
 		struct shrink_control *sc)
 {
 	unsigned long pages_freed;
-	struct zs_pool *pool = container_of(shrinker, struct zs_pool,
-			shrinker);
+	struct zs_pool *pool = shrinker->private_data;
 
 	/*
 	 * Compact classes and calculate compaction delta.
@@ -2126,8 +2125,7 @@ static unsigned long zs_shrinker_count(struct shrinker *shrinker,
 	int i;
 	struct size_class *class;
 	unsigned long pages_to_free = 0;
-	struct zs_pool *pool = container_of(shrinker, struct zs_pool,
-			shrinker);
+	struct zs_pool *pool = shrinker->private_data;
 
 	for (i = ZS_SIZE_CLASSES - 1; i >= 0; i--) {
 		class = pool->size_class[i];
@@ -2142,18 +2140,24 @@ static unsigned long zs_shrinker_count(struct shrinker *shrinker,
 
 static void zs_unregister_shrinker(struct zs_pool *pool)
 {
-	unregister_shrinker(&pool->shrinker);
+	unregister_and_free_shrinker(pool->shrinker);
 }
 
 static int zs_register_shrinker(struct zs_pool *pool)
 {
-	pool->shrinker.scan_objects = zs_shrinker_scan;
-	pool->shrinker.count_objects = zs_shrinker_count;
-	pool->shrinker.batch = 0;
-	pool->shrinker.seeks = DEFAULT_SEEKS;
+	int ret;
+
+	pool->shrinker = shrinker_alloc_and_init(zs_shrinker_count,
+						 zs_shrinker_scan, 0,
+						 DEFAULT_SEEKS, 0, pool);
+	if (!pool->shrinker)
+		return -ENOMEM;
 
-	return register_shrinker(&pool->shrinker, "mm-zspool:%s",
-				 pool->name);
+	ret = register_shrinker(pool->shrinker, "mm-zspool:%s", pool->name);
+	if (ret)
+		shrinker_free(pool->shrinker);
+
+	return ret;
 }
 
 static int calculate_zspage_chain_size(int class_size)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 21/29] fs: super: dynamically allocate the s_shrink
  2023-06-22  8:53 [PATCH 00/29] use refcount+RCU method to implement lockless slab shrink Qi Zheng
                   ` (19 preceding siblings ...)
  2023-06-22  8:53 ` [PATCH 20/29] zsmalloc: dynamically allocate the mm-zspool shrinker Qi Zheng
@ 2023-06-22  8:53 ` Qi Zheng
  2023-06-22  8:53 ` [PATCH 22/29] drm/ttm: introduce pool_shrink_rwsem Qi Zheng
                   ` (8 subsequent siblings)
  29 siblings, 0 replies; 55+ messages in thread
From: Qi Zheng @ 2023-06-22  8:53 UTC (permalink / raw)
  To: akpm, david, tkhai, vbabka, roman.gushchin, djwong, brauner,
	paulmck, tytso
  Cc: linux-kernel, linux-mm, intel-gfx, dri-devel, linux-arm-msm,
	dm-devel, linux-raid, linux-bcache, virtualization,
	linux-fsdevel, linux-ext4, linux-nfs, linux-xfs, linux-btrfs,
	Qi Zheng

In preparation for implementing lockless slab shrink,
we need to dynamically allocate the s_shrink, so that
it can be freed asynchronously using kfree_rcu(). Then
it doesn't need to wait for RCU read-side critical
section when releasing the struct super_block.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 fs/btrfs/super.c   |  2 +-
 fs/kernfs/mount.c  |  2 +-
 fs/proc/root.c     |  2 +-
 fs/super.c         | 38 ++++++++++++++++++++++----------------
 include/linux/fs.h |  2 +-
 5 files changed, 26 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index f1dd172d8d5b..fad4ded26c80 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1513,7 +1513,7 @@ static struct dentry *btrfs_mount_root(struct file_system_type *fs_type,
 			error = -EBUSY;
 	} else {
 		snprintf(s->s_id, sizeof(s->s_id), "%pg", bdev);
-		shrinker_debugfs_rename(&s->s_shrink, "sb-%s:%s", fs_type->name,
+		shrinker_debugfs_rename(s->s_shrink, "sb-%s:%s", fs_type->name,
 					s->s_id);
 		btrfs_sb(s)->bdev_holder = fs_type;
 		error = btrfs_fill_super(s, fs_devices, data);
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index d49606accb07..2657ff1181f1 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -256,7 +256,7 @@ static int kernfs_fill_super(struct super_block *sb, struct kernfs_fs_context *k
 	sb->s_time_gran = 1;
 
 	/* sysfs dentries and inodes don't require IO to create */
-	sb->s_shrink.seeks = 0;
+	sb->s_shrink->seeks = 0;
 
 	/* get root inode, initialize and unlock it */
 	down_read(&kf_root->kernfs_rwsem);
diff --git a/fs/proc/root.c b/fs/proc/root.c
index a86e65a608da..22b78b28b477 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -188,7 +188,7 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
 	s->s_stack_depth = FILESYSTEM_MAX_STACK_DEPTH;
 
 	/* procfs dentries and inodes don't require IO to create */
-	s->s_shrink.seeks = 0;
+	s->s_shrink->seeks = 0;
 
 	pde_get(&proc_root);
 	root_inode = proc_get_inode(s, &proc_root);
diff --git a/fs/super.c b/fs/super.c
index 2e83c8cd435b..791342bb8ac9 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -67,7 +67,7 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
 	long	dentries;
 	long	inodes;
 
-	sb = container_of(shrink, struct super_block, s_shrink);
+	sb = shrink->private_data;
 
 	/*
 	 * Deadlock avoidance.  We may hold various FS locks, and we don't want
@@ -120,7 +120,7 @@ static unsigned long super_cache_count(struct shrinker *shrink,
 	struct super_block *sb;
 	long	total_objects = 0;
 
-	sb = container_of(shrink, struct super_block, s_shrink);
+	sb = shrink->private_data;
 
 	/*
 	 * We don't call trylock_super() here as it is a scalability bottleneck,
@@ -182,7 +182,10 @@ static void destroy_unused_super(struct super_block *s)
 	security_sb_free(s);
 	put_user_ns(s->s_user_ns);
 	kfree(s->s_subtype);
-	free_prealloced_shrinker(&s->s_shrink);
+	if (s->s_shrink) {
+		free_prealloced_shrinker(s->s_shrink);
+		shrinker_free(s->s_shrink);
+	}
 	/* no delays needed */
 	destroy_super_work(&s->destroy_work);
 }
@@ -259,16 +262,19 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
 	s->s_time_min = TIME64_MIN;
 	s->s_time_max = TIME64_MAX;
 
-	s->s_shrink.seeks = DEFAULT_SEEKS;
-	s->s_shrink.scan_objects = super_cache_scan;
-	s->s_shrink.count_objects = super_cache_count;
-	s->s_shrink.batch = 1024;
-	s->s_shrink.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE;
-	if (prealloc_shrinker(&s->s_shrink, "sb-%s", type->name))
+	s->s_shrink = shrinker_alloc_and_init(super_cache_count,
+					      super_cache_scan, 1024,
+					      DEFAULT_SEEKS,
+					      SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE,
+					      s);
+	if (!s->s_shrink)
+		goto fail;
+
+	if (prealloc_shrinker(s->s_shrink, "sb-%s", type->name))
 		goto fail;
-	if (list_lru_init_memcg(&s->s_dentry_lru, &s->s_shrink))
+	if (list_lru_init_memcg(&s->s_dentry_lru, s->s_shrink))
 		goto fail;
-	if (list_lru_init_memcg(&s->s_inode_lru, &s->s_shrink))
+	if (list_lru_init_memcg(&s->s_inode_lru, s->s_shrink))
 		goto fail;
 	return s;
 
@@ -326,7 +332,7 @@ void deactivate_locked_super(struct super_block *s)
 {
 	struct file_system_type *fs = s->s_type;
 	if (atomic_dec_and_test(&s->s_active)) {
-		unregister_shrinker(&s->s_shrink);
+		unregister_and_free_shrinker(s->s_shrink);
 		fs->kill_sb(s);
 
 		/*
@@ -599,7 +605,7 @@ struct super_block *sget_fc(struct fs_context *fc,
 	hlist_add_head(&s->s_instances, &s->s_type->fs_supers);
 	spin_unlock(&sb_lock);
 	get_filesystem(s->s_type);
-	register_shrinker_prepared(&s->s_shrink);
+	register_shrinker_prepared(s->s_shrink);
 	return s;
 
 share_extant_sb:
@@ -678,7 +684,7 @@ struct super_block *sget(struct file_system_type *type,
 	hlist_add_head(&s->s_instances, &type->fs_supers);
 	spin_unlock(&sb_lock);
 	get_filesystem(type);
-	register_shrinker_prepared(&s->s_shrink);
+	register_shrinker_prepared(s->s_shrink);
 	return s;
 }
 EXPORT_SYMBOL(sget);
@@ -1308,7 +1314,7 @@ int get_tree_bdev(struct fs_context *fc,
 		down_write(&s->s_umount);
 	} else {
 		snprintf(s->s_id, sizeof(s->s_id), "%pg", bdev);
-		shrinker_debugfs_rename(&s->s_shrink, "sb-%s:%s",
+		shrinker_debugfs_rename(s->s_shrink, "sb-%s:%s",
 					fc->fs_type->name, s->s_id);
 		sb_set_blocksize(s, block_size(bdev));
 		error = fill_super(s, fc);
@@ -1381,7 +1387,7 @@ struct dentry *mount_bdev(struct file_system_type *fs_type,
 		down_write(&s->s_umount);
 	} else {
 		snprintf(s->s_id, sizeof(s->s_id), "%pg", bdev);
-		shrinker_debugfs_rename(&s->s_shrink, "sb-%s:%s",
+		shrinker_debugfs_rename(s->s_shrink, "sb-%s:%s",
 					fs_type->name, s->s_id);
 		sb_set_blocksize(s, block_size(bdev));
 		error = fill_super(s, data, flags & SB_SILENT ? 1 : 0);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 53e0b5e98046..dd6f8ce28385 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1228,7 +1228,7 @@ struct super_block {
 
 	const struct dentry_operations *s_d_op; /* default d_op for dentries */
 
-	struct shrinker s_shrink;	/* per-sb shrinker handle */
+	struct shrinker *s_shrink;	/* per-sb shrinker handle */
 
 	/* Number of inodes with nlink == 0 but still referenced */
 	atomic_long_t s_remove_count;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 22/29] drm/ttm: introduce pool_shrink_rwsem
  2023-06-22  8:53 [PATCH 00/29] use refcount+RCU method to implement lockless slab shrink Qi Zheng
                   ` (20 preceding siblings ...)
  2023-06-22  8:53 ` [PATCH 21/29] fs: super: dynamically allocate the s_shrink Qi Zheng
@ 2023-06-22  8:53 ` Qi Zheng
  2023-06-22  8:53 ` [PATCH 23/29] mm: shrinker: add refcount and completion_wait fields Qi Zheng
                   ` (7 subsequent siblings)
  29 siblings, 0 replies; 55+ messages in thread
From: Qi Zheng @ 2023-06-22  8:53 UTC (permalink / raw)
  To: akpm, david, tkhai, vbabka, roman.gushchin, djwong, brauner,
	paulmck, tytso
  Cc: linux-kernel, linux-mm, intel-gfx, dri-devel, linux-arm-msm,
	dm-devel, linux-raid, linux-bcache, virtualization,
	linux-fsdevel, linux-ext4, linux-nfs, linux-xfs, linux-btrfs,
	Qi Zheng

Currently, the synchronize_shrinkers() is only used by TTM
pool. It only requires that no shrinkers run in parallel.

After we use RCU+refcount method to implement the lockless
slab shrink, we can not use shrinker_rwsem or synchronize_rcu()
to guarantee that all shrinker invocations have seen an update
before freeing memory.

So we introduce a new pool_shrink_rwsem to implement a private
synchronize_shrinkers(), so as to achieve the same purpose.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 drivers/gpu/drm/ttm/ttm_pool.c | 15 +++++++++++++++
 include/linux/shrinker.h       |  1 -
 mm/vmscan.c                    | 15 ---------------
 3 files changed, 15 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
index cddb9151d20f..713b1c0a70e1 100644
--- a/drivers/gpu/drm/ttm/ttm_pool.c
+++ b/drivers/gpu/drm/ttm/ttm_pool.c
@@ -74,6 +74,7 @@ static struct ttm_pool_type global_dma32_uncached[MAX_ORDER + 1];
 static spinlock_t shrinker_lock;
 static struct list_head shrinker_list;
 static struct shrinker mm_shrinker;
+static DECLARE_RWSEM(pool_shrink_rwsem);
 
 /* Allocate pages of size 1 << order with the given gfp_flags */
 static struct page *ttm_pool_alloc_page(struct ttm_pool *pool, gfp_t gfp_flags,
@@ -317,6 +318,7 @@ static unsigned int ttm_pool_shrink(void)
 	unsigned int num_pages;
 	struct page *p;
 
+	down_read(&pool_shrink_rwsem);
 	spin_lock(&shrinker_lock);
 	pt = list_first_entry(&shrinker_list, typeof(*pt), shrinker_list);
 	list_move_tail(&pt->shrinker_list, &shrinker_list);
@@ -329,6 +331,7 @@ static unsigned int ttm_pool_shrink(void)
 	} else {
 		num_pages = 0;
 	}
+	up_read(&pool_shrink_rwsem);
 
 	return num_pages;
 }
@@ -572,6 +575,18 @@ void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
 }
 EXPORT_SYMBOL(ttm_pool_init);
 
+/**
+ * synchronize_shrinkers - Wait for all running shrinkers to complete.
+ *
+ * This is useful to guarantee that all shrinker invocations have seen an
+ * update, before freeing memory, similar to rcu.
+ */
+static void synchronize_shrinkers(void)
+{
+	down_write(&pool_shrink_rwsem);
+	up_write(&pool_shrink_rwsem);
+}
+
 /**
  * ttm_pool_fini - Cleanup a pool
  *
diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 8e9ba6fa3fcc..4094e4c44e80 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -105,7 +105,6 @@ extern int __printf(2, 3) register_shrinker(struct shrinker *shrinker,
 					    const char *fmt, ...);
 extern void unregister_shrinker(struct shrinker *shrinker);
 extern void free_prealloced_shrinker(struct shrinker *shrinker);
-extern void synchronize_shrinkers(void);
 
 typedef unsigned long (*count_objects_cb)(struct shrinker *s,
 					  struct shrink_control *sc);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 64ff598fbad9..3a8d50ad6ff6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -844,21 +844,6 @@ void unregister_and_free_shrinker(struct shrinker *shrinker)
 }
 EXPORT_SYMBOL(unregister_and_free_shrinker);
 
-/**
- * synchronize_shrinkers - Wait for all running shrinkers to complete.
- *
- * This is equivalent to calling unregister_shrink() and register_shrinker(),
- * but atomically and with less overhead. This is useful to guarantee that all
- * shrinker invocations have seen an update, before freeing memory, similar to
- * rcu.
- */
-void synchronize_shrinkers(void)
-{
-	down_write(&shrinker_rwsem);
-	up_write(&shrinker_rwsem);
-}
-EXPORT_SYMBOL(synchronize_shrinkers);
-
 #define SHRINK_BATCH 128
 
 static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 23/29] mm: shrinker: add refcount and completion_wait fields
  2023-06-22  8:53 [PATCH 00/29] use refcount+RCU method to implement lockless slab shrink Qi Zheng
                   ` (21 preceding siblings ...)
  2023-06-22  8:53 ` [PATCH 22/29] drm/ttm: introduce pool_shrink_rwsem Qi Zheng
@ 2023-06-22  8:53 ` Qi Zheng
  2023-06-22  8:53 ` [PATCH 24/29] mm: vmscan: make global slab shrink lockless Qi Zheng
                   ` (6 subsequent siblings)
  29 siblings, 0 replies; 55+ messages in thread
From: Qi Zheng @ 2023-06-22  8:53 UTC (permalink / raw)
  To: akpm, david, tkhai, vbabka, roman.gushchin, djwong, brauner,
	paulmck, tytso
  Cc: linux-kernel, linux-mm, intel-gfx, dri-devel, linux-arm-msm,
	dm-devel, linux-raid, linux-bcache, virtualization,
	linux-fsdevel, linux-ext4, linux-nfs, linux-xfs, linux-btrfs,
	Qi Zheng

This commit introduces refcount and completion_wait
fields to struct shrinker to manage the life cycle
of shrinker instance.

Just a preparation work for implementing the lockless
slab shrink, no functional changes.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/shrinker.h | 11 +++++++++++
 mm/vmscan.c              |  5 +++++
 2 files changed, 16 insertions(+)

diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 4094e4c44e80..7bfeb2f25246 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -4,6 +4,8 @@
 
 #include <linux/atomic.h>
 #include <linux/types.h>
+#include <linux/refcount.h>
+#include <linux/completion.h>
 
 /*
  * This struct is used to pass information from page reclaim to the shrinkers.
@@ -70,6 +72,9 @@ struct shrinker {
 	int seeks;	/* seeks to recreate an obj */
 	unsigned flags;
 
+	refcount_t refcount;
+	struct completion completion_wait;
+
 	void *private_data;
 
 	/* These are for internal use */
@@ -118,6 +123,12 @@ struct shrinker *shrinker_alloc_and_init(count_objects_cb count,
 void shrinker_free(struct shrinker *shrinker);
 void unregister_and_free_shrinker(struct shrinker *shrinker);
 
+static inline void shrinker_put(struct shrinker *shrinker)
+{
+	if (refcount_dec_and_test(&shrinker->refcount))
+		complete(&shrinker->completion_wait);
+}
+
 #ifdef CONFIG_SHRINKER_DEBUG
 extern int shrinker_debugfs_add(struct shrinker *shrinker);
 extern struct dentry *shrinker_debugfs_detach(struct shrinker *shrinker,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3a8d50ad6ff6..6f9c4750effa 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -740,6 +740,8 @@ void free_prealloced_shrinker(struct shrinker *shrinker)
 void register_shrinker_prepared(struct shrinker *shrinker)
 {
 	down_write(&shrinker_rwsem);
+	refcount_set(&shrinker->refcount, 1);
+	init_completion(&shrinker->completion_wait);
 	list_add_tail(&shrinker->list, &shrinker_list);
 	shrinker->flags |= SHRINKER_REGISTERED;
 	shrinker_debugfs_add(shrinker);
@@ -794,6 +796,9 @@ void unregister_shrinker(struct shrinker *shrinker)
 	if (!(shrinker->flags & SHRINKER_REGISTERED))
 		return;
 
+	shrinker_put(shrinker);
+	wait_for_completion(&shrinker->completion_wait);
+
 	down_write(&shrinker_rwsem);
 	list_del(&shrinker->list);
 	shrinker->flags &= ~SHRINKER_REGISTERED;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 24/29] mm: vmscan: make global slab shrink lockless
  2023-06-22  8:53 [PATCH 00/29] use refcount+RCU method to implement lockless slab shrink Qi Zheng
                   ` (22 preceding siblings ...)
  2023-06-22  8:53 ` [PATCH 23/29] mm: shrinker: add refcount and completion_wait fields Qi Zheng
@ 2023-06-22  8:53 ` Qi Zheng
  2023-06-22 15:12   ` Vlastimil Babka
  2023-06-22  8:53 ` [PATCH 25/29] mm: vmscan: make memcg " Qi Zheng
                   ` (5 subsequent siblings)
  29 siblings, 1 reply; 55+ messages in thread
From: Qi Zheng @ 2023-06-22  8:53 UTC (permalink / raw)
  To: akpm, david, tkhai, vbabka, roman.gushchin, djwong, brauner,
	paulmck, tytso
  Cc: linux-kernel, linux-mm, intel-gfx, dri-devel, linux-arm-msm,
	dm-devel, linux-raid, linux-bcache, virtualization,
	linux-fsdevel, linux-ext4, linux-nfs, linux-xfs, linux-btrfs,
	Qi Zheng

The shrinker_rwsem is a global read-write lock in
shrinkers subsystem, which protects most operations
such as slab shrink, registration and unregistration
of shrinkers, etc. This can easily cause problems in
the following cases.

1) When the memory pressure is high and there are many
   filesystems mounted or unmounted at the same time,
   slab shrink will be affected (down_read_trylock()
   failed).

   Such as the real workload mentioned by Kirill Tkhai:

   ```
   One of the real workloads from my experience is start
   of an overcommitted node containing many starting
   containers after node crash (or many resuming containers
   after reboot for kernel update). In these cases memory
   pressure is huge, and the node goes round in long reclaim.
   ```

2) If a shrinker is blocked (such as the case mentioned
   in [1]) and a writer comes in (such as mount a fs),
   then this writer will be blocked and cause all
   subsequent shrinker-related operations to be blocked.

Even if there is no competitor when shrinking slab, there
may still be a problem. If we have a long shrinker list
and we do not reclaim enough memory with each shrinker,
then the down_read_trylock() may be called with high
frequency. Because of the poor multicore scalability of
atomic operations, this can lead to a significant drop
in IPC (instructions per cycle).

We used to implement the lockless slab shrink with
SRCU [1], but then kernel test robot reported -88.8%
regression in stress-ng.ramfs.ops_per_sec test case [2],
so we reverted it [3].

This commit uses the refcount+RCU method [4] proposed by
by Dave Chinner to re-implement the lockless global slab
shrink. The memcg slab shrink is handled in the subsequent
patch.

Currently, the shrinker instances can be divided into
the following three types:

a) global shrinker instance statically defined in the kernel,
such as workingset_shadow_shrinker.

b) global shrinker instance statically defined in the kernel
modules, such as mmu_shrinker in x86.

c) shrinker instance embedded in other structures.

For case a, the memory of shrinker instance is never freed.
For case b, the memory of shrinker instance will be freed
after the module is unloaded. But we will call synchronize_rcu()
in free_module() to wait for RCU read-side critical section to
exit. For case c, the memory of shrinker instance will be
dynamically freed by calling kfree_rcu(). So we can use
rcu_read_{lock,unlock}() to ensure that the shrinker instance
is valid.

The shrinker::refcount mechanism ensures that the shrinker
instance will not be run again after unregistration. So the
structure that records the pointer of shrinker instance can be
safely freed without waiting for the RCU read-side critical
section.

In this way, while we implement the lockless slab shrink, we
don't need to be blocked in unregister_shrinker() to wait
RCU read-side critical section.

The following are the test results:

stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 &

1) Before applying this patchset:

 setting to a 60 second run per stressor
 dispatching hogs: 9 ramfs
 stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
 ramfs            880623     60.02      7.71    226.93     14671.45        3753.09
 ramfs:
          1 System Management Interrupt
 for a 60.03s run time:
    5762.40s available CPU time
       7.71s user time   (  0.13%)
     226.93s system time (  3.94%)
     234.64s total time  (  4.07%)
 load average: 8.54 3.06 2.11
 passed: 9: ramfs (9)
 failed: 0
 skipped: 0
 successful run completed in 60.03s (1 min, 0.03 secs)

2) After applying this patchset:

 setting to a 60 second run per stressor
 dispatching hogs: 9 ramfs
 stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
 ramfs            847562     60.02      7.44    230.22     14120.66        3566.23
 ramfs:
          4 System Management Interrupts
 for a 60.12s run time:
    5771.95s available CPU time
       7.44s user time   (  0.13%)
     230.22s system time (  3.99%)
     237.66s total time  (  4.12%)
 load average: 8.18 2.43 0.84
 passed: 9: ramfs (9)
 failed: 0
 skipped: 0
 successful run completed in 60.12s (1 min, 0.12 secs)

We can see that the ops/s has hardly changed.

[1]. https://lore.kernel.org/lkml/20230313112819.38938-1-zhengqi.arch@bytedance.com/
[2]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie.liu@intel.com/
[3]. https://lore.kernel.org/all/20230609081518.3039120-1-qi.zheng@linux.dev/
[4]. https://lore.kernel.org/lkml/ZIJhou1d55d4H1s0@dread.disaster.area/

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/shrinker.h |  6 ++++++
 mm/vmscan.c              | 33 ++++++++++++++-------------------
 2 files changed, 20 insertions(+), 19 deletions(-)

diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 7bfeb2f25246..b0c6c2df9db8 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -74,6 +74,7 @@ struct shrinker {
 
 	refcount_t refcount;
 	struct completion completion_wait;
+	struct rcu_head rcu;
 
 	void *private_data;
 
@@ -123,6 +124,11 @@ struct shrinker *shrinker_alloc_and_init(count_objects_cb count,
 void shrinker_free(struct shrinker *shrinker);
 void unregister_and_free_shrinker(struct shrinker *shrinker);
 
+static inline bool shrinker_try_get(struct shrinker *shrinker)
+{
+	return refcount_inc_not_zero(&shrinker->refcount);
+}
+
 static inline void shrinker_put(struct shrinker *shrinker)
 {
 	if (refcount_dec_and_test(&shrinker->refcount))
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6f9c4750effa..767569698946 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -57,6 +57,7 @@
 #include <linux/khugepaged.h>
 #include <linux/rculist_nulls.h>
 #include <linux/random.h>
+#include <linux/rculist.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -742,7 +743,7 @@ void register_shrinker_prepared(struct shrinker *shrinker)
 	down_write(&shrinker_rwsem);
 	refcount_set(&shrinker->refcount, 1);
 	init_completion(&shrinker->completion_wait);
-	list_add_tail(&shrinker->list, &shrinker_list);
+	list_add_tail_rcu(&shrinker->list, &shrinker_list);
 	shrinker->flags |= SHRINKER_REGISTERED;
 	shrinker_debugfs_add(shrinker);
 	up_write(&shrinker_rwsem);
@@ -800,7 +801,7 @@ void unregister_shrinker(struct shrinker *shrinker)
 	wait_for_completion(&shrinker->completion_wait);
 
 	down_write(&shrinker_rwsem);
-	list_del(&shrinker->list);
+	list_del_rcu(&shrinker->list);
 	shrinker->flags &= ~SHRINKER_REGISTERED;
 	if (shrinker->flags & SHRINKER_MEMCG_AWARE)
 		unregister_memcg_shrinker(shrinker);
@@ -845,7 +846,7 @@ EXPORT_SYMBOL(shrinker_free);
 void unregister_and_free_shrinker(struct shrinker *shrinker)
 {
 	unregister_shrinker(shrinker);
-	kfree(shrinker);
+	kfree_rcu(shrinker, rcu);
 }
 EXPORT_SYMBOL(unregister_and_free_shrinker);
 
@@ -1067,33 +1068,27 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
 	if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
 		return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
 
-	if (!down_read_trylock(&shrinker_rwsem))
-		goto out;
-
-	list_for_each_entry(shrinker, &shrinker_list, list) {
+	rcu_read_lock();
+	list_for_each_entry_rcu(shrinker, &shrinker_list, list) {
 		struct shrink_control sc = {
 			.gfp_mask = gfp_mask,
 			.nid = nid,
 			.memcg = memcg,
 		};
 
+		if (!shrinker_try_get(shrinker))
+			continue;
+		rcu_read_unlock();
+
 		ret = do_shrink_slab(&sc, shrinker, priority);
 		if (ret == SHRINK_EMPTY)
 			ret = 0;
 		freed += ret;
-		/*
-		 * Bail out if someone want to register a new shrinker to
-		 * prevent the registration from being stalled for long periods
-		 * by parallel ongoing shrinking.
-		 */
-		if (rwsem_is_contended(&shrinker_rwsem)) {
-			freed = freed ? : 1;
-			break;
-		}
-	}
 
-	up_read(&shrinker_rwsem);
-out:
+		rcu_read_lock();
+		shrinker_put(shrinker);
+	}
+	rcu_read_unlock();
 	cond_resched();
 	return freed;
 }
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 25/29] mm: vmscan: make memcg slab shrink lockless
  2023-06-22  8:53 [PATCH 00/29] use refcount+RCU method to implement lockless slab shrink Qi Zheng
                   ` (23 preceding siblings ...)
  2023-06-22  8:53 ` [PATCH 24/29] mm: vmscan: make global slab shrink lockless Qi Zheng
@ 2023-06-22  8:53 ` Qi Zheng
  2023-06-22  8:53 ` [PATCH 26/29] mm: shrinker: make count and scan in shrinker debugfs lockless Qi Zheng
                   ` (4 subsequent siblings)
  29 siblings, 0 replies; 55+ messages in thread
From: Qi Zheng @ 2023-06-22  8:53 UTC (permalink / raw)
  To: akpm, david, tkhai, vbabka, roman.gushchin, djwong, brauner,
	paulmck, tytso
  Cc: linux-kernel, linux-mm, intel-gfx, dri-devel, linux-arm-msm,
	dm-devel, linux-raid, linux-bcache, virtualization,
	linux-fsdevel, linux-ext4, linux-nfs, linux-xfs, linux-btrfs,
	Qi Zheng

Like global slab shrink, this commit also uses refcount+RCU
method to make memcg slab shrink lockless.

We can reproduce the down_read_trylock() hotspot through the
following script:

```

DIR="/root/shrinker/memcg/mnt"

do_create()
{
    mkdir -p /sys/fs/cgroup/memory/test
    mkdir -p /sys/fs/cgroup/perf_event/test
    echo 4G > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
    for i in `seq 0 $1`;
    do
        mkdir -p /sys/fs/cgroup/memory/test/$i;
        echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
        echo $$ > /sys/fs/cgroup/perf_event/test/cgroup.procs;
        mkdir -p $DIR/$i;
    done
}

do_mount()
{
    for i in `seq $1 $2`;
    do
        mount -t tmpfs $i $DIR/$i;
    done
}

do_touch()
{
    for i in `seq $1 $2`;
    do
        echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
        echo $$ > /sys/fs/cgroup/perf_event/test/cgroup.procs;
            dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 &
    done
}

case "$1" in
  touch)
    do_touch $2 $3
    ;;
  test)
      do_create 4000
    do_mount 0 4000
    do_touch 0 3000
    ;;
  *)
    exit 1
    ;;
esac
```

Save the above script, then run test and touch commands.
Then we can use the following perf command to view hotspots:

perf top -U -F 999 [-g]

1) Before applying this patchset:

  35.34%  [kernel]             [k] down_read_trylock
  18.44%  [kernel]             [k] shrink_slab
  15.98%  [kernel]             [k] pv_native_safe_halt
  15.08%  [kernel]             [k] up_read
   5.33%  [kernel]             [k] idr_find
   2.71%  [kernel]             [k] _find_next_bit
   2.21%  [kernel]             [k] shrink_node
   1.29%  [kernel]             [k] shrink_lruvec
   0.66%  [kernel]             [k] do_shrink_slab
   0.33%  [kernel]             [k] list_lru_count_one
   0.33%  [kernel]             [k] __radix_tree_lookup
   0.25%  [kernel]             [k] mem_cgroup_iter

-   82.19%    19.49%  [kernel]                  [k] shrink_slab
   - 62.00% shrink_slab
        36.37% down_read_trylock
        15.52% up_read
        5.48% idr_find
        3.38% _find_next_bit
      + 0.98% do_shrink_slab

2) After applying this patchset:

  46.83%  [kernel]           [k] shrink_slab
  20.52%  [kernel]           [k] pv_native_safe_halt
   8.85%  [kernel]           [k] do_shrink_slab
   7.71%  [kernel]           [k] _find_next_bit
   1.72%  [kernel]           [k] xas_descend
   1.70%  [kernel]           [k] shrink_node
   1.44%  [kernel]           [k] shrink_lruvec
   1.43%  [kernel]           [k] mem_cgroup_iter
   1.28%  [kernel]           [k] xas_load
   0.89%  [kernel]           [k] super_cache_count
   0.84%  [kernel]           [k] xas_start
   0.66%  [kernel]           [k] list_lru_count_one

-   65.50%    40.44%  [kernel]                  [k] shrink_slab
   - 22.96% shrink_slab
        13.11% _find_next_bit
      - 9.91% do_shrink_slab
         - 1.59% super_cache_count
              0.92% list_lru_count_one

We can see that the first perf hotspot becomes shrink_slab,
which is what we expect.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/vmscan.c | 58 +++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 41 insertions(+), 17 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 767569698946..357a1f2ad690 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -213,6 +213,12 @@ static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
 					 lockdep_is_held(&shrinker_rwsem));
 }
 
+static struct shrinker_info *shrinker_info_rcu(struct mem_cgroup *memcg,
+					       int nid)
+{
+	return rcu_dereference(memcg->nodeinfo[nid]->shrinker_info);
+}
+
 static int expand_one_shrinker_info(struct mem_cgroup *memcg,
 				    int map_size, int defer_size,
 				    int old_map_size, int old_defer_size,
@@ -339,7 +345,7 @@ void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
 		struct shrinker_info *info;
 
 		rcu_read_lock();
-		info = rcu_dereference(memcg->nodeinfo[nid]->shrinker_info);
+		info = shrinker_info_rcu(memcg, nid);
 		if (!WARN_ON_ONCE(shrinker_id >= info->map_nr_max)) {
 			/* Pairs with smp mb in shrink_slab() */
 			smp_mb__before_atomic();
@@ -359,7 +365,6 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
 		return -ENOSYS;
 
 	down_write(&shrinker_rwsem);
-	/* This may call shrinker, so it must use down_read_trylock() */
 	id = idr_alloc(&shrinker_idr, shrinker, 0, 0, GFP_KERNEL);
 	if (id < 0)
 		goto unlock;
@@ -392,18 +397,28 @@ static long xchg_nr_deferred_memcg(int nid, struct shrinker *shrinker,
 				   struct mem_cgroup *memcg)
 {
 	struct shrinker_info *info;
+	long nr_deferred;
 
-	info = shrinker_info_protected(memcg, nid);
-	return atomic_long_xchg(&info->nr_deferred[shrinker->id], 0);
+	rcu_read_lock();
+	info = shrinker_info_rcu(memcg, nid);
+	nr_deferred = atomic_long_xchg(&info->nr_deferred[shrinker->id], 0);
+	rcu_read_unlock();
+
+	return nr_deferred;
 }
 
 static long add_nr_deferred_memcg(long nr, int nid, struct shrinker *shrinker,
 				  struct mem_cgroup *memcg)
 {
 	struct shrinker_info *info;
+	long nr_deferred;
+
+	rcu_read_lock();
+	info = shrinker_info_rcu(memcg, nid);
+	nr_deferred = atomic_long_add_return(nr, &info->nr_deferred[shrinker->id]);
+	rcu_read_unlock();
 
-	info = shrinker_info_protected(memcg, nid);
-	return atomic_long_add_return(nr, &info->nr_deferred[shrinker->id]);
+	return nr_deferred;
 }
 
 void reparent_shrinker_deferred(struct mem_cgroup *memcg)
@@ -955,19 +970,18 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
 {
 	struct shrinker_info *info;
 	unsigned long ret, freed = 0;
-	int i;
+	int i = 0;
 
 	if (!mem_cgroup_online(memcg))
 		return 0;
 
-	if (!down_read_trylock(&shrinker_rwsem))
-		return 0;
-
-	info = shrinker_info_protected(memcg, nid);
+again:
+	rcu_read_lock();
+	info = shrinker_info_rcu(memcg, nid);
 	if (unlikely(!info))
 		goto unlock;
 
-	for_each_set_bit(i, info->map, info->map_nr_max) {
+	for_each_set_bit_from(i, info->map, info->map_nr_max) {
 		struct shrink_control sc = {
 			.gfp_mask = gfp_mask,
 			.nid = nid,
@@ -982,6 +996,10 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
 			continue;
 		}
 
+		if (!shrinker_try_get(shrinker))
+			continue;
+		rcu_read_unlock();
+
 		/* Call non-slab shrinkers even though kmem is disabled */
 		if (!memcg_kmem_online() &&
 		    !(shrinker->flags & SHRINKER_NONSLAB))
@@ -1014,13 +1032,19 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
 		}
 		freed += ret;
 
-		if (rwsem_is_contended(&shrinker_rwsem)) {
-			freed = freed ? : 1;
-			break;
-		}
+		shrinker_put(shrinker);
+
+		/*
+		 * We have already exited the read-side of rcu critical section
+		 * before calling do_shrink_slab(), the shrinker_info may be
+		 * released in expand_one_shrinker_info(), so restart the
+		 * iteration.
+		 */
+		i++;
+		goto again;
 	}
 unlock:
-	up_read(&shrinker_rwsem);
+	rcu_read_unlock();
 	return freed;
 }
 #else /* CONFIG_MEMCG */
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 26/29] mm: shrinker: make count and scan in shrinker debugfs lockless
  2023-06-22  8:53 [PATCH 00/29] use refcount+RCU method to implement lockless slab shrink Qi Zheng
                   ` (24 preceding siblings ...)
  2023-06-22  8:53 ` [PATCH 25/29] mm: vmscan: make memcg " Qi Zheng
@ 2023-06-22  8:53 ` Qi Zheng
  2023-06-22  8:53 ` [PATCH 27/29] mm: vmscan: hold write lock to reparent shrinker nr_deferred Qi Zheng
                   ` (3 subsequent siblings)
  29 siblings, 0 replies; 55+ messages in thread
From: Qi Zheng @ 2023-06-22  8:53 UTC (permalink / raw)
  To: akpm, david, tkhai, vbabka, roman.gushchin, djwong, brauner,
	paulmck, tytso
  Cc: linux-kernel, linux-mm, intel-gfx, dri-devel, linux-arm-msm,
	dm-devel, linux-raid, linux-bcache, virtualization,
	linux-fsdevel, linux-ext4, linux-nfs, linux-xfs, linux-btrfs,
	Qi Zheng

Like global and memcg slab shrink, also make count and scan
operations in memory shrinker debugfs lockless.

The debugfs_remove_recursive() will wait for debugfs_file_put()
to return, so there is no need to call rcu_read_lock() before
calling shrinker_try_get().

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/shrinker_debug.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/shrinker_debug.c b/mm/shrinker_debug.c
index 3ab53fad8876..c18fa9b6b7f0 100644
--- a/mm/shrinker_debug.c
+++ b/mm/shrinker_debug.c
@@ -55,8 +55,8 @@ static int shrinker_debugfs_count_show(struct seq_file *m, void *v)
 	if (!count_per_node)
 		return -ENOMEM;
 
-	ret = down_read_killable(&shrinker_rwsem);
-	if (ret) {
+	ret = shrinker_try_get(shrinker);
+	if (!ret) {
 		kfree(count_per_node);
 		return ret;
 	}
@@ -92,7 +92,7 @@ static int shrinker_debugfs_count_show(struct seq_file *m, void *v)
 	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
 
 	rcu_read_unlock();
-	up_read(&shrinker_rwsem);
+	shrinker_put(shrinker);
 
 	kfree(count_per_node);
 	return ret;
@@ -146,8 +146,8 @@ static ssize_t shrinker_debugfs_scan_write(struct file *file,
 		return -EINVAL;
 	}
 
-	ret = down_read_killable(&shrinker_rwsem);
-	if (ret) {
+	ret = shrinker_try_get(shrinker);
+	if (!ret) {
 		mem_cgroup_put(memcg);
 		return ret;
 	}
@@ -159,7 +159,7 @@ static ssize_t shrinker_debugfs_scan_write(struct file *file,
 
 	shrinker->scan_objects(shrinker, &sc);
 
-	up_read(&shrinker_rwsem);
+	shrinker_put(shrinker);
 	mem_cgroup_put(memcg);
 
 	return size;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 27/29] mm: vmscan: hold write lock to reparent shrinker nr_deferred
  2023-06-22  8:53 [PATCH 00/29] use refcount+RCU method to implement lockless slab shrink Qi Zheng
                   ` (25 preceding siblings ...)
  2023-06-22  8:53 ` [PATCH 26/29] mm: shrinker: make count and scan in shrinker debugfs lockless Qi Zheng
@ 2023-06-22  8:53 ` Qi Zheng
  2023-06-22  8:53 ` [PATCH 28/29] mm: shrinkers: convert shrinker_rwsem to mutex Qi Zheng
                   ` (2 subsequent siblings)
  29 siblings, 0 replies; 55+ messages in thread
From: Qi Zheng @ 2023-06-22  8:53 UTC (permalink / raw)
  To: akpm, david, tkhai, vbabka, roman.gushchin, djwong, brauner,
	paulmck, tytso
  Cc: linux-kernel, linux-mm, intel-gfx, dri-devel, linux-arm-msm,
	dm-devel, linux-raid, linux-bcache, virtualization,
	linux-fsdevel, linux-ext4, linux-nfs, linux-xfs, linux-btrfs,
	Qi Zheng

For now, reparent_shrinker_deferred() is the only holder
of read lock of shrinker_rwsem. And it already holds the
global cgroup_mutex, so it will not be called in parallel.

Therefore, in order to convert shrinker_rwsem to shrinker_mutex
later, here we change to hold the write lock of shrinker_rwsem
to reparent.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/vmscan.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 357a1f2ad690..0711b63e68d9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -433,7 +433,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
 		parent = root_mem_cgroup;
 
 	/* Prevent from concurrent shrinker_info expand */
-	down_read(&shrinker_rwsem);
+	down_write(&shrinker_rwsem);
 	for_each_node(nid) {
 		child_info = shrinker_info_protected(memcg, nid);
 		parent_info = shrinker_info_protected(parent, nid);
@@ -442,7 +442,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
 			atomic_long_add(nr, &parent_info->nr_deferred[i]);
 		}
 	}
-	up_read(&shrinker_rwsem);
+	up_write(&shrinker_rwsem);
 }
 
 static bool cgroup_reclaim(struct scan_control *sc)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 28/29] mm: shrinkers: convert shrinker_rwsem to mutex
  2023-06-22  8:53 [PATCH 00/29] use refcount+RCU method to implement lockless slab shrink Qi Zheng
                   ` (26 preceding siblings ...)
  2023-06-22  8:53 ` [PATCH 27/29] mm: vmscan: hold write lock to reparent shrinker nr_deferred Qi Zheng
@ 2023-06-22  8:53 ` Qi Zheng
  2023-06-22  8:53 ` [PATCH 29/29] mm: shrinker: move shrinker-related code into a separate file Qi Zheng
  2023-06-22  9:02 ` [PATCH 00/29] use refcount+RCU method to implement lockless slab shrink Qi Zheng
  29 siblings, 0 replies; 55+ messages in thread
From: Qi Zheng @ 2023-06-22  8:53 UTC (permalink / raw)
  To: akpm, david, tkhai, vbabka, roman.gushchin, djwong, brauner,
	paulmck, tytso
  Cc: linux-kernel, linux-mm, intel-gfx, dri-devel, linux-arm-msm,
	dm-devel, linux-raid, linux-bcache, virtualization,
	linux-fsdevel, linux-ext4, linux-nfs, linux-xfs, linux-btrfs,
	Qi Zheng

Now there are no readers of shrinker_rwsem, so we
can simply replace it with mutex lock.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 drivers/md/dm-cache-metadata.c |  2 +-
 drivers/md/dm-thin-metadata.c  |  2 +-
 fs/super.c                     |  2 +-
 mm/shrinker_debug.c            | 14 +++++++-------
 mm/vmscan.c                    | 34 +++++++++++++++++-----------------
 5 files changed, 27 insertions(+), 27 deletions(-)

diff --git a/drivers/md/dm-cache-metadata.c b/drivers/md/dm-cache-metadata.c
index acffed750e3e..9e0c69958587 100644
--- a/drivers/md/dm-cache-metadata.c
+++ b/drivers/md/dm-cache-metadata.c
@@ -1828,7 +1828,7 @@ int dm_cache_metadata_abort(struct dm_cache_metadata *cmd)
 	 * Replacement block manager (new_bm) is created and old_bm destroyed outside of
 	 * cmd root_lock to avoid ABBA deadlock that would result (due to life-cycle of
 	 * shrinker associated with the block manager's bufio client vs cmd root_lock).
-	 * - must take shrinker_rwsem without holding cmd->root_lock
+	 * - must take shrinker_mutex without holding cmd->root_lock
 	 */
 	new_bm = dm_block_manager_create(cmd->bdev, DM_CACHE_METADATA_BLOCK_SIZE << SECTOR_SHIFT,
 					 CACHE_MAX_CONCURRENT_LOCKS);
diff --git a/drivers/md/dm-thin-metadata.c b/drivers/md/dm-thin-metadata.c
index fd464fb024c3..9f5cb52c5763 100644
--- a/drivers/md/dm-thin-metadata.c
+++ b/drivers/md/dm-thin-metadata.c
@@ -1887,7 +1887,7 @@ int dm_pool_abort_metadata(struct dm_pool_metadata *pmd)
 	 * Replacement block manager (new_bm) is created and old_bm destroyed outside of
 	 * pmd root_lock to avoid ABBA deadlock that would result (due to life-cycle of
 	 * shrinker associated with the block manager's bufio client vs pmd root_lock).
-	 * - must take shrinker_rwsem without holding pmd->root_lock
+	 * - must take shrinker_mutex without holding pmd->root_lock
 	 */
 	new_bm = dm_block_manager_create(pmd->bdev, THIN_METADATA_BLOCK_SIZE << SECTOR_SHIFT,
 					 THIN_MAX_CONCURRENT_LOCKS);
diff --git a/fs/super.c b/fs/super.c
index 791342bb8ac9..471800ff793a 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -54,7 +54,7 @@ static char *sb_writers_name[SB_FREEZE_LEVELS] = {
  * One thing we have to be careful of with a per-sb shrinker is that we don't
  * drop the last active reference to the superblock from within the shrinker.
  * If that happens we could trigger unregistering the shrinker from within the
- * shrinker path and that leads to deadlock on the shrinker_rwsem. Hence we
+ * shrinker path and that leads to deadlock on the shrinker_mutex. Hence we
  * take a passive reference to the superblock to avoid this from occurring.
  */
 static unsigned long super_cache_scan(struct shrinker *shrink,
diff --git a/mm/shrinker_debug.c b/mm/shrinker_debug.c
index c18fa9b6b7f0..7ad903f84463 100644
--- a/mm/shrinker_debug.c
+++ b/mm/shrinker_debug.c
@@ -7,7 +7,7 @@
 #include <linux/memcontrol.h>
 
 /* defined in vmscan.c */
-extern struct rw_semaphore shrinker_rwsem;
+extern struct mutex shrinker_mutex;
 extern struct list_head shrinker_list;
 
 static DEFINE_IDA(shrinker_debugfs_ida);
@@ -177,7 +177,7 @@ int shrinker_debugfs_add(struct shrinker *shrinker)
 	char buf[128];
 	int id;
 
-	lockdep_assert_held(&shrinker_rwsem);
+	lockdep_assert_held(&shrinker_mutex);
 
 	/* debugfs isn't initialized yet, add debugfs entries later. */
 	if (!shrinker_debugfs_root)
@@ -220,7 +220,7 @@ int shrinker_debugfs_rename(struct shrinker *shrinker, const char *fmt, ...)
 	if (!new)
 		return -ENOMEM;
 
-	down_write(&shrinker_rwsem);
+	mutex_lock(&shrinker_mutex);
 
 	old = shrinker->name;
 	shrinker->name = new;
@@ -238,7 +238,7 @@ int shrinker_debugfs_rename(struct shrinker *shrinker, const char *fmt, ...)
 			shrinker->debugfs_entry = entry;
 	}
 
-	up_write(&shrinker_rwsem);
+	mutex_unlock(&shrinker_mutex);
 
 	kfree_const(old);
 
@@ -251,7 +251,7 @@ struct dentry *shrinker_debugfs_detach(struct shrinker *shrinker,
 {
 	struct dentry *entry = shrinker->debugfs_entry;
 
-	lockdep_assert_held(&shrinker_rwsem);
+	lockdep_assert_held(&shrinker_mutex);
 
 	kfree_const(shrinker->name);
 	shrinker->name = NULL;
@@ -280,14 +280,14 @@ static int __init shrinker_debugfs_init(void)
 	shrinker_debugfs_root = dentry;
 
 	/* Create debugfs entries for shrinkers registered at boot */
-	down_write(&shrinker_rwsem);
+	mutex_lock(&shrinker_mutex);
 	list_for_each_entry(shrinker, &shrinker_list, list)
 		if (!shrinker->debugfs_entry) {
 			ret = shrinker_debugfs_add(shrinker);
 			if (ret)
 				break;
 		}
-	up_write(&shrinker_rwsem);
+	mutex_unlock(&shrinker_mutex);
 
 	return ret;
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0711b63e68d9..bcdd97caa403 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -35,7 +35,7 @@
 #include <linux/cpuset.h>
 #include <linux/compaction.h>
 #include <linux/notifier.h>
-#include <linux/rwsem.h>
+#include <linux/mutex.h>
 #include <linux/delay.h>
 #include <linux/kthread.h>
 #include <linux/freezer.h>
@@ -190,7 +190,7 @@ struct scan_control {
 int vm_swappiness = 60;
 
 LIST_HEAD(shrinker_list);
-DECLARE_RWSEM(shrinker_rwsem);
+DEFINE_MUTEX(shrinker_mutex);
 
 #ifdef CONFIG_MEMCG
 static int shrinker_nr_max;
@@ -210,7 +210,7 @@ static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
 						     int nid)
 {
 	return rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_info,
-					 lockdep_is_held(&shrinker_rwsem));
+					 lockdep_is_held(&shrinker_mutex));
 }
 
 static struct shrinker_info *shrinker_info_rcu(struct mem_cgroup *memcg,
@@ -283,7 +283,7 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
 	int nid, size, ret = 0;
 	int map_size, defer_size = 0;
 
-	down_write(&shrinker_rwsem);
+	mutex_lock(&shrinker_mutex);
 	map_size = shrinker_map_size(shrinker_nr_max);
 	defer_size = shrinker_defer_size(shrinker_nr_max);
 	size = map_size + defer_size;
@@ -299,7 +299,7 @@ int alloc_shrinker_info(struct mem_cgroup *memcg)
 		info->map_nr_max = shrinker_nr_max;
 		rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, info);
 	}
-	up_write(&shrinker_rwsem);
+	mutex_unlock(&shrinker_mutex);
 
 	return ret;
 }
@@ -315,7 +315,7 @@ static int expand_shrinker_info(int new_id)
 	if (!root_mem_cgroup)
 		goto out;
 
-	lockdep_assert_held(&shrinker_rwsem);
+	lockdep_assert_held(&shrinker_mutex);
 
 	map_size = shrinker_map_size(new_nr_max);
 	defer_size = shrinker_defer_size(new_nr_max);
@@ -364,7 +364,7 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
 	if (mem_cgroup_disabled())
 		return -ENOSYS;
 
-	down_write(&shrinker_rwsem);
+	mutex_lock(&shrinker_mutex);
 	id = idr_alloc(&shrinker_idr, shrinker, 0, 0, GFP_KERNEL);
 	if (id < 0)
 		goto unlock;
@@ -378,7 +378,7 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
 	shrinker->id = id;
 	ret = 0;
 unlock:
-	up_write(&shrinker_rwsem);
+	mutex_unlock(&shrinker_mutex);
 	return ret;
 }
 
@@ -388,7 +388,7 @@ static void unregister_memcg_shrinker(struct shrinker *shrinker)
 
 	BUG_ON(id < 0);
 
-	lockdep_assert_held(&shrinker_rwsem);
+	lockdep_assert_held(&shrinker_mutex);
 
 	idr_remove(&shrinker_idr, id);
 }
@@ -433,7 +433,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
 		parent = root_mem_cgroup;
 
 	/* Prevent from concurrent shrinker_info expand */
-	down_write(&shrinker_rwsem);
+	mutex_lock(&shrinker_mutex);
 	for_each_node(nid) {
 		child_info = shrinker_info_protected(memcg, nid);
 		parent_info = shrinker_info_protected(parent, nid);
@@ -442,7 +442,7 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
 			atomic_long_add(nr, &parent_info->nr_deferred[i]);
 		}
 	}
-	up_write(&shrinker_rwsem);
+	mutex_unlock(&shrinker_mutex);
 }
 
 static bool cgroup_reclaim(struct scan_control *sc)
@@ -743,9 +743,9 @@ void free_prealloced_shrinker(struct shrinker *shrinker)
 	shrinker->name = NULL;
 #endif
 	if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
-		down_write(&shrinker_rwsem);
+		mutex_lock(&shrinker_mutex);
 		unregister_memcg_shrinker(shrinker);
-		up_write(&shrinker_rwsem);
+		mutex_unlock(&shrinker_mutex);
 		return;
 	}
 
@@ -755,13 +755,13 @@ void free_prealloced_shrinker(struct shrinker *shrinker)
 
 void register_shrinker_prepared(struct shrinker *shrinker)
 {
-	down_write(&shrinker_rwsem);
+	mutex_lock(&shrinker_mutex);
 	refcount_set(&shrinker->refcount, 1);
 	init_completion(&shrinker->completion_wait);
 	list_add_tail_rcu(&shrinker->list, &shrinker_list);
 	shrinker->flags |= SHRINKER_REGISTERED;
 	shrinker_debugfs_add(shrinker);
-	up_write(&shrinker_rwsem);
+	mutex_unlock(&shrinker_mutex);
 }
 
 static int __register_shrinker(struct shrinker *shrinker)
@@ -815,13 +815,13 @@ void unregister_shrinker(struct shrinker *shrinker)
 	shrinker_put(shrinker);
 	wait_for_completion(&shrinker->completion_wait);
 
-	down_write(&shrinker_rwsem);
+	mutex_lock(&shrinker_mutex);
 	list_del_rcu(&shrinker->list);
 	shrinker->flags &= ~SHRINKER_REGISTERED;
 	if (shrinker->flags & SHRINKER_MEMCG_AWARE)
 		unregister_memcg_shrinker(shrinker);
 	debugfs_entry = shrinker_debugfs_detach(shrinker, &debugfs_id);
-	up_write(&shrinker_rwsem);
+	mutex_unlock(&shrinker_mutex);
 
 	shrinker_debugfs_remove(debugfs_entry, debugfs_id);
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 29/29] mm: shrinker: move shrinker-related code into a separate file
  2023-06-22  8:53 [PATCH 00/29] use refcount+RCU method to implement lockless slab shrink Qi Zheng
                   ` (27 preceding siblings ...)
  2023-06-22  8:53 ` [PATCH 28/29] mm: shrinkers: convert shrinker_rwsem to mutex Qi Zheng
@ 2023-06-22  8:53 ` Qi Zheng
  2023-06-22 14:53   ` Vlastimil Babka
  2023-06-23  5:25   ` Sergey Senozhatsky
  2023-06-22  9:02 ` [PATCH 00/29] use refcount+RCU method to implement lockless slab shrink Qi Zheng
  29 siblings, 2 replies; 55+ messages in thread
From: Qi Zheng @ 2023-06-22  8:53 UTC (permalink / raw)
  To: akpm, david, tkhai, vbabka, roman.gushchin, djwong, brauner,
	paulmck, tytso
  Cc: linux-kernel, linux-mm, intel-gfx, dri-devel, linux-arm-msm,
	dm-devel, linux-raid, linux-bcache, virtualization,
	linux-fsdevel, linux-ext4, linux-nfs, linux-xfs, linux-btrfs,
	Qi Zheng

The mm/vmscan.c file is too large, so separate the shrinker-related
code from it into a separate file. No functional changes.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/shrinker.h |   3 +
 mm/Makefile              |   4 +-
 mm/shrinker.c            | 750 +++++++++++++++++++++++++++++++++++++++
 mm/vmscan.c              | 746 --------------------------------------
 4 files changed, 755 insertions(+), 748 deletions(-)
 create mode 100644 mm/shrinker.c

diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index b0c6c2df9db8..b837b4c0864a 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -104,6 +104,9 @@ struct shrinker {
  */
 #define SHRINKER_NONSLAB	(1 << 3)
 
+unsigned long shrink_slab(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg,
+			  int priority);
+
 extern int __printf(2, 3) prealloc_shrinker(struct shrinker *shrinker,
 					    const char *fmt, ...);
 extern void register_shrinker_prepared(struct shrinker *shrinker);
diff --git a/mm/Makefile b/mm/Makefile
index 678530a07326..891899186608 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -48,8 +48,8 @@ endif
 
 obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   maccess.o page-writeback.o folio-compat.o \
-			   readahead.o swap.o truncate.o vmscan.o shmem.o \
-			   util.o mmzone.o vmstat.o backing-dev.o \
+			   readahead.o swap.o truncate.o vmscan.o shrinker.o \
+			   shmem.o util.o mmzone.o vmstat.o backing-dev.o \
 			   mm_init.o percpu.o slab_common.o \
 			   compaction.o show_mem.o\
 			   interval_tree.o list_lru.o workingset.o \
diff --git a/mm/shrinker.c b/mm/shrinker.c
new file mode 100644
index 000000000000..f88966f3d6be
--- /dev/null
+++ b/mm/shrinker.c
@@ -0,0 +1,750 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/memcontrol.h>
+#include <linux/mutex.h>
+#include <linux/rculist.h>
+#include <linux/shrinker.h>
+#include <trace/events/vmscan.h>
+
+LIST_HEAD(shrinker_list);
+DEFINE_MUTEX(shrinker_mutex);
+
+#ifdef CONFIG_MEMCG
+static int shrinker_nr_max;
+
+/* The shrinker_info is expanded in a batch of BITS_PER_LONG */
+static inline int shrinker_map_size(int nr_items)
+{
+	return (DIV_ROUND_UP(nr_items, BITS_PER_LONG) * sizeof(unsigned long));
+}
+
+static inline int shrinker_defer_size(int nr_items)
+{
+	return (round_up(nr_items, BITS_PER_LONG) * sizeof(atomic_long_t));
+}
+
+static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
+						     int nid)
+{
+	return rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_info,
+					 lockdep_is_held(&shrinker_mutex));
+}
+
+static struct shrinker_info *shrinker_info_rcu(struct mem_cgroup *memcg,
+					       int nid)
+{
+	return rcu_dereference(memcg->nodeinfo[nid]->shrinker_info);
+}
+
+static int expand_one_shrinker_info(struct mem_cgroup *memcg,
+				    int map_size, int defer_size,
+				    int old_map_size, int old_defer_size,
+				    int new_nr_max)
+{
+	struct shrinker_info *new, *old;
+	struct mem_cgroup_per_node *pn;
+	int nid;
+	int size = map_size + defer_size;
+
+	for_each_node(nid) {
+		pn = memcg->nodeinfo[nid];
+		old = shrinker_info_protected(memcg, nid);
+		/* Not yet online memcg */
+		if (!old)
+			return 0;
+
+		/* Already expanded this shrinker_info */
+		if (new_nr_max <= old->map_nr_max)
+			continue;
+
+		new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
+		if (!new)
+			return -ENOMEM;
+
+		new->nr_deferred = (atomic_long_t *)(new + 1);
+		new->map = (void *)new->nr_deferred + defer_size;
+		new->map_nr_max = new_nr_max;
+
+		/* map: set all old bits, clear all new bits */
+		memset(new->map, (int)0xff, old_map_size);
+		memset((void *)new->map + old_map_size, 0, map_size - old_map_size);
+		/* nr_deferred: copy old values, clear all new values */
+		memcpy(new->nr_deferred, old->nr_deferred, old_defer_size);
+		memset((void *)new->nr_deferred + old_defer_size, 0,
+		       defer_size - old_defer_size);
+
+		rcu_assign_pointer(pn->shrinker_info, new);
+		kvfree_rcu(old, rcu);
+	}
+
+	return 0;
+}
+
+void free_shrinker_info(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup_per_node *pn;
+	struct shrinker_info *info;
+	int nid;
+
+	for_each_node(nid) {
+		pn = memcg->nodeinfo[nid];
+		info = rcu_dereference_protected(pn->shrinker_info, true);
+		kvfree(info);
+		rcu_assign_pointer(pn->shrinker_info, NULL);
+	}
+}
+
+int alloc_shrinker_info(struct mem_cgroup *memcg)
+{
+	struct shrinker_info *info;
+	int nid, size, ret = 0;
+	int map_size, defer_size = 0;
+
+	mutex_lock(&shrinker_mutex);
+	map_size = shrinker_map_size(shrinker_nr_max);
+	defer_size = shrinker_defer_size(shrinker_nr_max);
+	size = map_size + defer_size;
+	for_each_node(nid) {
+		info = kvzalloc_node(sizeof(*info) + size, GFP_KERNEL, nid);
+		if (!info) {
+			free_shrinker_info(memcg);
+			ret = -ENOMEM;
+			break;
+		}
+		info->nr_deferred = (atomic_long_t *)(info + 1);
+		info->map = (void *)info->nr_deferred + defer_size;
+		info->map_nr_max = shrinker_nr_max;
+		rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, info);
+	}
+	mutex_unlock(&shrinker_mutex);
+
+	return ret;
+}
+
+static int expand_shrinker_info(int new_id)
+{
+	int ret = 0;
+	int new_nr_max = round_up(new_id + 1, BITS_PER_LONG);
+	int map_size, defer_size = 0;
+	int old_map_size, old_defer_size = 0;
+	struct mem_cgroup *memcg;
+
+	if (!root_mem_cgroup)
+		goto out;
+
+	lockdep_assert_held(&shrinker_mutex);
+
+	map_size = shrinker_map_size(new_nr_max);
+	defer_size = shrinker_defer_size(new_nr_max);
+	old_map_size = shrinker_map_size(shrinker_nr_max);
+	old_defer_size = shrinker_defer_size(shrinker_nr_max);
+
+	memcg = mem_cgroup_iter(NULL, NULL, NULL);
+	do {
+		ret = expand_one_shrinker_info(memcg, map_size, defer_size,
+					       old_map_size, old_defer_size,
+					       new_nr_max);
+		if (ret) {
+			mem_cgroup_iter_break(NULL, memcg);
+			goto out;
+		}
+	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
+out:
+	if (!ret)
+		shrinker_nr_max = new_nr_max;
+
+	return ret;
+}
+
+void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
+{
+	if (shrinker_id >= 0 && memcg && !mem_cgroup_is_root(memcg)) {
+		struct shrinker_info *info;
+
+		rcu_read_lock();
+		info = shrinker_info_rcu(memcg, nid);
+		if (!WARN_ON_ONCE(shrinker_id >= info->map_nr_max)) {
+			/* Pairs with smp mb in shrink_slab() */
+			smp_mb__before_atomic();
+			set_bit(shrinker_id, info->map);
+		}
+		rcu_read_unlock();
+	}
+}
+
+static DEFINE_IDR(shrinker_idr);
+
+static int prealloc_memcg_shrinker(struct shrinker *shrinker)
+{
+	int id, ret = -ENOMEM;
+
+	if (mem_cgroup_disabled())
+		return -ENOSYS;
+
+	mutex_lock(&shrinker_mutex);
+	id = idr_alloc(&shrinker_idr, shrinker, 0, 0, GFP_KERNEL);
+	if (id < 0)
+		goto unlock;
+
+	if (id >= shrinker_nr_max) {
+		if (expand_shrinker_info(id)) {
+			idr_remove(&shrinker_idr, id);
+			goto unlock;
+		}
+	}
+	shrinker->id = id;
+	ret = 0;
+unlock:
+	mutex_unlock(&shrinker_mutex);
+	return ret;
+}
+
+static void unregister_memcg_shrinker(struct shrinker *shrinker)
+{
+	int id = shrinker->id;
+
+	BUG_ON(id < 0);
+
+	lockdep_assert_held(&shrinker_mutex);
+
+	idr_remove(&shrinker_idr, id);
+}
+
+static long xchg_nr_deferred_memcg(int nid, struct shrinker *shrinker,
+				   struct mem_cgroup *memcg)
+{
+	struct shrinker_info *info;
+	long nr_deferred;
+
+	rcu_read_lock();
+	info = shrinker_info_rcu(memcg, nid);
+	nr_deferred = atomic_long_xchg(&info->nr_deferred[shrinker->id], 0);
+	rcu_read_unlock();
+
+	return nr_deferred;
+}
+
+static long add_nr_deferred_memcg(long nr, int nid, struct shrinker *shrinker,
+				  struct mem_cgroup *memcg)
+{
+	struct shrinker_info *info;
+	long nr_deferred;
+
+	rcu_read_lock();
+	info = shrinker_info_rcu(memcg, nid);
+	nr_deferred = atomic_long_add_return(nr, &info->nr_deferred[shrinker->id]);
+	rcu_read_unlock();
+
+	return nr_deferred;
+}
+
+void reparent_shrinker_deferred(struct mem_cgroup *memcg)
+{
+	int i, nid;
+	long nr;
+	struct mem_cgroup *parent;
+	struct shrinker_info *child_info, *parent_info;
+
+	parent = parent_mem_cgroup(memcg);
+	if (!parent)
+		parent = root_mem_cgroup;
+
+	/* Prevent from concurrent shrinker_info expand */
+	mutex_lock(&shrinker_mutex);
+	for_each_node(nid) {
+		child_info = shrinker_info_protected(memcg, nid);
+		parent_info = shrinker_info_protected(parent, nid);
+		for (i = 0; i < child_info->map_nr_max; i++) {
+			nr = atomic_long_read(&child_info->nr_deferred[i]);
+			atomic_long_add(nr, &parent_info->nr_deferred[i]);
+		}
+	}
+	mutex_unlock(&shrinker_mutex);
+}
+#else
+static int prealloc_memcg_shrinker(struct shrinker *shrinker)
+{
+	return -ENOSYS;
+}
+
+static void unregister_memcg_shrinker(struct shrinker *shrinker)
+{
+}
+
+static long xchg_nr_deferred_memcg(int nid, struct shrinker *shrinker,
+				   struct mem_cgroup *memcg)
+{
+	return 0;
+}
+
+static long add_nr_deferred_memcg(long nr, int nid, struct shrinker *shrinker,
+				  struct mem_cgroup *memcg)
+{
+	return 0;
+}
+#endif /* CONFIG_MEMCG */
+
+static long xchg_nr_deferred(struct shrinker *shrinker,
+			     struct shrink_control *sc)
+{
+	int nid = sc->nid;
+
+	if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
+		nid = 0;
+
+	if (sc->memcg &&
+	    (shrinker->flags & SHRINKER_MEMCG_AWARE))
+		return xchg_nr_deferred_memcg(nid, shrinker,
+					      sc->memcg);
+
+	return atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
+}
+
+static long add_nr_deferred(long nr, struct shrinker *shrinker,
+			    struct shrink_control *sc)
+{
+	int nid = sc->nid;
+
+	if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
+		nid = 0;
+
+	if (sc->memcg &&
+	    (shrinker->flags & SHRINKER_MEMCG_AWARE))
+		return add_nr_deferred_memcg(nr, nid, shrinker,
+					     sc->memcg);
+
+	return atomic_long_add_return(nr, &shrinker->nr_deferred[nid]);
+}
+
+#define SHRINK_BATCH 128
+
+static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
+				    struct shrinker *shrinker, int priority)
+{
+	unsigned long freed = 0;
+	unsigned long long delta;
+	long total_scan;
+	long freeable;
+	long nr;
+	long new_nr;
+	long batch_size = shrinker->batch ? shrinker->batch
+					  : SHRINK_BATCH;
+	long scanned = 0, next_deferred;
+
+	freeable = shrinker->count_objects(shrinker, shrinkctl);
+	if (freeable == 0 || freeable == SHRINK_EMPTY)
+		return freeable;
+
+	/*
+	 * copy the current shrinker scan count into a local variable
+	 * and zero it so that other concurrent shrinker invocations
+	 * don't also do this scanning work.
+	 */
+	nr = xchg_nr_deferred(shrinker, shrinkctl);
+
+	if (shrinker->seeks) {
+		delta = freeable >> priority;
+		delta *= 4;
+		do_div(delta, shrinker->seeks);
+	} else {
+		/*
+		 * These objects don't require any IO to create. Trim
+		 * them aggressively under memory pressure to keep
+		 * them from causing refetches in the IO caches.
+		 */
+		delta = freeable / 2;
+	}
+
+	total_scan = nr >> priority;
+	total_scan += delta;
+	total_scan = min(total_scan, (2 * freeable));
+
+	trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
+				   freeable, delta, total_scan, priority);
+
+	/*
+	 * Normally, we should not scan less than batch_size objects in one
+	 * pass to avoid too frequent shrinker calls, but if the slab has less
+	 * than batch_size objects in total and we are really tight on memory,
+	 * we will try to reclaim all available objects, otherwise we can end
+	 * up failing allocations although there are plenty of reclaimable
+	 * objects spread over several slabs with usage less than the
+	 * batch_size.
+	 *
+	 * We detect the "tight on memory" situations by looking at the total
+	 * number of objects we want to scan (total_scan). If it is greater
+	 * than the total number of objects on slab (freeable), we must be
+	 * scanning at high prio and therefore should try to reclaim as much as
+	 * possible.
+	 */
+	while (total_scan >= batch_size ||
+	       total_scan >= freeable) {
+		unsigned long ret;
+		unsigned long nr_to_scan = min(batch_size, total_scan);
+
+		shrinkctl->nr_to_scan = nr_to_scan;
+		shrinkctl->nr_scanned = nr_to_scan;
+		ret = shrinker->scan_objects(shrinker, shrinkctl);
+		if (ret == SHRINK_STOP)
+			break;
+		freed += ret;
+
+		count_vm_events(SLABS_SCANNED, shrinkctl->nr_scanned);
+		total_scan -= shrinkctl->nr_scanned;
+		scanned += shrinkctl->nr_scanned;
+
+		cond_resched();
+	}
+
+	/*
+	 * The deferred work is increased by any new work (delta) that wasn't
+	 * done, decreased by old deferred work that was done now.
+	 *
+	 * And it is capped to two times of the freeable items.
+	 */
+	next_deferred = max_t(long, (nr + delta - scanned), 0);
+	next_deferred = min(next_deferred, (2 * freeable));
+
+	/*
+	 * move the unused scan count back into the shrinker in a
+	 * manner that handles concurrent updates.
+	 */
+	new_nr = add_nr_deferred(next_deferred, shrinker, shrinkctl);
+
+	trace_mm_shrink_slab_end(shrinker, shrinkctl->nid, freed, nr, new_nr, total_scan);
+	return freed;
+}
+
+#ifdef CONFIG_MEMCG
+static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
+			struct mem_cgroup *memcg, int priority)
+{
+	struct shrinker_info *info;
+	unsigned long ret, freed = 0;
+	int i = 0;
+
+	if (!mem_cgroup_online(memcg))
+		return 0;
+
+again:
+	rcu_read_lock();
+	info = shrinker_info_rcu(memcg, nid);
+	if (unlikely(!info))
+		goto unlock;
+
+	for_each_set_bit_from(i, info->map, info->map_nr_max) {
+		struct shrink_control sc = {
+			.gfp_mask = gfp_mask,
+			.nid = nid,
+			.memcg = memcg,
+		};
+		struct shrinker *shrinker;
+
+		shrinker = idr_find(&shrinker_idr, i);
+		if (unlikely(!shrinker || !(shrinker->flags & SHRINKER_REGISTERED))) {
+			if (!shrinker)
+				clear_bit(i, info->map);
+			continue;
+		}
+
+		if (!shrinker_try_get(shrinker))
+			continue;
+		rcu_read_unlock();
+
+		/* Call non-slab shrinkers even though kmem is disabled */
+		if (!memcg_kmem_online() &&
+		    !(shrinker->flags & SHRINKER_NONSLAB))
+			continue;
+
+		ret = do_shrink_slab(&sc, shrinker, priority);
+		if (ret == SHRINK_EMPTY) {
+			clear_bit(i, info->map);
+			/*
+			 * After the shrinker reported that it had no objects to
+			 * free, but before we cleared the corresponding bit in
+			 * the memcg shrinker map, a new object might have been
+			 * added. To make sure, we have the bit set in this
+			 * case, we invoke the shrinker one more time and reset
+			 * the bit if it reports that it is not empty anymore.
+			 * The memory barrier here pairs with the barrier in
+			 * set_shrinker_bit():
+			 *
+			 * list_lru_add()     shrink_slab_memcg()
+			 *   list_add_tail()    clear_bit()
+			 *   <MB>               <MB>
+			 *   set_bit()          do_shrink_slab()
+			 */
+			smp_mb__after_atomic();
+			ret = do_shrink_slab(&sc, shrinker, priority);
+			if (ret == SHRINK_EMPTY)
+				ret = 0;
+			else
+				set_shrinker_bit(memcg, nid, i);
+		}
+		freed += ret;
+
+		shrinker_put(shrinker);
+
+		/*
+		 * We have already exited the read-side of rcu critical section
+		 * before calling do_shrink_slab(), the shrinker_info may be
+		 * released in expand_one_shrinker_info(), so restart the
+		 * iteration.
+		 */
+		i++;
+		goto again;
+	}
+unlock:
+	rcu_read_unlock();
+	return freed;
+}
+#else /* CONFIG_MEMCG */
+static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
+			struct mem_cgroup *memcg, int priority)
+{
+	return 0;
+}
+#endif /* CONFIG_MEMCG */
+
+/**
+ * shrink_slab - shrink slab caches
+ * @gfp_mask: allocation context
+ * @nid: node whose slab caches to target
+ * @memcg: memory cgroup whose slab caches to target
+ * @priority: the reclaim priority
+ *
+ * Call the shrink functions to age shrinkable caches.
+ *
+ * @nid is passed along to shrinkers with SHRINKER_NUMA_AWARE set,
+ * unaware shrinkers will receive a node id of 0 instead.
+ *
+ * @memcg specifies the memory cgroup to target. Unaware shrinkers
+ * are called only if it is the root cgroup.
+ *
+ * @priority is sc->priority, we take the number of objects and >> by priority
+ * in order to get the scan target.
+ *
+ * Returns the number of reclaimed slab objects.
+ */
+unsigned long shrink_slab(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg,
+			  int priority)
+{
+	unsigned long ret, freed = 0;
+	struct shrinker *shrinker;
+
+	/*
+	 * The root memcg might be allocated even though memcg is disabled
+	 * via "cgroup_disable=memory" boot parameter.  This could make
+	 * mem_cgroup_is_root() return false, then just run memcg slab
+	 * shrink, but skip global shrink.  This may result in premature
+	 * oom.
+	 */
+	if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
+		return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(shrinker, &shrinker_list, list) {
+		struct shrink_control sc = {
+			.gfp_mask = gfp_mask,
+			.nid = nid,
+			.memcg = memcg,
+		};
+
+		if (!shrinker_try_get(shrinker))
+			continue;
+		rcu_read_unlock();
+
+		ret = do_shrink_slab(&sc, shrinker, priority);
+		if (ret == SHRINK_EMPTY)
+			ret = 0;
+		freed += ret;
+
+		rcu_read_lock();
+		shrinker_put(shrinker);
+	}
+	rcu_read_unlock();
+	cond_resched();
+	return freed;
+}
+
+/*
+ * Add a shrinker callback to be called from the vm.
+ */
+static int __prealloc_shrinker(struct shrinker *shrinker)
+{
+	unsigned int size;
+	int err;
+
+	if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
+		err = prealloc_memcg_shrinker(shrinker);
+		if (err != -ENOSYS)
+			return err;
+
+		shrinker->flags &= ~SHRINKER_MEMCG_AWARE;
+	}
+
+	size = sizeof(*shrinker->nr_deferred);
+	if (shrinker->flags & SHRINKER_NUMA_AWARE)
+		size *= nr_node_ids;
+
+	shrinker->nr_deferred = kzalloc(size, GFP_KERNEL);
+	if (!shrinker->nr_deferred)
+		return -ENOMEM;
+
+	return 0;
+}
+
+#ifdef CONFIG_SHRINKER_DEBUG
+int prealloc_shrinker(struct shrinker *shrinker, const char *fmt, ...)
+{
+	va_list ap;
+	int err;
+
+	va_start(ap, fmt);
+	shrinker->name = kvasprintf_const(GFP_KERNEL, fmt, ap);
+	va_end(ap);
+	if (!shrinker->name)
+		return -ENOMEM;
+
+	err = __prealloc_shrinker(shrinker);
+	if (err) {
+		kfree_const(shrinker->name);
+		shrinker->name = NULL;
+	}
+
+	return err;
+}
+#else
+int prealloc_shrinker(struct shrinker *shrinker, const char *fmt, ...)
+{
+	return __prealloc_shrinker(shrinker);
+}
+#endif
+
+void free_prealloced_shrinker(struct shrinker *shrinker)
+{
+#ifdef CONFIG_SHRINKER_DEBUG
+	kfree_const(shrinker->name);
+	shrinker->name = NULL;
+#endif
+	if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
+		mutex_lock(&shrinker_mutex);
+		unregister_memcg_shrinker(shrinker);
+		mutex_unlock(&shrinker_mutex);
+		return;
+	}
+
+	kfree(shrinker->nr_deferred);
+	shrinker->nr_deferred = NULL;
+}
+
+void register_shrinker_prepared(struct shrinker *shrinker)
+{
+	mutex_lock(&shrinker_mutex);
+	refcount_set(&shrinker->refcount, 1);
+	init_completion(&shrinker->completion_wait);
+	list_add_tail_rcu(&shrinker->list, &shrinker_list);
+	shrinker->flags |= SHRINKER_REGISTERED;
+	shrinker_debugfs_add(shrinker);
+	mutex_unlock(&shrinker_mutex);
+}
+
+static int __register_shrinker(struct shrinker *shrinker)
+{
+	int err = __prealloc_shrinker(shrinker);
+
+	if (err)
+		return err;
+	register_shrinker_prepared(shrinker);
+	return 0;
+}
+
+#ifdef CONFIG_SHRINKER_DEBUG
+int register_shrinker(struct shrinker *shrinker, const char *fmt, ...)
+{
+	va_list ap;
+	int err;
+
+	va_start(ap, fmt);
+	shrinker->name = kvasprintf_const(GFP_KERNEL, fmt, ap);
+	va_end(ap);
+	if (!shrinker->name)
+		return -ENOMEM;
+
+	err = __register_shrinker(shrinker);
+	if (err) {
+		kfree_const(shrinker->name);
+		shrinker->name = NULL;
+	}
+	return err;
+}
+#else
+int register_shrinker(struct shrinker *shrinker, const char *fmt, ...)
+{
+	return __register_shrinker(shrinker);
+}
+#endif
+EXPORT_SYMBOL(register_shrinker);
+
+/*
+ * Remove one
+ */
+void unregister_shrinker(struct shrinker *shrinker)
+{
+	struct dentry *debugfs_entry;
+	int debugfs_id;
+
+	if (!(shrinker->flags & SHRINKER_REGISTERED))
+		return;
+
+	shrinker_put(shrinker);
+	wait_for_completion(&shrinker->completion_wait);
+
+	mutex_lock(&shrinker_mutex);
+	list_del_rcu(&shrinker->list);
+	shrinker->flags &= ~SHRINKER_REGISTERED;
+	if (shrinker->flags & SHRINKER_MEMCG_AWARE)
+		unregister_memcg_shrinker(shrinker);
+	debugfs_entry = shrinker_debugfs_detach(shrinker, &debugfs_id);
+	mutex_unlock(&shrinker_mutex);
+
+	shrinker_debugfs_remove(debugfs_entry, debugfs_id);
+
+	kfree(shrinker->nr_deferred);
+	shrinker->nr_deferred = NULL;
+}
+EXPORT_SYMBOL(unregister_shrinker);
+
+struct shrinker *shrinker_alloc_and_init(count_objects_cb count,
+					 scan_objects_cb scan, long batch,
+					 int seeks, unsigned flags,
+					 void *priv_data)
+{
+	struct shrinker *shrinker;
+
+	shrinker = kzalloc(sizeof(struct shrinker), GFP_KERNEL);
+	if (!shrinker)
+		return NULL;
+
+	shrinker->count_objects = count;
+	shrinker->scan_objects = scan;
+	shrinker->batch = batch;
+	shrinker->seeks = seeks;
+	shrinker->flags = flags;
+	shrinker->private_data = priv_data;
+
+	return shrinker;
+}
+EXPORT_SYMBOL(shrinker_alloc_and_init);
+
+void shrinker_free(struct shrinker *shrinker)
+{
+	kfree(shrinker);
+}
+EXPORT_SYMBOL(shrinker_free);
+
+void unregister_and_free_shrinker(struct shrinker *shrinker)
+{
+	unregister_shrinker(shrinker);
+	kfree_rcu(shrinker, rcu);
+}
+EXPORT_SYMBOL(unregister_and_free_shrinker);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index bcdd97caa403..0816c6496483 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -35,7 +35,6 @@
 #include <linux/cpuset.h>
 #include <linux/compaction.h>
 #include <linux/notifier.h>
-#include <linux/mutex.h>
 #include <linux/delay.h>
 #include <linux/kthread.h>
 #include <linux/freezer.h>
@@ -57,7 +56,6 @@
 #include <linux/khugepaged.h>
 #include <linux/rculist_nulls.h>
 #include <linux/random.h>
-#include <linux/rculist.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -189,262 +187,7 @@ struct scan_control {
  */
 int vm_swappiness = 60;
 
-LIST_HEAD(shrinker_list);
-DEFINE_MUTEX(shrinker_mutex);
-
 #ifdef CONFIG_MEMCG
-static int shrinker_nr_max;
-
-/* The shrinker_info is expanded in a batch of BITS_PER_LONG */
-static inline int shrinker_map_size(int nr_items)
-{
-	return (DIV_ROUND_UP(nr_items, BITS_PER_LONG) * sizeof(unsigned long));
-}
-
-static inline int shrinker_defer_size(int nr_items)
-{
-	return (round_up(nr_items, BITS_PER_LONG) * sizeof(atomic_long_t));
-}
-
-static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg,
-						     int nid)
-{
-	return rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_info,
-					 lockdep_is_held(&shrinker_mutex));
-}
-
-static struct shrinker_info *shrinker_info_rcu(struct mem_cgroup *memcg,
-					       int nid)
-{
-	return rcu_dereference(memcg->nodeinfo[nid]->shrinker_info);
-}
-
-static int expand_one_shrinker_info(struct mem_cgroup *memcg,
-				    int map_size, int defer_size,
-				    int old_map_size, int old_defer_size,
-				    int new_nr_max)
-{
-	struct shrinker_info *new, *old;
-	struct mem_cgroup_per_node *pn;
-	int nid;
-	int size = map_size + defer_size;
-
-	for_each_node(nid) {
-		pn = memcg->nodeinfo[nid];
-		old = shrinker_info_protected(memcg, nid);
-		/* Not yet online memcg */
-		if (!old)
-			return 0;
-
-		/* Already expanded this shrinker_info */
-		if (new_nr_max <= old->map_nr_max)
-			continue;
-
-		new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
-		if (!new)
-			return -ENOMEM;
-
-		new->nr_deferred = (atomic_long_t *)(new + 1);
-		new->map = (void *)new->nr_deferred + defer_size;
-		new->map_nr_max = new_nr_max;
-
-		/* map: set all old bits, clear all new bits */
-		memset(new->map, (int)0xff, old_map_size);
-		memset((void *)new->map + old_map_size, 0, map_size - old_map_size);
-		/* nr_deferred: copy old values, clear all new values */
-		memcpy(new->nr_deferred, old->nr_deferred, old_defer_size);
-		memset((void *)new->nr_deferred + old_defer_size, 0,
-		       defer_size - old_defer_size);
-
-		rcu_assign_pointer(pn->shrinker_info, new);
-		kvfree_rcu(old, rcu);
-	}
-
-	return 0;
-}
-
-void free_shrinker_info(struct mem_cgroup *memcg)
-{
-	struct mem_cgroup_per_node *pn;
-	struct shrinker_info *info;
-	int nid;
-
-	for_each_node(nid) {
-		pn = memcg->nodeinfo[nid];
-		info = rcu_dereference_protected(pn->shrinker_info, true);
-		kvfree(info);
-		rcu_assign_pointer(pn->shrinker_info, NULL);
-	}
-}
-
-int alloc_shrinker_info(struct mem_cgroup *memcg)
-{
-	struct shrinker_info *info;
-	int nid, size, ret = 0;
-	int map_size, defer_size = 0;
-
-	mutex_lock(&shrinker_mutex);
-	map_size = shrinker_map_size(shrinker_nr_max);
-	defer_size = shrinker_defer_size(shrinker_nr_max);
-	size = map_size + defer_size;
-	for_each_node(nid) {
-		info = kvzalloc_node(sizeof(*info) + size, GFP_KERNEL, nid);
-		if (!info) {
-			free_shrinker_info(memcg);
-			ret = -ENOMEM;
-			break;
-		}
-		info->nr_deferred = (atomic_long_t *)(info + 1);
-		info->map = (void *)info->nr_deferred + defer_size;
-		info->map_nr_max = shrinker_nr_max;
-		rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, info);
-	}
-	mutex_unlock(&shrinker_mutex);
-
-	return ret;
-}
-
-static int expand_shrinker_info(int new_id)
-{
-	int ret = 0;
-	int new_nr_max = round_up(new_id + 1, BITS_PER_LONG);
-	int map_size, defer_size = 0;
-	int old_map_size, old_defer_size = 0;
-	struct mem_cgroup *memcg;
-
-	if (!root_mem_cgroup)
-		goto out;
-
-	lockdep_assert_held(&shrinker_mutex);
-
-	map_size = shrinker_map_size(new_nr_max);
-	defer_size = shrinker_defer_size(new_nr_max);
-	old_map_size = shrinker_map_size(shrinker_nr_max);
-	old_defer_size = shrinker_defer_size(shrinker_nr_max);
-
-	memcg = mem_cgroup_iter(NULL, NULL, NULL);
-	do {
-		ret = expand_one_shrinker_info(memcg, map_size, defer_size,
-					       old_map_size, old_defer_size,
-					       new_nr_max);
-		if (ret) {
-			mem_cgroup_iter_break(NULL, memcg);
-			goto out;
-		}
-	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
-out:
-	if (!ret)
-		shrinker_nr_max = new_nr_max;
-
-	return ret;
-}
-
-void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
-{
-	if (shrinker_id >= 0 && memcg && !mem_cgroup_is_root(memcg)) {
-		struct shrinker_info *info;
-
-		rcu_read_lock();
-		info = shrinker_info_rcu(memcg, nid);
-		if (!WARN_ON_ONCE(shrinker_id >= info->map_nr_max)) {
-			/* Pairs with smp mb in shrink_slab() */
-			smp_mb__before_atomic();
-			set_bit(shrinker_id, info->map);
-		}
-		rcu_read_unlock();
-	}
-}
-
-static DEFINE_IDR(shrinker_idr);
-
-static int prealloc_memcg_shrinker(struct shrinker *shrinker)
-{
-	int id, ret = -ENOMEM;
-
-	if (mem_cgroup_disabled())
-		return -ENOSYS;
-
-	mutex_lock(&shrinker_mutex);
-	id = idr_alloc(&shrinker_idr, shrinker, 0, 0, GFP_KERNEL);
-	if (id < 0)
-		goto unlock;
-
-	if (id >= shrinker_nr_max) {
-		if (expand_shrinker_info(id)) {
-			idr_remove(&shrinker_idr, id);
-			goto unlock;
-		}
-	}
-	shrinker->id = id;
-	ret = 0;
-unlock:
-	mutex_unlock(&shrinker_mutex);
-	return ret;
-}
-
-static void unregister_memcg_shrinker(struct shrinker *shrinker)
-{
-	int id = shrinker->id;
-
-	BUG_ON(id < 0);
-
-	lockdep_assert_held(&shrinker_mutex);
-
-	idr_remove(&shrinker_idr, id);
-}
-
-static long xchg_nr_deferred_memcg(int nid, struct shrinker *shrinker,
-				   struct mem_cgroup *memcg)
-{
-	struct shrinker_info *info;
-	long nr_deferred;
-
-	rcu_read_lock();
-	info = shrinker_info_rcu(memcg, nid);
-	nr_deferred = atomic_long_xchg(&info->nr_deferred[shrinker->id], 0);
-	rcu_read_unlock();
-
-	return nr_deferred;
-}
-
-static long add_nr_deferred_memcg(long nr, int nid, struct shrinker *shrinker,
-				  struct mem_cgroup *memcg)
-{
-	struct shrinker_info *info;
-	long nr_deferred;
-
-	rcu_read_lock();
-	info = shrinker_info_rcu(memcg, nid);
-	nr_deferred = atomic_long_add_return(nr, &info->nr_deferred[shrinker->id]);
-	rcu_read_unlock();
-
-	return nr_deferred;
-}
-
-void reparent_shrinker_deferred(struct mem_cgroup *memcg)
-{
-	int i, nid;
-	long nr;
-	struct mem_cgroup *parent;
-	struct shrinker_info *child_info, *parent_info;
-
-	parent = parent_mem_cgroup(memcg);
-	if (!parent)
-		parent = root_mem_cgroup;
-
-	/* Prevent from concurrent shrinker_info expand */
-	mutex_lock(&shrinker_mutex);
-	for_each_node(nid) {
-		child_info = shrinker_info_protected(memcg, nid);
-		parent_info = shrinker_info_protected(parent, nid);
-		for (i = 0; i < child_info->map_nr_max; i++) {
-			nr = atomic_long_read(&child_info->nr_deferred[i]);
-			atomic_long_add(nr, &parent_info->nr_deferred[i]);
-		}
-	}
-	mutex_unlock(&shrinker_mutex);
-}
-
 static bool cgroup_reclaim(struct scan_control *sc)
 {
 	return sc->target_mem_cgroup;
@@ -479,27 +222,6 @@ static bool writeback_throttling_sane(struct scan_control *sc)
 	return false;
 }
 #else
-static int prealloc_memcg_shrinker(struct shrinker *shrinker)
-{
-	return -ENOSYS;
-}
-
-static void unregister_memcg_shrinker(struct shrinker *shrinker)
-{
-}
-
-static long xchg_nr_deferred_memcg(int nid, struct shrinker *shrinker,
-				   struct mem_cgroup *memcg)
-{
-	return 0;
-}
-
-static long add_nr_deferred_memcg(long nr, int nid, struct shrinker *shrinker,
-				  struct mem_cgroup *memcg)
-{
-	return 0;
-}
-
 static bool cgroup_reclaim(struct scan_control *sc)
 {
 	return false;
@@ -568,39 +290,6 @@ static void flush_reclaim_state(struct scan_control *sc)
 	}
 }
 
-static long xchg_nr_deferred(struct shrinker *shrinker,
-			     struct shrink_control *sc)
-{
-	int nid = sc->nid;
-
-	if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
-		nid = 0;
-
-	if (sc->memcg &&
-	    (shrinker->flags & SHRINKER_MEMCG_AWARE))
-		return xchg_nr_deferred_memcg(nid, shrinker,
-					      sc->memcg);
-
-	return atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
-}
-
-
-static long add_nr_deferred(long nr, struct shrinker *shrinker,
-			    struct shrink_control *sc)
-{
-	int nid = sc->nid;
-
-	if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
-		nid = 0;
-
-	if (sc->memcg &&
-	    (shrinker->flags & SHRINKER_MEMCG_AWARE))
-		return add_nr_deferred_memcg(nr, nid, shrinker,
-					     sc->memcg);
-
-	return atomic_long_add_return(nr, &shrinker->nr_deferred[nid]);
-}
-
 static bool can_demote(int nid, struct scan_control *sc)
 {
 	if (!numa_demotion_enabled)
@@ -682,441 +371,6 @@ static unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru,
 	return size;
 }
 
-/*
- * Add a shrinker callback to be called from the vm.
- */
-static int __prealloc_shrinker(struct shrinker *shrinker)
-{
-	unsigned int size;
-	int err;
-
-	if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
-		err = prealloc_memcg_shrinker(shrinker);
-		if (err != -ENOSYS)
-			return err;
-
-		shrinker->flags &= ~SHRINKER_MEMCG_AWARE;
-	}
-
-	size = sizeof(*shrinker->nr_deferred);
-	if (shrinker->flags & SHRINKER_NUMA_AWARE)
-		size *= nr_node_ids;
-
-	shrinker->nr_deferred = kzalloc(size, GFP_KERNEL);
-	if (!shrinker->nr_deferred)
-		return -ENOMEM;
-
-	return 0;
-}
-
-#ifdef CONFIG_SHRINKER_DEBUG
-int prealloc_shrinker(struct shrinker *shrinker, const char *fmt, ...)
-{
-	va_list ap;
-	int err;
-
-	va_start(ap, fmt);
-	shrinker->name = kvasprintf_const(GFP_KERNEL, fmt, ap);
-	va_end(ap);
-	if (!shrinker->name)
-		return -ENOMEM;
-
-	err = __prealloc_shrinker(shrinker);
-	if (err) {
-		kfree_const(shrinker->name);
-		shrinker->name = NULL;
-	}
-
-	return err;
-}
-#else
-int prealloc_shrinker(struct shrinker *shrinker, const char *fmt, ...)
-{
-	return __prealloc_shrinker(shrinker);
-}
-#endif
-
-void free_prealloced_shrinker(struct shrinker *shrinker)
-{
-#ifdef CONFIG_SHRINKER_DEBUG
-	kfree_const(shrinker->name);
-	shrinker->name = NULL;
-#endif
-	if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
-		mutex_lock(&shrinker_mutex);
-		unregister_memcg_shrinker(shrinker);
-		mutex_unlock(&shrinker_mutex);
-		return;
-	}
-
-	kfree(shrinker->nr_deferred);
-	shrinker->nr_deferred = NULL;
-}
-
-void register_shrinker_prepared(struct shrinker *shrinker)
-{
-	mutex_lock(&shrinker_mutex);
-	refcount_set(&shrinker->refcount, 1);
-	init_completion(&shrinker->completion_wait);
-	list_add_tail_rcu(&shrinker->list, &shrinker_list);
-	shrinker->flags |= SHRINKER_REGISTERED;
-	shrinker_debugfs_add(shrinker);
-	mutex_unlock(&shrinker_mutex);
-}
-
-static int __register_shrinker(struct shrinker *shrinker)
-{
-	int err = __prealloc_shrinker(shrinker);
-
-	if (err)
-		return err;
-	register_shrinker_prepared(shrinker);
-	return 0;
-}
-
-#ifdef CONFIG_SHRINKER_DEBUG
-int register_shrinker(struct shrinker *shrinker, const char *fmt, ...)
-{
-	va_list ap;
-	int err;
-
-	va_start(ap, fmt);
-	shrinker->name = kvasprintf_const(GFP_KERNEL, fmt, ap);
-	va_end(ap);
-	if (!shrinker->name)
-		return -ENOMEM;
-
-	err = __register_shrinker(shrinker);
-	if (err) {
-		kfree_const(shrinker->name);
-		shrinker->name = NULL;
-	}
-	return err;
-}
-#else
-int register_shrinker(struct shrinker *shrinker, const char *fmt, ...)
-{
-	return __register_shrinker(shrinker);
-}
-#endif
-EXPORT_SYMBOL(register_shrinker);
-
-/*
- * Remove one
- */
-void unregister_shrinker(struct shrinker *shrinker)
-{
-	struct dentry *debugfs_entry;
-	int debugfs_id;
-
-	if (!(shrinker->flags & SHRINKER_REGISTERED))
-		return;
-
-	shrinker_put(shrinker);
-	wait_for_completion(&shrinker->completion_wait);
-
-	mutex_lock(&shrinker_mutex);
-	list_del_rcu(&shrinker->list);
-	shrinker->flags &= ~SHRINKER_REGISTERED;
-	if (shrinker->flags & SHRINKER_MEMCG_AWARE)
-		unregister_memcg_shrinker(shrinker);
-	debugfs_entry = shrinker_debugfs_detach(shrinker, &debugfs_id);
-	mutex_unlock(&shrinker_mutex);
-
-	shrinker_debugfs_remove(debugfs_entry, debugfs_id);
-
-	kfree(shrinker->nr_deferred);
-	shrinker->nr_deferred = NULL;
-}
-EXPORT_SYMBOL(unregister_shrinker);
-
-struct shrinker *shrinker_alloc_and_init(count_objects_cb count,
-					 scan_objects_cb scan, long batch,
-					 int seeks, unsigned flags,
-					 void *priv_data)
-{
-	struct shrinker *shrinker;
-
-	shrinker = kzalloc(sizeof(struct shrinker), GFP_KERNEL);
-	if (!shrinker)
-		return NULL;
-
-	shrinker->count_objects = count;
-	shrinker->scan_objects = scan;
-	shrinker->batch = batch;
-	shrinker->seeks = seeks;
-	shrinker->flags = flags;
-	shrinker->private_data = priv_data;
-
-	return shrinker;
-}
-EXPORT_SYMBOL(shrinker_alloc_and_init);
-
-void shrinker_free(struct shrinker *shrinker)
-{
-	kfree(shrinker);
-}
-EXPORT_SYMBOL(shrinker_free);
-
-void unregister_and_free_shrinker(struct shrinker *shrinker)
-{
-	unregister_shrinker(shrinker);
-	kfree_rcu(shrinker, rcu);
-}
-EXPORT_SYMBOL(unregister_and_free_shrinker);
-
-#define SHRINK_BATCH 128
-
-static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
-				    struct shrinker *shrinker, int priority)
-{
-	unsigned long freed = 0;
-	unsigned long long delta;
-	long total_scan;
-	long freeable;
-	long nr;
-	long new_nr;
-	long batch_size = shrinker->batch ? shrinker->batch
-					  : SHRINK_BATCH;
-	long scanned = 0, next_deferred;
-
-	freeable = shrinker->count_objects(shrinker, shrinkctl);
-	if (freeable == 0 || freeable == SHRINK_EMPTY)
-		return freeable;
-
-	/*
-	 * copy the current shrinker scan count into a local variable
-	 * and zero it so that other concurrent shrinker invocations
-	 * don't also do this scanning work.
-	 */
-	nr = xchg_nr_deferred(shrinker, shrinkctl);
-
-	if (shrinker->seeks) {
-		delta = freeable >> priority;
-		delta *= 4;
-		do_div(delta, shrinker->seeks);
-	} else {
-		/*
-		 * These objects don't require any IO to create. Trim
-		 * them aggressively under memory pressure to keep
-		 * them from causing refetches in the IO caches.
-		 */
-		delta = freeable / 2;
-	}
-
-	total_scan = nr >> priority;
-	total_scan += delta;
-	total_scan = min(total_scan, (2 * freeable));
-
-	trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
-				   freeable, delta, total_scan, priority);
-
-	/*
-	 * Normally, we should not scan less than batch_size objects in one
-	 * pass to avoid too frequent shrinker calls, but if the slab has less
-	 * than batch_size objects in total and we are really tight on memory,
-	 * we will try to reclaim all available objects, otherwise we can end
-	 * up failing allocations although there are plenty of reclaimable
-	 * objects spread over several slabs with usage less than the
-	 * batch_size.
-	 *
-	 * We detect the "tight on memory" situations by looking at the total
-	 * number of objects we want to scan (total_scan). If it is greater
-	 * than the total number of objects on slab (freeable), we must be
-	 * scanning at high prio and therefore should try to reclaim as much as
-	 * possible.
-	 */
-	while (total_scan >= batch_size ||
-	       total_scan >= freeable) {
-		unsigned long ret;
-		unsigned long nr_to_scan = min(batch_size, total_scan);
-
-		shrinkctl->nr_to_scan = nr_to_scan;
-		shrinkctl->nr_scanned = nr_to_scan;
-		ret = shrinker->scan_objects(shrinker, shrinkctl);
-		if (ret == SHRINK_STOP)
-			break;
-		freed += ret;
-
-		count_vm_events(SLABS_SCANNED, shrinkctl->nr_scanned);
-		total_scan -= shrinkctl->nr_scanned;
-		scanned += shrinkctl->nr_scanned;
-
-		cond_resched();
-	}
-
-	/*
-	 * The deferred work is increased by any new work (delta) that wasn't
-	 * done, decreased by old deferred work that was done now.
-	 *
-	 * And it is capped to two times of the freeable items.
-	 */
-	next_deferred = max_t(long, (nr + delta - scanned), 0);
-	next_deferred = min(next_deferred, (2 * freeable));
-
-	/*
-	 * move the unused scan count back into the shrinker in a
-	 * manner that handles concurrent updates.
-	 */
-	new_nr = add_nr_deferred(next_deferred, shrinker, shrinkctl);
-
-	trace_mm_shrink_slab_end(shrinker, shrinkctl->nid, freed, nr, new_nr, total_scan);
-	return freed;
-}
-
-#ifdef CONFIG_MEMCG
-static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
-			struct mem_cgroup *memcg, int priority)
-{
-	struct shrinker_info *info;
-	unsigned long ret, freed = 0;
-	int i = 0;
-
-	if (!mem_cgroup_online(memcg))
-		return 0;
-
-again:
-	rcu_read_lock();
-	info = shrinker_info_rcu(memcg, nid);
-	if (unlikely(!info))
-		goto unlock;
-
-	for_each_set_bit_from(i, info->map, info->map_nr_max) {
-		struct shrink_control sc = {
-			.gfp_mask = gfp_mask,
-			.nid = nid,
-			.memcg = memcg,
-		};
-		struct shrinker *shrinker;
-
-		shrinker = idr_find(&shrinker_idr, i);
-		if (unlikely(!shrinker || !(shrinker->flags & SHRINKER_REGISTERED))) {
-			if (!shrinker)
-				clear_bit(i, info->map);
-			continue;
-		}
-
-		if (!shrinker_try_get(shrinker))
-			continue;
-		rcu_read_unlock();
-
-		/* Call non-slab shrinkers even though kmem is disabled */
-		if (!memcg_kmem_online() &&
-		    !(shrinker->flags & SHRINKER_NONSLAB))
-			continue;
-
-		ret = do_shrink_slab(&sc, shrinker, priority);
-		if (ret == SHRINK_EMPTY) {
-			clear_bit(i, info->map);
-			/*
-			 * After the shrinker reported that it had no objects to
-			 * free, but before we cleared the corresponding bit in
-			 * the memcg shrinker map, a new object might have been
-			 * added. To make sure, we have the bit set in this
-			 * case, we invoke the shrinker one more time and reset
-			 * the bit if it reports that it is not empty anymore.
-			 * The memory barrier here pairs with the barrier in
-			 * set_shrinker_bit():
-			 *
-			 * list_lru_add()     shrink_slab_memcg()
-			 *   list_add_tail()    clear_bit()
-			 *   <MB>               <MB>
-			 *   set_bit()          do_shrink_slab()
-			 */
-			smp_mb__after_atomic();
-			ret = do_shrink_slab(&sc, shrinker, priority);
-			if (ret == SHRINK_EMPTY)
-				ret = 0;
-			else
-				set_shrinker_bit(memcg, nid, i);
-		}
-		freed += ret;
-
-		shrinker_put(shrinker);
-
-		/*
-		 * We have already exited the read-side of rcu critical section
-		 * before calling do_shrink_slab(), the shrinker_info may be
-		 * released in expand_one_shrinker_info(), so restart the
-		 * iteration.
-		 */
-		i++;
-		goto again;
-	}
-unlock:
-	rcu_read_unlock();
-	return freed;
-}
-#else /* CONFIG_MEMCG */
-static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
-			struct mem_cgroup *memcg, int priority)
-{
-	return 0;
-}
-#endif /* CONFIG_MEMCG */
-
-/**
- * shrink_slab - shrink slab caches
- * @gfp_mask: allocation context
- * @nid: node whose slab caches to target
- * @memcg: memory cgroup whose slab caches to target
- * @priority: the reclaim priority
- *
- * Call the shrink functions to age shrinkable caches.
- *
- * @nid is passed along to shrinkers with SHRINKER_NUMA_AWARE set,
- * unaware shrinkers will receive a node id of 0 instead.
- *
- * @memcg specifies the memory cgroup to target. Unaware shrinkers
- * are called only if it is the root cgroup.
- *
- * @priority is sc->priority, we take the number of objects and >> by priority
- * in order to get the scan target.
- *
- * Returns the number of reclaimed slab objects.
- */
-static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
-				 struct mem_cgroup *memcg,
-				 int priority)
-{
-	unsigned long ret, freed = 0;
-	struct shrinker *shrinker;
-
-	/*
-	 * The root memcg might be allocated even though memcg is disabled
-	 * via "cgroup_disable=memory" boot parameter.  This could make
-	 * mem_cgroup_is_root() return false, then just run memcg slab
-	 * shrink, but skip global shrink.  This may result in premature
-	 * oom.
-	 */
-	if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
-		return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
-
-	rcu_read_lock();
-	list_for_each_entry_rcu(shrinker, &shrinker_list, list) {
-		struct shrink_control sc = {
-			.gfp_mask = gfp_mask,
-			.nid = nid,
-			.memcg = memcg,
-		};
-
-		if (!shrinker_try_get(shrinker))
-			continue;
-		rcu_read_unlock();
-
-		ret = do_shrink_slab(&sc, shrinker, priority);
-		if (ret == SHRINK_EMPTY)
-			ret = 0;
-		freed += ret;
-
-		rcu_read_lock();
-		shrinker_put(shrinker);
-	}
-	rcu_read_unlock();
-	cond_resched();
-	return freed;
-}
-
 static unsigned long drop_slab_node(int nid)
 {
 	unsigned long freed = 0;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH 00/29] use refcount+RCU method to implement lockless slab shrink
  2023-06-22  8:53 [PATCH 00/29] use refcount+RCU method to implement lockless slab shrink Qi Zheng
                   ` (28 preceding siblings ...)
  2023-06-22  8:53 ` [PATCH 29/29] mm: shrinker: move shrinker-related code into a separate file Qi Zheng
@ 2023-06-22  9:02 ` Qi Zheng
  29 siblings, 0 replies; 55+ messages in thread
From: Qi Zheng @ 2023-06-22  9:02 UTC (permalink / raw)
  To: akpm, david, tkhai, vbabka, roman.gushchin, djwong, brauner,
	paulmck, tytso
  Cc: linux-kernel, linux-mm, intel-gfx, dri-devel, linux-arm-msm,
	dm-devel, linux-raid, linux-bcache, virtualization,
	linux-fsdevel, linux-ext4, linux-nfs, linux-xfs, linux-btrfs

Hi all,

Well, this one was sent successfully.

Since I always get the following error message, I deleted the original
cc people and only kept the mailing lists.

	4.7.1 Error: too many recipients from 49.7.199.173

Thanks,
Qi

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 01/29] mm: shrinker: add shrinker::private_data field
  2023-06-22  8:53 ` [PATCH 01/29] mm: shrinker: add shrinker::private_data field Qi Zheng
@ 2023-06-22 14:47   ` Vlastimil Babka
  2023-06-23 12:50     ` [External] " Qi Zheng
  0 siblings, 1 reply; 55+ messages in thread
From: Vlastimil Babka @ 2023-06-22 14:47 UTC (permalink / raw)
  To: Qi Zheng, akpm, david, tkhai, roman.gushchin, djwong, brauner,
	paulmck, tytso
  Cc: linux-kernel, linux-mm, intel-gfx, dri-devel, linux-arm-msm,
	dm-devel, linux-raid, linux-bcache, virtualization,
	linux-fsdevel, linux-ext4, linux-nfs, linux-xfs, linux-btrfs

On 6/22/23 10:53, Qi Zheng wrote:
> To prepare for the dynamic allocation of shrinker instances
> embedded in other structures, add a private_data field to
> struct shrinker, so that we can use shrinker::private_data
> to record and get the original embedded structure.
> 
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>

I would fold this to 02/29, less churn.

> ---
>  include/linux/shrinker.h | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index 224293b2dd06..43e6fcabbf51 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -70,6 +70,8 @@ struct shrinker {
>  	int seeks;	/* seeks to recreate an obj */
>  	unsigned flags;
>  
> +	void *private_data;
> +
>  	/* These are for internal use */
>  	struct list_head list;
>  #ifdef CONFIG_MEMCG


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 29/29] mm: shrinker: move shrinker-related code into a separate file
  2023-06-22  8:53 ` [PATCH 29/29] mm: shrinker: move shrinker-related code into a separate file Qi Zheng
@ 2023-06-22 14:53   ` Vlastimil Babka
  2023-06-23 13:12     ` Qi Zheng
  2023-06-23  5:25   ` Sergey Senozhatsky
  1 sibling, 1 reply; 55+ messages in thread
From: Vlastimil Babka @ 2023-06-22 14:53 UTC (permalink / raw)
  To: Qi Zheng, akpm, david, tkhai, roman.gushchin, djwong, brauner,
	paulmck, tytso
  Cc: linux-kernel, linux-mm, intel-gfx, dri-devel, linux-arm-msm,
	dm-devel, linux-raid, linux-bcache, virtualization,
	linux-fsdevel, linux-ext4, linux-nfs, linux-xfs, linux-btrfs

On 6/22/23 10:53, Qi Zheng wrote:
> The mm/vmscan.c file is too large, so separate the shrinker-related
> code from it into a separate file. No functional changes.
> 
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>

Maybe do this move as patch 01 so the further changes are done in the new
file already?


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 24/29] mm: vmscan: make global slab shrink lockless
  2023-06-22  8:53 ` [PATCH 24/29] mm: vmscan: make global slab shrink lockless Qi Zheng
@ 2023-06-22 15:12   ` Vlastimil Babka
  2023-06-22 16:42     ` Qi Zheng
  2023-06-23  6:29     ` Dave Chinner
  0 siblings, 2 replies; 55+ messages in thread
From: Vlastimil Babka @ 2023-06-22 15:12 UTC (permalink / raw)
  To: Qi Zheng, akpm, david, tkhai, roman.gushchin, djwong, brauner,
	paulmck, tytso
  Cc: linux-kernel, linux-mm, intel-gfx, dri-devel, linux-arm-msm,
	dm-devel, linux-raid, linux-bcache, virtualization,
	linux-fsdevel, linux-ext4, linux-nfs, linux-xfs, linux-btrfs

On 6/22/23 10:53, Qi Zheng wrote:
> The shrinker_rwsem is a global read-write lock in
> shrinkers subsystem, which protects most operations
> such as slab shrink, registration and unregistration
> of shrinkers, etc. This can easily cause problems in
> the following cases.
> 
> 1) When the memory pressure is high and there are many
>    filesystems mounted or unmounted at the same time,
>    slab shrink will be affected (down_read_trylock()
>    failed).
> 
>    Such as the real workload mentioned by Kirill Tkhai:
> 
>    ```
>    One of the real workloads from my experience is start
>    of an overcommitted node containing many starting
>    containers after node crash (or many resuming containers
>    after reboot for kernel update). In these cases memory
>    pressure is huge, and the node goes round in long reclaim.
>    ```
> 
> 2) If a shrinker is blocked (such as the case mentioned
>    in [1]) and a writer comes in (such as mount a fs),
>    then this writer will be blocked and cause all
>    subsequent shrinker-related operations to be blocked.
> 
> Even if there is no competitor when shrinking slab, there
> may still be a problem. If we have a long shrinker list
> and we do not reclaim enough memory with each shrinker,
> then the down_read_trylock() may be called with high
> frequency. Because of the poor multicore scalability of
> atomic operations, this can lead to a significant drop
> in IPC (instructions per cycle).
> 
> We used to implement the lockless slab shrink with
> SRCU [1], but then kernel test robot reported -88.8%
> regression in stress-ng.ramfs.ops_per_sec test case [2],
> so we reverted it [3].
> 
> This commit uses the refcount+RCU method [4] proposed by
> by Dave Chinner to re-implement the lockless global slab
> shrink. The memcg slab shrink is handled in the subsequent
> patch.
> 
> Currently, the shrinker instances can be divided into
> the following three types:
> 
> a) global shrinker instance statically defined in the kernel,
> such as workingset_shadow_shrinker.
> 
> b) global shrinker instance statically defined in the kernel
> modules, such as mmu_shrinker in x86.
> 
> c) shrinker instance embedded in other structures.
> 
> For case a, the memory of shrinker instance is never freed.
> For case b, the memory of shrinker instance will be freed
> after the module is unloaded. But we will call synchronize_rcu()
> in free_module() to wait for RCU read-side critical section to
> exit. For case c, the memory of shrinker instance will be
> dynamically freed by calling kfree_rcu(). So we can use
> rcu_read_{lock,unlock}() to ensure that the shrinker instance
> is valid.
> 
> The shrinker::refcount mechanism ensures that the shrinker
> instance will not be run again after unregistration. So the
> structure that records the pointer of shrinker instance can be
> safely freed without waiting for the RCU read-side critical
> section.
> 
> In this way, while we implement the lockless slab shrink, we
> don't need to be blocked in unregister_shrinker() to wait
> RCU read-side critical section.
> 
> The following are the test results:
> 
> stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 &
> 
> 1) Before applying this patchset:
> 
>  setting to a 60 second run per stressor
>  dispatching hogs: 9 ramfs
>  stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
>                            (secs)    (secs)    (secs)   (real time) (usr+sys time)
>  ramfs            880623     60.02      7.71    226.93     14671.45        3753.09
>  ramfs:
>           1 System Management Interrupt
>  for a 60.03s run time:
>     5762.40s available CPU time
>        7.71s user time   (  0.13%)
>      226.93s system time (  3.94%)
>      234.64s total time  (  4.07%)
>  load average: 8.54 3.06 2.11
>  passed: 9: ramfs (9)
>  failed: 0
>  skipped: 0
>  successful run completed in 60.03s (1 min, 0.03 secs)
> 
> 2) After applying this patchset:
> 
>  setting to a 60 second run per stressor
>  dispatching hogs: 9 ramfs
>  stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
>                            (secs)    (secs)    (secs)   (real time) (usr+sys time)
>  ramfs            847562     60.02      7.44    230.22     14120.66        3566.23
>  ramfs:
>           4 System Management Interrupts
>  for a 60.12s run time:
>     5771.95s available CPU time
>        7.44s user time   (  0.13%)
>      230.22s system time (  3.99%)
>      237.66s total time  (  4.12%)
>  load average: 8.18 2.43 0.84
>  passed: 9: ramfs (9)
>  failed: 0
>  skipped: 0
>  successful run completed in 60.12s (1 min, 0.12 secs)
> 
> We can see that the ops/s has hardly changed.
> 
> [1]. https://lore.kernel.org/lkml/20230313112819.38938-1-zhengqi.arch@bytedance.com/
> [2]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie.liu@intel.com/
> [3]. https://lore.kernel.org/all/20230609081518.3039120-1-qi.zheng@linux.dev/
> [4]. https://lore.kernel.org/lkml/ZIJhou1d55d4H1s0@dread.disaster.area/
> 
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
>  include/linux/shrinker.h |  6 ++++++
>  mm/vmscan.c              | 33 ++++++++++++++-------------------
>  2 files changed, 20 insertions(+), 19 deletions(-)
> 
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index 7bfeb2f25246..b0c6c2df9db8 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -74,6 +74,7 @@ struct shrinker {
>  
>  	refcount_t refcount;
>  	struct completion completion_wait;
> +	struct rcu_head rcu;
>  
>  	void *private_data;
>  
> @@ -123,6 +124,11 @@ struct shrinker *shrinker_alloc_and_init(count_objects_cb count,
>  void shrinker_free(struct shrinker *shrinker);
>  void unregister_and_free_shrinker(struct shrinker *shrinker);
>  
> +static inline bool shrinker_try_get(struct shrinker *shrinker)
> +{
> +	return refcount_inc_not_zero(&shrinker->refcount);
> +}
> +
>  static inline void shrinker_put(struct shrinker *shrinker)
>  {
>  	if (refcount_dec_and_test(&shrinker->refcount))
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 6f9c4750effa..767569698946 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -57,6 +57,7 @@
>  #include <linux/khugepaged.h>
>  #include <linux/rculist_nulls.h>
>  #include <linux/random.h>
> +#include <linux/rculist.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/div64.h>
> @@ -742,7 +743,7 @@ void register_shrinker_prepared(struct shrinker *shrinker)
>  	down_write(&shrinker_rwsem);
>  	refcount_set(&shrinker->refcount, 1);
>  	init_completion(&shrinker->completion_wait);
> -	list_add_tail(&shrinker->list, &shrinker_list);
> +	list_add_tail_rcu(&shrinker->list, &shrinker_list);
>  	shrinker->flags |= SHRINKER_REGISTERED;
>  	shrinker_debugfs_add(shrinker);
>  	up_write(&shrinker_rwsem);
> @@ -800,7 +801,7 @@ void unregister_shrinker(struct shrinker *shrinker)
>  	wait_for_completion(&shrinker->completion_wait);
>  
>  	down_write(&shrinker_rwsem);
> -	list_del(&shrinker->list);
> +	list_del_rcu(&shrinker->list);
>  	shrinker->flags &= ~SHRINKER_REGISTERED;
>  	if (shrinker->flags & SHRINKER_MEMCG_AWARE)
>  		unregister_memcg_shrinker(shrinker);
> @@ -845,7 +846,7 @@ EXPORT_SYMBOL(shrinker_free);
>  void unregister_and_free_shrinker(struct shrinker *shrinker)
>  {
>  	unregister_shrinker(shrinker);
> -	kfree(shrinker);
> +	kfree_rcu(shrinker, rcu);
>  }
>  EXPORT_SYMBOL(unregister_and_free_shrinker);
>  
> @@ -1067,33 +1068,27 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>  	if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
>  		return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>  
> -	if (!down_read_trylock(&shrinker_rwsem))
> -		goto out;
> -
> -	list_for_each_entry(shrinker, &shrinker_list, list) {
> +	rcu_read_lock();
> +	list_for_each_entry_rcu(shrinker, &shrinker_list, list) {
>  		struct shrink_control sc = {
>  			.gfp_mask = gfp_mask,
>  			.nid = nid,
>  			.memcg = memcg,
>  		};
>  
> +		if (!shrinker_try_get(shrinker))
> +			continue;
> +		rcu_read_unlock();

I don't think you can do this unlock?

> +
>  		ret = do_shrink_slab(&sc, shrinker, priority);
>  		if (ret == SHRINK_EMPTY)
>  			ret = 0;
>  		freed += ret;
> -		/*
> -		 * Bail out if someone want to register a new shrinker to
> -		 * prevent the registration from being stalled for long periods
> -		 * by parallel ongoing shrinking.
> -		 */
> -		if (rwsem_is_contended(&shrinker_rwsem)) {
> -			freed = freed ? : 1;
> -			break;
> -		}
> -	}
>  
> -	up_read(&shrinker_rwsem);
> -out:
> +		rcu_read_lock();

That new rcu_read_lock() won't help AFAIK, the whole
list_for_each_entry_rcu() needs to be under the single rcu_read_lock() to be
safe.

IIUC this is why Dave in [4] suggests unifying shrink_slab() with
shrink_slab_memcg(), as the latter doesn't iterate the list but uses IDR.

> +		shrinker_put(shrinker);
> +	}
> +	rcu_read_unlock();
>  	cond_resched();
>  	return freed;
>  }


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 24/29] mm: vmscan: make global slab shrink lockless
  2023-06-22 15:12   ` Vlastimil Babka
@ 2023-06-22 16:42     ` Qi Zheng
  2023-06-22 17:41       ` Alan Huang
  2023-06-23  6:29     ` Dave Chinner
  1 sibling, 1 reply; 55+ messages in thread
From: Qi Zheng @ 2023-06-22 16:42 UTC (permalink / raw)
  To: Vlastimil Babka, akpm, david, tkhai, roman.gushchin, djwong,
	brauner, paulmck, tytso
  Cc: linux-kernel, linux-mm, intel-gfx, dri-devel, linux-arm-msm,
	dm-devel, linux-raid, linux-bcache, virtualization,
	linux-fsdevel, linux-ext4, linux-nfs, linux-xfs, linux-btrfs



On 2023/6/22 23:12, Vlastimil Babka wrote:
> On 6/22/23 10:53, Qi Zheng wrote:
>> The shrinker_rwsem is a global read-write lock in
>> shrinkers subsystem, which protects most operations
>> such as slab shrink, registration and unregistration
>> of shrinkers, etc. This can easily cause problems in
>> the following cases.
>>
>> 1) When the memory pressure is high and there are many
>>     filesystems mounted or unmounted at the same time,
>>     slab shrink will be affected (down_read_trylock()
>>     failed).
>>
>>     Such as the real workload mentioned by Kirill Tkhai:
>>
>>     ```
>>     One of the real workloads from my experience is start
>>     of an overcommitted node containing many starting
>>     containers after node crash (or many resuming containers
>>     after reboot for kernel update). In these cases memory
>>     pressure is huge, and the node goes round in long reclaim.
>>     ```
>>
>> 2) If a shrinker is blocked (such as the case mentioned
>>     in [1]) and a writer comes in (such as mount a fs),
>>     then this writer will be blocked and cause all
>>     subsequent shrinker-related operations to be blocked.
>>
>> Even if there is no competitor when shrinking slab, there
>> may still be a problem. If we have a long shrinker list
>> and we do not reclaim enough memory with each shrinker,
>> then the down_read_trylock() may be called with high
>> frequency. Because of the poor multicore scalability of
>> atomic operations, this can lead to a significant drop
>> in IPC (instructions per cycle).
>>
>> We used to implement the lockless slab shrink with
>> SRCU [1], but then kernel test robot reported -88.8%
>> regression in stress-ng.ramfs.ops_per_sec test case [2],
>> so we reverted it [3].
>>
>> This commit uses the refcount+RCU method [4] proposed by
>> by Dave Chinner to re-implement the lockless global slab
>> shrink. The memcg slab shrink is handled in the subsequent
>> patch.
>>
>> Currently, the shrinker instances can be divided into
>> the following three types:
>>
>> a) global shrinker instance statically defined in the kernel,
>> such as workingset_shadow_shrinker.
>>
>> b) global shrinker instance statically defined in the kernel
>> modules, such as mmu_shrinker in x86.
>>
>> c) shrinker instance embedded in other structures.
>>
>> For case a, the memory of shrinker instance is never freed.
>> For case b, the memory of shrinker instance will be freed
>> after the module is unloaded. But we will call synchronize_rcu()
>> in free_module() to wait for RCU read-side critical section to
>> exit. For case c, the memory of shrinker instance will be
>> dynamically freed by calling kfree_rcu(). So we can use
>> rcu_read_{lock,unlock}() to ensure that the shrinker instance
>> is valid.
>>
>> The shrinker::refcount mechanism ensures that the shrinker
>> instance will not be run again after unregistration. So the
>> structure that records the pointer of shrinker instance can be
>> safely freed without waiting for the RCU read-side critical
>> section.
>>
>> In this way, while we implement the lockless slab shrink, we
>> don't need to be blocked in unregister_shrinker() to wait
>> RCU read-side critical section.
>>
>> The following are the test results:
>>
>> stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 &
>>
>> 1) Before applying this patchset:
>>
>>   setting to a 60 second run per stressor
>>   dispatching hogs: 9 ramfs
>>   stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
>>                             (secs)    (secs)    (secs)   (real time) (usr+sys time)
>>   ramfs            880623     60.02      7.71    226.93     14671.45        3753.09
>>   ramfs:
>>            1 System Management Interrupt
>>   for a 60.03s run time:
>>      5762.40s available CPU time
>>         7.71s user time   (  0.13%)
>>       226.93s system time (  3.94%)
>>       234.64s total time  (  4.07%)
>>   load average: 8.54 3.06 2.11
>>   passed: 9: ramfs (9)
>>   failed: 0
>>   skipped: 0
>>   successful run completed in 60.03s (1 min, 0.03 secs)
>>
>> 2) After applying this patchset:
>>
>>   setting to a 60 second run per stressor
>>   dispatching hogs: 9 ramfs
>>   stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
>>                             (secs)    (secs)    (secs)   (real time) (usr+sys time)
>>   ramfs            847562     60.02      7.44    230.22     14120.66        3566.23
>>   ramfs:
>>            4 System Management Interrupts
>>   for a 60.12s run time:
>>      5771.95s available CPU time
>>         7.44s user time   (  0.13%)
>>       230.22s system time (  3.99%)
>>       237.66s total time  (  4.12%)
>>   load average: 8.18 2.43 0.84
>>   passed: 9: ramfs (9)
>>   failed: 0
>>   skipped: 0
>>   successful run completed in 60.12s (1 min, 0.12 secs)
>>
>> We can see that the ops/s has hardly changed.
>>
>> [1]. https://lore.kernel.org/lkml/20230313112819.38938-1-zhengqi.arch@bytedance.com/
>> [2]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie.liu@intel.com/
>> [3]. https://lore.kernel.org/all/20230609081518.3039120-1-qi.zheng@linux.dev/
>> [4]. https://lore.kernel.org/lkml/ZIJhou1d55d4H1s0@dread.disaster.area/
>>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> ---
>>   include/linux/shrinker.h |  6 ++++++
>>   mm/vmscan.c              | 33 ++++++++++++++-------------------
>>   2 files changed, 20 insertions(+), 19 deletions(-)
>>
>> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
>> index 7bfeb2f25246..b0c6c2df9db8 100644
>> --- a/include/linux/shrinker.h
>> +++ b/include/linux/shrinker.h
>> @@ -74,6 +74,7 @@ struct shrinker {
>>   
>>   	refcount_t refcount;
>>   	struct completion completion_wait;
>> +	struct rcu_head rcu;
>>   
>>   	void *private_data;
>>   
>> @@ -123,6 +124,11 @@ struct shrinker *shrinker_alloc_and_init(count_objects_cb count,
>>   void shrinker_free(struct shrinker *shrinker);
>>   void unregister_and_free_shrinker(struct shrinker *shrinker);
>>   
>> +static inline bool shrinker_try_get(struct shrinker *shrinker)
>> +{
>> +	return refcount_inc_not_zero(&shrinker->refcount);
>> +}
>> +
>>   static inline void shrinker_put(struct shrinker *shrinker)
>>   {
>>   	if (refcount_dec_and_test(&shrinker->refcount))
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 6f9c4750effa..767569698946 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -57,6 +57,7 @@
>>   #include <linux/khugepaged.h>
>>   #include <linux/rculist_nulls.h>
>>   #include <linux/random.h>
>> +#include <linux/rculist.h>
>>   
>>   #include <asm/tlbflush.h>
>>   #include <asm/div64.h>
>> @@ -742,7 +743,7 @@ void register_shrinker_prepared(struct shrinker *shrinker)
>>   	down_write(&shrinker_rwsem);
>>   	refcount_set(&shrinker->refcount, 1);
>>   	init_completion(&shrinker->completion_wait);
>> -	list_add_tail(&shrinker->list, &shrinker_list);
>> +	list_add_tail_rcu(&shrinker->list, &shrinker_list);
>>   	shrinker->flags |= SHRINKER_REGISTERED;
>>   	shrinker_debugfs_add(shrinker);
>>   	up_write(&shrinker_rwsem);
>> @@ -800,7 +801,7 @@ void unregister_shrinker(struct shrinker *shrinker)
>>   	wait_for_completion(&shrinker->completion_wait);
>>   
>>   	down_write(&shrinker_rwsem);
>> -	list_del(&shrinker->list);
>> +	list_del_rcu(&shrinker->list);
>>   	shrinker->flags &= ~SHRINKER_REGISTERED;
>>   	if (shrinker->flags & SHRINKER_MEMCG_AWARE)
>>   		unregister_memcg_shrinker(shrinker);
>> @@ -845,7 +846,7 @@ EXPORT_SYMBOL(shrinker_free);
>>   void unregister_and_free_shrinker(struct shrinker *shrinker)
>>   {
>>   	unregister_shrinker(shrinker);
>> -	kfree(shrinker);
>> +	kfree_rcu(shrinker, rcu);
>>   }
>>   EXPORT_SYMBOL(unregister_and_free_shrinker);
>>   
>> @@ -1067,33 +1068,27 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>   	if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
>>   		return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>>   
>> -	if (!down_read_trylock(&shrinker_rwsem))
>> -		goto out;
>> -
>> -	list_for_each_entry(shrinker, &shrinker_list, list) {
>> +	rcu_read_lock();
>> +	list_for_each_entry_rcu(shrinker, &shrinker_list, list) {
>>   		struct shrink_control sc = {
>>   			.gfp_mask = gfp_mask,
>>   			.nid = nid,
>>   			.memcg = memcg,
>>   		};
>>   
>> +		if (!shrinker_try_get(shrinker))
>> +			continue;
>> +		rcu_read_unlock();
> 
> I don't think you can do this unlock?
> 
>> +
>>   		ret = do_shrink_slab(&sc, shrinker, priority);
>>   		if (ret == SHRINK_EMPTY)
>>   			ret = 0;
>>   		freed += ret;
>> -		/*
>> -		 * Bail out if someone want to register a new shrinker to
>> -		 * prevent the registration from being stalled for long periods
>> -		 * by parallel ongoing shrinking.
>> -		 */
>> -		if (rwsem_is_contended(&shrinker_rwsem)) {
>> -			freed = freed ? : 1;
>> -			break;
>> -		}
>> -	}
>>   
>> -	up_read(&shrinker_rwsem);
>> -out:
>> +		rcu_read_lock();
> 
> That new rcu_read_lock() won't help AFAIK, the whole
> list_for_each_entry_rcu() needs to be under the single rcu_read_lock() to be
> safe.

In the unregister_shrinker() path, we will wait for the refcount to zero
before deleting the shrinker from the linked list. Here, we first took
the rcu lock, and then decrement the refcount of this shrinker.

     shrink_slab                 unregister_shrinker
     ===========                 ===================
				
				/* wait for B */
				wait_for_completion()
   rcu_read_lock()

   shrinker_put() --> (B)
				list_del_rcu()
                                 /* wait for rcu_read_unlock() */
				kfree_rcu()

   /*
    * so this shrinker will not be freed here,
    * and can be used to traverse the next node
    * normally?
    */
   list_for_each_entry()

   shrinker_try_get()
   rcu_read_unlock()

Did I miss something?

> 
> IIUC this is why Dave in [4] suggests unifying shrink_slab() with
> shrink_slab_memcg(), as the latter doesn't iterate the list but uses IDR.
> 
>> +		shrinker_put(shrinker);
>> +	}
>> +	rcu_read_unlock();
>>   	cond_resched();
>>   	return freed;
>>   }
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 24/29] mm: vmscan: make global slab shrink lockless
  2023-06-22 16:42     ` Qi Zheng
@ 2023-06-22 17:41       ` Alan Huang
  2023-06-22 18:18         ` Qi Zheng
  0 siblings, 1 reply; 55+ messages in thread
From: Alan Huang @ 2023-06-22 17:41 UTC (permalink / raw)
  To: Qi Zheng
  Cc: Vlastimil Babka, akpm, Dave Chinner, tkhai, roman.gushchin,
	Darrick J. Wong, brauner, paulmck, tytso, linux-kernel, linux-mm,
	intel-gfx, dri-devel, linux-arm-msm, dm-devel, linux-raid,
	linux-bcache, virtualization, linux-fsdevel, linux-ext4,
	linux-nfs, linux-xfs, linux-btrfs


> 2023年6月23日 上午12:42,Qi Zheng <zhengqi.arch@bytedance.com> 写道:
> 
> 
> 
> On 2023/6/22 23:12, Vlastimil Babka wrote:
>> On 6/22/23 10:53, Qi Zheng wrote:
>>> The shrinker_rwsem is a global read-write lock in
>>> shrinkers subsystem, which protects most operations
>>> such as slab shrink, registration and unregistration
>>> of shrinkers, etc. This can easily cause problems in
>>> the following cases.
>>> 
>>> 1) When the memory pressure is high and there are many
>>>    filesystems mounted or unmounted at the same time,
>>>    slab shrink will be affected (down_read_trylock()
>>>    failed).
>>> 
>>>    Such as the real workload mentioned by Kirill Tkhai:
>>> 
>>>    ```
>>>    One of the real workloads from my experience is start
>>>    of an overcommitted node containing many starting
>>>    containers after node crash (or many resuming containers
>>>    after reboot for kernel update). In these cases memory
>>>    pressure is huge, and the node goes round in long reclaim.
>>>    ```
>>> 
>>> 2) If a shrinker is blocked (such as the case mentioned
>>>    in [1]) and a writer comes in (such as mount a fs),
>>>    then this writer will be blocked and cause all
>>>    subsequent shrinker-related operations to be blocked.
>>> 
>>> Even if there is no competitor when shrinking slab, there
>>> may still be a problem. If we have a long shrinker list
>>> and we do not reclaim enough memory with each shrinker,
>>> then the down_read_trylock() may be called with high
>>> frequency. Because of the poor multicore scalability of
>>> atomic operations, this can lead to a significant drop
>>> in IPC (instructions per cycle).
>>> 
>>> We used to implement the lockless slab shrink with
>>> SRCU [1], but then kernel test robot reported -88.8%
>>> regression in stress-ng.ramfs.ops_per_sec test case [2],
>>> so we reverted it [3].
>>> 
>>> This commit uses the refcount+RCU method [4] proposed by
>>> by Dave Chinner to re-implement the lockless global slab
>>> shrink. The memcg slab shrink is handled in the subsequent
>>> patch.
>>> 
>>> Currently, the shrinker instances can be divided into
>>> the following three types:
>>> 
>>> a) global shrinker instance statically defined in the kernel,
>>> such as workingset_shadow_shrinker.
>>> 
>>> b) global shrinker instance statically defined in the kernel
>>> modules, such as mmu_shrinker in x86.
>>> 
>>> c) shrinker instance embedded in other structures.
>>> 
>>> For case a, the memory of shrinker instance is never freed.
>>> For case b, the memory of shrinker instance will be freed
>>> after the module is unloaded. But we will call synchronize_rcu()
>>> in free_module() to wait for RCU read-side critical section to
>>> exit. For case c, the memory of shrinker instance will be
>>> dynamically freed by calling kfree_rcu(). So we can use
>>> rcu_read_{lock,unlock}() to ensure that the shrinker instance
>>> is valid.
>>> 
>>> The shrinker::refcount mechanism ensures that the shrinker
>>> instance will not be run again after unregistration. So the
>>> structure that records the pointer of shrinker instance can be
>>> safely freed without waiting for the RCU read-side critical
>>> section.
>>> 
>>> In this way, while we implement the lockless slab shrink, we
>>> don't need to be blocked in unregister_shrinker() to wait
>>> RCU read-side critical section.
>>> 
>>> The following are the test results:
>>> 
>>> stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 &
>>> 
>>> 1) Before applying this patchset:
>>> 
>>>  setting to a 60 second run per stressor
>>>  dispatching hogs: 9 ramfs
>>>  stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
>>>                            (secs)    (secs)    (secs)   (real time) (usr+sys time)
>>>  ramfs            880623     60.02      7.71    226.93     14671.45        3753.09
>>>  ramfs:
>>>           1 System Management Interrupt
>>>  for a 60.03s run time:
>>>     5762.40s available CPU time
>>>        7.71s user time   (  0.13%)
>>>      226.93s system time (  3.94%)
>>>      234.64s total time  (  4.07%)
>>>  load average: 8.54 3.06 2.11
>>>  passed: 9: ramfs (9)
>>>  failed: 0
>>>  skipped: 0
>>>  successful run completed in 60.03s (1 min, 0.03 secs)
>>> 
>>> 2) After applying this patchset:
>>> 
>>>  setting to a 60 second run per stressor
>>>  dispatching hogs: 9 ramfs
>>>  stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
>>>                            (secs)    (secs)    (secs)   (real time) (usr+sys time)
>>>  ramfs            847562     60.02      7.44    230.22     14120.66        3566.23
>>>  ramfs:
>>>           4 System Management Interrupts
>>>  for a 60.12s run time:
>>>     5771.95s available CPU time
>>>        7.44s user time   (  0.13%)
>>>      230.22s system time (  3.99%)
>>>      237.66s total time  (  4.12%)
>>>  load average: 8.18 2.43 0.84
>>>  passed: 9: ramfs (9)
>>>  failed: 0
>>>  skipped: 0
>>>  successful run completed in 60.12s (1 min, 0.12 secs)
>>> 
>>> We can see that the ops/s has hardly changed.
>>> 
>>> [1]. https://lore.kernel.org/lkml/20230313112819.38938-1-zhengqi.arch@bytedance.com/
>>> [2]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie.liu@intel.com/
>>> [3]. https://lore.kernel.org/all/20230609081518.3039120-1-qi.zheng@linux.dev/
>>> [4]. https://lore.kernel.org/lkml/ZIJhou1d55d4H1s0@dread.disaster.area/
>>> 
>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>> ---
>>>  include/linux/shrinker.h |  6 ++++++
>>>  mm/vmscan.c              | 33 ++++++++++++++-------------------
>>>  2 files changed, 20 insertions(+), 19 deletions(-)
>>> 
>>> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
>>> index 7bfeb2f25246..b0c6c2df9db8 100644
>>> --- a/include/linux/shrinker.h
>>> +++ b/include/linux/shrinker.h
>>> @@ -74,6 +74,7 @@ struct shrinker {
>>>    	refcount_t refcount;
>>>  	struct completion completion_wait;
>>> +	struct rcu_head rcu;
>>>    	void *private_data;
>>>  @@ -123,6 +124,11 @@ struct shrinker *shrinker_alloc_and_init(count_objects_cb count,
>>>  void shrinker_free(struct shrinker *shrinker);
>>>  void unregister_and_free_shrinker(struct shrinker *shrinker);
>>>  +static inline bool shrinker_try_get(struct shrinker *shrinker)
>>> +{
>>> +	return refcount_inc_not_zero(&shrinker->refcount);
>>> +}
>>> +
>>>  static inline void shrinker_put(struct shrinker *shrinker)
>>>  {
>>>  	if (refcount_dec_and_test(&shrinker->refcount))
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index 6f9c4750effa..767569698946 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -57,6 +57,7 @@
>>>  #include <linux/khugepaged.h>
>>>  #include <linux/rculist_nulls.h>
>>>  #include <linux/random.h>
>>> +#include <linux/rculist.h>
>>>    #include <asm/tlbflush.h>
>>>  #include <asm/div64.h>
>>> @@ -742,7 +743,7 @@ void register_shrinker_prepared(struct shrinker *shrinker)
>>>  	down_write(&shrinker_rwsem);
>>>  	refcount_set(&shrinker->refcount, 1);
>>>  	init_completion(&shrinker->completion_wait);
>>> -	list_add_tail(&shrinker->list, &shrinker_list);
>>> +	list_add_tail_rcu(&shrinker->list, &shrinker_list);
>>>  	shrinker->flags |= SHRINKER_REGISTERED;
>>>  	shrinker_debugfs_add(shrinker);
>>>  	up_write(&shrinker_rwsem);
>>> @@ -800,7 +801,7 @@ void unregister_shrinker(struct shrinker *shrinker)
>>>  	wait_for_completion(&shrinker->completion_wait);
>>>    	down_write(&shrinker_rwsem);
>>> -	list_del(&shrinker->list);
>>> +	list_del_rcu(&shrinker->list);
>>>  	shrinker->flags &= ~SHRINKER_REGISTERED;
>>>  	if (shrinker->flags & SHRINKER_MEMCG_AWARE)
>>>  		unregister_memcg_shrinker(shrinker);
>>> @@ -845,7 +846,7 @@ EXPORT_SYMBOL(shrinker_free);
>>>  void unregister_and_free_shrinker(struct shrinker *shrinker)
>>>  {
>>>  	unregister_shrinker(shrinker);
>>> -	kfree(shrinker);
>>> +	kfree_rcu(shrinker, rcu);
>>>  }
>>>  EXPORT_SYMBOL(unregister_and_free_shrinker);
>>>  @@ -1067,33 +1068,27 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>  	if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
>>>  		return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>>>  -	if (!down_read_trylock(&shrinker_rwsem))
>>> -		goto out;
>>> -
>>> -	list_for_each_entry(shrinker, &shrinker_list, list) {
>>> +	rcu_read_lock();
>>> +	list_for_each_entry_rcu(shrinker, &shrinker_list, list) {
>>>  		struct shrink_control sc = {
>>>  			.gfp_mask = gfp_mask,
>>>  			.nid = nid,
>>>  			.memcg = memcg,
>>>  		};
>>>  +		if (!shrinker_try_get(shrinker))
>>> +			continue;
>>> +		rcu_read_unlock();
>> I don't think you can do this unlock?
>>> +
>>>  		ret = do_shrink_slab(&sc, shrinker, priority);
>>>  		if (ret == SHRINK_EMPTY)
>>>  			ret = 0;
>>>  		freed += ret;
>>> -		/*
>>> -		 * Bail out if someone want to register a new shrinker to
>>> -		 * prevent the registration from being stalled for long periods
>>> -		 * by parallel ongoing shrinking.
>>> -		 */
>>> -		if (rwsem_is_contended(&shrinker_rwsem)) {
>>> -			freed = freed ? : 1;
>>> -			break;
>>> -		}
>>> -	}
>>>  -	up_read(&shrinker_rwsem);
>>> -out:
>>> +		rcu_read_lock();
>> That new rcu_read_lock() won't help AFAIK, the whole
>> list_for_each_entry_rcu() needs to be under the single rcu_read_lock() to be
>> safe.
> 
> In the unregister_shrinker() path, we will wait for the refcount to zero
> before deleting the shrinker from the linked list. Here, we first took
> the rcu lock, and then decrement the refcount of this shrinker.
> 
>    shrink_slab                 unregister_shrinker
>    ===========                 ===================
> 				
> 				/* wait for B */
> 				wait_for_completion()
>  rcu_read_lock()
> 
>  shrinker_put() --> (B)
> 				list_del_rcu()
>                                /* wait for rcu_read_unlock() */
> 				kfree_rcu()
> 
>  /*
>   * so this shrinker will not be freed here,
>   * and can be used to traverse the next node
>   * normally?
>   */
>  list_for_each_entry()
> 
>  shrinker_try_get()
>  rcu_read_unlock()
> 
> Did I miss something?

After calling rcu_read_unlock(), the next shrinker in the list can be freed,
so in the next iteration, use after free might happen?

Is that right?

> 
>> IIUC this is why Dave in [4] suggests unifying shrink_slab() with
>> shrink_slab_memcg(), as the latter doesn't iterate the list but uses IDR.
>>> +		shrinker_put(shrinker);
>>> +	}
>>> +	rcu_read_unlock();
>>>  	cond_resched();
>>>  	return freed;
>>>  }


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 24/29] mm: vmscan: make global slab shrink lockless
  2023-06-22 17:41       ` Alan Huang
@ 2023-06-22 18:18         ` Qi Zheng
  0 siblings, 0 replies; 55+ messages in thread
From: Qi Zheng @ 2023-06-22 18:18 UTC (permalink / raw)
  To: Alan Huang
  Cc: Vlastimil Babka, akpm, Dave Chinner, tkhai, roman.gushchin,
	Darrick J. Wong, brauner, paulmck, tytso, linux-kernel, linux-mm,
	intel-gfx, dri-devel, linux-arm-msm, dm-devel, linux-raid,
	linux-bcache, virtualization, linux-fsdevel, linux-ext4,
	linux-nfs, linux-xfs, linux-btrfs



On 2023/6/23 01:41, Alan Huang wrote:
> 
>> 2023年6月23日 上午12:42,Qi Zheng <zhengqi.arch@bytedance.com> 写道:
>>
>>
>>
>> On 2023/6/22 23:12, Vlastimil Babka wrote:
>>> On 6/22/23 10:53, Qi Zheng wrote:
>>>> The shrinker_rwsem is a global read-write lock in
>>>> shrinkers subsystem, which protects most operations
>>>> such as slab shrink, registration and unregistration
>>>> of shrinkers, etc. This can easily cause problems in
>>>> the following cases.
>>>>
>>>> 1) When the memory pressure is high and there are many
>>>>     filesystems mounted or unmounted at the same time,
>>>>     slab shrink will be affected (down_read_trylock()
>>>>     failed).
>>>>
>>>>     Such as the real workload mentioned by Kirill Tkhai:
>>>>
>>>>     ```
>>>>     One of the real workloads from my experience is start
>>>>     of an overcommitted node containing many starting
>>>>     containers after node crash (or many resuming containers
>>>>     after reboot for kernel update). In these cases memory
>>>>     pressure is huge, and the node goes round in long reclaim.
>>>>     ```
>>>>
>>>> 2) If a shrinker is blocked (such as the case mentioned
>>>>     in [1]) and a writer comes in (such as mount a fs),
>>>>     then this writer will be blocked and cause all
>>>>     subsequent shrinker-related operations to be blocked.
>>>>
>>>> Even if there is no competitor when shrinking slab, there
>>>> may still be a problem. If we have a long shrinker list
>>>> and we do not reclaim enough memory with each shrinker,
>>>> then the down_read_trylock() may be called with high
>>>> frequency. Because of the poor multicore scalability of
>>>> atomic operations, this can lead to a significant drop
>>>> in IPC (instructions per cycle).
>>>>
>>>> We used to implement the lockless slab shrink with
>>>> SRCU [1], but then kernel test robot reported -88.8%
>>>> regression in stress-ng.ramfs.ops_per_sec test case [2],
>>>> so we reverted it [3].
>>>>
>>>> This commit uses the refcount+RCU method [4] proposed by
>>>> by Dave Chinner to re-implement the lockless global slab
>>>> shrink. The memcg slab shrink is handled in the subsequent
>>>> patch.
>>>>
>>>> Currently, the shrinker instances can be divided into
>>>> the following three types:
>>>>
>>>> a) global shrinker instance statically defined in the kernel,
>>>> such as workingset_shadow_shrinker.
>>>>
>>>> b) global shrinker instance statically defined in the kernel
>>>> modules, such as mmu_shrinker in x86.
>>>>
>>>> c) shrinker instance embedded in other structures.
>>>>
>>>> For case a, the memory of shrinker instance is never freed.
>>>> For case b, the memory of shrinker instance will be freed
>>>> after the module is unloaded. But we will call synchronize_rcu()
>>>> in free_module() to wait for RCU read-side critical section to
>>>> exit. For case c, the memory of shrinker instance will be
>>>> dynamically freed by calling kfree_rcu(). So we can use
>>>> rcu_read_{lock,unlock}() to ensure that the shrinker instance
>>>> is valid.
>>>>
>>>> The shrinker::refcount mechanism ensures that the shrinker
>>>> instance will not be run again after unregistration. So the
>>>> structure that records the pointer of shrinker instance can be
>>>> safely freed without waiting for the RCU read-side critical
>>>> section.
>>>>
>>>> In this way, while we implement the lockless slab shrink, we
>>>> don't need to be blocked in unregister_shrinker() to wait
>>>> RCU read-side critical section.
>>>>
>>>> The following are the test results:
>>>>
>>>> stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 &
>>>>
>>>> 1) Before applying this patchset:
>>>>
>>>>   setting to a 60 second run per stressor
>>>>   dispatching hogs: 9 ramfs
>>>>   stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
>>>>                             (secs)    (secs)    (secs)   (real time) (usr+sys time)
>>>>   ramfs            880623     60.02      7.71    226.93     14671.45        3753.09
>>>>   ramfs:
>>>>            1 System Management Interrupt
>>>>   for a 60.03s run time:
>>>>      5762.40s available CPU time
>>>>         7.71s user time   (  0.13%)
>>>>       226.93s system time (  3.94%)
>>>>       234.64s total time  (  4.07%)
>>>>   load average: 8.54 3.06 2.11
>>>>   passed: 9: ramfs (9)
>>>>   failed: 0
>>>>   skipped: 0
>>>>   successful run completed in 60.03s (1 min, 0.03 secs)
>>>>
>>>> 2) After applying this patchset:
>>>>
>>>>   setting to a 60 second run per stressor
>>>>   dispatching hogs: 9 ramfs
>>>>   stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
>>>>                             (secs)    (secs)    (secs)   (real time) (usr+sys time)
>>>>   ramfs            847562     60.02      7.44    230.22     14120.66        3566.23
>>>>   ramfs:
>>>>            4 System Management Interrupts
>>>>   for a 60.12s run time:
>>>>      5771.95s available CPU time
>>>>         7.44s user time   (  0.13%)
>>>>       230.22s system time (  3.99%)
>>>>       237.66s total time  (  4.12%)
>>>>   load average: 8.18 2.43 0.84
>>>>   passed: 9: ramfs (9)
>>>>   failed: 0
>>>>   skipped: 0
>>>>   successful run completed in 60.12s (1 min, 0.12 secs)
>>>>
>>>> We can see that the ops/s has hardly changed.
>>>>
>>>> [1]. https://lore.kernel.org/lkml/20230313112819.38938-1-zhengqi.arch@bytedance.com/
>>>> [2]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie.liu@intel.com/
>>>> [3]. https://lore.kernel.org/all/20230609081518.3039120-1-qi.zheng@linux.dev/
>>>> [4]. https://lore.kernel.org/lkml/ZIJhou1d55d4H1s0@dread.disaster.area/
>>>>
>>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>>> ---
>>>>   include/linux/shrinker.h |  6 ++++++
>>>>   mm/vmscan.c              | 33 ++++++++++++++-------------------
>>>>   2 files changed, 20 insertions(+), 19 deletions(-)
>>>>
>>>> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
>>>> index 7bfeb2f25246..b0c6c2df9db8 100644
>>>> --- a/include/linux/shrinker.h
>>>> +++ b/include/linux/shrinker.h
>>>> @@ -74,6 +74,7 @@ struct shrinker {
>>>>     	refcount_t refcount;
>>>>   	struct completion completion_wait;
>>>> +	struct rcu_head rcu;
>>>>     	void *private_data;
>>>>   @@ -123,6 +124,11 @@ struct shrinker *shrinker_alloc_and_init(count_objects_cb count,
>>>>   void shrinker_free(struct shrinker *shrinker);
>>>>   void unregister_and_free_shrinker(struct shrinker *shrinker);
>>>>   +static inline bool shrinker_try_get(struct shrinker *shrinker)
>>>> +{
>>>> +	return refcount_inc_not_zero(&shrinker->refcount);
>>>> +}
>>>> +
>>>>   static inline void shrinker_put(struct shrinker *shrinker)
>>>>   {
>>>>   	if (refcount_dec_and_test(&shrinker->refcount))
>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>> index 6f9c4750effa..767569698946 100644
>>>> --- a/mm/vmscan.c
>>>> +++ b/mm/vmscan.c
>>>> @@ -57,6 +57,7 @@
>>>>   #include <linux/khugepaged.h>
>>>>   #include <linux/rculist_nulls.h>
>>>>   #include <linux/random.h>
>>>> +#include <linux/rculist.h>
>>>>     #include <asm/tlbflush.h>
>>>>   #include <asm/div64.h>
>>>> @@ -742,7 +743,7 @@ void register_shrinker_prepared(struct shrinker *shrinker)
>>>>   	down_write(&shrinker_rwsem);
>>>>   	refcount_set(&shrinker->refcount, 1);
>>>>   	init_completion(&shrinker->completion_wait);
>>>> -	list_add_tail(&shrinker->list, &shrinker_list);
>>>> +	list_add_tail_rcu(&shrinker->list, &shrinker_list);
>>>>   	shrinker->flags |= SHRINKER_REGISTERED;
>>>>   	shrinker_debugfs_add(shrinker);
>>>>   	up_write(&shrinker_rwsem);
>>>> @@ -800,7 +801,7 @@ void unregister_shrinker(struct shrinker *shrinker)
>>>>   	wait_for_completion(&shrinker->completion_wait);
>>>>     	down_write(&shrinker_rwsem);
>>>> -	list_del(&shrinker->list);
>>>> +	list_del_rcu(&shrinker->list);
>>>>   	shrinker->flags &= ~SHRINKER_REGISTERED;
>>>>   	if (shrinker->flags & SHRINKER_MEMCG_AWARE)
>>>>   		unregister_memcg_shrinker(shrinker);
>>>> @@ -845,7 +846,7 @@ EXPORT_SYMBOL(shrinker_free);
>>>>   void unregister_and_free_shrinker(struct shrinker *shrinker)
>>>>   {
>>>>   	unregister_shrinker(shrinker);
>>>> -	kfree(shrinker);
>>>> +	kfree_rcu(shrinker, rcu);
>>>>   }
>>>>   EXPORT_SYMBOL(unregister_and_free_shrinker);
>>>>   @@ -1067,33 +1068,27 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>>   	if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
>>>>   		return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>>>>   -	if (!down_read_trylock(&shrinker_rwsem))
>>>> -		goto out;
>>>> -
>>>> -	list_for_each_entry(shrinker, &shrinker_list, list) {
>>>> +	rcu_read_lock();
>>>> +	list_for_each_entry_rcu(shrinker, &shrinker_list, list) {
>>>>   		struct shrink_control sc = {
>>>>   			.gfp_mask = gfp_mask,
>>>>   			.nid = nid,
>>>>   			.memcg = memcg,
>>>>   		};
>>>>   +		if (!shrinker_try_get(shrinker))
>>>> +			continue;
>>>> +		rcu_read_unlock();
>>> I don't think you can do this unlock?
>>>> +
>>>>   		ret = do_shrink_slab(&sc, shrinker, priority);
>>>>   		if (ret == SHRINK_EMPTY)
>>>>   			ret = 0;
>>>>   		freed += ret;
>>>> -		/*
>>>> -		 * Bail out if someone want to register a new shrinker to
>>>> -		 * prevent the registration from being stalled for long periods
>>>> -		 * by parallel ongoing shrinking.
>>>> -		 */
>>>> -		if (rwsem_is_contended(&shrinker_rwsem)) {
>>>> -			freed = freed ? : 1;
>>>> -			break;
>>>> -		}
>>>> -	}
>>>>   -	up_read(&shrinker_rwsem);
>>>> -out:
>>>> +		rcu_read_lock();
>>> That new rcu_read_lock() won't help AFAIK, the whole
>>> list_for_each_entry_rcu() needs to be under the single rcu_read_lock() to be
>>> safe.
>>
>> In the unregister_shrinker() path, we will wait for the refcount to zero
>> before deleting the shrinker from the linked list. Here, we first took
>> the rcu lock, and then decrement the refcount of this shrinker.
>>
>>     shrink_slab                 unregister_shrinker
>>     ===========                 ===================
>> 				
>> 				/* wait for B */
>> 				wait_for_completion()
>>   rcu_read_lock()
>>
>>   shrinker_put() --> (B)
>> 				list_del_rcu()
>>                                 /* wait for rcu_read_unlock() */
>> 				kfree_rcu()
>>
>>   /*
>>    * so this shrinker will not be freed here,
>>    * and can be used to traverse the next node
>>    * normally?
>>    */
>>   list_for_each_entry()
>>
>>   shrinker_try_get()
>>   rcu_read_unlock()
>>
>> Did I miss something?
> 
> After calling rcu_read_unlock(), the next shrinker in the list can be freed,
> so in the next iteration, use after free might happen?
> 
> Is that right?

IIUC, are you talking about the following scenario?

      shrink_slab                 unregister_shrinker a
      ===========                 =====================
		

    rcu_read_unlock()	

    /* next *shrinker b* was
     * removed from shrinker_list.
     */
				/* wait for B */
				wait_for_completion()
    rcu_read_lock()

    shrinker_put() --> (B)
				list_del_rcu()
                                  /* wait for rcu_read_unlock() */
				kfree_rcu()

    list_for_each_entry()

    shrinker_try_get()
    rcu_read_unlock()

When the next *shrinker b* is deleted, the *shrinker a* has not been
deleted from the shrinker_list, so it will point a->next to b->next.
Then in the next iteration, we will get the b->next instead of b?


> 
>>
>>> IIUC this is why Dave in [4] suggests unifying shrink_slab() with
>>> shrink_slab_memcg(), as the latter doesn't iterate the list but uses IDR.
>>>> +		shrinker_put(shrinker);
>>>> +	}
>>>> +	rcu_read_unlock();
>>>>   	cond_resched();
>>>>   	return freed;
>>>>   }
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 29/29] mm: shrinker: move shrinker-related code into a separate file
  2023-06-22  8:53 ` [PATCH 29/29] mm: shrinker: move shrinker-related code into a separate file Qi Zheng
  2023-06-22 14:53   ` Vlastimil Babka
@ 2023-06-23  5:25   ` Sergey Senozhatsky
  2023-06-23 13:24     ` Qi Zheng
  1 sibling, 1 reply; 55+ messages in thread
From: Sergey Senozhatsky @ 2023-06-23  5:25 UTC (permalink / raw)
  To: Qi Zheng
  Cc: akpm, david, tkhai, vbabka, roman.gushchin, djwong, brauner,
	paulmck, tytso, linux-kernel, linux-mm, intel-gfx, dri-devel,
	linux-arm-msm, dm-devel, linux-raid, linux-bcache,
	virtualization, linux-fsdevel, linux-ext4, linux-nfs, linux-xfs,
	linux-btrfs

On (23/06/22 16:53), Qi Zheng wrote:
> +/*
> + * Remove one
> + */
> +void unregister_shrinker(struct shrinker *shrinker)
> +{
> +	struct dentry *debugfs_entry;
> +	int debugfs_id;
> +
> +	if (!(shrinker->flags & SHRINKER_REGISTERED))
> +		return;
> +
> +	shrinker_put(shrinker);
> +	wait_for_completion(&shrinker->completion_wait);
> +
> +	mutex_lock(&shrinker_mutex);
> +	list_del_rcu(&shrinker->list);

Should this function wait for RCU grace period(s) before it goes
touching shrinker fields?

> +	shrinker->flags &= ~SHRINKER_REGISTERED;
> +	if (shrinker->flags & SHRINKER_MEMCG_AWARE)
> +		unregister_memcg_shrinker(shrinker);
> +	debugfs_entry = shrinker_debugfs_detach(shrinker, &debugfs_id);
> +	mutex_unlock(&shrinker_mutex);
> +
> +	shrinker_debugfs_remove(debugfs_entry, debugfs_id);
> +
> +	kfree(shrinker->nr_deferred);
> +	shrinker->nr_deferred = NULL;
> +}
> +EXPORT_SYMBOL(unregister_shrinker);

[..]

> +void shrinker_free(struct shrinker *shrinker)
> +{
> +	kfree(shrinker);
> +}
> +EXPORT_SYMBOL(shrinker_free);
> +
> +void unregister_and_free_shrinker(struct shrinker *shrinker)
> +{
> +	unregister_shrinker(shrinker);
> +	kfree_rcu(shrinker, rcu);
> +}

Seems like this

	unregister_shrinker();
	shrinker_free();

is not exact equivalent of this

	unregister_and_free_shrinker();

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 02/29] mm: vmscan: introduce some helpers for dynamically allocating shrinker
  2023-06-22  8:53 ` [PATCH 02/29] mm: vmscan: introduce some helpers for dynamically allocating shrinker Qi Zheng
@ 2023-06-23  6:12   ` Dave Chinner
  2023-06-23 12:49     ` Qi Zheng
  0 siblings, 1 reply; 55+ messages in thread
From: Dave Chinner @ 2023-06-23  6:12 UTC (permalink / raw)
  To: Qi Zheng
  Cc: akpm, tkhai, vbabka, roman.gushchin, djwong, brauner, paulmck,
	tytso, linux-kernel, linux-mm, intel-gfx, dri-devel,
	linux-arm-msm, dm-devel, linux-raid, linux-bcache,
	virtualization, linux-fsdevel, linux-ext4, linux-nfs, linux-xfs,
	linux-btrfs

On Thu, Jun 22, 2023 at 04:53:08PM +0800, Qi Zheng wrote:
> Introduce some helpers for dynamically allocating shrinker instance,
> and their uses are as follows:
> 
> 1. shrinker_alloc_and_init()
> 
> Used to allocate and initialize a shrinker instance, the priv_data
> parameter is used to pass the pointer of the previously embedded
> structure of the shrinker instance.
> 
> 2. shrinker_free()
> 
> Used to free the shrinker instance when the registration of shrinker
> fails.
> 
> 3. unregister_and_free_shrinker()
> 
> Used to unregister and free the shrinker instance, and the kfree()
> will be changed to kfree_rcu() later.
> 
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
>  include/linux/shrinker.h | 12 ++++++++++++
>  mm/vmscan.c              | 35 +++++++++++++++++++++++++++++++++++
>  2 files changed, 47 insertions(+)
> 
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index 43e6fcabbf51..8e9ba6fa3fcc 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -107,6 +107,18 @@ extern void unregister_shrinker(struct shrinker *shrinker);
>  extern void free_prealloced_shrinker(struct shrinker *shrinker);
>  extern void synchronize_shrinkers(void);
>  
> +typedef unsigned long (*count_objects_cb)(struct shrinker *s,
> +					  struct shrink_control *sc);
> +typedef unsigned long (*scan_objects_cb)(struct shrinker *s,
> +					 struct shrink_control *sc);
> +
> +struct shrinker *shrinker_alloc_and_init(count_objects_cb count,
> +					 scan_objects_cb scan, long batch,
> +					 int seeks, unsigned flags,
> +					 void *priv_data);
> +void shrinker_free(struct shrinker *shrinker);
> +void unregister_and_free_shrinker(struct shrinker *shrinker);

Hmmmm. Not exactly how I envisioned this to be done.

Ok, this will definitely work, but I don't think it is an
improvement. It's certainly not what I was thinking of when I
suggested dynamically allocating shrinkers.

The main issue is that this doesn't simplify the API - it expands it
and creates a minefield of old and new functions that have to be
used in exactly the right order for the right things to happen.

What I was thinking of was moving the entire shrinker setup code
over to the prealloc/register_prepared() algorithm, where the setup
is already separated from the activation of the shrinker.

That is, we start by renaming prealloc_shrinker() to
shrinker_alloc(), adding a flags field to tell it everything that it
needs to alloc (i.e. the NUMA/MEMCG_AWARE flags) and having it
returned a fully allocated shrinker ready to register. Initially
this also contains an internal flag to say the shrinker was
allocated so that unregister_shrinker() knows to free it.

The caller then fills out the shrinker functions, seeks, etc. just
like the do now, and then calls register_shrinker_prepared() to make
the shrinker active when it wants to turn it on.

When it is time to tear down the shrinker, no API needs to change.
unregister_shrinker() does all the shutdown and frees all the
internal memory like it does now. If the shrinker is also marked as
allocated, it frees the shrinker via RCU, too.

Once everything is converted to this API, we then remove
register_shrinker(), rename register_shrinker_prepared() to
shrinker_register(), rename unregister_shrinker to
shrinker_unregister(), get rid of the internal "allocated" flag
and always free the shrinker.

At the end of the patchset, every shrinker should be set
up in a manner like this:


	sb->shrinker = shrinker_alloc(SHRINKER_MEMCG_AWARE|SHRINKER_NUMA_AWARE,
				"sb-%s", type->name);
	if (!sb->shrinker)
		return -ENOMEM;

	sb->shrinker->count_objects = super_cache_count;
	sb->shrinker->scan_objects = super_cache_scan;
	sb->shrinker->batch = 1024;
	sb->shrinker->private = sb;

	.....

	shrinker_register(sb->shrinker);

And teardown is just a call to shrinker_unregister(sb->shrinker)
as it is now.

i.e. the entire shrinker regsitration API is now just three
functions, down from the current four, and much simpler than the
the seven functions this patch set results in...

The other advantage of this is that it will break all the existing
out of tree code and third party modules using the old API and will
no longer work with a kernel using lockless slab shrinkers. They
need to break (both at the source and binary levels) to stop bad
things from happening due to using uncoverted shrinkers in the new
setup.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 24/29] mm: vmscan: make global slab shrink lockless
  2023-06-22 15:12   ` Vlastimil Babka
  2023-06-22 16:42     ` Qi Zheng
@ 2023-06-23  6:29     ` Dave Chinner
  2023-06-23 13:10       ` Qi Zheng
  2023-07-03 16:39       ` Paul E. McKenney
  1 sibling, 2 replies; 55+ messages in thread
From: Dave Chinner @ 2023-06-23  6:29 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Qi Zheng, akpm, tkhai, roman.gushchin, djwong, brauner, paulmck,
	tytso, linux-kernel, linux-mm, intel-gfx, dri-devel,
	linux-arm-msm, dm-devel, linux-raid, linux-bcache,
	virtualization, linux-fsdevel, linux-ext4, linux-nfs, linux-xfs,
	linux-btrfs

On Thu, Jun 22, 2023 at 05:12:02PM +0200, Vlastimil Babka wrote:
> On 6/22/23 10:53, Qi Zheng wrote:
> > @@ -1067,33 +1068,27 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
> >  	if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
> >  		return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
> >  
> > -	if (!down_read_trylock(&shrinker_rwsem))
> > -		goto out;
> > -
> > -	list_for_each_entry(shrinker, &shrinker_list, list) {
> > +	rcu_read_lock();
> > +	list_for_each_entry_rcu(shrinker, &shrinker_list, list) {
> >  		struct shrink_control sc = {
> >  			.gfp_mask = gfp_mask,
> >  			.nid = nid,
> >  			.memcg = memcg,
> >  		};
> >  
> > +		if (!shrinker_try_get(shrinker))
> > +			continue;
> > +		rcu_read_unlock();
> 
> I don't think you can do this unlock?
> 
> > +
> >  		ret = do_shrink_slab(&sc, shrinker, priority);
> >  		if (ret == SHRINK_EMPTY)
> >  			ret = 0;
> >  		freed += ret;
> > -		/*
> > -		 * Bail out if someone want to register a new shrinker to
> > -		 * prevent the registration from being stalled for long periods
> > -		 * by parallel ongoing shrinking.
> > -		 */
> > -		if (rwsem_is_contended(&shrinker_rwsem)) {
> > -			freed = freed ? : 1;
> > -			break;
> > -		}
> > -	}
> >  
> > -	up_read(&shrinker_rwsem);
> > -out:
> > +		rcu_read_lock();
> 
> That new rcu_read_lock() won't help AFAIK, the whole
> list_for_each_entry_rcu() needs to be under the single rcu_read_lock() to be
> safe.

Yeah, that's the pattern we've been taught and the one we can look
at and immediately say "this is safe".

This is a different pattern, as has been explained bi Qi, and I
think it *might* be safe.

*However.*

Right now I don't have time to go through a novel RCU list iteration
pattern it one step at to determine the correctness of the
algorithm. I'm mostly worried about list manipulations that can
occur outside rcu_read_lock() section bleeding into the RCU
critical section because rcu_read_lock() by itself is not a memory
barrier.

Maybe Paul has seen this pattern often enough he could simply tell
us what conditions it is safe in. But for me to work that out from
first principles? I just don't have the time to do that right now.

> IIUC this is why Dave in [4] suggests unifying shrink_slab() with
> shrink_slab_memcg(), as the latter doesn't iterate the list but uses IDR.

Yes, I suggested the IDR route because radix tree lookups under RCU
with reference counted objects are a known safe pattern that we can
easily confirm is correct or not.  Hence I suggested the unification
+ IDR route because it makes the life of reviewers so, so much
easier...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 02/29] mm: vmscan: introduce some helpers for dynamically allocating shrinker
  2023-06-23  6:12   ` Dave Chinner
@ 2023-06-23 12:49     ` Qi Zheng
  0 siblings, 0 replies; 55+ messages in thread
From: Qi Zheng @ 2023-06-23 12:49 UTC (permalink / raw)
  To: Dave Chinner
  Cc: akpm, tkhai, vbabka, roman.gushchin, djwong, brauner, paulmck,
	tytso, linux-kernel, linux-mm, intel-gfx, dri-devel,
	linux-arm-msm, dm-devel, linux-raid, linux-bcache,
	virtualization, linux-fsdevel, linux-ext4, linux-nfs, linux-xfs,
	linux-btrfs

Hi Dave,

On 2023/6/23 14:12, Dave Chinner wrote:
> On Thu, Jun 22, 2023 at 04:53:08PM +0800, Qi Zheng wrote:
>> Introduce some helpers for dynamically allocating shrinker instance,
>> and their uses are as follows:
>>
>> 1. shrinker_alloc_and_init()
>>
>> Used to allocate and initialize a shrinker instance, the priv_data
>> parameter is used to pass the pointer of the previously embedded
>> structure of the shrinker instance.
>>
>> 2. shrinker_free()
>>
>> Used to free the shrinker instance when the registration of shrinker
>> fails.
>>
>> 3. unregister_and_free_shrinker()
>>
>> Used to unregister and free the shrinker instance, and the kfree()
>> will be changed to kfree_rcu() later.
>>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> ---
>>   include/linux/shrinker.h | 12 ++++++++++++
>>   mm/vmscan.c              | 35 +++++++++++++++++++++++++++++++++++
>>   2 files changed, 47 insertions(+)
>>
>> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
>> index 43e6fcabbf51..8e9ba6fa3fcc 100644
>> --- a/include/linux/shrinker.h
>> +++ b/include/linux/shrinker.h
>> @@ -107,6 +107,18 @@ extern void unregister_shrinker(struct shrinker *shrinker);
>>   extern void free_prealloced_shrinker(struct shrinker *shrinker);
>>   extern void synchronize_shrinkers(void);
>>   
>> +typedef unsigned long (*count_objects_cb)(struct shrinker *s,
>> +					  struct shrink_control *sc);
>> +typedef unsigned long (*scan_objects_cb)(struct shrinker *s,
>> +					 struct shrink_control *sc);
>> +
>> +struct shrinker *shrinker_alloc_and_init(count_objects_cb count,
>> +					 scan_objects_cb scan, long batch,
>> +					 int seeks, unsigned flags,
>> +					 void *priv_data);
>> +void shrinker_free(struct shrinker *shrinker);
>> +void unregister_and_free_shrinker(struct shrinker *shrinker);
> 
> Hmmmm. Not exactly how I envisioned this to be done.
> 
> Ok, this will definitely work, but I don't think it is an
> improvement. It's certainly not what I was thinking of when I
> suggested dynamically allocating shrinkers.
> 
> The main issue is that this doesn't simplify the API - it expands it
> and creates a minefield of old and new functions that have to be
> used in exactly the right order for the right things to happen.
> 
> What I was thinking of was moving the entire shrinker setup code
> over to the prealloc/register_prepared() algorithm, where the setup
> is already separated from the activation of the shrinker.
> 
> That is, we start by renaming prealloc_shrinker() to
> shrinker_alloc(), adding a flags field to tell it everything that it
> needs to alloc (i.e. the NUMA/MEMCG_AWARE flags) and having it
> returned a fully allocated shrinker ready to register. Initially
> this also contains an internal flag to say the shrinker was
> allocated so that unregister_shrinker() knows to free it.
> 
> The caller then fills out the shrinker functions, seeks, etc. just
> like the do now, and then calls register_shrinker_prepared() to make
> the shrinker active when it wants to turn it on.
> 
> When it is time to tear down the shrinker, no API needs to change.
> unregister_shrinker() does all the shutdown and frees all the
> internal memory like it does now. If the shrinker is also marked as
> allocated, it frees the shrinker via RCU, too.
> 
> Once everything is converted to this API, we then remove
> register_shrinker(), rename register_shrinker_prepared() to
> shrinker_register(), rename unregister_shrinker to
> shrinker_unregister(), get rid of the internal "allocated" flag
> and always free the shrinker.

IIUC, you mean that we also need to convert the original statically
defined shrinker instances to dynamically allocated.

I think this is a good idea, it helps to simplify the APIs and also
remove special handling for case a and b (mentioned in cover letter).

> 
> At the end of the patchset, every shrinker should be set
> up in a manner like this:
> 
> 
> 	sb->shrinker = shrinker_alloc(SHRINKER_MEMCG_AWARE|SHRINKER_NUMA_AWARE,
> 				"sb-%s", type->name);
> 	if (!sb->shrinker)
> 		return -ENOMEM;
> 
> 	sb->shrinker->count_objects = super_cache_count;
> 	sb->shrinker->scan_objects = super_cache_scan;
> 	sb->shrinker->batch = 1024;
> 	sb->shrinker->private = sb;
> 
> 	.....
> 
> 	shrinker_register(sb->shrinker);
> 
> And teardown is just a call to shrinker_unregister(sb->shrinker)
> as it is now.
> 
> i.e. the entire shrinker regsitration API is now just three
> functions, down from the current four, and much simpler than the
> the seven functions this patch set results in...
> 
> The other advantage of this is that it will break all the existing
> out of tree code and third party modules using the old API and will
> no longer work with a kernel using lockless slab shrinkers. They
> need to break (both at the source and binary levels) to stop bad
> things from happening due to using uncoverted shrinkers in the new
> setup.

Got it. And totally agree.

I will do it in the v2.

Thanks,
Qi

> 
> -Dave.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [External] Re: [PATCH 01/29] mm: shrinker: add shrinker::private_data field
  2023-06-22 14:47   ` Vlastimil Babka
@ 2023-06-23 12:50     ` Qi Zheng
  0 siblings, 0 replies; 55+ messages in thread
From: Qi Zheng @ 2023-06-23 12:50 UTC (permalink / raw)
  To: Vlastimil Babka, akpm, david, tkhai, roman.gushchin, djwong,
	brauner, paulmck, tytso
  Cc: linux-kernel, linux-mm, intel-gfx, dri-devel, linux-arm-msm,
	dm-devel, linux-raid, linux-bcache, virtualization,
	linux-fsdevel, linux-ext4, linux-nfs, linux-xfs, linux-btrfs

Hi Vlastimil,

On 2023/6/22 22:47, Vlastimil Babka wrote:
> On 6/22/23 10:53, Qi Zheng wrote:
>> To prepare for the dynamic allocation of shrinker instances
>> embedded in other structures, add a private_data field to
>> struct shrinker, so that we can use shrinker::private_data
>> to record and get the original embedded structure.
>>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> 
> I would fold this to 02/29, less churn.

OK, I will fold this to 02/29 in the v2.

Thanks,
Qi

> 
>> ---
>>   include/linux/shrinker.h | 2 ++
>>   1 file changed, 2 insertions(+)
>>
>> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
>> index 224293b2dd06..43e6fcabbf51 100644
>> --- a/include/linux/shrinker.h
>> +++ b/include/linux/shrinker.h
>> @@ -70,6 +70,8 @@ struct shrinker {
>>   	int seeks;	/* seeks to recreate an obj */
>>   	unsigned flags;
>>   
>> +	void *private_data;
>> +
>>   	/* These are for internal use */
>>   	struct list_head list;
>>   #ifdef CONFIG_MEMCG
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 24/29] mm: vmscan: make global slab shrink lockless
  2023-06-23  6:29     ` Dave Chinner
@ 2023-06-23 13:10       ` Qi Zheng
  2023-06-23 22:19         ` Dave Chinner
  2023-07-03 16:39       ` Paul E. McKenney
  1 sibling, 1 reply; 55+ messages in thread
From: Qi Zheng @ 2023-06-23 13:10 UTC (permalink / raw)
  To: Dave Chinner, Vlastimil Babka, paulmck
  Cc: akpm, tkhai, roman.gushchin, djwong, brauner, tytso,
	linux-kernel, linux-mm, intel-gfx, dri-devel, linux-arm-msm,
	dm-devel, linux-raid, linux-bcache, virtualization,
	linux-fsdevel, linux-ext4, linux-nfs, linux-xfs, linux-btrfs



On 2023/6/23 14:29, Dave Chinner wrote:
> On Thu, Jun 22, 2023 at 05:12:02PM +0200, Vlastimil Babka wrote:
>> On 6/22/23 10:53, Qi Zheng wrote:
>>> @@ -1067,33 +1068,27 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>   	if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
>>>   		return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>>>   
>>> -	if (!down_read_trylock(&shrinker_rwsem))
>>> -		goto out;
>>> -
>>> -	list_for_each_entry(shrinker, &shrinker_list, list) {
>>> +	rcu_read_lock();
>>> +	list_for_each_entry_rcu(shrinker, &shrinker_list, list) {
>>>   		struct shrink_control sc = {
>>>   			.gfp_mask = gfp_mask,
>>>   			.nid = nid,
>>>   			.memcg = memcg,
>>>   		};
>>>   
>>> +		if (!shrinker_try_get(shrinker))
>>> +			continue;
>>> +		rcu_read_unlock();
>>
>> I don't think you can do this unlock?
>>
>>> +
>>>   		ret = do_shrink_slab(&sc, shrinker, priority);
>>>   		if (ret == SHRINK_EMPTY)
>>>   			ret = 0;
>>>   		freed += ret;
>>> -		/*
>>> -		 * Bail out if someone want to register a new shrinker to
>>> -		 * prevent the registration from being stalled for long periods
>>> -		 * by parallel ongoing shrinking.
>>> -		 */
>>> -		if (rwsem_is_contended(&shrinker_rwsem)) {
>>> -			freed = freed ? : 1;
>>> -			break;
>>> -		}
>>> -	}
>>>   
>>> -	up_read(&shrinker_rwsem);
>>> -out:
>>> +		rcu_read_lock();
>>
>> That new rcu_read_lock() won't help AFAIK, the whole
>> list_for_each_entry_rcu() needs to be under the single rcu_read_lock() to be
>> safe.
> 
> Yeah, that's the pattern we've been taught and the one we can look
> at and immediately say "this is safe".
> 
> This is a different pattern, as has been explained bi Qi, and I
> think it *might* be safe.
> 
> *However.*
> 
> Right now I don't have time to go through a novel RCU list iteration
> pattern it one step at to determine the correctness of the
> algorithm. I'm mostly worried about list manipulations that can
> occur outside rcu_read_lock() section bleeding into the RCU
> critical section because rcu_read_lock() by itself is not a memory
> barrier.
> 
> Maybe Paul has seen this pattern often enough he could simply tell
> us what conditions it is safe in. But for me to work that out from
> first principles? I just don't have the time to do that right now.

Hi Paul, can you help to confirm this?

> 
>> IIUC this is why Dave in [4] suggests unifying shrink_slab() with
>> shrink_slab_memcg(), as the latter doesn't iterate the list but uses IDR.
> 
> Yes, I suggested the IDR route because radix tree lookups under RCU
> with reference counted objects are a known safe pattern that we can
> easily confirm is correct or not.  Hence I suggested the unification
> + IDR route because it makes the life of reviewers so, so much
> easier...

In fact, I originally planned to try the unification + IDR method you
suggested at the beginning. But in the case of CONFIG_MEMCG disabled,
the struct mem_cgroup is not even defined, and root_mem_cgroup and
shrinker_info will not be allocated. This required more code changes, so
I ended up keeping the shrinker_list and implementing the above pattern.

If the above pattern is not safe, I will go back to the unification +
IDR method.

Thanks,
Qi

> 
> Cheers,
> 
> Dave.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 29/29] mm: shrinker: move shrinker-related code into a separate file
  2023-06-22 14:53   ` Vlastimil Babka
@ 2023-06-23 13:12     ` Qi Zheng
  0 siblings, 0 replies; 55+ messages in thread
From: Qi Zheng @ 2023-06-23 13:12 UTC (permalink / raw)
  To: Vlastimil Babka, akpm, david, tkhai, roman.gushchin, djwong,
	brauner, paulmck, tytso
  Cc: linux-kernel, linux-mm, intel-gfx, dri-devel, linux-arm-msm,
	dm-devel, linux-raid, linux-bcache, virtualization,
	linux-fsdevel, linux-ext4, linux-nfs, linux-xfs, linux-btrfs

Hi Vlastimil,

On 2023/6/22 22:53, Vlastimil Babka wrote:
> On 6/22/23 10:53, Qi Zheng wrote:
>> The mm/vmscan.c file is too large, so separate the shrinker-related
>> code from it into a separate file. No functional changes.
>>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> 
> Maybe do this move as patch 01 so the further changes are done in the new
> file already?

Sure, I will do it in the v2.

Thanks,
Qi

> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 29/29] mm: shrinker: move shrinker-related code into a separate file
  2023-06-23  5:25   ` Sergey Senozhatsky
@ 2023-06-23 13:24     ` Qi Zheng
  0 siblings, 0 replies; 55+ messages in thread
From: Qi Zheng @ 2023-06-23 13:24 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: akpm, david, tkhai, vbabka, roman.gushchin, djwong, brauner,
	paulmck, tytso, linux-kernel, linux-mm, intel-gfx, dri-devel,
	linux-arm-msm, dm-devel, linux-raid, linux-bcache,
	virtualization, linux-fsdevel, linux-ext4, linux-nfs, linux-xfs,
	linux-btrfs

Hi Sergey,

On 2023/6/23 13:25, Sergey Senozhatsky wrote:
> On (23/06/22 16:53), Qi Zheng wrote:
>> +/*
>> + * Remove one
>> + */
>> +void unregister_shrinker(struct shrinker *shrinker)
>> +{
>> +	struct dentry *debugfs_entry;
>> +	int debugfs_id;
>> +
>> +	if (!(shrinker->flags & SHRINKER_REGISTERED))
>> +		return;
>> +
>> +	shrinker_put(shrinker);
>> +	wait_for_completion(&shrinker->completion_wait);
>> +
>> +	mutex_lock(&shrinker_mutex);
>> +	list_del_rcu(&shrinker->list);
> 
> Should this function wait for RCU grace period(s) before it goes
> touching shrinker fields?

Why? We will free this shrinker instance by rcu after executing
unregister_shrinker(). So it is safe to touch shrinker fields here.

> 
>> +	shrinker->flags &= ~SHRINKER_REGISTERED;
>> +	if (shrinker->flags & SHRINKER_MEMCG_AWARE)
>> +		unregister_memcg_shrinker(shrinker);
>> +	debugfs_entry = shrinker_debugfs_detach(shrinker, &debugfs_id);
>> +	mutex_unlock(&shrinker_mutex);
>> +
>> +	shrinker_debugfs_remove(debugfs_entry, debugfs_id);
>> +
>> +	kfree(shrinker->nr_deferred);
>> +	shrinker->nr_deferred = NULL;
>> +}
>> +EXPORT_SYMBOL(unregister_shrinker);
> 
> [..]
> 
>> +void shrinker_free(struct shrinker *shrinker)
>> +{
>> +	kfree(shrinker);
>> +}
>> +EXPORT_SYMBOL(shrinker_free);
>> +
>> +void unregister_and_free_shrinker(struct shrinker *shrinker)
>> +{
>> +	unregister_shrinker(shrinker);
>> +	kfree_rcu(shrinker, rcu);
>> +}
> 
> Seems like this
> 
> 	unregister_shrinker();
> 	shrinker_free();
> 
> is not exact equivalent of this
> 
> 	unregister_and_free_shrinker();

Yes, my original intention is that shrinker_free() is only used to
handle the case where register_shrinker() returns failure.

I will implement the method suggested by Dave in 02/29. Those APIs are
more concise and will bring more benefits. :)

Thanks,
Qi




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 05/29] drm/panfrost: dynamically allocate the drm-panfrost shrinker
  2023-06-22  8:53 ` [PATCH 05/29] drm/panfrost: dynamically allocate the drm-panfrost shrinker Qi Zheng
@ 2023-06-23 13:33   ` Qi Zheng
  0 siblings, 0 replies; 55+ messages in thread
From: Qi Zheng @ 2023-06-23 13:33 UTC (permalink / raw)
  To: Steven Price
  Cc: linux-kernel, linux-mm, intel-gfx, dri-devel, linux-arm-msm,
	dm-devel, linux-raid, linux-bcache, virtualization,
	linux-fsdevel, linux-ext4, linux-nfs, linux-xfs, linux-btrfs,
	akpm, david, tkhai, vbabka, roman.gushchin, djwong, brauner,
	paulmck, tytso

Hi Steven,

The email you replied to was the failed version (due to the error
below), so I copied your reply and replied to you on this successful
version.

(4.7.1 Error: too many recipients from 49.7.199.173)

On 2023/6/23 18:01, Steven Price wrote:
 > On 22/06/2023 09:39, Qi Zheng wrote:
 >> From: Qi Zheng <zhengqi.arch@bytedance.com>
 >>
 >> In preparation for implementing lockless slab shrink,
 >> we need to dynamically allocate the drm-panfrost shrinker,
 >> so that it can be freed asynchronously using kfree_rcu().
 >> Then it doesn't need to wait for RCU read-side critical
 >> section when releasing the struct panfrost_device.
 >>
 >> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
 >> ---
 >>   drivers/gpu/drm/panfrost/panfrost_device.h    |  2 +-
 >>   .../gpu/drm/panfrost/panfrost_gem_shrinker.c  | 24 ++++++++++---------
 >>   2 files changed, 14 insertions(+), 12 deletions(-)
 >>
 >> diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h 
b/drivers/gpu/drm/panfrost/panfrost_device.h
 >> index b0126b9fbadc..e667e5689353 100644
 >> --- a/drivers/gpu/drm/panfrost/panfrost_device.h
 >> +++ b/drivers/gpu/drm/panfrost/panfrost_device.h
 >> @@ -118,7 +118,7 @@ struct panfrost_device {
 >>
 >>   	struct mutex shrinker_lock;
 >>   	struct list_head shrinker_list;
 >> -	struct shrinker shrinker;
 >> +	struct shrinker *shrinker;
 >>
 >>   	struct panfrost_devfreq pfdevfreq;
 >>   };
 >> diff --git a/drivers/gpu/drm/panfrost/panfrost_gem_shrinker.c 
b/drivers/gpu/drm/panfrost/panfrost_gem_shrinker.c
 >> index bf0170782f25..2a5513eb9e1f 100644
 >> --- a/drivers/gpu/drm/panfrost/panfrost_gem_shrinker.c
 >> +++ b/drivers/gpu/drm/panfrost/panfrost_gem_shrinker.c
 >> @@ -18,8 +18,7 @@
 >>   static unsigned long
 >>   panfrost_gem_shrinker_count(struct shrinker *shrinker, struct 
shrink_control *sc)
 >>   {
 >> -	struct panfrost_device *pfdev =
 >> -		container_of(shrinker, struct panfrost_device, shrinker);
 >> +	struct panfrost_device *pfdev = shrinker->private_data;
 >>   	struct drm_gem_shmem_object *shmem;
 >>   	unsigned long count = 0;
 >>
 >> @@ -65,8 +64,7 @@ static bool panfrost_gem_purge(struct 
drm_gem_object *obj)
 >>   static unsigned long
 >>   panfrost_gem_shrinker_scan(struct shrinker *shrinker, struct 
shrink_control *sc)
 >>   {
 >> -	struct panfrost_device *pfdev =
 >> -		container_of(shrinker, struct panfrost_device, shrinker);
 >> +	struct panfrost_device *pfdev = shrinker->private_data;
 >>   	struct drm_gem_shmem_object *shmem, *tmp;
 >>   	unsigned long freed = 0;
 >>
 >> @@ -100,10 +98,15 @@ panfrost_gem_shrinker_scan(struct shrinker 
*shrinker, struct shrink_control *sc)
 >>   void panfrost_gem_shrinker_init(struct drm_device *dev)
 >>   {
 >>   	struct panfrost_device *pfdev = dev->dev_private;
 >> -	pfdev->shrinker.count_objects = panfrost_gem_shrinker_count;
 >> -	pfdev->shrinker.scan_objects = panfrost_gem_shrinker_scan;
 >> -	pfdev->shrinker.seeks = DEFAULT_SEEKS;
 >> -	WARN_ON(register_shrinker(&pfdev->shrinker, "drm-panfrost"));
 >> +
 >> +	pfdev->shrinker = shrinker_alloc_and_init(panfrost_gem_shrinker_count,
 >> +						  panfrost_gem_shrinker_scan, 0,
 >> +						  DEFAULT_SEEKS, 0, pfdev);
 >> +	if (pfdev->shrinker &&
 >> +	    register_shrinker(pfdev->shrinker, "drm-panfrost")) {
 >> +		shrinker_free(pfdev->shrinker);
 >> +		WARN_ON(1);
 >> +	}
 >
 > So we didn't have good error handling here before, but this is
 > significantly worse. Previously if register_shrinker() failed then the
 > driver could safely continue without a shrinker - it would waste memory
 > but still function.
 >
 > However we now have two failure conditions:
 >   * shrinker_alloc_init() returns NULL. No warning and NULL deferences
 >     will happen later.
 >
 >   * register_shrinker() fails, shrinker_free() will free pdev->shrinker
 >     we get a warning, but followed by a use-after-free later.
 >
 > I think we need to modify panfrost_gem_shrinker_init() to be able to
 > return an error, so a change something like the below (untested) before
 > your change.

Indeed. I will fix it in the v2.

Thanks,
Qi

 >
 > Steve
 >
 > ----8<---
 > diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c
 > b/drivers/gpu/drm/panfrost/panfrost_drv.c
 > index bbada731bbbd..f705bbdea360 100644
 > --- a/drivers/gpu/drm/panfrost/panfrost_drv.c
 > +++ b/drivers/gpu/drm/panfrost/panfrost_drv.c
 > @@ -598,10 +598,14 @@ static int panfrost_probe(struct platform_device
 > *pdev)
 >   	if (err < 0)
 >   		goto err_out1;
 >
 > -	panfrost_gem_shrinker_init(ddev);
 > +	err = panfrost_gem_shrinker_init(ddev);
 > +	if (err)
 > +		goto err_out2;
 >
 >   	return 0;
 >
 > +err_out2:
 > +	drm_dev_unregister(ddev);
 >   err_out1:
 >   	pm_runtime_disable(pfdev->dev);
 >   	panfrost_device_fini(pfdev);
 > diff --git a/drivers/gpu/drm/panfrost/panfrost_gem.h
 > b/drivers/gpu/drm/panfrost/panfrost_gem.h
 > index ad2877eeeccd..863d2ec8d4f0 100644
 > --- a/drivers/gpu/drm/panfrost/panfrost_gem.h
 > +++ b/drivers/gpu/drm/panfrost/panfrost_gem.h
 > @@ -81,7 +81,7 @@ panfrost_gem_mapping_get(struct panfrost_gem_object 
*bo,
 >   void panfrost_gem_mapping_put(struct panfrost_gem_mapping *mapping);
 >   void panfrost_gem_teardown_mappings_locked(struct 
panfrost_gem_object *bo);
 >
 > -void panfrost_gem_shrinker_init(struct drm_device *dev);
 > +int panfrost_gem_shrinker_init(struct drm_device *dev);
 >   void panfrost_gem_shrinker_cleanup(struct drm_device *dev);
 >
 >   #endif /* __PANFROST_GEM_H__ */
 > diff --git a/drivers/gpu/drm/panfrost/panfrost_gem_shrinker.c
 > b/drivers/gpu/drm/panfrost/panfrost_gem_shrinker.c
 > index bf0170782f25..90265b37636f 100644
 > --- a/drivers/gpu/drm/panfrost/panfrost_gem_shrinker.c
 > +++ b/drivers/gpu/drm/panfrost/panfrost_gem_shrinker.c
 > @@ -97,13 +97,17 @@ panfrost_gem_shrinker_scan(struct shrinker
 > *shrinker, struct shrink_control *sc)
 >    *
 >    * This function registers and sets up the panfrost shrinker.
 >    */
 > -void panfrost_gem_shrinker_init(struct drm_device *dev)
 > +int panfrost_gem_shrinker_init(struct drm_device *dev)
 >   {
 >   	struct panfrost_device *pfdev = dev->dev_private;
 > +	int ret;
 > +
 >   	pfdev->shrinker.count_objects = panfrost_gem_shrinker_count;
 >   	pfdev->shrinker.scan_objects = panfrost_gem_shrinker_scan;
 >   	pfdev->shrinker.seeks = DEFAULT_SEEKS;
 > -	WARN_ON(register_shrinker(&pfdev->shrinker, "drm-panfrost"));
 > +	ret = register_shrinker(&pfdev->shrinker, "drm-panfrost");
 > +
 > +	return ret;
 >   }
 >
 >   /**
 >


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 15/29] NFSD: dynamically allocate the nfsd-client shrinker
  2023-06-22  8:53 ` [PATCH 15/29] NFSD: dynamically allocate the nfsd-client shrinker Qi Zheng
@ 2023-06-23 21:49   ` Chuck Lever
  2023-06-24 11:17     ` Qi Zheng
  0 siblings, 1 reply; 55+ messages in thread
From: Chuck Lever @ 2023-06-23 21:49 UTC (permalink / raw)
  To: Qi Zheng
  Cc: akpm, david, tkhai, vbabka, roman.gushchin, djwong, brauner,
	paulmck, tytso, linux-bcache, linux-xfs, linux-nfs,
	linux-arm-msm, intel-gfx, linux-kernel, dri-devel,
	virtualization, linux-raid, linux-mm, dm-devel, linux-fsdevel,
	linux-ext4, linux-btrfs

On Thu, Jun 22, 2023 at 04:53:21PM +0800, Qi Zheng wrote:
> In preparation for implementing lockless slab shrink,
> we need to dynamically allocate the nfsd-client shrinker,
> so that it can be freed asynchronously using kfree_rcu().
> Then it doesn't need to wait for RCU read-side critical
> section when releasing the struct nfsd_net.
> 
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>

For 15/29 and 16/29 of this series:

Acked-by: Chuck Lever <chuck.lever@oracle.com>


> ---
>  fs/nfsd/netns.h     |  2 +-
>  fs/nfsd/nfs4state.c | 20 ++++++++++++--------
>  2 files changed, 13 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/nfsd/netns.h b/fs/nfsd/netns.h
> index ec49b200b797..f669444d5336 100644
> --- a/fs/nfsd/netns.h
> +++ b/fs/nfsd/netns.h
> @@ -195,7 +195,7 @@ struct nfsd_net {
>  	int			nfs4_max_clients;
>  
>  	atomic_t		nfsd_courtesy_clients;
> -	struct shrinker		nfsd_client_shrinker;
> +	struct shrinker		*nfsd_client_shrinker;
>  	struct work_struct	nfsd_shrinker_work;
>  };
>  
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index 6e61fa3acaf1..a06184270548 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -4388,8 +4388,7 @@ static unsigned long
>  nfsd4_state_shrinker_count(struct shrinker *shrink, struct shrink_control *sc)
>  {
>  	int count;
> -	struct nfsd_net *nn = container_of(shrink,
> -			struct nfsd_net, nfsd_client_shrinker);
> +	struct nfsd_net *nn = shrink->private_data;
>  
>  	count = atomic_read(&nn->nfsd_courtesy_clients);
>  	if (!count)
> @@ -8094,14 +8093,19 @@ static int nfs4_state_create_net(struct net *net)
>  	INIT_WORK(&nn->nfsd_shrinker_work, nfsd4_state_shrinker_worker);
>  	get_net(net);
>  
> -	nn->nfsd_client_shrinker.scan_objects = nfsd4_state_shrinker_scan;
> -	nn->nfsd_client_shrinker.count_objects = nfsd4_state_shrinker_count;
> -	nn->nfsd_client_shrinker.seeks = DEFAULT_SEEKS;
> -
> -	if (register_shrinker(&nn->nfsd_client_shrinker, "nfsd-client"))
> +	nn->nfsd_client_shrinker = shrinker_alloc_and_init(nfsd4_state_shrinker_count,
> +							   nfsd4_state_shrinker_scan,
> +							   0, DEFAULT_SEEKS, 0,
> +							   nn);
> +	if (!nn->nfsd_client_shrinker)
>  		goto err_shrinker;
> +
> +	if (register_shrinker(nn->nfsd_client_shrinker, "nfsd-client"))
> +		goto err_register;
>  	return 0;
>  
> +err_register:
> +	shrinker_free(nn->nfsd_client_shrinker);
>  err_shrinker:
>  	put_net(net);
>  	kfree(nn->sessionid_hashtbl);
> @@ -8197,7 +8201,7 @@ nfs4_state_shutdown_net(struct net *net)
>  	struct list_head *pos, *next, reaplist;
>  	struct nfsd_net *nn = net_generic(net, nfsd_net_id);
>  
> -	unregister_shrinker(&nn->nfsd_client_shrinker);
> +	unregister_and_free_shrinker(nn->nfsd_client_shrinker);
>  	cancel_work(&nn->nfsd_shrinker_work);
>  	cancel_delayed_work_sync(&nn->laundromat_work);
>  	locks_end_grace(&nn->nfsd4_manager);
> -- 
> 2.30.2
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 24/29] mm: vmscan: make global slab shrink lockless
  2023-06-23 13:10       ` Qi Zheng
@ 2023-06-23 22:19         ` Dave Chinner
  2023-06-24 11:08           ` Qi Zheng
  0 siblings, 1 reply; 55+ messages in thread
From: Dave Chinner @ 2023-06-23 22:19 UTC (permalink / raw)
  To: Qi Zheng
  Cc: Vlastimil Babka, paulmck, akpm, tkhai, roman.gushchin, djwong,
	brauner, tytso, linux-kernel, linux-mm, intel-gfx, dri-devel,
	linux-arm-msm, dm-devel, linux-raid, linux-bcache,
	virtualization, linux-fsdevel, linux-ext4, linux-nfs, linux-xfs,
	linux-btrfs

On Fri, Jun 23, 2023 at 09:10:57PM +0800, Qi Zheng wrote:
> On 2023/6/23 14:29, Dave Chinner wrote:
> > On Thu, Jun 22, 2023 at 05:12:02PM +0200, Vlastimil Babka wrote:
> > > On 6/22/23 10:53, Qi Zheng wrote:
> > Yes, I suggested the IDR route because radix tree lookups under RCU
> > with reference counted objects are a known safe pattern that we can
> > easily confirm is correct or not.  Hence I suggested the unification
> > + IDR route because it makes the life of reviewers so, so much
> > easier...
> 
> In fact, I originally planned to try the unification + IDR method you
> suggested at the beginning. But in the case of CONFIG_MEMCG disabled,
> the struct mem_cgroup is not even defined, and root_mem_cgroup and
> shrinker_info will not be allocated.  This required more code changes, so
> I ended up keeping the shrinker_list and implementing the above pattern.

Yes. Go back and read what I originally said needed to be done
first. In the case of CONFIG_MEMCG=n, a dummy root memcg still needs
to exist that holds all of the global shrinkers. Then shrink_slab()
is only ever passed a memcg that should be iterated.

Yes, it needs changes external to the shrinker code itself to be
made to work. And even if memcg's are not enabled, we can still use
the memcg structures to ensure a common abstraction is used for the
shrinker tracking infrastructure....

> If the above pattern is not safe, I will go back to the unification +
> IDR method.

And that is exactly how we got into this mess in the first place....

-Dave
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 24/29] mm: vmscan: make global slab shrink lockless
  2023-06-23 22:19         ` Dave Chinner
@ 2023-06-24 11:08           ` Qi Zheng
  2023-06-25  3:15             ` Qi Zheng
  2023-07-04  4:20             ` Qi Zheng
  0 siblings, 2 replies; 55+ messages in thread
From: Qi Zheng @ 2023-06-24 11:08 UTC (permalink / raw)
  To: Dave Chinner, paulmck
  Cc: Vlastimil Babka, akpm, tkhai, roman.gushchin, djwong, brauner,
	tytso, linux-kernel, linux-mm, intel-gfx, dri-devel,
	linux-arm-msm, dm-devel, linux-raid, linux-bcache,
	virtualization, linux-fsdevel, linux-ext4, linux-nfs, linux-xfs,
	linux-btrfs

Hi Dave,

On 2023/6/24 06:19, Dave Chinner wrote:
> On Fri, Jun 23, 2023 at 09:10:57PM +0800, Qi Zheng wrote:
>> On 2023/6/23 14:29, Dave Chinner wrote:
>>> On Thu, Jun 22, 2023 at 05:12:02PM +0200, Vlastimil Babka wrote:
>>>> On 6/22/23 10:53, Qi Zheng wrote:
>>> Yes, I suggested the IDR route because radix tree lookups under RCU
>>> with reference counted objects are a known safe pattern that we can
>>> easily confirm is correct or not.  Hence I suggested the unification
>>> + IDR route because it makes the life of reviewers so, so much
>>> easier...
>>
>> In fact, I originally planned to try the unification + IDR method you
>> suggested at the beginning. But in the case of CONFIG_MEMCG disabled,
>> the struct mem_cgroup is not even defined, and root_mem_cgroup and
>> shrinker_info will not be allocated.  This required more code changes, so
>> I ended up keeping the shrinker_list and implementing the above pattern.
> 
> Yes. Go back and read what I originally said needed to be done
> first. In the case of CONFIG_MEMCG=n, a dummy root memcg still needs
> to exist that holds all of the global shrinkers. Then shrink_slab()
> is only ever passed a memcg that should be iterated.
> 
> Yes, it needs changes external to the shrinker code itself to be
> made to work. And even if memcg's are not enabled, we can still use
> the memcg structures to ensure a common abstraction is used for the
> shrinker tracking infrastructure....

Yeah, what I imagined before was to define a more concise struct
mem_cgroup in the case of CONFIG_MEMCG=n, then allocate a dummy root
memcg on system boot:

#ifdef !CONFIG_MEMCG

struct shrinker_info {
	struct rcu_head rcu;
	atomic_long_t *nr_deferred;
	unsigned long *map;
	int map_nr_max;
};

struct mem_cgroup_per_node {
	struct shrinker_info __rcu	*shrinker_info;
};

struct mem_cgroup {
	struct mem_cgroup_per_node *nodeinfo[];
};

#endif

But I have a concern: if all global shrinkers are tracking with the
info->map of root memcg, a shrinker->id needs to be assigned to them,
which will cause info->map_nr_max to become larger than before, then
making the traversal of info->map slower.

> 
>> If the above pattern is not safe, I will go back to the unification +
>> IDR method.
> 
> And that is exactly how we got into this mess in the first place....

I only found one similar pattern in the kernel:

fs/smb/server/oplock.c:find_same_lease_key/smb_break_all_levII_oplock/lookup_lease_in_table

But IIUC, the refcount here needs to be decremented after holding
rcu lock as I did above.

So regardless of whether we choose unification + IDR in the end, I still
want to confirm whether the pattern I implemented above is safe. :)

Thanks,
Qi

> 
> -Dave

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 15/29] NFSD: dynamically allocate the nfsd-client shrinker
  2023-06-23 21:49   ` Chuck Lever
@ 2023-06-24 11:17     ` Qi Zheng
  0 siblings, 0 replies; 55+ messages in thread
From: Qi Zheng @ 2023-06-24 11:17 UTC (permalink / raw)
  To: Chuck Lever
  Cc: akpm, david, tkhai, vbabka, roman.gushchin, djwong, brauner,
	paulmck, tytso, linux-bcache, linux-xfs, linux-nfs,
	linux-arm-msm, intel-gfx, linux-kernel, dri-devel,
	virtualization, linux-raid, linux-mm, dm-devel, linux-fsdevel,
	linux-ext4, linux-btrfs

Hi Chuck,

On 2023/6/24 05:49, Chuck Lever wrote:
> On Thu, Jun 22, 2023 at 04:53:21PM +0800, Qi Zheng wrote:
>> In preparation for implementing lockless slab shrink,
>> we need to dynamically allocate the nfsd-client shrinker,
>> so that it can be freed asynchronously using kfree_rcu().
>> Then it doesn't need to wait for RCU read-side critical
>> section when releasing the struct nfsd_net.
>>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> 
> For 15/29 and 16/29 of this series:
> 
> Acked-by: Chuck Lever <chuck.lever@oracle.com>

Thanks for your review! :)

And I will implement the APIs suggested by Dave in 02/29 in
the v2, so there will be some changes here, but it should
not be much. So I will keep your Acked-bys in the v2.

Thanks,
Qi

> 
> 
>> ---
>>   fs/nfsd/netns.h     |  2 +-
>>   fs/nfsd/nfs4state.c | 20 ++++++++++++--------
>>   2 files changed, 13 insertions(+), 9 deletions(-)
>>
>> diff --git a/fs/nfsd/netns.h b/fs/nfsd/netns.h
>> index ec49b200b797..f669444d5336 100644
>> --- a/fs/nfsd/netns.h
>> +++ b/fs/nfsd/netns.h
>> @@ -195,7 +195,7 @@ struct nfsd_net {
>>   	int			nfs4_max_clients;
>>   
>>   	atomic_t		nfsd_courtesy_clients;
>> -	struct shrinker		nfsd_client_shrinker;
>> +	struct shrinker		*nfsd_client_shrinker;
>>   	struct work_struct	nfsd_shrinker_work;
>>   };
>>   
>> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
>> index 6e61fa3acaf1..a06184270548 100644
>> --- a/fs/nfsd/nfs4state.c
>> +++ b/fs/nfsd/nfs4state.c
>> @@ -4388,8 +4388,7 @@ static unsigned long
>>   nfsd4_state_shrinker_count(struct shrinker *shrink, struct shrink_control *sc)
>>   {
>>   	int count;
>> -	struct nfsd_net *nn = container_of(shrink,
>> -			struct nfsd_net, nfsd_client_shrinker);
>> +	struct nfsd_net *nn = shrink->private_data;
>>   
>>   	count = atomic_read(&nn->nfsd_courtesy_clients);
>>   	if (!count)
>> @@ -8094,14 +8093,19 @@ static int nfs4_state_create_net(struct net *net)
>>   	INIT_WORK(&nn->nfsd_shrinker_work, nfsd4_state_shrinker_worker);
>>   	get_net(net);
>>   
>> -	nn->nfsd_client_shrinker.scan_objects = nfsd4_state_shrinker_scan;
>> -	nn->nfsd_client_shrinker.count_objects = nfsd4_state_shrinker_count;
>> -	nn->nfsd_client_shrinker.seeks = DEFAULT_SEEKS;
>> -
>> -	if (register_shrinker(&nn->nfsd_client_shrinker, "nfsd-client"))
>> +	nn->nfsd_client_shrinker = shrinker_alloc_and_init(nfsd4_state_shrinker_count,
>> +							   nfsd4_state_shrinker_scan,
>> +							   0, DEFAULT_SEEKS, 0,
>> +							   nn);
>> +	if (!nn->nfsd_client_shrinker)
>>   		goto err_shrinker;
>> +
>> +	if (register_shrinker(nn->nfsd_client_shrinker, "nfsd-client"))
>> +		goto err_register;
>>   	return 0;
>>   
>> +err_register:
>> +	shrinker_free(nn->nfsd_client_shrinker);
>>   err_shrinker:
>>   	put_net(net);
>>   	kfree(nn->sessionid_hashtbl);
>> @@ -8197,7 +8201,7 @@ nfs4_state_shutdown_net(struct net *net)
>>   	struct list_head *pos, *next, reaplist;
>>   	struct nfsd_net *nn = net_generic(net, nfsd_net_id);
>>   
>> -	unregister_shrinker(&nn->nfsd_client_shrinker);
>> +	unregister_and_free_shrinker(nn->nfsd_client_shrinker);
>>   	cancel_work(&nn->nfsd_shrinker_work);
>>   	cancel_delayed_work_sync(&nn->laundromat_work);
>>   	locks_end_grace(&nn->nfsd4_manager);
>> -- 
>> 2.30.2
>>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 24/29] mm: vmscan: make global slab shrink lockless
  2023-06-24 11:08           ` Qi Zheng
@ 2023-06-25  3:15             ` Qi Zheng
  2023-07-04  4:20             ` Qi Zheng
  1 sibling, 0 replies; 55+ messages in thread
From: Qi Zheng @ 2023-06-25  3:15 UTC (permalink / raw)
  To: Dave Chinner, paulmck
  Cc: Vlastimil Babka, akpm, tkhai, roman.gushchin, djwong, brauner,
	tytso, linux-kernel, linux-mm, intel-gfx, dri-devel,
	linux-arm-msm, dm-devel, linux-raid, linux-bcache,
	virtualization, linux-fsdevel, linux-ext4, linux-nfs, linux-xfs,
	linux-btrfs, RCU



On 2023/6/24 19:08, Qi Zheng wrote:
> Hi Dave,
> 
> On 2023/6/24 06:19, Dave Chinner wrote:
>> On Fri, Jun 23, 2023 at 09:10:57PM +0800, Qi Zheng wrote:
>>> On 2023/6/23 14:29, Dave Chinner wrote:
>>>> On Thu, Jun 22, 2023 at 05:12:02PM +0200, Vlastimil Babka wrote:
>>>>> On 6/22/23 10:53, Qi Zheng wrote:
>>>> Yes, I suggested the IDR route because radix tree lookups under RCU
>>>> with reference counted objects are a known safe pattern that we can
>>>> easily confirm is correct or not.  Hence I suggested the unification
>>>> + IDR route because it makes the life of reviewers so, so much
>>>> easier...
>>>
>>> In fact, I originally planned to try the unification + IDR method you
>>> suggested at the beginning. But in the case of CONFIG_MEMCG disabled,
>>> the struct mem_cgroup is not even defined, and root_mem_cgroup and
>>> shrinker_info will not be allocated.  This required more code 
>>> changes, so
>>> I ended up keeping the shrinker_list and implementing the above pattern.
>>
>> Yes. Go back and read what I originally said needed to be done
>> first. In the case of CONFIG_MEMCG=n, a dummy root memcg still needs
>> to exist that holds all of the global shrinkers. Then shrink_slab()
>> is only ever passed a memcg that should be iterated.
>>
>> Yes, it needs changes external to the shrinker code itself to be
>> made to work. And even if memcg's are not enabled, we can still use
>> the memcg structures to ensure a common abstraction is used for the
>> shrinker tracking infrastructure....
> 
> Yeah, what I imagined before was to define a more concise struct
> mem_cgroup in the case of CONFIG_MEMCG=n, then allocate a dummy root
> memcg on system boot:
> 
> #ifdef !CONFIG_MEMCG
> 
> struct shrinker_info {
>      struct rcu_head rcu;
>      atomic_long_t *nr_deferred;
>      unsigned long *map;
>      int map_nr_max;
> };
> 
> struct mem_cgroup_per_node {
>      struct shrinker_info __rcu    *shrinker_info;
> };
> 
> struct mem_cgroup {
>      struct mem_cgroup_per_node *nodeinfo[];
> };
> 
> #endif
> 
> But I have a concern: if all global shrinkers are tracking with the
> info->map of root memcg, a shrinker->id needs to be assigned to them,
> which will cause info->map_nr_max to become larger than before, then
> making the traversal of info->map slower.

But most of the system is 'sb-xxx' shrinker instances, they all have
the SHRINKER_MEMCG_AWARE flag, so it should have little impact on the
speed of traversing info->map. ;)

> 
>>
>>> If the above pattern is not safe, I will go back to the unification +
>>> IDR method.
>>
>> And that is exactly how we got into this mess in the first place....
> 
> I only found one similar pattern in the kernel:
> 
> fs/smb/server/oplock.c:find_same_lease_key/smb_break_all_levII_oplock/lookup_lease_in_table
> 
> But IIUC, the refcount here needs to be decremented after holding
> rcu lock as I did above.
> 
> So regardless of whether we choose unification + IDR in the end, I still
> want to confirm whether the pattern I implemented above is safe. :)

Also + RCU mailing list.

> 
> Thanks,
> Qi
> 
>>
>> -Dave

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 24/29] mm: vmscan: make global slab shrink lockless
  2023-06-23  6:29     ` Dave Chinner
  2023-06-23 13:10       ` Qi Zheng
@ 2023-07-03 16:39       ` Paul E. McKenney
  2023-07-04  3:45         ` Qi Zheng
  1 sibling, 1 reply; 55+ messages in thread
From: Paul E. McKenney @ 2023-07-03 16:39 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Vlastimil Babka, Qi Zheng, akpm, tkhai, roman.gushchin, djwong,
	brauner, tytso, linux-kernel, linux-mm, intel-gfx, dri-devel,
	linux-arm-msm, dm-devel, linux-raid, linux-bcache,
	virtualization, linux-fsdevel, linux-ext4, linux-nfs, linux-xfs,
	linux-btrfs

On Fri, Jun 23, 2023 at 04:29:39PM +1000, Dave Chinner wrote:
> On Thu, Jun 22, 2023 at 05:12:02PM +0200, Vlastimil Babka wrote:
> > On 6/22/23 10:53, Qi Zheng wrote:
> > > @@ -1067,33 +1068,27 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
> > >  	if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
> > >  		return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
> > >  
> > > -	if (!down_read_trylock(&shrinker_rwsem))
> > > -		goto out;
> > > -
> > > -	list_for_each_entry(shrinker, &shrinker_list, list) {
> > > +	rcu_read_lock();
> > > +	list_for_each_entry_rcu(shrinker, &shrinker_list, list) {
> > >  		struct shrink_control sc = {
> > >  			.gfp_mask = gfp_mask,
> > >  			.nid = nid,
> > >  			.memcg = memcg,
> > >  		};
> > >  
> > > +		if (!shrinker_try_get(shrinker))
> > > +			continue;
> > > +		rcu_read_unlock();
> > 
> > I don't think you can do this unlock?

Sorry to be slow to respond here, this one fell through the cracks.
And thank you to Qi for reminding me!

If you do this unlock, you had jolly well better nail down the current
element (the one referenced by shrinker), for example, by acquiring an
explicit reference count on the object.  And presumably this is exactly
what shrinker_try_get() is doing.  And a look at your 24/29 confirms this,
at least assuming that shrinker->refcount is set to zero before the call
to synchronize_rcu() in free_module() *and* that synchronize_rcu() doesn't
start until *after* shrinker_put() calls complete().  Plus, as always,
the object must be removed from the list before the synchronize_rcu()
starts.  (On these parts of the puzzle, I defer to those more familiar
with this code path.  And I strongly suggest carefully commenting this
type of action-at-a-distance design pattern.)

Why is this important?  Because otherwise that object might be freed
before you get to the call to rcu_read_lock() at the end of this loop.
And if that happens, list_for_each_entry_rcu() will be walking the
freelist, which is quite bad for the health and well-being of your kernel.

There are a few other ways to make this sort of thing work:

1.	Defer the shrinker_put() to the beginning of the loop.
	You would need a flag initially set to zero, and then set to
	one just before (or just after) the rcu_read_lock() above.
	You would also need another shrinker_old pointer to track the
	old pointer.  Then at the top of the loop, if the flag is set,
	invoke shrinker_put() on shrinker_old.	This ensures that the
	previous shrinker structure stays around long enough to allow
	the loop to find the next shrinker structure in the list.

	This approach is attractive when the removal code path
	can invoke shrinker_put() after the grace period ends.

2.	Make shrinker_put() invoke call_rcu() when ->refcount reaches
	zero, and have the callback function free the object.  This of
	course requires adding an rcu_head structure to the shrinker
	structure, which might or might not be a reasonable course of
	action.  If adding that rcu_head is reasonable, this simplifies
	the logic quite a bit.

3.	For the shrinker-structure-removal code path, remove the shrinker
	structure, then remove the initial count from ->refcount,
	and then keep doing grace periods until ->refcount is zero,
	then do one more.  Of course, if the result of removing the
	initial count was zero, then only a single additional grace
	period is required.

	This would need to be carefully commented, as it is a bit
	unconventional.

There are probably many other ways, but just to give an idea of a few
other ways to do this.

> > > +
> > >  		ret = do_shrink_slab(&sc, shrinker, priority);
> > >  		if (ret == SHRINK_EMPTY)
> > >  			ret = 0;
> > >  		freed += ret;
> > > -		/*
> > > -		 * Bail out if someone want to register a new shrinker to
> > > -		 * prevent the registration from being stalled for long periods
> > > -		 * by parallel ongoing shrinking.
> > > -		 */
> > > -		if (rwsem_is_contended(&shrinker_rwsem)) {
> > > -			freed = freed ? : 1;
> > > -			break;
> > > -		}
> > > -	}
> > >  
> > > -	up_read(&shrinker_rwsem);
> > > -out:
> > > +		rcu_read_lock();
> > 
> > That new rcu_read_lock() won't help AFAIK, the whole
> > list_for_each_entry_rcu() needs to be under the single rcu_read_lock() to be
> > safe.
> 
> Yeah, that's the pattern we've been taught and the one we can look
> at and immediately say "this is safe".
> 
> This is a different pattern, as has been explained bi Qi, and I
> think it *might* be safe.
> 
> *However.*
> 
> Right now I don't have time to go through a novel RCU list iteration
> pattern it one step at to determine the correctness of the
> algorithm. I'm mostly worried about list manipulations that can
> occur outside rcu_read_lock() section bleeding into the RCU
> critical section because rcu_read_lock() by itself is not a memory
> barrier.
> 
> Maybe Paul has seen this pattern often enough he could simply tell
> us what conditions it is safe in. But for me to work that out from
> first principles? I just don't have the time to do that right now.

If the code does just the right sequence of things on the removal path
(remove, decrement reference, wait for reference to go to zero, wait for
grace period, free), then it would work.  If this is what is happening,
I would argue for more comments.  ;-)

							Thanx, Paul

> > IIUC this is why Dave in [4] suggests unifying shrink_slab() with
> > shrink_slab_memcg(), as the latter doesn't iterate the list but uses IDR.
> 
> Yes, I suggested the IDR route because radix tree lookups under RCU
> with reference counted objects are a known safe pattern that we can
> easily confirm is correct or not.  Hence I suggested the unification
> + IDR route because it makes the life of reviewers so, so much
> easier...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 24/29] mm: vmscan: make global slab shrink lockless
  2023-07-03 16:39       ` Paul E. McKenney
@ 2023-07-04  3:45         ` Qi Zheng
  2023-07-05  3:27           ` Qi Zheng
  0 siblings, 1 reply; 55+ messages in thread
From: Qi Zheng @ 2023-07-04  3:45 UTC (permalink / raw)
  To: paulmck, Dave Chinner
  Cc: Vlastimil Babka, akpm, tkhai, roman.gushchin, djwong, brauner,
	tytso, linux-kernel, linux-mm, intel-gfx, dri-devel,
	linux-arm-msm, dm-devel, linux-raid, linux-bcache,
	virtualization, linux-fsdevel, linux-ext4, linux-nfs, linux-xfs,
	linux-btrfs



On 2023/7/4 00:39, Paul E. McKenney wrote:
> On Fri, Jun 23, 2023 at 04:29:39PM +1000, Dave Chinner wrote:
>> On Thu, Jun 22, 2023 at 05:12:02PM +0200, Vlastimil Babka wrote:
>>> On 6/22/23 10:53, Qi Zheng wrote:
>>>> @@ -1067,33 +1068,27 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>>>   	if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
>>>>   		return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>>>>   
>>>> -	if (!down_read_trylock(&shrinker_rwsem))
>>>> -		goto out;
>>>> -
>>>> -	list_for_each_entry(shrinker, &shrinker_list, list) {
>>>> +	rcu_read_lock();
>>>> +	list_for_each_entry_rcu(shrinker, &shrinker_list, list) {
>>>>   		struct shrink_control sc = {
>>>>   			.gfp_mask = gfp_mask,
>>>>   			.nid = nid,
>>>>   			.memcg = memcg,
>>>>   		};
>>>>   
>>>> +		if (!shrinker_try_get(shrinker))
>>>> +			continue;
>>>> +		rcu_read_unlock();
>>>
>>> I don't think you can do this unlock?
> 
> Sorry to be slow to respond here, this one fell through the cracks.
> And thank you to Qi for reminding me!
> 
> If you do this unlock, you had jolly well better nail down the current
> element (the one referenced by shrinker), for example, by acquiring an
> explicit reference count on the object.  And presumably this is exactly
> what shrinker_try_get() is doing.  And a look at your 24/29 confirms this,
> at least assuming that shrinker->refcount is set to zero before the call
> to synchronize_rcu() in free_module() *and* that synchronize_rcu() doesn't
> start until *after* shrinker_put() calls complete().  Plus, as always,
> the object must be removed from the list before the synchronize_rcu()
> starts.  (On these parts of the puzzle, I defer to those more familiar
> with this code path.  And I strongly suggest carefully commenting this
> type of action-at-a-distance design pattern.)

Yeah, I think I've done it like above. A more detailed timing diagram is
below.

> 
> Why is this important?  Because otherwise that object might be freed
> before you get to the call to rcu_read_lock() at the end of this loop.
> And if that happens, list_for_each_entry_rcu() will be walking the
> freelist, which is quite bad for the health and well-being of your kernel.
> 
> There are a few other ways to make this sort of thing work:
> 
> 1.	Defer the shrinker_put() to the beginning of the loop.
> 	You would need a flag initially set to zero, and then set to
> 	one just before (or just after) the rcu_read_lock() above.
> 	You would also need another shrinker_old pointer to track the
> 	old pointer.  Then at the top of the loop, if the flag is set,
> 	invoke shrinker_put() on shrinker_old.	This ensures that the
> 	previous shrinker structure stays around long enough to allow
> 	the loop to find the next shrinker structure in the list.
> 
> 	This approach is attractive when the removal code path
> 	can invoke shrinker_put() after the grace period ends.
> 
> 2.	Make shrinker_put() invoke call_rcu() when ->refcount reaches
> 	zero, and have the callback function free the object.  This of
> 	course requires adding an rcu_head structure to the shrinker
> 	structure, which might or might not be a reasonable course of
> 	action.  If adding that rcu_head is reasonable, this simplifies
> 	the logic quite a bit.
> 
> 3.	For the shrinker-structure-removal code path, remove the shrinker
> 	structure, then remove the initial count from ->refcount,
> 	and then keep doing grace periods until ->refcount is zero,
> 	then do one more.  Of course, if the result of removing the
> 	initial count was zero, then only a single additional grace
> 	period is required.
> 
> 	This would need to be carefully commented, as it is a bit
> 	unconventional.

Thanks for such a detailed addition!

> 
> There are probably many other ways, but just to give an idea of a few
> other ways to do this.
> 
>>>> +
>>>>   		ret = do_shrink_slab(&sc, shrinker, priority);
>>>>   		if (ret == SHRINK_EMPTY)
>>>>   			ret = 0;
>>>>   		freed += ret;
>>>> -		/*
>>>> -		 * Bail out if someone want to register a new shrinker to
>>>> -		 * prevent the registration from being stalled for long periods
>>>> -		 * by parallel ongoing shrinking.
>>>> -		 */
>>>> -		if (rwsem_is_contended(&shrinker_rwsem)) {
>>>> -			freed = freed ? : 1;
>>>> -			break;
>>>> -		}
>>>> -	}
>>>>   
>>>> -	up_read(&shrinker_rwsem);
>>>> -out:
>>>> +		rcu_read_lock();
>>>
>>> That new rcu_read_lock() won't help AFAIK, the whole
>>> list_for_each_entry_rcu() needs to be under the single rcu_read_lock() to be
>>> safe.
>>
>> Yeah, that's the pattern we've been taught and the one we can look
>> at and immediately say "this is safe".
>>
>> This is a different pattern, as has been explained bi Qi, and I
>> think it *might* be safe.
>>
>> *However.*
>>
>> Right now I don't have time to go through a novel RCU list iteration
>> pattern it one step at to determine the correctness of the
>> algorithm. I'm mostly worried about list manipulations that can
>> occur outside rcu_read_lock() section bleeding into the RCU
>> critical section because rcu_read_lock() by itself is not a memory
>> barrier.
>>
>> Maybe Paul has seen this pattern often enough he could simply tell
>> us what conditions it is safe in. But for me to work that out from
>> first principles? I just don't have the time to do that right now.
> 
> If the code does just the right sequence of things on the removal path
> (remove, decrement reference, wait for reference to go to zero, wait for
> grace period, free), then it would work.  If this is what is happening,
> I would argue for more comments.  ;-)

The order of the removal path is slightly different from this:

     shrink_slab                 unregister_shrinker
     ===========                 ===================
		
    shrinker_try_get()
    rcu_read_unlock()		
                                 1. decrement initial reference
				shrinker_put()
				2. wait for reference to go to zero
				wait_for_completion()
    rcu_read_lock()

    shrinker_put()
				3. remove the shrinker from list
				list_del_rcu()
                                 4. wait for grace period
				kfree_rcu()/synchronize_rcu()


    list_for_each_entry()

    shrinker_try_get()
    rcu_read_unlock()
				5. free the shrinker

So the order is: decrement reference, wait for reference to go to zero,
remove, wait for grace period, free.

I think this can work. And we can only do the *step 3* after we hold the
RCU read lock again, right? Please let me know if I missed something.

Thanks,
Qi

> 
> 							Thanx, Paul
> 
>>> IIUC this is why Dave in [4] suggests unifying shrink_slab() with
>>> shrink_slab_memcg(), as the latter doesn't iterate the list but uses IDR.
>>
>> Yes, I suggested the IDR route because radix tree lookups under RCU
>> with reference counted objects are a known safe pattern that we can
>> easily confirm is correct or not.  Hence I suggested the unification
>> + IDR route because it makes the life of reviewers so, so much
>> easier...
>>
>> Cheers,
>>
>> Dave.
>> -- 
>> Dave Chinner
>> david@fromorbit.com

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 24/29] mm: vmscan: make global slab shrink lockless
  2023-06-24 11:08           ` Qi Zheng
  2023-06-25  3:15             ` Qi Zheng
@ 2023-07-04  4:20             ` Qi Zheng
  1 sibling, 0 replies; 55+ messages in thread
From: Qi Zheng @ 2023-07-04  4:20 UTC (permalink / raw)
  To: Dave Chinner
  Cc: paulmck, Vlastimil Babka, akpm, tkhai, roman.gushchin, djwong,
	brauner, tytso, linux-kernel, linux-mm, intel-gfx, dri-devel,
	linux-arm-msm, dm-devel, linux-raid, linux-bcache,
	virtualization, linux-fsdevel, linux-ext4, linux-nfs, linux-xfs,
	linux-btrfs

Hi Dave,

On 2023/6/24 19:08, Qi Zheng wrote:
> Hi Dave,
> 
> On 2023/6/24 06:19, Dave Chinner wrote:
>> On Fri, Jun 23, 2023 at 09:10:57PM +0800, Qi Zheng wrote:
>>> On 2023/6/23 14:29, Dave Chinner wrote:
>>>> On Thu, Jun 22, 2023 at 05:12:02PM +0200, Vlastimil Babka wrote:
>>>>> On 6/22/23 10:53, Qi Zheng wrote:
>>>> Yes, I suggested the IDR route because radix tree lookups under RCU
>>>> with reference counted objects are a known safe pattern that we can
>>>> easily confirm is correct or not.  Hence I suggested the unification
>>>> + IDR route because it makes the life of reviewers so, so much
>>>> easier...
>>>
>>> In fact, I originally planned to try the unification + IDR method you
>>> suggested at the beginning. But in the case of CONFIG_MEMCG disabled,
>>> the struct mem_cgroup is not even defined, and root_mem_cgroup and
>>> shrinker_info will not be allocated.  This required more code 
>>> changes, so
>>> I ended up keeping the shrinker_list and implementing the above pattern.
>>
>> Yes. Go back and read what I originally said needed to be done
>> first. In the case of CONFIG_MEMCG=n, a dummy root memcg still needs
>> to exist that holds all of the global shrinkers. Then shrink_slab()
>> is only ever passed a memcg that should be iterated.
>>
>> Yes, it needs changes external to the shrinker code itself to be
>> made to work. And even if memcg's are not enabled, we can still use
>> the memcg structures to ensure a common abstraction is used for the
>> shrinker tracking infrastructure....
> 
> Yeah, what I imagined before was to define a more concise struct
> mem_cgroup in the case of CONFIG_MEMCG=n, then allocate a dummy root
> memcg on system boot:
> 
> #ifdef !CONFIG_MEMCG
> 
> struct shrinker_info {
>      struct rcu_head rcu;
>      atomic_long_t *nr_deferred;
>      unsigned long *map;
>      int map_nr_max;
> };
> 
> struct mem_cgroup_per_node {
>      struct shrinker_info __rcu    *shrinker_info;
> };
> 
> struct mem_cgroup {
>      struct mem_cgroup_per_node *nodeinfo[];
> };
> 
> #endif

These days I tried doing this:

1. CONFIG_MEMCG && !mem_cgroup_disabled()

    track all global shrinkers with root_mem_cgroup.

2. CONFIG_MEMCG && mem_cgroup_disabled()

    the root_mem_cgroup is also allocated in this case, so still use
    root_mem_cgroup to track all global shrinkers.

3. !CONFIG_MEMCG

    allocate a dummy memcg during system startup (after cgroup_init())
    and use it to track all global shrinkers

This works, but needs to modify the startup order of some subsystems,
because some shrinkers will be registered before root_mem_cgroup is
allocated, such as:

1. rcu-kfree shrinker in rcu_init()
2. super block shrinkers in vfs_caches_init()

And cgroup_init() also depends on some file system infrastructure, so
I made some changes (rough and unorganized):

diff --git a/fs/namespace.c b/fs/namespace.c
index e157efc54023..6a12d3d0064e 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -4706,7 +4706,7 @@ static void __init init_mount_tree(void)

  void __init mnt_init(void)
  {
-       int err;
+       //int err;

         mnt_cache = kmem_cache_create("mnt_cache", sizeof(struct mount),
                         0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, 
NULL);
@@ -4725,15 +4725,7 @@ void __init mnt_init(void)
         if (!mount_hashtable || !mountpoint_hashtable)
                 panic("Failed to allocate mount hash table\n");

-       kernfs_init();
-
-       err = sysfs_init();
-       if (err)
-               printk(KERN_WARNING "%s: sysfs_init error: %d\n",
-                       __func__, err);
-       fs_kobj = kobject_create_and_add("fs", NULL);
-       if (!fs_kobj)
-               printk(KERN_WARNING "%s: kobj create error\n", __func__);
         shmem_init();
         init_rootfs();
         init_mount_tree();
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 7d9c2a63b7cd..d87c67f6f66e 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -119,6 +119,7 @@ static inline void call_rcu_hurry(struct rcu_head 
*head, rcu_callback_t func)

  /* Internal to kernel */
  void rcu_init(void);
+void rcu_shrinker_init(void);
  extern int rcu_scheduler_active;
  void rcu_sched_clock_irq(int user);
  void rcu_report_dead(unsigned int cpu);
diff --git a/init/main.c b/init/main.c
index ad920fac325c..4190fc6d10ad 100644
--- a/init/main.c
+++ b/init/main.c
@@ -1049,14 +1049,22 @@ void start_kernel(void)
         security_init();
         dbg_late_init();
         net_ns_init();
+       kernfs_init();
+       if (sysfs_init())
+               printk(KERN_WARNING "%s: sysfs_init error\n",
+                       __func__);
+       fs_kobj = kobject_create_and_add("fs", NULL);
+       if (!fs_kobj)
+               printk(KERN_WARNING "%s: kobj create error\n", __func__);
+       proc_root_init();
+       cgroup_init();
         vfs_caches_init();
         pagecache_init();
         signals_init();
         seq_file_init();
-       proc_root_init();
         nsfs_init();
         cpuset_init();
-       cgroup_init();
+       rcu_shrinker_init();
         taskstats_init_early();
         delayacct_init();

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index d068ce3567fc..71a04ae8defb 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -4953,7 +4953,10 @@ static void __init kfree_rcu_batch_init(void)
                 INIT_DELAYED_WORK(&krcp->page_cache_work, 
fill_page_cache_func);
                 krcp->initialized = true;
         }
+}

+void __init rcu_shrinker_init(void)
+{
         kfree_rcu_shrinker = shrinker_alloc(0, "rcu-kfree");
         if (!kfree_rcu_shrinker) {
                 pr_err("Failed to allocate kfree_rcu() shrinker!\n");

I adjusted it step by step according to the errors reported, and there
may be hidden problems (needs more review and testing).

In addition, unifying the processing of global and memcg slab shrink
does have many benefits:

1. shrinker::nr_deferred can be removed
2. shrinker_list can be removed
3. simplifies the existing code logic and subsequent lockless processing

But I'm still a bit apprehensive about modifying the boot order. :(

What do you think about this?

Thanks,
Qi


> 
> But I have a concern: if all global shrinkers are tracking with the
> info->map of root memcg, a shrinker->id needs to be assigned to them,
> which will cause info->map_nr_max to become larger than before, then
> making the traversal of info->map slower.
> 
>>
>>> If the above pattern is not safe, I will go back to the unification +
>>> IDR method.
>>
>> And that is exactly how we got into this mess in the first place....
> 
> I only found one similar pattern in the kernel:
> 
> fs/smb/server/oplock.c:find_same_lease_key/smb_break_all_levII_oplock/lookup_lease_in_table
> 
> But IIUC, the refcount here needs to be decremented after holding
> rcu lock as I did above.
> 
> So regardless of whether we choose unification + IDR in the end, I still
> want to confirm whether the pattern I implemented above is safe. :)
> 
> Thanks,
> Qi
> 
>>
>> -Dave

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH 24/29] mm: vmscan: make global slab shrink lockless
  2023-07-04  3:45         ` Qi Zheng
@ 2023-07-05  3:27           ` Qi Zheng
  0 siblings, 0 replies; 55+ messages in thread
From: Qi Zheng @ 2023-07-05  3:27 UTC (permalink / raw)
  To: paulmck, Dave Chinner
  Cc: Vlastimil Babka, akpm, tkhai, roman.gushchin, djwong, brauner,
	tytso, linux-kernel, linux-mm, intel-gfx, dri-devel,
	linux-arm-msm, dm-devel, linux-raid, linux-bcache,
	virtualization, linux-fsdevel, linux-ext4, linux-nfs, linux-xfs,
	linux-btrfs



On 2023/7/4 11:45, Qi Zheng wrote:
> 
> 
> On 2023/7/4 00:39, Paul E. McKenney wrote:
>> On Fri, Jun 23, 2023 at 04:29:39PM +1000, Dave Chinner wrote:
>>> On Thu, Jun 22, 2023 at 05:12:02PM +0200, Vlastimil Babka wrote:
>>>> On 6/22/23 10:53, Qi Zheng wrote:
>>>>> @@ -1067,33 +1068,27 @@ static unsigned long shrink_slab(gfp_t 
>>>>> gfp_mask, int nid,
>>>>>       if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
>>>>>           return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>>>>> -    if (!down_read_trylock(&shrinker_rwsem))
>>>>> -        goto out;
>>>>> -
>>>>> -    list_for_each_entry(shrinker, &shrinker_list, list) {
>>>>> +    rcu_read_lock();
>>>>> +    list_for_each_entry_rcu(shrinker, &shrinker_list, list) {
>>>>>           struct shrink_control sc = {
>>>>>               .gfp_mask = gfp_mask,
>>>>>               .nid = nid,
>>>>>               .memcg = memcg,
>>>>>           };
>>>>> +        if (!shrinker_try_get(shrinker))
>>>>> +            continue;
>>>>> +        rcu_read_unlock();
>>>>
>>>> I don't think you can do this unlock?
>>
>> Sorry to be slow to respond here, this one fell through the cracks.
>> And thank you to Qi for reminding me!
>>
>> If you do this unlock, you had jolly well better nail down the current
>> element (the one referenced by shrinker), for example, by acquiring an
>> explicit reference count on the object.  And presumably this is exactly
>> what shrinker_try_get() is doing.  And a look at your 24/29 confirms 
>> this,
>> at least assuming that shrinker->refcount is set to zero before the call
>> to synchronize_rcu() in free_module() *and* that synchronize_rcu() 
>> doesn't
>> start until *after* shrinker_put() calls complete().  Plus, as always,
>> the object must be removed from the list before the synchronize_rcu()
>> starts.  (On these parts of the puzzle, I defer to those more familiar
>> with this code path.  And I strongly suggest carefully commenting this
>> type of action-at-a-distance design pattern.)
> 
> Yeah, I think I've done it like above. A more detailed timing diagram is
> below.
> 
>>
>> Why is this important?  Because otherwise that object might be freed
>> before you get to the call to rcu_read_lock() at the end of this loop.
>> And if that happens, list_for_each_entry_rcu() will be walking the
>> freelist, which is quite bad for the health and well-being of your 
>> kernel.
>>
>> There are a few other ways to make this sort of thing work:
>>
>> 1.    Defer the shrinker_put() to the beginning of the loop.
>>     You would need a flag initially set to zero, and then set to
>>     one just before (or just after) the rcu_read_lock() above.
>>     You would also need another shrinker_old pointer to track the
>>     old pointer.  Then at the top of the loop, if the flag is set,
>>     invoke shrinker_put() on shrinker_old.    This ensures that the
>>     previous shrinker structure stays around long enough to allow
>>     the loop to find the next shrinker structure in the list.
>>
>>     This approach is attractive when the removal code path
>>     can invoke shrinker_put() after the grace period ends.
>>
>> 2.    Make shrinker_put() invoke call_rcu() when ->refcount reaches
>>     zero, and have the callback function free the object.  This of
>>     course requires adding an rcu_head structure to the shrinker
>>     structure, which might or might not be a reasonable course of
>>     action.  If adding that rcu_head is reasonable, this simplifies
>>     the logic quite a bit.
>>
>> 3.    For the shrinker-structure-removal code path, remove the shrinker
>>     structure, then remove the initial count from ->refcount,
>>     and then keep doing grace periods until ->refcount is zero,
>>     then do one more.  Of course, if the result of removing the
>>     initial count was zero, then only a single additional grace
>>     period is required.
>>
>>     This would need to be carefully commented, as it is a bit
>>     unconventional.
> 
> Thanks for such a detailed addition!
> 
>>
>> There are probably many other ways, but just to give an idea of a few
>> other ways to do this.
>>
>>>>> +
>>>>>           ret = do_shrink_slab(&sc, shrinker, priority);
>>>>>           if (ret == SHRINK_EMPTY)
>>>>>               ret = 0;
>>>>>           freed += ret;
>>>>> -        /*
>>>>> -         * Bail out if someone want to register a new shrinker to
>>>>> -         * prevent the registration from being stalled for long 
>>>>> periods
>>>>> -         * by parallel ongoing shrinking.
>>>>> -         */
>>>>> -        if (rwsem_is_contended(&shrinker_rwsem)) {
>>>>> -            freed = freed ? : 1;
>>>>> -            break;
>>>>> -        }
>>>>> -    }
>>>>> -    up_read(&shrinker_rwsem);
>>>>> -out:
>>>>> +        rcu_read_lock();
>>>>
>>>> That new rcu_read_lock() won't help AFAIK, the whole
>>>> list_for_each_entry_rcu() needs to be under the single 
>>>> rcu_read_lock() to be
>>>> safe.
>>>
>>> Yeah, that's the pattern we've been taught and the one we can look
>>> at and immediately say "this is safe".
>>>
>>> This is a different pattern, as has been explained bi Qi, and I
>>> think it *might* be safe.
>>>
>>> *However.*
>>>
>>> Right now I don't have time to go through a novel RCU list iteration
>>> pattern it one step at to determine the correctness of the
>>> algorithm. I'm mostly worried about list manipulations that can
>>> occur outside rcu_read_lock() section bleeding into the RCU
>>> critical section because rcu_read_lock() by itself is not a memory
>>> barrier.
>>>
>>> Maybe Paul has seen this pattern often enough he could simply tell
>>> us what conditions it is safe in. But for me to work that out from
>>> first principles? I just don't have the time to do that right now.
>>
>> If the code does just the right sequence of things on the removal path
>> (remove, decrement reference, wait for reference to go to zero, wait for
>> grace period, free), then it would work.  If this is what is happening,
>> I would argue for more comments.  ;-)
> 
> The order of the removal path is slightly different from this:
> 
>      shrink_slab                 unregister_shrinker
>      ===========                 ===================
> 
>     shrinker_try_get()
>     rcu_read_unlock()
>                                  1. decrement initial reference
>                  shrinker_put()
>                  2. wait for reference to go to zero
>                  wait_for_completion()
>     rcu_read_lock()
> 
>     shrinker_put()
>                  3. remove the shrinker from list
>                  list_del_rcu()
>                                  4. wait for grace period
>                  kfree_rcu()/synchronize_rcu()
> 
> 
>     list_for_each_entry()
> 
>     shrinker_try_get()
>     rcu_read_unlock()
>                  5. free the shrinker
> 
> So the order is: decrement reference, wait for reference to go to zero,
> remove, wait for grace period, free.
> 
> I think this can work. And we can only do the *step 3* after we hold the
> RCU read lock again, right? Please let me know if I missed something.

Oh, you are right, It would be better to move step 3 to step 1. We
should first remove the shrinker from the shrinker_list to prevent
other traversers from finding it again, otherwise the following
situations may occur theoretically:

CPU 0                 CPU 1

shrinker_try_get()

                       shrinker_try_get()

shrinker_put()
shrinker_try_get()
                       shrinker_put()

Thanks,
Qi

> 
> Thanks,
> Qi
> 
>>
>>                             Thanx, Paul
>>
>>>> IIUC this is why Dave in [4] suggests unifying shrink_slab() with
>>>> shrink_slab_memcg(), as the latter doesn't iterate the list but uses 
>>>> IDR.
>>>
>>> Yes, I suggested the IDR route because radix tree lookups under RCU
>>> with reference counted objects are a known safe pattern that we can
>>> easily confirm is correct or not.  Hence I suggested the unification
>>> + IDR route because it makes the life of reviewers so, so much
>>> easier...
>>>
>>> Cheers,
>>>
>>> Dave.
>>> -- 
>>> Dave Chinner
>>> david@fromorbit.com

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2023-07-05  3:28 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-22  8:53 [PATCH 00/29] use refcount+RCU method to implement lockless slab shrink Qi Zheng
2023-06-22  8:53 ` [PATCH 01/29] mm: shrinker: add shrinker::private_data field Qi Zheng
2023-06-22 14:47   ` Vlastimil Babka
2023-06-23 12:50     ` [External] " Qi Zheng
2023-06-22  8:53 ` [PATCH 02/29] mm: vmscan: introduce some helpers for dynamically allocating shrinker Qi Zheng
2023-06-23  6:12   ` Dave Chinner
2023-06-23 12:49     ` Qi Zheng
2023-06-22  8:53 ` [PATCH 03/29] drm/i915: dynamically allocate the i915_gem_mm shrinker Qi Zheng
2023-06-22  8:53 ` [PATCH 04/29] drm/msm: dynamically allocate the drm-msm_gem shrinker Qi Zheng
2023-06-22  8:53 ` [PATCH 05/29] drm/panfrost: dynamically allocate the drm-panfrost shrinker Qi Zheng
2023-06-23 13:33   ` Qi Zheng
2023-06-22  8:53 ` [PATCH 06/29] dm: dynamically allocate the dm-bufio shrinker Qi Zheng
2023-06-22  8:53 ` [PATCH 07/29] dm zoned: dynamically allocate the dm-zoned-meta shrinker Qi Zheng
2023-06-22  8:53 ` [PATCH 08/29] md/raid5: dynamically allocate the md-raid5 shrinker Qi Zheng
2023-06-22  8:53 ` [PATCH 09/29] bcache: dynamically allocate the md-bcache shrinker Qi Zheng
2023-06-22  8:53 ` [PATCH 10/29] vmw_balloon: dynamically allocate the vmw-balloon shrinker Qi Zheng
2023-06-22  8:53 ` [PATCH 11/29] virtio_balloon: dynamically allocate the virtio-balloon shrinker Qi Zheng
2023-06-22  8:53 ` [PATCH 12/29] mbcache: dynamically allocate the mbcache shrinker Qi Zheng
2023-06-22  8:53 ` [PATCH 13/29] ext4: dynamically allocate the ext4-es shrinker Qi Zheng
2023-06-22  8:53 ` [PATCH 14/29] jbd2,ext4: dynamically allocate the jbd2-journal shrinker Qi Zheng
2023-06-22  8:53 ` [PATCH 15/29] NFSD: dynamically allocate the nfsd-client shrinker Qi Zheng
2023-06-23 21:49   ` Chuck Lever
2023-06-24 11:17     ` Qi Zheng
2023-06-22  8:53 ` [PATCH 16/29] NFSD: dynamically allocate the nfsd-reply shrinker Qi Zheng
2023-06-22  8:53 ` [PATCH 17/29] xfs: dynamically allocate the xfs-buf shrinker Qi Zheng
2023-06-22  8:53 ` [PATCH 18/29] xfs: dynamically allocate the xfs-inodegc shrinker Qi Zheng
2023-06-22  8:53 ` [PATCH 19/29] xfs: dynamically allocate the xfs-qm shrinker Qi Zheng
2023-06-22  8:53 ` [PATCH 20/29] zsmalloc: dynamically allocate the mm-zspool shrinker Qi Zheng
2023-06-22  8:53 ` [PATCH 21/29] fs: super: dynamically allocate the s_shrink Qi Zheng
2023-06-22  8:53 ` [PATCH 22/29] drm/ttm: introduce pool_shrink_rwsem Qi Zheng
2023-06-22  8:53 ` [PATCH 23/29] mm: shrinker: add refcount and completion_wait fields Qi Zheng
2023-06-22  8:53 ` [PATCH 24/29] mm: vmscan: make global slab shrink lockless Qi Zheng
2023-06-22 15:12   ` Vlastimil Babka
2023-06-22 16:42     ` Qi Zheng
2023-06-22 17:41       ` Alan Huang
2023-06-22 18:18         ` Qi Zheng
2023-06-23  6:29     ` Dave Chinner
2023-06-23 13:10       ` Qi Zheng
2023-06-23 22:19         ` Dave Chinner
2023-06-24 11:08           ` Qi Zheng
2023-06-25  3:15             ` Qi Zheng
2023-07-04  4:20             ` Qi Zheng
2023-07-03 16:39       ` Paul E. McKenney
2023-07-04  3:45         ` Qi Zheng
2023-07-05  3:27           ` Qi Zheng
2023-06-22  8:53 ` [PATCH 25/29] mm: vmscan: make memcg " Qi Zheng
2023-06-22  8:53 ` [PATCH 26/29] mm: shrinker: make count and scan in shrinker debugfs lockless Qi Zheng
2023-06-22  8:53 ` [PATCH 27/29] mm: vmscan: hold write lock to reparent shrinker nr_deferred Qi Zheng
2023-06-22  8:53 ` [PATCH 28/29] mm: shrinkers: convert shrinker_rwsem to mutex Qi Zheng
2023-06-22  8:53 ` [PATCH 29/29] mm: shrinker: move shrinker-related code into a separate file Qi Zheng
2023-06-22 14:53   ` Vlastimil Babka
2023-06-23 13:12     ` Qi Zheng
2023-06-23  5:25   ` Sergey Senozhatsky
2023-06-23 13:24     ` Qi Zheng
2023-06-22  9:02 ` [PATCH 00/29] use refcount+RCU method to implement lockless slab shrink Qi Zheng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).