All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/8] make unregistration of super_block shrinker more faster
@ 2023-05-31  9:57 Qi Zheng
  2023-05-31  9:57 ` [PATCH 1/8] mm: vmscan: move shrinker_debugfs_remove() before synchronize_srcu() Qi Zheng
                   ` (8 more replies)
  0 siblings, 9 replies; 25+ messages in thread
From: Qi Zheng @ 2023-05-31  9:57 UTC (permalink / raw)
  To: akpm, tkhai, roman.gushchin, vbabka, viro, brauner, djwong,
	hughd, paulmck, muchun.song
  Cc: linux-mm, linux-fsdevel, linux-xfs, linux-kernel, Qi Zheng

From: Qi Zheng <zhengqi.arch@bytedance.com>

Hi all,

This patch series aims to make unregistration of super_block shrinker more
faster.

1. Background
=============

The kernel test robot noticed a -88.8% regression of stress-ng.ramfs.ops_per_sec
on commit f95bdb700bc6 ("mm: vmscan: make global slab shrink lockless"). More
details can be seen from the link[1] below.

[1]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie.liu@intel.com/

We can just use the following command to reproduce the result:

stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 &

1) before commit f95bdb700bc6b:

stress-ng: info:  [11023] dispatching hogs: 9 ramfs
stress-ng: info:  [11023] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
stress-ng: info:  [11023]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [11023] ramfs            774966     60.00     10.18    169.45     12915.89        4314.26
stress-ng: info:  [11023] for a 60.00s run time:
stress-ng: info:  [11023]    1920.11s available CPU time
stress-ng: info:  [11023]      10.18s user time   (  0.53%)
stress-ng: info:  [11023]     169.44s system time (  8.82%)
stress-ng: info:  [11023]     179.62s total time  (  9.35%)
stress-ng: info:  [11023] load average: 8.99 2.69 0.93
stress-ng: info:  [11023] successful run completed in 60.00s (1 min, 0.00 secs)

2) after commit f95bdb700bc6b:

stress-ng: info:  [37676] dispatching hogs: 9 ramfs
stress-ng: info:  [37676] stressor       bogo ops real time  usrtime  sys time   bogo ops/s     bogo ops/s
stress-ng: info:  [37676]                           (secs)    (secs)   (secs)   (real time) (usr+sys time)
stress-ng: info:  [37676] ramfs            168673     60.00     1.61    39.66      2811.08        4087.47
stress-ng: info:  [37676] for a 60.10s run time:
stress-ng: info:  [37676]    1923.36s available CPU time
stress-ng: info:  [37676]       1.60s user time   (  0.08%)
stress-ng: info:  [37676]      39.66s system time (  2.06%)
stress-ng: info:  [37676]      41.26s total time  (  2.15%)
stress-ng: info:  [37676] load average: 7.69 3.63 2.36
stress-ng: info:  [37676] successful run completed in 60.10s (1 min, 0.10 secs)

The root cause is that SRCU has to be careful to not frequently check for srcu
read-side critical section exits. Paul E. McKenney gave a detailed explanation:

```
In practice, the act of checking to see if there is anyone in an SRCU
read-side critical section is a heavy-weight operation, involving at
least one cache miss per CPU along with a number of full memory barriers.
```

Therefore, even if no one is currently in the SRCU read-side critical section,
synchronize_srcu() cannot return quickly. That's why unregister_shrinker() has
become slower.

2. Idea
=======

2.1 use synchronize_srcu_expedited() ?
--------------------------------------

The synchronize_srcu_expedited() will let SRCU to be much more aggressive.
If we use it to replace synchronize_srcu() in the unregister_shrinker(), the
ops/s will return to previous levels:

stress-ng: info:  [13159] dispatching hogs: 9 ramfs
stress-ng: info:  [13159] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
stress-ng: info:  [13159]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [13159] ramfs            710062     60.00      9.63    157.26     11834.18        4254.75
stress-ng: info:  [13159] for a 60.00s run time:
stress-ng: info:  [13159]    1920.14s available CPU time
stress-ng: info:  [13159]       9.62s user time   (  0.50%)
stress-ng: info:  [13159]     157.26s system time (  8.19%)
stress-ng: info:  [13159]     166.88s total time  (  8.69%)
stress-ng: info:  [13159] load average: 9.49 4.02 1.65
stress-ng: info:  [13159] successful run completed in 60.00s (1 min, 0.00 secs)

But because SRCU (Sleepable RCU) is used here, the reader is allowed to sleep in
the read-side critical section, so synchronize_srcu_expedited() may cause a lot
of CPU consumption, so this is not a good choice.

2.2 move synchronize_srcu() to the asynchronous delayed work
------------------------------------------------------------

Kirill Tkhai proposed a better idea[2] in 2018: move synchronize_srcu() to the
asynchronous delayed work, then it doesn't affect on user-visible unregistration
speed.

[2]. https://lore.kernel.org/lkml/153365636747.19074.12610817307548583381.stgit@localhost.localdomain/

After applying his patches ([PATCH RFC 04/10]~[PATCH RFC 10/10], with few
conflicts), the ops/s is of course back to the previous levels:

stress-ng: info:  [11506] setting to a 60 second run per stressor
stress-ng: info:  [11506] dispatching hogs: 9 ramfs
stress-ng: info:  [11506] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
stress-ng: info:  [11506]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [11506] ramfs            829462     60.00     10.81    174.25     13824.14        4482.08
stress-ng: info:  [11506] for a 60.00s run time:
stress-ng: info:  [11506]    1920.12s available CPU time
stress-ng: info:  [11506]      10.81s user time   (  0.56%)
stress-ng: info:  [11506]     174.25s system time (  9.07%)
stress-ng: info:  [11506]     185.06s total time  (  9.64%)
stress-ng: info:  [11506] load average: 8.96 2.60 0.89
stress-ng: info:  [11506] successful run completed in 60.00s (1 min, 0.00 secs)

In order to continue to advance this patch set, I rebase these patches onto the
next-20230525. Any comments and suggestions are welcome.

Note: This patch serise is only for super_block shrinker, all further
time-critical for unregistration places may be written in the same conception.

Thanks,
Qi

Kirill Tkhai (7):
  mm: vmscan: split unregister_shrinker()
  fs: move list_lru_destroy() to destroy_super_work()
  fs: shrink only (SB_ACTIVE|SB_BORN) superblocks in super_cache_scan()
  fs: introduce struct super_operations::destroy_super() callback
  xfs: introduce xfs_fs_destroy_super()
  shmem: implement shmem_destroy_super()
  fs: use unregister_shrinker_delayed_{initiate, finalize} for
    super_block shrinker

Qi Zheng (1):
  mm: vmscan: move shrinker_debugfs_remove() before synchronize_srcu()

 fs/super.c               | 32 ++++++++++++++++++--------------
 fs/xfs/xfs_super.c       | 25 ++++++++++++++++++++++---
 include/linux/fs.h       |  6 ++++++
 include/linux/shrinker.h |  2 ++
 mm/shmem.c               |  8 ++++++++
 mm/vmscan.c              | 26 ++++++++++++++++++++------
 6 files changed, 76 insertions(+), 23 deletions(-)

-- 
2.30.2


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/8] mm: vmscan: move shrinker_debugfs_remove() before synchronize_srcu()
  2023-05-31  9:57 [PATCH 0/8] make unregistration of super_block shrinker more faster Qi Zheng
@ 2023-05-31  9:57 ` Qi Zheng
  2023-05-31 10:49   ` Christian Brauner
  2023-05-31  9:57 ` [PATCH 2/8] mm: vmscan: split unregister_shrinker() Qi Zheng
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 25+ messages in thread
From: Qi Zheng @ 2023-05-31  9:57 UTC (permalink / raw)
  To: akpm, tkhai, roman.gushchin, vbabka, viro, brauner, djwong,
	hughd, paulmck, muchun.song
  Cc: linux-mm, linux-fsdevel, linux-xfs, linux-kernel, Qi Zheng

From: Qi Zheng <zhengqi.arch@bytedance.com>

The debugfs_remove_recursive() will wait for debugfs_file_put()
to return, so there is no need to put it after synchronize_srcu()
to wait for the rcu read-side critical section to exit.

Just move it before synchronize_srcu(), which is also convenient
to put the heavy synchronize_srcu() in the delayed work later.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/vmscan.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index eeca83e28c9b..a773e97e152e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -818,11 +818,11 @@ void unregister_shrinker(struct shrinker *shrinker)
 	debugfs_entry = shrinker_debugfs_detach(shrinker, &debugfs_id);
 	mutex_unlock(&shrinker_mutex);
 
+	shrinker_debugfs_remove(debugfs_entry, debugfs_id);
+
 	atomic_inc(&shrinker_srcu_generation);
 	synchronize_srcu(&shrinker_srcu);
 
-	shrinker_debugfs_remove(debugfs_entry, debugfs_id);
-
 	kfree(shrinker->nr_deferred);
 	shrinker->nr_deferred = NULL;
 }
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 2/8] mm: vmscan: split unregister_shrinker()
  2023-05-31  9:57 [PATCH 0/8] make unregistration of super_block shrinker more faster Qi Zheng
  2023-05-31  9:57 ` [PATCH 1/8] mm: vmscan: move shrinker_debugfs_remove() before synchronize_srcu() Qi Zheng
@ 2023-05-31  9:57 ` Qi Zheng
  2023-05-31 22:57   ` Dave Chinner
  2023-05-31  9:57 ` [PATCH 3/8] fs: move list_lru_destroy() to destroy_super_work() Qi Zheng
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 25+ messages in thread
From: Qi Zheng @ 2023-05-31  9:57 UTC (permalink / raw)
  To: akpm, tkhai, roman.gushchin, vbabka, viro, brauner, djwong,
	hughd, paulmck, muchun.song
  Cc: linux-mm, linux-fsdevel, linux-xfs, linux-kernel, Qi Zheng

From: Kirill Tkhai <tkhai@ya.ru>

This and the next patches in this series aim to make
time effect of synchronize_srcu() invisible for user.
The patch splits unregister_shrinker() in two functions:

	unregister_shrinker_delayed_initiate()
	unregister_shrinker_delayed_finalize()

and shrinker users may make the second of them to be called
asynchronous (e.g., from workqueue). Next patches make
superblock shrinker to follow this way, so user-visible
umount() time won't contain delays from synchronize_srcu().

Signed-off-by: Kirill Tkhai <tkhai@ya.ru>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/shrinker.h |  2 ++
 mm/vmscan.c              | 22 ++++++++++++++++++----
 2 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 224293b2dd06..e9d5a19d83fe 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -102,6 +102,8 @@ extern void register_shrinker_prepared(struct shrinker *shrinker);
 extern int __printf(2, 3) register_shrinker(struct shrinker *shrinker,
 					    const char *fmt, ...);
 extern void unregister_shrinker(struct shrinker *shrinker);
+extern void unregister_shrinker_delayed_initiate(struct shrinker *shrinker);
+extern void unregister_shrinker_delayed_finalize(struct shrinker *shrinker);
 extern void free_prealloced_shrinker(struct shrinker *shrinker);
 extern void synchronize_shrinkers(void);
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a773e97e152e..baf8d2327d70 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -799,10 +799,7 @@ int register_shrinker(struct shrinker *shrinker, const char *fmt, ...)
 #endif
 EXPORT_SYMBOL(register_shrinker);
 
-/*
- * Remove one
- */
-void unregister_shrinker(struct shrinker *shrinker)
+void unregister_shrinker_delayed_initiate(struct shrinker *shrinker)
 {
 	struct dentry *debugfs_entry;
 	int debugfs_id;
@@ -819,6 +816,13 @@ void unregister_shrinker(struct shrinker *shrinker)
 	mutex_unlock(&shrinker_mutex);
 
 	shrinker_debugfs_remove(debugfs_entry, debugfs_id);
+}
+EXPORT_SYMBOL(unregister_shrinker_delayed_initiate);
+
+void unregister_shrinker_delayed_finalize(struct shrinker *shrinker)
+{
+	if (!shrinker->nr_deferred)
+		return;
 
 	atomic_inc(&shrinker_srcu_generation);
 	synchronize_srcu(&shrinker_srcu);
@@ -826,6 +830,16 @@ void unregister_shrinker(struct shrinker *shrinker)
 	kfree(shrinker->nr_deferred);
 	shrinker->nr_deferred = NULL;
 }
+EXPORT_SYMBOL(unregister_shrinker_delayed_finalize);
+
+/*
+ * Remove one
+ */
+void unregister_shrinker(struct shrinker *shrinker)
+{
+	unregister_shrinker_delayed_initiate(shrinker);
+	unregister_shrinker_delayed_finalize(shrinker);
+}
 EXPORT_SYMBOL(unregister_shrinker);
 
 /**
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 3/8] fs: move list_lru_destroy() to destroy_super_work()
  2023-05-31  9:57 [PATCH 0/8] make unregistration of super_block shrinker more faster Qi Zheng
  2023-05-31  9:57 ` [PATCH 1/8] mm: vmscan: move shrinker_debugfs_remove() before synchronize_srcu() Qi Zheng
  2023-05-31  9:57 ` [PATCH 2/8] mm: vmscan: split unregister_shrinker() Qi Zheng
@ 2023-05-31  9:57 ` Qi Zheng
  2023-05-31 23:00   ` Dave Chinner
  2023-05-31  9:57 ` [PATCH 4/8] fs: shrink only (SB_ACTIVE|SB_BORN) superblocks in super_cache_scan() Qi Zheng
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 25+ messages in thread
From: Qi Zheng @ 2023-05-31  9:57 UTC (permalink / raw)
  To: akpm, tkhai, roman.gushchin, vbabka, viro, brauner, djwong,
	hughd, paulmck, muchun.song
  Cc: linux-mm, linux-fsdevel, linux-xfs, linux-kernel, Qi Zheng

From: Kirill Tkhai <tkhai@ya.ru>

The patch makes s_dentry_lru and s_inode_lru be destroyed
later from the workqueue. This is preparation to split
unregister_shrinker(super_block::s_shrink) in two stages,
and to call finalize stage from destroy_super_work().

Note, that generic filesystem shrinker unregistration
is safe to be split in two stages right after this
patch, since super_cache_count() and super_cache_scan()
have a deal with s_dentry_lru and s_inode_lru only.

But there are two exceptions: XFS and SHMEM, which
define .nr_cached_objects() and .free_cached_objects()
callbacks. These two do not allow us to do the splitting
right after this patch. They touch fs-specific data,
which is destroyed earlier, than destroy_super_work().
So, we can't call unregister_shrinker_delayed_finalize()
from destroy_super_work() because of them, and next
patches make preparations to make this possible.

Signed-off-by: Kirill Tkhai <tkhai@ya.ru>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 fs/super.c | 17 +++++------------
 1 file changed, 5 insertions(+), 12 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 8d8d68799b34..2ce4c72720f3 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -159,6 +159,11 @@ static void destroy_super_work(struct work_struct *work)
 							destroy_work);
 	int i;
 
+	WARN_ON(list_lru_count(&s->s_dentry_lru));
+	WARN_ON(list_lru_count(&s->s_inode_lru));
+	list_lru_destroy(&s->s_dentry_lru);
+	list_lru_destroy(&s->s_inode_lru);
+
 	for (i = 0; i < SB_FREEZE_LEVELS; i++)
 		percpu_free_rwsem(&s->s_writers.rw_sem[i]);
 	kfree(s);
@@ -177,8 +182,6 @@ static void destroy_unused_super(struct super_block *s)
 	if (!s)
 		return;
 	up_write(&s->s_umount);
-	list_lru_destroy(&s->s_dentry_lru);
-	list_lru_destroy(&s->s_inode_lru);
 	security_sb_free(s);
 	put_user_ns(s->s_user_ns);
 	kfree(s->s_subtype);
@@ -287,8 +290,6 @@ static void __put_super(struct super_block *s)
 {
 	if (!--s->s_count) {
 		list_del_init(&s->s_list);
-		WARN_ON(s->s_dentry_lru.node);
-		WARN_ON(s->s_inode_lru.node);
 		WARN_ON(!list_empty(&s->s_mounts));
 		security_sb_free(s);
 		put_user_ns(s->s_user_ns);
@@ -330,14 +331,6 @@ void deactivate_locked_super(struct super_block *s)
 		unregister_shrinker(&s->s_shrink);
 		fs->kill_sb(s);
 
-		/*
-		 * Since list_lru_destroy() may sleep, we cannot call it from
-		 * put_super(), where we hold the sb_lock. Therefore we destroy
-		 * the lru lists right now.
-		 */
-		list_lru_destroy(&s->s_dentry_lru);
-		list_lru_destroy(&s->s_inode_lru);
-
 		put_filesystem(fs);
 		put_super(s);
 	} else {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 4/8] fs: shrink only (SB_ACTIVE|SB_BORN) superblocks in super_cache_scan()
  2023-05-31  9:57 [PATCH 0/8] make unregistration of super_block shrinker more faster Qi Zheng
                   ` (2 preceding siblings ...)
  2023-05-31  9:57 ` [PATCH 3/8] fs: move list_lru_destroy() to destroy_super_work() Qi Zheng
@ 2023-05-31  9:57 ` Qi Zheng
  2023-05-31 23:12   ` Dave Chinner
  2023-05-31  9:57 ` [PATCH 5/8] fs: introduce struct super_operations::destroy_super() callback Qi Zheng
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 25+ messages in thread
From: Qi Zheng @ 2023-05-31  9:57 UTC (permalink / raw)
  To: akpm, tkhai, roman.gushchin, vbabka, viro, brauner, djwong,
	hughd, paulmck, muchun.song
  Cc: linux-mm, linux-fsdevel, linux-xfs, linux-kernel, Qi Zheng

From: Kirill Tkhai <tkhai@ya.ru>

This patch prepares superblock shrinker for delayed unregistering.
It makes super_cache_scan() avoid shrinking of not active superblocks.
SB_ACTIVE is used as such the indicator. In case of superblock is not
active, super_cache_scan() just exits with SHRINK_STOP as result.

Note, that SB_ACTIVE is cleared in generic_shutdown_super() and this
is made under the write lock of s_umount. Function super_cache_scan()
also takes the read lock of s_umount, so it can't skip this flag cleared.

SB_BORN check is added to super_cache_scan() just for uniformity
with super_cache_count(), while super_cache_count() received SB_ACTIVE
check just for uniformity with super_cache_scan().

After this patch super_cache_scan() becomes to ignore unregistering
superblocks, so this function is OK with splitting unregister_shrinker().
Next patches prepare super_cache_count() to follow this way.

Signed-off-by: Kirill Tkhai <tkhai@ya.ru>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 fs/super.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/super.c b/fs/super.c
index 2ce4c72720f3..2ce54561e82e 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -79,6 +79,11 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
 	if (!trylock_super(sb))
 		return SHRINK_STOP;
 
+	if ((sb->s_flags & (SB_BORN|SB_ACTIVE)) != (SB_BORN|SB_ACTIVE)) {
+		freed = SHRINK_STOP;
+		goto unlock;
+	}
+
 	if (sb->s_op->nr_cached_objects)
 		fs_objects = sb->s_op->nr_cached_objects(sb, sc);
 
@@ -110,6 +115,7 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
 		freed += sb->s_op->free_cached_objects(sb, sc);
 	}
 
+unlock:
 	up_read(&sb->s_umount);
 	return freed;
 }
@@ -136,7 +142,7 @@ static unsigned long super_cache_count(struct shrinker *shrink,
 	 * avoid this situation, so do the same here. The memory barrier is
 	 * matched with the one in mount_fs() as we don't hold locks here.
 	 */
-	if (!(sb->s_flags & SB_BORN))
+	if ((sb->s_flags & (SB_BORN|SB_ACTIVE)) != (SB_BORN|SB_ACTIVE))
 		return 0;
 	smp_rmb();
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 5/8] fs: introduce struct super_operations::destroy_super() callback
  2023-05-31  9:57 [PATCH 0/8] make unregistration of super_block shrinker more faster Qi Zheng
                   ` (3 preceding siblings ...)
  2023-05-31  9:57 ` [PATCH 4/8] fs: shrink only (SB_ACTIVE|SB_BORN) superblocks in super_cache_scan() Qi Zheng
@ 2023-05-31  9:57 ` Qi Zheng
  2023-05-31 11:19   ` Christian Brauner
  2023-05-31  9:57 ` [PATCH 6/8] xfs: introduce xfs_fs_destroy_super() Qi Zheng
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 25+ messages in thread
From: Qi Zheng @ 2023-05-31  9:57 UTC (permalink / raw)
  To: akpm, tkhai, roman.gushchin, vbabka, viro, brauner, djwong,
	hughd, paulmck, muchun.song
  Cc: linux-mm, linux-fsdevel, linux-xfs, linux-kernel, Qi Zheng

From: Kirill Tkhai <tkhai@ya.ru>

The patch introduces a new callback, which will be called
asynchronous from delayed work.

This will allows to make ::nr_cached_objects() safe
to be called on destroying superblock in next patches,
and to split unregister_shrinker() into two primitives.

Signed-off-by: Kirill Tkhai <tkhai@ya.ru>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 fs/super.c         | 3 +++
 include/linux/fs.h | 1 +
 2 files changed, 4 insertions(+)

diff --git a/fs/super.c b/fs/super.c
index 2ce54561e82e..4e9d08224f86 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -170,6 +170,9 @@ static void destroy_super_work(struct work_struct *work)
 	list_lru_destroy(&s->s_dentry_lru);
 	list_lru_destroy(&s->s_inode_lru);
 
+	if (s->s_op->destroy_super)
+		s->s_op->destroy_super(s);
+
 	for (i = 0; i < SB_FREEZE_LEVELS; i++)
 		percpu_free_rwsem(&s->s_writers.rw_sem[i]);
 	kfree(s);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 0b54ac1d331b..30b46d0facfc 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1910,6 +1910,7 @@ struct super_operations {
 	int (*drop_inode) (struct inode *);
 	void (*evict_inode) (struct inode *);
 	void (*put_super) (struct super_block *);
+	void (*destroy_super) (struct super_block *);
 	int (*sync_fs)(struct super_block *sb, int wait);
 	int (*freeze_super) (struct super_block *);
 	int (*freeze_fs) (struct super_block *);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 6/8] xfs: introduce xfs_fs_destroy_super()
  2023-05-31  9:57 [PATCH 0/8] make unregistration of super_block shrinker more faster Qi Zheng
                   ` (4 preceding siblings ...)
  2023-05-31  9:57 ` [PATCH 5/8] fs: introduce struct super_operations::destroy_super() callback Qi Zheng
@ 2023-05-31  9:57 ` Qi Zheng
  2023-05-31 23:48   ` Dave Chinner
  2023-05-31  9:57 ` [PATCH 7/8] shmem: implement shmem_destroy_super() Qi Zheng
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 25+ messages in thread
From: Qi Zheng @ 2023-05-31  9:57 UTC (permalink / raw)
  To: akpm, tkhai, roman.gushchin, vbabka, viro, brauner, djwong,
	hughd, paulmck, muchun.song
  Cc: linux-mm, linux-fsdevel, linux-xfs, linux-kernel, Qi Zheng

From: Kirill Tkhai <tkhai@ya.ru>

xfs_fs_nr_cached_objects() touches sb->s_fs_info,
and this patch makes it to be destructed later.

After this patch xfs_fs_nr_cached_objects() is safe
for splitting unregister_shrinker(): mp->m_perag_tree
is stable till destroy_super_work(), while iteration
over it is already RCU-protected by internal XFS
business.

Signed-off-by: Kirill Tkhai <tkhai@ya.ru>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 fs/xfs/xfs_super.c | 25 ++++++++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 7e706255f165..694616524c76 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -743,11 +743,18 @@ xfs_fs_drop_inode(
 }
 
 static void
-xfs_mount_free(
+xfs_free_names(
 	struct xfs_mount	*mp)
 {
 	kfree(mp->m_rtname);
 	kfree(mp->m_logname);
+}
+
+static void
+xfs_mount_free(
+	struct xfs_mount	*mp)
+{
+	xfs_free_names(mp);
 	kmem_free(mp);
 }
 
@@ -1136,8 +1143,19 @@ xfs_fs_put_super(
 	xfs_destroy_mount_workqueues(mp);
 	xfs_close_devices(mp);
 
-	sb->s_fs_info = NULL;
-	xfs_mount_free(mp);
+	xfs_free_names(mp);
+}
+
+static void
+xfs_fs_destroy_super(
+	struct super_block	*sb)
+{
+	if (sb->s_fs_info) {
+		struct xfs_mount	*mp = XFS_M(sb);
+
+		kmem_free(mp);
+		sb->s_fs_info = NULL;
+	}
 }
 
 static long
@@ -1165,6 +1183,7 @@ static const struct super_operations xfs_super_operations = {
 	.dirty_inode		= xfs_fs_dirty_inode,
 	.drop_inode		= xfs_fs_drop_inode,
 	.put_super		= xfs_fs_put_super,
+	.destroy_super		= xfs_fs_destroy_super,
 	.sync_fs		= xfs_fs_sync_fs,
 	.freeze_fs		= xfs_fs_freeze,
 	.unfreeze_fs		= xfs_fs_unfreeze,
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 7/8] shmem: implement shmem_destroy_super()
  2023-05-31  9:57 [PATCH 0/8] make unregistration of super_block shrinker more faster Qi Zheng
                   ` (5 preceding siblings ...)
  2023-05-31  9:57 ` [PATCH 6/8] xfs: introduce xfs_fs_destroy_super() Qi Zheng
@ 2023-05-31  9:57 ` Qi Zheng
  2023-05-31  9:57 ` [PATCH 8/8] fs: use unregister_shrinker_delayed_{initiate, finalize} for super_block shrinker Qi Zheng
       [not found] ` <20230531114054.bf077db642aa9c58c0831687@linux-foundation.org>
  8 siblings, 0 replies; 25+ messages in thread
From: Qi Zheng @ 2023-05-31  9:57 UTC (permalink / raw)
  To: akpm, tkhai, roman.gushchin, vbabka, viro, brauner, djwong,
	hughd, paulmck, muchun.song
  Cc: linux-mm, linux-fsdevel, linux-xfs, linux-kernel, Qi Zheng

From: Kirill Tkhai <tkhai@ya.ru>

Similar to xfs_fs_destroy_super() implement the method
for shmem.

shmem_unused_huge_count() just touches sb->s_fs_info.
After such the later freeing it will be safe for
unregister_shrinker() splitting (which is made in next
patch).

Signed-off-by: Kirill Tkhai <tkhai@ya.ru>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/shmem.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/mm/shmem.c b/mm/shmem.c
index 71e6c9855770..a4c3fdf23838 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -3930,6 +3930,12 @@ static void shmem_put_super(struct super_block *sb)
 	free_percpu(sbinfo->ino_batch);
 	percpu_counter_destroy(&sbinfo->used_blocks);
 	mpol_put(sbinfo->mpol);
+}
+
+static void shmem_destroy_super(struct super_block *sb)
+{
+	struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
+
 	kfree(sbinfo);
 	sb->s_fs_info = NULL;
 }
@@ -4018,6 +4024,7 @@ static int shmem_fill_super(struct super_block *sb, struct fs_context *fc)
 
 failed:
 	shmem_put_super(sb);
+	shmem_destroy_super(sb);
 	return -ENOMEM;
 }
 
@@ -4182,6 +4189,7 @@ static const struct super_operations shmem_ops = {
 	.evict_inode	= shmem_evict_inode,
 	.drop_inode	= generic_delete_inode,
 	.put_super	= shmem_put_super,
+	.destroy_super	= shmem_destroy_super,
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	.nr_cached_objects	= shmem_unused_huge_count,
 	.free_cached_objects	= shmem_unused_huge_scan,
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 8/8] fs: use unregister_shrinker_delayed_{initiate, finalize} for super_block shrinker
  2023-05-31  9:57 [PATCH 0/8] make unregistration of super_block shrinker more faster Qi Zheng
                   ` (6 preceding siblings ...)
  2023-05-31  9:57 ` [PATCH 7/8] shmem: implement shmem_destroy_super() Qi Zheng
@ 2023-05-31  9:57 ` Qi Zheng
       [not found] ` <20230531114054.bf077db642aa9c58c0831687@linux-foundation.org>
  8 siblings, 0 replies; 25+ messages in thread
From: Qi Zheng @ 2023-05-31  9:57 UTC (permalink / raw)
  To: akpm, tkhai, roman.gushchin, vbabka, viro, brauner, djwong,
	hughd, paulmck, muchun.song
  Cc: linux-mm, linux-fsdevel, linux-xfs, linux-kernel, Qi Zheng

From: Kirill Tkhai <tkhai@ya.ru>

Previous patches made all the data, which is touched from
super_cache_count(), destroyed from destroy_super_work():
s_dentry_lru, s_inode_lru and super_block::s_fs_info.

super_cache_scan() can't be called after SB_ACTIVE is cleared
in generic_shutdown_super().

So, it safe to move heavy unregister_shrinker_delayed_finalize()
part to delayed work, i.e. it's safe for parallel do_shrink_slab()
to be executed between unregister_shrinker_delayed_initiate() and
destroy_super_work()->unregister_shrinker_delayed_finalize().

This makes the heavy synchronize_srcu() to do not affect on user-visible
unregistration speed (since now it's executed from workqueue).

All further time-critical for unregistration places may be written
in the same conception.

Signed-off-by: Kirill Tkhai <tkhai@ya.ru>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 fs/super.c         | 4 +++-
 include/linux/fs.h | 5 +++++
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/fs/super.c b/fs/super.c
index 4e9d08224f86..c61efb74fa7f 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -165,6 +165,8 @@ static void destroy_super_work(struct work_struct *work)
 							destroy_work);
 	int i;
 
+	unregister_shrinker_delayed_finalize(&s->s_shrink);
+
 	WARN_ON(list_lru_count(&s->s_dentry_lru));
 	WARN_ON(list_lru_count(&s->s_inode_lru));
 	list_lru_destroy(&s->s_dentry_lru);
@@ -337,7 +339,7 @@ void deactivate_locked_super(struct super_block *s)
 {
 	struct file_system_type *fs = s->s_type;
 	if (atomic_dec_and_test(&s->s_active)) {
-		unregister_shrinker(&s->s_shrink);
+		unregister_shrinker_delayed_initiate(&s->s_shrink);
 		fs->kill_sb(s);
 
 		put_filesystem(fs);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 30b46d0facfc..869dd8de91a5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1929,6 +1929,11 @@ struct super_operations {
 	ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
 	struct dquot **(*get_dquots)(struct inode *);
 #endif
+	/*
+	 * Shrinker may call these two function on destructing super_block
+	 * till unregister_shrinker_delayed_finalize() has completed
+	 * in destroy_super_work(), and they must care about that.
+	 */
 	long (*nr_cached_objects)(struct super_block *,
 				  struct shrink_control *);
 	long (*free_cached_objects)(struct super_block *,
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/8] mm: vmscan: move shrinker_debugfs_remove() before synchronize_srcu()
  2023-05-31  9:57 ` [PATCH 1/8] mm: vmscan: move shrinker_debugfs_remove() before synchronize_srcu() Qi Zheng
@ 2023-05-31 10:49   ` Christian Brauner
  2023-05-31 12:52     ` Qi Zheng
  0 siblings, 1 reply; 25+ messages in thread
From: Christian Brauner @ 2023-05-31 10:49 UTC (permalink / raw)
  To: Qi Zheng
  Cc: akpm, tkhai, roman.gushchin, vbabka, viro, djwong, hughd,
	paulmck, muchun.song, linux-mm, linux-fsdevel, linux-xfs,
	linux-kernel, Qi Zheng

On Wed, May 31, 2023 at 09:57:35AM +0000, Qi Zheng wrote:
> From: Qi Zheng <zhengqi.arch@bytedance.com>
> 
> The debugfs_remove_recursive() will wait for debugfs_file_put()
> to return, so there is no need to put it after synchronize_srcu()
> to wait for the rcu read-side critical section to exit.
> 
> Just move it before synchronize_srcu(), which is also convenient
> to put the heavy synchronize_srcu() in the delayed work later.
> 
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---

Afaict, should be a patch independent of this series.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/8] fs: introduce struct super_operations::destroy_super() callback
  2023-05-31  9:57 ` [PATCH 5/8] fs: introduce struct super_operations::destroy_super() callback Qi Zheng
@ 2023-05-31 11:19   ` Christian Brauner
  2023-05-31 12:54     ` Qi Zheng
  0 siblings, 1 reply; 25+ messages in thread
From: Christian Brauner @ 2023-05-31 11:19 UTC (permalink / raw)
  To: Qi Zheng
  Cc: akpm, tkhai, roman.gushchin, vbabka, viro, djwong, hughd,
	paulmck, muchun.song, linux-mm, linux-fsdevel, linux-xfs,
	linux-kernel, Qi Zheng, Christoph Hellwig

On Wed, May 31, 2023 at 09:57:39AM +0000, Qi Zheng wrote:
> From: Kirill Tkhai <tkhai@ya.ru>
> 
> The patch introduces a new callback, which will be called
> asynchronous from delayed work.
> 
> This will allows to make ::nr_cached_objects() safe
> to be called on destroying superblock in next patches,
> and to split unregister_shrinker() into two primitives.
> 
> Signed-off-by: Kirill Tkhai <tkhai@ya.ru>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
>  fs/super.c         | 3 +++
>  include/linux/fs.h | 1 +
>  2 files changed, 4 insertions(+)

Misses updates to
Documentation/filesystems/locking.rst
Documentation/filesystems/vfs.rst

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/8] mm: vmscan: move shrinker_debugfs_remove() before synchronize_srcu()
  2023-05-31 10:49   ` Christian Brauner
@ 2023-05-31 12:52     ` Qi Zheng
  0 siblings, 0 replies; 25+ messages in thread
From: Qi Zheng @ 2023-05-31 12:52 UTC (permalink / raw)
  To: Christian Brauner
  Cc: akpm, tkhai, roman.gushchin, vbabka, viro, djwong, hughd,
	paulmck, muchun.song, linux-mm, linux-fsdevel, linux-xfs,
	linux-kernel, Qi Zheng



On 2023/5/31 18:49, Christian Brauner wrote:
> On Wed, May 31, 2023 at 09:57:35AM +0000, Qi Zheng wrote:
>> From: Qi Zheng <zhengqi.arch@bytedance.com>
>>
>> The debugfs_remove_recursive() will wait for debugfs_file_put()
>> to return, so there is no need to put it after synchronize_srcu()
>> to wait for the rcu read-side critical section to exit.
>>
>> Just move it before synchronize_srcu(), which is also convenient
>> to put the heavy synchronize_srcu() in the delayed work later.
>>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> ---
> 
> Afaict, should be a patch independent of this series.

OK, will resend as an independent patch.

Thanks,
Qi

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/8] fs: introduce struct super_operations::destroy_super() callback
  2023-05-31 11:19   ` Christian Brauner
@ 2023-05-31 12:54     ` Qi Zheng
  0 siblings, 0 replies; 25+ messages in thread
From: Qi Zheng @ 2023-05-31 12:54 UTC (permalink / raw)
  To: Christian Brauner
  Cc: akpm, tkhai, roman.gushchin, vbabka, viro, djwong, hughd,
	paulmck, muchun.song, linux-mm, linux-fsdevel, linux-xfs,
	linux-kernel, Qi Zheng, Christoph Hellwig



On 2023/5/31 19:19, Christian Brauner wrote:
> On Wed, May 31, 2023 at 09:57:39AM +0000, Qi Zheng wrote:
>> From: Kirill Tkhai <tkhai@ya.ru>
>>
>> The patch introduces a new callback, which will be called
>> asynchronous from delayed work.
>>
>> This will allows to make ::nr_cached_objects() safe
>> to be called on destroying superblock in next patches,
>> and to split unregister_shrinker() into two primitives.
>>
>> Signed-off-by: Kirill Tkhai <tkhai@ya.ru>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> ---
>>   fs/super.c         | 3 +++
>>   include/linux/fs.h | 1 +
>>   2 files changed, 4 insertions(+)
> 
> Misses updates to
> Documentation/filesystems/locking.rst
> Documentation/filesystems/vfs.rst

Will do.

Thanks,
Qi

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/8] mm: vmscan: split unregister_shrinker()
  2023-05-31  9:57 ` [PATCH 2/8] mm: vmscan: split unregister_shrinker() Qi Zheng
@ 2023-05-31 22:57   ` Dave Chinner
  0 siblings, 0 replies; 25+ messages in thread
From: Dave Chinner @ 2023-05-31 22:57 UTC (permalink / raw)
  To: Qi Zheng
  Cc: akpm, tkhai, roman.gushchin, vbabka, viro, brauner, djwong,
	hughd, paulmck, muchun.song, linux-mm, linux-fsdevel, linux-xfs,
	linux-kernel, Qi Zheng

On Wed, May 31, 2023 at 09:57:36AM +0000, Qi Zheng wrote:
> From: Kirill Tkhai <tkhai@ya.ru>
> 
> This and the next patches in this series aim to make
> time effect of synchronize_srcu() invisible for user.
> The patch splits unregister_shrinker() in two functions:
> 
> 	unregister_shrinker_delayed_initiate()
> 	unregister_shrinker_delayed_finalize()
> 
> and shrinker users may make the second of them to be called
> asynchronous (e.g., from workqueue). Next patches make
> superblock shrinker to follow this way, so user-visible
> umount() time won't contain delays from synchronize_srcu().
> 
> Signed-off-by: Kirill Tkhai <tkhai@ya.ru>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
>  include/linux/shrinker.h |  2 ++
>  mm/vmscan.c              | 22 ++++++++++++++++++----
>  2 files changed, 20 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index 224293b2dd06..e9d5a19d83fe 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -102,6 +102,8 @@ extern void register_shrinker_prepared(struct shrinker *shrinker);
>  extern int __printf(2, 3) register_shrinker(struct shrinker *shrinker,
>  					    const char *fmt, ...);
>  extern void unregister_shrinker(struct shrinker *shrinker);
> +extern void unregister_shrinker_delayed_initiate(struct shrinker *shrinker);
> +extern void unregister_shrinker_delayed_finalize(struct shrinker *shrinker);
>  extern void free_prealloced_shrinker(struct shrinker *shrinker);
>  extern void synchronize_shrinkers(void);
>  
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a773e97e152e..baf8d2327d70 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -799,10 +799,7 @@ int register_shrinker(struct shrinker *shrinker, const char *fmt, ...)
>  #endif
>  EXPORT_SYMBOL(register_shrinker);
>  
> -/*
> - * Remove one
> - */
> -void unregister_shrinker(struct shrinker *shrinker)
> +void unregister_shrinker_delayed_initiate(struct shrinker *shrinker)
>  {
>  	struct dentry *debugfs_entry;
>  	int debugfs_id;
> @@ -819,6 +816,13 @@ void unregister_shrinker(struct shrinker *shrinker)
>  	mutex_unlock(&shrinker_mutex);
>  
>  	shrinker_debugfs_remove(debugfs_entry, debugfs_id);
> +}
> +EXPORT_SYMBOL(unregister_shrinker_delayed_initiate);
> +
> +void unregister_shrinker_delayed_finalize(struct shrinker *shrinker)
> +{
> +	if (!shrinker->nr_deferred)
> +		return;

This is new logic and isn't explained anywhere: why do we want to
avoid RCU cleanup if (shrinker->nr_deferred == 0)? Regardless,
whatever this is avoiding, it needs a comment to explain it.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/8] fs: move list_lru_destroy() to destroy_super_work()
  2023-05-31  9:57 ` [PATCH 3/8] fs: move list_lru_destroy() to destroy_super_work() Qi Zheng
@ 2023-05-31 23:00   ` Dave Chinner
  0 siblings, 0 replies; 25+ messages in thread
From: Dave Chinner @ 2023-05-31 23:00 UTC (permalink / raw)
  To: Qi Zheng
  Cc: akpm, tkhai, roman.gushchin, vbabka, viro, brauner, djwong,
	hughd, paulmck, muchun.song, linux-mm, linux-fsdevel, linux-xfs,
	linux-kernel, Qi Zheng

On Wed, May 31, 2023 at 09:57:37AM +0000, Qi Zheng wrote:
> From: Kirill Tkhai <tkhai@ya.ru>
> 
> The patch makes s_dentry_lru and s_inode_lru be destroyed
> later from the workqueue. This is preparation to split
> unregister_shrinker(super_block::s_shrink) in two stages,
> and to call finalize stage from destroy_super_work().
> 
> Note, that generic filesystem shrinker unregistration
> is safe to be split in two stages right after this
> patch, since super_cache_count() and super_cache_scan()
> have a deal with s_dentry_lru and s_inode_lru only.
> 
> But there are two exceptions: XFS and SHMEM, which
> define .nr_cached_objects() and .free_cached_objects()
> callbacks. These two do not allow us to do the splitting
> right after this patch. They touch fs-specific data,
> which is destroyed earlier, than destroy_super_work().
> So, we can't call unregister_shrinker_delayed_finalize()
> from destroy_super_work() because of them, and next
> patches make preparations to make this possible.
> 
> Signed-off-by: Kirill Tkhai <tkhai@ya.ru>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
>  fs/super.c | 17 +++++------------
>  1 file changed, 5 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index 8d8d68799b34..2ce4c72720f3 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -159,6 +159,11 @@ static void destroy_super_work(struct work_struct *work)
>  							destroy_work);
>  	int i;
>  
> +	WARN_ON(list_lru_count(&s->s_dentry_lru));
> +	WARN_ON(list_lru_count(&s->s_inode_lru));
> +	list_lru_destroy(&s->s_dentry_lru);
> +	list_lru_destroy(&s->s_inode_lru);
> +
>  	for (i = 0; i < SB_FREEZE_LEVELS; i++)
>  		percpu_free_rwsem(&s->s_writers.rw_sem[i]);
>  	kfree(s);
> @@ -177,8 +182,6 @@ static void destroy_unused_super(struct super_block *s)
>  	if (!s)
>  		return;
>  	up_write(&s->s_umount);
> -	list_lru_destroy(&s->s_dentry_lru);
> -	list_lru_destroy(&s->s_inode_lru);
>  	security_sb_free(s);
>  	put_user_ns(s->s_user_ns);
>  	kfree(s->s_subtype);
> @@ -287,8 +290,6 @@ static void __put_super(struct super_block *s)
>  {
>  	if (!--s->s_count) {
>  		list_del_init(&s->s_list);
> -		WARN_ON(s->s_dentry_lru.node);
> -		WARN_ON(s->s_inode_lru.node);

Why are you removing the wanrings from here? Regardless of where
we tear down the lru lists, they *must* be empty here otherwise we
have a memory leak. Hence I don't think these warnings should be
moved at all.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/8] fs: shrink only (SB_ACTIVE|SB_BORN) superblocks in super_cache_scan()
  2023-05-31  9:57 ` [PATCH 4/8] fs: shrink only (SB_ACTIVE|SB_BORN) superblocks in super_cache_scan() Qi Zheng
@ 2023-05-31 23:12   ` Dave Chinner
  0 siblings, 0 replies; 25+ messages in thread
From: Dave Chinner @ 2023-05-31 23:12 UTC (permalink / raw)
  To: Qi Zheng
  Cc: akpm, tkhai, roman.gushchin, vbabka, viro, brauner, djwong,
	hughd, paulmck, muchun.song, linux-mm, linux-fsdevel, linux-xfs,
	linux-kernel, Qi Zheng

On Wed, May 31, 2023 at 09:57:38AM +0000, Qi Zheng wrote:
> From: Kirill Tkhai <tkhai@ya.ru>
> 
> This patch prepares superblock shrinker for delayed unregistering.
> It makes super_cache_scan() avoid shrinking of not active superblocks.
> SB_ACTIVE is used as such the indicator. In case of superblock is not
> active, super_cache_scan() just exits with SHRINK_STOP as result.
> 
> Note, that SB_ACTIVE is cleared in generic_shutdown_super() and this
> is made under the write lock of s_umount. Function super_cache_scan()
> also takes the read lock of s_umount, so it can't skip this flag cleared.
> 
> SB_BORN check is added to super_cache_scan() just for uniformity
> with super_cache_count(), while super_cache_count() received SB_ACTIVE
> check just for uniformity with super_cache_scan().
> 
> After this patch super_cache_scan() becomes to ignore unregistering
> superblocks, so this function is OK with splitting unregister_shrinker().
> Next patches prepare super_cache_count() to follow this way.
> 
> Signed-off-by: Kirill Tkhai <tkhai@ya.ru>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
>  fs/super.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index 2ce4c72720f3..2ce54561e82e 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -79,6 +79,11 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
>  	if (!trylock_super(sb))
>  		return SHRINK_STOP;
>  
> +	if ((sb->s_flags & (SB_BORN|SB_ACTIVE)) != (SB_BORN|SB_ACTIVE)) {
> +		freed = SHRINK_STOP;
> +		goto unlock;
> +	}

This should not be here - the check to determine if the shrinker
should run is done in the ->count method. If we removed the SB_ACTIVE
flag between ->count and ->scan, then the superblock should be
locked and the trylock_super() call above should fail....

Indeed, the unregister_shrinker() call in deactivate_locked_super()
is done with the sb->s_umount held exclusively, and this happens
before we clear SB_ACTIVE in the ->kill_sb() -> kill_block_super()
-> generic_shutdown_super() path after the shrinker is unregistered.

Hence we can't get to this check without SB_ACTIVE being set - the
trylock will fail and then the shrinker_unregister() call will do
it's thing to ensure the shrinker is never called again.

If the change to the shrinker code allows the shrinker to still be
actively running when we call ->kill_sb(), then that needs to be
fixed. THe superblock shrinker must be stopped completely and never
run again before we call ->kill_sb().

>  	if (sb->s_op->nr_cached_objects)
>  		fs_objects = sb->s_op->nr_cached_objects(sb, sc);
>  
> @@ -110,6 +115,7 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
>  		freed += sb->s_op->free_cached_objects(sb, sc);
>  	}
>  
> +unlock:
>  	up_read(&sb->s_umount);
>  	return freed;
>  }
> @@ -136,7 +142,7 @@ static unsigned long super_cache_count(struct shrinker *shrink,
>  	 * avoid this situation, so do the same here. The memory barrier is
>  	 * matched with the one in mount_fs() as we don't hold locks here.
>  	 */
> -	if (!(sb->s_flags & SB_BORN))
> +	if ((sb->s_flags & (SB_BORN|SB_ACTIVE)) != (SB_BORN|SB_ACTIVE))
>  		return 0;

This is fine because it's an unlocked check, but I don't think it's
actually necessary given the above. Indeed, if you are adding this,
you need to expand the comment above on why SB_ACTIVE needs
checking, and why the memory barrier doesn't actually apply to that
part of the check....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 6/8] xfs: introduce xfs_fs_destroy_super()
  2023-05-31  9:57 ` [PATCH 6/8] xfs: introduce xfs_fs_destroy_super() Qi Zheng
@ 2023-05-31 23:48   ` Dave Chinner
  2023-06-01  8:43     ` Qi Zheng
  0 siblings, 1 reply; 25+ messages in thread
From: Dave Chinner @ 2023-05-31 23:48 UTC (permalink / raw)
  To: Qi Zheng
  Cc: akpm, tkhai, roman.gushchin, vbabka, viro, brauner, djwong,
	hughd, paulmck, muchun.song, linux-mm, linux-fsdevel, linux-xfs,
	linux-kernel, Qi Zheng

On Wed, May 31, 2023 at 09:57:40AM +0000, Qi Zheng wrote:
> From: Kirill Tkhai <tkhai@ya.ru>
> 
> xfs_fs_nr_cached_objects() touches sb->s_fs_info,
> and this patch makes it to be destructed later.
> 
> After this patch xfs_fs_nr_cached_objects() is safe
> for splitting unregister_shrinker(): mp->m_perag_tree
> is stable till destroy_super_work(), while iteration
> over it is already RCU-protected by internal XFS
> business.
> 
> Signed-off-by: Kirill Tkhai <tkhai@ya.ru>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
>  fs/xfs/xfs_super.c | 25 ++++++++++++++++++++++---
>  1 file changed, 22 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 7e706255f165..694616524c76 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -743,11 +743,18 @@ xfs_fs_drop_inode(
>  }
>  
>  static void
> -xfs_mount_free(
> +xfs_free_names(
>  	struct xfs_mount	*mp)
>  {
>  	kfree(mp->m_rtname);
>  	kfree(mp->m_logname);
> +}
> +
> +static void
> +xfs_mount_free(
> +	struct xfs_mount	*mp)
> +{
> +	xfs_free_names(mp);
>  	kmem_free(mp);
>  }
>  
> @@ -1136,8 +1143,19 @@ xfs_fs_put_super(
>  	xfs_destroy_mount_workqueues(mp);
>  	xfs_close_devices(mp);
>  
> -	sb->s_fs_info = NULL;
> -	xfs_mount_free(mp);
> +	xfs_free_names(mp);
> +}
> +
> +static void
> +xfs_fs_destroy_super(
> +	struct super_block	*sb)
> +{
> +	if (sb->s_fs_info) {
> +		struct xfs_mount	*mp = XFS_M(sb);
> +
> +		kmem_free(mp);
> +		sb->s_fs_info = NULL;
> +	}
>  }
>  
>  static long
> @@ -1165,6 +1183,7 @@ static const struct super_operations xfs_super_operations = {
>  	.dirty_inode		= xfs_fs_dirty_inode,
>  	.drop_inode		= xfs_fs_drop_inode,
>  	.put_super		= xfs_fs_put_super,
> +	.destroy_super		= xfs_fs_destroy_super,
>  	.sync_fs		= xfs_fs_sync_fs,
>  	.freeze_fs		= xfs_fs_freeze,
>  	.unfreeze_fs		= xfs_fs_unfreeze,

I don't really like this ->destroy_super() callback, especially as
it's completely undocumented as to why it exists. This is purely a
work-around for handling extended filesystem superblock shrinker
functionality, yet there's nothing that tells the reader this.

It also seems to imply that the superblock shrinker can continue to
run after the existing unregister_shrinker() call before ->kill_sb()
is called. This violates the assumption made in filesystems that the
superblock shrinkers have been stopped and will never run again
before ->kill_sb() is called. Hence ->kill_sb() implementations
assume there is nothing else accessing filesystem owned structures
and it can tear down internal structures safely.

Realistically, the days of XFS using this superblock shrinker
extension are numbered. We've got a lot of the infrastructure we
need in place to get rid of the background inode reclaim
infrastructure that requires this shrinker extension, and it's on my
list of things that need to be addressed in the near future. 

In fact, now that I look at it, I think the shmem usage of this
superblock shrinker interface is broken - it returns SHRINK_STOP to
->free_cached_objects(), but the only valid return value is the
number of objects freed (i.e. 0 is nothing freed). These special
superblock extension interfaces do not work like a normal
shrinker....

Hence I think the shmem usage should be replaced with an separate
internal shmem shrinker that is managed by the filesystem itself
(similar to how XFS has multiple internal shrinkers).

At this point, then the only user of this interface is (again) XFS.
Given this, adding new VFS methods for a single filesystem
for functionality that is planned to be removed is probably not the
best approach to solving the problem.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 6/8] xfs: introduce xfs_fs_destroy_super()
  2023-05-31 23:48   ` Dave Chinner
@ 2023-06-01  8:43     ` Qi Zheng
  2023-06-01  9:58       ` Christian Brauner
  2023-06-01 23:06       ` Dave Chinner
  0 siblings, 2 replies; 25+ messages in thread
From: Qi Zheng @ 2023-06-01  8:43 UTC (permalink / raw)
  To: Dave Chinner
  Cc: akpm, tkhai, roman.gushchin, vbabka, viro, brauner, djwong,
	hughd, paulmck, muchun.song, linux-mm, linux-fsdevel, linux-xfs,
	linux-kernel, Qi Zheng

Hi Dave,

On 2023/6/1 07:48, Dave Chinner wrote:
> On Wed, May 31, 2023 at 09:57:40AM +0000, Qi Zheng wrote:
>> From: Kirill Tkhai <tkhai@ya.ru>
>>
>> xfs_fs_nr_cached_objects() touches sb->s_fs_info,
>> and this patch makes it to be destructed later.
>>
>> After this patch xfs_fs_nr_cached_objects() is safe
>> for splitting unregister_shrinker(): mp->m_perag_tree
>> is stable till destroy_super_work(), while iteration
>> over it is already RCU-protected by internal XFS
>> business.
>>
>> Signed-off-by: Kirill Tkhai <tkhai@ya.ru>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> ---
>>   fs/xfs/xfs_super.c | 25 ++++++++++++++++++++++---
>>   1 file changed, 22 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
>> index 7e706255f165..694616524c76 100644
>> --- a/fs/xfs/xfs_super.c
>> +++ b/fs/xfs/xfs_super.c
>> @@ -743,11 +743,18 @@ xfs_fs_drop_inode(
>>   }
>>   
>>   static void
>> -xfs_mount_free(
>> +xfs_free_names(
>>   	struct xfs_mount	*mp)
>>   {
>>   	kfree(mp->m_rtname);
>>   	kfree(mp->m_logname);
>> +}
>> +
>> +static void
>> +xfs_mount_free(
>> +	struct xfs_mount	*mp)
>> +{
>> +	xfs_free_names(mp);
>>   	kmem_free(mp);
>>   }
>>   
>> @@ -1136,8 +1143,19 @@ xfs_fs_put_super(
>>   	xfs_destroy_mount_workqueues(mp);
>>   	xfs_close_devices(mp);
>>   
>> -	sb->s_fs_info = NULL;
>> -	xfs_mount_free(mp);
>> +	xfs_free_names(mp);
>> +}
>> +
>> +static void
>> +xfs_fs_destroy_super(
>> +	struct super_block	*sb)
>> +{
>> +	if (sb->s_fs_info) {
>> +		struct xfs_mount	*mp = XFS_M(sb);
>> +
>> +		kmem_free(mp);
>> +		sb->s_fs_info = NULL;
>> +	}
>>   }
>>   
>>   static long
>> @@ -1165,6 +1183,7 @@ static const struct super_operations xfs_super_operations = {
>>   	.dirty_inode		= xfs_fs_dirty_inode,
>>   	.drop_inode		= xfs_fs_drop_inode,
>>   	.put_super		= xfs_fs_put_super,
>> +	.destroy_super		= xfs_fs_destroy_super,
>>   	.sync_fs		= xfs_fs_sync_fs,
>>   	.freeze_fs		= xfs_fs_freeze,
>>   	.unfreeze_fs		= xfs_fs_unfreeze,
> 
> I don't really like this ->destroy_super() callback, especially as
> it's completely undocumented as to why it exists. This is purely a
> work-around for handling extended filesystem superblock shrinker
> functionality, yet there's nothing that tells the reader this.
> 
> It also seems to imply that the superblock shrinker can continue to
> run after the existing unregister_shrinker() call before ->kill_sb()
> is called. This violates the assumption made in filesystems that the
> superblock shrinkers have been stopped and will never run again
> before ->kill_sb() is called. Hence ->kill_sb() implementations
> assume there is nothing else accessing filesystem owned structures
> and it can tear down internal structures safely.
> 
> Realistically, the days of XFS using this superblock shrinker
> extension are numbered. We've got a lot of the infrastructure we
> need in place to get rid of the background inode reclaim
> infrastructure that requires this shrinker extension, and it's on my
> list of things that need to be addressed in the near future.
> 
> In fact, now that I look at it, I think the shmem usage of this
> superblock shrinker interface is broken - it returns SHRINK_STOP to
> ->free_cached_objects(), but the only valid return value is the
> number of objects freed (i.e. 0 is nothing freed). These special
> superblock extension interfaces do not work like a normal
> shrinker....
> 
> Hence I think the shmem usage should be replaced with an separate
> internal shmem shrinker that is managed by the filesystem itself
> (similar to how XFS has multiple internal shrinkers).
> 
> At this point, then the only user of this interface is (again) XFS.
> Given this, adding new VFS methods for a single filesystem
> for functionality that is planned to be removed is probably not the
> best approach to solving the problem.

Thanks for such a detailed analysis. Kirill Tkhai just proposeed a
new method[1], I cc'd you on the email.

[1]. 
https://lore.kernel.org/lkml/bab60fe4-964c-43a6-ecce-4cbd4981d875@ya.ru/

Thanks,
Qi

> 
> Cheers,
> 
> Dave.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/8] make unregistration of super_block shrinker more faster
       [not found] ` <20230531114054.bf077db642aa9c58c0831687@linux-foundation.org>
@ 2023-06-01  8:46   ` Qi Zheng
  0 siblings, 0 replies; 25+ messages in thread
From: Qi Zheng @ 2023-06-01  8:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: tkhai, roman.gushchin, vbabka, viro, brauner, djwong, hughd,
	paulmck, muchun.song, linux-mm, linux-fsdevel, linux-xfs,
	linux-kernel, Qi Zheng



On 2023/6/1 02:40, Andrew Morton wrote:
> On Wed, 31 May 2023 09:57:34 +0000 Qi Zheng <qi.zheng@linux.dev> wrote:
> 
>> From: Qi Zheng <zhengqi.arch@bytedance.com>
>>
>> Hi all,
>>
>> This patch series aims to make unregistration of super_block shrinker more
>> faster.
>>
>> 1. Background
>> =============
>>
>> The kernel test robot noticed a -88.8% regression of stress-ng.ramfs.ops_per_sec
>> on commit f95bdb700bc6 ("mm: vmscan: make global slab shrink lockless"). More
>> details can be seen from the link[1] below.
>>
>> [1]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie.liu@intel.com/
>>
>> We can just use the following command to reproduce the result:
>>
>> stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 &
>>
>> 1) before commit f95bdb700bc6b:
>>
>> stress-ng: info:  [11023] dispatching hogs: 9 ramfs
>> stress-ng: info:  [11023] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
>> stress-ng: info:  [11023]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
>> stress-ng: info:  [11023] ramfs            774966     60.00     10.18    169.45     12915.89        4314.26
>> stress-ng: info:  [11023] for a 60.00s run time:
>> stress-ng: info:  [11023]    1920.11s available CPU time
>> stress-ng: info:  [11023]      10.18s user time   (  0.53%)
>> stress-ng: info:  [11023]     169.44s system time (  8.82%)
>> stress-ng: info:  [11023]     179.62s total time  (  9.35%)
>> stress-ng: info:  [11023] load average: 8.99 2.69 0.93
>> stress-ng: info:  [11023] successful run completed in 60.00s (1 min, 0.00 secs)
>>
>> 2) after commit f95bdb700bc6b:
>>
>> stress-ng: info:  [37676] dispatching hogs: 9 ramfs
>> stress-ng: info:  [37676] stressor       bogo ops real time  usrtime  sys time   bogo ops/s     bogo ops/s
>> stress-ng: info:  [37676]                           (secs)    (secs)   (secs)   (real time) (usr+sys time)
>> stress-ng: info:  [37676] ramfs            168673     60.00     1.61    39.66      2811.08        4087.47
>> stress-ng: info:  [37676] for a 60.10s run time:
>> stress-ng: info:  [37676]    1923.36s available CPU time
>> stress-ng: info:  [37676]       1.60s user time   (  0.08%)
>> stress-ng: info:  [37676]      39.66s system time (  2.06%)
>> stress-ng: info:  [37676]      41.26s total time  (  2.15%)
>> stress-ng: info:  [37676] load average: 7.69 3.63 2.36
>> stress-ng: info:  [37676] successful run completed in 60.10s (1 min, 0.10 secs)
> 
> Is this comparison reversed?  It appears to demonstrate that
> f95bdb700bc6b made the operation faster.

Maybe not. IIUC, the bogo ops/s (real time) bigger the better.

Thanks,
Qi

> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 6/8] xfs: introduce xfs_fs_destroy_super()
  2023-06-01  8:43     ` Qi Zheng
@ 2023-06-01  9:58       ` Christian Brauner
  2023-06-01 23:06       ` Dave Chinner
  1 sibling, 0 replies; 25+ messages in thread
From: Christian Brauner @ 2023-06-01  9:58 UTC (permalink / raw)
  To: Qi Zheng
  Cc: Dave Chinner, akpm, tkhai, roman.gushchin, vbabka, viro, djwong,
	hughd, paulmck, muchun.song, linux-mm, linux-fsdevel, linux-xfs,
	linux-kernel, Qi Zheng

On Thu, Jun 01, 2023 at 04:43:32PM +0800, Qi Zheng wrote:
> Hi Dave,
> 
> On 2023/6/1 07:48, Dave Chinner wrote:
> > On Wed, May 31, 2023 at 09:57:40AM +0000, Qi Zheng wrote:
> > > From: Kirill Tkhai <tkhai@ya.ru>
> > > 
> > > xfs_fs_nr_cached_objects() touches sb->s_fs_info,
> > > and this patch makes it to be destructed later.
> > > 
> > > After this patch xfs_fs_nr_cached_objects() is safe
> > > for splitting unregister_shrinker(): mp->m_perag_tree
> > > is stable till destroy_super_work(), while iteration
> > > over it is already RCU-protected by internal XFS
> > > business.
> > > 
> > > Signed-off-by: Kirill Tkhai <tkhai@ya.ru>
> > > Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> > > ---
> > >   fs/xfs/xfs_super.c | 25 ++++++++++++++++++++++---
> > >   1 file changed, 22 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > > index 7e706255f165..694616524c76 100644
> > > --- a/fs/xfs/xfs_super.c
> > > +++ b/fs/xfs/xfs_super.c
> > > @@ -743,11 +743,18 @@ xfs_fs_drop_inode(
> > >   }
> > >   static void
> > > -xfs_mount_free(
> > > +xfs_free_names(
> > >   	struct xfs_mount	*mp)
> > >   {
> > >   	kfree(mp->m_rtname);
> > >   	kfree(mp->m_logname);
> > > +}
> > > +
> > > +static void
> > > +xfs_mount_free(
> > > +	struct xfs_mount	*mp)
> > > +{
> > > +	xfs_free_names(mp);
> > >   	kmem_free(mp);
> > >   }
> > > @@ -1136,8 +1143,19 @@ xfs_fs_put_super(
> > >   	xfs_destroy_mount_workqueues(mp);
> > >   	xfs_close_devices(mp);
> > > -	sb->s_fs_info = NULL;
> > > -	xfs_mount_free(mp);
> > > +	xfs_free_names(mp);
> > > +}
> > > +
> > > +static void
> > > +xfs_fs_destroy_super(
> > > +	struct super_block	*sb)
> > > +{
> > > +	if (sb->s_fs_info) {
> > > +		struct xfs_mount	*mp = XFS_M(sb);
> > > +
> > > +		kmem_free(mp);
> > > +		sb->s_fs_info = NULL;
> > > +	}
> > >   }
> > >   static long
> > > @@ -1165,6 +1183,7 @@ static const struct super_operations xfs_super_operations = {
> > >   	.dirty_inode		= xfs_fs_dirty_inode,
> > >   	.drop_inode		= xfs_fs_drop_inode,
> > >   	.put_super		= xfs_fs_put_super,
> > > +	.destroy_super		= xfs_fs_destroy_super,
> > >   	.sync_fs		= xfs_fs_sync_fs,
> > >   	.freeze_fs		= xfs_fs_freeze,
> > >   	.unfreeze_fs		= xfs_fs_unfreeze,
> > 
> > I don't really like this ->destroy_super() callback, especially as
> > it's completely undocumented as to why it exists. This is purely a
> > work-around for handling extended filesystem superblock shrinker
> > functionality, yet there's nothing that tells the reader this.
> > 
> > It also seems to imply that the superblock shrinker can continue to
> > run after the existing unregister_shrinker() call before ->kill_sb()
> > is called. This violates the assumption made in filesystems that the
> > superblock shrinkers have been stopped and will never run again
> > before ->kill_sb() is called. Hence ->kill_sb() implementations
> > assume there is nothing else accessing filesystem owned structures
> > and it can tear down internal structures safely.
> > 
> > Realistically, the days of XFS using this superblock shrinker
> > extension are numbered. We've got a lot of the infrastructure we
> > need in place to get rid of the background inode reclaim
> > infrastructure that requires this shrinker extension, and it's on my
> > list of things that need to be addressed in the near future.
> > 
> > In fact, now that I look at it, I think the shmem usage of this
> > superblock shrinker interface is broken - it returns SHRINK_STOP to
> > ->free_cached_objects(), but the only valid return value is the
> > number of objects freed (i.e. 0 is nothing freed). These special
> > superblock extension interfaces do not work like a normal
> > shrinker....
> > 
> > Hence I think the shmem usage should be replaced with an separate
> > internal shmem shrinker that is managed by the filesystem itself
> > (similar to how XFS has multiple internal shrinkers).
> > 
> > At this point, then the only user of this interface is (again) XFS.
> > Given this, adding new VFS methods for a single filesystem
> > for functionality that is planned to be removed is probably not the
> > best approach to solving the problem.
> 
> Thanks for such a detailed analysis. Kirill Tkhai just proposeed a
> new method[1], I cc'd you on the email.
> 
> [1].
> https://lore.kernel.org/lkml/bab60fe4-964c-43a6-ecce-4cbd4981d875@ya.ru/

As long as we agree that we're not adding a new super operation that
sounds like a better way forward.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 6/8] xfs: introduce xfs_fs_destroy_super()
  2023-06-01  8:43     ` Qi Zheng
  2023-06-01  9:58       ` Christian Brauner
@ 2023-06-01 23:06       ` Dave Chinner
  2023-06-02  3:13         ` Qi Zheng
  1 sibling, 1 reply; 25+ messages in thread
From: Dave Chinner @ 2023-06-01 23:06 UTC (permalink / raw)
  To: Qi Zheng
  Cc: akpm, tkhai, roman.gushchin, vbabka, viro, brauner, djwong,
	hughd, paulmck, muchun.song, linux-mm, linux-fsdevel, linux-xfs,
	linux-kernel, Qi Zheng

On Thu, Jun 01, 2023 at 04:43:32PM +0800, Qi Zheng wrote:
> Hi Dave,
> On 2023/6/1 07:48, Dave Chinner wrote:
> > On Wed, May 31, 2023 at 09:57:40AM +0000, Qi Zheng wrote:
> > > From: Kirill Tkhai <tkhai@ya.ru>
> > I don't really like this ->destroy_super() callback, especially as
> > it's completely undocumented as to why it exists. This is purely a
> > work-around for handling extended filesystem superblock shrinker
> > functionality, yet there's nothing that tells the reader this.
> > 
> > It also seems to imply that the superblock shrinker can continue to
> > run after the existing unregister_shrinker() call before ->kill_sb()
> > is called. This violates the assumption made in filesystems that the
> > superblock shrinkers have been stopped and will never run again
> > before ->kill_sb() is called. Hence ->kill_sb() implementations
> > assume there is nothing else accessing filesystem owned structures
> > and it can tear down internal structures safely.
> > 
> > Realistically, the days of XFS using this superblock shrinker
> > extension are numbered. We've got a lot of the infrastructure we
> > need in place to get rid of the background inode reclaim
> > infrastructure that requires this shrinker extension, and it's on my
> > list of things that need to be addressed in the near future.
> > 
> > In fact, now that I look at it, I think the shmem usage of this
> > superblock shrinker interface is broken - it returns SHRINK_STOP to
> > ->free_cached_objects(), but the only valid return value is the
> > number of objects freed (i.e. 0 is nothing freed). These special
> > superblock extension interfaces do not work like a normal
> > shrinker....
> > 
> > Hence I think the shmem usage should be replaced with an separate
> > internal shmem shrinker that is managed by the filesystem itself
> > (similar to how XFS has multiple internal shrinkers).
> > 
> > At this point, then the only user of this interface is (again) XFS.
> > Given this, adding new VFS methods for a single filesystem
> > for functionality that is planned to be removed is probably not the
> > best approach to solving the problem.
> 
> Thanks for such a detailed analysis. Kirill Tkhai just proposeed a
> new method[1], I cc'd you on the email.

I;ve just read through that thread, and I've looked at the original
patch that caused the regression.

I'm a bit annoyed right now. Nobody cc'd me on the original patches
nor were any of the subsystems that use shrinkers were cc'd on the
patches that changed shrinker behaviour. I only find out about this
because someone tries to fix something they broke by *breaking more
stuff* and not even realising how broken what they are proposing is.

The previous code was not broken and it provided specific guarantees
to subsystems via unregister_shrinker(). From the above discussion,
it appears that the original authors of these changes either did not
know about or did not understand them, so that casts doubt in my
mind about the attempted solution and all the proposed fixes for it.

I don't have the time right now unravel this mess and fully
understand the original problem, changes or the band-aids that are
being thrown around. We are also getting quite late in the cycle to
be doing major surgery to critical infrastructure, especially as it
gives so little time to review regression test whatever new solution
is proposed.

Given this appears to be a change introduced in 6.4-rc1, I think the
right thing to do is to revert the change rather than make things
worse by trying to shove some "quick fix" into the kernel to address
it.

Andrew, could you please sort out a series to revert this shrinker
infrastructure change and all the dependent hacks that have been
added to try to fix it so far?

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 6/8] xfs: introduce xfs_fs_destroy_super()
  2023-06-01 23:06       ` Dave Chinner
@ 2023-06-02  3:13         ` Qi Zheng
  2023-06-02 15:15           ` Darrick J. Wong
  0 siblings, 1 reply; 25+ messages in thread
From: Qi Zheng @ 2023-06-02  3:13 UTC (permalink / raw)
  To: Dave Chinner
  Cc: akpm, tkhai, roman.gushchin, vbabka, viro, brauner, djwong,
	hughd, paulmck, muchun.song, linux-mm, linux-fsdevel, linux-xfs,
	linux-kernel, Qi Zheng

Hi Dave,

On 2023/6/2 07:06, Dave Chinner wrote:
> On Thu, Jun 01, 2023 at 04:43:32PM +0800, Qi Zheng wrote:
>> Hi Dave,
>> On 2023/6/1 07:48, Dave Chinner wrote:
>>> On Wed, May 31, 2023 at 09:57:40AM +0000, Qi Zheng wrote:
>>>> From: Kirill Tkhai <tkhai@ya.ru>
>>> I don't really like this ->destroy_super() callback, especially as
>>> it's completely undocumented as to why it exists. This is purely a
>>> work-around for handling extended filesystem superblock shrinker
>>> functionality, yet there's nothing that tells the reader this.
>>>
>>> It also seems to imply that the superblock shrinker can continue to
>>> run after the existing unregister_shrinker() call before ->kill_sb()
>>> is called. This violates the assumption made in filesystems that the
>>> superblock shrinkers have been stopped and will never run again
>>> before ->kill_sb() is called. Hence ->kill_sb() implementations
>>> assume there is nothing else accessing filesystem owned structures
>>> and it can tear down internal structures safely.
>>>
>>> Realistically, the days of XFS using this superblock shrinker
>>> extension are numbered. We've got a lot of the infrastructure we
>>> need in place to get rid of the background inode reclaim
>>> infrastructure that requires this shrinker extension, and it's on my
>>> list of things that need to be addressed in the near future.
>>>
>>> In fact, now that I look at it, I think the shmem usage of this
>>> superblock shrinker interface is broken - it returns SHRINK_STOP to
>>> ->free_cached_objects(), but the only valid return value is the
>>> number of objects freed (i.e. 0 is nothing freed). These special
>>> superblock extension interfaces do not work like a normal
>>> shrinker....
>>>
>>> Hence I think the shmem usage should be replaced with an separate
>>> internal shmem shrinker that is managed by the filesystem itself
>>> (similar to how XFS has multiple internal shrinkers).
>>>
>>> At this point, then the only user of this interface is (again) XFS.
>>> Given this, adding new VFS methods for a single filesystem
>>> for functionality that is planned to be removed is probably not the
>>> best approach to solving the problem.
>>
>> Thanks for such a detailed analysis. Kirill Tkhai just proposeed a
>> new method[1], I cc'd you on the email.
> 
> I;ve just read through that thread, and I've looked at the original
> patch that caused the regression.
> 
> I'm a bit annoyed right now. Nobody cc'd me on the original patches
> nor were any of the subsystems that use shrinkers were cc'd on the
> patches that changed shrinker behaviour. I only find out about this

Sorry about that, my mistake. I followed the results of
scripts/get_maintainer.pl before.

> because someone tries to fix something they broke by *breaking more
> stuff* and not even realising how broken what they are proposing is.

Yes, this slows down the speed of umount. But the benefit is that
slab shrink becomes lockless, the mount operation and slab shrink no
longer affect each other, and the IPC no longer drops significantly,
etc.

And I used bpftrace to measure the time consumption of
unregister_shrinker():

```
And I just tested it on a physical machine (Intel(R) Xeon(R) Platinum
8260 CPU @ 2.40GHz) and the results are as follows:

1) use synchronize_srcu():

@ns[umount]:
[8K, 16K)             83 |@@@@@@@       |
[16K, 32K)           578 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[32K, 64K)            78 |@@@@@@@       |
[64K, 128K)            6 |       |
[128K, 256K)           7 |       |
[256K, 512K)          29 |@@       |
[512K, 1M)            51 |@@@@      |
[1M, 2M)              90 |@@@@@@@@       |
[2M, 4M)              70 |@@@@@@      |
[4M, 8M)               8 |      |

2) use synchronize_srcu_expedited():

@ns[umount]:
[8K, 16K)             31 |@@       |
[16K, 32K)           803 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[32K, 64K)           158 |@@@@@@@@@@       |
[64K, 128K)            4 |       |
[128K, 256K)           2 |       |
[256K, 512K)           2 |       |
```

With synchronize_srcu(), most of the time consumption is between 16us 
and 32us, the worst case between 4ms and 8ms. Is this totally
unacceptable?

This performance regression report comes from a stress test. Will the
umount action be executed so frequently under real workloads?

If there are really unacceptable, after applying the newly proposed
method, umount will be as fast as before (or even faster).

Thanks,
Qi

> 
> The previous code was not broken and it provided specific guarantees
> to subsystems via unregister_shrinker(). From the above discussion,
> it appears that the original authors of these changes either did not
> know about or did not understand them, so that casts doubt in my
> mind about the attempted solution and all the proposed fixes for it.
> 
> I don't have the time right now unravel this mess and fully
> understand the original problem, changes or the band-aids that are
> being thrown around. We are also getting quite late in the cycle to
> be doing major surgery to critical infrastructure, especially as it
> gives so little time to review regression test whatever new solution
> is proposed.
> 
> Given this appears to be a change introduced in 6.4-rc1, I think the
> right thing to do is to revert the change rather than make things
> worse by trying to shove some "quick fix" into the kernel to address
> it.
> 
> Andrew, could you please sort out a series to revert this shrinker
> infrastructure change and all the dependent hacks that have been
> added to try to fix it so far?
> 
> -Dave.

-- 
Thanks,
Qi

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 6/8] xfs: introduce xfs_fs_destroy_super()
  2023-06-02  3:13         ` Qi Zheng
@ 2023-06-02 15:15           ` Darrick J. Wong
  2023-06-05 11:50             ` Christian Brauner
  0 siblings, 1 reply; 25+ messages in thread
From: Darrick J. Wong @ 2023-06-02 15:15 UTC (permalink / raw)
  To: Qi Zheng
  Cc: Dave Chinner, akpm, tkhai, roman.gushchin, vbabka, viro, brauner,
	hughd, paulmck, muchun.song, linux-mm, linux-fsdevel, linux-xfs,
	linux-kernel, Qi Zheng

On Fri, Jun 02, 2023 at 11:13:09AM +0800, Qi Zheng wrote:
> Hi Dave,
> 
> On 2023/6/2 07:06, Dave Chinner wrote:
> > On Thu, Jun 01, 2023 at 04:43:32PM +0800, Qi Zheng wrote:
> > > Hi Dave,
> > > On 2023/6/1 07:48, Dave Chinner wrote:
> > > > On Wed, May 31, 2023 at 09:57:40AM +0000, Qi Zheng wrote:
> > > > > From: Kirill Tkhai <tkhai@ya.ru>
> > > > I don't really like this ->destroy_super() callback, especially as
> > > > it's completely undocumented as to why it exists. This is purely a
> > > > work-around for handling extended filesystem superblock shrinker
> > > > functionality, yet there's nothing that tells the reader this.
> > > > 
> > > > It also seems to imply that the superblock shrinker can continue to
> > > > run after the existing unregister_shrinker() call before ->kill_sb()
> > > > is called. This violates the assumption made in filesystems that the
> > > > superblock shrinkers have been stopped and will never run again
> > > > before ->kill_sb() is called. Hence ->kill_sb() implementations
> > > > assume there is nothing else accessing filesystem owned structures
> > > > and it can tear down internal structures safely.
> > > > 
> > > > Realistically, the days of XFS using this superblock shrinker
> > > > extension are numbered. We've got a lot of the infrastructure we
> > > > need in place to get rid of the background inode reclaim
> > > > infrastructure that requires this shrinker extension, and it's on my
> > > > list of things that need to be addressed in the near future.
> > > > 
> > > > In fact, now that I look at it, I think the shmem usage of this
> > > > superblock shrinker interface is broken - it returns SHRINK_STOP to
> > > > ->free_cached_objects(), but the only valid return value is the
> > > > number of objects freed (i.e. 0 is nothing freed). These special
> > > > superblock extension interfaces do not work like a normal
> > > > shrinker....
> > > > 
> > > > Hence I think the shmem usage should be replaced with an separate
> > > > internal shmem shrinker that is managed by the filesystem itself
> > > > (similar to how XFS has multiple internal shrinkers).
> > > > 
> > > > At this point, then the only user of this interface is (again) XFS.
> > > > Given this, adding new VFS methods for a single filesystem
> > > > for functionality that is planned to be removed is probably not the
> > > > best approach to solving the problem.
> > > 
> > > Thanks for such a detailed analysis. Kirill Tkhai just proposeed a
> > > new method[1], I cc'd you on the email.
> > 
> > I;ve just read through that thread, and I've looked at the original
> > patch that caused the regression.
> > 
> > I'm a bit annoyed right now. Nobody cc'd me on the original patches
> > nor were any of the subsystems that use shrinkers were cc'd on the
> > patches that changed shrinker behaviour. I only find out about this
> 
> Sorry about that, my mistake. I followed the results of
> scripts/get_maintainer.pl before.

Sometimes I wonder if people who contribute a lot to a subsystem should
be more aggressive about listing themselves explicitly in MAINTAINERS
but then I look at the ~600 emails that came in while I was on vacation
for 6 days over a long weekend and ... shut up. :P

> > because someone tries to fix something they broke by *breaking more
> > stuff* and not even realising how broken what they are proposing is.
> 
> Yes, this slows down the speed of umount. But the benefit is that
> slab shrink becomes lockless, the mount operation and slab shrink no
> longer affect each other, and the IPC no longer drops significantly,
> etc.

The lockless shrink seems like a good thing to have, but ... is it
really true that the superblock shrinker can still be running after
->kill_sb?  /That/ is surprising to me.

--D

> And I used bpftrace to measure the time consumption of
> unregister_shrinker():
> 
> ```
> And I just tested it on a physical machine (Intel(R) Xeon(R) Platinum
> 8260 CPU @ 2.40GHz) and the results are as follows:
> 
> 1) use synchronize_srcu():
> 
> @ns[umount]:
> [8K, 16K)             83 |@@@@@@@       |
> [16K, 32K)           578
> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [32K, 64K)            78 |@@@@@@@       |
> [64K, 128K)            6 |       |
> [128K, 256K)           7 |       |
> [256K, 512K)          29 |@@       |
> [512K, 1M)            51 |@@@@      |
> [1M, 2M)              90 |@@@@@@@@       |
> [2M, 4M)              70 |@@@@@@      |
> [4M, 8M)               8 |      |
> 
> 2) use synchronize_srcu_expedited():
> 
> @ns[umount]:
> [8K, 16K)             31 |@@       |
> [16K, 32K)           803
> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [32K, 64K)           158 |@@@@@@@@@@       |
> [64K, 128K)            4 |       |
> [128K, 256K)           2 |       |
> [256K, 512K)           2 |       |
> ```
> 
> With synchronize_srcu(), most of the time consumption is between 16us and
> 32us, the worst case between 4ms and 8ms. Is this totally
> unacceptable?
> 
> This performance regression report comes from a stress test. Will the
> umount action be executed so frequently under real workloads?
> 
> If there are really unacceptable, after applying the newly proposed
> method, umount will be as fast as before (or even faster).
> 
> Thanks,
> Qi
> 
> > 
> > The previous code was not broken and it provided specific guarantees
> > to subsystems via unregister_shrinker(). From the above discussion,
> > it appears that the original authors of these changes either did not
> > know about or did not understand them, so that casts doubt in my
> > mind about the attempted solution and all the proposed fixes for it.
> > 
> > I don't have the time right now unravel this mess and fully
> > understand the original problem, changes or the band-aids that are
> > being thrown around. We are also getting quite late in the cycle to
> > be doing major surgery to critical infrastructure, especially as it
> > gives so little time to review regression test whatever new solution
> > is proposed.
> > 
> > Given this appears to be a change introduced in 6.4-rc1, I think the
> > right thing to do is to revert the change rather than make things
> > worse by trying to shove some "quick fix" into the kernel to address
> > it.
> > 
> > Andrew, could you please sort out a series to revert this shrinker
> > infrastructure change and all the dependent hacks that have been
> > added to try to fix it so far?
> > 
> > -Dave.
> 
> -- 
> Thanks,
> Qi

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 6/8] xfs: introduce xfs_fs_destroy_super()
  2023-06-02 15:15           ` Darrick J. Wong
@ 2023-06-05 11:50             ` Christian Brauner
  2023-06-05 12:16               ` Qi Zheng
  0 siblings, 1 reply; 25+ messages in thread
From: Christian Brauner @ 2023-06-05 11:50 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Qi Zheng, Dave Chinner, akpm, tkhai, roman.gushchin, vbabka,
	viro, hughd, paulmck, muchun.song, linux-mm, linux-fsdevel,
	linux-xfs, linux-kernel, Qi Zheng

On Fri, Jun 02, 2023 at 08:15:32AM -0700, Darrick J. Wong wrote:
> On Fri, Jun 02, 2023 at 11:13:09AM +0800, Qi Zheng wrote:
> > Hi Dave,
> > 
> > On 2023/6/2 07:06, Dave Chinner wrote:
> > > On Thu, Jun 01, 2023 at 04:43:32PM +0800, Qi Zheng wrote:
> > > > Hi Dave,
> > > > On 2023/6/1 07:48, Dave Chinner wrote:
> > > > > On Wed, May 31, 2023 at 09:57:40AM +0000, Qi Zheng wrote:
> > > > > > From: Kirill Tkhai <tkhai@ya.ru>
> > > > > I don't really like this ->destroy_super() callback, especially as
> > > > > it's completely undocumented as to why it exists. This is purely a
> > > > > work-around for handling extended filesystem superblock shrinker
> > > > > functionality, yet there's nothing that tells the reader this.
> > > > > 
> > > > > It also seems to imply that the superblock shrinker can continue to
> > > > > run after the existing unregister_shrinker() call before ->kill_sb()
> > > > > is called. This violates the assumption made in filesystems that the
> > > > > superblock shrinkers have been stopped and will never run again
> > > > > before ->kill_sb() is called. Hence ->kill_sb() implementations
> > > > > assume there is nothing else accessing filesystem owned structures
> > > > > and it can tear down internal structures safely.
> > > > > 
> > > > > Realistically, the days of XFS using this superblock shrinker
> > > > > extension are numbered. We've got a lot of the infrastructure we
> > > > > need in place to get rid of the background inode reclaim
> > > > > infrastructure that requires this shrinker extension, and it's on my
> > > > > list of things that need to be addressed in the near future.
> > > > > 
> > > > > In fact, now that I look at it, I think the shmem usage of this
> > > > > superblock shrinker interface is broken - it returns SHRINK_STOP to
> > > > > ->free_cached_objects(), but the only valid return value is the
> > > > > number of objects freed (i.e. 0 is nothing freed). These special
> > > > > superblock extension interfaces do not work like a normal
> > > > > shrinker....
> > > > > 
> > > > > Hence I think the shmem usage should be replaced with an separate
> > > > > internal shmem shrinker that is managed by the filesystem itself
> > > > > (similar to how XFS has multiple internal shrinkers).
> > > > > 
> > > > > At this point, then the only user of this interface is (again) XFS.
> > > > > Given this, adding new VFS methods for a single filesystem
> > > > > for functionality that is planned to be removed is probably not the
> > > > > best approach to solving the problem.
> > > > 
> > > > Thanks for such a detailed analysis. Kirill Tkhai just proposeed a
> > > > new method[1], I cc'd you on the email.
> > > 
> > > I;ve just read through that thread, and I've looked at the original
> > > patch that caused the regression.
> > > 
> > > I'm a bit annoyed right now. Nobody cc'd me on the original patches
> > > nor were any of the subsystems that use shrinkers were cc'd on the
> > > patches that changed shrinker behaviour. I only find out about this
> > 
> > Sorry about that, my mistake. I followed the results of
> > scripts/get_maintainer.pl before.
> 
> Sometimes I wonder if people who contribute a lot to a subsystem should
> be more aggressive about listing themselves explicitly in MAINTAINERS
> but then I look at the ~600 emails that came in while I was on vacation
> for 6 days over a long weekend and ... shut up. :P
> 
> > > because someone tries to fix something they broke by *breaking more
> > > stuff* and not even realising how broken what they are proposing is.
> > 
> > Yes, this slows down the speed of umount. But the benefit is that
> > slab shrink becomes lockless, the mount operation and slab shrink no
> > longer affect each other, and the IPC no longer drops significantly,
> > etc.
> 
> The lockless shrink seems like a good thing to have, but ... is it
> really true that the superblock shrinker can still be running after
> ->kill_sb?  /That/ is surprising to me.

So what's the plan here? If this causes issues for filesystems that rely
on specific guarantees that are broken by the patch then either it needs
a clean fix or a revert.

> 
> --D
> 
> > And I used bpftrace to measure the time consumption of
> > unregister_shrinker():
> > 
> > ```
> > And I just tested it on a physical machine (Intel(R) Xeon(R) Platinum
> > 8260 CPU @ 2.40GHz) and the results are as follows:
> > 
> > 1) use synchronize_srcu():
> > 
> > @ns[umount]:
> > [8K, 16K)             83 |@@@@@@@       |
> > [16K, 32K)           578
> > |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > [32K, 64K)            78 |@@@@@@@       |
> > [64K, 128K)            6 |       |
> > [128K, 256K)           7 |       |
> > [256K, 512K)          29 |@@       |
> > [512K, 1M)            51 |@@@@      |
> > [1M, 2M)              90 |@@@@@@@@       |
> > [2M, 4M)              70 |@@@@@@      |
> > [4M, 8M)               8 |      |
> > 
> > 2) use synchronize_srcu_expedited():
> > 
> > @ns[umount]:
> > [8K, 16K)             31 |@@       |
> > [16K, 32K)           803
> > |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > [32K, 64K)           158 |@@@@@@@@@@       |
> > [64K, 128K)            4 |       |
> > [128K, 256K)           2 |       |
> > [256K, 512K)           2 |       |
> > ```
> > 
> > With synchronize_srcu(), most of the time consumption is between 16us and
> > 32us, the worst case between 4ms and 8ms. Is this totally
> > unacceptable?
> > 
> > This performance regression report comes from a stress test. Will the
> > umount action be executed so frequently under real workloads?
> > 
> > If there are really unacceptable, after applying the newly proposed
> > method, umount will be as fast as before (or even faster).
> > 
> > Thanks,
> > Qi
> > 
> > > 
> > > The previous code was not broken and it provided specific guarantees
> > > to subsystems via unregister_shrinker(). From the above discussion,
> > > it appears that the original authors of these changes either did not
> > > know about or did not understand them, so that casts doubt in my
> > > mind about the attempted solution and all the proposed fixes for it.
> > > 
> > > I don't have the time right now unravel this mess and fully
> > > understand the original problem, changes or the band-aids that are
> > > being thrown around. We are also getting quite late in the cycle to
> > > be doing major surgery to critical infrastructure, especially as it
> > > gives so little time to review regression test whatever new solution
> > > is proposed.
> > > 
> > > Given this appears to be a change introduced in 6.4-rc1, I think the
> > > right thing to do is to revert the change rather than make things
> > > worse by trying to shove some "quick fix" into the kernel to address
> > > it.
> > > 
> > > Andrew, could you please sort out a series to revert this shrinker
> > > infrastructure change and all the dependent hacks that have been
> > > added to try to fix it so far?
> > > 
> > > -Dave.
> > 
> > -- 
> > Thanks,
> > Qi

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 6/8] xfs: introduce xfs_fs_destroy_super()
  2023-06-05 11:50             ` Christian Brauner
@ 2023-06-05 12:16               ` Qi Zheng
  0 siblings, 0 replies; 25+ messages in thread
From: Qi Zheng @ 2023-06-05 12:16 UTC (permalink / raw)
  To: Christian Brauner, Darrick J. Wong, tkhai
  Cc: Dave Chinner, akpm, roman.gushchin, vbabka, viro, hughd, paulmck,
	muchun.song, linux-mm, linux-fsdevel, linux-xfs, linux-kernel,
	Qi Zheng



On 2023/6/5 19:50, Christian Brauner wrote:
> On Fri, Jun 02, 2023 at 08:15:32AM -0700, Darrick J. Wong wrote:
>> On Fri, Jun 02, 2023 at 11:13:09AM +0800, Qi Zheng wrote:
>>> Hi Dave,
>>>
>>> On 2023/6/2 07:06, Dave Chinner wrote:
>>>> On Thu, Jun 01, 2023 at 04:43:32PM +0800, Qi Zheng wrote:
>>>>> Hi Dave,
>>>>> On 2023/6/1 07:48, Dave Chinner wrote:
>>>>>> On Wed, May 31, 2023 at 09:57:40AM +0000, Qi Zheng wrote:
>>>>>>> From: Kirill Tkhai <tkhai@ya.ru>
>>>>>> I don't really like this ->destroy_super() callback, especially as
>>>>>> it's completely undocumented as to why it exists. This is purely a
>>>>>> work-around for handling extended filesystem superblock shrinker
>>>>>> functionality, yet there's nothing that tells the reader this.
>>>>>>
>>>>>> It also seems to imply that the superblock shrinker can continue to
>>>>>> run after the existing unregister_shrinker() call before ->kill_sb()
>>>>>> is called. This violates the assumption made in filesystems that the
>>>>>> superblock shrinkers have been stopped and will never run again
>>>>>> before ->kill_sb() is called. Hence ->kill_sb() implementations
>>>>>> assume there is nothing else accessing filesystem owned structures
>>>>>> and it can tear down internal structures safely.
>>>>>>
>>>>>> Realistically, the days of XFS using this superblock shrinker
>>>>>> extension are numbered. We've got a lot of the infrastructure we
>>>>>> need in place to get rid of the background inode reclaim
>>>>>> infrastructure that requires this shrinker extension, and it's on my
>>>>>> list of things that need to be addressed in the near future.
>>>>>>
>>>>>> In fact, now that I look at it, I think the shmem usage of this
>>>>>> superblock shrinker interface is broken - it returns SHRINK_STOP to
>>>>>> ->free_cached_objects(), but the only valid return value is the
>>>>>> number of objects freed (i.e. 0 is nothing freed). These special
>>>>>> superblock extension interfaces do not work like a normal
>>>>>> shrinker....
>>>>>>
>>>>>> Hence I think the shmem usage should be replaced with an separate
>>>>>> internal shmem shrinker that is managed by the filesystem itself
>>>>>> (similar to how XFS has multiple internal shrinkers).
>>>>>>
>>>>>> At this point, then the only user of this interface is (again) XFS.
>>>>>> Given this, adding new VFS methods for a single filesystem
>>>>>> for functionality that is planned to be removed is probably not the
>>>>>> best approach to solving the problem.
>>>>>
>>>>> Thanks for such a detailed analysis. Kirill Tkhai just proposeed a
>>>>> new method[1], I cc'd you on the email.
>>>>
>>>> I;ve just read through that thread, and I've looked at the original
>>>> patch that caused the regression.
>>>>
>>>> I'm a bit annoyed right now. Nobody cc'd me on the original patches
>>>> nor were any of the subsystems that use shrinkers were cc'd on the
>>>> patches that changed shrinker behaviour. I only find out about this
>>>
>>> Sorry about that, my mistake. I followed the results of
>>> scripts/get_maintainer.pl before.
>>
>> Sometimes I wonder if people who contribute a lot to a subsystem should
>> be more aggressive about listing themselves explicitly in MAINTAINERS
>> but then I look at the ~600 emails that came in while I was on vacation
>> for 6 days over a long weekend and ... shut up. :P
>>
>>>> because someone tries to fix something they broke by *breaking more
>>>> stuff* and not even realising how broken what they are proposing is.
>>>
>>> Yes, this slows down the speed of umount. But the benefit is that
>>> slab shrink becomes lockless, the mount operation and slab shrink no
>>> longer affect each other, and the IPC no longer drops significantly,
>>> etc.
>>
>> The lockless shrink seems like a good thing to have, but ... is it
>> really true that the superblock shrinker can still be running after
>> ->kill_sb?  /That/ is surprising to me.
> 
> So what's the plan here? If this causes issues for filesystems that rely
> on specific guarantees that are broken by the patch then either it needs
> a clean fix or a revert.

If the reduction in umount speed is really unacceptable, I think we can
try the patch[1] from Kirill Tkhai. At least the granularity of the
shrinker rwsem lock is reduced, and the file system code does not need
to be modified.

And IIUC, after clearing SHRINKER_REGISTERED under the write lock of
shrinker->rwsem, we can guarantee that the shrinker won't be running.
So synchronize_srcu() doesn't need to be called in unregister_shrinker()
anymore. So we don't need to split unregister_shrinker().

[1]. 
https://lore.kernel.org/lkml/bab60fe4-964c-43a6-ecce-4cbd4981d875@ya.ru/

> 
>>
>> --D
>>
>>> And I used bpftrace to measure the time consumption of
>>> unregister_shrinker():
>>>
>>> ```
>>> And I just tested it on a physical machine (Intel(R) Xeon(R) Platinum
>>> 8260 CPU @ 2.40GHz) and the results are as follows:
>>>
>>> 1) use synchronize_srcu():
>>>
>>> @ns[umount]:
>>> [8K, 16K)             83 |@@@@@@@       |
>>> [16K, 32K)           578
>>> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>>> [32K, 64K)            78 |@@@@@@@       |
>>> [64K, 128K)            6 |       |
>>> [128K, 256K)           7 |       |
>>> [256K, 512K)          29 |@@       |
>>> [512K, 1M)            51 |@@@@      |
>>> [1M, 2M)              90 |@@@@@@@@       |
>>> [2M, 4M)              70 |@@@@@@      |
>>> [4M, 8M)               8 |      |
>>>
>>> 2) use synchronize_srcu_expedited():
>>>
>>> @ns[umount]:
>>> [8K, 16K)             31 |@@       |
>>> [16K, 32K)           803
>>> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>>> [32K, 64K)           158 |@@@@@@@@@@       |
>>> [64K, 128K)            4 |       |
>>> [128K, 256K)           2 |       |
>>> [256K, 512K)           2 |       |
>>> ```
>>>
>>> With synchronize_srcu(), most of the time consumption is between 16us and
>>> 32us, the worst case between 4ms and 8ms. Is this totally
>>> unacceptable?
>>>
>>> This performance regression report comes from a stress test. Will the
>>> umount action be executed so frequently under real workloads?
>>>
>>> If there are really unacceptable, after applying the newly proposed
>>> method, umount will be as fast as before (or even faster).
>>>
>>> Thanks,
>>> Qi
>>>
>>>>
>>>> The previous code was not broken and it provided specific guarantees
>>>> to subsystems via unregister_shrinker(). From the above discussion,
>>>> it appears that the original authors of these changes either did not
>>>> know about or did not understand them, so that casts doubt in my
>>>> mind about the attempted solution and all the proposed fixes for it.
>>>>
>>>> I don't have the time right now unravel this mess and fully
>>>> understand the original problem, changes or the band-aids that are
>>>> being thrown around. We are also getting quite late in the cycle to
>>>> be doing major surgery to critical infrastructure, especially as it
>>>> gives so little time to review regression test whatever new solution
>>>> is proposed.
>>>>
>>>> Given this appears to be a change introduced in 6.4-rc1, I think the
>>>> right thing to do is to revert the change rather than make things
>>>> worse by trying to shove some "quick fix" into the kernel to address
>>>> it.
>>>>
>>>> Andrew, could you please sort out a series to revert this shrinker
>>>> infrastructure change and all the dependent hacks that have been
>>>> added to try to fix it so far?
>>>>
>>>> -Dave.
>>>
>>> -- 
>>> Thanks,
>>> Qi

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2023-06-05 12:17 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-31  9:57 [PATCH 0/8] make unregistration of super_block shrinker more faster Qi Zheng
2023-05-31  9:57 ` [PATCH 1/8] mm: vmscan: move shrinker_debugfs_remove() before synchronize_srcu() Qi Zheng
2023-05-31 10:49   ` Christian Brauner
2023-05-31 12:52     ` Qi Zheng
2023-05-31  9:57 ` [PATCH 2/8] mm: vmscan: split unregister_shrinker() Qi Zheng
2023-05-31 22:57   ` Dave Chinner
2023-05-31  9:57 ` [PATCH 3/8] fs: move list_lru_destroy() to destroy_super_work() Qi Zheng
2023-05-31 23:00   ` Dave Chinner
2023-05-31  9:57 ` [PATCH 4/8] fs: shrink only (SB_ACTIVE|SB_BORN) superblocks in super_cache_scan() Qi Zheng
2023-05-31 23:12   ` Dave Chinner
2023-05-31  9:57 ` [PATCH 5/8] fs: introduce struct super_operations::destroy_super() callback Qi Zheng
2023-05-31 11:19   ` Christian Brauner
2023-05-31 12:54     ` Qi Zheng
2023-05-31  9:57 ` [PATCH 6/8] xfs: introduce xfs_fs_destroy_super() Qi Zheng
2023-05-31 23:48   ` Dave Chinner
2023-06-01  8:43     ` Qi Zheng
2023-06-01  9:58       ` Christian Brauner
2023-06-01 23:06       ` Dave Chinner
2023-06-02  3:13         ` Qi Zheng
2023-06-02 15:15           ` Darrick J. Wong
2023-06-05 11:50             ` Christian Brauner
2023-06-05 12:16               ` Qi Zheng
2023-05-31  9:57 ` [PATCH 7/8] shmem: implement shmem_destroy_super() Qi Zheng
2023-05-31  9:57 ` [PATCH 8/8] fs: use unregister_shrinker_delayed_{initiate, finalize} for super_block shrinker Qi Zheng
     [not found] ` <20230531114054.bf077db642aa9c58c0831687@linux-foundation.org>
2023-06-01  8:46   ` [PATCH 0/8] make unregistration of super_block shrinker more faster Qi Zheng

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.