All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/1] fs: remove retry loop
@ 2020-06-17 10:40 John Ogness
  2020-06-17 10:40 ` [RFC PATCH 1/1] fs/namespace.c: use spinlock instead of busy loop John Ogness
  0 siblings, 1 reply; 5+ messages in thread
From: John Ogness @ 2020-06-17 10:40 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Sebastian Andrzej Siewior, Peter Zijlstra, Thomas Gleixner,
	linux-fsdevel, linux-kernel

Hello,

This patch removes the last retry loop in VFS. It is partially
reverting

   commit d3ef3d7351cc ("fs: mnt_want_write speedup")

by re-introducing per-cpu spinlocks for each mount. The patch
includes benchmark results in the diffstat section to show that the
previous optimization work is not undone.

I would have liked to use a percpu_rw_semaphore per mount instead
of the many spinlocks. However, percpu_rw_semaphore can sleep, which
is a problem for sb_prepare_remount_readonly() since it needs to
take the spinlock in @mount_lock in order to iterate @sb->s_mounts.
Perhaps using a mutex to sychronize @sb->s_mounts is an option. I am
not sure. That is why this is an RFC.

I am suggesting this partial revert because it removes the retry
loop and does not show any obvious negative benchmark effects.

John Ogness (1):
  fs/namespace.c: use spinlock instead of busy loop

 fs/mount.h     |   7 +++
 fs/namespace.c | 118 +++++++++++++++++++++++++++++++++----------------
 2 files changed, 86 insertions(+), 39 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [RFC PATCH 1/1] fs/namespace.c: use spinlock instead of busy loop
  2020-06-17 10:40 [RFC PATCH 0/1] fs: remove retry loop John Ogness
@ 2020-06-17 10:40 ` John Ogness
  2020-06-17 11:26   ` Peter Zijlstra
  2020-06-17 11:27   ` Peter Zijlstra
  0 siblings, 2 replies; 5+ messages in thread
From: John Ogness @ 2020-06-17 10:40 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Sebastian Andrzej Siewior, Peter Zijlstra, Thomas Gleixner,
	linux-fsdevel, linux-kernel

The MNT_WRITE_HOLD flag is used to manually implement a per-cpu
optimized rwsem using busy waiting. This implementation is a problem
on PREEMPT_RT because write_seqlock() on @mount_lock (i.e. taking a
spinlock) does not disable preemption. This allows a writer to
preempt a task that has set MNT_WRITE_HOLD and thus that writer will
live lock in __mnt_want_write() due to busy looping with preemption
disabled.

Reimplement the same semantics using per-cpu spinlocks. This
provides lockdep coverage and makes the code RT ready.

Since this reverts some of the optimization work of
  commit d3ef3d7351cc ("fs: mnt_want_write speedup")
lmbench lat_mmap tests were performed to verify that there is no
obvious performance degradation.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
---
 Here is the detailed test information...

 TEST COMMAND
 lat_mmap -P $pval -W 32 -N 50 64m file

 OUTPUT FORMAT
 pval: avg std

 RESULTS (32 CPUs)
 No Forced Preemption
           BEFORE         AFTER
  1:   275.60  1.82   274.40  0.55
  2:   296.20  3.83   286.80  1.92
  4:   310.20  4.44   304.40  2.51
  8:   359.20  2.28   357.80  2.95
 16:   417.40  2.51   412.20  3.90
 32:   625.60  2.07   622.00  3.08
 64:  1202.60 15.87  1202.20  6.14

 No Forced Preemption, no PTI
           BEFORE         AFTER
  1:   278.00  2.12   274.40  1.52
  2:   301.00  3.67   289.80  6.06
  4:   333.40  7.73   303.80  2.39
  8:   389.80  3.56   351.80  3.42
 16:   425.00  3.46   408.20  4.87
 32:   606.00  1.22   605.60  1.82
 64:  1193.60  7.09  1184.80  4.27

 Voluntary Kernel Preemption
           BEFORE         AFTER
  1:   277.80  1.30   278.20  1.10
  2:   291.20  0.84   286.60  2.30
  4:   310.00  1.87   304.80  1.30
  8:   360.00  2.55   354.60  1.14
 16:   414.00  1.58   414.00  2.35
 32:   619.60  5.50   607.00  3.74
 64:  1224.00  8.40  1219.60  6.19

 Voluntary Kernel Preemption, no PTI
           BEFORE         AFTER
  1:   277.80  4.66   276.40  0.89
  2:   291.40  6.54   286.40  3.58
  4:   310.00  1.22   315.40  1.14
  8:   357.20  0.84   361.40  2.61
 16:   405.60  2.88   407.60  2.51
 32:   615.40  2.30   611.60  5.55
 64:  1219.80  9.91  1207.40 10.88

 Preemptible Kernel
           BEFORE         AFTER
  1:   283.80  0.45   286.80  0.84
  2:   293.40  2.51   294.40  3.51
  4:   318.20  1.30   315.60  1.95
  8:   367.00  0.71   363.00  1.22
 16:   416.20  1.64   413.20  4.87
 32:   628.80  2.28   617.40  2.97
 64:  1277.20  9.88  1254.20  4.87

 Preemptible Kernel, no PTI
           BEFORE         AFTER
  1:   283.00  1.73   288.40  1.67
  2:   305.80  2.05   297.00  3.24
  4:   321.40  4.34   318.60  2.79
  8:   370.20  2.39   366.40  2.70
 16:   413.20  3.11   412.40  2.41
 32:   616.40  2.61   620.20  2.05
 64:  1266.00  6.48  1255.80  3.90

 fs/mount.h     |   7 +++
 fs/namespace.c | 118 +++++++++++++++++++++++++++++++++----------------
 2 files changed, 86 insertions(+), 39 deletions(-)

diff --git a/fs/mount.h b/fs/mount.h
index c7abb7b394d8..627e635cd7d1 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -28,6 +28,12 @@ struct mnt_namespace {
 struct mnt_pcp {
 	int mnt_count;
 	int mnt_writers;
+	/*
+	 * If holding multiple instances of this lock, they
+	 * must be ordered by cpu number.
+	 */
+	spinlock_t mnt_writers_lock;
+	struct lock_class_key mnt_writers_lock_key;
 };
 
 struct mountpoint {
@@ -51,6 +57,7 @@ struct mount {
 #else
 	int mnt_count;
 	int mnt_writers;
+	spinlock_t mnt_writers_lock;
 #endif
 	struct list_head mnt_mounts;	/* list of children, anchored here */
 	struct list_head mnt_child;	/* and going through their mnt_child */
diff --git a/fs/namespace.c b/fs/namespace.c
index f30ed401cc6d..e292c91f966d 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -177,6 +177,7 @@ static struct mount *alloc_vfsmnt(const char *name)
 	struct mount *mnt = kmem_cache_zalloc(mnt_cache, GFP_KERNEL);
 	if (mnt) {
 		int err;
+		int cpu;
 
 		err = mnt_alloc_id(mnt);
 		if (err)
@@ -194,9 +195,21 @@ static struct mount *alloc_vfsmnt(const char *name)
 			goto out_free_devname;
 
 		this_cpu_add(mnt->mnt_pcp->mnt_count, 1);
+
+		for_each_possible_cpu(cpu) {
+			struct mnt_pcp *writer =
+					per_cpu_ptr(mnt->mnt_pcp, cpu);
+
+			lockdep_register_key(&writer->mnt_writers_lock_key);
+			spin_lock_init(&writer->mnt_writers_lock);
+			lockdep_set_class(&writer->mnt_writers_lock,
+					  &writer->mnt_writers_lock_key);
+		}
 #else
+		(void)cpu;
 		mnt->mnt_count = 1;
 		mnt->mnt_writers = 0;
+		spin_lock_init(&mnt->mnt_writers_lock);
 #endif
 
 		INIT_HLIST_NODE(&mnt->mnt_hash);
@@ -311,29 +324,26 @@ static int mnt_is_readonly(struct vfsmount *mnt)
 int __mnt_want_write(struct vfsmount *m)
 {
 	struct mount *mnt = real_mount(m);
+	spinlock_t *lock;
 	int ret = 0;
 
-	preempt_disable();
-	mnt_inc_writers(mnt);
-	/*
-	 * The store to mnt_inc_writers must be visible before we pass
-	 * MNT_WRITE_HOLD loop below, so that the slowpath can see our
-	 * incremented count after it has set MNT_WRITE_HOLD.
-	 */
-	smp_mb();
-	while (READ_ONCE(mnt->mnt.mnt_flags) & MNT_WRITE_HOLD)
-		cpu_relax();
-	/*
-	 * After the slowpath clears MNT_WRITE_HOLD, mnt_is_readonly will
-	 * be set to match its requirements. So we must not load that until
-	 * MNT_WRITE_HOLD is cleared.
-	 */
-	smp_rmb();
-	if (mnt_is_readonly(m)) {
-		mnt_dec_writers(mnt);
+#ifdef CONFIG_SMP
+	migrate_disable();
+	lock = &this_cpu_ptr(mnt->mnt_pcp)->mnt_writers_lock;
+#else
+	lock = &mnt->mnt_writers_lock;
+#endif
+
+	spin_lock(lock);
+	if (mnt_is_readonly(m))
 		ret = -EROFS;
-	}
-	preempt_enable();
+	else
+		mnt_inc_writers(mnt);
+	spin_unlock(lock);
+
+#ifdef CONFIG_SMP
+	migrate_enable();
+#endif
 
 	return ret;
 }
@@ -459,17 +469,39 @@ void mnt_drop_write_file(struct file *file)
 }
 EXPORT_SYMBOL(mnt_drop_write_file);
 
+static void mnt_lock_writers(struct mount *mnt)
+{
+#ifdef CONFIG_SMP
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		spin_lock(&per_cpu_ptr(mnt->mnt_pcp,
+				       cpu)->mnt_writers_lock);
+	}
+#else
+	spin_lock(&mnt->mnt_writers_lock);
+#endif
+}
+
+static void mnt_unlock_writers(struct mount *mnt)
+{
+#ifdef CONFIG_SMP
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		spin_unlock(&per_cpu_ptr(mnt->mnt_pcp,
+					 cpu)->mnt_writers_lock);
+	}
+#else
+	spin_unlock(&mnt->mnt_writers_lock);
+#endif
+}
+
 static int mnt_make_readonly(struct mount *mnt)
 {
 	int ret = 0;
 
 	lock_mount_hash();
-	mnt->mnt.mnt_flags |= MNT_WRITE_HOLD;
-	/*
-	 * After storing MNT_WRITE_HOLD, we'll read the counters. This store
-	 * should be visible before we do.
-	 */
-	smp_mb();
 
 	/*
 	 * With writers on hold, if this value is zero, then there are
@@ -482,21 +514,17 @@ static int mnt_make_readonly(struct mount *mnt)
 	 * sum up each counter, if we read a counter before it is incremented,
 	 * but then read another CPU's count which it has been subsequently
 	 * decremented from -- we would see more decrements than we should.
-	 * MNT_WRITE_HOLD protects against this scenario, because
-	 * mnt_want_write first increments count, then smp_mb, then spins on
-	 * MNT_WRITE_HOLD, so it can't be decremented by another CPU while
-	 * we're counting up here.
+	 * mnt_writers_lock protects against this scenario, because all CPUs
+	 * are prevented from incrementing the counter until the summation of
+	 * all CPU counters is complete.
 	 */
+	mnt_lock_writers(mnt);
 	if (mnt_get_writers(mnt) > 0)
 		ret = -EBUSY;
 	else
 		mnt->mnt.mnt_flags |= MNT_READONLY;
-	/*
-	 * MNT_READONLY must become visible before ~MNT_WRITE_HOLD, so writers
-	 * that become unheld will see MNT_READONLY.
-	 */
-	smp_wmb();
-	mnt->mnt.mnt_flags &= ~MNT_WRITE_HOLD;
+	mnt_unlock_writers(mnt);
+
 	unlock_mount_hash();
 	return ret;
 }
@@ -514,15 +542,15 @@ int sb_prepare_remount_readonly(struct super_block *sb)
 	struct mount *mnt;
 	int err = 0;
 
-	/* Racy optimization.  Recheck the counter under MNT_WRITE_HOLD */
+	/* Racy optimization.  Recheck the counter under mnt_writers_lock. */
 	if (atomic_long_read(&sb->s_remove_count))
 		return -EBUSY;
 
 	lock_mount_hash();
 	list_for_each_entry(mnt, &sb->s_mounts, mnt_instance) {
 		if (!(mnt->mnt.mnt_flags & MNT_READONLY)) {
+			mnt_lock_writers(mnt);
 			mnt->mnt.mnt_flags |= MNT_WRITE_HOLD;
-			smp_mb();
 			if (mnt_get_writers(mnt) > 0) {
 				err = -EBUSY;
 				break;
@@ -537,8 +565,10 @@ int sb_prepare_remount_readonly(struct super_block *sb)
 		smp_wmb();
 	}
 	list_for_each_entry(mnt, &sb->s_mounts, mnt_instance) {
-		if (mnt->mnt.mnt_flags & MNT_WRITE_HOLD)
+		if (mnt->mnt.mnt_flags & MNT_WRITE_HOLD) {
 			mnt->mnt.mnt_flags &= ~MNT_WRITE_HOLD;
+			mnt_unlock_writers(mnt);
+		}
 	}
 	unlock_mount_hash();
 
@@ -1099,6 +1129,7 @@ static void cleanup_mnt(struct mount *mnt)
 {
 	struct hlist_node *p;
 	struct mount *m;
+	int cpu;
 	/*
 	 * The warning here probably indicates that somebody messed
 	 * up a mnt_want/drop_write() pair.  If this happens, the
@@ -1117,6 +1148,15 @@ static void cleanup_mnt(struct mount *mnt)
 	dput(mnt->mnt.mnt_root);
 	deactivate_super(mnt->mnt.mnt_sb);
 	mnt_free_id(mnt);
+#ifdef CONFIG_SMP
+	for_each_possible_cpu(cpu) {
+		struct mnt_pcp *writer = per_cpu_ptr(mnt->mnt_pcp, cpu);
+
+		lockdep_unregister_key(&writer->mnt_writers_lock_key);
+	}
+#else
+	(void)cpu;
+#endif
 	call_rcu(&mnt->mnt_rcu, delayed_free_vfsmnt);
 }
 
-- 
2.20.1


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH 1/1] fs/namespace.c: use spinlock instead of busy loop
  2020-06-17 10:40 ` [RFC PATCH 1/1] fs/namespace.c: use spinlock instead of busy loop John Ogness
@ 2020-06-17 11:26   ` Peter Zijlstra
  2020-06-17 11:27   ` Peter Zijlstra
  1 sibling, 0 replies; 5+ messages in thread
From: Peter Zijlstra @ 2020-06-17 11:26 UTC (permalink / raw)
  To: John Ogness
  Cc: Alexander Viro, Sebastian Andrzej Siewior, Thomas Gleixner,
	linux-fsdevel, linux-kernel

On Wed, Jun 17, 2020 at 12:46:58PM +0206, John Ogness wrote:
> @@ -459,17 +469,39 @@ void mnt_drop_write_file(struct file *file)
>  }
>  EXPORT_SYMBOL(mnt_drop_write_file);
>  
> +static void mnt_lock_writers(struct mount *mnt)
> +{
> +#ifdef CONFIG_SMP
> +	int cpu;
> +
> +	for_each_possible_cpu(cpu) {
> +		spin_lock(&per_cpu_ptr(mnt->mnt_pcp,
> +				       cpu)->mnt_writers_lock);
> +	}
> +#else
> +	spin_lock(&mnt->mnt_writers_lock);
> +#endif
> +}
> +
> +static void mnt_unlock_writers(struct mount *mnt)
> +{
> +#ifdef CONFIG_SMP
> +	int cpu;
> +
> +	for_each_possible_cpu(cpu) {
> +		spin_unlock(&per_cpu_ptr(mnt->mnt_pcp,
> +					 cpu)->mnt_writers_lock);
> +	}
> +#else
> +	spin_unlock(&mnt->mnt_writers_lock);
> +#endif
> +}

*groan*.. this is brlock reincarnate :-/ Also broken.



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH 1/1] fs/namespace.c: use spinlock instead of busy loop
  2020-06-17 10:40 ` [RFC PATCH 1/1] fs/namespace.c: use spinlock instead of busy loop John Ogness
  2020-06-17 11:26   ` Peter Zijlstra
@ 2020-06-17 11:27   ` Peter Zijlstra
  2020-06-17 12:09     ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 5+ messages in thread
From: Peter Zijlstra @ 2020-06-17 11:27 UTC (permalink / raw)
  To: John Ogness
  Cc: Alexander Viro, Sebastian Andrzej Siewior, Thomas Gleixner,
	linux-fsdevel, linux-kernel

On Wed, Jun 17, 2020 at 12:46:58PM +0206, John Ogness wrote:
> The MNT_WRITE_HOLD flag is used to manually implement a per-cpu
> optimized rwsem using busy waiting. This implementation is a problem
> on PREEMPT_RT because write_seqlock() on @mount_lock (i.e. taking a
> spinlock) does not disable preemption. This allows a writer to
> preempt a task that has set MNT_WRITE_HOLD and thus that writer will
> live lock in __mnt_want_write() due to busy looping with preemption
> disabled.
> 

Why can't you use a regular percpu-rwsem for this?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH 1/1] fs/namespace.c: use spinlock instead of busy loop
  2020-06-17 11:27   ` Peter Zijlstra
@ 2020-06-17 12:09     ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 5+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-06-17 12:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: John Ogness, Alexander Viro, Thomas Gleixner, linux-fsdevel,
	linux-kernel

On 2020-06-17 13:27:19 [+0200], Peter Zijlstra wrote:
> Why can't you use a regular percpu-rwsem for this?

This is the percpu-rwsem version. The lock is per struct super_block not
per mount point. This is due to the restriction in
sb_prepare_remount_readonly().

diff --git a/fs/mount.h b/fs/mount.h
index 711a4093e475..344fae9931b5 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -4,6 +4,7 @@
 #include <linux/poll.h>
 #include <linux/ns_common.h>
 #include <linux/fs_pin.h>
+#include <linux/percpu-rwsem.h>
 
 struct mnt_namespace {
 	atomic_t		count;
@@ -40,6 +41,7 @@ struct mount {
 		struct rcu_head mnt_rcu;
 		struct llist_node mnt_llist;
 	};
+	struct percpu_rw_semaphore mnt_writers_rws;
 #ifdef CONFIG_SMP
 	struct mnt_pcp __percpu *mnt_pcp;
 #else
diff --git a/fs/namespace.c b/fs/namespace.c
index a28e4db075ed..cc05c922d7d6 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -311,30 +311,17 @@ static int mnt_is_readonly(struct vfsmount *mnt)
 int __mnt_want_write(struct vfsmount *m)
 {
 	struct mount *mnt = real_mount(m);
+	struct super_block *sb = m->mnt_sb;
 	int ret = 0;
 
-	preempt_disable();
-	mnt_inc_writers(mnt);
-	/*
-	 * The store to mnt_inc_writers must be visible before we pass
-	 * MNT_WRITE_HOLD loop below, so that the slowpath can see our
-	 * incremented count after it has set MNT_WRITE_HOLD.
-	 */
-	smp_mb();
-	while (READ_ONCE(mnt->mnt.mnt_flags) & MNT_WRITE_HOLD)
-		cpu_relax();
-	/*
-	 * After the slowpath clears MNT_WRITE_HOLD, mnt_is_readonly will
-	 * be set to match its requirements. So we must not load that until
-	 * MNT_WRITE_HOLD is cleared.
-	 */
-	smp_rmb();
-	if (mnt_is_readonly(m)) {
-		mnt_dec_writers(mnt);
+	percpu_down_read(&sb->mnt_writers_rws);
+
+	if (mnt_is_readonly(m))
 		ret = -EROFS;
-	}
-	preempt_enable();
+	else
+		mnt_inc_writers(mnt);
 
+	percpu_up_read(&sb->mnt_writers_rws);
 	return ret;
 }
 
@@ -459,53 +446,30 @@ void mnt_drop_write_file(struct file *file)
 }
 EXPORT_SYMBOL(mnt_drop_write_file);
 
-static int mnt_make_readonly(struct mount *mnt)
+static int mnt_make_readonly(struct super_block *sb, struct mount *mnt)
 {
 	int ret = 0;
 
+	percpu_down_write(&sb->mnt_writers_rws);
 	lock_mount_hash();
-	mnt->mnt.mnt_flags |= MNT_WRITE_HOLD;
-	/*
-	 * After storing MNT_WRITE_HOLD, we'll read the counters. This store
-	 * should be visible before we do.
-	 */
-	smp_mb();
 
-	/*
-	 * With writers on hold, if this value is zero, then there are
-	 * definitely no active writers (although held writers may subsequently
-	 * increment the count, they'll have to wait, and decrement it after
-	 * seeing MNT_READONLY).
-	 *
-	 * It is OK to have counter incremented on one CPU and decremented on
-	 * another: the sum will add up correctly. The danger would be when we
-	 * sum up each counter, if we read a counter before it is incremented,
-	 * but then read another CPU's count which it has been subsequently
-	 * decremented from -- we would see more decrements than we should.
-	 * MNT_WRITE_HOLD protects against this scenario, because
-	 * mnt_want_write first increments count, then smp_mb, then spins on
-	 * MNT_WRITE_HOLD, so it can't be decremented by another CPU while
-	 * we're counting up here.
-	 */
 	if (mnt_get_writers(mnt) > 0)
 		ret = -EBUSY;
 	else
 		mnt->mnt.mnt_flags |= MNT_READONLY;
-	/*
-	 * MNT_READONLY must become visible before ~MNT_WRITE_HOLD, so writers
-	 * that become unheld will see MNT_READONLY.
-	 */
-	smp_wmb();
-	mnt->mnt.mnt_flags &= ~MNT_WRITE_HOLD;
+
 	unlock_mount_hash();
+	percpu_up_write(&sb->mnt_writers_rws);
 	return ret;
 }
 
-static int __mnt_unmake_readonly(struct mount *mnt)
+static int __mnt_unmake_readonly(struct super_block *sb, struct mount *mnt)
 {
+	percpu_down_write(&sb->mnt_writers_rws);
 	lock_mount_hash();
 	mnt->mnt.mnt_flags &= ~MNT_READONLY;
 	unlock_mount_hash();
+	percpu_up_write(&sb->mnt_writers_rws);
 	return 0;
 }
 
@@ -514,21 +478,22 @@ int sb_prepare_remount_readonly(struct super_block *sb)
 	struct mount *mnt;
 	int err = 0;
 
-	/* Racy optimization.  Recheck the counter under MNT_WRITE_HOLD */
+	/* Racy optimization.  Recheck the counter under mnt_writers_rws. */
 	if (atomic_long_read(&sb->s_remove_count))
 		return -EBUSY;
 
+	percpu_down_write(&sb->mnt_writers_rws);
 	lock_mount_hash();
+
 	list_for_each_entry(mnt, &sb->s_mounts, mnt_instance) {
 		if (!(mnt->mnt.mnt_flags & MNT_READONLY)) {
-			mnt->mnt.mnt_flags |= MNT_WRITE_HOLD;
-			smp_mb();
 			if (mnt_get_writers(mnt) > 0) {
 				err = -EBUSY;
 				break;
 			}
 		}
 	}
+
 	if (!err && atomic_long_read(&sb->s_remove_count))
 		err = -EBUSY;
 
@@ -536,11 +501,9 @@ int sb_prepare_remount_readonly(struct super_block *sb)
 		sb->s_readonly_remount = 1;
 		smp_wmb();
 	}
-	list_for_each_entry(mnt, &sb->s_mounts, mnt_instance) {
-		if (mnt->mnt.mnt_flags & MNT_WRITE_HOLD)
-			mnt->mnt.mnt_flags &= ~MNT_WRITE_HOLD;
-	}
+
 	unlock_mount_hash();
+	percpu_up_write(&sb->mnt_writers_rws);
 
 	return err;
 }
@@ -1036,7 +999,7 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
 	}
 
 	mnt->mnt.mnt_flags = old->mnt.mnt_flags;
-	mnt->mnt.mnt_flags &= ~(MNT_WRITE_HOLD|MNT_MARKED|MNT_INTERNAL);
+	mnt->mnt.mnt_flags &= ~(MNT_MARKED|MNT_INTERNAL);
 
 	atomic_inc(&sb->s_active);
 	mnt->mnt.mnt_sb = sb;
@@ -2440,7 +2403,8 @@ static bool can_change_locked_flags(struct mount *mnt, unsigned int mnt_flags)
 	return true;
 }
 
-static int change_mount_ro_state(struct mount *mnt, unsigned int mnt_flags)
+static int change_mount_ro_state(struct super_block *sb,
+				 struct mount *mnt, unsigned int mnt_flags)
 {
 	bool readonly_request = (mnt_flags & MNT_READONLY);
 
@@ -2448,9 +2412,9 @@ static int change_mount_ro_state(struct mount *mnt, unsigned int mnt_flags)
 		return 0;
 
 	if (readonly_request)
-		return mnt_make_readonly(mnt);
+		return mnt_make_readonly(sb, mnt);
 
-	return __mnt_unmake_readonly(mnt);
+	return __mnt_unmake_readonly(sb, mnt);
 }
 
 /*
@@ -2509,7 +2473,7 @@ static int do_reconfigure_mnt(struct path *path, unsigned int mnt_flags)
 		return -EPERM;
 
 	down_write(&sb->s_umount);
-	ret = change_mount_ro_state(mnt, mnt_flags);
+	ret = change_mount_ro_state(sb, mnt, mnt_flags);
 	if (ret == 0)
 		set_mount_attributes(mnt, mnt_flags);
 	up_write(&sb->s_umount);
diff --git a/fs/super.c b/fs/super.c
index a288cd60d2ae..b6036e5fbe8d 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -210,6 +210,8 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
 	INIT_LIST_HEAD(&s->s_mounts);
 	s->s_user_ns = get_user_ns(user_ns);
 	init_rwsem(&s->s_umount);
+	percpu_init_rwsem(&s->mnt_writers_rws);
+
 	lockdep_set_class(&s->s_umount, &type->s_umount_key);
 	/*
 	 * sget() can have s_umount recursion.
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 45cc10cdf6dd..cc259b7c4ee8 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1458,6 +1458,7 @@ struct super_block {
 #endif
 	struct hlist_bl_head	s_roots;	/* alternate root dentries for NFS */
 	struct list_head	s_mounts;	/* list of mounts; _not_ for fs use */
+	struct percpu_rw_semaphore mnt_writers_rws;
 	struct block_device	*s_bdev;
 	struct backing_dev_info *s_bdi;
 	struct mtd_info		*s_mtd;
diff --git a/include/linux/mount.h b/include/linux/mount.h
index bf8cc4108b8f..8daf217108e8 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -32,7 +32,6 @@ struct fs_context;
 #define MNT_READONLY	0x40	/* does the user want this to be r/o? */
 
 #define MNT_SHRINKABLE	0x100
-#define MNT_WRITE_HOLD	0x200
 
 #define MNT_SHARED	0x1000	/* if the vfsmount is a shared mount */
 #define MNT_UNBINDABLE	0x2000	/* if the vfsmount is a unbindable mount */
@@ -49,7 +48,7 @@ struct fs_context;
 				 | MNT_READONLY)
 #define MNT_ATIME_MASK (MNT_NOATIME | MNT_NODIRATIME | MNT_RELATIME )
 
-#define MNT_INTERNAL_FLAGS (MNT_SHARED | MNT_WRITE_HOLD | MNT_INTERNAL | \
+#define MNT_INTERNAL_FLAGS (MNT_SHARED | MNT_INTERNAL | \
 			    MNT_DOOMED | MNT_SYNC_UMOUNT | MNT_MARKED)
 
 #define MNT_INTERNAL	0x4000
-- 
2.27.0


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-06-17 12:09 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-17 10:40 [RFC PATCH 0/1] fs: remove retry loop John Ogness
2020-06-17 10:40 ` [RFC PATCH 1/1] fs/namespace.c: use spinlock instead of busy loop John Ogness
2020-06-17 11:26   ` Peter Zijlstra
2020-06-17 11:27   ` Peter Zijlstra
2020-06-17 12:09     ` Sebastian Andrzej Siewior

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.