All of lore.kernel.org
 help / color / mirror / Atom feed
From: Christian Brauner <christian.brauner@ubuntu.com>
To: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: linux-fsdevel@vger.kernel.org,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Peter Zijlstra <peterz@infradead.org>,
	John Ogness <john.ogness@linutronix.de>,
	Thomas Gleixner <tglx@linutronix.de>
Subject: Re: [RFC PATCH] fs/namespace: use percpu_rw_semaphore for writer holding
Date: Mon, 25 Oct 2021 11:15:04 +0200	[thread overview]
Message-ID: <20211025091504.6k7d57awbfpqmmqs@wittgenstein> (raw)
In-Reply-To: <20211021220102.bm5bvldjtzsabbfn@linutronix.de>

On Fri, Oct 22, 2021 at 12:01:02AM +0200, Sebastian Andrzej Siewior wrote:
> From: John Ogness <john.ogness@linutronix.de>
> 
> The MNT_WRITE_HOLD flag is used to hold back any new writers while the
> mount point is about to be made read-only. __mnt_want_write() then loops
> with disabled preemption until this flag disappears. Callers of
> mnt_hold_writers() (which sets the flag) hold the spinlock_t of
> mount_lock (seqlock_t) which disables preemption on !PREEMPT_RT and
> ensures the task is not scheduled away so that the spinning side spins
> for a long time.
> 
> On PREEMPT_RT the spinlock_t does not disable preemption and so it is
> possible that the task setting MNT_WRITE_HOLD is preempted by task with
> higher priority which then spins infinitely waiting for MNT_WRITE_HOLD
> to get removed.
> 
> To avoid the spinning, the synchronisation mechanism is replaced by a
> per-CPU-rwsem. The rwsem has been added to struct super_block instead of
> struct mount because in sb_prepare_remount_readonly() it iterates over
> mount points while holding a spinlock_t and the per-CPU-rwsem must not
> be acquired with disabled preemption. This could be avoided by adding a
> mutex_t to protect just the list-add/del operations. There is also the
> lockdep problem where it complains about locking multiple locks of the
> same class where it may lead to a dead lock…
> 
> That problem (with multiple locks of the kind) now occurs in
> do_mount_setattr().

Iiuc, this only becomes a problem by switching to the rwsem. I stared at
and tested that locking for a good few weeks last year to make sure it's
sane. It's not as neatly documented as it probably should be.

> As far as I've seen, all add/remove operation of the list behind
> next_mnt() are performed with namespace_lock() acquired so the rwsem can
> be acquired before lock_mount_hash() with namespace_lock() held.
> Since multiple mountpoints may belong to different super-blocks it is
> needed to acquire multiple rwsems of the same kind and here is where
> lockdep complains. I *assume* that it is possible that two different

Afaict, in that case it would be false-positive.

> mount points belong to the same super block at which point the code in

Yes, that's possible and it's not a rare case.

> do_mount_setattr() really deadlocks.

Yes that would deadlock as you would be reacquiring the same rwsem of a
superblock with multiple mounts.

> 
> So now by writing all this and reevaluating all options it might be
> easier to just lock_mount_hash()+unlock on PREEMPT_RT to PI-boost the
> owner of MNT_WRITE_HOLD and by that avoiding spinning in
> __mnt_want_write().

So the locking would be unchanged on non-rt kernels and only change on
rt kernels? Would be ok to send that patch for comparison if it's not
too much work?

Christian

> 
> [bigeasy: Forward port the old patch, new description and a gross hack
> 	  to let lockdep not complain about the deadlock while running a
> 	  few tests.]
> 
> Link: https://lore.kernel.org/r/20211021220102.bm5bvldjtzsabbfn@linutronix.de
> Co-developed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Signed-off-by: John Ogness <john.ogness@linutronix.de>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
>  fs/namespace.c                | 135 ++++++++++++++--------------------
>  fs/super.c                    |   3 +
>  include/linux/fs.h            |   1 +
>  include/linux/mount.h         |   3 +-
>  include/linux/percpu-rwsem.h  |   7 ++
>  kernel/locking/percpu-rwsem.c |  23 +++++-
>  6 files changed, 88 insertions(+), 84 deletions(-)
> 
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 659a8f39c61af..fb9bb813bca3c 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -333,30 +333,17 @@ static int mnt_is_readonly(struct vfsmount *mnt)
>  int __mnt_want_write(struct vfsmount *m)
>  {
>  	struct mount *mnt = real_mount(m);
> +	struct super_block *sb = m->mnt_sb;
>  	int ret = 0;
>  
> -	preempt_disable();
> -	mnt_inc_writers(mnt);
> -	/*
> -	 * The store to mnt_inc_writers must be visible before we pass
> -	 * MNT_WRITE_HOLD loop below, so that the slowpath can see our
> -	 * incremented count after it has set MNT_WRITE_HOLD.
> -	 */
> -	smp_mb();
> -	while (READ_ONCE(mnt->mnt.mnt_flags) & MNT_WRITE_HOLD)
> -		cpu_relax();
> -	/*
> -	 * After the slowpath clears MNT_WRITE_HOLD, mnt_is_readonly will
> -	 * be set to match its requirements. So we must not load that until
> -	 * MNT_WRITE_HOLD is cleared.
> -	 */
> -	smp_rmb();
> -	if (mnt_is_readonly(m)) {
> -		mnt_dec_writers(mnt);
> -		ret = -EROFS;
> -	}
> -	preempt_enable();
> +	percpu_down_read(&sb->mnt_writers_rws);
>  
> +	if (mnt_is_readonly(m))
> +		ret = -EROFS;
> +	else
> +		mnt_inc_writers(mnt);
> +
> +	percpu_up_read(&sb->mnt_writers_rws);
>  	return ret;
>  }
>  
> @@ -435,9 +422,11 @@ EXPORT_SYMBOL_GPL(mnt_want_write_file);
>   */
>  void __mnt_drop_write(struct vfsmount *mnt)
>  {
> -	preempt_disable();
> +	struct super_block *sb = mnt->mnt_sb;
> +
> +	percpu_down_read(&sb->mnt_writers_rws);
>  	mnt_dec_writers(real_mount(mnt));
> -	preempt_enable();
> +	percpu_up_read(&sb->mnt_writers_rws);
>  }
>  
>  /**
> @@ -470,28 +459,15 @@ EXPORT_SYMBOL(mnt_drop_write_file);
>  
>  static inline int mnt_hold_writers(struct mount *mnt)
>  {
> -	mnt->mnt.mnt_flags |= MNT_WRITE_HOLD;
> -	/*
> -	 * After storing MNT_WRITE_HOLD, we'll read the counters. This store
> -	 * should be visible before we do.
> -	 */
> -	smp_mb();
> +	lockdep_assert_held(&mnt->mnt.mnt_sb->mnt_writers_rws);
>  
>  	/*
>  	 * With writers on hold, if this value is zero, then there are
> -	 * definitely no active writers (although held writers may subsequently
> -	 * increment the count, they'll have to wait, and decrement it after
> -	 * seeing MNT_READONLY).
> +	 * definitely no active writers.
>  	 *
>  	 * It is OK to have counter incremented on one CPU and decremented on
> -	 * another: the sum will add up correctly. The danger would be when we
> -	 * sum up each counter, if we read a counter before it is incremented,
> -	 * but then read another CPU's count which it has been subsequently
> -	 * decremented from -- we would see more decrements than we should.
> -	 * MNT_WRITE_HOLD protects against this scenario, because
> -	 * mnt_want_write first increments count, then smp_mb, then spins on
> -	 * MNT_WRITE_HOLD, so it can't be decremented by another CPU while
> -	 * we're counting up here.
> +	 * another: the sum will add up correctly. The rwsem ensures that the
> +	 * counters are not modified once the writer lock is acquired.
>  	 */
>  	if (mnt_get_writers(mnt) > 0)
>  		return -EBUSY;
> @@ -499,16 +475,6 @@ static inline int mnt_hold_writers(struct mount *mnt)
>  	return 0;
>  }
>  
> -static inline void mnt_unhold_writers(struct mount *mnt)
> -{
> -	/*
> -	 * MNT_READONLY must become visible before ~MNT_WRITE_HOLD, so writers
> -	 * that become unheld will see MNT_READONLY.
> -	 */
> -	smp_wmb();
> -	mnt->mnt.mnt_flags &= ~MNT_WRITE_HOLD;
> -}
> -
>  static int mnt_make_readonly(struct mount *mnt)
>  {
>  	int ret;
> @@ -516,7 +482,6 @@ static int mnt_make_readonly(struct mount *mnt)
>  	ret = mnt_hold_writers(mnt);
>  	if (!ret)
>  		mnt->mnt.mnt_flags |= MNT_READONLY;
> -	mnt_unhold_writers(mnt);
>  	return ret;
>  }
>  
> @@ -525,15 +490,15 @@ int sb_prepare_remount_readonly(struct super_block *sb)
>  	struct mount *mnt;
>  	int err = 0;
>  
> -	/* Racy optimization.  Recheck the counter under MNT_WRITE_HOLD */
> +	/* Racy optimization.  Recheck the counter under mnt_writers_rws. */
>  	if (atomic_long_read(&sb->s_remove_count))
>  		return -EBUSY;
>  
> +	percpu_down_write(&sb->mnt_writers_rws);
>  	lock_mount_hash();
> +
>  	list_for_each_entry(mnt, &sb->s_mounts, mnt_instance) {
>  		if (!(mnt->mnt.mnt_flags & MNT_READONLY)) {
> -			mnt->mnt.mnt_flags |= MNT_WRITE_HOLD;
> -			smp_mb();
>  			if (mnt_get_writers(mnt) > 0) {
>  				err = -EBUSY;
>  				break;
> @@ -547,11 +512,9 @@ int sb_prepare_remount_readonly(struct super_block *sb)
>  		sb->s_readonly_remount = 1;
>  		smp_wmb();
>  	}
> -	list_for_each_entry(mnt, &sb->s_mounts, mnt_instance) {
> -		if (mnt->mnt.mnt_flags & MNT_WRITE_HOLD)
> -			mnt->mnt.mnt_flags &= ~MNT_WRITE_HOLD;
> -	}
> +
>  	unlock_mount_hash();
> +	percpu_up_write(&sb->mnt_writers_rws);
>  
>  	return err;
>  }
> @@ -1068,7 +1031,7 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
>  	}
>  
>  	mnt->mnt.mnt_flags = old->mnt.mnt_flags;
> -	mnt->mnt.mnt_flags &= ~(MNT_WRITE_HOLD|MNT_MARKED|MNT_INTERNAL);
> +	mnt->mnt.mnt_flags &= ~(MNT_MARKED|MNT_INTERNAL);
>  
>  	atomic_inc(&sb->s_active);
>  	mnt->mnt.mnt_userns = mnt_user_ns(&old->mnt);
> @@ -2585,8 +2548,8 @@ static void mnt_warn_timestamp_expiry(struct path *mountpoint, struct vfsmount *
>   */
>  static int do_reconfigure_mnt(struct path *path, unsigned int mnt_flags)
>  {
> -	struct super_block *sb = path->mnt->mnt_sb;
>  	struct mount *mnt = real_mount(path->mnt);
> +	struct super_block *sb = mnt->mnt.mnt_sb;
>  	int ret;
>  
>  	if (!check_mnt(mnt))
> @@ -2603,11 +2566,13 @@ static int do_reconfigure_mnt(struct path *path, unsigned int mnt_flags)
>  	 * changing it, so only take down_read(&sb->s_umount).
>  	 */
>  	down_read(&sb->s_umount);
> +	percpu_down_write(&sb->mnt_writers_rws);
>  	lock_mount_hash();
>  	ret = change_mount_ro_state(mnt, mnt_flags);
>  	if (ret == 0)
>  		set_mount_attributes(mnt, mnt_flags);
>  	unlock_mount_hash();
> +	percpu_up_write(&sb->mnt_writers_rws);
>  	up_read(&sb->s_umount);
>  
>  	mnt_warn_timestamp_expiry(path, &mnt->mnt);
> @@ -4027,15 +3992,6 @@ static void mount_setattr_commit(struct mount_kattr *kattr,
>  			WRITE_ONCE(m->mnt.mnt_flags, flags);
>  		}
>  
> -		/*
> -		 * We either set MNT_READONLY above so make it visible
> -		 * before ~MNT_WRITE_HOLD or we failed to recursively
> -		 * apply mount options.
> -		 */
> -		if ((kattr->attr_set & MNT_READONLY) &&
> -		    (m->mnt.mnt_flags & MNT_WRITE_HOLD))
> -			mnt_unhold_writers(m);
> -
>  		if (!err && kattr->propagation)
>  			change_mnt_propagation(m, kattr->propagation);
>  
> @@ -4054,26 +4010,39 @@ static void mount_setattr_commit(struct mount_kattr *kattr,
>  static int do_mount_setattr(struct path *path, struct mount_kattr *kattr)
>  {
>  	struct mount *mnt = real_mount(path->mnt), *last = NULL;
> +	bool ns_lock = false;
>  	int err = 0;
>  
>  	if (path->dentry != mnt->mnt.mnt_root)
>  		return -EINVAL;
>  
> -	if (kattr->propagation) {
> +	if ((kattr->attr_set & MNT_READONLY) || kattr->propagation)
>  		/*
>  		 * Only take namespace_lock() if we're actually changing
> -		 * propagation.
> +		 * propagation or iterate via next_mnt().
>  		 */
> +		ns_lock = true;
> +
> +	if (ns_lock)
>  		namespace_lock();
> -		if (kattr->propagation == MS_SHARED) {
> -			err = invent_group_ids(mnt, kattr->recurse);
> -			if (err) {
> -				namespace_unlock();
> -				return err;
> -			}
> +
> +	if (kattr->propagation == MS_SHARED) {
> +		err = invent_group_ids(mnt, kattr->recurse);
> +		if (err) {
> +			namespace_unlock();
> +			return err;
>  		}
>  	}
>  
> +	if (kattr->attr_set & MNT_READONLY) {
> +		struct mount *m = mnt;
> +		int num = 0;
> +
> +		do {
> +			percpu_down_write_nested(&m->mnt.mnt_sb->mnt_writers_rws, num++);
> +		} while (kattr->recurse && (m = next_mnt(m, mnt)));
> +
> +	}
>  	lock_mount_hash();
>  
>  	/*
> @@ -4086,8 +4055,18 @@ static int do_mount_setattr(struct path *path, struct mount_kattr *kattr)
>  
>  	unlock_mount_hash();
>  
> -	if (kattr->propagation) {
> +	if (kattr->attr_set & MNT_READONLY) {
> +		struct mount *m = mnt;
> +
> +		do {
> +			percpu_up_write(&m->mnt.mnt_sb->mnt_writers_rws);
> +		} while (kattr->recurse && (m = next_mnt(m, mnt)));
> +
> +	}
> +	if (ns_lock)
>  		namespace_unlock();
> +
> +	if (kattr->propagation) {
>  		if (err)
>  			cleanup_group_ids(mnt, NULL);
>  	}
> diff --git a/fs/super.c b/fs/super.c
> index bcef3a6f4c4b5..9c828a05e145f 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -162,6 +162,7 @@ static void destroy_super_work(struct work_struct *work)
>  
>  	for (i = 0; i < SB_FREEZE_LEVELS; i++)
>  		percpu_free_rwsem(&s->s_writers.rw_sem[i]);
> +	percpu_free_rwsem(&s->mnt_writers_rws);
>  	kfree(s);
>  }
>  
> @@ -210,6 +211,8 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
>  	INIT_LIST_HEAD(&s->s_mounts);
>  	s->s_user_ns = get_user_ns(user_ns);
>  	init_rwsem(&s->s_umount);
> +	percpu_init_rwsem(&s->mnt_writers_rws);
> +
>  	lockdep_set_class(&s->s_umount, &type->s_umount_key);
>  	/*
>  	 * sget() can have s_umount recursion.
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index e7a633353fd20..2c7b67dfe4303 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1497,6 +1497,7 @@ struct super_block {
>  #endif
>  	struct hlist_bl_head	s_roots;	/* alternate root dentries for NFS */
>  	struct list_head	s_mounts;	/* list of mounts; _not_ for fs use */
> +	struct percpu_rw_semaphore mnt_writers_rws;
>  	struct block_device	*s_bdev;
>  	struct backing_dev_info *s_bdi;
>  	struct mtd_info		*s_mtd;
> diff --git a/include/linux/mount.h b/include/linux/mount.h
> index 5d92a7e1a742d..d51f5b5f7cfea 100644
> --- a/include/linux/mount.h
> +++ b/include/linux/mount.h
> @@ -33,7 +33,6 @@ struct fs_context;
>  #define MNT_NOSYMFOLLOW	0x80
>  
>  #define MNT_SHRINKABLE	0x100
> -#define MNT_WRITE_HOLD	0x200
>  
>  #define MNT_SHARED	0x1000	/* if the vfsmount is a shared mount */
>  #define MNT_UNBINDABLE	0x2000	/* if the vfsmount is a unbindable mount */
> @@ -50,7 +49,7 @@ struct fs_context;
>  				 | MNT_READONLY | MNT_NOSYMFOLLOW)
>  #define MNT_ATIME_MASK (MNT_NOATIME | MNT_NODIRATIME | MNT_RELATIME )
>  
> -#define MNT_INTERNAL_FLAGS (MNT_SHARED | MNT_WRITE_HOLD | MNT_INTERNAL | \
> +#define MNT_INTERNAL_FLAGS (MNT_SHARED | MNT_INTERNAL | \
>  			    MNT_DOOMED | MNT_SYNC_UMOUNT | MNT_MARKED | \
>  			    MNT_CURSOR)
>  
> diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
> index 5fda40f97fe91..db357f306ad6a 100644
> --- a/include/linux/percpu-rwsem.h
> +++ b/include/linux/percpu-rwsem.h
> @@ -121,6 +121,13 @@ static inline void percpu_up_read(struct percpu_rw_semaphore *sem)
>  	preempt_enable();
>  }
>  
> +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> +extern void percpu_down_write_nested(struct percpu_rw_semaphore *, int subclass);
> +#else
> +# define percpu_down_write_nested(lock, subclass)		\
> +	percpu_down_write(((void)(subclass), (lock)))
> +#endif
> +
>  extern void percpu_down_write(struct percpu_rw_semaphore *);
>  extern void percpu_up_write(struct percpu_rw_semaphore *);
>  
> diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
> index 70a32a576f3f2..2065073b7d101 100644
> --- a/kernel/locking/percpu-rwsem.c
> +++ b/kernel/locking/percpu-rwsem.c
> @@ -211,11 +211,8 @@ static bool readers_active_check(struct percpu_rw_semaphore *sem)
>  	return true;
>  }
>  
> -void percpu_down_write(struct percpu_rw_semaphore *sem)
> +static void _percpu_down_write(struct percpu_rw_semaphore *sem)
>  {
> -	might_sleep();
> -	rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_);
> -
>  	/* Notify readers to take the slow path. */
>  	rcu_sync_enter(&sem->rss);
>  
> @@ -237,8 +234,26 @@ void percpu_down_write(struct percpu_rw_semaphore *sem)
>  	/* Wait for all active readers to complete. */
>  	rcuwait_wait_event(&sem->writer, readers_active_check(sem), TASK_UNINTERRUPTIBLE);
>  }
> +
> +void percpu_down_write(struct percpu_rw_semaphore *sem)
> +{
> +	might_sleep();
> +	rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_);
> +	_percpu_down_write(sem);
> +}
>  EXPORT_SYMBOL_GPL(percpu_down_write);
>  
> +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> +void percpu_down_write_nested(struct percpu_rw_semaphore *sem, int subclass)
> +{
> +	might_sleep();
> +	rwsem_acquire(&sem->dep_map, subclass, 0, _RET_IP_);
> +
> +	_percpu_down_write(sem);
> +}
> +EXPORT_SYMBOL_GPL(percpu_down_write_nested);
> +#endif
> +
>  void percpu_up_write(struct percpu_rw_semaphore *sem)
>  {
>  	rwsem_release(&sem->dep_map, _RET_IP_);
> -- 
> 2.33.0

  reply	other threads:[~2021-10-25  9:15 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-21 22:01 [RFC PATCH] fs/namespace: use percpu_rw_semaphore for writer holding Sebastian Andrzej Siewior
2021-10-25  9:15 ` Christian Brauner [this message]
2021-10-25  9:20   ` Sebastian Andrzej Siewior
2021-10-25 15:22   ` [PATCH] fs/namespace: Boost the mount_lock.lock owner instead of spinning on PREEMPT_RT Sebastian Andrzej Siewior
2021-10-26 21:06     ` Christian Brauner
2021-10-27  6:42       ` Sebastian Andrzej Siewior

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20211025091504.6k7d57awbfpqmmqs@wittgenstein \
    --to=christian.brauner@ubuntu.com \
    --cc=bigeasy@linutronix.de \
    --cc=john.ogness@linutronix.de \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.