[RFC][PATCH] burying long-dead rudiments in mntput_no_expire()

From: Al Viro <viro@zeniv.linux.org.uk>
To: linux-fsdevel@vger.kernel.org
Cc: Oleg Nesterov <oleg@redhat.com>, Jann Horn <jannh@google.com>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: [RFC][PATCH] burying long-dead rudiments in mntput_no_expire()
Date: Wed, 6 Jul 2022 14:52:18 +0100	[thread overview]
Message-ID: <YsWTkmAs53Wjf2nN@ZenIV> (raw)

	A bit of context: the original RCU conversion of struct mount
handling had several bugs, spotted and fixed only 5 years later, in
119e1ef80ecf "fix __legitimize_mnt()/mntput() race".  However, bits and
pieces of old broken approach hadn't been removed.

	legitimize_mnt() takes a non-counting reference the caller had
found after rcu_read_lock() and sampling of mount_lock seqcount and
tries to turn it into counting one if mount_lock seqcount had not been changed.

	The hard part is dealing with the possibility of race with final
mntput() - we want the successful fast path through legitimize_mnt()
to be lockless, so there's a possibility of legitimize_mnt()
incrementing refcount, only to recheck the mount_lock seqcount and
notice that it had been disturbed by what would've been the final
mntput().  In that case legitimize_mnt() needs to drop the reference
it has acquired and report failure.

	Original approach had been to have the final mntput() to
mark mount as doomed when it commits to killing it off, with
legitimize_mnt() doing mntput() if it noticed mount_lock disturbed
while it had been grabbing a reference and mntput checking if
it has already marked the mount doomed and leaving its destruction
to the thread that had done the marking.

	That had been racy - since this mntput() from legitimize_mnt()
might end up doing real work (if what would've been the final mntput()
had observed attempted increment by legitimize_mnt()) it can't be done
under rcu_read_lock(), but in case if the race went the other way round
and final mntput() had *not* seen an attempted increment, rcu_read_lock()
we are holding might be the only thing preventing freeing of struct
mount in question.

	Fortunately, legitimize_mnt() can tell one case from another by
grabbing mount_lock and checking whether the mount had been marked doomed
- if it had been marked we known that the final mntput() has already done
mnt_get_count() and not observed our increment, so we can just decrement
it quietly.  If it hadn't been marked, we know that we need to do full
mntput(), but we also know that it's safe to drop rcu_read_lock() -
mount won't get freed under us.

	That's what the commit in question had done - in effect,
it had taken the "is it already marked doomed?"  check from mntput()
to legitimize_mnt().  Which is the right thing to do, but we should've
removed that check from mntput() itself.  While we are at it, there's
no reason for mntput() to hold onto rcu_read_lock() past the handling of
"still mounted, we know it's not the final drop" case.

	Patch below takes the remnants out.  It should've been done
as part of the original fix; as it is, the magical mystery shite had
been left behind.

	Folks, could you give it some beating?  I realize that original
reproducers might be long gone, but...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---

diff --git a/fs/namespace.c b/fs/namespace.c
index 68789f896f08..ad94f9e228ae 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1224,6 +1224,7 @@ static void mntput_no_expire(struct mount *mnt)
 		rcu_read_unlock();
 		return;
 	}
+	rcu_read_unlock();
 	lock_mount_hash();
 	/*
 	 * make sure that if __legitimize_mnt() has not seen us grab
@@ -1234,17 +1235,10 @@ static void mntput_no_expire(struct mount *mnt)
 	count = mnt_get_count(mnt);
 	if (count != 0) {
 		WARN_ON(count < 0);
-		rcu_read_unlock();
-		unlock_mount_hash();
-		return;
-	}
-	if (unlikely(mnt->mnt.mnt_flags & MNT_DOOMED)) {
-		rcu_read_unlock();
 		unlock_mount_hash();
 		return;
 	}
 	mnt->mnt.mnt_flags |= MNT_DOOMED;
-	rcu_read_unlock();
 
 	list_del(&mnt->mnt_instance);