Re: nfsd_copy_write_verifier: wrong usage of read_seqbegin_or_lock()

From: Chuck Lever <chuck.lever@oracle.com>
To: Oleg Nesterov <oleg@redhat.com>
Cc: Jeff Layton <jlayton@kernel.org>, Neil Brown <neilb@suse.de>,
	Olga Kornievskaia <kolga@netapp.com>,
	Dai Ngo <Dai.Ngo@oracle.com>, Tom Talpey <tom@talpey.com>,
	Ingo Molnar <mingo@redhat.com>,
	linux-nfs@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: nfsd_copy_write_verifier: wrong usage of read_seqbegin_or_lock()
Date: Wed, 25 Oct 2023 13:00:10 -0400	[thread overview]
Message-ID: <ZTlJmuDpGE+U3pEF@tissot.1015granger.net> (raw)
In-Reply-To: <20231025163006.GA8279@redhat.com>

On Wed, Oct 25, 2023 at 06:30:06PM +0200, Oleg Nesterov wrote:
> Hello,
> 
> The usage of writeverf_lock is wrong and misleading no matter what and
> I can not understand the intent.

The structure of the seqlock was introduced in commit 27c438f53e79
("nfsd: Support the server resetting the boot verifier").

The NFS write verifier is an 8-byte cookie that is supposed to
indicate the boot epoch of the server -- simply put, when the server
restarts, the epoch (and this verifier) changes.

NFSv3 and later have a two-phase write scheme where the client
sends data to the server (known as an UNSTABLE WRITE), then later
asks the server to commit that data (a COMMIT). Before the COMMIT,
that data is not durable and the client must hold onto it until
the server's COMMIT Reply indicates it's safe for the client to
discard that data and move on.

When an UNSTABLE WRITE is done, the server reports its current
epoch as part of each WRITE Reply. If this verifier cookie changes,
the client knows that the server might have lost previously
written written-but-uncommitted data, so it must send the WRITEs
again in that (rare) case.

NFSD abuses this slightly by changing the write verifier whenever
there is an underlying local write error that might have occurred in
the background (ie, there was no WRITE or COMMIT operation at the
time that the server could use to convey the error back to the
client). This is supposed to trigger clients to send UNSTABLE WRITEs
again to ensure that data is properly committed to durable storage.

The point of the seqlock is to ensure that

a) a write verifier update does not tear the verifier
b) a write verifier read does not see a torn verifier

This is a hot path, so we don't want a full spinlock to achieve
a) and b).

Way back when, the verifier was updated by two separate 32-bit
stores; hence the risk of tearing.

> nfsd_copy_write_verifier() uses read_seqbegin_or_lock() incorrectly.
> "seq" is always even, so read_seqbegin_or_lock() can never take the
> lock for writing. We need to make the counter odd for the 2nd round:
> 
> 	--- a/fs/nfsd/nfssvc.c
> 	+++ b/fs/nfsd/nfssvc.c
> 	@@ -359,11 +359,14 @@ static bool nfsd_needs_lockd(struct nfsd_net *nn)
> 	  */
> 	 void nfsd_copy_write_verifier(__be32 verf[2], struct nfsd_net *nn)
> 	 {
> 	-	int seq = 0;
> 	+	int seq, nextseq = 0;
> 	 
> 		do {
> 	+		seq = nextseq;
> 			read_seqbegin_or_lock(&nn->writeverf_lock, &seq);
> 			memcpy(verf, nn->writeverf, sizeof(nn->writeverf));
> 	+		/* If lockless access failed, take the lock. */
> 	+		nextseq = 1;
> 		} while (need_seqretry(&nn->writeverf_lock, seq));
> 		done_seqretry(&nn->writeverf_lock, seq);
> 	 }
> 
> OTOH. This function just copies 8 bytes, this makes me think that it doesn't
> need the conditional locking and read_seqbegin_or_lock() at all. So perhaps
> the (untested) patch below makes more sense? Please note that it should not
> change the current behaviour, it just makes the code look correct (and more
> optimal but this is minor).
> 
> Another question is why we can't simply turn nn->writeverf into seqcount_t.
> I guess we can't because nfsd_reset_write_verifier() needs spin_lock() to
> serialise with itself, right?

"reset" is supposed to be very rare operation. Using a lock in that
case is probably quite acceptable, as long as reading the verifier
is wait-free and guaranteed to be untorn.

But a seqcount_t is only 32 bits.

> Oleg.
> ---
> 
> diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
> index c7af1095f6b5..094b765c5397 100644
> --- a/fs/nfsd/nfssvc.c
> +++ b/fs/nfsd/nfssvc.c
> @@ -359,13 +359,12 @@ static bool nfsd_needs_lockd(struct nfsd_net *nn)
>   */
>  void nfsd_copy_write_verifier(__be32 verf[2], struct nfsd_net *nn)
>  {
> -	int seq = 0;
> +	unsigned seq;
>  
>  	do {
> -		read_seqbegin_or_lock(&nn->writeverf_lock, &seq);
> +		seq = read_seqbegin(&nn->writeverf_lock);
>  		memcpy(verf, nn->writeverf, sizeof(nn->writeverf));
> -	} while (need_seqretry(&nn->writeverf_lock, seq));
> -	done_seqretry(&nn->writeverf_lock, seq);
> +	} while (read_seqretry(&nn->writeverf_lock, seq));
>  }
>  
>  static void nfsd_reset_write_verifier_locked(struct nfsd_net *nn)
> 

-- 
Chuck Lever