All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chuck Lever <chuck.lever@oracle.com>
To: Oleg Nesterov <oleg@redhat.com>
Cc: Jeff Layton <jlayton@kernel.org>, Neil Brown <neilb@suse.de>,
	Olga Kornievskaia <kolga@netapp.com>,
	Dai Ngo <Dai.Ngo@oracle.com>, Tom Talpey <tom@talpey.com>,
	Ingo Molnar <mingo@redhat.com>,
	linux-nfs@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: nfsd_copy_write_verifier: wrong usage of read_seqbegin_or_lock()
Date: Wed, 25 Oct 2023 13:00:10 -0400	[thread overview]
Message-ID: <ZTlJmuDpGE+U3pEF@tissot.1015granger.net> (raw)
In-Reply-To: <20231025163006.GA8279@redhat.com>

On Wed, Oct 25, 2023 at 06:30:06PM +0200, Oleg Nesterov wrote:
> Hello,
> 
> The usage of writeverf_lock is wrong and misleading no matter what and
> I can not understand the intent.

The structure of the seqlock was introduced in commit 27c438f53e79
("nfsd: Support the server resetting the boot verifier").

The NFS write verifier is an 8-byte cookie that is supposed to
indicate the boot epoch of the server -- simply put, when the server
restarts, the epoch (and this verifier) changes.

NFSv3 and later have a two-phase write scheme where the client
sends data to the server (known as an UNSTABLE WRITE), then later
asks the server to commit that data (a COMMIT). Before the COMMIT,
that data is not durable and the client must hold onto it until
the server's COMMIT Reply indicates it's safe for the client to
discard that data and move on.

When an UNSTABLE WRITE is done, the server reports its current
epoch as part of each WRITE Reply. If this verifier cookie changes,
the client knows that the server might have lost previously
written written-but-uncommitted data, so it must send the WRITEs
again in that (rare) case.

NFSD abuses this slightly by changing the write verifier whenever
there is an underlying local write error that might have occurred in
the background (ie, there was no WRITE or COMMIT operation at the
time that the server could use to convey the error back to the
client). This is supposed to trigger clients to send UNSTABLE WRITEs
again to ensure that data is properly committed to durable storage.

The point of the seqlock is to ensure that

a) a write verifier update does not tear the verifier
b) a write verifier read does not see a torn verifier

This is a hot path, so we don't want a full spinlock to achieve
a) and b).

Way back when, the verifier was updated by two separate 32-bit
stores; hence the risk of tearing.


> nfsd_copy_write_verifier() uses read_seqbegin_or_lock() incorrectly.
> "seq" is always even, so read_seqbegin_or_lock() can never take the
> lock for writing. We need to make the counter odd for the 2nd round:
> 
> 	--- a/fs/nfsd/nfssvc.c
> 	+++ b/fs/nfsd/nfssvc.c
> 	@@ -359,11 +359,14 @@ static bool nfsd_needs_lockd(struct nfsd_net *nn)
> 	  */
> 	 void nfsd_copy_write_verifier(__be32 verf[2], struct nfsd_net *nn)
> 	 {
> 	-	int seq = 0;
> 	+	int seq, nextseq = 0;
> 	 
> 		do {
> 	+		seq = nextseq;
> 			read_seqbegin_or_lock(&nn->writeverf_lock, &seq);
> 			memcpy(verf, nn->writeverf, sizeof(nn->writeverf));
> 	+		/* If lockless access failed, take the lock. */
> 	+		nextseq = 1;
> 		} while (need_seqretry(&nn->writeverf_lock, seq));
> 		done_seqretry(&nn->writeverf_lock, seq);
> 	 }
> 
> OTOH. This function just copies 8 bytes, this makes me think that it doesn't
> need the conditional locking and read_seqbegin_or_lock() at all. So perhaps
> the (untested) patch below makes more sense? Please note that it should not
> change the current behaviour, it just makes the code look correct (and more
> optimal but this is minor).
> 
> Another question is why we can't simply turn nn->writeverf into seqcount_t.
> I guess we can't because nfsd_reset_write_verifier() needs spin_lock() to
> serialise with itself, right?

"reset" is supposed to be very rare operation. Using a lock in that
case is probably quite acceptable, as long as reading the verifier
is wait-free and guaranteed to be untorn.

But a seqcount_t is only 32 bits.


> Oleg.
> ---
> 
> diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
> index c7af1095f6b5..094b765c5397 100644
> --- a/fs/nfsd/nfssvc.c
> +++ b/fs/nfsd/nfssvc.c
> @@ -359,13 +359,12 @@ static bool nfsd_needs_lockd(struct nfsd_net *nn)
>   */
>  void nfsd_copy_write_verifier(__be32 verf[2], struct nfsd_net *nn)
>  {
> -	int seq = 0;
> +	unsigned seq;
>  
>  	do {
> -		read_seqbegin_or_lock(&nn->writeverf_lock, &seq);
> +		seq = read_seqbegin(&nn->writeverf_lock);
>  		memcpy(verf, nn->writeverf, sizeof(nn->writeverf));
> -	} while (need_seqretry(&nn->writeverf_lock, seq));
> -	done_seqretry(&nn->writeverf_lock, seq);
> +	} while (read_seqretry(&nn->writeverf_lock, seq));
>  }
>  
>  static void nfsd_reset_write_verifier_locked(struct nfsd_net *nn)
> 

-- 
Chuck Lever

  reply	other threads:[~2023-10-25 17:00 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-25 16:30 nfsd_copy_write_verifier: wrong usage of read_seqbegin_or_lock() Oleg Nesterov
2023-10-25 17:00 ` Chuck Lever [this message]
2023-10-25 17:39   ` Oleg Nesterov
2023-10-25 17:47     ` Oleg Nesterov
2023-10-25 17:57     ` Chuck Lever
2023-10-25 18:10       ` Oleg Nesterov
2023-10-25 17:54   ` Oleg Nesterov
2023-10-25 18:07     ` Chuck Lever
2023-10-25 18:19       ` Oleg Nesterov
2023-10-26 14:50 ` [PATCH] nfsd_copy_write_verifier: use read_seqbegin() rather than read_seqbegin_or_lock() Oleg Nesterov
     [not found]   ` <ZTvc0Z6DJEYXI/TL@tissot.1015granger.net>
2023-10-27 19:34     ` Oleg Nesterov
2023-10-27 19:40       ` Chuck Lever III
2023-10-27 20:28   ` Jeff Layton
2023-10-27 22:52   ` NeilBrown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZTlJmuDpGE+U3pEF@tissot.1015granger.net \
    --to=chuck.lever@oracle.com \
    --cc=Dai.Ngo@oracle.com \
    --cc=jlayton@kernel.org \
    --cc=kolga@netapp.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=neilb@suse.de \
    --cc=oleg@redhat.com \
    --cc=tom@talpey.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.