All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chuck Lever <chuck.lever@oracle.com>
To: Oleg Nesterov <oleg@redhat.com>
Cc: Jeff Layton <jlayton@kernel.org>, Neil Brown <neilb@suse.de>,
	Olga Kornievskaia <kolga@netapp.com>,
	Dai Ngo <Dai.Ngo@oracle.com>, Tom Talpey <tom@talpey.com>,
	Ingo Molnar <mingo@redhat.com>,
	linux-nfs@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: nfsd_copy_write_verifier: wrong usage of read_seqbegin_or_lock()
Date: Wed, 25 Oct 2023 13:57:35 -0400	[thread overview]
Message-ID: <ZTlXD/hQAVQMKfaE@tissot.1015granger.net> (raw)
In-Reply-To: <20231025173931.GA29779@redhat.com>

On Wed, Oct 25, 2023 at 07:39:31PM +0200, Oleg Nesterov wrote:
> Hi Chuck,
> 
> Thanks for your reply. But I am already sleeping and I can't understand it.

I was responding to "I can not understand the intent." But also I
was hoping that explanation would help you provide a correct
replacement for the existing code.


> So let me ask a couple of questions.
> 
> 1. Do you agree that the current nfsd_copy_write_verifier() code makes no sense?

Probably.


>    I mean, the usage of read_seqbegin_or_lock() suggests that if the lockless
>    pass fails it should take writeverf_lock for writing. But this can't happen,
>    and thus this code doesn't look right no matter what. None of the
>    read_seqbegin_or_lock/need_seqretry/done_seqretry helpers make any sense
>    because "seq" is alway even.

> 2. If yes, which change do you prefer? I'd prefer the patch at the end.

Based on my limited understanding of read_seqbegin(), the patch at
the end seems cleanest and is on-point. Please post an official
version of that to linux-nfs@ with a full patch description, and
I'll see that it gets into v6.8-rc with proper tags, review, and
testing.


> Oleg.
> 
> On 10/25, Chuck Lever wrote:
> >
> > On Wed, Oct 25, 2023 at 06:30:06PM +0200, Oleg Nesterov wrote:
> > > Hello,
> > >
> > > The usage of writeverf_lock is wrong and misleading no matter what and
> > > I can not understand the intent.
> >
> > The structure of the seqlock was introduced in commit 27c438f53e79
> > ("nfsd: Support the server resetting the boot verifier").
> >
> > The NFS write verifier is an 8-byte cookie that is supposed to
> > indicate the boot epoch of the server -- simply put, when the server
> > restarts, the epoch (and this verifier) changes.
> >
> > NFSv3 and later have a two-phase write scheme where the client
> > sends data to the server (known as an UNSTABLE WRITE), then later
> > asks the server to commit that data (a COMMIT). Before the COMMIT,
> > that data is not durable and the client must hold onto it until
> > the server's COMMIT Reply indicates it's safe for the client to
> > discard that data and move on.
> >
> > When an UNSTABLE WRITE is done, the server reports its current
> > epoch as part of each WRITE Reply. If this verifier cookie changes,
> > the client knows that the server might have lost previously
> > written written-but-uncommitted data, so it must send the WRITEs
> > again in that (rare) case.
> >
> > NFSD abuses this slightly by changing the write verifier whenever
> > there is an underlying local write error that might have occurred in
> > the background (ie, there was no WRITE or COMMIT operation at the
> > time that the server could use to convey the error back to the
> > client). This is supposed to trigger clients to send UNSTABLE WRITEs
> > again to ensure that data is properly committed to durable storage.
> >
> > The point of the seqlock is to ensure that
> >
> > a) a write verifier update does not tear the verifier
> > b) a write verifier read does not see a torn verifier
> >
> > This is a hot path, so we don't want a full spinlock to achieve
> > a) and b).
> >
> > Way back when, the verifier was updated by two separate 32-bit
> > stores; hence the risk of tearing.
> >
> >
> > > nfsd_copy_write_verifier() uses read_seqbegin_or_lock() incorrectly.
> > > "seq" is always even, so read_seqbegin_or_lock() can never take the
> > > lock for writing. We need to make the counter odd for the 2nd round:
> > >
> > > 	--- a/fs/nfsd/nfssvc.c
> > > 	+++ b/fs/nfsd/nfssvc.c
> > > 	@@ -359,11 +359,14 @@ static bool nfsd_needs_lockd(struct nfsd_net *nn)
> > > 	  */
> > > 	 void nfsd_copy_write_verifier(__be32 verf[2], struct nfsd_net *nn)
> > > 	 {
> > > 	-	int seq = 0;
> > > 	+	int seq, nextseq = 0;
> > >
> > > 		do {
> > > 	+		seq = nextseq;
> > > 			read_seqbegin_or_lock(&nn->writeverf_lock, &seq);
> > > 			memcpy(verf, nn->writeverf, sizeof(nn->writeverf));
> > > 	+		/* If lockless access failed, take the lock. */
> > > 	+		nextseq = 1;
> > > 		} while (need_seqretry(&nn->writeverf_lock, seq));
> > > 		done_seqretry(&nn->writeverf_lock, seq);
> > > 	 }
> > >
> > > OTOH. This function just copies 8 bytes, this makes me think that it doesn't
> > > need the conditional locking and read_seqbegin_or_lock() at all. So perhaps
> > > the (untested) patch below makes more sense? Please note that it should not
> > > change the current behaviour, it just makes the code look correct (and more
> > > optimal but this is minor).
> > >
> > > Another question is why we can't simply turn nn->writeverf into seqcount_t.
> > > I guess we can't because nfsd_reset_write_verifier() needs spin_lock() to
> > > serialise with itself, right?
> >
> > "reset" is supposed to be very rare operation. Using a lock in that
> > case is probably quite acceptable, as long as reading the verifier
> > is wait-free and guaranteed to be untorn.
> >
> > But a seqcount_t is only 32 bits.
> >
> >
> > > Oleg.
> > > ---
> > >
> > > diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
> > > index c7af1095f6b5..094b765c5397 100644
> > > --- a/fs/nfsd/nfssvc.c
> > > +++ b/fs/nfsd/nfssvc.c
> > > @@ -359,13 +359,12 @@ static bool nfsd_needs_lockd(struct nfsd_net *nn)
> > >   */
> > >  void nfsd_copy_write_verifier(__be32 verf[2], struct nfsd_net *nn)
> > >  {
> > > -	int seq = 0;
> > > +	unsigned seq;
> > >
> > >  	do {
> > > -		read_seqbegin_or_lock(&nn->writeverf_lock, &seq);
> > > +		seq = read_seqbegin(&nn->writeverf_lock);
> > >  		memcpy(verf, nn->writeverf, sizeof(nn->writeverf));
> > > -	} while (need_seqretry(&nn->writeverf_lock, seq));
> > > -	done_seqretry(&nn->writeverf_lock, seq);
> > > +	} while (read_seqretry(&nn->writeverf_lock, seq));
> > >  }
> > >
> > >  static void nfsd_reset_write_verifier_locked(struct nfsd_net *nn)
> > >
> >
> > --
> > Chuck Lever
> >
> 

-- 
Chuck Lever

  parent reply	other threads:[~2023-10-25 17:58 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-25 16:30 nfsd_copy_write_verifier: wrong usage of read_seqbegin_or_lock() Oleg Nesterov
2023-10-25 17:00 ` Chuck Lever
2023-10-25 17:39   ` Oleg Nesterov
2023-10-25 17:47     ` Oleg Nesterov
2023-10-25 17:57     ` Chuck Lever [this message]
2023-10-25 18:10       ` Oleg Nesterov
2023-10-25 17:54   ` Oleg Nesterov
2023-10-25 18:07     ` Chuck Lever
2023-10-25 18:19       ` Oleg Nesterov
2023-10-26 14:50 ` [PATCH] nfsd_copy_write_verifier: use read_seqbegin() rather than read_seqbegin_or_lock() Oleg Nesterov
     [not found]   ` <ZTvc0Z6DJEYXI/TL@tissot.1015granger.net>
2023-10-27 19:34     ` Oleg Nesterov
2023-10-27 19:40       ` Chuck Lever III
2023-10-27 20:28   ` Jeff Layton
2023-10-27 22:52   ` NeilBrown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZTlXD/hQAVQMKfaE@tissot.1015granger.net \
    --to=chuck.lever@oracle.com \
    --cc=Dai.Ngo@oracle.com \
    --cc=jlayton@kernel.org \
    --cc=kolga@netapp.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=neilb@suse.de \
    --cc=oleg@redhat.com \
    --cc=tom@talpey.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.