All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] ceph: fail the request directly if handle_reply gets an ESTALE
@ 2022-02-07  5:03 xiubli
  2022-02-07 15:12 ` Jeff Layton
  0 siblings, 1 reply; 13+ messages in thread
From: xiubli @ 2022-02-07  5:03 UTC (permalink / raw)
  To: jlayton; +Cc: idryomov, vshankar, ceph-devel, Xiubo Li

From: Xiubo Li <xiubli@redhat.com>

If MDS return ESTALE, that means the MDS has already iterated all
the possible active MDSes including the auth MDS or the inode is
under purging. No need to retry in auth MDS and will just return
ESTALE directly.

Or it will cause definite loop for retrying it.

URL: https://tracker.ceph.com/issues/53504
Signed-off-by: Xiubo Li <xiubli@redhat.com>
---
 fs/ceph/mds_client.c | 29 -----------------------------
 1 file changed, 29 deletions(-)

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 93e5e3c4ba64..c918d2ac8272 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -3368,35 +3368,6 @@ static void handle_reply(struct ceph_mds_session *session, struct ceph_msg *msg)
 
 	result = le32_to_cpu(head->result);
 
-	/*
-	 * Handle an ESTALE
-	 * if we're not talking to the authority, send to them
-	 * if the authority has changed while we weren't looking,
-	 * send to new authority
-	 * Otherwise we just have to return an ESTALE
-	 */
-	if (result == -ESTALE) {
-		dout("got ESTALE on request %llu\n", req->r_tid);
-		req->r_resend_mds = -1;
-		if (req->r_direct_mode != USE_AUTH_MDS) {
-			dout("not using auth, setting for that now\n");
-			req->r_direct_mode = USE_AUTH_MDS;
-			__do_request(mdsc, req);
-			mutex_unlock(&mdsc->mutex);
-			goto out;
-		} else  {
-			int mds = __choose_mds(mdsc, req, NULL);
-			if (mds >= 0 && mds != req->r_session->s_mds) {
-				dout("but auth changed, so resending\n");
-				__do_request(mdsc, req);
-				mutex_unlock(&mdsc->mutex);
-				goto out;
-			}
-		}
-		dout("have to return ESTALE on request %llu\n", req->r_tid);
-	}
-
-
 	if (head->safe) {
 		set_bit(CEPH_MDS_R_GOT_SAFE, &req->r_req_flags);
 		__unregister_request(mdsc, req);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] ceph: fail the request directly if handle_reply gets an ESTALE
  2022-02-07  5:03 [PATCH] ceph: fail the request directly if handle_reply gets an ESTALE xiubli
@ 2022-02-07 15:12 ` Jeff Layton
  2022-02-07 15:40   ` Gregory Farnum
                     ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Jeff Layton @ 2022-02-07 15:12 UTC (permalink / raw)
  To: xiubli; +Cc: idryomov, vshankar, ceph-devel, Sage Weil, Gregory Farnum, ukernel

On Mon, 2022-02-07 at 13:03 +0800, xiubli@redhat.com wrote:
> From: Xiubo Li <xiubli@redhat.com>
> 
> If MDS return ESTALE, that means the MDS has already iterated all
> the possible active MDSes including the auth MDS or the inode is
> under purging. No need to retry in auth MDS and will just return
> ESTALE directly.
> 

When you say "purging" here, do you mean that it's effectively being
cleaned up after being unlinked? Or is it just being purged from the
MDS's cache?

> Or it will cause definite loop for retrying it.
> 
> URL: https://tracker.ceph.com/issues/53504
> Signed-off-by: Xiubo Li <xiubli@redhat.com>
> ---
>  fs/ceph/mds_client.c | 29 -----------------------------
>  1 file changed, 29 deletions(-)
> 
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index 93e5e3c4ba64..c918d2ac8272 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -3368,35 +3368,6 @@ static void handle_reply(struct ceph_mds_session *session, struct ceph_msg *msg)
>  
>  	result = le32_to_cpu(head->result);
>  
> -	/*
> -	 * Handle an ESTALE
> -	 * if we're not talking to the authority, send to them
> -	 * if the authority has changed while we weren't looking,
> -	 * send to new authority
> -	 * Otherwise we just have to return an ESTALE
> -	 */
> -	if (result == -ESTALE) {
> -		dout("got ESTALE on request %llu\n", req->r_tid);
> -		req->r_resend_mds = -1;
> -		if (req->r_direct_mode != USE_AUTH_MDS) {
> -			dout("not using auth, setting for that now\n");
> -			req->r_direct_mode = USE_AUTH_MDS;
> -			__do_request(mdsc, req);
> -			mutex_unlock(&mdsc->mutex);
> -			goto out;
> -		} else  {
> -			int mds = __choose_mds(mdsc, req, NULL);
> -			if (mds >= 0 && mds != req->r_session->s_mds) {
> -				dout("but auth changed, so resending\n");
> -				__do_request(mdsc, req);
> -				mutex_unlock(&mdsc->mutex);
> -				goto out;
> -			}
> -		}
> -		dout("have to return ESTALE on request %llu\n", req->r_tid);
> -	}
> -
> -
>  	if (head->safe) {
>  		set_bit(CEPH_MDS_R_GOT_SAFE, &req->r_req_flags);
>  		__unregister_request(mdsc, req);


(cc'ing Greg, Sage and Zheng)

This patch sort of contradicts the original design, AFAICT, and I'm not
sure what the correct behavior should be. I could use some
clarification.

The original code (from the 2009 merge) would tolerate 2 ESTALEs before
giving up and returning that to userland. Then in 2010, Greg added this
commit:

    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e55b71f802fd448a79275ba7b263fe1a8639be5f

...which would presumably make it retry indefinitely as long as the auth
MDS kept changing. Then, Zheng made this change in 2013:

    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ca18bede048e95a749d13410ce1da4ad0ffa7938

...which seems to try to do the same thing, but detected the auth mds
change in a different way.

Is that where livelock detection was broken? Or was there some
corresponding change to __choose_mds that should prevent infinitely
looping on the same request?

In NFS, ESTALE errors mean that the filehandle (inode) no longer exists
and that the server has forgotten about it. Does it mean the same thing
to the ceph MDS?

Has the behavior of the MDS changed such that these retries are no
longer necessary on an ESTALE? If so, when did this change, and does the
client need to do anything to detect what behavior it should be using?
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] ceph: fail the request directly if handle_reply gets an ESTALE
  2022-02-07 15:12 ` Jeff Layton
@ 2022-02-07 15:40   ` Gregory Farnum
  2022-02-07 16:13     ` Jeff Layton
  2022-02-08  7:56   ` Xiubo Li
  2022-02-09  1:56   ` Xiubo Li
  2 siblings, 1 reply; 13+ messages in thread
From: Gregory Farnum @ 2022-02-07 15:40 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Xiubo Li, Ilya Dryomov, Venky Shankar, ceph-devel, Sage Weil, ukernel

On Mon, Feb 7, 2022 at 7:12 AM Jeff Layton <jlayton@kernel.org> wrote:
>
> On Mon, 2022-02-07 at 13:03 +0800, xiubli@redhat.com wrote:
> > From: Xiubo Li <xiubli@redhat.com>
> >
> > If MDS return ESTALE, that means the MDS has already iterated all
> > the possible active MDSes including the auth MDS or the inode is
> > under purging. No need to retry in auth MDS and will just return
> > ESTALE directly.
> >
>
> When you say "purging" here, do you mean that it's effectively being
> cleaned up after being unlinked? Or is it just being purged from the
> MDS's cache?
>
> > Or it will cause definite loop for retrying it.
> >
> > URL: https://tracker.ceph.com/issues/53504
> > Signed-off-by: Xiubo Li <xiubli@redhat.com>
> > ---
> >  fs/ceph/mds_client.c | 29 -----------------------------
> >  1 file changed, 29 deletions(-)
> >
> > diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> > index 93e5e3c4ba64..c918d2ac8272 100644
> > --- a/fs/ceph/mds_client.c
> > +++ b/fs/ceph/mds_client.c
> > @@ -3368,35 +3368,6 @@ static void handle_reply(struct ceph_mds_session *session, struct ceph_msg *msg)
> >
> >       result = le32_to_cpu(head->result);
> >
> > -     /*
> > -      * Handle an ESTALE
> > -      * if we're not talking to the authority, send to them
> > -      * if the authority has changed while we weren't looking,
> > -      * send to new authority
> > -      * Otherwise we just have to return an ESTALE
> > -      */
> > -     if (result == -ESTALE) {
> > -             dout("got ESTALE on request %llu\n", req->r_tid);
> > -             req->r_resend_mds = -1;
> > -             if (req->r_direct_mode != USE_AUTH_MDS) {
> > -                     dout("not using auth, setting for that now\n");
> > -                     req->r_direct_mode = USE_AUTH_MDS;
> > -                     __do_request(mdsc, req);
> > -                     mutex_unlock(&mdsc->mutex);
> > -                     goto out;
> > -             } else  {
> > -                     int mds = __choose_mds(mdsc, req, NULL);
> > -                     if (mds >= 0 && mds != req->r_session->s_mds) {
> > -                             dout("but auth changed, so resending\n");
> > -                             __do_request(mdsc, req);
> > -                             mutex_unlock(&mdsc->mutex);
> > -                             goto out;
> > -                     }
> > -             }
> > -             dout("have to return ESTALE on request %llu\n", req->r_tid);
> > -     }
> > -
> > -
> >       if (head->safe) {
> >               set_bit(CEPH_MDS_R_GOT_SAFE, &req->r_req_flags);
> >               __unregister_request(mdsc, req);
>
>
> (cc'ing Greg, Sage and Zheng)
>
> This patch sort of contradicts the original design, AFAICT, and I'm not
> sure what the correct behavior should be. I could use some
> clarification.
>
> The original code (from the 2009 merge) would tolerate 2 ESTALEs before
> giving up and returning that to userland. Then in 2010, Greg added this
> commit:
>
>     https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e55b71f802fd448a79275ba7b263fe1a8639be5f
>
> ...which would presumably make it retry indefinitely as long as the auth
> MDS kept changing. Then, Zheng made this change in 2013:
>
>     https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ca18bede048e95a749d13410ce1da4ad0ffa7938
>
> ...which seems to try to do the same thing, but detected the auth mds
> change in a different way.
>
> Is that where livelock detection was broken? Or was there some
> corresponding change to __choose_mds that should prevent infinitely
> looping on the same request?
>
> In NFS, ESTALE errors mean that the filehandle (inode) no longer exists
> and that the server has forgotten about it. Does it mean the same thing
> to the ceph MDS?

This used to get returned if the MDS couldn't find the inode number in
question, because . This was not possible in most cases because if the
client has caps on the inode, it's pinned in MDS cache, but was
possible when NFS was layered on top (and possibly some other edge
case APIs where clients can operate on inode numbers they've saved
from a previous lookup?).

>
> Has the behavior of the MDS changed such that these retries are no
> longer necessary on an ESTALE? If so, when did this change, and does the
> client need to do anything to detect what behavior it should be using?

Well, I see that CEPHFS_ESTALE is still returned sometimes from the
MDS, so somebody will need to audit those, but the MDS has definitely
changed. These days, we can look up an unknown inode using the
(directory) backtrace we store on its first RADOS object, and it does
(at least some of the time? But I think everywhere relevant). But that
happened when we first added scrub circa 2014ish? Previously if the
inode wasn't in cache, we just had no way of finding it.
-Greg


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] ceph: fail the request directly if handle_reply gets an ESTALE
  2022-02-07 15:40   ` Gregory Farnum
@ 2022-02-07 16:13     ` Jeff Layton
  2022-02-07 16:28       ` Gregory Farnum
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff Layton @ 2022-02-07 16:13 UTC (permalink / raw)
  To: Gregory Farnum, Dan van der Ster
  Cc: Xiubo Li, Ilya Dryomov, Venky Shankar, ceph-devel, Sage Weil, ukernel

On Mon, 2022-02-07 at 07:40 -0800, Gregory Farnum wrote:
> On Mon, Feb 7, 2022 at 7:12 AM Jeff Layton <jlayton@kernel.org> wrote:
> > 
> > On Mon, 2022-02-07 at 13:03 +0800, xiubli@redhat.com wrote:
> > > From: Xiubo Li <xiubli@redhat.com>
> > > 
> > > If MDS return ESTALE, that means the MDS has already iterated all
> > > the possible active MDSes including the auth MDS or the inode is
> > > under purging. No need to retry in auth MDS and will just return
> > > ESTALE directly.
> > > 
> > 
> > When you say "purging" here, do you mean that it's effectively being
> > cleaned up after being unlinked? Or is it just being purged from the
> > MDS's cache?
> > 
> > > Or it will cause definite loop for retrying it.
> > > 
> > > URL: https://tracker.ceph.com/issues/53504
> > > Signed-off-by: Xiubo Li <xiubli@redhat.com>
> > > ---
> > >  fs/ceph/mds_client.c | 29 -----------------------------
> > >  1 file changed, 29 deletions(-)
> > > 
> > > diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> > > index 93e5e3c4ba64..c918d2ac8272 100644
> > > --- a/fs/ceph/mds_client.c
> > > +++ b/fs/ceph/mds_client.c
> > > @@ -3368,35 +3368,6 @@ static void handle_reply(struct ceph_mds_session *session, struct ceph_msg *msg)
> > > 
> > >       result = le32_to_cpu(head->result);
> > > 
> > > -     /*
> > > -      * Handle an ESTALE
> > > -      * if we're not talking to the authority, send to them
> > > -      * if the authority has changed while we weren't looking,
> > > -      * send to new authority
> > > -      * Otherwise we just have to return an ESTALE
> > > -      */
> > > -     if (result == -ESTALE) {
> > > -             dout("got ESTALE on request %llu\n", req->r_tid);
> > > -             req->r_resend_mds = -1;
> > > -             if (req->r_direct_mode != USE_AUTH_MDS) {
> > > -                     dout("not using auth, setting for that now\n");
> > > -                     req->r_direct_mode = USE_AUTH_MDS;
> > > -                     __do_request(mdsc, req);
> > > -                     mutex_unlock(&mdsc->mutex);
> > > -                     goto out;
> > > -             } else  {
> > > -                     int mds = __choose_mds(mdsc, req, NULL);
> > > -                     if (mds >= 0 && mds != req->r_session->s_mds) {
> > > -                             dout("but auth changed, so resending\n");
> > > -                             __do_request(mdsc, req);
> > > -                             mutex_unlock(&mdsc->mutex);
> > > -                             goto out;
> > > -                     }
> > > -             }
> > > -             dout("have to return ESTALE on request %llu\n", req->r_tid);
> > > -     }
> > > -
> > > -
> > >       if (head->safe) {
> > >               set_bit(CEPH_MDS_R_GOT_SAFE, &req->r_req_flags);
> > >               __unregister_request(mdsc, req);
> > 
> > 
> > (cc'ing Greg, Sage and Zheng)
> > 
> > This patch sort of contradicts the original design, AFAICT, and I'm not
> > sure what the correct behavior should be. I could use some
> > clarification.
> > 
> > The original code (from the 2009 merge) would tolerate 2 ESTALEs before
> > giving up and returning that to userland. Then in 2010, Greg added this
> > commit:
> > 
> >     https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e55b71f802fd448a79275ba7b263fe1a8639be5f
> > 
> > ...which would presumably make it retry indefinitely as long as the auth
> > MDS kept changing. Then, Zheng made this change in 2013:
> > 
> >     https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ca18bede048e95a749d13410ce1da4ad0ffa7938
> > 
> > ...which seems to try to do the same thing, but detected the auth mds
> > change in a different way.
> > 
> > Is that where livelock detection was broken? Or was there some
> > corresponding change to __choose_mds that should prevent infinitely
> > looping on the same request?
> > 
> > In NFS, ESTALE errors mean that the filehandle (inode) no longer exists
> > and that the server has forgotten about it. Does it mean the same thing
> > to the ceph MDS?
> 
> This used to get returned if the MDS couldn't find the inode number in
> question, because . This was not possible in most cases because if the
> client has caps on the inode, it's pinned in MDS cache, but was
> possible when NFS was layered on top (and possibly some other edge
> case APIs where clients can operate on inode numbers they've saved
> from a previous lookup?).
> 

The tracker bug mentions that this occurs after an MDS is restarted.
Could this be the result of clients relying on delete-on-last-close
behavior?

IOW, we have a situation where a file is opened and then unlinked, and
userland is actively doing I/O to it. The thing gets moved into the
strays dir, but isn't unlinked yet because we have open files against
it. Everything works fine at this point...

Then, the MDS restarts and the inode gets purged altogether. Client
reconnects and tries to reclaim his open, and gets ESTALE.

NFS clients prevent this by doing something called a silly-rename,
though client crashes can leave silly-renamed files sitting around (you
may have seen the .nfsXXXXXXXXXXXXXX files in some NFS-exported
filesystems).

> > 
> > Has the behavior of the MDS changed such that these retries are no
> > longer necessary on an ESTALE? If so, when did this change, and does the
> > client need to do anything to detect what behavior it should be using?
> 
> Well, I see that CEPHFS_ESTALE is still returned sometimes from the
> MDS, so somebody will need to audit those, but the MDS has definitely
> changed. These days, we can look up an unknown inode using the
> (directory) backtrace we store on its first RADOS object, and it does
> (at least some of the time? But I think everywhere relevant). But that
> happened when we first added scrub circa 2014ish? Previously if the
> inode wasn't in cache, we just had no way of finding it.


Ok, could you send an Acked-by if you think Xiubo's logic is correct?

Thanks,
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] ceph: fail the request directly if handle_reply gets an ESTALE
  2022-02-07 16:13     ` Jeff Layton
@ 2022-02-07 16:28       ` Gregory Farnum
  2022-02-07 17:11         ` Jeff Layton
  0 siblings, 1 reply; 13+ messages in thread
From: Gregory Farnum @ 2022-02-07 16:28 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Dan van der Ster, Xiubo Li, Ilya Dryomov, Venky Shankar,
	ceph-devel, Sage Weil, ukernel

On Mon, Feb 7, 2022 at 8:13 AM Jeff Layton <jlayton@kernel.org> wrote:
> The tracker bug mentions that this occurs after an MDS is restarted.
> Could this be the result of clients relying on delete-on-last-close
> behavior?

Oooh, I didn't actually look at the tracker.

>
> IOW, we have a situation where a file is opened and then unlinked, and
> userland is actively doing I/O to it. The thing gets moved into the
> strays dir, but isn't unlinked yet because we have open files against
> it. Everything works fine at this point...
>
> Then, the MDS restarts and the inode gets purged altogether. Client
> reconnects and tries to reclaim his open, and gets ESTALE.

Uh, okay. So I didn't do a proper audit before I sent my previous
reply, but one of the cases I did see was that the MDS returns ESTALE
if you try to do a name lookup on an inode in the stray directory. I
don't know if that's what is happening here or not? But perhaps that's
the root of the problem in this case.

Oh, nope, I see it's issuing getattr requests. That doesn't do ESTALE
directly so it must indeed be coming out of MDCache::path_traverse.

The MDS shouldn't move an inode into the purge queue on restart unless
there were no clients with caps on it (that state is persisted to disk
so it knows). Maybe if the clients don't make the reconnect window
it's dropping them all and *then* moves it into purge queue? I think
we need to identify what's happening there before we issue kernel
client changes, Xiubo?
-Greg


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] ceph: fail the request directly if handle_reply gets an ESTALE
  2022-02-07 16:28       ` Gregory Farnum
@ 2022-02-07 17:11         ` Jeff Layton
  2022-02-09  6:00           ` Xiubo Li
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff Layton @ 2022-02-07 17:11 UTC (permalink / raw)
  To: Gregory Farnum
  Cc: Dan van der Ster, Xiubo Li, Ilya Dryomov, Venky Shankar,
	ceph-devel, Sage Weil, ukernel

On Mon, 2022-02-07 at 08:28 -0800, Gregory Farnum wrote:
> On Mon, Feb 7, 2022 at 8:13 AM Jeff Layton <jlayton@kernel.org> wrote:
> > The tracker bug mentions that this occurs after an MDS is restarted.
> > Could this be the result of clients relying on delete-on-last-close
> > behavior?
> 
> Oooh, I didn't actually look at the tracker.
> 
> > 
> > IOW, we have a situation where a file is opened and then unlinked, and
> > userland is actively doing I/O to it. The thing gets moved into the
> > strays dir, but isn't unlinked yet because we have open files against
> > it. Everything works fine at this point...
> > 
> > Then, the MDS restarts and the inode gets purged altogether. Client
> > reconnects and tries to reclaim his open, and gets ESTALE.
> 
> Uh, okay. So I didn't do a proper audit before I sent my previous
> reply, but one of the cases I did see was that the MDS returns ESTALE
> if you try to do a name lookup on an inode in the stray directory. I
> don't know if that's what is happening here or not? But perhaps that's
> the root of the problem in this case.
> 
> Oh, nope, I see it's issuing getattr requests. That doesn't do ESTALE
> directly so it must indeed be coming out of MDCache::path_traverse.
> 
> The MDS shouldn't move an inode into the purge queue on restart unless
> there were no clients with caps on it (that state is persisted to disk
> so it knows). Maybe if the clients don't make the reconnect window
> it's dropping them all and *then* moves it into purge queue? I think
> we need to identify what's happening there before we issue kernel
> client changes, Xiubo?


Agreed. I think we need to understand why he's seeing ESTALE errors in
the first place, but it sounds like retrying on an ESTALE error isn't
likely to be helpful.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] ceph: fail the request directly if handle_reply gets an ESTALE
  2022-02-07 15:12 ` Jeff Layton
  2022-02-07 15:40   ` Gregory Farnum
@ 2022-02-08  7:56   ` Xiubo Li
  2022-02-08 15:16     ` Gregory Farnum
  2022-02-09  1:56   ` Xiubo Li
  2 siblings, 1 reply; 13+ messages in thread
From: Xiubo Li @ 2022-02-08  7:56 UTC (permalink / raw)
  To: Jeff Layton
  Cc: idryomov, vshankar, ceph-devel, Sage Weil, Gregory Farnum, ukernel


On 2/7/22 11:12 PM, Jeff Layton wrote:
> On Mon, 2022-02-07 at 13:03 +0800, xiubli@redhat.com wrote:
>> From: Xiubo Li <xiubli@redhat.com>
>>
>> If MDS return ESTALE, that means the MDS has already iterated all
>> the possible active MDSes including the auth MDS or the inode is
>> under purging. No need to retry in auth MDS and will just return
>> ESTALE directly.
>>
> When you say "purging" here, do you mean that it's effectively being
> cleaned up after being unlinked? Or is it just being purged from the
> MDS's cache?

There is one case when the client just removes the file or the file is 
force overrode via renaming, the related dentry in MDS will be put to 
stray dir and the Inode will be marked as under purging. So in case when 
the client try to call getattr, for example, after that, or retries the 
pending getattr request, which can cause the MDS returns ESTALE errno in 
theory.

Locally I still haven't reproduced it yet.

>> Or it will cause definite loop for retrying it.
>>
>> URL: https://tracker.ceph.com/issues/53504
>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
>> ---
>>   fs/ceph/mds_client.c | 29 -----------------------------
>>   1 file changed, 29 deletions(-)
>>
>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
>> index 93e5e3c4ba64..c918d2ac8272 100644
>> --- a/fs/ceph/mds_client.c
>> +++ b/fs/ceph/mds_client.c
>> @@ -3368,35 +3368,6 @@ static void handle_reply(struct ceph_mds_session *session, struct ceph_msg *msg)
>>   
>>   	result = le32_to_cpu(head->result);
>>   
>> -	/*
>> -	 * Handle an ESTALE
>> -	 * if we're not talking to the authority, send to them
>> -	 * if the authority has changed while we weren't looking,
>> -	 * send to new authority
>> -	 * Otherwise we just have to return an ESTALE
>> -	 */
>> -	if (result == -ESTALE) {
>> -		dout("got ESTALE on request %llu\n", req->r_tid);
>> -		req->r_resend_mds = -1;
>> -		if (req->r_direct_mode != USE_AUTH_MDS) {
>> -			dout("not using auth, setting for that now\n");
>> -			req->r_direct_mode = USE_AUTH_MDS;
>> -			__do_request(mdsc, req);
>> -			mutex_unlock(&mdsc->mutex);
>> -			goto out;
>> -		} else  {
>> -			int mds = __choose_mds(mdsc, req, NULL);
>> -			if (mds >= 0 && mds != req->r_session->s_mds) {
>> -				dout("but auth changed, so resending\n");
>> -				__do_request(mdsc, req);
>> -				mutex_unlock(&mdsc->mutex);
>> -				goto out;
>> -			}
>> -		}
>> -		dout("have to return ESTALE on request %llu\n", req->r_tid);
>> -	}
>> -
>> -
>>   	if (head->safe) {
>>   		set_bit(CEPH_MDS_R_GOT_SAFE, &req->r_req_flags);
>>   		__unregister_request(mdsc, req);
>
> (cc'ing Greg, Sage and Zheng)
>
> This patch sort of contradicts the original design, AFAICT, and I'm not
> sure what the correct behavior should be. I could use some
> clarification.
>
> The original code (from the 2009 merge) would tolerate 2 ESTALEs before
> giving up and returning that to userland. Then in 2010, Greg added this
> commit:
>
>      https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e55b71f802fd448a79275ba7b263fe1a8639be5f
>
> ...which would presumably make it retry indefinitely as long as the auth
> MDS kept changing. Then, Zheng made this change in 2013:
>
>      https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ca18bede048e95a749d13410ce1da4ad0ffa7938
>
> ...which seems to try to do the same thing, but detected the auth mds
> change in a different way.
>
> Is that where livelock detection was broken? Or was there some
> corresponding change to __choose_mds that should prevent infinitely
> looping on the same request?
>
> In NFS, ESTALE errors mean that the filehandle (inode) no longer exists
> and that the server has forgotten about it. Does it mean the same thing
> to the ceph MDS?
>
> Has the behavior of the MDS changed such that these retries are no
> longer necessary on an ESTALE? If so, when did this change, and does the
> client need to do anything to detect what behavior it should be using?


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] ceph: fail the request directly if handle_reply gets an ESTALE
  2022-02-08  7:56   ` Xiubo Li
@ 2022-02-08 15:16     ` Gregory Farnum
  2022-02-09  1:35       ` Xiubo Li
  0 siblings, 1 reply; 13+ messages in thread
From: Gregory Farnum @ 2022-02-08 15:16 UTC (permalink / raw)
  To: Xiubo Li
  Cc: Jeff Layton, Ilya Dryomov, Venky Shankar, ceph-devel, Sage Weil, ukernel

On Mon, Feb 7, 2022 at 11:56 PM Xiubo Li <xiubli@redhat.com> wrote:
>
>
> On 2/7/22 11:12 PM, Jeff Layton wrote:
> > On Mon, 2022-02-07 at 13:03 +0800, xiubli@redhat.com wrote:
> >> From: Xiubo Li <xiubli@redhat.com>
> >>
> >> If MDS return ESTALE, that means the MDS has already iterated all
> >> the possible active MDSes including the auth MDS or the inode is
> >> under purging. No need to retry in auth MDS and will just return
> >> ESTALE directly.
> >>
> > When you say "purging" here, do you mean that it's effectively being
> > cleaned up after being unlinked? Or is it just being purged from the
> > MDS's cache?
>
> There is one case when the client just removes the file or the file is
> force overrode via renaming, the related dentry in MDS will be put to
> stray dir and the Inode will be marked as under purging.

I'm confused by this statement. Renaming inode B over inode A should
still just be a normal unlink for A, and not trigger an immediate
purge. Is there something in the MDS that force-starts that and breaks
clients who still have the inode held open?
-Greg

> So in case when
> the client try to call getattr, for example, after that, or retries the
> pending getattr request, which can cause the MDS returns ESTALE errno in
> theory.
>
> Locally I still haven't reproduced it yet.
>
> >> Or it will cause definite loop for retrying it.
> >>
> >> URL: https://tracker.ceph.com/issues/53504
> >> Signed-off-by: Xiubo Li <xiubli@redhat.com>
> >> ---
> >>   fs/ceph/mds_client.c | 29 -----------------------------
> >>   1 file changed, 29 deletions(-)
> >>
> >> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> >> index 93e5e3c4ba64..c918d2ac8272 100644
> >> --- a/fs/ceph/mds_client.c
> >> +++ b/fs/ceph/mds_client.c
> >> @@ -3368,35 +3368,6 @@ static void handle_reply(struct ceph_mds_session *session, struct ceph_msg *msg)
> >>
> >>      result = le32_to_cpu(head->result);
> >>
> >> -    /*
> >> -     * Handle an ESTALE
> >> -     * if we're not talking to the authority, send to them
> >> -     * if the authority has changed while we weren't looking,
> >> -     * send to new authority
> >> -     * Otherwise we just have to return an ESTALE
> >> -     */
> >> -    if (result == -ESTALE) {
> >> -            dout("got ESTALE on request %llu\n", req->r_tid);
> >> -            req->r_resend_mds = -1;
> >> -            if (req->r_direct_mode != USE_AUTH_MDS) {
> >> -                    dout("not using auth, setting for that now\n");
> >> -                    req->r_direct_mode = USE_AUTH_MDS;
> >> -                    __do_request(mdsc, req);
> >> -                    mutex_unlock(&mdsc->mutex);
> >> -                    goto out;
> >> -            } else  {
> >> -                    int mds = __choose_mds(mdsc, req, NULL);
> >> -                    if (mds >= 0 && mds != req->r_session->s_mds) {
> >> -                            dout("but auth changed, so resending\n");
> >> -                            __do_request(mdsc, req);
> >> -                            mutex_unlock(&mdsc->mutex);
> >> -                            goto out;
> >> -                    }
> >> -            }
> >> -            dout("have to return ESTALE on request %llu\n", req->r_tid);
> >> -    }
> >> -
> >> -
> >>      if (head->safe) {
> >>              set_bit(CEPH_MDS_R_GOT_SAFE, &req->r_req_flags);
> >>              __unregister_request(mdsc, req);
> >
> > (cc'ing Greg, Sage and Zheng)
> >
> > This patch sort of contradicts the original design, AFAICT, and I'm not
> > sure what the correct behavior should be. I could use some
> > clarification.
> >
> > The original code (from the 2009 merge) would tolerate 2 ESTALEs before
> > giving up and returning that to userland. Then in 2010, Greg added this
> > commit:
> >
> >      https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e55b71f802fd448a79275ba7b263fe1a8639be5f
> >
> > ...which would presumably make it retry indefinitely as long as the auth
> > MDS kept changing. Then, Zheng made this change in 2013:
> >
> >      https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ca18bede048e95a749d13410ce1da4ad0ffa7938
> >
> > ...which seems to try to do the same thing, but detected the auth mds
> > change in a different way.
> >
> > Is that where livelock detection was broken? Or was there some
> > corresponding change to __choose_mds that should prevent infinitely
> > looping on the same request?
> >
> > In NFS, ESTALE errors mean that the filehandle (inode) no longer exists
> > and that the server has forgotten about it. Does it mean the same thing
> > to the ceph MDS?
> >
> > Has the behavior of the MDS changed such that these retries are no
> > longer necessary on an ESTALE? If so, when did this change, and does the
> > client need to do anything to detect what behavior it should be using?
>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] ceph: fail the request directly if handle_reply gets an ESTALE
  2022-02-08 15:16     ` Gregory Farnum
@ 2022-02-09  1:35       ` Xiubo Li
  0 siblings, 0 replies; 13+ messages in thread
From: Xiubo Li @ 2022-02-09  1:35 UTC (permalink / raw)
  To: Gregory Farnum
  Cc: Jeff Layton, Ilya Dryomov, Venky Shankar, ceph-devel, Sage Weil, ukernel


On 2/8/22 11:16 PM, Gregory Farnum wrote:
> On Mon, Feb 7, 2022 at 11:56 PM Xiubo Li <xiubli@redhat.com> wrote:
>>
>> On 2/7/22 11:12 PM, Jeff Layton wrote:
>>> On Mon, 2022-02-07 at 13:03 +0800, xiubli@redhat.com wrote:
>>>> From: Xiubo Li <xiubli@redhat.com>
>>>>
>>>> If MDS return ESTALE, that means the MDS has already iterated all
>>>> the possible active MDSes including the auth MDS or the inode is
>>>> under purging. No need to retry in auth MDS and will just return
>>>> ESTALE directly.
>>>>
>>> When you say "purging" here, do you mean that it's effectively being
>>> cleaned up after being unlinked? Or is it just being purged from the
>>> MDS's cache?
>> There is one case when the client just removes the file or the file is
>> force overrode via renaming, the related dentry in MDS will be put to
>> stray dir and the Inode will be marked as under purging.
> I'm confused by this statement. Renaming inode B over inode A should
> still just be a normal unlink for A, and not trigger an immediate
> purge. Is there something in the MDS that force-starts that and breaks
> clients who still have the inode held open?

Yeah, you are right. I went through the mds related code again and if 
there has any caps is held by clients will it be deferred queuing to the 
purge queue.

I will went the mds code today try to figure out whether there has bugs 
when replaying the journals and reconnecting the clients when the MDS 
starting.

Thanks.

> -Greg
>
>> So in case when
>> the client try to call getattr, for example, after that, or retries the
>> pending getattr request, which can cause the MDS returns ESTALE errno in
>> theory.
>>
>> Locally I still haven't reproduced it yet.
>>
>>>> Or it will cause definite loop for retrying it.
>>>>
>>>> URL: https://tracker.ceph.com/issues/53504
>>>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
>>>> ---
>>>>    fs/ceph/mds_client.c | 29 -----------------------------
>>>>    1 file changed, 29 deletions(-)
>>>>
>>>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
>>>> index 93e5e3c4ba64..c918d2ac8272 100644
>>>> --- a/fs/ceph/mds_client.c
>>>> +++ b/fs/ceph/mds_client.c
>>>> @@ -3368,35 +3368,6 @@ static void handle_reply(struct ceph_mds_session *session, struct ceph_msg *msg)
>>>>
>>>>       result = le32_to_cpu(head->result);
>>>>
>>>> -    /*
>>>> -     * Handle an ESTALE
>>>> -     * if we're not talking to the authority, send to them
>>>> -     * if the authority has changed while we weren't looking,
>>>> -     * send to new authority
>>>> -     * Otherwise we just have to return an ESTALE
>>>> -     */
>>>> -    if (result == -ESTALE) {
>>>> -            dout("got ESTALE on request %llu\n", req->r_tid);
>>>> -            req->r_resend_mds = -1;
>>>> -            if (req->r_direct_mode != USE_AUTH_MDS) {
>>>> -                    dout("not using auth, setting for that now\n");
>>>> -                    req->r_direct_mode = USE_AUTH_MDS;
>>>> -                    __do_request(mdsc, req);
>>>> -                    mutex_unlock(&mdsc->mutex);
>>>> -                    goto out;
>>>> -            } else  {
>>>> -                    int mds = __choose_mds(mdsc, req, NULL);
>>>> -                    if (mds >= 0 && mds != req->r_session->s_mds) {
>>>> -                            dout("but auth changed, so resending\n");
>>>> -                            __do_request(mdsc, req);
>>>> -                            mutex_unlock(&mdsc->mutex);
>>>> -                            goto out;
>>>> -                    }
>>>> -            }
>>>> -            dout("have to return ESTALE on request %llu\n", req->r_tid);
>>>> -    }
>>>> -
>>>> -
>>>>       if (head->safe) {
>>>>               set_bit(CEPH_MDS_R_GOT_SAFE, &req->r_req_flags);
>>>>               __unregister_request(mdsc, req);
>>> (cc'ing Greg, Sage and Zheng)
>>>
>>> This patch sort of contradicts the original design, AFAICT, and I'm not
>>> sure what the correct behavior should be. I could use some
>>> clarification.
>>>
>>> The original code (from the 2009 merge) would tolerate 2 ESTALEs before
>>> giving up and returning that to userland. Then in 2010, Greg added this
>>> commit:
>>>
>>>       https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e55b71f802fd448a79275ba7b263fe1a8639be5f
>>>
>>> ...which would presumably make it retry indefinitely as long as the auth
>>> MDS kept changing. Then, Zheng made this change in 2013:
>>>
>>>       https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ca18bede048e95a749d13410ce1da4ad0ffa7938
>>>
>>> ...which seems to try to do the same thing, but detected the auth mds
>>> change in a different way.
>>>
>>> Is that where livelock detection was broken? Or was there some
>>> corresponding change to __choose_mds that should prevent infinitely
>>> looping on the same request?
>>>
>>> In NFS, ESTALE errors mean that the filehandle (inode) no longer exists
>>> and that the server has forgotten about it. Does it mean the same thing
>>> to the ceph MDS?
>>>
>>> Has the behavior of the MDS changed such that these retries are no
>>> longer necessary on an ESTALE? If so, when did this change, and does the
>>> client need to do anything to detect what behavior it should be using?


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] ceph: fail the request directly if handle_reply gets an ESTALE
  2022-02-07 15:12 ` Jeff Layton
  2022-02-07 15:40   ` Gregory Farnum
  2022-02-08  7:56   ` Xiubo Li
@ 2022-02-09  1:56   ` Xiubo Li
  2 siblings, 0 replies; 13+ messages in thread
From: Xiubo Li @ 2022-02-09  1:56 UTC (permalink / raw)
  To: Jeff Layton
  Cc: idryomov, vshankar, ceph-devel, Sage Weil, Gregory Farnum, ukernel


On 2/7/22 11:12 PM, Jeff Layton wrote:
> On Mon, 2022-02-07 at 13:03 +0800, xiubli@redhat.com wrote:
>> From: Xiubo Li <xiubli@redhat.com>
>>
>> If MDS return ESTALE, that means the MDS has already iterated all
>> the possible active MDSes including the auth MDS or the inode is
>> under purging. No need to retry in auth MDS and will just return
>> ESTALE directly.
>>
> When you say "purging" here, do you mean that it's effectively being
> cleaned up after being unlinked? Or is it just being purged from the
> MDS's cache?
>
>> Or it will cause definite loop for retrying it.
>>
>> URL: https://tracker.ceph.com/issues/53504
>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
>> ---
>>   fs/ceph/mds_client.c | 29 -----------------------------
>>   1 file changed, 29 deletions(-)
>>
>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
>> index 93e5e3c4ba64..c918d2ac8272 100644
>> --- a/fs/ceph/mds_client.c
>> +++ b/fs/ceph/mds_client.c
>> @@ -3368,35 +3368,6 @@ static void handle_reply(struct ceph_mds_session *session, struct ceph_msg *msg)
>>   
>>   	result = le32_to_cpu(head->result);
>>   
>> -	/*
>> -	 * Handle an ESTALE
>> -	 * if we're not talking to the authority, send to them
>> -	 * if the authority has changed while we weren't looking,
>> -	 * send to new authority
>> -	 * Otherwise we just have to return an ESTALE
>> -	 */
>> -	if (result == -ESTALE) {
>> -		dout("got ESTALE on request %llu\n", req->r_tid);
>> -		req->r_resend_mds = -1;
>> -		if (req->r_direct_mode != USE_AUTH_MDS) {
>> -			dout("not using auth, setting for that now\n");
>> -			req->r_direct_mode = USE_AUTH_MDS;
>> -			__do_request(mdsc, req);
>> -			mutex_unlock(&mdsc->mutex);
>> -			goto out;
>> -		} else  {
>> -			int mds = __choose_mds(mdsc, req, NULL);
>> -			if (mds >= 0 && mds != req->r_session->s_mds) {
>> -				dout("but auth changed, so resending\n");
>> -				__do_request(mdsc, req);
>> -				mutex_unlock(&mdsc->mutex);
>> -				goto out;
>> -			}
>> -		}
>> -		dout("have to return ESTALE on request %llu\n", req->r_tid);
>> -	}
>> -
>> -
>>   	if (head->safe) {
>>   		set_bit(CEPH_MDS_R_GOT_SAFE, &req->r_req_flags);
>>   		__unregister_request(mdsc, req);
>
> (cc'ing Greg, Sage and Zheng)
>
> This patch sort of contradicts the original design, AFAICT, and I'm not
> sure what the correct behavior should be. I could use some
> clarification.
>
> The original code (from the 2009 merge) would tolerate 2 ESTALEs before
> giving up and returning that to userland. Then in 2010, Greg added this
> commit:
>
>      https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e55b71f802fd448a79275ba7b263fe1a8639be5f

The find_ino_peer was support after the original code:

c2c333d8cb0 mds: find_ino_peer

Date:   Thu Mar 31 15:55:10 2011 -0700

 From the code the find_ino_peer should have already checked all the 
possible MDSes even the auth changed.


- Xiubo

> ...which would presumably make it retry indefinitely as long as the auth
> MDS kept changing. Then, Zheng made this change in 2013:
>
>      https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ca18bede048e95a749d13410ce1da4ad0ffa7938
>
> ...which seems to try to do the same thing, but detected the auth mds
> change in a different way.
>
> Is that where livelock detection was broken? Or was there some
> corresponding change to __choose_mds that should prevent infinitely
> looping on the same request?
>
> In NFS, ESTALE errors mean that the filehandle (inode) no longer exists
> and that the server has forgotten about it. Does it mean the same thing
> to the ceph MDS?
>
> Has the behavior of the MDS changed such that these retries are no
> longer necessary on an ESTALE? If so, when did this change, and does the
> client need to do anything to detect what behavior it should be using?


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] ceph: fail the request directly if handle_reply gets an ESTALE
  2022-02-07 17:11         ` Jeff Layton
@ 2022-02-09  6:00           ` Xiubo Li
  2022-02-09 11:34             ` Jeff Layton
  2022-02-17  3:18             ` Gregory Farnum
  0 siblings, 2 replies; 13+ messages in thread
From: Xiubo Li @ 2022-02-09  6:00 UTC (permalink / raw)
  To: Jeff Layton, Gregory Farnum
  Cc: Dan van der Ster, Ilya Dryomov, Venky Shankar, ceph-devel,
	Sage Weil, ukernel


On 2/8/22 1:11 AM, Jeff Layton wrote:
> On Mon, 2022-02-07 at 08:28 -0800, Gregory Farnum wrote:
>> On Mon, Feb 7, 2022 at 8:13 AM Jeff Layton <jlayton@kernel.org> wrote:
>>> The tracker bug mentions that this occurs after an MDS is restarted.
>>> Could this be the result of clients relying on delete-on-last-close
>>> behavior?
>> Oooh, I didn't actually look at the tracker.
>>
>>> IOW, we have a situation where a file is opened and then unlinked, and
>>> userland is actively doing I/O to it. The thing gets moved into the
>>> strays dir, but isn't unlinked yet because we have open files against
>>> it. Everything works fine at this point...
>>>
>>> Then, the MDS restarts and the inode gets purged altogether. Client
>>> reconnects and tries to reclaim his open, and gets ESTALE.
>> Uh, okay. So I didn't do a proper audit before I sent my previous
>> reply, but one of the cases I did see was that the MDS returns ESTALE
>> if you try to do a name lookup on an inode in the stray directory. I
>> don't know if that's what is happening here or not? But perhaps that's
>> the root of the problem in this case.
>>
>> Oh, nope, I see it's issuing getattr requests. That doesn't do ESTALE
>> directly so it must indeed be coming out of MDCache::path_traverse.
>>
>> The MDS shouldn't move an inode into the purge queue on restart unless
>> there were no clients with caps on it (that state is persisted to disk
>> so it knows). Maybe if the clients don't make the reconnect window
>> it's dropping them all and *then* moves it into purge queue? I think
>> we need to identify what's happening there before we issue kernel
>> client changes, Xiubo?
>
> Agreed. I think we need to understand why he's seeing ESTALE errors in
> the first place, but it sounds like retrying on an ESTALE error isn't
> likely to be helpful.

There has one case that could cause the inode to be put into the purge 
queue:

1, When unlinking a file and just after the unlink journal log is 
flushed and the MDS is restart or replaced by a standby MDS. The unlink 
journal log will contain the a straydn and the straydn will link to the 
related CInode.

2, The new starting MDS will replay this unlink journal log in 
up:standby_replay state.

3, The MDCache::upkeep_main() thread will try to trim MDCache, and it 
will possibly trim the straydn. Since the clients haven't reconnected 
the sessions, so the CInode won't have any client cap. So when trimming 
the straydn and CInode, the CInode will be put into the purge queue.

4, After up:reconnect, when retrying the getattr requests the MDS will 
return ESTALE.

This should be fixed in https://github.com/ceph/ceph/pull/41667 
recently, it will just enables trim() in up:active state.

I also went through the ESTALE related code in MDS, this patch still 
makes sense and when getting an ESTALE errno to retry the request make 
no sense.

BRs

Xiubo



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] ceph: fail the request directly if handle_reply gets an ESTALE
  2022-02-09  6:00           ` Xiubo Li
@ 2022-02-09 11:34             ` Jeff Layton
  2022-02-17  3:18             ` Gregory Farnum
  1 sibling, 0 replies; 13+ messages in thread
From: Jeff Layton @ 2022-02-09 11:34 UTC (permalink / raw)
  To: Xiubo Li, Gregory Farnum
  Cc: Dan van der Ster, Ilya Dryomov, Venky Shankar, ceph-devel,
	Sage Weil, ukernel

On Wed, 2022-02-09 at 14:00 +0800, Xiubo Li wrote:
> On 2/8/22 1:11 AM, Jeff Layton wrote:
> > On Mon, 2022-02-07 at 08:28 -0800, Gregory Farnum wrote:
> > > On Mon, Feb 7, 2022 at 8:13 AM Jeff Layton <jlayton@kernel.org> wrote:
> > > > The tracker bug mentions that this occurs after an MDS is restarted.
> > > > Could this be the result of clients relying on delete-on-last-close
> > > > behavior?
> > > Oooh, I didn't actually look at the tracker.
> > > 
> > > > IOW, we have a situation where a file is opened and then unlinked, and
> > > > userland is actively doing I/O to it. The thing gets moved into the
> > > > strays dir, but isn't unlinked yet because we have open files against
> > > > it. Everything works fine at this point...
> > > > 
> > > > Then, the MDS restarts and the inode gets purged altogether. Client
> > > > reconnects and tries to reclaim his open, and gets ESTALE.
> > > Uh, okay. So I didn't do a proper audit before I sent my previous
> > > reply, but one of the cases I did see was that the MDS returns ESTALE
> > > if you try to do a name lookup on an inode in the stray directory. I
> > > don't know if that's what is happening here or not? But perhaps that's
> > > the root of the problem in this case.
> > > 
> > > Oh, nope, I see it's issuing getattr requests. That doesn't do ESTALE
> > > directly so it must indeed be coming out of MDCache::path_traverse.
> > > 
> > > The MDS shouldn't move an inode into the purge queue on restart unless
> > > there were no clients with caps on it (that state is persisted to disk
> > > so it knows). Maybe if the clients don't make the reconnect window
> > > it's dropping them all and *then* moves it into purge queue? I think
> > > we need to identify what's happening there before we issue kernel
> > > client changes, Xiubo?
> > 
> > Agreed. I think we need to understand why he's seeing ESTALE errors in
> > the first place, but it sounds like retrying on an ESTALE error isn't
> > likely to be helpful.
> 
> There has one case that could cause the inode to be put into the purge 
> queue:
> 
> 1, When unlinking a file and just after the unlink journal log is 
> flushed and the MDS is restart or replaced by a standby MDS. The unlink 
> journal log will contain the a straydn and the straydn will link to the 
> related CInode.
> 
> 2, The new starting MDS will replay this unlink journal log in 
> up:standby_replay state.
> 
> 3, The MDCache::upkeep_main() thread will try to trim MDCache, and it 
> will possibly trim the straydn. Since the clients haven't reconnected 
> the sessions, so the CInode won't have any client cap. So when trimming 
> the straydn and CInode, the CInode will be put into the purge queue.
> 
> 4, After up:reconnect, when retrying the getattr requests the MDS will 
> return ESTALE.
> 
> This should be fixed in https://github.com/ceph/ceph/pull/41667 
> recently, it will just enables trim() in up:active state.
> 
> I also went through the ESTALE related code in MDS, this patch still 
> makes sense and when getting an ESTALE errno to retry the request make 
> no sense.
> 

Agreed. I think retrying an operation directly on an ESTALE makes no
sense, and it probably prevents the ESTALE handling in the vfs layer
from working properly.

Usually, when we get back an ESTALE error on NFS with a path-based
operation, it means that it did a lookup (often out of the client's
cache) and found an inode, but by the time we got around to doing the
operation, the filehandle had vanished from the server. The kernel will
then go and set LOOKUP_REVAL and do the pathwalk again (since that's a
pretty good indicator that the dcache was wrong).

It's a bit different in ceph since it's a (semi-)path-based protocol,
and the lookup cache coherency is tighter, but I think the same sort of
races are probably possible. Allowing ESTALE to bubble back up to the
VFS layer would allow it to retry the lookups and do it again.

I'm going to plan to take this patch into testing branch and aim for it
to go into v5.18.

Thanks!
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] ceph: fail the request directly if handle_reply gets an ESTALE
  2022-02-09  6:00           ` Xiubo Li
  2022-02-09 11:34             ` Jeff Layton
@ 2022-02-17  3:18             ` Gregory Farnum
  1 sibling, 0 replies; 13+ messages in thread
From: Gregory Farnum @ 2022-02-17  3:18 UTC (permalink / raw)
  To: Xiubo Li
  Cc: Jeff Layton, Dan van der Ster, Ilya Dryomov, Venky Shankar,
	ceph-devel, Sage Weil, ukernel

On Tue, Feb 8, 2022 at 10:00 PM Xiubo Li <xiubli@redhat.com> wrote:
>
>
> On 2/8/22 1:11 AM, Jeff Layton wrote:
> > On Mon, 2022-02-07 at 08:28 -0800, Gregory Farnum wrote:
> >> On Mon, Feb 7, 2022 at 8:13 AM Jeff Layton <jlayton@kernel.org> wrote:
> >>> The tracker bug mentions that this occurs after an MDS is restarted.
> >>> Could this be the result of clients relying on delete-on-last-close
> >>> behavior?
> >> Oooh, I didn't actually look at the tracker.
> >>
> >>> IOW, we have a situation where a file is opened and then unlinked, and
> >>> userland is actively doing I/O to it. The thing gets moved into the
> >>> strays dir, but isn't unlinked yet because we have open files against
> >>> it. Everything works fine at this point...
> >>>
> >>> Then, the MDS restarts and the inode gets purged altogether. Client
> >>> reconnects and tries to reclaim his open, and gets ESTALE.
> >> Uh, okay. So I didn't do a proper audit before I sent my previous
> >> reply, but one of the cases I did see was that the MDS returns ESTALE
> >> if you try to do a name lookup on an inode in the stray directory. I
> >> don't know if that's what is happening here or not? But perhaps that's
> >> the root of the problem in this case.
> >>
> >> Oh, nope, I see it's issuing getattr requests. That doesn't do ESTALE
> >> directly so it must indeed be coming out of MDCache::path_traverse.
> >>
> >> The MDS shouldn't move an inode into the purge queue on restart unless
> >> there were no clients with caps on it (that state is persisted to disk
> >> so it knows). Maybe if the clients don't make the reconnect window
> >> it's dropping them all and *then* moves it into purge queue? I think
> >> we need to identify what's happening there before we issue kernel
> >> client changes, Xiubo?
> >
> > Agreed. I think we need to understand why he's seeing ESTALE errors in
> > the first place, but it sounds like retrying on an ESTALE error isn't
> > likely to be helpful.
>
> There has one case that could cause the inode to be put into the purge
> queue:
>
> 1, When unlinking a file and just after the unlink journal log is
> flushed and the MDS is restart or replaced by a standby MDS. The unlink
> journal log will contain the a straydn and the straydn will link to the
> related CInode.
>
> 2, The new starting MDS will replay this unlink journal log in
> up:standby_replay state.
>
> 3, The MDCache::upkeep_main() thread will try to trim MDCache, and it
> will possibly trim the straydn. Since the clients haven't reconnected
> the sessions, so the CInode won't have any client cap. So when trimming
> the straydn and CInode, the CInode will be put into the purge queue.
>
> 4, After up:reconnect, when retrying the getattr requests the MDS will
> return ESTALE.
>
> This should be fixed in https://github.com/ceph/ceph/pull/41667
> recently, it will just enables trim() in up:active state.
>
> I also went through the ESTALE related code in MDS, this patch still
> makes sense and when getting an ESTALE errno to retry the request make
> no sense.

Thanks for checking; this sounds good to me.

Acked-by: Greg Farnum <gfarnum@redhat.com>

>
> BRs
>
> Xiubo
>
>


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2022-02-17  3:18 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-07  5:03 [PATCH] ceph: fail the request directly if handle_reply gets an ESTALE xiubli
2022-02-07 15:12 ` Jeff Layton
2022-02-07 15:40   ` Gregory Farnum
2022-02-07 16:13     ` Jeff Layton
2022-02-07 16:28       ` Gregory Farnum
2022-02-07 17:11         ` Jeff Layton
2022-02-09  6:00           ` Xiubo Li
2022-02-09 11:34             ` Jeff Layton
2022-02-17  3:18             ` Gregory Farnum
2022-02-08  7:56   ` Xiubo Li
2022-02-08 15:16     ` Gregory Farnum
2022-02-09  1:35       ` Xiubo Li
2022-02-09  1:56   ` Xiubo Li

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.