All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/3] NFSv4: Fix a livelock when CLOSE pre-emptively bumps state sequence
@ 2020-09-21 11:04 Benjamin Coddington
  2020-09-21 11:04 ` [PATCH 2/3] NFSv4: Refactor nfs_need_update_open_stateid() Benjamin Coddington
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Benjamin Coddington @ 2020-09-21 11:04 UTC (permalink / raw)
  To: trond.myklebust, anna.schumaker; +Cc: linux-nfs

Since commit 0e0cb35b417f ("NFSv4: Handle NFS4ERR_OLD_STATEID in
CLOSE/OPEN_DOWNGRADE") the following livelock may occur if a CLOSE races
with the update of the nfs_state:

Process 1	  Process 2	   Server
=========         =========	   ========
 OPEN file
		  OPEN file
		  		   Reply OPEN (1)
		  		   Reply OPEN (2)
 Update state (1)
 CLOSE file (1)
		  		   Reply OLD_STATEID (1)
 CLOSE file (2)
		  		   Reply CLOSE (-1)
		  Update state (2)
		  wait for state change
 OPEN file
		  wake
 CLOSE file
 OPEN file
		  wake
 CLOSE file
 ...
		  ...

As long as the first process continues updating state, the second process
will fail to exit the loop in nfs_set_open_stateid_locked().  This livelock
has been observed in generic/168.

Fix this by detecting the case in nfs_need_update_open_stateid() and
then exit the loop if:
 - the state is NFS_OPEN_STATE, and
 - the stateid sequence is > 1, and
 - the stateid doesn't match the current open stateid

Fixes: 0e0cb35b417f ("NFSv4: Handle NFS4ERR_OLD_STATEID in CLOSE/OPEN_DOWNGRADE")
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
---
 fs/nfs/nfs4proc.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 45e0585e0667..9ced7a62c05e 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -1570,10 +1570,14 @@ static bool nfs_need_update_open_stateid(struct nfs4_state *state,
 {
 	if (test_bit(NFS_OPEN_STATE, &state->flags) == 0 ||
 	    !nfs4_stateid_match_other(stateid, &state->open_stateid)) {
-		if (stateid->seqid == cpu_to_be32(1))
+		if (stateid->seqid == cpu_to_be32(1)) {
 			nfs_state_log_update_open_stateid(state);
-		else
-			set_bit(NFS_STATE_CHANGE_WAIT, &state->flags);
+		} else {
+			if (!nfs4_stateid_match_other(stateid, &state->open_stateid))
+				return false;
+			else
+				set_bit(NFS_STATE_CHANGE_WAIT, &state->flags);
+		}
 		return true;
 	}
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/3] NFSv4: Refactor nfs_need_update_open_stateid()
  2020-09-21 11:04 [PATCH 1/3] NFSv4: Fix a livelock when CLOSE pre-emptively bumps state sequence Benjamin Coddington
@ 2020-09-21 11:04 ` Benjamin Coddington
  2020-09-21 11:17   ` Benjamin Coddington
  2020-09-22 10:14   ` [PATCH 2/3 v2] " Benjamin Coddington
  2020-09-21 11:04 ` [PATCH 3/3] NFSv4: cleanup unused zero_stateid copy Benjamin Coddington
  2020-09-22 14:03 ` [PATCH 1/3] NFSv4: Fix a livelock when CLOSE pre-emptively bumps state sequence Anna Schumaker
  2 siblings, 2 replies; 13+ messages in thread
From: Benjamin Coddington @ 2020-09-21 11:04 UTC (permalink / raw)
  To: trond.myklebust, anna.schumaker; +Cc: linux-nfs

The logic was becoming difficult to follow.  Add some comments and local
variables to clarify the behavior.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
---
 fs/nfs/nfs4proc.c | 39 +++++++++++++++++++++++++--------------
 1 file changed, 25 insertions(+), 14 deletions(-)

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 9ced7a62c05e..499f978d48aa 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -1568,23 +1568,34 @@ static void nfs_test_and_clear_all_open_stateid(struct nfs4_state *state)
 static bool nfs_need_update_open_stateid(struct nfs4_state *state,
 		const nfs4_stateid *stateid)
 {
-	if (test_bit(NFS_OPEN_STATE, &state->flags) == 0 ||
-	    !nfs4_stateid_match_other(stateid, &state->open_stateid)) {
-		if (stateid->seqid == cpu_to_be32(1)) {
+	bool state_matches_other = nfs4_stateid_match_other(stateid, &state->open_stateid);
+	bool seqid_one = stateid->seqid == cpu_to_be32(1);
+
+	if (test_bit(NFS_OPEN_STATE, &state->flags)) {
+		/* The common case - we're updating to a new sequence number */
+		if (state_matches_other && nfs4_stateid_is_newer(stateid, &state->open_stateid)) {
+			nfs_state_log_out_of_order_open_stateid(state, stateid);
+			return true;
+		}
+		/* We lost a race with a self-bumping close, do recovery */
+		if (!state_matches_other) {
+			trace_printk("lost race to self-bump close\n");
+			return false;
+		}
+	} else {
+		/* The common case, this is the first OPEN */
+		if (!state_matches_other && seqid_one) {
 			nfs_state_log_update_open_stateid(state);
-		} else {
-			if (!nfs4_stateid_match_other(stateid, &state->open_stateid))
-				return false;
-			else
-				set_bit(NFS_STATE_CHANGE_WAIT, &state->flags);
+			return true;
+		}
+		/* We lost a race either with a self-bumping close, OR with the first OPEN */
+		if (!state_matches_other && !seqid_one) {
+			trace_printk("lost race to self-bump close OR first OPEN\n");
+			set_bit(NFS_STATE_CHANGE_WAIT, &state->flags);
+			return true;
 		}
-		return true;
-	}
-
-	if (nfs4_stateid_is_newer(stateid, &state->open_stateid)) {
-		nfs_state_log_out_of_order_open_stateid(state, stateid);
-		return true;
 	}
+	/* Should be impossible to reach: */
 	return false;
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/3] NFSv4: cleanup unused zero_stateid copy
  2020-09-21 11:04 [PATCH 1/3] NFSv4: Fix a livelock when CLOSE pre-emptively bumps state sequence Benjamin Coddington
  2020-09-21 11:04 ` [PATCH 2/3] NFSv4: Refactor nfs_need_update_open_stateid() Benjamin Coddington
@ 2020-09-21 11:04 ` Benjamin Coddington
  2020-09-22 14:03 ` [PATCH 1/3] NFSv4: Fix a livelock when CLOSE pre-emptively bumps state sequence Anna Schumaker
  2 siblings, 0 replies; 13+ messages in thread
From: Benjamin Coddington @ 2020-09-21 11:04 UTC (permalink / raw)
  To: trond.myklebust, anna.schumaker; +Cc: linux-nfs

Since commit d9aba2b40de6 ("NFSv4: Don't use the zero stateid with
layoutget") the zero stateid will never be used.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
---
 fs/nfs/nfs4state.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
index a8dc25ce48bb..fe2d11ce3fa4 100644
--- a/fs/nfs/nfs4state.c
+++ b/fs/nfs/nfs4state.c
@@ -1018,18 +1018,14 @@ static int nfs4_copy_lock_stateid(nfs4_stateid *dst,
 bool nfs4_copy_open_stateid(nfs4_stateid *dst, struct nfs4_state *state)
 {
 	bool ret;
-	const nfs4_stateid *src;
 	int seq;
 
 	do {
 		ret = false;
-		src = &zero_stateid;
 		seq = read_seqbegin(&state->seqlock);
-		if (test_bit(NFS_OPEN_STATE, &state->flags)) {
-			src = &state->open_stateid;
+		if (test_bit(NFS_OPEN_STATE, &state->flags))
 			ret = true;
-		}
-		nfs4_stateid_copy(dst, src);
+		nfs4_stateid_copy(dst, &state->open_stateid);
 	} while (read_seqretry(&state->seqlock, seq));
 	return ret;
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/3] NFSv4: Refactor nfs_need_update_open_stateid()
  2020-09-21 11:04 ` [PATCH 2/3] NFSv4: Refactor nfs_need_update_open_stateid() Benjamin Coddington
@ 2020-09-21 11:17   ` Benjamin Coddington
  2020-09-22 10:14   ` [PATCH 2/3 v2] " Benjamin Coddington
  1 sibling, 0 replies; 13+ messages in thread
From: Benjamin Coddington @ 2020-09-21 11:17 UTC (permalink / raw)
  To: trond.myklebust, anna.schumaker; +Cc: linux-nfs

NACK.  Please drop this patch..  This is a junk version of this patch 
that I was
using for debugging that I sent by mistake.  I'll send a cleaned up 
version separately.

Ben

On 21 Sep 2020, at 7:04, Benjamin Coddington wrote:

> The logic was becoming difficult to follow.  Add some comments and 
> local
> variables to clarify the behavior.
>
> Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
> ---
>  fs/nfs/nfs4proc.c | 39 +++++++++++++++++++++++++--------------
>  1 file changed, 25 insertions(+), 14 deletions(-)
>
> diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
> index 9ced7a62c05e..499f978d48aa 100644
> --- a/fs/nfs/nfs4proc.c
> +++ b/fs/nfs/nfs4proc.c
> @@ -1568,23 +1568,34 @@ static void 
> nfs_test_and_clear_all_open_stateid(struct nfs4_state *state)
>  static bool nfs_need_update_open_stateid(struct nfs4_state *state,
>  		const nfs4_stateid *stateid)
>  {
> -	if (test_bit(NFS_OPEN_STATE, &state->flags) == 0 ||
> -	    !nfs4_stateid_match_other(stateid, &state->open_stateid)) {
> -		if (stateid->seqid == cpu_to_be32(1)) {
> +	bool state_matches_other = nfs4_stateid_match_other(stateid, 
> &state->open_stateid);
> +	bool seqid_one = stateid->seqid == cpu_to_be32(1);
> +
> +	if (test_bit(NFS_OPEN_STATE, &state->flags)) {
> +		/* The common case - we're updating to a new sequence number */
> +		if (state_matches_other && nfs4_stateid_is_newer(stateid, 
> &state->open_stateid)) {
> +			nfs_state_log_out_of_order_open_stateid(state, stateid);
> +			return true;
> +		}
> +		/* We lost a race with a self-bumping close, do recovery */
> +		if (!state_matches_other) {
> +			trace_printk("lost race to self-bump close\n");
> +			return false;
> +		}
> +	} else {
> +		/* The common case, this is the first OPEN */
> +		if (!state_matches_other && seqid_one) {
>  			nfs_state_log_update_open_stateid(state);
> -		} else {
> -			if (!nfs4_stateid_match_other(stateid, &state->open_stateid))
> -				return false;
> -			else
> -				set_bit(NFS_STATE_CHANGE_WAIT, &state->flags);
> +			return true;
> +		}
> +		/* We lost a race either with a self-bumping close, OR with the 
> first OPEN */
> +		if (!state_matches_other && !seqid_one) {
> +			trace_printk("lost race to self-bump close OR first OPEN\n");
> +			set_bit(NFS_STATE_CHANGE_WAIT, &state->flags);
> +			return true;
>  		}
> -		return true;
> -	}
> -
> -	if (nfs4_stateid_is_newer(stateid, &state->open_stateid)) {
> -		nfs_state_log_out_of_order_open_stateid(state, stateid);
> -		return true;
>  	}
> +	/* Should be impossible to reach: */
>  	return false;
>  }
>
> -- 
> 2.20.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 2/3 v2] NFSv4: Refactor nfs_need_update_open_stateid()
  2020-09-21 11:04 ` [PATCH 2/3] NFSv4: Refactor nfs_need_update_open_stateid() Benjamin Coddington
  2020-09-21 11:17   ` Benjamin Coddington
@ 2020-09-22 10:14   ` Benjamin Coddington
  1 sibling, 0 replies; 13+ messages in thread
From: Benjamin Coddington @ 2020-09-22 10:14 UTC (permalink / raw)
  To: trond.myklebust, anna.schumaker; +Cc: linux-nfs

The logic was becoming difficult to follow.  Add some comments and local
variables to clarify the behavior.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
---
 fs/nfs/nfs4proc.c | 31 +++++++++++++++++--------------
 1 file changed, 17 insertions(+), 14 deletions(-)

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 9ced7a62c05e..827659ee1fad 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -1568,22 +1568,25 @@ static void nfs_test_and_clear_all_open_stateid(struct nfs4_state *state)
 static bool nfs_need_update_open_stateid(struct nfs4_state *state,
 		const nfs4_stateid *stateid)
 {
-	if (test_bit(NFS_OPEN_STATE, &state->flags) == 0 ||
-	    !nfs4_stateid_match_other(stateid, &state->open_stateid)) {
-		if (stateid->seqid == cpu_to_be32(1)) {
+	bool state_matches_other = nfs4_stateid_match_other(stateid, &state->open_stateid);
+	bool seqid_one = stateid->seqid == cpu_to_be32(1);
+
+	if (test_bit(NFS_OPEN_STATE, &state->flags)) {
+		/* The common case - we're updating to a new sequence number */
+		if (state_matches_other && nfs4_stateid_is_newer(stateid, &state->open_stateid)) {
+			nfs_state_log_out_of_order_open_stateid(state, stateid);
+			return true;
+		}
+	} else {
+		/* This is the first OPEN */
+		if (!state_matches_other && seqid_one) {
 			nfs_state_log_update_open_stateid(state);
-		} else {
-			if (!nfs4_stateid_match_other(stateid, &state->open_stateid))
-				return false;
-			else
-				set_bit(NFS_STATE_CHANGE_WAIT, &state->flags);
+			return true;
+		}
+		if (!state_matches_other && !seqid_one) {
+			set_bit(NFS_STATE_CHANGE_WAIT, &state->flags);
+			return true;
 		}
-		return true;
-	}
-
-	if (nfs4_stateid_is_newer(stateid, &state->open_stateid)) {
-		nfs_state_log_out_of_order_open_stateid(state, stateid);
-		return true;
 	}
 	return false;
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] NFSv4: Fix a livelock when CLOSE pre-emptively bumps state sequence
  2020-09-21 11:04 [PATCH 1/3] NFSv4: Fix a livelock when CLOSE pre-emptively bumps state sequence Benjamin Coddington
  2020-09-21 11:04 ` [PATCH 2/3] NFSv4: Refactor nfs_need_update_open_stateid() Benjamin Coddington
  2020-09-21 11:04 ` [PATCH 3/3] NFSv4: cleanup unused zero_stateid copy Benjamin Coddington
@ 2020-09-22 14:03 ` Anna Schumaker
  2020-09-22 14:22   ` Benjamin Coddington
  2 siblings, 1 reply; 13+ messages in thread
From: Anna Schumaker @ 2020-09-22 14:03 UTC (permalink / raw)
  To: Benjamin Coddington; +Cc: Trond Myklebust, Linux NFS Mailing List

Hi Ben,

On Mon, Sep 21, 2020 at 7:05 AM Benjamin Coddington <bcodding@redhat.com> wrote:
>
> Since commit 0e0cb35b417f ("NFSv4: Handle NFS4ERR_OLD_STATEID in
> CLOSE/OPEN_DOWNGRADE") the following livelock may occur if a CLOSE races
> with the update of the nfs_state:
>
> Process 1         Process 2        Server
> =========         =========        ========
>  OPEN file
>                   OPEN file
>                                    Reply OPEN (1)
>                                    Reply OPEN (2)
>  Update state (1)
>  CLOSE file (1)
>                                    Reply OLD_STATEID (1)
>  CLOSE file (2)
>                                    Reply CLOSE (-1)
>                   Update state (2)
>                   wait for state change
>  OPEN file
>                   wake
>  CLOSE file
>  OPEN file
>                   wake
>  CLOSE file
>  ...
>                   ...
>
> As long as the first process continues updating state, the second process
> will fail to exit the loop in nfs_set_open_stateid_locked().  This livelock
> has been observed in generic/168.

Once I apply this patch I have trouble with generic/478 doing lock reclaim:

[  937.460505] run fstests generic/478 at 2020-09-22 09:59:14
[  937.607990] NFS: __nfs4_reclaim_open_state: Lock reclaim failed!

And the test just hangs until I kill it.

Just thought you should know!
Anna

>
> Fix this by detecting the case in nfs_need_update_open_stateid() and
> then exit the loop if:
>  - the state is NFS_OPEN_STATE, and
>  - the stateid sequence is > 1, and
>  - the stateid doesn't match the current open stateid
>
> Fixes: 0e0cb35b417f ("NFSv4: Handle NFS4ERR_OLD_STATEID in CLOSE/OPEN_DOWNGRADE")
> Cc: stable@vger.kernel.org # v5.4+
> Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
> ---
>  fs/nfs/nfs4proc.c | 10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
> index 45e0585e0667..9ced7a62c05e 100644
> --- a/fs/nfs/nfs4proc.c
> +++ b/fs/nfs/nfs4proc.c
> @@ -1570,10 +1570,14 @@ static bool nfs_need_update_open_stateid(struct nfs4_state *state,
>  {
>         if (test_bit(NFS_OPEN_STATE, &state->flags) == 0 ||
>             !nfs4_stateid_match_other(stateid, &state->open_stateid)) {
> -               if (stateid->seqid == cpu_to_be32(1))
> +               if (stateid->seqid == cpu_to_be32(1)) {
>                         nfs_state_log_update_open_stateid(state);
> -               else
> -                       set_bit(NFS_STATE_CHANGE_WAIT, &state->flags);
> +               } else {
> +                       if (!nfs4_stateid_match_other(stateid, &state->open_stateid))
> +                               return false;
> +                       else
> +                               set_bit(NFS_STATE_CHANGE_WAIT, &state->flags);
> +               }
>                 return true;
>         }
>
> --
> 2.20.1
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] NFSv4: Fix a livelock when CLOSE pre-emptively bumps state sequence
  2020-09-22 14:03 ` [PATCH 1/3] NFSv4: Fix a livelock when CLOSE pre-emptively bumps state sequence Anna Schumaker
@ 2020-09-22 14:22   ` Benjamin Coddington
  2020-09-22 14:31     ` Anna Schumaker
  0 siblings, 1 reply; 13+ messages in thread
From: Benjamin Coddington @ 2020-09-22 14:22 UTC (permalink / raw)
  To: Anna Schumaker; +Cc: Trond Myklebust, Linux NFS Mailing List

On 22 Sep 2020, at 10:03, Anna Schumaker wrote:
> Hi Ben,
>
> Once I apply this patch I have trouble with generic/478 doing lock reclaim:
>
> [  937.460505] run fstests generic/478 at 2020-09-22 09:59:14
> [  937.607990] NFS: __nfs4_reclaim_open_state: Lock reclaim failed!
>
> And the test just hangs until I kill it.
>
> Just thought you should know!

Yes, thanks!  I'm not seeing that..  I've tested these based on v5.8.4, I'll
rebase and check again.  I see a wirecap of generic/478 is only 515K on my
system, would you be willing to share a capture of your test failing?

Ben


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] NFSv4: Fix a livelock when CLOSE pre-emptively bumps state sequence
  2020-09-22 14:22   ` Benjamin Coddington
@ 2020-09-22 14:31     ` Anna Schumaker
       [not found]       ` <CAFX2JfkQSonD=Hnn40Y8A62rfmoQ2d8_ugNvOmg+Ny8zJ6dLAg@mail.gmail.com>
  0 siblings, 1 reply; 13+ messages in thread
From: Anna Schumaker @ 2020-09-22 14:31 UTC (permalink / raw)
  To: Benjamin Coddington; +Cc: Trond Myklebust, Linux NFS Mailing List

On Tue, Sep 22, 2020 at 10:22 AM Benjamin Coddington
<bcodding@redhat.com> wrote:
>
> On 22 Sep 2020, at 10:03, Anna Schumaker wrote:
> > Hi Ben,
> >
> > Once I apply this patch I have trouble with generic/478 doing lock reclaim:
> >
> > [  937.460505] run fstests generic/478 at 2020-09-22 09:59:14
> > [  937.607990] NFS: __nfs4_reclaim_open_state: Lock reclaim failed!
> >
> > And the test just hangs until I kill it.
> >
> > Just thought you should know!
>
> Yes, thanks!  I'm not seeing that..  I've tested these based on v5.8.4, I'll
> rebase and check again.  I see a wirecap of generic/478 is only 515K on my
> system, would you be willing to share a capture of your test failing?

I have it based on v5.9-rc6 (plus the patches I have queued up for
v5.10), so there definitely could be a difference there! I'm using a
stock kernel on my server, though :)

I can definitely get you a packet trace once I re-apply the patch and
rerun the test.

Anna

>
> Ben
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] NFSv4: Fix a livelock when CLOSE pre-emptively bumps state sequence
       [not found]       ` <CAFX2JfkQSonD=Hnn40Y8A62rfmoQ2d8_ugNvOmg+Ny8zJ6dLAg@mail.gmail.com>
@ 2020-09-22 15:46         ` Benjamin Coddington
  2020-09-22 15:53           ` Anna Schumaker
  0 siblings, 1 reply; 13+ messages in thread
From: Benjamin Coddington @ 2020-09-22 15:46 UTC (permalink / raw)
  To: Anna Schumaker; +Cc: Trond Myklebust, Linux NFS Mailing List

On 22 Sep 2020, at 10:43, Anna Schumaker wrote:

> On Tue, Sep 22, 2020 at 10:31 AM Anna Schumaker
> <anna.schumaker@netapp.com> wrote:
>>
>> On Tue, Sep 22, 2020 at 10:22 AM Benjamin Coddington
>> <bcodding@redhat.com> wrote:
>>>
>>> On 22 Sep 2020, at 10:03, Anna Schumaker wrote:
>>>> Hi Ben,
>>>>
>>>> Once I apply this patch I have trouble with generic/478 doing lock 
>>>> reclaim:
>>>>
>>>> [  937.460505] run fstests generic/478 at 2020-09-22 09:59:14
>>>> [  937.607990] NFS: __nfs4_reclaim_open_state: Lock reclaim failed!
>>>>
>>>> And the test just hangs until I kill it.
>>>>
>>>> Just thought you should know!
>>>
>>> Yes, thanks!  I'm not seeing that..  I've tested these based on 
>>> v5.8.4, I'll
>>> rebase and check again.  I see a wirecap of generic/478 is only 515K 
>>> on my
>>> system, would you be willing to share a capture of your test 
>>> failing?
>>
>> I have it based on v5.9-rc6 (plus the patches I have queued up for
>> v5.10), so there definitely could be a difference there! I'm using a
>> stock kernel on my server, though :)
>>
>> I can definitely get you a packet trace once I re-apply the patch and
>> rerun the test.
>
> Here's the packet trace, I reran the test with just this patch applied
> on top of v5.9-rc6 so it's not interacting with something else in my
> tree. Looks like it's ending up in an NFS4ERR_OLD_STATEID loop.

Thanks very much!

Did you see this failure with all three patches applied, or just with 
the
first patch?

I see the client get two OPEN responses, but then is sending 
TEST_STATEID
with the first seqid.  Seems like seqid 2 is getting lost.  I wonder if
we're making a bad assumption that NFS_OPEN_STATE can only be toggled 
under
the so_lock.

Ben


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] NFSv4: Fix a livelock when CLOSE pre-emptively bumps state sequence
  2020-09-22 15:46         ` Benjamin Coddington
@ 2020-09-22 15:53           ` Anna Schumaker
  2020-09-22 16:11             ` Anna Schumaker
  0 siblings, 1 reply; 13+ messages in thread
From: Anna Schumaker @ 2020-09-22 15:53 UTC (permalink / raw)
  To: Benjamin Coddington; +Cc: Trond Myklebust, Linux NFS Mailing List

On Tue, Sep 22, 2020 at 11:49 AM Benjamin Coddington
<bcodding@redhat.com> wrote:
>
> On 22 Sep 2020, at 10:43, Anna Schumaker wrote:
>
> > On Tue, Sep 22, 2020 at 10:31 AM Anna Schumaker
> > <anna.schumaker@netapp.com> wrote:
> >>
> >> On Tue, Sep 22, 2020 at 10:22 AM Benjamin Coddington
> >> <bcodding@redhat.com> wrote:
> >>>
> >>> On 22 Sep 2020, at 10:03, Anna Schumaker wrote:
> >>>> Hi Ben,
> >>>>
> >>>> Once I apply this patch I have trouble with generic/478 doing lock
> >>>> reclaim:
> >>>>
> >>>> [  937.460505] run fstests generic/478 at 2020-09-22 09:59:14
> >>>> [  937.607990] NFS: __nfs4_reclaim_open_state: Lock reclaim failed!
> >>>>
> >>>> And the test just hangs until I kill it.
> >>>>
> >>>> Just thought you should know!
> >>>
> >>> Yes, thanks!  I'm not seeing that..  I've tested these based on
> >>> v5.8.4, I'll
> >>> rebase and check again.  I see a wirecap of generic/478 is only 515K
> >>> on my
> >>> system, would you be willing to share a capture of your test
> >>> failing?
> >>
> >> I have it based on v5.9-rc6 (plus the patches I have queued up for
> >> v5.10), so there definitely could be a difference there! I'm using a
> >> stock kernel on my server, though :)
> >>
> >> I can definitely get you a packet trace once I re-apply the patch and
> >> rerun the test.
> >
> > Here's the packet trace, I reran the test with just this patch applied
> > on top of v5.9-rc6 so it's not interacting with something else in my
> > tree. Looks like it's ending up in an NFS4ERR_OLD_STATEID loop.
>
> Thanks very much!
>
> Did you see this failure with all three patches applied, or just with
> the
> first patch?

I saw it with the first patch applied, and with the first and third
applied. I initially hit it as I was wrapping up for the day
yesterday, but I left out #2 since I saw your retraction

>
> I see the client get two OPEN responses, but then is sending
> TEST_STATEID
> with the first seqid.  Seems like seqid 2 is getting lost.  I wonder if
> we're making a bad assumption that NFS_OPEN_STATE can only be toggled
> under
> the so_lock.
>
> Ben
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] NFSv4: Fix a livelock when CLOSE pre-emptively bumps state sequence
  2020-09-22 15:53           ` Anna Schumaker
@ 2020-09-22 16:11             ` Anna Schumaker
  2020-09-22 18:47               ` Benjamin Coddington
  0 siblings, 1 reply; 13+ messages in thread
From: Anna Schumaker @ 2020-09-22 16:11 UTC (permalink / raw)
  To: Benjamin Coddington; +Cc: Trond Myklebust, Linux NFS Mailing List

On Tue, Sep 22, 2020 at 11:53 AM Anna Schumaker
<anna.schumaker@netapp.com> wrote:
>
> On Tue, Sep 22, 2020 at 11:49 AM Benjamin Coddington
> <bcodding@redhat.com> wrote:
> >
> > On 22 Sep 2020, at 10:43, Anna Schumaker wrote:
> >
> > > On Tue, Sep 22, 2020 at 10:31 AM Anna Schumaker
> > > <anna.schumaker@netapp.com> wrote:
> > >>
> > >> On Tue, Sep 22, 2020 at 10:22 AM Benjamin Coddington
> > >> <bcodding@redhat.com> wrote:
> > >>>
> > >>> On 22 Sep 2020, at 10:03, Anna Schumaker wrote:
> > >>>> Hi Ben,
> > >>>>
> > >>>> Once I apply this patch I have trouble with generic/478 doing lock
> > >>>> reclaim:
> > >>>>
> > >>>> [  937.460505] run fstests generic/478 at 2020-09-22 09:59:14
> > >>>> [  937.607990] NFS: __nfs4_reclaim_open_state: Lock reclaim failed!
> > >>>>
> > >>>> And the test just hangs until I kill it.
> > >>>>
> > >>>> Just thought you should know!
> > >>>
> > >>> Yes, thanks!  I'm not seeing that..  I've tested these based on
> > >>> v5.8.4, I'll
> > >>> rebase and check again.  I see a wirecap of generic/478 is only 515K
> > >>> on my
> > >>> system, would you be willing to share a capture of your test
> > >>> failing?
> > >>
> > >> I have it based on v5.9-rc6 (plus the patches I have queued up for
> > >> v5.10), so there definitely could be a difference there! I'm using a
> > >> stock kernel on my server, though :)
> > >>
> > >> I can definitely get you a packet trace once I re-apply the patch and
> > >> rerun the test.
> > >
> > > Here's the packet trace, I reran the test with just this patch applied
> > > on top of v5.9-rc6 so it's not interacting with something else in my
> > > tree. Looks like it's ending up in an NFS4ERR_OLD_STATEID loop.
> >
> > Thanks very much!
> >
> > Did you see this failure with all three patches applied, or just with
> > the
> > first patch?
>
> I saw it with the first patch applied, and with the first and third
> applied. I initially hit it as I was wrapping up for the day
> yesterday, but I left out #2 since I saw your retraction

I reran with all three patches applied, and didn't have the issue. So
something in the refactor patch fixes it.

Anna

>
> >
> > I see the client get two OPEN responses, but then is sending
> > TEST_STATEID
> > with the first seqid.  Seems like seqid 2 is getting lost.  I wonder if
> > we're making a bad assumption that NFS_OPEN_STATE can only be toggled
> > under
> > the so_lock.
> >
> > Ben
> >

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] NFSv4: Fix a livelock when CLOSE pre-emptively bumps state sequence
  2020-09-22 16:11             ` Anna Schumaker
@ 2020-09-22 18:47               ` Benjamin Coddington
  2020-09-22 18:51                 ` Anna Schumaker
  0 siblings, 1 reply; 13+ messages in thread
From: Benjamin Coddington @ 2020-09-22 18:47 UTC (permalink / raw)
  To: Anna Schumaker; +Cc: Trond Myklebust, Linux NFS Mailing List

On 22 Sep 2020, at 12:11, Anna Schumaker wrote:

> On Tue, Sep 22, 2020 at 11:53 AM Anna Schumaker
> <anna.schumaker@netapp.com> wrote:
>>
>> On Tue, Sep 22, 2020 at 11:49 AM Benjamin Coddington
>> <bcodding@redhat.com> wrote:
>>>
>>> On 22 Sep 2020, at 10:43, Anna Schumaker wrote:
>>>
>>>> On Tue, Sep 22, 2020 at 10:31 AM Anna Schumaker
>>>> <anna.schumaker@netapp.com> wrote:
>>>>>
>>>>> On Tue, Sep 22, 2020 at 10:22 AM Benjamin Coddington
>>>>> <bcodding@redhat.com> wrote:
>>>>>>
>>>>>> On 22 Sep 2020, at 10:03, Anna Schumaker wrote:
>>>>>>> Hi Ben,
>>>>>>>
>>>>>>> Once I apply this patch I have trouble with generic/478 doing lock
>>>>>>> reclaim:
>>>>>>>
>>>>>>> [  937.460505] run fstests generic/478 at 2020-09-22 09:59:14
>>>>>>> [  937.607990] NFS: __nfs4_reclaim_open_state: Lock reclaim failed!
>>>>>>>
>>>>>>> And the test just hangs until I kill it.
>>>>>>>
>>>>>>> Just thought you should know!
>>>>>>
>>>>>> Yes, thanks!  I'm not seeing that..  I've tested these based on
>>>>>> v5.8.4, I'll
>>>>>> rebase and check again.  I see a wirecap of generic/478 is only 515K
>>>>>> on my
>>>>>> system, would you be willing to share a capture of your test
>>>>>> failing?
>>>>>
>>>>> I have it based on v5.9-rc6 (plus the patches I have queued up for
>>>>> v5.10), so there definitely could be a difference there! I'm using a
>>>>> stock kernel on my server, though :)
>>>>>
>>>>> I can definitely get you a packet trace once I re-apply the patch and
>>>>> rerun the test.
>>>>
>>>> Here's the packet trace, I reran the test with just this patch applied
>>>> on top of v5.9-rc6 so it's not interacting with something else in my
>>>> tree. Looks like it's ending up in an NFS4ERR_OLD_STATEID loop.
>>>
>>> Thanks very much!
>>>
>>> Did you see this failure with all three patches applied, or just with
>>> the
>>> first patch?
>>
>> I saw it with the first patch applied, and with the first and third
>> applied. I initially hit it as I was wrapping up for the day
>> yesterday, but I left out #2 since I saw your retraction
>
> I reran with all three patches applied, and didn't have the issue. So
> something in the refactor patch fixes it.

That helped me see the case we're not handling correctly is when two OPENs
race and the second one tries to update the state first and gets dropped.
That is fixed by the 2/3 refactor patch since the refactor was being a bit
more explicit.

That means I'll need to fix those two patches and send them again.  I'm very
glad you caught this!  Thanks very much for helping me find the problem.

Ben


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] NFSv4: Fix a livelock when CLOSE pre-emptively bumps state sequence
  2020-09-22 18:47               ` Benjamin Coddington
@ 2020-09-22 18:51                 ` Anna Schumaker
  0 siblings, 0 replies; 13+ messages in thread
From: Anna Schumaker @ 2020-09-22 18:51 UTC (permalink / raw)
  To: Benjamin Coddington; +Cc: Trond Myklebust, Linux NFS Mailing List

On Tue, Sep 22, 2020 at 2:47 PM Benjamin Coddington <bcodding@redhat.com> wrote:
>
> On 22 Sep 2020, at 12:11, Anna Schumaker wrote:
>
> > On Tue, Sep 22, 2020 at 11:53 AM Anna Schumaker
> > <anna.schumaker@netapp.com> wrote:
> >>
> >> On Tue, Sep 22, 2020 at 11:49 AM Benjamin Coddington
> >> <bcodding@redhat.com> wrote:
> >>>
> >>> On 22 Sep 2020, at 10:43, Anna Schumaker wrote:
> >>>
> >>>> On Tue, Sep 22, 2020 at 10:31 AM Anna Schumaker
> >>>> <anna.schumaker@netapp.com> wrote:
> >>>>>
> >>>>> On Tue, Sep 22, 2020 at 10:22 AM Benjamin Coddington
> >>>>> <bcodding@redhat.com> wrote:
> >>>>>>
> >>>>>> On 22 Sep 2020, at 10:03, Anna Schumaker wrote:
> >>>>>>> Hi Ben,
> >>>>>>>
> >>>>>>> Once I apply this patch I have trouble with generic/478 doing lock
> >>>>>>> reclaim:
> >>>>>>>
> >>>>>>> [  937.460505] run fstests generic/478 at 2020-09-22 09:59:14
> >>>>>>> [  937.607990] NFS: __nfs4_reclaim_open_state: Lock reclaim failed!
> >>>>>>>
> >>>>>>> And the test just hangs until I kill it.
> >>>>>>>
> >>>>>>> Just thought you should know!
> >>>>>>
> >>>>>> Yes, thanks!  I'm not seeing that..  I've tested these based on
> >>>>>> v5.8.4, I'll
> >>>>>> rebase and check again.  I see a wirecap of generic/478 is only 515K
> >>>>>> on my
> >>>>>> system, would you be willing to share a capture of your test
> >>>>>> failing?
> >>>>>
> >>>>> I have it based on v5.9-rc6 (plus the patches I have queued up for
> >>>>> v5.10), so there definitely could be a difference there! I'm using a
> >>>>> stock kernel on my server, though :)
> >>>>>
> >>>>> I can definitely get you a packet trace once I re-apply the patch and
> >>>>> rerun the test.
> >>>>
> >>>> Here's the packet trace, I reran the test with just this patch applied
> >>>> on top of v5.9-rc6 so it's not interacting with something else in my
> >>>> tree. Looks like it's ending up in an NFS4ERR_OLD_STATEID loop.
> >>>
> >>> Thanks very much!
> >>>
> >>> Did you see this failure with all three patches applied, or just with
> >>> the
> >>> first patch?
> >>
> >> I saw it with the first patch applied, and with the first and third
> >> applied. I initially hit it as I was wrapping up for the day
> >> yesterday, but I left out #2 since I saw your retraction
> >
> > I reran with all three patches applied, and didn't have the issue. So
> > something in the refactor patch fixes it.
>
> That helped me see the case we're not handling correctly is when two OPENs
> race and the second one tries to update the state first and gets dropped.
> That is fixed by the 2/3 refactor patch since the refactor was being a bit
> more explicit.
>
> That means I'll need to fix those two patches and send them again.  I'm very
> glad you caught this!  Thanks very much for helping me find the problem.

You're welcome! I'm looking forward to the next version :)

Anna

>
> Ben
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2020-09-22 18:52 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-21 11:04 [PATCH 1/3] NFSv4: Fix a livelock when CLOSE pre-emptively bumps state sequence Benjamin Coddington
2020-09-21 11:04 ` [PATCH 2/3] NFSv4: Refactor nfs_need_update_open_stateid() Benjamin Coddington
2020-09-21 11:17   ` Benjamin Coddington
2020-09-22 10:14   ` [PATCH 2/3 v2] " Benjamin Coddington
2020-09-21 11:04 ` [PATCH 3/3] NFSv4: cleanup unused zero_stateid copy Benjamin Coddington
2020-09-22 14:03 ` [PATCH 1/3] NFSv4: Fix a livelock when CLOSE pre-emptively bumps state sequence Anna Schumaker
2020-09-22 14:22   ` Benjamin Coddington
2020-09-22 14:31     ` Anna Schumaker
     [not found]       ` <CAFX2JfkQSonD=Hnn40Y8A62rfmoQ2d8_ugNvOmg+Ny8zJ6dLAg@mail.gmail.com>
2020-09-22 15:46         ` Benjamin Coddington
2020-09-22 15:53           ` Anna Schumaker
2020-09-22 16:11             ` Anna Schumaker
2020-09-22 18:47               ` Benjamin Coddington
2020-09-22 18:51                 ` Anna Schumaker

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.