Linux-NFS Archive on lore.kernel.org
 help / color / Atom feed
* troubleshooting LOCK FH and NFS4ERR_BAD_SEQID
@ 2019-09-12  8:27 Leon Kyneur
  2019-09-17 11:28 ` Benjamin Coddington
  0 siblings, 1 reply; 9+ messages in thread
From: Leon Kyneur @ 2019-09-12  8:27 UTC (permalink / raw)
  To: linux-nfs

Hi

I'm experiencing an issue on NFS 4.0 + 4.1 where we cannot call fcntl
locks on any file on the share. The problem goes away if the share is
umount && mount (mount -o remount does not resolve the issue)

Client:
EL 7.4 3.10.0-693.5.2.el7.x86_64 nfs-utils-1.3.0-0.48.el7_4.x86_64

Server:
EL 7.4 3.10.0-693.5.2.el7.x86_64  nfs-utils-1.3.0-0.48.el7_4.x86_64

I can't figure this out but the client reports bad-sequence-id in
dupicate in the logs:
Sep 12 02:16:59 client kernel: NFS: v4 server returned a bad
sequence-id error on an unconfirmed sequence ffff881c52286220!
Sep 12 02:16:59 client kernel: NFS: v4 server returned a bad
sequence-id error on an unconfirmed sequence ffff881c52286220!
Sep 12 02:17:39 client kernel: NFS: v4 server returned a bad
sequence-id error on an unconfirmed sequence ffff8810889cb020!
Sep 12 02:17:39 client kernel: NFS: v4 server returned a bad
sequence-id error on an unconfirmed sequence ffff8810889cb020!
Sep 12 02:17:44 client kernel: NFS: v4 server returned a bad
sequence-id error on an unconfirmed sequence ffff881b414b2620!

wireshark capture shows only 1 BAD_SEQID reply from the server:
$ tshark -r client_broken.pcap -z proto,colinfo,rpc.xid,rpc.xid -z
proto,colinfo,nfs.seqid,nfs.seqid -R 'rpc.xid == 0x9990c61d'
tshark: -R without -2 is deprecated. For single-pass filtering use -Y.
141         93 172.27.30.129 -> 172.27.255.28 NFS 352 V4 Call LOCK FH:
0x80589398 Offset: 0 Length: <End of File>  nfs.seqid == 0x0000004e
nfs.seqid == 0x00000002  rpc.xid == 0x9990c61d
142         93 172.27.255.28 -> 172.27.30.129 NFS 124 V4 Reply (Call
In 141) LOCK Status: NFS4ERR_BAD_SEQID  rpc.xid == 0x9990c61d

system call I have identified as triggering it is:
fcntl(3, F_SETLK, {type=F_RDLCK, whence=SEEK_SET, start=1073741824,
len=1}) = -1 EIO (Input/output error)

The server filesystem is ZFS though NFS sharing is turned off via ZFS
options and it's exported using /etc/exports / nfsd...

The BAD_SEQID error seems to be fairly random, we have over 2000
machines connected to the share and it's experienced frequently but
randomly accross our clients.

It's worth mentioning that the majority of the clients are mounting
4.0 we did try 4.1 everywhere but hit this
https://access.redhat.com/solutions/3146191

mount options are:
server:/data on /d/data type nfs4
(rw,noatime,nodiratime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=172.27.10.45,local_lock=none,addr=172.27.255.28,_netdev)
or:
server:/data on /d/data type nfs4
(rw,noatime,nodiratime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=172.27.30.129,local_lock=none,addr=172.27.255.28,_netdev)

I'm at a bit off a loss as to where to look next, i've tried to
reproduce locking / unlocking threading but cannot seem to create a
test case that triggers it.

Thanks

Leon

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: troubleshooting LOCK FH and NFS4ERR_BAD_SEQID
  2019-09-12  8:27 troubleshooting LOCK FH and NFS4ERR_BAD_SEQID Leon Kyneur
@ 2019-09-17 11:28 ` Benjamin Coddington
  2019-09-18  2:20   ` Leon Kyneur
  0 siblings, 1 reply; 9+ messages in thread
From: Benjamin Coddington @ 2019-09-17 11:28 UTC (permalink / raw)
  To: Leon Kyneur; +Cc: linux-nfs

On 12 Sep 2019, at 4:27, Leon Kyneur wrote:

> Hi
>
> I'm experiencing an issue on NFS 4.0 + 4.1 where we cannot call fcntl
> locks on any file on the share. The problem goes away if the share is
> umount && mount (mount -o remount does not resolve the issue)
>
> Client:
> EL 7.4 3.10.0-693.5.2.el7.x86_64 nfs-utils-1.3.0-0.48.el7_4.x86_64
>
> Server:
> EL 7.4 3.10.0-693.5.2.el7.x86_64  nfs-utils-1.3.0-0.48.el7_4.x86_64
>
> I can't figure this out but the client reports bad-sequence-id in
> dupicate in the logs:
> Sep 12 02:16:59 client kernel: NFS: v4 server returned a bad
> sequence-id error on an unconfirmed sequence ffff881c52286220!
> Sep 12 02:16:59 client kernel: NFS: v4 server returned a bad
> sequence-id error on an unconfirmed sequence ffff881c52286220!
> Sep 12 02:17:39 client kernel: NFS: v4 server returned a bad
> sequence-id error on an unconfirmed sequence ffff8810889cb020!
> Sep 12 02:17:39 client kernel: NFS: v4 server returned a bad
> sequence-id error on an unconfirmed sequence ffff8810889cb020!
> Sep 12 02:17:44 client kernel: NFS: v4 server returned a bad
> sequence-id error on an unconfirmed sequence ffff881b414b2620!
>
> wireshark capture shows only 1 BAD_SEQID reply from the server:
> $ tshark -r client_broken.pcap -z proto,colinfo,rpc.xid,rpc.xid -z
> proto,colinfo,nfs.seqid,nfs.seqid -R 'rpc.xid == 0x9990c61d'
> tshark: -R without -2 is deprecated. For single-pass filtering use -Y.
> 141         93 172.27.30.129 -> 172.27.255.28 NFS 352 V4 Call LOCK FH:
> 0x80589398 Offset: 0 Length: <End of File>  nfs.seqid == 0x0000004e
> nfs.seqid == 0x00000002  rpc.xid == 0x9990c61d
> 142         93 172.27.255.28 -> 172.27.30.129 NFS 124 V4 Reply (Call
> In 141) LOCK Status: NFS4ERR_BAD_SEQID  rpc.xid == 0x9990c61d
>
> system call I have identified as triggering it is:
> fcntl(3, F_SETLK, {type=F_RDLCK, whence=SEEK_SET, start=1073741824,
> len=1}) = -1 EIO (Input/output error)

Can you simplify the trigger into something repeatable?  Can you determine
if the client or the server has lost track of the sequence?

> The server filesystem is ZFS though NFS sharing is turned off via ZFS
> options and it's exported using /etc/exports / nfsd...
>
> The BAD_SEQID error seems to be fairly random, we have over 2000
> machines connected to the share and it's experienced frequently but
> randomly accross our clients.
>
> It's worth mentioning that the majority of the clients are mounting
> 4.0 we did try 4.1 everywhere but hit this
> https://access.redhat.com/solutions/3146191

This was fixed in kernel-3.10.0-735.el7, FWIW..

Ben

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: troubleshooting LOCK FH and NFS4ERR_BAD_SEQID
  2019-09-17 11:28 ` Benjamin Coddington
@ 2019-09-18  2:20   ` Leon Kyneur
  2019-09-18 11:04     ` Benjamin Coddington
  0 siblings, 1 reply; 9+ messages in thread
From: Leon Kyneur @ 2019-09-18  2:20 UTC (permalink / raw)
  To: Benjamin Coddington; +Cc: linux-nfs

On Tue, Sep 17, 2019 at 7:28 PM Benjamin Coddington <bcodding@redhat.com> wrote:
>
> On 12 Sep 2019, at 4:27, Leon Kyneur wrote:
>
> > Hi
> >
> > I'm experiencing an issue on NFS 4.0 + 4.1 where we cannot call fcntl
> > locks on any file on the share. The problem goes away if the share is
> > umount && mount (mount -o remount does not resolve the issue)
> >
> > Client:
> > EL 7.4 3.10.0-693.5.2.el7.x86_64 nfs-utils-1.3.0-0.48.el7_4.x86_64
> >
> > Server:
> > EL 7.4 3.10.0-693.5.2.el7.x86_64  nfs-utils-1.3.0-0.48.el7_4.x86_64
> >
> > I can't figure this out but the client reports bad-sequence-id in
> > dupicate in the logs:
> > Sep 12 02:16:59 client kernel: NFS: v4 server returned a bad
> > sequence-id error on an unconfirmed sequence ffff881c52286220!
> > Sep 12 02:16:59 client kernel: NFS: v4 server returned a bad
> > sequence-id error on an unconfirmed sequence ffff881c52286220!
> > Sep 12 02:17:39 client kernel: NFS: v4 server returned a bad
> > sequence-id error on an unconfirmed sequence ffff8810889cb020!
> > Sep 12 02:17:39 client kernel: NFS: v4 server returned a bad
> > sequence-id error on an unconfirmed sequence ffff8810889cb020!
> > Sep 12 02:17:44 client kernel: NFS: v4 server returned a bad
> > sequence-id error on an unconfirmed sequence ffff881b414b2620!
> >
> > wireshark capture shows only 1 BAD_SEQID reply from the server:
> > $ tshark -r client_broken.pcap -z proto,colinfo,rpc.xid,rpc.xid -z
> > proto,colinfo,nfs.seqid,nfs.seqid -R 'rpc.xid == 0x9990c61d'
> > tshark: -R without -2 is deprecated. For single-pass filtering use -Y.
> > 141         93 172.27.30.129 -> 172.27.255.28 NFS 352 V4 Call LOCK FH:
> > 0x80589398 Offset: 0 Length: <End of File>  nfs.seqid == 0x0000004e
> > nfs.seqid == 0x00000002  rpc.xid == 0x9990c61d
> > 142         93 172.27.255.28 -> 172.27.30.129 NFS 124 V4 Reply (Call
> > In 141) LOCK Status: NFS4ERR_BAD_SEQID  rpc.xid == 0x9990c61d
> >
> > system call I have identified as triggering it is:
> > fcntl(3, F_SETLK, {type=F_RDLCK, whence=SEEK_SET, start=1073741824,
> > len=1}) = -1 EIO (Input/output error)
>
> Can you simplify the trigger into something repeatable?  Can you determine
> if the client or the server has lost track of the sequence?
>

I have tried, I wrote some code to perform the fcntl RDKLCK the same
way and ran it accross
thousands of machines without any success. I am quite sure this is a
symptom of something
not the cause.

Is there a better way of tracking sequences other than monitoring the
network traffic?

> > The server filesystem is ZFS though NFS sharing is turned off via ZFS
> > options and it's exported using /etc/exports / nfsd...
> >
> > The BAD_SEQID error seems to be fairly random, we have over 2000
> > machines connected to the share and it's experienced frequently but
> > randomly accross our clients.
> >
> > It's worth mentioning that the majority of the clients are mounting
> > 4.0 we did try 4.1 everywhere but hit this
> > https://access.redhat.com/solutions/3146191
>
> This was fixed in kernel-3.10.0-735.el7, FWIW..
>
> Ben

Thanks good to know, am planning an update soon but have been stuck on
3.10.0-693 for other reasons.

Leon

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: troubleshooting LOCK FH and NFS4ERR_BAD_SEQID
  2019-09-18  2:20   ` Leon Kyneur
@ 2019-09-18 11:04     ` Benjamin Coddington
  2019-09-18 14:32       ` Olga Kornievskaia
  0 siblings, 1 reply; 9+ messages in thread
From: Benjamin Coddington @ 2019-09-18 11:04 UTC (permalink / raw)
  To: Leon Kyneur; +Cc: linux-nfs

On 17 Sep 2019, at 22:20, Leon Kyneur wrote:

> On Tue, Sep 17, 2019 at 7:28 PM Benjamin Coddington 
> <bcodding@redhat.com> wrote:
>>
>> On 12 Sep 2019, at 4:27, Leon Kyneur wrote:
>>
>>> Hi
>>>
>>> I'm experiencing an issue on NFS 4.0 + 4.1 where we cannot call 
>>> fcntl
>>> locks on any file on the share. The problem goes away if the share 
>>> is
>>> umount && mount (mount -o remount does not resolve the issue)
>>>
>>> Client:
>>> EL 7.4 3.10.0-693.5.2.el7.x86_64 nfs-utils-1.3.0-0.48.el7_4.x86_64
>>>
>>> Server:
>>> EL 7.4 3.10.0-693.5.2.el7.x86_64  nfs-utils-1.3.0-0.48.el7_4.x86_64
>>>
>>> I can't figure this out but the client reports bad-sequence-id in
>>> dupicate in the logs:
>>> Sep 12 02:16:59 client kernel: NFS: v4 server returned a bad
>>> sequence-id error on an unconfirmed sequence ffff881c52286220!
>>> Sep 12 02:16:59 client kernel: NFS: v4 server returned a bad
>>> sequence-id error on an unconfirmed sequence ffff881c52286220!
>>> Sep 12 02:17:39 client kernel: NFS: v4 server returned a bad
>>> sequence-id error on an unconfirmed sequence ffff8810889cb020!
>>> Sep 12 02:17:39 client kernel: NFS: v4 server returned a bad
>>> sequence-id error on an unconfirmed sequence ffff8810889cb020!
>>> Sep 12 02:17:44 client kernel: NFS: v4 server returned a bad
>>> sequence-id error on an unconfirmed sequence ffff881b414b2620!
>>>
>>> wireshark capture shows only 1 BAD_SEQID reply from the server:
>>> $ tshark -r client_broken.pcap -z proto,colinfo,rpc.xid,rpc.xid -z
>>> proto,colinfo,nfs.seqid,nfs.seqid -R 'rpc.xid == 0x9990c61d'
>>> tshark: -R without -2 is deprecated. For single-pass filtering use 
>>> -Y.
>>> 141         93 172.27.30.129 -> 172.27.255.28 NFS 352 V4 Call LOCK 
>>> FH:
>>> 0x80589398 Offset: 0 Length: <End of File>  nfs.seqid == 0x0000004e
>>> nfs.seqid == 0x00000002  rpc.xid == 0x9990c61d
>>> 142         93 172.27.255.28 -> 172.27.30.129 NFS 124 V4 Reply (Call
>>> In 141) LOCK Status: NFS4ERR_BAD_SEQID  rpc.xid == 0x9990c61d
>>>
>>> system call I have identified as triggering it is:
>>> fcntl(3, F_SETLK, {type=F_RDLCK, whence=SEEK_SET, start=1073741824,
>>> len=1}) = -1 EIO (Input/output error)
>>
>> Can you simplify the trigger into something repeatable?  Can you 
>> determine
>> if the client or the server has lost track of the sequence?
>>
>
> I have tried, I wrote some code to perform the fcntl RDKLCK the same
> way and ran it accross
> thousands of machines without any success. I am quite sure this is a
> symptom of something
> not the cause.
>
> Is there a better way of tracking sequences other than monitoring the
> network traffic?

I think that's the best way, right now.  We do have tracepoints for
nfs4 open and close that show the sequence numbers on the client, but 
I'm
not sure about how to get that from the server side.  I don't think we 
have
seqid for locks in tracepoints.. I could be missing something.  Not only
that, but you might not get tracepoint output showing the sequence 
numbers
if you're in an error-handling path.

If you have a wire capture of the event, you should be able to go 
backwards
from the error and figure out what the sequence number on the state 
should
be for the operation that received BAD_SEQID by finding the last
sequence-mutating (OPEN,CLOSE,LOCK) operation for that stateid that did 
not
return an error.

Ben

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: troubleshooting LOCK FH and NFS4ERR_BAD_SEQID
  2019-09-18 11:04     ` Benjamin Coddington
@ 2019-09-18 14:32       ` Olga Kornievskaia
  2019-09-19  4:22         ` Leon Kyneur
  0 siblings, 1 reply; 9+ messages in thread
From: Olga Kornievskaia @ 2019-09-18 14:32 UTC (permalink / raw)
  To: Benjamin Coddington; +Cc: Leon Kyneur, linux-nfs

Hi folks,

The bad_seqid error could have been the bug in 7.4
https://access.redhat.com/solutions/3354251. It's been fixed in
kernel-3.10.0-693.23.1.el7. Can you try to update and see if that
helps? The bug was client was sending a double close throwing off the
seqid use.

On Wed, Sep 18, 2019 at 9:07 AM Benjamin Coddington <bcodding@redhat.com> wrote:
>
> On 17 Sep 2019, at 22:20, Leon Kyneur wrote:
>
> > On Tue, Sep 17, 2019 at 7:28 PM Benjamin Coddington
> > <bcodding@redhat.com> wrote:
> >>
> >> On 12 Sep 2019, at 4:27, Leon Kyneur wrote:
> >>
> >>> Hi
> >>>
> >>> I'm experiencing an issue on NFS 4.0 + 4.1 where we cannot call
> >>> fcntl
> >>> locks on any file on the share. The problem goes away if the share
> >>> is
> >>> umount && mount (mount -o remount does not resolve the issue)
> >>>
> >>> Client:
> >>> EL 7.4 3.10.0-693.5.2.el7.x86_64 nfs-utils-1.3.0-0.48.el7_4.x86_64
> >>>
> >>> Server:
> >>> EL 7.4 3.10.0-693.5.2.el7.x86_64  nfs-utils-1.3.0-0.48.el7_4.x86_64
> >>>
> >>> I can't figure this out but the client reports bad-sequence-id in
> >>> dupicate in the logs:
> >>> Sep 12 02:16:59 client kernel: NFS: v4 server returned a bad
> >>> sequence-id error on an unconfirmed sequence ffff881c52286220!
> >>> Sep 12 02:16:59 client kernel: NFS: v4 server returned a bad
> >>> sequence-id error on an unconfirmed sequence ffff881c52286220!
> >>> Sep 12 02:17:39 client kernel: NFS: v4 server returned a bad
> >>> sequence-id error on an unconfirmed sequence ffff8810889cb020!
> >>> Sep 12 02:17:39 client kernel: NFS: v4 server returned a bad
> >>> sequence-id error on an unconfirmed sequence ffff8810889cb020!
> >>> Sep 12 02:17:44 client kernel: NFS: v4 server returned a bad
> >>> sequence-id error on an unconfirmed sequence ffff881b414b2620!
> >>>
> >>> wireshark capture shows only 1 BAD_SEQID reply from the server:
> >>> $ tshark -r client_broken.pcap -z proto,colinfo,rpc.xid,rpc.xid -z
> >>> proto,colinfo,nfs.seqid,nfs.seqid -R 'rpc.xid == 0x9990c61d'
> >>> tshark: -R without -2 is deprecated. For single-pass filtering use
> >>> -Y.
> >>> 141         93 172.27.30.129 -> 172.27.255.28 NFS 352 V4 Call LOCK
> >>> FH:
> >>> 0x80589398 Offset: 0 Length: <End of File>  nfs.seqid == 0x0000004e
> >>> nfs.seqid == 0x00000002  rpc.xid == 0x9990c61d
> >>> 142         93 172.27.255.28 -> 172.27.30.129 NFS 124 V4 Reply (Call
> >>> In 141) LOCK Status: NFS4ERR_BAD_SEQID  rpc.xid == 0x9990c61d
> >>>
> >>> system call I have identified as triggering it is:
> >>> fcntl(3, F_SETLK, {type=F_RDLCK, whence=SEEK_SET, start=1073741824,
> >>> len=1}) = -1 EIO (Input/output error)
> >>
> >> Can you simplify the trigger into something repeatable?  Can you
> >> determine
> >> if the client or the server has lost track of the sequence?
> >>
> >
> > I have tried, I wrote some code to perform the fcntl RDKLCK the same
> > way and ran it accross
> > thousands of machines without any success. I am quite sure this is a
> > symptom of something
> > not the cause.
> >
> > Is there a better way of tracking sequences other than monitoring the
> > network traffic?
>
> I think that's the best way, right now.  We do have tracepoints for
> nfs4 open and close that show the sequence numbers on the client, but
> I'm
> not sure about how to get that from the server side.  I don't think we
> have
> seqid for locks in tracepoints.. I could be missing something.  Not only
> that, but you might not get tracepoint output showing the sequence
> numbers
> if you're in an error-handling path.
>
> If you have a wire capture of the event, you should be able to go
> backwards
> from the error and figure out what the sequence number on the state
> should
> be for the operation that received BAD_SEQID by finding the last
> sequence-mutating (OPEN,CLOSE,LOCK) operation for that stateid that did
> not
> return an error.
>
> Ben

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: troubleshooting LOCK FH and NFS4ERR_BAD_SEQID
  2019-09-18 14:32       ` Olga Kornievskaia
@ 2019-09-19  4:22         ` Leon Kyneur
  2019-09-19 13:36           ` Olga Kornievskaia
  0 siblings, 1 reply; 9+ messages in thread
From: Leon Kyneur @ 2019-09-19  4:22 UTC (permalink / raw)
  To: Olga Kornievskaia; +Cc: Benjamin Coddington, linux-nfs

On Wed, Sep 18, 2019 at 10:32 PM Olga Kornievskaia <aglo@umich.edu> wrote:
>
> Hi folks,
>
> The bad_seqid error could have been the bug in 7.4
> https://access.redhat.com/solutions/3354251. It's been fixed in
> kernel-3.10.0-693.23.1.el7. Can you try to update and see if that
> helps? The bug was client was sending a double close throwing off the
> seqid use.
>

Hi Olga

I did see that but discounted it as I am not seeing duplicate CLOSE
for the same StateID when the client is affected, could this happened
at some point earlier in time to trigger the condition? I don't
understand how I can reproduce it.

When it's affected I don't see the duplicate close:

$ tshark -n -r client_broken.pcap -z proto,colinfo,rpc.xid,rpc.xid
|grep -E "CLOSE|BAD|LOCK|OPEN|ACCESS"
 14         10 172.27.30.129 -> 172.27.255.28 NFS 276 V4 Call ACCESS
FH: 0xfd9d2fb5, [Check: RD LU MD XT DL]  rpc.xid == 0x6590c61d
 15         10 172.27.255.28 -> 172.27.30.129 NFS 196 V4 Reply (Call
In 14) ACCESS, [Allowed: RD LU MD XT DL]  rpc.xid == 0x6590c61d
 19         10 172.27.30.129 -> 172.27.255.28 NFS 276 V4 Call ACCESS
FH: 0xfd9d2fb5, [Check: RD LU MD XT DL]  rpc.xid == 0x6790c61d
 20         10 172.27.255.28 -> 172.27.30.129 NFS 196 V4 Reply (Call
In 19) ACCESS, [Access Denied: MD XT DL], [Allowed: RD LU]  rpc.xid ==
0x6790c61d
 21         10 172.27.30.129 -> 172.27.255.28 NFS 288 V4 Call ACCESS
FH: 0x272b8d23, [Check: RD LU MD XT DL]  rpc.xid == 0x6890c61d
 22         10 172.27.255.28 -> 172.27.30.129 NFS 196 V4 Reply (Call
In 21) ACCESS, [Allowed: RD LU MD XT DL]  rpc.xid == 0x6890c61d
 24         10 172.27.30.129 -> 172.27.255.28 NFS 288 V4 Call ACCESS
FH: 0x54d7a1df, [Check: RD LU MD XT DL]  rpc.xid == 0x6990c61d
 25         10 172.27.255.28 -> 172.27.30.129 NFS 196 V4 Reply (Call
In 24) ACCESS, [Allowed: RD LU MD XT DL]  rpc.xid == 0x6990c61d
 33         11 172.27.30.129 -> 172.27.255.28 NFS 372 V4 Call OPEN DH:
0x272b8d23/.bash_history  rpc.xid == 0x6c90c61d
 34         11 172.27.255.28 -> 172.27.30.129 NFS 396 V4 Reply (Call
In 33) OPEN StateID: 0x0f49  rpc.xid == 0x6c90c61d
 36         11 172.27.30.129 -> 172.27.255.28 NFS 288 V4 Call
OPEN_CONFIRM  rpc.xid == 0x6d90c61d
 37         11 172.27.255.28 -> 172.27.30.129 NFS 140 V4 Reply (Call
In 36) OPEN_CONFIRM  rpc.xid == 0x6d90c61d
 38         11 172.27.30.129 -> 172.27.255.28 NFS 304 V4 Call CLOSE
StateID: 0x01be  rpc.xid == 0x6e90c61d
 39         11 172.27.255.28 -> 172.27.30.129 NFS 204 V4 Reply (Call
In 38) CLOSE  rpc.xid == 0x6e90c61d
 44         11 172.27.30.129 -> 172.27.255.28 NFS 364 V4 Call OPEN DH:
0x272b8d23/.bashrc  rpc.xid == 0x7090c61d
 45         11 172.27.255.28 -> 172.27.30.129 NFS 432 V4 Reply (Call
In 44) OPEN StateID: 0x13f2  rpc.xid == 0x7090c61d
 46         11 172.27.30.129 -> 172.27.255.28 NFS 304 V4 Call CLOSE
StateID: 0x13f2  rpc.xid == 0x7190c61d
 47         11 172.27.255.28 -> 172.27.30.129 NFS 204 V4 Reply (Call
In 46) CLOSE  rpc.xid == 0x7190c61d
 48         11 172.27.30.129 -> 172.27.255.28 NFS 288 V4 Call ACCESS
FH: 0xfb7c5154, [Check: RD MD XT XE]  rpc.xid == 0x7290c61d
 49         11 172.27.255.28 -> 172.27.30.129 NFS 196 V4 Reply (Call
In 48) ACCESS, [Access Denied: XE], [Allowed: RD MD XT]  rpc.xid ==
0x7290c61d
 50         11 172.27.30.129 -> 172.27.255.28 NFS 364 V4 Call OPEN DH:
0x272b8d23/.bashenv  rpc.xid == 0x7390c61d
 51         11 172.27.255.28 -> 172.27.30.129 NFS 432 V4 Reply (Call
In 50) OPEN StateID: 0xca9b  rpc.xid == 0x7390c61d
 52         11 172.27.30.129 -> 172.27.255.28 NFS 304 V4 Call CLOSE
StateID: 0xca9b  rpc.xid == 0x7490c61d
 53         11 172.27.255.28 -> 172.27.30.129 NFS 204 V4 Reply (Call
In 52) CLOSE  rpc.xid == 0x7490c61d
 55         16 172.27.30.129 -> 172.27.255.28 NFS 372 V4 Call OPEN DH:
0x272b8d23/.bash_history  rpc.xid == 0x7590c61d
 56         16 172.27.255.28 -> 172.27.30.129 NFS 432 V4 Reply (Call
In 55) OPEN StateID: 0xf3ed  rpc.xid == 0x7590c61d
 58         16 172.27.30.129 -> 172.27.255.28 NFS 304 V4 Call CLOSE
StateID: 0xf3ed  rpc.xid == 0x7690c61d
 59         16 172.27.255.28 -> 172.27.30.129 NFS 204 V4 Reply (Call
In 58) CLOSE  rpc.xid == 0x7690c61d
 61         21 172.27.30.129 -> 172.27.255.28 NFS 288 V4 Call ACCESS
FH: 0xa77c94de, [Check: RD LU MD XT DL]  rpc.xid == 0x7790c61d
 62         21 172.27.255.28 -> 172.27.30.129 NFS 196 V4 Reply (Call
In 61) ACCESS, [Allowed: RD LU MD XT DL]  rpc.xid == 0x7790c61d
 64         21 172.27.30.129 -> 172.27.255.28 NFS 364 V4 Call OPEN DH:
0xa77c94de/a.out  rpc.xid == 0x7890c61d
 65         21 172.27.255.28 -> 172.27.30.129 NFS 432 V4 Reply (Call
In 64) OPEN StateID: 0xb877  rpc.xid == 0x7890c61d
 66         21 172.27.30.129 -> 172.27.255.28 NFS 288 V4 Call ACCESS
FH: 0xbfe01adc, [Check: RD LU MD XT DL]  rpc.xid == 0x7990c61d
 67         21 172.27.255.28 -> 172.27.30.129 NFS 196 V4 Reply (Call
In 66) ACCESS, [Allowed: RD LU MD XT DL]  rpc.xid == 0x7990c61d
 68         21 172.27.30.129 -> 172.27.255.28 NFS 368 V4 Call OPEN DH:
0xbfe01adc/m.db  rpc.xid == 0x7a90c61d
 69         21 172.27.255.28 -> 172.27.30.129 NFS 396 V4 Reply (Call
In 68) OPEN StateID: 0x8101  rpc.xid == 0x7a90c61d
 70         21 172.27.30.129 -> 172.27.255.28 NFS 352 V4 Call LOCK FH:
0x80589398 Offset: 0 Length: <End of File>  rpc.xid == 0x7b90c61d
 71         21 172.27.255.28 -> 172.27.30.129 NFS 124 V4 Reply (Call
In 70) LOCK Status: NFS4ERR_BAD_SEQID  rpc.xid == 0x7b90c61d
 72         21 172.27.30.129 -> 172.27.255.28 NFS 304 V4 Call CLOSE
StateID: 0xb877  rpc.xid == 0x7c90c61d
 73         21 172.27.255.28 -> 172.27.30.129 NFS 204 V4 Reply (Call
In 72) CLOSE  rpc.xid == 0x7c90c61d
 74         21 172.27.30.129 -> 172.27.255.28 NFS 304 V4 Call CLOSE
StateID: 0x8101  rpc.xid == 0x7d90c61d
 75         21 172.27.255.28 -> 172.27.30.129 NFS 204 V4 Reply (Call
In 74) CLOSE  rpc.xid == 0x7d90c61d



>
> On Wed, Sep 18, 2019 at 9:07 AM Benjamin Coddington <bcodding@redhat.com> wrote:
> >


> > > Is there a better way of tracking sequences other than monitoring the
> > > network traffic?
> >
> > I think that's the best way, right now.  We do have tracepoints for
> > nfs4 open and close that show the sequence numbers on the client, but
> > I'm
> > not sure about how to get that from the server side.  I don't think we
> > have
> > seqid for locks in tracepoints.. I could be missing something.  Not only
> > that, but you might not get tracepoint output showing the sequence
> > numbers
> > if you're in an error-handling path.
> >
> > If you have a wire capture of the event, you should be able to go
> > backwards
> > from the error and figure out what the sequence number on the state
> > should
> > be for the operation that received BAD_SEQID by finding the last
> > sequence-mutating (OPEN,CLOSE,LOCK) operation for that stateid that did
> > not
> > return an error.
> >
> > Ben


Thanks Ben, I'll persist with the network monitoring for now. The
sequence ids appear do appear to be sequential when it's in error
state. I don't see it skipping any if that's what the error is
expecting


Client
~~~~~

Here is the sequence-id that gets returned as bad:

$ tshark -n -2 -r client_broken.pcap -z proto,colinfo,rpc.xid,rpc.xid
-R 'nfs.seqid == 0x0000003c && frame.number <= 70'
 70         21 172.27.30.129 -> 172.27.255.28 NFS 352 V4 Call LOCK FH:
0x80589398 Offset: 0 Length: <End of File>  rpc.xid == 0x7b90c61d

prior sequence is the OPEN as expected:

$ tshark -n -2 -r client_broken.pcap -z proto,colinfo,rpc.xid,rpc.xid
-R 'nfs.seqid == 0x0000003b && frame.number <= 70'
 68         21 172.27.30.129 -> 172.27.255.28 NFS 368 V4 Call OPEN DH:
0xbfe01adc/m.db  rpc.xid == 0x7a90c61d

Server
~~~~~
bad-sequence

$ tshark -n -2 -r server_broken.pcap -z proto,colinfo,rpc.xid,rpc.xid
-R 'nfs.seqid == 0x0000003c && frame.number <= 70'
  1         65 172.27.30.129 -> 172.27.255.28 NFS 350 V4 Call (Reply
In 2) LOCK FH: 0x80589398 Offset: 0 Length: <End of File>  rpc.xid ==
0x7b90c61d
  2         65 172.27.30.129 -> 172.27.255.28 NFS 302 V4 Call (Reply
In 3) CLOSE StateID: 0xb877  rpc.xid == 0x7c90c61d

prior sequence

$ tshark -n -2 -r server_broken.pcap -z proto,colinfo,rpc.xid,rpc.xid
-R 'nfs.seqid == 0x0000003b && frame.number <= 70'
  1         65 172.27.30.129 -> 172.27.255.28 NFS 366 V4 Call (Reply
In 2) OPEN DH: 0xbfe01adc/m.db  rpc.xid == 0x7a90c61d

if it is indeed Red Hat KB 3354251 then there is greater motivation to
try a kernel upgrade again. Though just trying to confirm I have not
hit something new here.

Leon

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: troubleshooting LOCK FH and NFS4ERR_BAD_SEQID
  2019-09-19  4:22         ` Leon Kyneur
@ 2019-09-19 13:36           ` Olga Kornievskaia
  2019-09-24  4:17             ` Leon Kyneur
  0 siblings, 1 reply; 9+ messages in thread
From: Olga Kornievskaia @ 2019-09-19 13:36 UTC (permalink / raw)
  To: Leon Kyneur; +Cc: Benjamin Coddington, linux-nfs

Hi Leon,

Given that you have a network trace with the BAD_SEQID error can you
filter on the file handle and trace the operations? It should be
possible to tell at that point if it's the client that's sending an
incorrect seq or the server. I'm assuming we can either trace the
locking stated to a previous (successful) use or to an open.  Provided
tshark output isn't enough to see what was inside the packets.

If the LOCK that's failing used a seqid that wasn't bootstrapped from
the open seqid, then it can't be caused by a double CLOSE problem. But
if this LOCK was was bootstrapped from the open stateid we need to
traces open owner sequencing and a double close could have been the
cause.

On Thu, Sep 19, 2019 at 12:23 AM Leon Kyneur <leonk@dug.com> wrote:
>
> On Wed, Sep 18, 2019 at 10:32 PM Olga Kornievskaia <aglo@umich.edu> wrote:
> >
> > Hi folks,
> >
> > The bad_seqid error could have been the bug in 7.4
> > https://access.redhat.com/solutions/3354251. It's been fixed in
> > kernel-3.10.0-693.23.1.el7. Can you try to update and see if that
> > helps? The bug was client was sending a double close throwing off the
> > seqid use.
> >
>
> Hi Olga
>
> I did see that but discounted it as I am not seeing duplicate CLOSE
> for the same StateID when the client is affected, could this happened
> at some point earlier in time to trigger the condition? I don't
> understand how I can reproduce it.
>
> When it's affected I don't see the duplicate close:
>
> $ tshark -n -r client_broken.pcap -z proto,colinfo,rpc.xid,rpc.xid
> |grep -E "CLOSE|BAD|LOCK|OPEN|ACCESS"
>  14         10 172.27.30.129 -> 172.27.255.28 NFS 276 V4 Call ACCESS
> FH: 0xfd9d2fb5, [Check: RD LU MD XT DL]  rpc.xid == 0x6590c61d
>  15         10 172.27.255.28 -> 172.27.30.129 NFS 196 V4 Reply (Call
> In 14) ACCESS, [Allowed: RD LU MD XT DL]  rpc.xid == 0x6590c61d
>  19         10 172.27.30.129 -> 172.27.255.28 NFS 276 V4 Call ACCESS
> FH: 0xfd9d2fb5, [Check: RD LU MD XT DL]  rpc.xid == 0x6790c61d
>  20         10 172.27.255.28 -> 172.27.30.129 NFS 196 V4 Reply (Call
> In 19) ACCESS, [Access Denied: MD XT DL], [Allowed: RD LU]  rpc.xid ==
> 0x6790c61d
>  21         10 172.27.30.129 -> 172.27.255.28 NFS 288 V4 Call ACCESS
> FH: 0x272b8d23, [Check: RD LU MD XT DL]  rpc.xid == 0x6890c61d
>  22         10 172.27.255.28 -> 172.27.30.129 NFS 196 V4 Reply (Call
> In 21) ACCESS, [Allowed: RD LU MD XT DL]  rpc.xid == 0x6890c61d
>  24         10 172.27.30.129 -> 172.27.255.28 NFS 288 V4 Call ACCESS
> FH: 0x54d7a1df, [Check: RD LU MD XT DL]  rpc.xid == 0x6990c61d
>  25         10 172.27.255.28 -> 172.27.30.129 NFS 196 V4 Reply (Call
> In 24) ACCESS, [Allowed: RD LU MD XT DL]  rpc.xid == 0x6990c61d
>  33         11 172.27.30.129 -> 172.27.255.28 NFS 372 V4 Call OPEN DH:
> 0x272b8d23/.bash_history  rpc.xid == 0x6c90c61d
>  34         11 172.27.255.28 -> 172.27.30.129 NFS 396 V4 Reply (Call
> In 33) OPEN StateID: 0x0f49  rpc.xid == 0x6c90c61d
>  36         11 172.27.30.129 -> 172.27.255.28 NFS 288 V4 Call
> OPEN_CONFIRM  rpc.xid == 0x6d90c61d
>  37         11 172.27.255.28 -> 172.27.30.129 NFS 140 V4 Reply (Call
> In 36) OPEN_CONFIRM  rpc.xid == 0x6d90c61d
>  38         11 172.27.30.129 -> 172.27.255.28 NFS 304 V4 Call CLOSE
> StateID: 0x01be  rpc.xid == 0x6e90c61d
>  39         11 172.27.255.28 -> 172.27.30.129 NFS 204 V4 Reply (Call
> In 38) CLOSE  rpc.xid == 0x6e90c61d
>  44         11 172.27.30.129 -> 172.27.255.28 NFS 364 V4 Call OPEN DH:
> 0x272b8d23/.bashrc  rpc.xid == 0x7090c61d
>  45         11 172.27.255.28 -> 172.27.30.129 NFS 432 V4 Reply (Call
> In 44) OPEN StateID: 0x13f2  rpc.xid == 0x7090c61d
>  46         11 172.27.30.129 -> 172.27.255.28 NFS 304 V4 Call CLOSE
> StateID: 0x13f2  rpc.xid == 0x7190c61d
>  47         11 172.27.255.28 -> 172.27.30.129 NFS 204 V4 Reply (Call
> In 46) CLOSE  rpc.xid == 0x7190c61d
>  48         11 172.27.30.129 -> 172.27.255.28 NFS 288 V4 Call ACCESS
> FH: 0xfb7c5154, [Check: RD MD XT XE]  rpc.xid == 0x7290c61d
>  49         11 172.27.255.28 -> 172.27.30.129 NFS 196 V4 Reply (Call
> In 48) ACCESS, [Access Denied: XE], [Allowed: RD MD XT]  rpc.xid ==
> 0x7290c61d
>  50         11 172.27.30.129 -> 172.27.255.28 NFS 364 V4 Call OPEN DH:
> 0x272b8d23/.bashenv  rpc.xid == 0x7390c61d
>  51         11 172.27.255.28 -> 172.27.30.129 NFS 432 V4 Reply (Call
> In 50) OPEN StateID: 0xca9b  rpc.xid == 0x7390c61d
>  52         11 172.27.30.129 -> 172.27.255.28 NFS 304 V4 Call CLOSE
> StateID: 0xca9b  rpc.xid == 0x7490c61d
>  53         11 172.27.255.28 -> 172.27.30.129 NFS 204 V4 Reply (Call
> In 52) CLOSE  rpc.xid == 0x7490c61d
>  55         16 172.27.30.129 -> 172.27.255.28 NFS 372 V4 Call OPEN DH:
> 0x272b8d23/.bash_history  rpc.xid == 0x7590c61d
>  56         16 172.27.255.28 -> 172.27.30.129 NFS 432 V4 Reply (Call
> In 55) OPEN StateID: 0xf3ed  rpc.xid == 0x7590c61d
>  58         16 172.27.30.129 -> 172.27.255.28 NFS 304 V4 Call CLOSE
> StateID: 0xf3ed  rpc.xid == 0x7690c61d
>  59         16 172.27.255.28 -> 172.27.30.129 NFS 204 V4 Reply (Call
> In 58) CLOSE  rpc.xid == 0x7690c61d
>  61         21 172.27.30.129 -> 172.27.255.28 NFS 288 V4 Call ACCESS
> FH: 0xa77c94de, [Check: RD LU MD XT DL]  rpc.xid == 0x7790c61d
>  62         21 172.27.255.28 -> 172.27.30.129 NFS 196 V4 Reply (Call
> In 61) ACCESS, [Allowed: RD LU MD XT DL]  rpc.xid == 0x7790c61d
>  64         21 172.27.30.129 -> 172.27.255.28 NFS 364 V4 Call OPEN DH:
> 0xa77c94de/a.out  rpc.xid == 0x7890c61d
>  65         21 172.27.255.28 -> 172.27.30.129 NFS 432 V4 Reply (Call
> In 64) OPEN StateID: 0xb877  rpc.xid == 0x7890c61d
>  66         21 172.27.30.129 -> 172.27.255.28 NFS 288 V4 Call ACCESS
> FH: 0xbfe01adc, [Check: RD LU MD XT DL]  rpc.xid == 0x7990c61d
>  67         21 172.27.255.28 -> 172.27.30.129 NFS 196 V4 Reply (Call
> In 66) ACCESS, [Allowed: RD LU MD XT DL]  rpc.xid == 0x7990c61d
>  68         21 172.27.30.129 -> 172.27.255.28 NFS 368 V4 Call OPEN DH:
> 0xbfe01adc/m.db  rpc.xid == 0x7a90c61d
>  69         21 172.27.255.28 -> 172.27.30.129 NFS 396 V4 Reply (Call
> In 68) OPEN StateID: 0x8101  rpc.xid == 0x7a90c61d
>  70         21 172.27.30.129 -> 172.27.255.28 NFS 352 V4 Call LOCK FH:
> 0x80589398 Offset: 0 Length: <End of File>  rpc.xid == 0x7b90c61d
>  71         21 172.27.255.28 -> 172.27.30.129 NFS 124 V4 Reply (Call
> In 70) LOCK Status: NFS4ERR_BAD_SEQID  rpc.xid == 0x7b90c61d
>  72         21 172.27.30.129 -> 172.27.255.28 NFS 304 V4 Call CLOSE
> StateID: 0xb877  rpc.xid == 0x7c90c61d
>  73         21 172.27.255.28 -> 172.27.30.129 NFS 204 V4 Reply (Call
> In 72) CLOSE  rpc.xid == 0x7c90c61d
>  74         21 172.27.30.129 -> 172.27.255.28 NFS 304 V4 Call CLOSE
> StateID: 0x8101  rpc.xid == 0x7d90c61d
>  75         21 172.27.255.28 -> 172.27.30.129 NFS 204 V4 Reply (Call
> In 74) CLOSE  rpc.xid == 0x7d90c61d
>
>
>
> >
> > On Wed, Sep 18, 2019 at 9:07 AM Benjamin Coddington <bcodding@redhat.com> wrote:
> > >
>
>
> > > > Is there a better way of tracking sequences other than monitoring the
> > > > network traffic?
> > >
> > > I think that's the best way, right now.  We do have tracepoints for
> > > nfs4 open and close that show the sequence numbers on the client, but
> > > I'm
> > > not sure about how to get that from the server side.  I don't think we
> > > have
> > > seqid for locks in tracepoints.. I could be missing something.  Not only
> > > that, but you might not get tracepoint output showing the sequence
> > > numbers
> > > if you're in an error-handling path.
> > >
> > > If you have a wire capture of the event, you should be able to go
> > > backwards
> > > from the error and figure out what the sequence number on the state
> > > should
> > > be for the operation that received BAD_SEQID by finding the last
> > > sequence-mutating (OPEN,CLOSE,LOCK) operation for that stateid that did
> > > not
> > > return an error.
> > >
> > > Ben
>
>
> Thanks Ben, I'll persist with the network monitoring for now. The
> sequence ids appear do appear to be sequential when it's in error
> state. I don't see it skipping any if that's what the error is
> expecting
>
>
> Client
> ~~~~~
>
> Here is the sequence-id that gets returned as bad:
>
> $ tshark -n -2 -r client_broken.pcap -z proto,colinfo,rpc.xid,rpc.xid
> -R 'nfs.seqid == 0x0000003c && frame.number <= 70'
>  70         21 172.27.30.129 -> 172.27.255.28 NFS 352 V4 Call LOCK FH:
> 0x80589398 Offset: 0 Length: <End of File>  rpc.xid == 0x7b90c61d
>
> prior sequence is the OPEN as expected:
>
> $ tshark -n -2 -r client_broken.pcap -z proto,colinfo,rpc.xid,rpc.xid
> -R 'nfs.seqid == 0x0000003b && frame.number <= 70'
>  68         21 172.27.30.129 -> 172.27.255.28 NFS 368 V4 Call OPEN DH:
> 0xbfe01adc/m.db  rpc.xid == 0x7a90c61d
>
> Server
> ~~~~~
> bad-sequence
>
> $ tshark -n -2 -r server_broken.pcap -z proto,colinfo,rpc.xid,rpc.xid
> -R 'nfs.seqid == 0x0000003c && frame.number <= 70'
>   1         65 172.27.30.129 -> 172.27.255.28 NFS 350 V4 Call (Reply
> In 2) LOCK FH: 0x80589398 Offset: 0 Length: <End of File>  rpc.xid ==
> 0x7b90c61d
>   2         65 172.27.30.129 -> 172.27.255.28 NFS 302 V4 Call (Reply
> In 3) CLOSE StateID: 0xb877  rpc.xid == 0x7c90c61d
>
> prior sequence
>
> $ tshark -n -2 -r server_broken.pcap -z proto,colinfo,rpc.xid,rpc.xid
> -R 'nfs.seqid == 0x0000003b && frame.number <= 70'
>   1         65 172.27.30.129 -> 172.27.255.28 NFS 366 V4 Call (Reply
> In 2) OPEN DH: 0xbfe01adc/m.db  rpc.xid == 0x7a90c61d
>
> if it is indeed Red Hat KB 3354251 then there is greater motivation to
> try a kernel upgrade again. Though just trying to confirm I have not
> hit something new here.
>
> Leon

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: troubleshooting LOCK FH and NFS4ERR_BAD_SEQID
  2019-09-19 13:36           ` Olga Kornievskaia
@ 2019-09-24  4:17             ` Leon Kyneur
  2019-09-24 14:45               ` Olga Kornievskaia
  0 siblings, 1 reply; 9+ messages in thread
From: Leon Kyneur @ 2019-09-24  4:17 UTC (permalink / raw)
  To: Olga Kornievskaia; +Cc: Benjamin Coddington, linux-nfs

On Thu, Sep 19, 2019 at 9:36 PM Olga Kornievskaia <aglo@umich.edu> wrote:
>
> Hi Leon,
>
> Given that you have a network trace with the BAD_SEQID error can you
> filter on the file handle and trace the operations? It should be
> possible to tell at that point if it's the client that's sending an
> incorrect seq or the server. I'm assuming we can either trace the
> locking stated to a previous (successful) use or to an open.  Provided
> tshark output isn't enough to see what was inside the packets.
>
> If the LOCK that's failing used a seqid that wasn't bootstrapped from
> the open seqid, then it can't be caused by a double CLOSE problem. But
> if this LOCK was was bootstrapped from the open stateid we need to
> traces open owner sequencing and a double close could have been the
> cause.


Hi Olga

I don't ever see a double close in my packet captures. Here's a more
verbose dump, I'm not quite sure what you mean re LOCK "wasn't
bootstrapped from the open seqid" that does seem to explain it as I
don't see the 0x00000001 seq id on the from the OPEN call later in the
LOCK FH call. The working example (also below) does show this.

So I'm still wondering if this is the double close issue or something else?

# tshark -2 -n -r ../nfs.pcap   -z
proto,colinfo,nfs.stateid4.other,nfs.stateid4.other
...
 22          3 172.27.50.222 -> 172.27.255.28 NFS 370 V4 Call (Reply
In 23) OPEN DH: 0xbfe01adc/m.db  nfs.seqid == 0x00000022
 23          3 172.27.255.28 -> 172.27.50.222 NFS 394 V4 Reply (Call
In 22) OPEN StateID: 0x7999  nfs.seqid == 0x00000001
nfs.stateid4.other == 8f:73:7b:5d:30:ab:b5:d0:e3:02:00:00
 24          3 172.27.50.222 -> 172.27.255.28 NFS 290 V4 Call (Reply
In 25) OPEN_CONFIRM  nfs.seqid == 0x00000001  nfs.seqid == 0x00000023
nfs.stateid4.other == 8f:73:7b:5d:30:ab:b5:d0:e3:02:00:00
 25          3 172.27.255.28 -> 172.27.50.222 NFS 138 V4 Reply (Call
In 24) OPEN_CONFIRM  nfs.seqid == 0x00000002  nfs.stateid4.other ==
8f:73:7b:5d:30:ab:b5:d0:e3:02:00:00
 26          3 172.27.50.222 -> 172.27.255.28 NFS 370 V4 Call (Reply
In 27) OPEN DH: 0x272b8d23/.sqliterc  nfs.seqid == 0x00000024
 27          3 172.27.255.28 -> 172.27.50.222 NFS 122 V4 Reply (Call
In 26) OPEN Status: NFS4ERR_NOENT
 28          3 172.27.50.222 -> 172.27.255.28 NFS 374 V4 Call (Reply
In 29) OPEN DH: 0x272b8d23/.sqlite_history  nfs.seqid == 0x00000025
 29          3 172.27.255.28 -> 172.27.50.222 NFS 430 V4 Reply (Call
In 28) OPEN StateID: 0x3203  nfs.seqid == 0x00000001  nfs.seqid ==
0x00000001  nfs.stateid4.other == 8f:73:7b:5d:30:ab:b5:d0:e5:02:00:00
nfs.stateid4.other == 8f:73:7b:5d:30:ab:b5:d0:e6:02:00:00
 30          3 172.27.50.222 -> 172.27.255.28 NFS 306 V4 Call (Reply
In 31) CLOSE StateID: 0x3203  nfs.seqid == 0x00000026  nfs.seqid ==
0x00000001  nfs.stateid4.other == 8f:73:7b:5d:30:ab:b5:d0:e5:02:00:00
 31          3 172.27.255.28 -> 172.27.50.222 NFS 202 V4 Reply (Call
In 30) CLOSE  nfs.seqid == 0x00000002  nfs.stateid4.other ==
8f:73:7b:5d:30:ab:b5:d0:e5:02:00:00
 32          3 172.27.50.222 -> 172.27.255.28 NFS 306 V4 Call (Reply
In 33) LOOKUP DH: 0x272b8d23/.terminfo
 33          3 172.27.255.28 -> 172.27.50.222 NFS 122 V4 Reply (Call
In 32) LOOKUP Status: NFS4ERR_NOENT
 34          3 172.27.50.222 -> 172.27.255.28 NFS 290 V4 Call (Reply
In 35) ACCESS FH: 0x272b8d23, [Check: RD LU MD XT DL]
 35          3 172.27.255.28 -> 172.27.50.222 NFS 194 V4 Reply (Call
In 34) ACCESS, [Allowed: RD LU MD XT DL]
 36          3 172.27.50.222 -> 172.27.255.28 NFS 282 V4 Call (Reply
In 37) GETATTR FH: 0x272b8d23
 37          3 172.27.255.28 -> 172.27.50.222 NFS 266 V4 Reply (Call
In 36) GETATTR
 38          3 172.27.50.222 -> 172.27.255.28 NFS 302 V4 Call (Reply
In 39) LOOKUP DH: 0x272b8d23/.inputrc
 39          3 172.27.255.28 -> 172.27.50.222 NFS 122 V4 Reply (Call
In 38) LOOKUP Status: NFS4ERR_NOENT
 40          3 172.27.50.222 -> 172.27.255.28 TCP 66 822 > 2049 [ACK]
Seq=3950 Ack=2617 Win=3077 Len=0 TSval=177154789 TSecr=3474577216
 41          5 172.27.50.222 -> 172.27.255.28 NFS 354 V4 Call (Reply
In 42) LOCK FH: 0x80589398 Offset: 1073741824 Length: 1   nfs.seqid ==
0x00000027  nfs.seqid == 0x00000002  nfs.stateid4.other ==
8f:73:7b:5d:30:ab:b5:d0:e3:02:00:00
 42          5 172.27.255.28 -> 172.27.50.222 NFS 122 V4 Reply (Call
In 41) LOCK Status: NFS4ERR_BAD_SEQID
 43          5 172.27.50.222 -> 172.27.255.28 TCP 66 822 > 2049 [ACK]
Seq=4238 Ack=2673 Win=3077 Len=0 TSval=177156701 TSecr=3474579167


working:

  7          2 172.27.50.222 -> 172.27.255.28 NFS 370 V4 Call (Reply
In 8) OPEN DH: 0xbfe01adc/m.db  nfs.seqid == 0x00000017
  8          2 172.27.255.28 -> 172.27.50.222 NFS 394 V4 Reply (Call
In 7) OPEN StateID: 0x08c1  nfs.seqid == 0x00000001
nfs.stateid4.other == 8f:73:7b:5d:30:ab:b5:d0:fc:02:00:00
  9          2 172.27.50.222 -> 172.27.255.28 NFS 374 V4 Call (Reply
In 10) OPEN DH: 0x272b8d23/.sqlite_history  nfs.seqid == 0x00000018
 10          2 172.27.255.28 -> 172.27.50.222 NFS 430 V4 Reply (Call
In 9) OPEN StateID: 0x147a  nfs.seqid == 0x00000001  nfs.seqid ==
0x00000001  nfs.stateid4.other == 8f:73:7b:5d:30:ab:b5:d0:fd:02:00:00
nfs.stateid4.other == 8f:73:7b:5d:30:ab:b5:d0:fe:02:00:00
 11          2 172.27.50.222 -> 172.27.255.28 NFS 306 V4 Call (Reply
In 12) CLOSE StateID: 0x147a  nfs.seqid == 0x00000019  nfs.seqid ==
0x00000001  nfs.stateid4.other == 8f:73:7b:5d:30:ab:b5:d0:fd:02:00:00
 12          2 172.27.255.28 -> 172.27.50.222 NFS 202 V4 Reply (Call
In 11) CLOSE  nfs.seqid == 0x00000002  nfs.stateid4.other ==
8f:73:7b:5d:30:ab:b5:d0:fd:02:00:00
 13          2 172.27.50.222 -> 172.27.255.28 TCP 66 919 > 2049 [ACK]
Seq=1293 Ack=1157 Win=40 Len=0 TSval=177379723 TSecr=3474802150
 14          4 172.27.50.222 -> 172.27.255.28 NFS 354 V4 Call (Reply
In 15) LOCK FH: 0x80589398 Offset: 1073741824 Length: 1   nfs.seqid ==
0x0000001a  nfs.seqid == 0x00000001  nfs.stateid4.other ==
8f:73:7b:5d:30:ab:b5:d0:fc:02:00:00
 15          4 172.27.255.28 -> 172.27.50.222 NFS 138 V4 Reply (Call
In 14) LOCK  nfs.seqid == 0x00000001  nfs.stateid4.other ==
8f:73:7b:5d:30:ab:b5:d0:ff:02:00:00
 16          4 172.27.50.222 -> 172.27.255.28 TCP 66 919 > 2049 [ACK]
Seq=1581 Ack=1229 Win=40 Len=0 TSval=177381305 TSecr=3474803772
 17          4 172.27.50.222 -> 172.27.255.28 NFS 318 V4 Call (Reply
In 18) LOCK FH: 0x80589398 Offset: 1073741826 Length: 510   nfs.seqid
== 0x00000001  nfs.stateid4.other ==
8f:73:7b:5d:30:ab:b5:d0:ff:02:00:00
 18          4 172.27.255.28 -> 172.27.50.222 NFS 138 V4 Reply (Call
In 17) LOCK  nfs.seqid == 0x00000002  nfs.stateid4.other ==
8f:73:7b:5d:30:ab:b5:d0:ff:02:00:00
 19          4 172.27.50.222 -> 172.27.255.28 NFS 310 V4 Call (Reply
In 20) LOCKU FH: 0x80589398 Offset: 1073741824 Length: 1   nfs.seqid
== 0x00000002  nfs.seqid == 0x00000002  nfs.stateid4.other ==
8f:73:7b:5d:30:ab:b5:d0:ff:02:00:00
 20          4 172.27.255.28 -> 172.27.50.222 NFS 138 V4 Reply (Call
In 19) LOCKU  nfs.seqid == 0x00000003  nfs.stateid4.other ==
8f:73:7b:5d:30:ab:b5:d0:ff:02:00:00
 21          4 172.27.50.222 -> 172.27.255.28 NFS 298 V4 Call (Reply
In 22) READ StateID: 0x2656 Offset: 0 Len: 6144  nfs.seqid ==
0x00000003  nfs.stateid4.other == 8f:73:7b:5d:30:ab:b5:d0:ff:02:00:00
 22          4 172.27.255.28 -> 172.27.50.222 NFS 6274 V4 Reply (Call
In 21) READ
 23          4 172.27.50.222 -> 172.27.255.28 TCP 66 919 > 2049 [ACK]
Seq=2309 Ack=7581 Win=40 Len=0 TSval=177381309 TSecr=3474803775
 24          4 172.27.50.222 -> 172.27.255.28 NFS 310 V4 Call (Reply
In 25) LOCKU FH: 0x80589398 Offset: 0 Length: <End of File>  nfs.seqid
== 0x00000003  nfs.seqid == 0x00000003  nfs.stateid4.other ==
8f:73:7b:5d:30:ab:b5:d0:ff:02:00:00
 25          4 172.27.255.28 -> 172.27.50.222 NFS 138 V4 Reply (Call
In 24) LOCKU  nfs.seqid == 0x00000004  nfs.stateid4.other ==
8f:73:7b:5d:30:ab:b5:d0:ff:02:00:00
 26          4 172.27.50.222 -> 172.27.255.28 NFS 190 V4 Call (Reply
In 28) RELEASE_LOCKOWNER
 27          4 172.27.50.222 -> 172.27.255.28 NFS 354 V4 Call (Reply
In 29) LOCK FH: 0x80589398 Offset: 1073741824 Length: 1   nfs.seqid ==
0x0000001b  nfs.seqid == 0x00000001  nfs.stateid4.other ==
8f:73:7b:5d:30:ab:b5:d0:fc:02:00:00
 28          4 172.27.255.28 -> 172.27.50.222 NFS 114 V4 Reply (Call
In 26) RELEASE_LOCKOWNER

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: troubleshooting LOCK FH and NFS4ERR_BAD_SEQID
  2019-09-24  4:17             ` Leon Kyneur
@ 2019-09-24 14:45               ` Olga Kornievskaia
  0 siblings, 0 replies; 9+ messages in thread
From: Olga Kornievskaia @ 2019-09-24 14:45 UTC (permalink / raw)
  To: Leon Kyneur; +Cc: Benjamin Coddington, linux-nfs

Hi Leon,

Given this trace, I don't see anything that the client is doing wrong.
I'd say this points to the server losing track of the seqid.

On Tue, Sep 24, 2019 at 12:18 AM Leon Kyneur <leonk@dug.com> wrote:
>
> On Thu, Sep 19, 2019 at 9:36 PM Olga Kornievskaia <aglo@umich.edu> wrote:
> >
> > Hi Leon,
> >
> > Given that you have a network trace with the BAD_SEQID error can you
> > filter on the file handle and trace the operations? It should be
> > possible to tell at that point if it's the client that's sending an
> > incorrect seq or the server. I'm assuming we can either trace the
> > locking stated to a previous (successful) use or to an open.  Provided
> > tshark output isn't enough to see what was inside the packets.
> >
> > If the LOCK that's failing used a seqid that wasn't bootstrapped from
> > the open seqid, then it can't be caused by a double CLOSE problem. But
> > if this LOCK was was bootstrapped from the open stateid we need to
> > traces open owner sequencing and a double close could have been the
> > cause.
>
>
> Hi Olga
>
> I don't ever see a double close in my packet captures. Here's a more
> verbose dump, I'm not quite sure what you mean re LOCK "wasn't
> bootstrapped from the open seqid" that does seem to explain it as I
> don't see the 0x00000001 seq id on the from the OPEN call later in the
> LOCK FH call. The working example (also below) does show this.
>
> So I'm still wondering if this is the double close issue or something else?
>
> # tshark -2 -n -r ../nfs.pcap   -z
> proto,colinfo,nfs.stateid4.other,nfs.stateid4.other
> ...
>  22          3 172.27.50.222 -> 172.27.255.28 NFS 370 V4 Call (Reply
> In 23) OPEN DH: 0xbfe01adc/m.db  nfs.seqid == 0x00000022
>  23          3 172.27.255.28 -> 172.27.50.222 NFS 394 V4 Reply (Call
> In 22) OPEN StateID: 0x7999  nfs.seqid == 0x00000001
> nfs.stateid4.other == 8f:73:7b:5d:30:ab:b5:d0:e3:02:00:00
>  24          3 172.27.50.222 -> 172.27.255.28 NFS 290 V4 Call (Reply
> In 25) OPEN_CONFIRM  nfs.seqid == 0x00000001  nfs.seqid == 0x00000023
> nfs.stateid4.other == 8f:73:7b:5d:30:ab:b5:d0:e3:02:00:00
>  25          3 172.27.255.28 -> 172.27.50.222 NFS 138 V4 Reply (Call
> In 24) OPEN_CONFIRM  nfs.seqid == 0x00000002  nfs.stateid4.other ==
> 8f:73:7b:5d:30:ab:b5:d0:e3:02:00:00
>  26          3 172.27.50.222 -> 172.27.255.28 NFS 370 V4 Call (Reply
> In 27) OPEN DH: 0x272b8d23/.sqliterc  nfs.seqid == 0x00000024
>  27          3 172.27.255.28 -> 172.27.50.222 NFS 122 V4 Reply (Call
> In 26) OPEN Status: NFS4ERR_NOENT
>  28          3 172.27.50.222 -> 172.27.255.28 NFS 374 V4 Call (Reply
> In 29) OPEN DH: 0x272b8d23/.sqlite_history  nfs.seqid == 0x00000025
>  29          3 172.27.255.28 -> 172.27.50.222 NFS 430 V4 Reply (Call
> In 28) OPEN StateID: 0x3203  nfs.seqid == 0x00000001  nfs.seqid ==
> 0x00000001  nfs.stateid4.other == 8f:73:7b:5d:30:ab:b5:d0:e5:02:00:00
> nfs.stateid4.other == 8f:73:7b:5d:30:ab:b5:d0:e6:02:00:00
>  30          3 172.27.50.222 -> 172.27.255.28 NFS 306 V4 Call (Reply
> In 31) CLOSE StateID: 0x3203  nfs.seqid == 0x00000026  nfs.seqid ==
> 0x00000001  nfs.stateid4.other == 8f:73:7b:5d:30:ab:b5:d0:e5:02:00:00
>  31          3 172.27.255.28 -> 172.27.50.222 NFS 202 V4 Reply (Call
> In 30) CLOSE  nfs.seqid == 0x00000002  nfs.stateid4.other ==
> 8f:73:7b:5d:30:ab:b5:d0:e5:02:00:00
>  32          3 172.27.50.222 -> 172.27.255.28 NFS 306 V4 Call (Reply
> In 33) LOOKUP DH: 0x272b8d23/.terminfo
>  33          3 172.27.255.28 -> 172.27.50.222 NFS 122 V4 Reply (Call
> In 32) LOOKUP Status: NFS4ERR_NOENT
>  34          3 172.27.50.222 -> 172.27.255.28 NFS 290 V4 Call (Reply
> In 35) ACCESS FH: 0x272b8d23, [Check: RD LU MD XT DL]
>  35          3 172.27.255.28 -> 172.27.50.222 NFS 194 V4 Reply (Call
> In 34) ACCESS, [Allowed: RD LU MD XT DL]
>  36          3 172.27.50.222 -> 172.27.255.28 NFS 282 V4 Call (Reply
> In 37) GETATTR FH: 0x272b8d23
>  37          3 172.27.255.28 -> 172.27.50.222 NFS 266 V4 Reply (Call
> In 36) GETATTR
>  38          3 172.27.50.222 -> 172.27.255.28 NFS 302 V4 Call (Reply
> In 39) LOOKUP DH: 0x272b8d23/.inputrc
>  39          3 172.27.255.28 -> 172.27.50.222 NFS 122 V4 Reply (Call
> In 38) LOOKUP Status: NFS4ERR_NOENT
>  40          3 172.27.50.222 -> 172.27.255.28 TCP 66 822 > 2049 [ACK]
> Seq=3950 Ack=2617 Win=3077 Len=0 TSval=177154789 TSecr=3474577216
>  41          5 172.27.50.222 -> 172.27.255.28 NFS 354 V4 Call (Reply
> In 42) LOCK FH: 0x80589398 Offset: 1073741824 Length: 1   nfs.seqid ==
> 0x00000027  nfs.seqid == 0x00000002  nfs.stateid4.other ==
> 8f:73:7b:5d:30:ab:b5:d0:e3:02:00:00
>  42          5 172.27.255.28 -> 172.27.50.222 NFS 122 V4 Reply (Call
> In 41) LOCK Status: NFS4ERR_BAD_SEQID
>  43          5 172.27.50.222 -> 172.27.255.28 TCP 66 822 > 2049 [ACK]
> Seq=4238 Ack=2673 Win=3077 Len=0 TSval=177156701 TSecr=3474579167
>
>
> working:
>
>   7          2 172.27.50.222 -> 172.27.255.28 NFS 370 V4 Call (Reply
> In 8) OPEN DH: 0xbfe01adc/m.db  nfs.seqid == 0x00000017
>   8          2 172.27.255.28 -> 172.27.50.222 NFS 394 V4 Reply (Call
> In 7) OPEN StateID: 0x08c1  nfs.seqid == 0x00000001
> nfs.stateid4.other == 8f:73:7b:5d:30:ab:b5:d0:fc:02:00:00
>   9          2 172.27.50.222 -> 172.27.255.28 NFS 374 V4 Call (Reply
> In 10) OPEN DH: 0x272b8d23/.sqlite_history  nfs.seqid == 0x00000018
>  10          2 172.27.255.28 -> 172.27.50.222 NFS 430 V4 Reply (Call
> In 9) OPEN StateID: 0x147a  nfs.seqid == 0x00000001  nfs.seqid ==
> 0x00000001  nfs.stateid4.other == 8f:73:7b:5d:30:ab:b5:d0:fd:02:00:00
> nfs.stateid4.other == 8f:73:7b:5d:30:ab:b5:d0:fe:02:00:00
>  11          2 172.27.50.222 -> 172.27.255.28 NFS 306 V4 Call (Reply
> In 12) CLOSE StateID: 0x147a  nfs.seqid == 0x00000019  nfs.seqid ==
> 0x00000001  nfs.stateid4.other == 8f:73:7b:5d:30:ab:b5:d0:fd:02:00:00
>  12          2 172.27.255.28 -> 172.27.50.222 NFS 202 V4 Reply (Call
> In 11) CLOSE  nfs.seqid == 0x00000002  nfs.stateid4.other ==
> 8f:73:7b:5d:30:ab:b5:d0:fd:02:00:00
>  13          2 172.27.50.222 -> 172.27.255.28 TCP 66 919 > 2049 [ACK]
> Seq=1293 Ack=1157 Win=40 Len=0 TSval=177379723 TSecr=3474802150
>  14          4 172.27.50.222 -> 172.27.255.28 NFS 354 V4 Call (Reply
> In 15) LOCK FH: 0x80589398 Offset: 1073741824 Length: 1   nfs.seqid ==
> 0x0000001a  nfs.seqid == 0x00000001  nfs.stateid4.other ==
> 8f:73:7b:5d:30:ab:b5:d0:fc:02:00:00
>  15          4 172.27.255.28 -> 172.27.50.222 NFS 138 V4 Reply (Call
> In 14) LOCK  nfs.seqid == 0x00000001  nfs.stateid4.other ==
> 8f:73:7b:5d:30:ab:b5:d0:ff:02:00:00
>  16          4 172.27.50.222 -> 172.27.255.28 TCP 66 919 > 2049 [ACK]
> Seq=1581 Ack=1229 Win=40 Len=0 TSval=177381305 TSecr=3474803772
>  17          4 172.27.50.222 -> 172.27.255.28 NFS 318 V4 Call (Reply
> In 18) LOCK FH: 0x80589398 Offset: 1073741826 Length: 510   nfs.seqid
> == 0x00000001  nfs.stateid4.other ==
> 8f:73:7b:5d:30:ab:b5:d0:ff:02:00:00
>  18          4 172.27.255.28 -> 172.27.50.222 NFS 138 V4 Reply (Call
> In 17) LOCK  nfs.seqid == 0x00000002  nfs.stateid4.other ==
> 8f:73:7b:5d:30:ab:b5:d0:ff:02:00:00
>  19          4 172.27.50.222 -> 172.27.255.28 NFS 310 V4 Call (Reply
> In 20) LOCKU FH: 0x80589398 Offset: 1073741824 Length: 1   nfs.seqid
> == 0x00000002  nfs.seqid == 0x00000002  nfs.stateid4.other ==
> 8f:73:7b:5d:30:ab:b5:d0:ff:02:00:00
>  20          4 172.27.255.28 -> 172.27.50.222 NFS 138 V4 Reply (Call
> In 19) LOCKU  nfs.seqid == 0x00000003  nfs.stateid4.other ==
> 8f:73:7b:5d:30:ab:b5:d0:ff:02:00:00
>  21          4 172.27.50.222 -> 172.27.255.28 NFS 298 V4 Call (Reply
> In 22) READ StateID: 0x2656 Offset: 0 Len: 6144  nfs.seqid ==
> 0x00000003  nfs.stateid4.other == 8f:73:7b:5d:30:ab:b5:d0:ff:02:00:00
>  22          4 172.27.255.28 -> 172.27.50.222 NFS 6274 V4 Reply (Call
> In 21) READ
>  23          4 172.27.50.222 -> 172.27.255.28 TCP 66 919 > 2049 [ACK]
> Seq=2309 Ack=7581 Win=40 Len=0 TSval=177381309 TSecr=3474803775
>  24          4 172.27.50.222 -> 172.27.255.28 NFS 310 V4 Call (Reply
> In 25) LOCKU FH: 0x80589398 Offset: 0 Length: <End of File>  nfs.seqid
> == 0x00000003  nfs.seqid == 0x00000003  nfs.stateid4.other ==
> 8f:73:7b:5d:30:ab:b5:d0:ff:02:00:00
>  25          4 172.27.255.28 -> 172.27.50.222 NFS 138 V4 Reply (Call
> In 24) LOCKU  nfs.seqid == 0x00000004  nfs.stateid4.other ==
> 8f:73:7b:5d:30:ab:b5:d0:ff:02:00:00
>  26          4 172.27.50.222 -> 172.27.255.28 NFS 190 V4 Call (Reply
> In 28) RELEASE_LOCKOWNER
>  27          4 172.27.50.222 -> 172.27.255.28 NFS 354 V4 Call (Reply
> In 29) LOCK FH: 0x80589398 Offset: 1073741824 Length: 1   nfs.seqid ==
> 0x0000001b  nfs.seqid == 0x00000001  nfs.stateid4.other ==
> 8f:73:7b:5d:30:ab:b5:d0:fc:02:00:00
>  28          4 172.27.255.28 -> 172.27.50.222 NFS 114 V4 Reply (Call
> In 26) RELEASE_LOCKOWNER

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, back to index

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-12  8:27 troubleshooting LOCK FH and NFS4ERR_BAD_SEQID Leon Kyneur
2019-09-17 11:28 ` Benjamin Coddington
2019-09-18  2:20   ` Leon Kyneur
2019-09-18 11:04     ` Benjamin Coddington
2019-09-18 14:32       ` Olga Kornievskaia
2019-09-19  4:22         ` Leon Kyneur
2019-09-19 13:36           ` Olga Kornievskaia
2019-09-24  4:17             ` Leon Kyneur
2019-09-24 14:45               ` Olga Kornievskaia

Linux-NFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-nfs/0 linux-nfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-nfs linux-nfs/ https://lore.kernel.org/linux-nfs \
		linux-nfs@vger.kernel.org linux-nfs@archiver.kernel.org
	public-inbox-index linux-nfs

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-nfs


AGPL code for this site: git clone https://public-inbox.org/ public-inbox