Linux-NFS Archive on lore.kernel.org
 help / color / Atom feed
From: Olga Kornievskaia <aglo@umich.edu>
To: Leon Kyneur <leonk@dug.com>
Cc: Benjamin Coddington <bcodding@redhat.com>,
	linux-nfs <linux-nfs@vger.kernel.org>
Subject: Re: troubleshooting LOCK FH and NFS4ERR_BAD_SEQID
Date: Thu, 19 Sep 2019 09:36:37 -0400
Message-ID: <CAN-5tyHqtQMJdrm1WUSjmiT7mrRd4HDNNmvZ=fdWjOoOu54ApA@mail.gmail.com> (raw)
In-Reply-To: <CAACwWuPxPmbZFTpLf0_Lsh+yJqz_JQrSGUY5_621P4MGd1H_wA@mail.gmail.com>

Hi Leon,

Given that you have a network trace with the BAD_SEQID error can you
filter on the file handle and trace the operations? It should be
possible to tell at that point if it's the client that's sending an
incorrect seq or the server. I'm assuming we can either trace the
locking stated to a previous (successful) use or to an open.  Provided
tshark output isn't enough to see what was inside the packets.

If the LOCK that's failing used a seqid that wasn't bootstrapped from
the open seqid, then it can't be caused by a double CLOSE problem. But
if this LOCK was was bootstrapped from the open stateid we need to
traces open owner sequencing and a double close could have been the
cause.

On Thu, Sep 19, 2019 at 12:23 AM Leon Kyneur <leonk@dug.com> wrote:
>
> On Wed, Sep 18, 2019 at 10:32 PM Olga Kornievskaia <aglo@umich.edu> wrote:
> >
> > Hi folks,
> >
> > The bad_seqid error could have been the bug in 7.4
> > https://access.redhat.com/solutions/3354251. It's been fixed in
> > kernel-3.10.0-693.23.1.el7. Can you try to update and see if that
> > helps? The bug was client was sending a double close throwing off the
> > seqid use.
> >
>
> Hi Olga
>
> I did see that but discounted it as I am not seeing duplicate CLOSE
> for the same StateID when the client is affected, could this happened
> at some point earlier in time to trigger the condition? I don't
> understand how I can reproduce it.
>
> When it's affected I don't see the duplicate close:
>
> $ tshark -n -r client_broken.pcap -z proto,colinfo,rpc.xid,rpc.xid
> |grep -E "CLOSE|BAD|LOCK|OPEN|ACCESS"
>  14         10 172.27.30.129 -> 172.27.255.28 NFS 276 V4 Call ACCESS
> FH: 0xfd9d2fb5, [Check: RD LU MD XT DL]  rpc.xid == 0x6590c61d
>  15         10 172.27.255.28 -> 172.27.30.129 NFS 196 V4 Reply (Call
> In 14) ACCESS, [Allowed: RD LU MD XT DL]  rpc.xid == 0x6590c61d
>  19         10 172.27.30.129 -> 172.27.255.28 NFS 276 V4 Call ACCESS
> FH: 0xfd9d2fb5, [Check: RD LU MD XT DL]  rpc.xid == 0x6790c61d
>  20         10 172.27.255.28 -> 172.27.30.129 NFS 196 V4 Reply (Call
> In 19) ACCESS, [Access Denied: MD XT DL], [Allowed: RD LU]  rpc.xid ==
> 0x6790c61d
>  21         10 172.27.30.129 -> 172.27.255.28 NFS 288 V4 Call ACCESS
> FH: 0x272b8d23, [Check: RD LU MD XT DL]  rpc.xid == 0x6890c61d
>  22         10 172.27.255.28 -> 172.27.30.129 NFS 196 V4 Reply (Call
> In 21) ACCESS, [Allowed: RD LU MD XT DL]  rpc.xid == 0x6890c61d
>  24         10 172.27.30.129 -> 172.27.255.28 NFS 288 V4 Call ACCESS
> FH: 0x54d7a1df, [Check: RD LU MD XT DL]  rpc.xid == 0x6990c61d
>  25         10 172.27.255.28 -> 172.27.30.129 NFS 196 V4 Reply (Call
> In 24) ACCESS, [Allowed: RD LU MD XT DL]  rpc.xid == 0x6990c61d
>  33         11 172.27.30.129 -> 172.27.255.28 NFS 372 V4 Call OPEN DH:
> 0x272b8d23/.bash_history  rpc.xid == 0x6c90c61d
>  34         11 172.27.255.28 -> 172.27.30.129 NFS 396 V4 Reply (Call
> In 33) OPEN StateID: 0x0f49  rpc.xid == 0x6c90c61d
>  36         11 172.27.30.129 -> 172.27.255.28 NFS 288 V4 Call
> OPEN_CONFIRM  rpc.xid == 0x6d90c61d
>  37         11 172.27.255.28 -> 172.27.30.129 NFS 140 V4 Reply (Call
> In 36) OPEN_CONFIRM  rpc.xid == 0x6d90c61d
>  38         11 172.27.30.129 -> 172.27.255.28 NFS 304 V4 Call CLOSE
> StateID: 0x01be  rpc.xid == 0x6e90c61d
>  39         11 172.27.255.28 -> 172.27.30.129 NFS 204 V4 Reply (Call
> In 38) CLOSE  rpc.xid == 0x6e90c61d
>  44         11 172.27.30.129 -> 172.27.255.28 NFS 364 V4 Call OPEN DH:
> 0x272b8d23/.bashrc  rpc.xid == 0x7090c61d
>  45         11 172.27.255.28 -> 172.27.30.129 NFS 432 V4 Reply (Call
> In 44) OPEN StateID: 0x13f2  rpc.xid == 0x7090c61d
>  46         11 172.27.30.129 -> 172.27.255.28 NFS 304 V4 Call CLOSE
> StateID: 0x13f2  rpc.xid == 0x7190c61d
>  47         11 172.27.255.28 -> 172.27.30.129 NFS 204 V4 Reply (Call
> In 46) CLOSE  rpc.xid == 0x7190c61d
>  48         11 172.27.30.129 -> 172.27.255.28 NFS 288 V4 Call ACCESS
> FH: 0xfb7c5154, [Check: RD MD XT XE]  rpc.xid == 0x7290c61d
>  49         11 172.27.255.28 -> 172.27.30.129 NFS 196 V4 Reply (Call
> In 48) ACCESS, [Access Denied: XE], [Allowed: RD MD XT]  rpc.xid ==
> 0x7290c61d
>  50         11 172.27.30.129 -> 172.27.255.28 NFS 364 V4 Call OPEN DH:
> 0x272b8d23/.bashenv  rpc.xid == 0x7390c61d
>  51         11 172.27.255.28 -> 172.27.30.129 NFS 432 V4 Reply (Call
> In 50) OPEN StateID: 0xca9b  rpc.xid == 0x7390c61d
>  52         11 172.27.30.129 -> 172.27.255.28 NFS 304 V4 Call CLOSE
> StateID: 0xca9b  rpc.xid == 0x7490c61d
>  53         11 172.27.255.28 -> 172.27.30.129 NFS 204 V4 Reply (Call
> In 52) CLOSE  rpc.xid == 0x7490c61d
>  55         16 172.27.30.129 -> 172.27.255.28 NFS 372 V4 Call OPEN DH:
> 0x272b8d23/.bash_history  rpc.xid == 0x7590c61d
>  56         16 172.27.255.28 -> 172.27.30.129 NFS 432 V4 Reply (Call
> In 55) OPEN StateID: 0xf3ed  rpc.xid == 0x7590c61d
>  58         16 172.27.30.129 -> 172.27.255.28 NFS 304 V4 Call CLOSE
> StateID: 0xf3ed  rpc.xid == 0x7690c61d
>  59         16 172.27.255.28 -> 172.27.30.129 NFS 204 V4 Reply (Call
> In 58) CLOSE  rpc.xid == 0x7690c61d
>  61         21 172.27.30.129 -> 172.27.255.28 NFS 288 V4 Call ACCESS
> FH: 0xa77c94de, [Check: RD LU MD XT DL]  rpc.xid == 0x7790c61d
>  62         21 172.27.255.28 -> 172.27.30.129 NFS 196 V4 Reply (Call
> In 61) ACCESS, [Allowed: RD LU MD XT DL]  rpc.xid == 0x7790c61d
>  64         21 172.27.30.129 -> 172.27.255.28 NFS 364 V4 Call OPEN DH:
> 0xa77c94de/a.out  rpc.xid == 0x7890c61d
>  65         21 172.27.255.28 -> 172.27.30.129 NFS 432 V4 Reply (Call
> In 64) OPEN StateID: 0xb877  rpc.xid == 0x7890c61d
>  66         21 172.27.30.129 -> 172.27.255.28 NFS 288 V4 Call ACCESS
> FH: 0xbfe01adc, [Check: RD LU MD XT DL]  rpc.xid == 0x7990c61d
>  67         21 172.27.255.28 -> 172.27.30.129 NFS 196 V4 Reply (Call
> In 66) ACCESS, [Allowed: RD LU MD XT DL]  rpc.xid == 0x7990c61d
>  68         21 172.27.30.129 -> 172.27.255.28 NFS 368 V4 Call OPEN DH:
> 0xbfe01adc/m.db  rpc.xid == 0x7a90c61d
>  69         21 172.27.255.28 -> 172.27.30.129 NFS 396 V4 Reply (Call
> In 68) OPEN StateID: 0x8101  rpc.xid == 0x7a90c61d
>  70         21 172.27.30.129 -> 172.27.255.28 NFS 352 V4 Call LOCK FH:
> 0x80589398 Offset: 0 Length: <End of File>  rpc.xid == 0x7b90c61d
>  71         21 172.27.255.28 -> 172.27.30.129 NFS 124 V4 Reply (Call
> In 70) LOCK Status: NFS4ERR_BAD_SEQID  rpc.xid == 0x7b90c61d
>  72         21 172.27.30.129 -> 172.27.255.28 NFS 304 V4 Call CLOSE
> StateID: 0xb877  rpc.xid == 0x7c90c61d
>  73         21 172.27.255.28 -> 172.27.30.129 NFS 204 V4 Reply (Call
> In 72) CLOSE  rpc.xid == 0x7c90c61d
>  74         21 172.27.30.129 -> 172.27.255.28 NFS 304 V4 Call CLOSE
> StateID: 0x8101  rpc.xid == 0x7d90c61d
>  75         21 172.27.255.28 -> 172.27.30.129 NFS 204 V4 Reply (Call
> In 74) CLOSE  rpc.xid == 0x7d90c61d
>
>
>
> >
> > On Wed, Sep 18, 2019 at 9:07 AM Benjamin Coddington <bcodding@redhat.com> wrote:
> > >
>
>
> > > > Is there a better way of tracking sequences other than monitoring the
> > > > network traffic?
> > >
> > > I think that's the best way, right now.  We do have tracepoints for
> > > nfs4 open and close that show the sequence numbers on the client, but
> > > I'm
> > > not sure about how to get that from the server side.  I don't think we
> > > have
> > > seqid for locks in tracepoints.. I could be missing something.  Not only
> > > that, but you might not get tracepoint output showing the sequence
> > > numbers
> > > if you're in an error-handling path.
> > >
> > > If you have a wire capture of the event, you should be able to go
> > > backwards
> > > from the error and figure out what the sequence number on the state
> > > should
> > > be for the operation that received BAD_SEQID by finding the last
> > > sequence-mutating (OPEN,CLOSE,LOCK) operation for that stateid that did
> > > not
> > > return an error.
> > >
> > > Ben
>
>
> Thanks Ben, I'll persist with the network monitoring for now. The
> sequence ids appear do appear to be sequential when it's in error
> state. I don't see it skipping any if that's what the error is
> expecting
>
>
> Client
> ~~~~~
>
> Here is the sequence-id that gets returned as bad:
>
> $ tshark -n -2 -r client_broken.pcap -z proto,colinfo,rpc.xid,rpc.xid
> -R 'nfs.seqid == 0x0000003c && frame.number <= 70'
>  70         21 172.27.30.129 -> 172.27.255.28 NFS 352 V4 Call LOCK FH:
> 0x80589398 Offset: 0 Length: <End of File>  rpc.xid == 0x7b90c61d
>
> prior sequence is the OPEN as expected:
>
> $ tshark -n -2 -r client_broken.pcap -z proto,colinfo,rpc.xid,rpc.xid
> -R 'nfs.seqid == 0x0000003b && frame.number <= 70'
>  68         21 172.27.30.129 -> 172.27.255.28 NFS 368 V4 Call OPEN DH:
> 0xbfe01adc/m.db  rpc.xid == 0x7a90c61d
>
> Server
> ~~~~~
> bad-sequence
>
> $ tshark -n -2 -r server_broken.pcap -z proto,colinfo,rpc.xid,rpc.xid
> -R 'nfs.seqid == 0x0000003c && frame.number <= 70'
>   1         65 172.27.30.129 -> 172.27.255.28 NFS 350 V4 Call (Reply
> In 2) LOCK FH: 0x80589398 Offset: 0 Length: <End of File>  rpc.xid ==
> 0x7b90c61d
>   2         65 172.27.30.129 -> 172.27.255.28 NFS 302 V4 Call (Reply
> In 3) CLOSE StateID: 0xb877  rpc.xid == 0x7c90c61d
>
> prior sequence
>
> $ tshark -n -2 -r server_broken.pcap -z proto,colinfo,rpc.xid,rpc.xid
> -R 'nfs.seqid == 0x0000003b && frame.number <= 70'
>   1         65 172.27.30.129 -> 172.27.255.28 NFS 366 V4 Call (Reply
> In 2) OPEN DH: 0xbfe01adc/m.db  rpc.xid == 0x7a90c61d
>
> if it is indeed Red Hat KB 3354251 then there is greater motivation to
> try a kernel upgrade again. Though just trying to confirm I have not
> hit something new here.
>
> Leon

  reply index

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-12  8:27 Leon Kyneur
2019-09-17 11:28 ` Benjamin Coddington
2019-09-18  2:20   ` Leon Kyneur
2019-09-18 11:04     ` Benjamin Coddington
2019-09-18 14:32       ` Olga Kornievskaia
2019-09-19  4:22         ` Leon Kyneur
2019-09-19 13:36           ` Olga Kornievskaia [this message]
2019-09-24  4:17             ` Leon Kyneur
2019-09-24 14:45               ` Olga Kornievskaia

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAN-5tyHqtQMJdrm1WUSjmiT7mrRd4HDNNmvZ=fdWjOoOu54ApA@mail.gmail.com' \
    --to=aglo@umich.edu \
    --cc=bcodding@redhat.com \
    --cc=leonk@dug.com \
    --cc=linux-nfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-NFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-nfs/0 linux-nfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-nfs linux-nfs/ https://lore.kernel.org/linux-nfs \
		linux-nfs@vger.kernel.org
	public-inbox-index linux-nfs

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-nfs


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git