linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* NFS hangs on one interface
@ 2019-10-23  0:34 Chandler
  2019-10-23 17:15 ` J. Bruce Fields
  2019-11-20 18:43 ` Chandler
  0 siblings, 2 replies; 10+ messages in thread
From: Chandler @ 2019-10-23  0:34 UTC (permalink / raw)
  To: linux-nfs

Hi all, I'm sure you get this alot, but I couldn't figure out any solution.  We have a client/server pair with both 1Gb and 10Gb network interfaces.  I can mount the share on the client on the 1Gb interface just fine and interact with it normally.  If I unmount and try to mount the share on the 10Gb interface, it will mount but everything after that hangs (like ls or df).  The exports entry is the same on the server, i.e.:

#1Gb interface
/data   10.10.10.0/24(rw,no_root_squash,async)
#10Gb interface
/data   128.196.X.X/28(rw,no_root_squash,async)

I turned off iptables for troubleshooting and checked with the NOC here.  Using NFSv4 by default and CentOS 6.10 2.6.32 kernel.  I had some strange results if i try vers=3 or vers=2, then i can "ls /data" but if I try to "ls /data/subdir" then it hangs again.  Now it doesn't even mount if i try with vers=3 or vers=2



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: NFS hangs on one interface
  2019-10-23  0:34 NFS hangs on one interface Chandler
@ 2019-10-23 17:15 ` J. Bruce Fields
  2019-10-24 23:40   ` Chandler
  2019-11-20 18:43 ` Chandler
  1 sibling, 1 reply; 10+ messages in thread
From: J. Bruce Fields @ 2019-10-23 17:15 UTC (permalink / raw)
  To: Chandler; +Cc: linux-nfs

Beats me.  My first guess would be some kind of networking problem.
Maybe try running wireshark and watching to see if certain calls aren't
getting responses.

--b.

On Tue, Oct 22, 2019 at 05:34:51PM -0700, Chandler wrote:
> Hi all, I'm sure you get this alot, but I couldn't figure out any solution.  We have a client/server pair with both 1Gb and 10Gb network interfaces.  I can mount the share on the client on the 1Gb interface just fine and interact with it normally.  If I unmount and try to mount the share on the 10Gb interface, it will mount but everything after that hangs (like ls or df).  The exports entry is the same on the server, i.e.:
> 
> #1Gb interface
> /data   10.10.10.0/24(rw,no_root_squash,async)
> #10Gb interface
> /data   128.196.X.X/28(rw,no_root_squash,async)
> 
> I turned off iptables for troubleshooting and checked with the NOC here.  Using NFSv4 by default and CentOS 6.10 2.6.32 kernel.  I had some strange results if i try vers=3 or vers=2, then i can "ls /data" but if I try to "ls /data/subdir" then it hangs again.  Now it doesn't even mount if i try with vers=3 or vers=2
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: NFS hangs on one interface
  2019-10-23 17:15 ` J. Bruce Fields
@ 2019-10-24 23:40   ` Chandler
  2019-10-25  0:37     ` Rick Macklem
  0 siblings, 1 reply; 10+ messages in thread
From: Chandler @ 2019-10-24 23:40 UTC (permalink / raw)
  Cc: linux-nfs

Thanks Bruce.
Do you (or anyone) have an idea how to use wireshark "tshark" on the command line to capture this data?  I tried to run it but it captures way too much traffic.. is there perhaps a certain port or ports I could tell it to monitor?  2049?  Thanks


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: NFS hangs on one interface
  2019-10-24 23:40   ` Chandler
@ 2019-10-25  0:37     ` Rick Macklem
  2019-10-30  6:30       ` Chandler
  0 siblings, 1 reply; 10+ messages in thread
From: Rick Macklem @ 2019-10-25  0:37 UTC (permalink / raw)
  To: Chandler; +Cc: linux-nfs

I usually use tcpdump to do a raw packet capture. Something like:
# tcpdump -s 0 -w out.pcap host <nfs-server>
(<nfs-server> is the hostname of the other machine, client or server)
<ctrl>C  <-- when you think you have enough

Then you can read out.pcap into wireshark.

rick

________________________________________
From: linux-nfs-owner@vger.kernel.org <linux-nfs-owner@vger.kernel.org> on behalf of Chandler <admin@genome.arizona.edu>
Sent: Thursday, October 24, 2019 7:40 PM
Cc: linux-nfs@vger.kernel.org
Subject: Re: NFS hangs on one interface

Thanks Bruce.
Do you (or anyone) have an idea how to use wireshark "tshark" on the command line to capture this data?  I tried to run it but it captures way too much traffic.. is there perhaps a certain port or ports I could tell it to monitor?  2049?  Thanks


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: NFS hangs on one interface
  2019-10-25  0:37     ` Rick Macklem
@ 2019-10-30  6:30       ` Chandler
  2019-11-05  0:28         ` Chandler
  0 siblings, 1 reply; 10+ messages in thread
From: Chandler @ 2019-10-30  6:30 UTC (permalink / raw)
  To: linux-nfs

Does this tcpdump help at all?  I ran:
tcpdump -i eth2 -s 0 -w out.pcap host x.4 and x.2

Then I tried "mount -v x.2:/data /data" in another term, waited until after the timeout then ^C then killed the tcpdump


   1   0.000000 x.4 -> x.2 TCP 66 739 > nfs [ACK] Seq=1 Ack=1 Win=140 Len=0 TSval=2837501537 TSecr=1577364421
   2   0.000034 x.4 -> x.2 NFS 206 [TCP Previous segment not captured] V3 READ Call, FH:0x48bf584a Offset:0 Len:131072
   3   0.000127 x.2 -> x.4 TCP 60 nfs > 739 [RST] Seq=1 Win=0 Len=0
   4   0.000148 x.2 -> x.4 TCP 60 nfs > 739 [RST] Seq=1 Win=0 Len=0
   5   3.000003 x.4 -> x.2 TCP 74 [TCP Port numbers reused] 739 > nfs [SYN] Seq=0 Win=17920 Len=0 MSS=8960 SACK_PERM=1 TSval=2837504537 TSecr=0 WS=128
   6   3.000182 x.2 -> x.4 TCP 74 nfs > 739 [SYN, ACK] Seq=0 Ack=1 Win=17896 Len=0 MSS=8960 SACK_PERM=1 TSval=1578327421 TSecr=2837504537 WS=128
   7   3.000205 x.4 -> x.2 TCP 66 739 > nfs [ACK] Seq=1 Ack=1 Win=17920 Len=0 TSval=2837504537 TSecr=1578327421
   8   3.000228 x.4 -> x.2 NFS 206 V3 READ Call, FH:0x48bf584a Offset:0 Len:131072
   9   3.000261 x.2 -> x.4 TCP 66 nfs > 739 [ACK] Seq=1 Ack=141 Win=19072 Len=0 TSval=1578327421 TSecr=2837504537
  10   4.100351 x.4 -> x.2 NFS 194 V4 Call PUTROOTFH | GETATTR
  11   4.139630 x.2 -> x.4 TCP 66 nfs > 798 [ACK] Seq=4113 Ack=129 Win=157 Len=0 TSval=1578328561 TSecr=2837505637
  12  44.294611 x.2 -> x.4 TCP 66 netconfsoaphttp > nfs [ACK] Seq=1 Ack=1 Win=140 Len=0 TSval=1578368716 TSecr=2837365831
  13  44.294628 x.4 -> x.2 TCP 66 [TCP ACKed unseen segment] nfs > netconfsoaphttp [ACK] Seq=3969 Ack=2 Win=149 Len=0 TSval=2837545831 TSecr=1578188716
  14  44.294634 x.2 -> x.4 TCP 66 [TCP Previous segment not captured] netconfsoaphttp > nfs [FIN, ACK] Seq=2 Ack=1 Win=140 Len=0 TSval=1578368716 TSecr=2837365831
  15  44.294688 x.4 -> x.2 TCP 66 [TCP ACKed unseen segment] nfs > netconfsoaphttp [FIN, ACK] Seq=3969 Ack=3 Win=149 Len=0 TSval=2837545831 TSecr=1578368716
  16  44.294734 x.2 -> x.4 TCP 60 netconfsoaphttp > nfs [RST] Seq=3 Win=0 Len=0
  17  47.294699 x.2 -> x.4 TCP 74 [TCP Port numbers reused] netconfsoaphttp > nfs [SYN] Seq=0 Win=17920 Len=0 MSS=8960 SACK_PERM=1 TSval=1578371716 TSecr=0 WS=128
  18  47.294723 x.4 -> x.2 TCP 74 nfs > netconfsoaphttp [SYN, ACK] Seq=0 Ack=1 Win=17896 Len=0 MSS=8960 SACK_PERM=1 TSval=2837548831 TSecr=1578371716 WS=128
  19  47.294770 x.2 -> x.4 TCP 66 netconfsoaphttp > nfs [ACK] Seq=1 Ack=1 Win=17920 Len=0 TSval=1578371716 TSecr=2837548831
  20  47.294794 x.2 -> x.4 NFS 270 V4 Call READDIR FH:0xcb5e6e28
  21  47.294802 x.4 -> x.2 TCP 66 nfs > netconfsoaphttp [ACK] Seq=1 Ack=205 Win=19072 Len=0 TSval=2837548831 TSecr=1578371716
  22  47.294975 x.4 -> x.2 NFS 4034 V4 Reply (Call In 20) READDIR
  23  47.495918 x.4 -> x.2 NFS 4034 [RPC duplicate of #22][TCP Retransmission] V4 Reply (Call In 20) READDIR
  24  47.897918 x.4 -> x.2 NFS 4034 [RPC duplicate of #22][TCP Retransmission] V4 Reply (Call In 20) READDIR
  25  48.701955 x.4 -> x.2 NFS 4034 [RPC duplicate of #22][TCP Retransmission] V4 Reply (Call In 20) READDIR
  26  50.309927 x.4 -> x.2 NFS 4034 [RPC duplicate of #22][TCP Retransmission] V4 Reply (Call In 20) READDIR
  27  53.525953 x.4 -> x.2 NFS 4034 [RPC duplicate of #22][TCP Retransmission] V4 Reply (Call In 20) READDIR
  28  59.957895 x.4 -> x.2 NFS 4034 [RPC duplicate of #22][TCP Retransmission] V4 Reply (Call In 20) READDIR
  29  62.999962 x.4 -> x.2 TCP 66 [TCP Keep-Alive] 739 > nfs [ACK] Seq=140 Ack=1 Win=17920 Len=0 TSval=2837564537 TSecr=1578327421
  30  63.000082 x.2 -> x.4 TCP 66 [TCP Previous segment not captured] nfs > 739 [ACK] Seq=17897 Ack=141 Win=19072 Len=0 TSval=1578387421 TSecr=2837504537
  31  64.099972 x.4 -> x.2 TCP 66 798 > nfs [FIN, ACK] Seq=129 Ack=1 Win=140 Len=0 TSval=2837565637 TSecr=1578304804
  32  64.100147 x.2 -> x.4 NFS 326 V4 Reply (Call In 10) PUTROOTFH | GETATTR
  33  64.100174 x.4 -> x.2 TCP 54 798 > nfs [RST] Seq=130 Win=0 Len=0
  34  67.099990 x.4 -> x.2 TCP 74 [TCP Port numbers reused] 798 > nfs [SYN] Seq=0 Win=17920 Len=0 MSS=8960 SACK_PERM=1 TSval=2837568637 TSecr=0 WS=128
  35  67.100135 x.2 -> x.4 TCP 74 nfs > 798 [SYN, ACK] Seq=0 Ack=1 Win=17896 Len=0 MSS=8960 SACK_PERM=1 TSval=1578391521 TSecr=2837568637 WS=128
  36  67.100158 x.4 -> x.2 TCP 66 798 > nfs [ACK] Seq=1 Ack=1 Win=17920 Len=0 TSval=2837568637 TSecr=1578391521
  37  67.100181 x.4 -> x.2 NFS 238 V4 Call READDIR FH:0x0366982c
  38  67.100222 x.2 -> x.4 TCP 66 nfs > 798 [ACK] Seq=1 Ack=173 Win=19072 Len=0 TSval=1578391521 TSecr=2837568637
  39  67.100241 x.4 -> x.2 NFS 194 V4 Call PUTROOTFH | GETATTR
  40  67.100285 x.2 -> x.4 TCP 66 nfs > 798 [ACK] Seq=1 Ack=301 Win=20096 Len=0 TSval=1578391521 TSecr=2837568637
  41  67.100332 x.2 -> x.4 NFS 326 V4 Reply (Call In 39) PUTROOTFH | GETATTR
  42  67.100339 x.4 -> x.2 TCP 66 798 > nfs [ACK] Seq=301 Ack=261 Win=19072 Len=0 TSval=2837568637 TSecr=1578391521
  43  67.111864 x.4 -> x.2 NFS 198 V4 Call GETATTR FH:0x62d40c52
  44  67.111991 x.2 -> x.4 NFS 158 [TCP Previous segment not captured] V4 Reply (Call In 43) GETATTR
  45  67.112010 x.4 -> x.2 TCP 78 [TCP Dup ACK 43#1] 798 > nfs [ACK] Seq=433 Ack=261 Win=19072 Len=0 TSval=2837568649 TSecr=1578391521 SLE=4373 SRE=4465
  46  72.821950 x.4 -> x.2 NFS 4034 [RPC duplicate of #22][TCP Retransmission] V4 Reply (Call In 20) READDIR
  47  98.549965 x.4 -> x.2 NFS 4034 [RPC duplicate of #22][TCP Retransmission] V4 Reply (Call In 20) READDIR
  48 107.294623 x.2 -> x.4 TCP 66 [TCP Keep-Alive] netconfsoaphttp > nfs [ACK] Seq=204 Ack=1 Win=17920 Len=0 TSval=1578431716 TSecr=2837548831
  49 107.294642 x.4 -> x.2 TCP 66 [TCP Keep-Alive ACK] nfs > netconfsoaphttp [ACK] Seq=3969 Ack=205 Win=19072 Len=0 TSval=2837608831 TSecr=1578371716
  50 122.999956 x.4 -> x.2 TCP 66 [TCP Keep-Alive] 739 > nfs [ACK] Seq=140 Ack=1 Win=17920 Len=0 TSval=2837624537 TSecr=1578327421
  51 123.000088 x.2 -> x.4 TCP 66 [TCP Keep-Alive ACK] nfs > 739 [ACK] Seq=17897 Ack=141 Win=19072 Len=0 TSval=1578447421 TSecr=2837504537
  52 123.354599 Solarfla_y -> Solarfla_z ARP 60 Who has x.4?  Tell x.2
  53 123.354613 Solarfla_z -> Solarfla_y ARP 42 x.4 is at 00:0f:53:z
  54 127.110922 x.4 -> x.2 TCP 78 798 > nfs [FIN, ACK] Seq=433 Ack=261 Win=19072 Len=0 TSval=2837628648 TSecr=1578391521 SLE=4373 SRE=4465
  55 127.111041 x.2 -> x.4 TCP 66 nfs > 798 [FIN, ACK] Seq=4465 Ack=434 Win=21120 Len=0 TSval=1578451532 TSecr=2837628648
  56 127.111068 x.4 -> x.2 TCP 54 798 > nfs [RST] Seq=434 Win=0 Len=0
  57 130.110999 x.4 -> x.2 TCP 74 [TCP Port numbers reused] 798 > nfs [SYN] Seq=0 Win=17920 Len=0 MSS=8960 SACK_PERM=1 TSval=2837631648 TSecr=0 WS=128
  58 130.111106 x.2 -> x.4 TCP 74 nfs > 798 [SYN, ACK] Seq=0 Ack=1 Win=17896 Len=0 MSS=8960 SACK_PERM=1 TSval=1578454532 TSecr=2837631648 WS=128
  59 130.111133 x.4 -> x.2 TCP 66 798 > nfs [ACK] Seq=1 Ack=1 Win=17920 Len=0 TSval=2837631648 TSecr=1578454532
  60 130.111146 x.4 -> x.2 NFS 238 V4 Call READDIR FH:0x0366982c
  61 130.111162 x.4 -> x.2 NFS 198 V4 Call GETATTR FH:0x62d40c52
  62 130.111198 x.2 -> x.4 TCP 66 nfs > 798 [ACK] Seq=1 Ack=173 Win=19072 Len=0 TSval=1578454532 TSecr=2837631648
  63 130.111208 x.2 -> x.4 TCP 66 nfs > 798 [ACK] Seq=1 Ack=305 Win=20096 Len=0 TSval=1578454532 TSecr=2837631648
  64 130.111290 x.2 -> x.4 NFS 158 V4 Reply (Call In 61) GETATTR
  65 130.111296 x.4 -> x.2 TCP 66 798 > nfs [ACK] Seq=305 Ack=93 Win=17920 Len=0 TSval=2837631648 TSecr=1578454532
  66 130.111367 x.4 -> x.2 NFS 202 V4 Call GETATTR FH:0x62d40c52
  67 130.111430 x.2 -> x.4 NFS 178 V4 Reply (Call In 66) GETATTR
  68 130.111463 x.4 -> x.2 NFS 198 V4 Call GETATTR FH:0x62d40c52
  69 130.111513 x.2 -> x.4 NFS 158 [TCP Previous segment not captured] V4 Reply (Call In 68) GETATTR
  70 130.111521 x.4 -> x.2 TCP 78 [TCP Dup ACK 68#1] 798 > nfs [ACK] Seq=573 Ack=205 Win=17920 Len=0 TSval=2837631648 TSecr=1578454532 SLE=4317 SRE=4409

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: NFS hangs on one interface
  2019-10-30  6:30       ` Chandler
@ 2019-11-05  0:28         ` Chandler
  2019-11-05 16:24           ` Olga Kornievskaia
  0 siblings, 1 reply; 10+ messages in thread
From: Chandler @ 2019-11-05  0:28 UTC (permalink / raw)
  To: linux-nfs

Any ideas what's going on here?
Thanks

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: NFS hangs on one interface
  2019-11-05  0:28         ` Chandler
@ 2019-11-05 16:24           ` Olga Kornievskaia
       [not found]             ` <49472814-8dcd-4594-d48e-be4c1d9a8d8f@genome.arizona.edu>
  0 siblings, 1 reply; 10+ messages in thread
From: Olga Kornievskaia @ 2019-11-05 16:24 UTC (permalink / raw)
  To: Chandler; +Cc: linux-nfs

It's too hard to read this tcpdump-style network trace with multiple
nfs streams (a full .cap file would be much better) (internals of the
packets are hidden).

Some things that stick out. If you are doing a v4.0 mount, it
typically would start with a SETCLIENTID. Yours starts with a
PUTROOTFH which means you already have a 4.0 mount going to this
server. "cat /proc/fs/nfsfs/server" would show you mounts to that
server. If you are not expecting that you already had an existing 4.0
mount (ie., your "mount" command doesn't show that server mounted),
then things have gone wrong already and you have a stuck mount which
might be interfering with further mounts.

Are you experiencing issues with a fresh boot ? do you have an
ability/luxury to reboot the client machine?

Your problem description is confusing. Your last network trace is
about a failing v4.0 mount. Your initial description is talking about
mounting with "vers=3" or "vers=2". So is the problem with a specific
nfs version or is the problem with mounting over 10GB interface with
any NFS versions?

You can also turn on rpcdebug messages (if your client machine isn't
getting a lot of NFS traffic) but given your trace I see multiple
streams so you'll have to dig thru lots of output to follow your own
NFS operations.

On Mon, Nov 4, 2019 at 7:29 PM Chandler <admin@genome.arizona.edu> wrote:
>
> Any ideas what's going on here?
> Thanks

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: NFS hangs on one interface
       [not found]             ` <49472814-8dcd-4594-d48e-be4c1d9a8d8f@genome.arizona.edu>
@ 2019-11-11 15:46               ` Olga Kornievskaia
  2019-11-12 22:14                 ` Chandler
  0 siblings, 1 reply; 10+ messages in thread
From: Olga Kornievskaia @ 2019-11-11 15:46 UTC (permalink / raw)
  To: Chandler; +Cc: linux-nfs

Hi Chandler,

Given what you say, it sounds to me more like a generic networking
issue between this particular problem and the server.

debug messages are logging that client can't reach the server:
Nov  8 17:58:21 NFSclient kernel: nfs: server x.2 not responding, still trying

I'd recommend making sure that your network works alright between
those interfaces. Perhaps running an iperf for a few minutes to make
sure you are seeing expected, consistent performance between those two
nodes. Another thing to check if you for some reason have duplicate
IPs in the system that can show up as weird hangs.

On Fri, Nov 8, 2019 at 8:22 PM Chandler <admin@genome.arizona.edu> wrote:
>
> Hi Olga, thanks so much for your help.
>
> I tried to reboot and still having weird issues.  If I mount over the local network (10.x address) then there are no issues.  As soon as I try to mount over the 10G network, weird things happen.  For example, I can perform the mount just fine and do "ll /mount" but as soon as I try another directory like "ll /mount/users" then it hangs.  Also this only happens between these two machines with 10G interfaces.  The server with the 10G interface has several other 1G clients that outside the local 10.x network that connect to it on the 10G interface, and those clients all work fine as well, so seems to be an issue specific to this client on the 10G interface.
>
> In my earlier post, I did try troubleshooting with vers=3 and vers=2 just to see if that was the issue, but since then have been using the defaults (so vers=4).
>
> I turned on the rpcdebug on both client and server with "rpcdebug -m nfs(d) all"  but it seemed to lock up the server and i had to reboot it, so will keep that off for now.  I attached a log of the debug messages from the client showing what commands I executed (snoopy) and the resulting kernel debug entries, hope this helps.  The hangup happens at the end when I ls -l on the users directory.  Let me know if there's anything else I can provide.
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: NFS hangs on one interface
  2019-11-11 15:46               ` Olga Kornievskaia
@ 2019-11-12 22:14                 ` Chandler
  0 siblings, 0 replies; 10+ messages in thread
From: Chandler @ 2019-11-12 22:14 UTC (permalink / raw)
  To: Olga Kornievskaia; +Cc: linux-nfs

Yes I don't understand what's going on with the network.  I can ssh to the server from the client over the 10G interfaces, login and get to a prompt.  I can even run some commands, but as soon as I try "top" then the session freezes, top works just fine if I ssh from my workstation to the server over the 10G interface, and top works fine if i ssh from the client to the server over the 1G interface..... maybe post on LinuxQuestions or something??


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: NFS hangs on one interface
  2019-10-23  0:34 NFS hangs on one interface Chandler
  2019-10-23 17:15 ` J. Bruce Fields
@ 2019-11-20 18:43 ` Chandler
  1 sibling, 0 replies; 10+ messages in thread
From: Chandler @ 2019-11-20 18:43 UTC (permalink / raw)
  To: linux-nfs

Seems the problem was a mis-matched MTU setting with the switch.  Now that the port on the switch is set to 9000, everything is working.  Thanks for all your help.


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-11-20 18:43 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-23  0:34 NFS hangs on one interface Chandler
2019-10-23 17:15 ` J. Bruce Fields
2019-10-24 23:40   ` Chandler
2019-10-25  0:37     ` Rick Macklem
2019-10-30  6:30       ` Chandler
2019-11-05  0:28         ` Chandler
2019-11-05 16:24           ` Olga Kornievskaia
     [not found]             ` <49472814-8dcd-4594-d48e-be4c1d9a8d8f@genome.arizona.edu>
2019-11-11 15:46               ` Olga Kornievskaia
2019-11-12 22:14                 ` Chandler
2019-11-20 18:43 ` Chandler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).