All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: RDMA connection closed and not re-opened
       [not found] <f1e596cf-0e70-39af-99e9-a0a7e912bad3@genome.arizona.edu>
@ 2018-06-29 15:04 ` Chuck Lever
  2018-07-02 23:22   ` admin
  0 siblings, 1 reply; 14+ messages in thread
From: Chuck Lever @ 2018-06-29 15:04 UTC (permalink / raw)
  To: admin; +Cc: Linux NFS Mailing List

Hi Chandler-


> On Jun 28, 2018, at 8:23 PM, admin@genome.arizona.edu wrote:
>=20
> Dear Chuck et. al.,
>=20
> Sorry for my late reply.  I have since lost the previous messages in =
my news client and gmane isn't very reliable anymore.  I am replying to =
the message-id A9E63254-22F5-48A7-85C2-8016D85CD192 [1] which was in =
reference to my original posts [2][3] (links in footer).
>=20
> We keep having this problem and having to reset servers and losing =
work.  The latest incident involved 7 out of 9 of our NFS clients.  I've =
attached the latest messages from these clients (n001.txt through =
n007.txt) as well as the messages from the server.
>=20
> Here is a short summary in chronological order: I first notice a =
message on our server at Jun 27 19:09:03 in reference to Ganglia not =
being able to reach one of the data sources.  Not sure if it is related =
but the message seems to only appear when there are these problems with =
the NFS... the next message doesn't happen until Jun 27 20:01:55.
>=20
> On the clients, the first errors happen on n005,
> Jun 27 20:04:07 n005 kernel: RPC:       rpcrdma_sendcq_process_wc: =
frmr ffff88204ea3b840 (stale): WR flushed
>=20
> there are similar messages on n007 and n003 which happen at 20:04:09 =
and 20:04:17.  However I don't see these "WR flushed" messages on the =
other nodes.  These are accompanied by the INFO messages that our =
application (daligner) is being blocked as well as the "rpcrdma: =
connection to 10.10.11.10:20049 closed (-103)" error.  After that the =
nodes become unresponsive to SSH, although Ganglia seems to still be =
able to collect some information from them as I can see the load graphs =
continually increasing.

These are informational messages that are typical of network
problems or maybe the server has failed or is overloaded. I'm
especially inclined to think this is not a client issue because it
happens on multiple clients at around the same time.

These appear to be typical of all the clients:

Jun 27 20:07:07 n005 kernel: nfs: server 10.10.11.10 not responding, =
still trying
Jun 27 20:08:34 n005 kernel: rpcrdma: connection to 10.10.11.10:20049 on =
mlx4_0, memreg 5 slots 32 ird 16
Jun 27 20:08:35 n005 kernel: nfs: server 10.10.11.10 OK
Jun 27 20:08:35 n005 kernel: nfs: server 10.10.11.10 not responding, =
still trying
Jun 27 20:08:35 n005 kernel: nfs: server 10.10.11.10 OK
Jun 27 20:13:59 n005 kernel: RPC:       rpcrdma_sendcq_process_wc: frmr =
ffff88204f86b380 (stale): WR flushed
Jun 27 20:13:59 n005 kernel: RPC:       rpcrdma_sendcq_process_wc: frmr =
ffff88204eea9180 (stale): WR flushed
Jun 27 20:13:59 n005 kernel: RPC:       rpcrdma_sendcq_process_wc: frmr =
ffff88204e743f80 (stale): WR flushed
Jun 27 20:15:43 n005 kernel: rpcrdma: connection to 10.10.11.10:20049 on =
mlx4_0, memreg 5 slots 32 ird 16
Jun 27 20:32:08 n005 kernel: rpcrdma: connection to 10.10.11.10:20049 =
closed (-103)

The "closed" message appears only in some client logs.

On the server:

Jun 27 20:08:34 pac kernel: svcrdma: failed to send reply chunks, rc=3D-5
Jun 27 20:08:34 pac kernel: nfsd: peername failed (err 107)!
Jun 27 20:08:34 pac kernel: nfsd: peername failed (err 107)!
Jun 27 20:08:35 pac kernel: svcrdma: failed to send reply chunks, rc=3D-5

This is suspicious. I don't have access to the CentOS 6.9 source
code, but it could mean that the server logic that transmits reply
chunks is broken, and the client is requesting an operation that
has to use reply chunks. That would cause a deadlock on that
connection because the client's recourse is to send that operation
again and again, but the server would repeatedly fail to reply.


> We haven't had this problem until recently.  I upgraded our cluster to =
add two additional nodes (n008 and n009, which have problems too and =
have to be rebooted) and we also added more storage to the server.  The =
jobs are submitted to the cluster via Sun Grid Engine, and in total =
there are about 61 jobs (daligner) that may start at once and open =
connections to the NFS server... is it too much work for NFS to handle?
>=20
> Yes both clients and servers have CentOS 6.9.  Is there a way to =
report this to Red Hat?  Otherwise i'm not sure of a way to report this =
to the "Linux distributor".

I don't know how to contact CentOS support, but that would be the
first step here to do the basic troubleshooting steps with people
who are familiar with that code base and the tools that are available
in that distribution.

Perhaps a RH staffer on this list could provide some guidance?


> The machines are not completely updated and there appears to be a new =
kernel (2.6.32.696.30.1.el6) available as well as new nfs-utils =
(1:1.2.3-75.el6_9).  So not sure if updating those may help...

If there are no other constraints on your NFS server's kernel /
distribution, I recommend upgrading it to a recent update of CentOS
7 (not simply a newer CentOS 6 release).

IMO nfs-utils is not involved in these issues.


> If you do not see any solution to this old implementation then would =
you perhaps suggest I manually install the latest stable version of NFS =
on the clients and server?  In that case please let me know of any =
relevant configure flags I might need to use if you can think of any off =
the top of your head.

The NFS implementation is integrated into the Linux kernel, so it's
not a simple matter of "installing the latest stable version of NFS".


> Many Thanks,
> Chandler / Systems Administrator
> Arizona Genomics Institute
> www.genome.arizona.edu
>=20
> --
> 1. https://marc.info/?l=3Dlinux-nfs&m=3D152545311928035&w=3D2
> 2. https://marc.info/?l=3Dlinux-nfs&m=3D152538002122612&w=3D2
> 3. https://marc.info/?l=3Dlinux-nfs&m=3D152538859227047&w=3D2
>=20
>=20
> =
<n001.txt><n002.txt><n003.txt><n004.txt><n005.txt><n006.txt><n007.txt><ser=
ver.txt>

--
Chuck Lever
chucklever@gmail.com




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: RDMA connection closed and not re-opened
  2018-06-29 15:04 ` RDMA connection closed and not re-opened Chuck Lever
@ 2018-07-02 23:22   ` admin
  2018-07-03  2:44     ` Chuck Lever
  0 siblings, 1 reply; 14+ messages in thread
From: admin @ 2018-07-02 23:22 UTC (permalink / raw)
  To: Linux NFS Mailing List; +Cc: Chuck Lever

Thanks Chuck for your input, let me address it below like normal for 
mailing lists.  Although I'm confused as to why my message hasn't shown 
up on the mailing list, even though I'm subscribed with this address... 
I've written to owner-linux-nfs@vger.kernel.org regarding this 
discrepancy and it was rejected as spam so now i'm waiting to hear from 
postmaster@vger.kernel.org, so I guess I'll need to continue to CC you 
as well in the time being since your responses show up on the mailing 
list at least...


Chuck Lever wrote on 06/29/2018 08:04 AM:
 > These are informational messages that are typical of network
 > problems or maybe the server has failed or is overloaded. I'm
 > especially inclined to think this is not a client issue because it
 > happens on multiple clients at around the same time.

Yes it makes sense to be a server problem, however our server is more 
than capable of handling this I would think.  Although it is an older 
server, it still has 2x 6-core Intel Xeon E5-2620 v2 @ 2.10GHz with 
128GB of RAM and maybe 10% utilization normally.  I have not watched the 
server when we start these daligner jobs so that could be something I 
look for to see if I notice any bottlenecks... what is a typical 
bottleneck for NFS/RDMA?


 > If there are no other constraints on your NFS server's kernel /
 > distribution, I recommend upgrading it to a recent update of CentOS
 > 7 (not simply a newer CentOS 6 release).

Unfortunately CentOS doesn't support upgrading from 6 to 7 and this 
machine is too critical to take down for a fresh 
installation/reconfiguration, so I have a feeling we'll need to figure 
out how to get the 6.9 kernel working.  I will try updating to the 
latest kernel on all of the nodes to see if it helps.



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: RDMA connection closed and not re-opened
  2018-07-02 23:22   ` admin
@ 2018-07-03  2:44     ` Chuck Lever
  2018-07-03 23:41       ` admin
  0 siblings, 1 reply; 14+ messages in thread
From: Chuck Lever @ 2018-07-03  2:44 UTC (permalink / raw)
  To: admin; +Cc: Linux NFS Mailing List, Chuck Lever


> On Jul 2, 2018, at 7:22 PM, admin@genome.arizona.edu wrote:
>=20
> Thanks Chuck for your input, let me address it below like normal for maili=
ng lists.  Although I'm confused as to why my message hasn't shown up on the=
 mailing list, even though I'm subscribed with this address... I've written t=
o owner-linux-nfs@vger.kernel.org regarding this discrepancy and it was reje=
cted as spam so now i'm waiting to hear from postmaster@vger.kernel.org, so I=
 guess I'll need to continue to CC you as well in the time being since your r=
esponses show up on the mailing list at least...
>=20
>=20
> Chuck Lever wrote on 06/29/2018 08:04 AM:
> > These are informational messages that are typical of network
> > problems or maybe the server has failed or is overloaded. I'm
> > especially inclined to think this is not a client issue because it
> > happens on multiple clients at around the same time.
>=20
> Yes it makes sense to be a server problem, however our server is more than=
 capable of handling this I would think.  Although it is an older server, it=
 still has 2x 6-core Intel Xeon E5-2620 v2 @ 2.10GHz with 128GB of RAM and m=
aybe 10% utilization normally.  I have not watched the server when we start t=
hese daligner jobs so that could be something I look for to see if I notice a=
ny bottlenecks... what is a typical bottleneck for NFS/RDMA?

Please review all of my last email. I concluded the likely culprit is a soft=
ware bug, not server overload.


> > If there are no other constraints on your NFS server's kernel /
> > distribution, I recommend upgrading it to a recent update of CentOS
> > 7 (not simply a newer CentOS 6 release).
>=20
> Unfortunately CentOS doesn't support upgrading from 6 to 7 and this machin=
e is too critical to take down for a fresh installation/reconfiguration, so I=
 have a feeling we'll need to figure out how to get the 6.9 kernel working. =
 I will try updating to the latest kernel on all of the nodes to see if it h=
elps.

If CentOS 6 is required, CentOS / Red Hat really does need to be involved as=
 you troubleshoot. Any code changes will necessitate a new kernel build that=
 only they can provide.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: RDMA connection closed and not re-opened
  2018-07-03  2:44     ` Chuck Lever
@ 2018-07-03 23:41       ` admin
  2018-07-12 22:44         ` admin
  0 siblings, 1 reply; 14+ messages in thread
From: admin @ 2018-07-03 23:41 UTC (permalink / raw)
  To: Linux NFS Mailing List

Chuck Lever wrote on 07/02/2018 07:44 PM:
> Please review all of my last email. I concluded the likely culprit is a software bug, not server overload.
> If CentOS 6 is required, CentOS / Red Hat really does need to be involved as you troubleshoot. Any code changes will necessitate a new kernel build that only they can provide.

Thanks we will see how it goes with the latest kernel and if there are 
still problems I'll look into filing bug report with CentOS or something.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: RDMA connection closed and not re-opened
  2018-07-03 23:41       ` admin
@ 2018-07-12 22:44         ` admin
  2018-07-13 14:36           ` Chuck Lever
  0 siblings, 1 reply; 14+ messages in thread
From: admin @ 2018-07-12 22:44 UTC (permalink / raw)
  To: Linux NFS Mailing List

> Thanks we will see how it goes with the latest kernel and if there are 
> still problems I'll look into filing bug report with CentOS or something.

So, the latest CentOS kernel, 2.6.32-696.30.1, has not helped yet.  In 
the mean time we have reverted to using NFS/TCP over the gigabit 
ethernet link, which creates a bottleneck for the full processing of our 
cluster, but at least hasn't crashed yet.

I did notice that the hangups have all been after 8pm in each 
occurrence.  Each night at 8PM, the NFS server acts as a NFS client and 
runs a couple rsnapshot jobs which backup to a different NFS server. 
Even with NFS/TCP the NFS server became unresponsive after 8pm when the 
rsnapshot jobs were running.  I can see in the system messages the same 
sort of errors with Ganglia we were seeing, as well as rsyslog dropping 
messages related to the ganglia process, as well as nfsd peername failed 
(err 107).  For example,

Jul 11 20:07:31 pac /usr/sbin/gmetad[3582]: poll() timeout from source 0 
for [Pac] data source after 0 bytes read
<repeated 13 times>
Jul 11 20:21:31 pac /usr/sbin/gmetad[3582]: RRD_update 
(/var/lib/ganglia/rrds/Pac/n003.genome.arizona.edu/load_one.rrd): 
/var/lib/ganglia/rrds/Pac/n003.genome.arizona.edu/load_one.rrd: illegal 
attempt to update using time 1531365691 when last update time is 
1531365691 (minimum one second step)
<many messages like this from all the nodes n001-n009
Jul 11 20:21:31 pac rsyslogd-2177: imuxsock begins to drop messages from 
pid 3582 due to rate-limiting
Jul 11 20:22:25 pac rsyslogd-2177: imuxsock lost 116 messages from pid 
3582 due to rate-limiting
Jul 11 20:22:25 pac /usr/sbin/gmetad[3582]: poll() timeout from source 0 
for [Pac] data source after 0 bytes read
<bunch more of these and RRD_update errors>
Jul 11 20:41:54 pac rsyslogd-2177: imuxsock begins to drop messages from 
pid 3582 due to rate-limiting
Jul 11 20:42:34 pac rsyslogd-2177: imuxsock lost 116 messages from pid 
3582 due to rate-limiting
Jul 11 21:09:56 pac kernel: nfsd: peername failed (err 107)!
<repeated 9 more times>
Jul 11 21:09:59 pac /usr/sbin/gmetad[3582]: poll() timeout from source 0 
for [Pac] data source after 0 bytes read
<repeated ~50 more times>
Jul 11 21:48:30 pac /usr/sbin/gmetad[3582]: poll() timeout from source 0 
for [Pac] data source after 0 bytes read
Jul 11 21:48:43 pac kernel: nfsd: peername failed (err 107)!
<repeated 3 more times>
Jul 11 21:53:59 pac /usr/sbin/gmetad[3582]: poll() timeout from source 0 
for [Pac] data source after 0 bytes read
Jul 11 22:39:05 pac rsnapshot[24727]: /usr/bin/rsnapshot -V -c 
/etc/rsnapshotData.conf daily: completed successfully
Jul 11 23:16:24 pac /usr/sbin/gmetad[3582]: poll() timeout from source 0 
for [Pac] data source after 0 bytes read
<EOF>


The difference is it was able to recover once the rsnapshot jobs had 
completed and our other cluster jobs (daligner) are still running and 
servers are responsive.

We are going to let this large job finish with the NFS/TCP before I file 
a bug report with CentOS..   but i thought this extra info might be 
helpful in troubleshooting.  I found the CentOS bug report page and 
there are several options for the "Category"  including "rdma" or 
"kernel" ... which do you think I should file it under?

Thanks,

-- 
Chandler
Arizona Genomics Institute

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: RDMA connection closed and not re-opened
  2018-07-12 22:44         ` admin
@ 2018-07-13 14:36           ` Chuck Lever
  2018-07-13 22:32             ` admin
  0 siblings, 1 reply; 14+ messages in thread
From: Chuck Lever @ 2018-07-13 14:36 UTC (permalink / raw)
  To: admin; +Cc: Linux NFS Mailing List



> On Jul 12, 2018, at 6:44 PM, admin@genome.arizona.edu wrote:
>=20
>> Thanks we will see how it goes with the latest kernel and if there =
are still problems I'll look into filing bug report with CentOS or =
something.
>=20
> So, the latest CentOS kernel, 2.6.32-696.30.1, has not helped yet.  In =
the mean time we have reverted to using NFS/TCP over the gigabit =
ethernet link, which creates a bottleneck for the full processing of our =
cluster, but at least hasn't crashed yet.

You should be able to mount using "proto=3Dtcp" with your mlx4 cards.
That avoids the use of NFS/RDMA but would enable the use of the
higher bandwidth network fabric.


> I did notice that the hangups have all been after 8pm in each =
occurrence.  Each night at 8PM, the NFS server acts as a NFS client and =
runs a couple rsnapshot jobs which backup to a different NFS server.

Can you diagram your full configuration during the backup? Does the
NFS client mount the NFS server on this same host? Does it use
NFS/RDMA or can it use ssh instead of NFS?


> Even with NFS/TCP the NFS server became unresponsive after 8pm when =
the rsnapshot jobs were running.  I can see in the system messages the =
same sort of errors with Ganglia we were seeing, as well as rsyslog =
dropping messages related to the ganglia process, as well as nfsd =
peername failed (err 107).  For example,
>=20
> Jul 11 20:07:31 pac /usr/sbin/gmetad[3582]: poll() timeout from source =
0 for [Pac] data source after 0 bytes read
> <repeated 13 times>
> Jul 11 20:21:31 pac /usr/sbin/gmetad[3582]: RRD_update =
(/var/lib/ganglia/rrds/Pac/n003.genome.arizona.edu/load_one.rrd): =
/var/lib/ganglia/rrds/Pac/n003.genome.arizona.edu/load_one.rrd: illegal =
attempt to update using time 1531365691 when last update time is =
1531365691 (minimum one second step)
> <many messages like this from all the nodes n001-n009
> Jul 11 20:21:31 pac rsyslogd-2177: imuxsock begins to drop messages =
from pid 3582 due to rate-limiting
> Jul 11 20:22:25 pac rsyslogd-2177: imuxsock lost 116 messages from pid =
3582 due to rate-limiting
> Jul 11 20:22:25 pac /usr/sbin/gmetad[3582]: poll() timeout from source =
0 for [Pac] data source after 0 bytes read
> <bunch more of these and RRD_update errors>
> Jul 11 20:41:54 pac rsyslogd-2177: imuxsock begins to drop messages =
from pid 3582 due to rate-limiting
> Jul 11 20:42:34 pac rsyslogd-2177: imuxsock lost 116 messages from pid =
3582 due to rate-limiting
> Jul 11 21:09:56 pac kernel: nfsd: peername failed (err 107)!
> <repeated 9 more times>
> Jul 11 21:09:59 pac /usr/sbin/gmetad[3582]: poll() timeout from source =
0 for [Pac] data source after 0 bytes read
> <repeated ~50 more times>
> Jul 11 21:48:30 pac /usr/sbin/gmetad[3582]: poll() timeout from source =
0 for [Pac] data source after 0 bytes read
> Jul 11 21:48:43 pac kernel: nfsd: peername failed (err 107)!
> <repeated 3 more times>
> Jul 11 21:53:59 pac /usr/sbin/gmetad[3582]: poll() timeout from source =
0 for [Pac] data source after 0 bytes read
> Jul 11 22:39:05 pac rsnapshot[24727]: /usr/bin/rsnapshot -V -c =
/etc/rsnapshotData.conf daily: completed successfully
> Jul 11 23:16:24 pac /usr/sbin/gmetad[3582]: poll() timeout from source =
0 for [Pac] data source after 0 bytes read
> <EOF>
>=20
>=20
> The difference is it was able to recover once the rsnapshot jobs had =
completed and our other cluster jobs (daligner) are still running and =
servers are responsive.

That does describe a possible server overload. Using only GbE could
slow things down enough to avoid catastrophic deadlock.


> We are going to let this large job finish with the NFS/TCP before I =
file a bug report with CentOS..   but i thought this extra info might be =
helpful in troubleshooting.  I found the CentOS bug report page and =
there are several options for the "Category"  including "rdma" or =
"kernel" ... which do you think I should file it under?

I'm not familiar with the CentOS bug database. If there's an "NFS"
category, I would go with that.

Before filing, you should search that database to see if there are
similar bugs. Simply Googling "peername failed!" brings up several
NFSD related entries right at the top of the list that appear
similar to your circumstance (and there is no mention of NFS/RDMA).


--
Chuck Lever




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: RDMA connection closed and not re-opened
  2018-07-13 14:36           ` Chuck Lever
@ 2018-07-13 22:32             ` admin
  2018-07-14 14:37               ` Chuck Lever
  0 siblings, 1 reply; 14+ messages in thread
From: admin @ 2018-07-13 22:32 UTC (permalink / raw)
  To: Linux NFS Mailing List

Chuck Lever wrote on 07/13/2018 07:36 AM:
> You should be able to mount using "proto=tcp" with your mlx4 cards.
> That avoids the use of NFS/RDMA but would enable the use of the
> higher bandwidth network fabric.
Thanks I could definitely try that.  IPoIB has it's own set of issues 
though but can cross that bridge when I get to it....

> Can you diagram your full configuration during the backup?
The main server in relation to this issue, which is named "pac" in the 
log files, has several local storage devices which are exported over the 
Ethernet and Infiniband interfaces.  In addition, it has several other 
mounts over Ethernet to some of our other NFS servers.  The 
rsnapshot/backup job uses rsync to read from the local storage and sends 
to the NFS mounts to another server using standard 1Gb ethernet and TCP 
protocol.  So the answer to your second question,
> Does the
> NFS client mount the NFS server on this same host?
I believe is "yes"

> Does it use
> NFS/RDMA or can it use ssh instead of NFS?
Currently just uses NFS/TCP over 1Gb Ethernet link.  rsnapshot does have 
the ability to use SSH

> I'm not familiar with the CentOS bug database. If there's an "NFS"
> category, I would go with that.
There is no "NFS" category, only nfs-utils, nfs-utils-lib, and 
nfs4-acl-tools.  So I'm guessing if we want to report against NFS then 
"kernel" would be the category?

> Before filing, you should search that database to see if there are
> similar bugs. Simply Googling "peername failed!" brings up several
> NFSD related entries right at the top of the list that appear
> similar to your circumstance (and there is no mention of NFS/RDMA).
Thanks I will be checking that out

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: RDMA connection closed and not re-opened
  2018-07-13 22:32             ` admin
@ 2018-07-14 14:37               ` Chuck Lever
  2018-07-18  0:27                 ` admin
  2018-08-08 18:54                 ` admin
  0 siblings, 2 replies; 14+ messages in thread
From: Chuck Lever @ 2018-07-14 14:37 UTC (permalink / raw)
  To: admin; +Cc: Linux NFS Mailing List



> On Jul 13, 2018, at 6:32 PM, admin@genome.arizona.edu wrote:
>=20
> Chuck Lever wrote on 07/13/2018 07:36 AM:
>> You should be able to mount using "proto=3Dtcp" with your mlx4 cards.
>> That avoids the use of NFS/RDMA but would enable the use of the
>> higher bandwidth network fabric.
> Thanks I could definitely try that.  IPoIB has it's own set of issues =
though but can cross that bridge when I get to it....

Stick with connected mode and keep rsize and wsize smaller
than the IPoIB MTU, which can be set as high as 65KB.


>> Can you diagram your full configuration during the backup?
> The main server in relation to this issue, which is named "pac" in the =
log files, has several local storage devices which are exported over the =
Ethernet and Infiniband interfaces.  In addition, it has several other =
mounts over Ethernet to some of our other NFS servers.  The =
rsnapshot/backup job uses rsync to read from the local storage and sends =
to the NFS mounts to another server using standard 1Gb ethernet and TCP =
protocol.  So the answer to your second question,
>> Does the
>> NFS client mount the NFS server on this same host?
> I believe is "yes"

I wasn't entirely clear: Does pac mount itself?

I don't know what the workload is like on this "self mount" but
we recommend not to use this kind of configuration, because it
is prone to deadlock with a significant workload.


>> Does it use
>> NFS/RDMA or can it use ssh instead of NFS?
> Currently just uses NFS/TCP over 1Gb Ethernet link.  rsnapshot does =
have the ability to use SSH

I was thinking that it might be better to use ssh and avoid NFS
for the backup workload, in order to avoid pac mounting itself.


>> I'm not familiar with the CentOS bug database. If there's an "NFS"
>> category, I would go with that.
> There is no "NFS" category, only nfs-utils, nfs-utils-lib, and =
nfs4-acl-tools.  So I'm guessing if we want to report against NFS then =
"kernel" would be the category?

In the "kernel" category, there might be an "NFS or NFSD"
subcomponent.


>> Before filing, you should search that database to see if there are
>> similar bugs. Simply Googling "peername failed!" brings up several
>> NFSD related entries right at the top of the list that appear
>> similar to your circumstance (and there is no mention of NFS/RDMA).
> Thanks I will be checking that out

--
Chuck Lever




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: RDMA connection closed and not re-opened
  2018-07-14 14:37               ` Chuck Lever
@ 2018-07-18  0:27                 ` admin
  2018-08-08 18:54                 ` admin
  1 sibling, 0 replies; 14+ messages in thread
From: admin @ 2018-07-18  0:27 UTC (permalink / raw)
  To: Linux NFS Mailing List

Chuck Lever wrote on 07/14/2018 07:37 AM> I wasn't entirely clear: Does 
pac mount itself?
No, why would we do that?  Do people do that?  Here is a listing of 
relevant mounts on our server pac:

/dev/sdc1 on /data type xfs (rw)
/dev/sdb1 on /projects type xfs (rw)
/dev/sde1 on /working type xfs (rw,nobarrier)
nfsd on /proc/fs/nfsd type nfsd (rw)
/dev/drbd0 on /newwing type xfs (rw)
150.x.x.116:/wing on /wing type nfs (rw,addr=150.x.x.116)
150.x.x.116:/archive on /archive type nfs (rw,addr=150.x.x.116)
150.x.x.116:/backups on /backups type nfs (rw,addr=150.x.x.116)

The backup jobs read from the mounted local disks /data and /projects 
and write to the remote NFS server at /backups and /archive.  I have 
noticed in the log files for our other servers which mount the pac 
exports, "nfs: server pac not responding, timed out" which all show up 
after 8PM when the backup jobs are running.

And here is listing of our pac server exports:

/data	10.10.10.0/24(rw,no_root_squash,async)
/data	10.10.11.0/24(rw,no_root_squash,async)
/data	150.x.x.192/27(rw,no_root_squash,async)
/data	150.x.x.64/26(rw,no_root_squash,async)
/home	10.10.10.0/24(rw,no_root_squash,async)
/home	10.10.11.0/24(rw,no_root_squash,async)
/opt	10.10.10.0/24(rw,no_root_squash,async)
/opt	10.10.11.0/24(rw,no_root_squash,async)
/projects	10.10.10.0/24(rw,no_root_squash,async)
/projects	10.10.11.0/24(rw,no_root_squash,async)
/projects	150.x.x.192/27(rw,no_root_squash,async)
/projects	150.x.x.64/26(rw,no_root_squash,async)
/tools	10.10.10.0/24(rw,no_root_squash,async)
/tools	10.10.11.0/24(rw,no_root_squash,async)
/usr/share/gridengine     10.10.10.10/24(rw,no_root_squash,async)
/usr/share/gridengine     10.10.11.10/24(rw,no_root_squash,async)
/usr/local	10.10.10.10/24(rw,no_root_squash,async)
/usr/local	10.10.11.10/24(rw,no_root_squash,async)
/working	10.10.10.0/24(rw,no_root_squash,async)
/working	10.10.11.0/24(rw,no_root_squash,async)
/working	150.x.x.192/27(rw,no_root_squash,async)
/working	150.x.x.64/26(rw,no_root_squash,async)
/newwing	10.10.10.0/24(rw,no_root_squash,async)
/newwing	10.10.11.0/24(rw,no_root_squash,async)
/newwing	150.x.x.192/27(rw,no_root_squash,async)
/newwing	150.x.x.64/26(rw,no_root_squash,async)

The 10.10.10.0/24 network is 1GbE and the 10.10.11.0/24 is the 
Infiniband.  The other networks are also 1GbE.  Our cluster nodes will 
normally mount all of these using the Infiniband with RDMA and the 
computation jobs will normally be using /working which will see the most 
reading/writing but /newwing, /projects, and /data are also used.

It does continue to seem to be a bug in NFS.  Somehow seems to be 
triggered when the NFS server runs the backup job.  I just tried it now 
and about 20 mins into the backup job the server stopped responding to 
some things, like iotop froze.  top remained active and could see the 
load on the server going up but only to about 22/24 and still about 95% 
idle cpu time.  Also noticed the "nfs: server pac not responding, timed 
out" messages on our other servers.  After about 10 minutes the server 
became responsive again and load dropped down to 3/24 while the backup 
job continued.

Perhaps it could be  mitigated if I change the backup job to use SSH 
instead of NFS.  I'll try that and see if it helps, then once our job 
has completed I can try going back to RDMA to see if it still happens....



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: RDMA connection closed and not re-opened
  2018-07-14 14:37               ` Chuck Lever
  2018-07-18  0:27                 ` admin
@ 2018-08-08 18:54                 ` admin
  2018-08-08 19:01                   ` Chuck Lever
  1 sibling, 1 reply; 14+ messages in thread
From: admin @ 2018-08-08 18:54 UTC (permalink / raw)
  To: Linux NFS Mailing List

Chuck Lever wrote on 07/14/2018 07:37 AM:
>> On Jul 13, 2018, at 6:32 PM, admin@genome.arizona.edu wrote:
>> Chuck Lever wrote on 07/13/2018 07:36 AM:
>>> You should be able to mount using "proto=tcp" with your mlx4 cards.
>>> That avoids the use of NFS/RDMA but would enable the use of the
>>> higher bandwidth network fabric.
>> Thanks I could definitely try that.  IPoIB has it's own set of issues though but can cross that bridge when I get to it....
> Stick with connected mode and keep rsize and wsize smaller
> than the IPoIB MTU, which can be set as high as 65KB.
We are running in this setup, so far so good... however the rsize/wsize 
were much greater than the IPoIB MTU, and it is probably causing these 
"page allocation failures" which fortunately have not been fatal; our 
computation is still running.  In the ifcfg file for the IPoIB 
interface, the MTU is set to 65520, which was the recommended maximum 
from the Red Hat manual.  So should rsize/wsize be set to 65519? or is 
it better to pick another value that is a multiple 1024 or something?
Thanks

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: RDMA connection closed and not re-opened
  2018-08-08 18:54                 ` admin
@ 2018-08-08 19:01                   ` Chuck Lever
  2018-08-08 19:11                     ` admin
  0 siblings, 1 reply; 14+ messages in thread
From: Chuck Lever @ 2018-08-08 19:01 UTC (permalink / raw)
  To: admin; +Cc: Linux NFS Mailing List



> On Aug 8, 2018, at 2:54 PM, admin@genome.arizona.edu wrote:
>=20
> Chuck Lever wrote on 07/14/2018 07:37 AM:
>>> On Jul 13, 2018, at 6:32 PM, admin@genome.arizona.edu wrote:
>>> Chuck Lever wrote on 07/13/2018 07:36 AM:
>>>> You should be able to mount using "proto=3Dtcp" with your mlx4 =
cards.
>>>> That avoids the use of NFS/RDMA but would enable the use of the
>>>> higher bandwidth network fabric.
>>> Thanks I could definitely try that.  IPoIB has it's own set of =
issues though but can cross that bridge when I get to it....
>> Stick with connected mode and keep rsize and wsize smaller
>> than the IPoIB MTU, which can be set as high as 65KB.
> We are running in this setup, so far so good... however the =
rsize/wsize were much greater than the IPoIB MTU, and it is probably =
causing these "page allocation failures" which fortunately have not been =
fatal; our computation is still running.  In the ifcfg file for the =
IPoIB interface, the MTU is set to 65520, which was the recommended =
maximum from the Red Hat manual.  So should rsize/wsize be set to 65519? =
or is it better to pick another value that is a multiple 1024 or =
something?

The r/wsize settings have to be power of two. The next power of
two smaller than 65520 is 32768. Try "rsize=3D32768,wsize=3D32768" .


--
Chuck Lever




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: RDMA connection closed and not re-opened
  2018-08-08 19:01                   ` Chuck Lever
@ 2018-08-08 19:11                     ` admin
  2018-08-08 19:18                       ` Chuck Lever
  0 siblings, 1 reply; 14+ messages in thread
From: admin @ 2018-08-08 19:11 UTC (permalink / raw)
  To: Linux NFS Mailing List

Chuck Lever wrote on 08/08/2018 12:01 PM:
>> On Aug 8, 2018, at 2:54 PM, admin@genome.arizona.edu wrote:
>> Chuck Lever wrote on 07/14/2018 07:37 AM:
>>>> On Jul 13, 2018, at 6:32 PM, admin@genome.arizona.edu wrote:
>>>> Chuck Lever wrote on 07/13/2018 07:36 AM:
>>>>> You should be able to mount using "proto=tcp" with your mlx4 cards.
>>>>> That avoids the use of NFS/RDMA but would enable the use of the
>>>>> higher bandwidth network fabric.
>>>> Thanks I could definitely try that.  IPoIB has it's own set of issues though but can cross that bridge when I get to it....
>>> Stick with connected mode and keep rsize and wsize smaller
>>> than the IPoIB MTU, which can be set as high as 65KB.
>> We are running in this setup, so far so good... however the rsize/wsize were much greater than the IPoIB MTU, and it is probably causing these "page allocation failures" which fortunately have not been fatal; our computation is still running.  In the ifcfg file for the IPoIB interface, the MTU is set to 65520, which was the recommended maximum from the Red Hat manual.  So should rsize/wsize be set to 65519? or is it better to pick another value that is a multiple 1024 or something?
> 
> The r/wsize settings have to be power of two. The next power of
> two smaller than 65520 is 32768. Try "rsize=32768,wsize=32768" .

Thanks but what is the reason for that?  After googling around a while 
for rsize/wsize settings, i finally found in the nfs manual page (of all 
places!!) that "If a specified value is within the supported range but 
not a multiple of 1024, it is rounded down to the nearest multiple of 
1024."  So it sound like we could use 63KiB or 64512.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: RDMA connection closed and not re-opened
  2018-08-08 19:11                     ` admin
@ 2018-08-08 19:18                       ` Chuck Lever
  2018-08-08 23:11                         ` Chandler
  0 siblings, 1 reply; 14+ messages in thread
From: Chuck Lever @ 2018-08-08 19:18 UTC (permalink / raw)
  To: admin; +Cc: Linux NFS Mailing List



> On Aug 8, 2018, at 3:11 PM, admin@genome.arizona.edu wrote:
>=20
> Chuck Lever wrote on 08/08/2018 12:01 PM:
>>> On Aug 8, 2018, at 2:54 PM, admin@genome.arizona.edu wrote:
>>> Chuck Lever wrote on 07/14/2018 07:37 AM:
>>>>> On Jul 13, 2018, at 6:32 PM, admin@genome.arizona.edu wrote:
>>>>> Chuck Lever wrote on 07/13/2018 07:36 AM:
>>>>>> You should be able to mount using "proto=3Dtcp" with your mlx4 =
cards.
>>>>>> That avoids the use of NFS/RDMA but would enable the use of the
>>>>>> higher bandwidth network fabric.
>>>>> Thanks I could definitely try that.  IPoIB has it's own set of =
issues though but can cross that bridge when I get to it....
>>>> Stick with connected mode and keep rsize and wsize smaller
>>>> than the IPoIB MTU, which can be set as high as 65KB.
>>> We are running in this setup, so far so good... however the =
rsize/wsize were much greater than the IPoIB MTU, and it is probably =
causing these "page allocation failures" which fortunately have not been =
fatal; our computation is still running.  In the ifcfg file for the =
IPoIB interface, the MTU is set to 65520, which was the recommended =
maximum from the Red Hat manual.  So should rsize/wsize be set to 65519? =
or is it better to pick another value that is a multiple 1024 or =
something?
>> The r/wsize settings have to be power of two. The next power of
>> two smaller than 65520 is 32768. Try "rsize=3D32768,wsize=3D32768" .
>=20
> Thanks but what is the reason for that?  After googling around a while =
for rsize/wsize settings, i finally found in the nfs manual page (of all =
places!!) that "If a specified value is within the supported range but =
not a multiple of 1024, it is rounded down to the nearest multiple of =
1024." So it sound like we could use 63KiB or 64512.

I just tried this:

[root@manet ~]# mount -o vers=3D3,rsize=3D65520,wsize=3D65520 =
klimt:/export/tmp/ /mnt
[root@manet ~]# grep klimt /proc/mounts
klimt:/export/tmp/ /mnt nfs =
rw,relatime,vers=3D3,rsize=3D32768,wsize=3D32768,namlen=3D255,hard,proto=3D=
tcp,timeo=3D600,retrans=3D2,sec=3Dsys,mountaddr=3D192.168.1.55,mountvers=3D=
3,mountport=3D20048,mountproto=3Dudp,local_lock=3Dnone,addr=3D192.168.1.55=
 0 0

Looks like the man page is wrong.

--
Chuck Lever




^ permalink raw reply	[flat|nested] 14+ messages in thread

* RDMA connection closed and not re-opened
  2018-08-08 19:18                       ` Chuck Lever
@ 2018-08-08 23:11                         ` Chandler
  0 siblings, 0 replies; 14+ messages in thread
From: Chandler @ 2018-08-08 23:11 UTC (permalink / raw)
  To: Linux NFS Mailing List

Chuck Lever wrote on 08/08/2018 12:18 PM:
> Looks like the man page is wrong.

Right you are!

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2018-08-09  1:33 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <f1e596cf-0e70-39af-99e9-a0a7e912bad3@genome.arizona.edu>
2018-06-29 15:04 ` RDMA connection closed and not re-opened Chuck Lever
2018-07-02 23:22   ` admin
2018-07-03  2:44     ` Chuck Lever
2018-07-03 23:41       ` admin
2018-07-12 22:44         ` admin
2018-07-13 14:36           ` Chuck Lever
2018-07-13 22:32             ` admin
2018-07-14 14:37               ` Chuck Lever
2018-07-18  0:27                 ` admin
2018-08-08 18:54                 ` admin
2018-08-08 19:01                   ` Chuck Lever
2018-08-08 19:11                     ` admin
2018-08-08 19:18                       ` Chuck Lever
2018-08-08 23:11                         ` Chandler

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.