All of lore.kernel.org
 help / color / mirror / Atom feed
* Performance/stability problems with nfs shares
@ 2013-08-02  6:04 Dawid Stawiarski
  2013-08-02 13:12 ` Jeff Layton
  0 siblings, 1 reply; 5+ messages in thread
From: Dawid Stawiarski @ 2013-08-02  6:04 UTC (permalink / raw)
  To: linux-nfs

Hi,

we observe performance issues on Blade Linux NFS clients (Ubuntu 12.04 with kernel 3.8.0-23-generic).
Blade nodes are used in a shared hosting environment, and NFS is used to access client's data from Nexenta Storage (mostly small php files and/or images). Single node is running about 300-400 apache instances.
We use 10G on the whole path from nodes to storage with jumbo frames enabled. We didn't see any drops on
network interfaces (on nodes nor switches).
Once in a while, apache processes accesing data on NFS share stuck on IO (D state - stack trace below).
We've already tried different combinations of mount options and tuning sysctls and sunrpc module (we also tried NFSv4 and UDP transport - these only made things worse; without the local locks we had also lots of problems).
Hangs seems to happen under haeavy concurent operations (in production env); unfortunatelly we didn't manage
to reproduce it with benchmark utilities. When the number of nodes is decreased the problem happens more frequently (in this case we have about 600 apache instances per node). We didn't see any problems on the storage itself when one of the shares hangs (the cpu usage and load look as usual).

1. client mount options we've tested:
noatime,nodiratime,noacl,nodev,nosuid,rsize=8192,wsize=8192,intr,bg,timeo=20,nfsvers=3,nolock
noatime,nodiratime,noacl,nodev,nosuid,rsize=8192,wsize=8192,intr,bg,acregmin=6,timeo=20,nfsvers=3,nolock

noatime,nodiratime,noacl,nodev,nosuid,rsize=1048576,wsize=1048576,intr,bg,acregmin=6,timeo=20,nfsvers=3,nolock
noatime,nodiratime,noacl,nodev,nosuid,rsize=1048576,wsize=1048576,intr,bg,acregmin=10,timeo=100,nfsvers=3,nolock
noatime,nodiratime,noacl,nodev,nosuid,rsize=1048576,wsize=1048576,intr,bg,acregmin=10,timeo=600,nfsvers=3,nolock

noatime,nodiratime,noacl,nodev,nosuid,rsize=1048576,wsize=1048576,intr,bg,acregmin=10,timeo=20,nfsvers=4,nolock

2. linux sysctl:
net.ipv4.tcp_timestamps = 0
net.core.netdev_max_backlog = 30000
net.ipv4.tcp_mtu_probing = 1
net.ipv4.tcp_slow_start_after_idle = 0
net.ipv4.tcp_timestamps = 0

3. linux module option:
options sunrpc tcp_slot_table_entries=128


With nfs timeout=2s we observed a huge loadavg (1000 or more) and lots of processes in "D" state waiting in
function "rpc_bit_killable". Everything "worked" but insanely slow. For example `find` on the mountpoint printed ~1 line per second. "avg RTT" and "avg exe" stats from nfsiostat increased to 500-800ms.

At first, we had 8 mounts from a single storage server (so basicly only one TCP connection was used).
However, we've also tried to add 8 virtual IPs to the storage, and use a separate IP to connect to
every share to distribute traffic among more TCP connections. At the same time we've
set nfs client timeout to 60s (the default). In this case we observed permanent hang
on random (single) mountpoint - and loadavg of about 150. Other mountpoints from the same storage worked correctly. There was no data traffic to hung mountpoint IP; only couple retransmissions (every 60 secs). After TCP reset and reconnect (this happens after couple minutes) everything starts to work correctly.

Now we decreased timeout to 10s.

/proc/PID/stack of a hung process (we have hundreds of these):
[<ffffffffa00eb019>] rpc_wait_bit_killable+0x39/0x90 [sunrpc]
[<ffffffffa00ec0fb>] __rpc_execute+0x15b/0x1b0 [sunrpc]
[<ffffffffa00ec87f>] rpc_execute+0x4f/0xb0 [sunrpc]
[<ffffffffa00e45a5>] rpc_run_task+0x75/0x90 [sunrpc]
[<ffffffffa00e46c3>] rpc_call_sync+0x43/0xa0 [sunrpc]
[<ffffffffa02595eb>] nfs3_rpc_wrapper.constprop.10+0x6b/0xb0 [nfsv3]
[<ffffffffa025a4ae>] nfs3_proc_getattr+0x3e/0x50 [nfsv3]
[<ffffffffa01452fd>] __nfs_revalidate_inode+0x8d/0x120 [nfs]
[<ffffffffa0141313>] nfs_lookup_revalidate+0x353/0x3a0 [nfs]
[<ffffffff811a79b3>] lookup_fast+0x173/0x230
[<ffffffff811a7cc6>] do_last+0x106/0x820
[<ffffffff811aa333>] path_openat+0xb3/0x4d0
[<ffffffff811ab152>] do_filp_open+0x42/0xa0
[<ffffffff8119adaa>] do_sys_open+0xfa/0x250
[<ffffffff811ed8cb>] compat_sys_open+0x1b/0x20
[<ffffffff816fc62c>] sysenter_dispatch+0x7/0x21
[<ffffffffffffffff>] 0xffffffffffffffff

nfsiostat on a problematic "slow" share (other shares from the SAME storage, but on separate TCP connection work correctly):
10.254.38.115:/volumes/DATA1/10/5 mounted on /home/10/5:

   op/s         rpc bklog
 420.50            0.00
read:             ops/s            kB/s           kB/op         retrans         avg RTT (ms)    avg exe (ms)
                  1.000          30.736          30.736        0 (0.0%)          13.500         867.700
write:            ops/s            kB/s           kB/op         retrans         avg RTT (ms)    avg exe (ms)
                  0.600           0.522           0.870        0 (0.0%)           0.667         872.333

mount options used on node:
10.254.38.115:/volumes/DATA1/10/5 /home/10/5 nfs rw,nosuid,nodev,noatime,nodiratime,vers=3,rsize=131072,wsize=131072,namlen=255,acregmin=10,hard,nolock,noacl,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.254.38.115,mountvers=3,mountport=63856,mountproto=udp,local_lock=all,addr=10.254.38.115 0 0


netstat:
- very slow access:
tcp        0      0 10.254.39.72:692        10.254.38.115:2049      ESTABLISHED -                off (0.00/0/0)

- completly not responding:
tcp        0 132902 10.254.39.74:719        10.254.38.115:2049      ESTABLISHED -                on (43.21/3/0)

client software:
- util-linux 2.20.1-1ubuntu3
- nfs-common 1.2.5-3ubuntu3.1
- libevent 2.0.16-stable-1

Can anyone help us to investigate the problem or has any sugestions what to try/check? Any help would be appreciated.

cheers,
Dawid





^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Performance/stability problems with nfs shares
  2013-08-02  6:04 Performance/stability problems with nfs shares Dawid Stawiarski
@ 2013-08-02 13:12 ` Jeff Layton
  2013-08-02 14:37   ` Dawid Stawiarski
  0 siblings, 1 reply; 5+ messages in thread
From: Jeff Layton @ 2013-08-02 13:12 UTC (permalink / raw)
  To: Dawid Stawiarski; +Cc: linux-nfs

On Fri, 02 Aug 2013 08:04:35 +0200
"Dawid Stawiarski" <neeo@xl.wp.pl> wrote:

> Hi,
> 
> we observe performance issues on Blade Linux NFS clients (Ubuntu 12.04 with kernel 3.8.0-23-generic).
> Blade nodes are used in a shared hosting environment, and NFS is used to access client's data from Nexenta Storage (mostly small php files and/or images). Single node is running about 300-400 apache instances.
> We use 10G on the whole path from nodes to storage with jumbo frames enabled. We didn't see any drops on
> network interfaces (on nodes nor switches).
> Once in a while, apache processes accesing data on NFS share stuck on IO (D state - stack trace below).
> We've already tried different combinations of mount options and tuning sysctls and sunrpc module (we also tried NFSv4 and UDP transport - these only made things worse; without the local locks we had also lots of problems).
> Hangs seems to happen under haeavy concurent operations (in production env); unfortunatelly we didn't manage
> to reproduce it with benchmark utilities. When the number of nodes is decreased the problem happens more frequently (in this case we have about 600 apache instances per node). We didn't see any problems on the storage itself when one of the shares hangs (the cpu usage and load look as usual).
> 
> 1. client mount options we've tested:
> noatime,nodiratime,noacl,nodev,nosuid,rsize=8192,wsize=8192,intr,bg,timeo=20,nfsvers=3,nolock
> noatime,nodiratime,noacl,nodev,nosuid,rsize=8192,wsize=8192,intr,bg,acregmin=6,timeo=20,nfsvers=3,nolock
> 
> noatime,nodiratime,noacl,nodev,nosuid,rsize=1048576,wsize=1048576,intr,bg,acregmin=6,timeo=20,nfsvers=3,nolock
> noatime,nodiratime,noacl,nodev,nosuid,rsize=1048576,wsize=1048576,intr,bg,acregmin=10,timeo=100,nfsvers=3,nolock
> noatime,nodiratime,noacl,nodev,nosuid,rsize=1048576,wsize=1048576,intr,bg,acregmin=10,timeo=600,nfsvers=3,nolock
> 
> noatime,nodiratime,noacl,nodev,nosuid,rsize=1048576,wsize=1048576,intr,bg,acregmin=10,timeo=20,nfsvers=4,nolock
> 
> 2. linux sysctl:
> net.ipv4.tcp_timestamps = 0
> net.core.netdev_max_backlog = 30000
> net.ipv4.tcp_mtu_probing = 1
> net.ipv4.tcp_slow_start_after_idle = 0
> net.ipv4.tcp_timestamps = 0
> 
> 3. linux module option:
> options sunrpc tcp_slot_table_entries=128
> 
> 
> With nfs timeout=2s we observed a huge loadavg (1000 or more) and lots of processes in "D" state waiting in
> function "rpc_bit_killable". Everything "worked" but insanely slow. For example `find` on the mountpoint printed ~1 line per second. "avg RTT" and "avg exe" stats from nfsiostat increased to 500-800ms.
> 

To be clear, you mean the "timeo=20" mounts? That's awfully low for a
TCP connection. With TCP, you typically don't want the client doing RPC
retransmits that frequently. You want to let the TCP layer handle it in
most cases.

> At first, we had 8 mounts from a single storage server (so basicly only one TCP connection was used).
> However, we've also tried to add 8 virtual IPs to the storage, and use a separate IP to connect to
> every share to distribute traffic among more TCP connections. At the same time we've
> set nfs client timeout to 60s (the default). In this case we observed permanent hang
> on random (single) mountpoint - and loadavg of about 150. Other mountpoints from the same storage worked correctly. There was no data traffic to hung mountpoint IP; only couple retransmissions (every 60 secs). After TCP reset and reconnect (this happens after couple minutes) everything starts to work correctly.
> 

Are those RPC or TCP retransmissions?

> Now we decreased timeout to 10s.
> 
> /proc/PID/stack of a hung process (we have hundreds of these):
> [<ffffffffa00eb019>] rpc_wait_bit_killable+0x39/0x90 [sunrpc]
> [<ffffffffa00ec0fb>] __rpc_execute+0x15b/0x1b0 [sunrpc]
> [<ffffffffa00ec87f>] rpc_execute+0x4f/0xb0 [sunrpc]
> [<ffffffffa00e45a5>] rpc_run_task+0x75/0x90 [sunrpc]
> [<ffffffffa00e46c3>] rpc_call_sync+0x43/0xa0 [sunrpc]
> [<ffffffffa02595eb>] nfs3_rpc_wrapper.constprop.10+0x6b/0xb0 [nfsv3]
> [<ffffffffa025a4ae>] nfs3_proc_getattr+0x3e/0x50 [nfsv3]
> [<ffffffffa01452fd>] __nfs_revalidate_inode+0x8d/0x120 [nfs]
> [<ffffffffa0141313>] nfs_lookup_revalidate+0x353/0x3a0 [nfs]
> [<ffffffff811a79b3>] lookup_fast+0x173/0x230
> [<ffffffff811a7cc6>] do_last+0x106/0x820
> [<ffffffff811aa333>] path_openat+0xb3/0x4d0
> [<ffffffff811ab152>] do_filp_open+0x42/0xa0
> [<ffffffff8119adaa>] do_sys_open+0xfa/0x250
> [<ffffffff811ed8cb>] compat_sys_open+0x1b/0x20
> [<ffffffff816fc62c>] sysenter_dispatch+0x7/0x21
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> nfsiostat on a problematic "slow" share (other shares from the SAME storage, but on separate TCP connection work correctly):
> 10.254.38.115:/volumes/DATA1/10/5 mounted on /home/10/5:
> 
>    op/s         rpc bklog
>  420.50            0.00
> read:             ops/s            kB/s           kB/op         retrans         avg RTT (ms)    avg exe (ms)
>                   1.000          30.736          30.736        0 (0.0%)          13.500         867.700
> write:            ops/s            kB/s           kB/op         retrans         avg RTT (ms)    avg exe (ms)
>                   0.600           0.522           0.870        0 (0.0%)           0.667         872.333
> 
> mount options used on node:
> 10.254.38.115:/volumes/DATA1/10/5 /home/10/5 nfs rw,nosuid,nodev,noatime,nodiratime,vers=3,rsize=131072,wsize=131072,namlen=255,acregmin=10,hard,nolock,noacl,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.254.38.115,mountvers=3,mountport=63856,mountproto=udp,local_lock=all,addr=10.254.38.115 0 0
> 
> 
> netstat:
> - very slow access:
> tcp        0      0 10.254.39.72:692        10.254.38.115:2049      ESTABLISHED -                off (0.00/0/0)
> 
> - completly not responding:
> tcp        0 132902 10.254.39.74:719        10.254.38.115:2049      ESTABLISHED -                on (43.21/3/0)
> 
> client software:
> - util-linux 2.20.1-1ubuntu3
> - nfs-common 1.2.5-3ubuntu3.1
> - libevent 2.0.16-stable-1
> 
> Can anyone help us to investigate the problem or has any sugestions what to try/check? Any help would be appreciated.
> 
> cheers,
> Dawid
> 
> 

Typically, a stack trace like that indicates that the process is
waiting for the server to respond. The first thing I would do would be
to ascertain whether the server is actually responding to these
requests. 

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Performance/stability problems with nfs shares
  2013-08-02 13:12 ` Jeff Layton
@ 2013-08-02 14:37   ` Dawid Stawiarski
  2013-08-02 15:47     ` J. Bruce Fields
  0 siblings, 1 reply; 5+ messages in thread
From: Dawid Stawiarski @ 2013-08-02 14:37 UTC (permalink / raw)
  To: Jeff Layton; +Cc: linux-nfs

[-- Attachment #1: Type: text/plain, Size: 7025 bytes --]

W dniu 02.08.2013 15:12, Jeff Layton pisze:
> On Fri, 02 Aug 2013 08:04:35 +0200
> "Dawid Stawiarski" <neeo@xl.wp.pl> wrote:
>
>> Hi,
>>
>> we observe performance issues on Blade Linux NFS clients (Ubuntu 12.04 with kernel 3.8.0-23-generic).
>> Blade nodes are used in a shared hosting environment, and NFS is used to access client's data from Nexenta Storage (mostly small php files and/or images). Single node is running about 300-400 apache instances.
>> We use 10G on the whole path from nodes to storage with jumbo frames enabled. We didn't see any drops on
>> network interfaces (on nodes nor switches).
>> Once in a while, apache processes accesing data on NFS share stuck on IO (D state - stack trace below).
>> We've already tried different combinations of mount options and tuning sysctls and sunrpc module (we also tried NFSv4 and UDP transport - these only made things worse; without the local locks we had also lots of problems).
>> Hangs seems to happen under haeavy concurent operations (in production env); unfortunatelly we didn't manage
>> to reproduce it with benchmark utilities. When the number of nodes is decreased the problem happens more frequently (in this case we have about 600 apache instances per node). We didn't see any problems on the storage itself when one of the shares hangs (the cpu usage and load look as usual).
>>
>> 1. client mount options we've tested:
>> noatime,nodiratime,noacl,nodev,nosuid,rsize=8192,wsize=8192,intr,bg,timeo=20,nfsvers=3,nolock
>> noatime,nodiratime,noacl,nodev,nosuid,rsize=8192,wsize=8192,intr,bg,acregmin=6,timeo=20,nfsvers=3,nolock
>>
>> noatime,nodiratime,noacl,nodev,nosuid,rsize=1048576,wsize=1048576,intr,bg,acregmin=6,timeo=20,nfsvers=3,nolock
>> noatime,nodiratime,noacl,nodev,nosuid,rsize=1048576,wsize=1048576,intr,bg,acregmin=10,timeo=100,nfsvers=3,nolock
>> noatime,nodiratime,noacl,nodev,nosuid,rsize=1048576,wsize=1048576,intr,bg,acregmin=10,timeo=600,nfsvers=3,nolock
>>
>> noatime,nodiratime,noacl,nodev,nosuid,rsize=1048576,wsize=1048576,intr,bg,acregmin=10,timeo=20,nfsvers=4,nolock
>>
>> 2. linux sysctl:
>> net.ipv4.tcp_timestamps = 0
>> net.core.netdev_max_backlog = 30000
>> net.ipv4.tcp_mtu_probing = 1
>> net.ipv4.tcp_slow_start_after_idle = 0
>> net.ipv4.tcp_timestamps = 0
>>
>> 3. linux module option:
>> options sunrpc tcp_slot_table_entries=128
>>
>>
>> With nfs timeout=2s we observed a huge loadavg (1000 or more) and lots of processes in "D" state waiting in
>> function "rpc_bit_killable". Everything "worked" but insanely slow. For example `find` on the mountpoint printed ~1 line per second. "avg RTT" and "avg exe" stats from nfsiostat increased to 500-800ms.
>>
>
> To be clear, you mean the "timeo=20" mounts? That's awfully low for a
> TCP connection. With TCP, you typically don't want the client doing RPC
> retransmits that frequently. You want to let the TCP layer handle it in
> most cases.

Yes, we've tested timeo=20, 100 and 600 (the default). The sympoms 
change (share completly hang or is terribly slow) - but the problem 
exists with all the values.


>> At first, we had 8 mounts from a single storage server (so basicly only one TCP connection was used).
>> However, we've also tried to add 8 virtual IPs to the storage, and use a separate IP to connect to
>> every share to distribute traffic among more TCP connections. At the same time we've
>> set nfs client timeout to 60s (the default). In this case we observed permanent hang
>> on random (single) mountpoint - and loadavg of about 150. Other mountpoints from the same storage worked correctly. There was no data traffic to hung mountpoint IP; only couple retransmissions (every 60 secs). After TCP reset and reconnect (this happens after couple minutes) everything starts to work correctly.
>>
>
> Are those RPC or TCP retransmissions?

I belive this were TCP retransmits (but RPC ones also happen - and can 
be seen on mountstats).


>> Now we decreased timeout to 10s.
>>
>> /proc/PID/stack of a hung process (we have hundreds of these):
>> [<ffffffffa00eb019>] rpc_wait_bit_killable+0x39/0x90 [sunrpc]
>> [<ffffffffa00ec0fb>] __rpc_execute+0x15b/0x1b0 [sunrpc]
>> [<ffffffffa00ec87f>] rpc_execute+0x4f/0xb0 [sunrpc]
>> [<ffffffffa00e45a5>] rpc_run_task+0x75/0x90 [sunrpc]
>> [<ffffffffa00e46c3>] rpc_call_sync+0x43/0xa0 [sunrpc]
>> [<ffffffffa02595eb>] nfs3_rpc_wrapper.constprop.10+0x6b/0xb0 [nfsv3]
>> [<ffffffffa025a4ae>] nfs3_proc_getattr+0x3e/0x50 [nfsv3]
>> [<ffffffffa01452fd>] __nfs_revalidate_inode+0x8d/0x120 [nfs]
>> [<ffffffffa0141313>] nfs_lookup_revalidate+0x353/0x3a0 [nfs]
>> [<ffffffff811a79b3>] lookup_fast+0x173/0x230
>> [<ffffffff811a7cc6>] do_last+0x106/0x820
>> [<ffffffff811aa333>] path_openat+0xb3/0x4d0
>> [<ffffffff811ab152>] do_filp_open+0x42/0xa0
>> [<ffffffff8119adaa>] do_sys_open+0xfa/0x250
>> [<ffffffff811ed8cb>] compat_sys_open+0x1b/0x20
>> [<ffffffff816fc62c>] sysenter_dispatch+0x7/0x21
>> [<ffffffffffffffff>] 0xffffffffffffffff
>>
>> nfsiostat on a problematic "slow" share (other shares from the SAME storage, but on separate TCP connection work correctly):
>> 10.254.38.115:/volumes/DATA1/10/5 mounted on /home/10/5:
>>
>>     op/s         rpc bklog
>>   420.50            0.00
>> read:             ops/s            kB/s           kB/op         retrans         avg RTT (ms)    avg exe (ms)
>>                    1.000          30.736          30.736        0 (0.0%)          13.500         867.700
>> write:            ops/s            kB/s           kB/op         retrans         avg RTT (ms)    avg exe (ms)
>>                    0.600           0.522           0.870        0 (0.0%)           0.667         872.333
>>
>> mount options used on node:
>> 10.254.38.115:/volumes/DATA1/10/5 /home/10/5 nfs rw,nosuid,nodev,noatime,nodiratime,vers=3,rsize=131072,wsize=131072,namlen=255,acregmin=10,hard,nolock,noacl,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.254.38.115,mountvers=3,mountport=63856,mountproto=udp,local_lock=all,addr=10.254.38.115 0 0
>>
>>
>> netstat:
>> - very slow access:
>> tcp        0      0 10.254.39.72:692        10.254.38.115:2049      ESTABLISHED -                off (0.00/0/0)
>>
>> - completly not responding:
>> tcp        0 132902 10.254.39.74:719        10.254.38.115:2049      ESTABLISHED -                on (43.21/3/0)
>>
>> client software:
>> - util-linux 2.20.1-1ubuntu3
>> - nfs-common 1.2.5-3ubuntu3.1
>> - libevent 2.0.16-stable-1
>>
>> Can anyone help us to investigate the problem or has any sugestions what to try/check? Any help would be appreciated.
>>
>> cheers,
>> Dawid
>>
>>
>
> Typically, a stack trace like that indicates that the process is
> waiting for the server to respond. The first thing I would do would be
> to ascertain whether the server is actually responding to these
> requests.
>

The same share is accessible on other nodes, so the problem involves 
only one of the nodes (completly random) at a time.


[-- Attachment #2: Kryptograficzna sygnatura S/MIME --]
[-- Type: application/pkcs7-signature, Size: 4231 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Performance/stability problems with nfs shares
  2013-08-02 14:37   ` Dawid Stawiarski
@ 2013-08-02 15:47     ` J. Bruce Fields
  2013-08-04  7:29       ` Dawid Stawiarski
  0 siblings, 1 reply; 5+ messages in thread
From: J. Bruce Fields @ 2013-08-02 15:47 UTC (permalink / raw)
  To: Dawid Stawiarski; +Cc: Jeff Layton, linux-nfs

On Fri, Aug 02, 2013 at 04:37:57PM +0200, Dawid Stawiarski wrote:
> W dniu 02.08.2013 15:12, Jeff Layton pisze:
> >Typically, a stack trace like that indicates that the process is
> >waiting for the server to respond. The first thing I would do would be
> >to ascertain whether the server is actually responding to these
> >requests.
> >
> 
> The same share is accessible on other nodes, so the problem involves
> only one of the nodes (completly random) at a time.

It's still conceivable that a server problem could cause it to stop
responding to calls only from a single client--it'd be useful if
possible to check a trace to see if that's what's happening.  If the
traffic is really huge then capturing and analyzing a good trace may be
difficult.

--b.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Performance/stability problems with nfs shares
  2013-08-02 15:47     ` J. Bruce Fields
@ 2013-08-04  7:29       ` Dawid Stawiarski
  0 siblings, 0 replies; 5+ messages in thread
From: Dawid Stawiarski @ 2013-08-04  7:29 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Jeff Layton, linux-nfs

[-- Attachment #1: Type: text/plain, Size: 1337 bytes --]

W dniu 02.08.2013 17:47, J. Bruce Fields pisze:
> On Fri, Aug 02, 2013 at 04:37:57PM +0200, Dawid Stawiarski wrote:
>> W dniu 02.08.2013 15:12, Jeff Layton pisze:
>>> Typically, a stack trace like that indicates that the process is
>>> waiting for the server to respond. The first thing I would do would be
>>> to ascertain whether the server is actually responding to these
>>> requests.
>>>
>>
>> The same share is accessible on other nodes, so the problem involves
>> only one of the nodes (completly random) at a time.
>
> It's still conceivable that a server problem could cause it to stop
> responding to calls only from a single client--it'd be useful if
> possible to check a trace to see if that's what's happening.  If the
> traffic is really huge then capturing and analyzing a good trace may be
> difficult.

We have almost one hundred NFS nodes, and they're all using the same 
shares with about the same volume of traffic - and only one of the nodes 
at a time has a problem for only one share (other shares from the SAME 
server work OK on the "failing" node) - so it's hard to belive it's the 
server that causes the problem. And yes - with that amount of traffic 
it's hard to make a meaningfull trace on the server side (as noted 
before server is NexentaStor based, and not Linux).

Dawid


[-- Attachment #2: Kryptograficzna sygnatura S/MIME --]
[-- Type: application/pkcs7-signature, Size: 4231 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2013-08-04  7:29 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-02  6:04 Performance/stability problems with nfs shares Dawid Stawiarski
2013-08-02 13:12 ` Jeff Layton
2013-08-02 14:37   ` Dawid Stawiarski
2013-08-02 15:47     ` J. Bruce Fields
2013-08-04  7:29       ` Dawid Stawiarski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.