All of lore.kernel.org
 help / color / mirror / Atom feed
* NFS/UDP slow read, lost fragments
@ 2003-09-25 17:59 Robert L. Millner
  2003-09-25 20:22 ` Brian Mancuso
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Robert L. Millner @ 2003-09-25 17:59 UTC (permalink / raw)
  To: nfs

Hello,

The problem I am seeing is similar to what was posted by Larry Sendlosky
on Jun 27, "2.4.20-pre3 -> 2.4.21 : nfs client read performance broken"
though I have not done as through a drill-down into the nature of the
problem.

Somewhere between 2.4.19 and 2.4.20, NFS/UDP read performance began to
suck because of a large number of request retransmits.  From tcpdump, the
retransmits are for read transactions which return data in a reasonable
time frame but are missing one or more fragments of the return packet.

The client is a PIII 550, dual CPU, 1GB RAM, eepro100 (tested with both
the e100 and eepro drivers) running a variety of kernels from the stock
Red Hat 7.3 (2.4.20-20.7 which exhibits the problem), and from stock
2.4.18 through 2.4.22.  2.4.20 and above exhibit this problem.

The server is a Network Appliance F820 running ONTAP 6.2.2.  Tests are
conducted when the F820 is not under a noticeable load.

>From the difference in behavior between kernel revisions and by tcpdump, 
it is believed that the fragments are transmitted by the server.

The timings for different IO sizes for NFSv3 reads of a 70MB file:

Configuration             Seconds Real Time
-------------------------------------------
UDP, 32k                       1080
UDP,  8k                        420
UDP,  4k                        210
UDP,  1k                         40

TCP, 32k                          6.4


The NFSv2/UDP timings for 8k, 4k and 1k are almost identical to the
NFSv3/UDP timings.

The same test with 2.4.18 yields read times for around 6.3 seconds for 32k
and 8k NFSv2 and NFSv3 over UDP.

Setting rmem_max, rmem_default, wmem_max and wmem_default to either 524284 
or 262142 make no difference.

Setting netdev_max_backlog to 1200 (from 300) makes no difference.

Setting ipfrag_high_thresh to up to 4194304 makes no difference.


We have a mix of clients and servers, not all of which support NFS/TCP yet
so we can't globally set tcp in the automounter maps for our netapp mounts
or as "localoptions" in the autofs init script.  The mount patch submitted
by Steve Dickson on Aug 6, "NFS Mount Patch:  Making NFS over TCP the
default" is probably the immediate workaround that will at least make the
mounts that really matter to us work well again.  I'll test that next.


Is this a known problem?  Is there a patch already out there or in the
works that fixes this?  What other data would help drill into this
problem?


	Rob



-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NFS/UDP slow read, lost fragments
  2003-09-25 17:59 NFS/UDP slow read, lost fragments Robert L. Millner
@ 2003-09-25 20:22 ` Brian Mancuso
  2003-09-25 20:33 ` brianm
  2003-09-25 21:44 ` Brian Mancuso
  2 siblings, 0 replies; 12+ messages in thread
From: Brian Mancuso @ 2003-09-25 20:22 UTC (permalink / raw)
  To: Robert L. Millner; +Cc: nfs

On Thu, Sep 25, 2003 at 10:59:43AM -0700, Robert L. Millner wrote:
: Hello,
: 
: The problem I am seeing is similar to what was posted by Larry Sendlosky
: on Jun 27, "2.4.20-pre3 -> 2.4.21 : nfs client read performance broken"
: though I have not done as through a drill-down into the nature of the
: problem.
: 
: Somewhere between 2.4.19 and 2.4.20, NFS/UDP read performance began to
: suck because of a large number of request retransmits.  From tcpdump, the
: retransmits are for read transactions which return data in a reasonable
: time frame but are missing one or more fragments of the return packet.

This is because code that exponentially backs off RTO for UDP RPC was
backported from the 2.[56] series in 2.4.20, and this code is
completely broken. Trond has a patch in his patchset for 2.6 that
significantly fixes these problems, however this patch still has one
problem that can result in a large number of unnecessary retransmits
for RPC sessions that have low variance in RTT: RTO is calculated to
be the filtered round trip time plus a small constant times the mean
deviation of round trip times. However, because the RTT calculation
code implements Karn's algorithm (from TCP: RTT calculation isn't done
for responses to RPC requests that have been retransmitted), RTT is
never allowed to increase, for were a response to take longer than
measured RTT plus the (assumed small) deviation, the packet will be
retransmitted and a calculatation that will increase measured RTT
won't be done. Thus if a server's real RTT were to increase over time,
initial RTO values would never grow (for measured RTT would never grow
beyond the minimum ever measured), and RPC requests will frequently be
retransmitted at least once. This can be easily remedied however by
TCP's technique of inheriting backoff of RTO from previous
transactions: create a new variable somewhere in the clnt structure
called, say, cl_backoff; Each time an RPC transaction completes, store
the number of retransmits for that transaction (req->rq_nresend) in
cl_backoff; calculate RTO to be rpc_calc_rto() left shifted by the
number of retransmits for this transaction (initially 0) plus
clnt->cl_backoff (the number of retransmits for the last completed
transaction).

The backported code mentioned above will also result in significantly
more EIO events for users with soft UDP mounts. Users seeing lots of
EIO's should see them diminish after these problems are fixed.

: Is this a known problem?  Is there a patch already out there or in the
: works that fixes this?  What other data would help drill into this
: problem?

I will post a patch tomorrow for 2.4.20.

Brian Mancuso


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NFS/UDP slow read, lost fragments
  2003-09-25 17:59 NFS/UDP slow read, lost fragments Robert L. Millner
  2003-09-25 20:22 ` Brian Mancuso
@ 2003-09-25 20:33 ` brianm
  2003-09-25 21:44 ` Brian Mancuso
  2 siblings, 0 replies; 12+ messages in thread
From: brianm @ 2003-09-25 20:33 UTC (permalink / raw)
  To: Robert L. Millner; +Cc: nfs

On Thu, Sep 25, 2003 at 10:59:43AM -0700, Robert L. Millner wrote:
: Hello,
:
: The problem I am seeing is similar to what was posted by Larry Sendlosky
: on Jun 27, "2.4.20-pre3 -> 2.4.21 : nfs client read performance broken"
: though I have not done as through a drill-down into the nature of the
: problem.
:
: Somewhere between 2.4.19 and 2.4.20, NFS/UDP read performance began to
: suck because of a large number of request retransmits.  From tcpdump, the
: retransmits are for read transactions which return data in a reasonable
: time frame but are missing one or more fragments of the return packet.

This is because code that exponentially backs off RTO for UDP RPC was
backported from the 2.[56] series in 2.4.20, and this code is
completely broken. Trond has a patch in his patchset for 2.6 that
significantly fixes these problems, however this patch still has one
problem that can result in a large number of unnecessary retransmits
for RPC sessions that have low variance in RTT: RTO is calculated to
be the filtered round trip time plus a small constant times the mean
deviation of round trip times. However, because the RTT calculation
code implements Karn's algorithm (from TCP: RTT calculation isn't done
for responses to RPC requests that have been retransmitted), RTT is
never allowed to increase, for were a response to take longer than
measured RTT plus the (assumed small) deviation, the packet will be
retransmitted and a calculatation that will increase measured RTT
won't be done. Thus if a server's real RTT were to increase over time,
initial RTO values would never grow (for measured RTT would never grow
beyond the minimum ever measured), and RPC requests will frequently be
retransmitted at least once. This can be easily remedied however by
TCP's technique of inheriting backoff of RTO from previous
transactions: create a new variable somewhere in the clnt structure
called, say, cl_backoff; Each time an RPC transaction completes, store
the number of retransmits for that transaction (req->rq_nresend) in
cl_backoff; calculate RTO to be rpc_calc_rto() left shifted by the
number of retransmits for this transaction (initially 0) plus
clnt->cl_backoff (the number of retransmits for the last completed
transaction).

The backported code mentioned above will also result in significantly
more EIO events for users with soft UDP mounts. Users seeing lots of
EIO's should see them diminish after these problems are fixed.

: Is this a known problem?  Is there a patch already out there or in the
: works that fixes this?  What other data would help drill into this
: problem?

I will post a patch tomorrow for 2.4.20.

Brian Mancuso



-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NFS/UDP slow read, lost fragments
  2003-09-25 17:59 NFS/UDP slow read, lost fragments Robert L. Millner
  2003-09-25 20:22 ` Brian Mancuso
  2003-09-25 20:33 ` brianm
@ 2003-09-25 21:44 ` Brian Mancuso
  2003-09-25 23:31   ` Trond Myklebust
  2 siblings, 1 reply; 12+ messages in thread
From: Brian Mancuso @ 2003-09-25 21:44 UTC (permalink / raw)
  To: nfs

On Thu, Sep 25, 2003 at 10:59:43AM -0700, Robert L. Millner wrote:
: Hello,
: 
: The problem I am seeing is similar to what was posted by Larry Sendlosky
: on Jun 27, "2.4.20-pre3 -> 2.4.21 : nfs client read performance broken"
: though I have not done as through a drill-down into the nature of the
: problem.
: 
: Somewhere between 2.4.19 and 2.4.20, NFS/UDP read performance began to
: suck because of a large number of request retransmits.  From tcpdump, the
: retransmits are for read transactions which return data in a reasonable
: time frame but are missing one or more fragments of the return packet.

This is because code that exponentially backs off RTO for UDP RPC was
backported from the 2.[56] series in 2.4.20, and this code is
completely broken. Trond has a patch in his patchset for 2.6 that
significantly fixes these problems, however this patch still has one
problem that can result in a large number of unnecessary retransmits
for RPC sessions that have low variance in RTT: RTO is calculated to
be the filtered round trip time plus a small constant times the mean
deviation of round trip times. However, because the RTT calculation
code implements Karn's algorithm (from TCP: RTT calculation isn't done
for responses to RPC requests that have been retransmitted), RTT is
never allowed to increase, for were a response to take longer than
measured RTT plus the (assumed small) deviation, the packet will be
retransmitted and a calculatation that will increase measured RTT
won't be done. Thus if a server's real RTT were to increase over time,
initial RTO values would never grow (for measured RTT would never grow
beyond the minimum ever measured), and RPC requests will frequently be
retransmitted at least once. This can be easily remedied however by
TCP's technique of inheriting backoff of RTO from previous
transactions: create a new variable somewhere in the clnt structure
called, say, cl_backoff; Each time an RPC transaction completes, store
the number of retransmits for that transaction (req->rq_nresend) in
cl_backoff; calculate RTO to be rpc_calc_rto() left shifted by the
number of retransmits for this transaction (initially 0) plus
clnt->cl_backoff (the number of retransmits for the last completed
transaction).

The backported code mentioned above will also result in significantly
more EIO events for users with soft UDP mounts. Users seeing lots of
EIO's should see them diminish after these problems are fixed.

: Is this a known problem?  Is there a patch already out there or in the
: works that fixes this?  What other data would help drill into this
: problem?

I will post a patch tomorrow for 2.4.20.

Brian Mancuso


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NFS/UDP slow read, lost fragments
  2003-09-25 21:44 ` Brian Mancuso
@ 2003-09-25 23:31   ` Trond Myklebust
  2003-09-26 21:07     ` Brian Mancuso
  0 siblings, 1 reply; 12+ messages in thread
From: Trond Myklebust @ 2003-09-25 23:31 UTC (permalink / raw)
  To: Brian Mancuso; +Cc: nfs

>>>>> " " == Brian Mancuso <bmancuso@akamai.com> writes:

     > This can be easily remedied however by TCP's technique of
     > inheriting backoff of RTO from previous transactions: create a
     > new variable somewhere in the clnt structure called, say,
     > cl_backoff; Each time an RPC transaction completes, store the
     > number of retransmits for that transaction (req->rq_nresend) in
     > cl_backoff; calculate RTO to be rpc_calc_rto() left shifted by
     > the number of retransmits for this transaction (initially 0)
     > plus clnt-> cl_backoff (the number of retransmits for the last
     > completed transaction).

Right... There is already a variable set aside for this task in the
RTT code in the form of "ntimeouts".

The following patch sets up a scheme of the form described by Brian,
in combination with another fix to lengthen the window of time during
which we accept updates to the RTO estimate (Karn's algorithm states
that the window closes once you retransmit, whereas our current
algorithm closes the window once a request times out).

Could people give it a try?

Cheers,
  Trond

diff -u --recursive --new-file linux-2.4.23-pre5/include/linux/sunrpc/timer.h linux-2.4.23-01-fix_retrans/include/linux/sunrpc/timer.h
--- linux-2.4.23-pre5/include/linux/sunrpc/timer.h	2003-09-19 13:15:31.000000000 -0700
+++ linux-2.4.23-01-fix_retrans/include/linux/sunrpc/timer.h	2003-09-25 16:25:02.000000000 -0700
@@ -23,14 +23,9 @@
 extern void rpc_update_rtt(struct rpc_rtt *rt, int timer, long m);
 extern long rpc_calc_rto(struct rpc_rtt *rt, int timer);
 
-static inline void rpc_inc_timeo(struct rpc_rtt *rt)
+static inline void rpc_set_timeo(struct rpc_rtt *rt, int ntimeo)
 {
-	atomic_inc(&rt->ntimeouts);
-}
-
-static inline void rpc_clear_timeo(struct rpc_rtt *rt)
-{
-	atomic_set(&rt->ntimeouts, 0);
+	atomic_set(&rt->ntimeouts, ntimeo);
 }
 
 static inline int rpc_ntimeo(struct rpc_rtt *rt)
diff -u --recursive --new-file linux-2.4.23-pre5/include/linux/sunrpc/xprt.h linux-2.4.23-01-fix_retrans/include/linux/sunrpc/xprt.h
--- linux-2.4.23-pre5/include/linux/sunrpc/xprt.h	2003-07-29 16:52:26.000000000 -0700
+++ linux-2.4.23-01-fix_retrans/include/linux/sunrpc/xprt.h	2003-09-19 13:15:31.000000000 -0700
@@ -115,7 +115,7 @@
 
 	long			rq_xtime;	/* when transmitted */
 	int			rq_ntimeo;
-	int			rq_nresend;
+	int			rq_ntrans;
 };
 #define rq_svec			rq_snd_buf.head
 #define rq_slen			rq_snd_buf.len
diff -u --recursive --new-file linux-2.4.23-pre5/net/sunrpc/xprt.c linux-2.4.23-01-fix_retrans/net/sunrpc/xprt.c
--- linux-2.4.23-pre5/net/sunrpc/xprt.c	2003-07-29 16:54:19.000000000 -0700
+++ linux-2.4.23-01-fix_retrans/net/sunrpc/xprt.c	2003-09-25 16:25:02.000000000 -0700
@@ -138,18 +138,21 @@
 static int
 __xprt_lock_write(struct rpc_xprt *xprt, struct rpc_task *task)
 {
+	struct rpc_rqst *req = task->tk_rqstp;
 	if (!xprt->snd_task) {
 		if (xprt->nocong || __xprt_get_cong(xprt, task)) {
 			xprt->snd_task = task;
-			if (task->tk_rqstp)
-				task->tk_rqstp->rq_bytes_sent = 0;
+			if (req) {
+				req->rq_bytes_sent = 0;
+				req->rq_ntrans++;
+			}
 		}
 	}
 	if (xprt->snd_task != task) {
 		dprintk("RPC: %4d TCP write queue full\n", task->tk_pid);
 		task->tk_timeout = 0;
 		task->tk_status = -EAGAIN;
-		if (task->tk_rqstp && task->tk_rqstp->rq_nresend)
+		if (req && req->rq_ntrans)
 			rpc_sleep_on(&xprt->resend, task, NULL, NULL);
 		else
 			rpc_sleep_on(&xprt->sending, task, NULL, NULL);
@@ -183,9 +186,12 @@
 			return;
 	}
 	if (xprt->nocong || __xprt_get_cong(xprt, task)) {
+		struct rpc_rqst *req = task->tk_rqstp;
 		xprt->snd_task = task;
-		if (task->tk_rqstp)
-			task->tk_rqstp->rq_bytes_sent = 0;
+		if (req) {
+			req->rq_bytes_sent = 0;
+			req->rq_ntrans++;
+		}
 	}
 }
 
@@ -592,12 +598,12 @@
 	if (!xprt->nocong) {
 		xprt_adjust_cwnd(xprt, copied);
 		__xprt_put_cong(xprt, req);
-	       	if (!req->rq_nresend) {
+	       	if (req->rq_ntrans == 1) {
 			int timer = rpcproc_timer(clnt, task->tk_msg.rpc_proc);
 			if (timer)
 				rpc_update_rtt(&clnt->cl_rtt, timer, (long)jiffies - req->rq_xtime);
 		}
-		rpc_clear_timeo(&clnt->cl_rtt);
+		rpc_set_timeo(&clnt->cl_rtt, req->rq_ntrans - 1);
 	}
 
 #ifdef RPC_PROFILE
@@ -1063,7 +1069,7 @@
 		goto out;
 
 	xprt_adjust_cwnd(req->rq_xprt, -ETIMEDOUT);
-	req->rq_nresend++;
+	__xprt_put_cong(xprt, req);
 
 	dprintk("RPC: %4d xprt_timer (%s request)\n",
 		task->tk_pid, req ? "pending" : "backlogged");
@@ -1219,6 +1225,7 @@
 	if (!xprt->nocong) {
 		task->tk_timeout = rpc_calc_rto(&clnt->cl_rtt,
 				rpcproc_timer(clnt, task->tk_msg.rpc_proc));
+		task->tk_timeout <<= rpc_ntimeo(&clnt->cl_rtt);
 		task->tk_timeout <<= clnt->cl_timeout.to_retries
 			- req->rq_timeout.to_retries;
 		if (task->tk_timeout > req->rq_timeout.to_maxval)



-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NFS/UDP slow read, lost fragments
  2003-09-25 23:31   ` Trond Myklebust
@ 2003-09-26 21:07     ` Brian Mancuso
  2003-09-27  5:02       ` Robert L. Millner
  2003-10-14  1:20       ` Steve Dickson
  0 siblings, 2 replies; 12+ messages in thread
From: Brian Mancuso @ 2003-09-26 21:07 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: nfs

On Thu, Sep 25, 2003 at 04:31:05PM -0700, Trond Myklebust wrote:
: 
: Right... There is already a variable set aside for this task in the
: RTT code in the form of "ntimeouts".
: 
: The following patch sets up a scheme of the form described by Brian,
: in combination with another fix to lengthen the window of time during
: which we accept updates to the RTO estimate (Karn's algorithm states
: that the window closes once you retransmit, whereas our current
: algorithm closes the window once a request times out).
: 
: Could people give it a try?
: 
: Cheers,
:   Trond

Hi Trond,

This patch is great! One thing though: there is one case with this
patch in which a request terminates but the client's ntimeouts value
is not updated: when a request exhausts its retries. I think future
requests should inherit the ntimeouts factor provided by timed-out
requests for their RTO calculations. Here is a patch (against
linux-2.4.23-pre5 with your patch) that implements this (you might
know of a cleaner way of doing this..):

--- clnt.c.1	Fri Sep 26 19:58:30 2003
+++ clnt.c	Fri Sep 26 20:00:38 2003
@@ -699,6 +699,7 @@ call_status(struct rpc_task *task)
 static void
 call_timeout(struct rpc_task *task)
 {
+	struct rpc_rqst	*req = task->tk_rqstp;
 	struct rpc_clnt	*clnt = task->tk_client;
 	struct rpc_timeout *to = &task->tk_rqstp->rq_timeout;
 
@@ -707,6 +708,7 @@ call_timeout(struct rpc_task *task)
 		goto retry;
 	}
 	to->to_retries = clnt->cl_timeout.to_retries;
+	rpc_set_timeo(&clnt->cl_rtt, req->rq_ntrans - 1);
 
 	dprintk("RPC: %4d call_timeout (major)\n", task->tk_pid);
 	if (clnt->cl_softrtry) {

I tested your patch. Here is some documentation of my testing:

I added a printk to the code, mounted an NFS volume with a retrans=8
option, and did an ls on a directory with 100,000 entries in it. This
takes awhile, and leaves ample time for the system to develop a good
RTT estimate and for one to interfere with client/server connectivity
so as to induce retransmits with exponential backoff:

01  DO_XPRT_TRANSMIT: rtt = 4, ntimeo = 0, ntrans = 0, rto = 4
02  DO_XPRT_TRANSMIT: rtt = 4, ntimeo = 0, ntrans = 0, rto = 4
...
03  DO_XPRT_TRANSMIT: rtt = 4, ntimeo = 0, ntrans = 0, rto = 4
04  DO_XPRT_TRANSMIT: rtt = 4, ntimeo = 0, ntrans = 2, rto = 16
05  DO_XPRT_TRANSMIT: rtt = 4, ntimeo = 0, ntrans = 3, rto = 32
06  DO_XPRT_TRANSMIT: rtt = 4, ntimeo = 0, ntrans = 4, rto = 64
07  DO_XPRT_TRANSMIT: rtt = 4, ntimeo = 0, ntrans = 5, rto = 128
08  DO_XPRT_TRANSMIT: rtt = 4, ntimeo = 0, ntrans = 6, rto = 256
09  DO_XPRT_TRANSMIT: rtt = 4, ntimeo = 0, ntrans = 7, rto = 512
10  DO_XPRT_TRANSMIT: rtt = 4, ntimeo = 0, ntrans = 8, rto = 1024
11  nfs: server 172.18.192.11 not responding, timed out
12  DO_XPRT_TRANSMIT: rtt = 4, ntimeo = 8, ntrans = 0, rto = 1024
13  DO_XPRT_TRANSMIT: rtt = 4, ntimeo = 8, ntrans = 1, rto = 2048
14  DO_XPRT_TRANSMIT: rtt = 4, ntimeo = 8, ntrans = 2, rto = 4096

The numbers on the left are line numbers I manually inserted for
illustration; they were not produced by the printk. The '...'
indicates lots of lines, identical to the preceding, were removed.
There is one line per RTO calculation (i.e. per transmit). Lines 1-2
show RTO calculations for RPC transactions that saw no loss. 'rtt' is
the value returned by rpc_calc_rto(), and represents the estimated
round trip time plus a constant factor times its mean deviation.
'ntimeo' is the ntimeouts value in the client structure, set to the
number of transmits minus one for the last terminated (successful or
otherwise) request. 'ntrans' is effectively the number of retransmits
of the request in question. 'rto' is the calculated RTO:
rtt << ntimeo + ntrans.

The server was disconnected at (before) line 3. Lines 4-10 show the
client exponentially backing off RTO. Line 11 represents an exhaustion
of retransmit attempts and an EIO returned to the application, ls
(this is a soft mount). Lines 12-14 show RTO calculations for a new
request. Note that the ntimeo on line 12 inherited its value from the
ntrans value of the request that terminated on line 10 (this kernel
had my above modification).

Brian

: diff -u --recursive --new-file linux-2.4.23-pre5/include/linux/sunrpc/timer.h linux-2.4.23-01-fix_retrans/include/linux/sunrpc/timer.h
: --- linux-2.4.23-pre5/include/linux/sunrpc/timer.h	2003-09-19 13:15:31.000000000 -0700
: +++ linux-2.4.23-01-fix_retrans/include/linux/sunrpc/timer.h	2003-09-25 16:25:02.000000000 -0700
: @@ -23,14 +23,9 @@
:  extern void rpc_update_rtt(struct rpc_rtt *rt, int timer, long m);
:  extern long rpc_calc_rto(struct rpc_rtt *rt, int timer);
:  
: -static inline void rpc_inc_timeo(struct rpc_rtt *rt)
: +static inline void rpc_set_timeo(struct rpc_rtt *rt, int ntimeo)
:  {
: -	atomic_inc(&rt->ntimeouts);
: -}
: -
: -static inline void rpc_clear_timeo(struct rpc_rtt *rt)
: -{
: -	atomic_set(&rt->ntimeouts, 0);
: +	atomic_set(&rt->ntimeouts, ntimeo);
:  }
:  
:  static inline int rpc_ntimeo(struct rpc_rtt *rt)
: diff -u --recursive --new-file linux-2.4.23-pre5/include/linux/sunrpc/xprt.h linux-2.4.23-01-fix_retrans/include/linux/sunrpc/xprt.h
: --- linux-2.4.23-pre5/include/linux/sunrpc/xprt.h	2003-07-29 16:52:26.000000000 -0700
: +++ linux-2.4.23-01-fix_retrans/include/linux/sunrpc/xprt.h	2003-09-19 13:15:31.000000000 -0700
: @@ -115,7 +115,7 @@
:  
:  	long			rq_xtime;	/* when transmitted */
:  	int			rq_ntimeo;
: -	int			rq_nresend;
: +	int			rq_ntrans;
:  };
:  #define rq_svec			rq_snd_buf.head
:  #define rq_slen			rq_snd_buf.len
: diff -u --recursive --new-file linux-2.4.23-pre5/net/sunrpc/xprt.c linux-2.4.23-01-fix_retrans/net/sunrpc/xprt.c
: --- linux-2.4.23-pre5/net/sunrpc/xprt.c	2003-07-29 16:54:19.000000000 -0700
: +++ linux-2.4.23-01-fix_retrans/net/sunrpc/xprt.c	2003-09-25 16:25:02.000000000 -0700
: @@ -138,18 +138,21 @@
:  static int
:  __xprt_lock_write(struct rpc_xprt *xprt, struct rpc_task *task)
:  {
: +	struct rpc_rqst *req = task->tk_rqstp;
:  	if (!xprt->snd_task) {
:  		if (xprt->nocong || __xprt_get_cong(xprt, task)) {
:  			xprt->snd_task = task;
: -			if (task->tk_rqstp)
: -				task->tk_rqstp->rq_bytes_sent = 0;
: +			if (req) {
: +				req->rq_bytes_sent = 0;
: +				req->rq_ntrans++;
: +			}
:  		}
:  	}
:  	if (xprt->snd_task != task) {
:  		dprintk("RPC: %4d TCP write queue full\n", task->tk_pid);
:  		task->tk_timeout = 0;
:  		task->tk_status = -EAGAIN;
: -		if (task->tk_rqstp && task->tk_rqstp->rq_nresend)
: +		if (req && req->rq_ntrans)
:  			rpc_sleep_on(&xprt->resend, task, NULL, NULL);
:  		else
:  			rpc_sleep_on(&xprt->sending, task, NULL, NULL);
: @@ -183,9 +186,12 @@
:  			return;
:  	}
:  	if (xprt->nocong || __xprt_get_cong(xprt, task)) {
: +		struct rpc_rqst *req = task->tk_rqstp;
:  		xprt->snd_task = task;
: -		if (task->tk_rqstp)
: -			task->tk_rqstp->rq_bytes_sent = 0;
: +		if (req) {
: +			req->rq_bytes_sent = 0;
: +			req->rq_ntrans++;
: +		}
:  	}
:  }
:  
: @@ -592,12 +598,12 @@
:  	if (!xprt->nocong) {
:  		xprt_adjust_cwnd(xprt, copied);
:  		__xprt_put_cong(xprt, req);
: -	       	if (!req->rq_nresend) {
: +	       	if (req->rq_ntrans == 1) {
:  			int timer = rpcproc_timer(clnt, task->tk_msg.rpc_proc);
:  			if (timer)
:  				rpc_update_rtt(&clnt->cl_rtt, timer, (long)jiffies - req->rq_xtime);
:  		}
: -		rpc_clear_timeo(&clnt->cl_rtt);
: +		rpc_set_timeo(&clnt->cl_rtt, req->rq_ntrans - 1);
:  	}
:  
:  #ifdef RPC_PROFILE
: @@ -1063,7 +1069,7 @@
:  		goto out;
:  
:  	xprt_adjust_cwnd(req->rq_xprt, -ETIMEDOUT);
: -	req->rq_nresend++;
: +	__xprt_put_cong(xprt, req);
:  
:  	dprintk("RPC: %4d xprt_timer (%s request)\n",
:  		task->tk_pid, req ? "pending" : "backlogged");
: @@ -1219,6 +1225,7 @@
:  	if (!xprt->nocong) {
:  		task->tk_timeout = rpc_calc_rto(&clnt->cl_rtt,
:  				rpcproc_timer(clnt, task->tk_msg.rpc_proc));
: +		task->tk_timeout <<= rpc_ntimeo(&clnt->cl_rtt);
:  		task->tk_timeout <<= clnt->cl_timeout.to_retries
:  			- req->rq_timeout.to_retries;
:  		if (task->tk_timeout > req->rq_timeout.to_maxval)


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NFS/UDP slow read, lost fragments
  2003-09-26 21:07     ` Brian Mancuso
@ 2003-09-27  5:02       ` Robert L. Millner
  2003-10-14  1:20       ` Steve Dickson
  1 sibling, 0 replies; 12+ messages in thread
From: Robert L. Millner @ 2003-09-27  5:02 UTC (permalink / raw)
  To: nfs

> The following patch sets up a scheme of the form described by Brian,
> in combination with another fix to lengthen the window of time during
> which we accept updates to the RTO estimate (Karn's algorithm states
> that the window closes once you retransmit, whereas our current
> algorithm closes the window once a request times out).


Still seeing an inordinate number of retransmits (25% of the reads are
duplicates) with 2.4.23-pre5 and the patch on my test case (copy a 70MB
file from the server).

Its interesting to note that a dual P4 Xeon with an e1000 attached to the
same Foundry switch as the Netapp also has a similar retransmit rate
(20%).

If the same test is run against a different Netapp (running 6.3.1R1),
there's still a large number of retransmits but down to 17%.

Examining this problem further; if the server is Linux (stock Red Hat 7.3,
2.4.20-18.7smp, dual Xeon 2.4GhZ) then there are no retransmits and the
70MB copy takes its appropriate 6.4 sec with the client running either the
stock Red Hat kernel or a patched 2.4.23-pre5.

If the server is a Sun 880, running Solaris 8, then there are no
retransmits and the 70MB copy takes around 6.5 sec.

If the Netapp is placed under a heavy load (CPU at between 65% and 100% -
which traditionally kills Netapp performance), the 2.4.23-pre5+patch
client has roughly the same percentage of retransmits as it does against
an otherwise quiescent Netapp; however, the test takes 470 seconds instead
of 550 seconds.


As a sanity check against transient network problems, the problem still
consistently goes away if the client boots 2.2.19 and consistently repeats
if the client boots 2.4.23-pre5+patch.


I guess the next step is to go over packet dumps from both the client and
another host on a mirrored network port to see what's interesting about
the fragments being dropped.  That may point to why a heavily loaded
Netapp performs this test faster than a quiescent one.


	Rob




-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NFS/UDP slow read, lost fragments
  2003-09-26 21:07     ` Brian Mancuso
  2003-09-27  5:02       ` Robert L. Millner
@ 2003-10-14  1:20       ` Steve Dickson
  2003-10-14 14:52         ` Trond Myklebust
  1 sibling, 1 reply; 12+ messages in thread
From: Steve Dickson @ 2003-10-14  1:20 UTC (permalink / raw)
  To: Brian Mancuso; +Cc: Trond Myklebust, nfs

Brian Mancuso wrote:

>On Thu, Sep 25, 2003 at 04:31:05PM -0700, Trond Myklebust wrote:
>: 
>:Could people give it a try?
>: 
> Here is a patch (against linux-2.4.23-pre5 with your patch) that implements this (you might
>know of a cleaner way of doing this..):
>  
>
I've done some testing with both of these patches and here is
what I've found.... The test consisted of 10 client threads reading 10
different 100mg files on the same filesystem simultaneously. The
machines were duel x86 using private 100bt network.

I used the linux-2.4.23-pre7 kernel with both Trond's and Brian's
patches. The test was done with hard and soft mounts using v3 over
UDP.  Here are the results:

linux-2.4.23-pre7
Soft mount:
EIO calls      retrans   rate
0   128042     2105      6.20Mbps to 6.08Mbps
0   128040     1916      6.28Mbps to 6.09Mbps
0   128044     1936      6.37Mbps to 6.12Mbps
Hard mount:
    128048     1995      6.29Mbp  to 6.09Mbps
    128033     2075      6.21Mbps to 6.07Mbps
    128037     2015      6.23Mbps to 6.10Mbps

with Trond's patch
Soft mount:
EIO calls      retrans   rate
0   128038     1954      6.25Mbps to 6.12Mbps
0   128039     1789      6.32Mbp  to 6.14Mbps
0   128042     1853      6.25Mbps to 6.13Mbps
Hard mount:
    128039     1880      6.28Mbps to 6.11Mbps
    128042     2019      6.29Mbps to 6.12Mbps
    128042     1883      6.32Mbps to 6.14Mbps

with Trond's and Brian's patch
Soft mount:
EIO calls      retrans   rate
0   128036     1943      6.31Mbps to 6.14Mbps
0   128042     1802      6.35Mbps to 6.19Mbps
0   128047     1782      6.35Mbps to 6.14Mbps
Hard mount:
    128042     1953      6.28Mbps to 6.10Mbps
    128038     1752      6.48Mbps to 6.17Mbps
    128034     1978      6.26Mbps to 6.11Mbps

As you can see the patches does help a little bit but
but not as much as I would expect.... Maybe it was the
type of testes ran or maybe I didn't run the tests long enough
(each test ran for about ~2mins) or maybe my expectations
are too high... But it just doesn't seem that these patches really
help that much in bringing down the retransmissions...

SteveD.






-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
SourceForge.net hosts over 70,000 Open Source Projects.
See the people who have HELPED US provide better services:
Click here: http://sourceforge.net/supporters.php
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NFS/UDP slow read, lost fragments
  2003-10-14  1:20       ` Steve Dickson
@ 2003-10-14 14:52         ` Trond Myklebust
  2003-10-15  3:17           ` Steve Dickson
  0 siblings, 1 reply; 12+ messages in thread
From: Trond Myklebust @ 2003-10-14 14:52 UTC (permalink / raw)
  To: Steve Dickson; +Cc: Brian Mancuso, nfs

>>>>> " " == Steve Dickson <SteveD@RedHat.com> writes:

     > As you can see the patches does help a little bit but but not
     > as much as I would expect.... Maybe it was the type of testes
     > ran or maybe I didn't run the tests long enough (each test ran
     > for about ~2mins) or maybe my expectations are too high... But
     > it just doesn't seem that these patches really help that much
     > in bringing down the retransmissions...

Thanks for putting numbers to this Steve!

Could you try using the slight variation in

 http://www.fys.uio.no/~trondmy/src/2.4.23-pre7/linux-2.4.23-03-fix_retrans.dif

As you can see, that has a slight change to rpc_set_timeo(): if it
sees that several retransmissions have occurred for a given request,
then it keeps the value of rtt->ntimeouts high for a couple of extra
iterations in order to allow the new value of the timeout to
converge.
The reason for doing this is that if the round trip time changes
suddenly, then it will takes ~ 8 measurements before you converge on
the new value (due to the weighting done in the algorithm).

Cheers,
  Trond


-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
SourceForge.net hosts over 70,000 Open Source Projects.
See the people who have HELPED US provide better services:
Click here: http://sourceforge.net/supporters.php
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NFS/UDP slow read, lost fragments
  2003-10-14 14:52         ` Trond Myklebust
@ 2003-10-15  3:17           ` Steve Dickson
  2003-10-15  3:27             ` Trond Myklebust
  0 siblings, 1 reply; 12+ messages in thread
From: Steve Dickson @ 2003-10-15  3:17 UTC (permalink / raw)
  To: trond.myklebust; +Cc: nfs



Trond Myklebust wrote:

>Could you try using the slight variation in
>  
>
It didn't seem to help much... here the numbers...

linux-2.4.23-pre7
Soft mount:
EIO calls      retrans   rate
0   128042     2105      6.20Mbps to 6.08Mbps
0   128040     1916      6.28Mbps to 6.09Mbps
0   128044     1936      6.37Mbps to 6.12Mbps
Hard mount:
    128048     1995      6.29Mbp  to 6.09Mbps
    128033     2075      6.21Mbps to 6.07Mbps
    128037     2015      6.23Mbps to 6.10Mbps

with Trond's Original Patch
Soft mount:
EIO calls      retrans   rate
0   128043     1817      6.36Mbps to 6.17Mbps
0   128036     1630      6.38Mbps to 6.21Mbps
0   128038     1848      6.35Mbps to 6.18Mbps
Hard mount:
    128043     1706      6.30Mbps to 6.22Mbps
    128041     1778      6.33Mbps to 6.18Mbps
    128041     1891      6.31Mbps to 6.16Mbps

with Trond's Original Patch w/ Brian's patch
Soft mount:
EIO calls      retrans   rate
0   128039     1626      6.45Mbps to 6.22Mbps
0   128043     1896      6.31Mbps to 6.20Mbps
0   128036     1948      6.32Mbps to 6.15Mbps
Hard mount:
    128034     1792      6.41Mbps to 6.17Mbps
    128039     1823      6.29Mbps to 6.17Mbps
    128043     1903      6.42Mbps to 6.18Mbps
with Trond's New Patch
Soft mount:
EIO calls      retrans   rate
    128041     1954      6.33Mbps to 6.11Mbps
    128040     1897      6.31Mbps to 6.16Mbps
    128039     1758      6.26Mbps to 6.17Mbps
Hard mount:
    128043     1808      6.40Mbps to 6.19Mbps
    128033     1671      6.37Mbps to 6.20Mbps
    128042     1852      6.37Mbps to 6.17Mbps

Again this is 10 threads reading 10 100m files from the same filesystem

SteveD.



-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
SourceForge.net hosts over 70,000 Open Source Projects.
See the people who have HELPED US provide better services:
Click here: http://sourceforge.net/supporters.php
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: NFS/UDP slow read, lost fragments
  2003-10-15  3:17           ` Steve Dickson
@ 2003-10-15  3:27             ` Trond Myklebust
  0 siblings, 0 replies; 12+ messages in thread
From: Trond Myklebust @ 2003-10-15  3:27 UTC (permalink / raw)
  To: Steve Dickson; +Cc: NFS maillist

>>>>> " " == Steve Dickson <SteveD@RedHat.com> writes:

     > Trond Myklebust wrote:

    >> Could you try using the slight variation in
    >>
    >>
     > It didn't seem to help much... here the numbers...

Oh well, a 1.5% retransmission rate is in any case not a major
problem. It is more than we should see given that we are setting a
timeout of > 4 standard deviations, but it is consistent with what I
see in my own tests...

Cheers,
  Trond


-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
SourceForge.net hosts over 70,000 Open Source Projects.
See the people who have HELPED US provide better services:
Click here: http://sourceforge.net/supporters.php
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: NFS/UDP slow read, lost fragments
@ 2003-09-25 20:11 Lever, Charles
  0 siblings, 0 replies; 12+ messages in thread
From: Lever, Charles @ 2003-09-25 20:11 UTC (permalink / raw)
  To: Robert L. Millner; +Cc: nfs

hi robert-

if packets are being lost in server replies, then i
don't think this is strictly a client problem.

the newer clients may be more likely to overload a server
or network by sending more requests than can be buffered.
look into network cleanliness, and contact your local
NetApp SE for help in getting diagnostic info from your
filer.

especially because this problem goes away with TCP, i
would suspect a networking problem first.

> -----Original Message-----
> From: Robert L. Millner [mailto:rmillner@transmeta.com]=20
> Sent: Thursday, September 25, 2003 2:00 PM
> To: nfs@lists.sourceforge.net
> Subject: [NFS] NFS/UDP slow read, lost fragments
>=20
>=20
> Hello,
>=20
> The problem I am seeing is similar to what was posted by=20
> Larry Sendlosky
> on Jun 27, "2.4.20-pre3 -> 2.4.21 : nfs client read=20
> performance broken"
> though I have not done as through a drill-down into the nature of the
> problem.
>=20
> Somewhere between 2.4.19 and 2.4.20, NFS/UDP read performance began to
> suck because of a large number of request retransmits.  From=20
> tcpdump, the
> retransmits are for read transactions which return data in a=20
> reasonable
> time frame but are missing one or more fragments of the return packet.
>=20
> The client is a PIII 550, dual CPU, 1GB RAM, eepro100 (tested=20
> with both
> the e100 and eepro drivers) running a variety of kernels from=20
> the stock
> Red Hat 7.3 (2.4.20-20.7 which exhibits the problem), and from stock
> 2.4.18 through 2.4.22.  2.4.20 and above exhibit this problem.
>=20
> The server is a Network Appliance F820 running ONTAP 6.2.2.  Tests are
> conducted when the F820 is not under a noticeable load.
>=20
> >From the difference in behavior between kernel revisions and=20
> by tcpdump,=20
> it is believed that the fragments are transmitted by the server.
>=20
> The timings for different IO sizes for NFSv3 reads of a 70MB file:
>=20
> Configuration             Seconds Real Time
> -------------------------------------------
> UDP, 32k                       1080
> UDP,  8k                        420
> UDP,  4k                        210
> UDP,  1k                         40
>=20
> TCP, 32k                          6.4
>=20
>=20
> The NFSv2/UDP timings for 8k, 4k and 1k are almost identical to the
> NFSv3/UDP timings.
>=20
> The same test with 2.4.18 yields read times for around 6.3=20
> seconds for 32k
> and 8k NFSv2 and NFSv3 over UDP.
>=20
> Setting rmem_max, rmem_default, wmem_max and wmem_default to=20
> either 524284=20
> or 262142 make no difference.
>=20
> Setting netdev_max_backlog to 1200 (from 300) makes no difference.
>=20
> Setting ipfrag_high_thresh to up to 4194304 makes no difference.
>=20
>=20
> We have a mix of clients and servers, not all of which=20
> support NFS/TCP yet
> so we can't globally set tcp in the automounter maps for our=20
> netapp mounts
> or as "localoptions" in the autofs init script.  The mount=20
> patch submitted
> by Steve Dickson on Aug 6, "NFS Mount Patch:  Making NFS over TCP the
> default" is probably the immediate workaround that will at=20
> least make the
> mounts that really matter to us work well again.  I'll test that next.
>=20
>=20
> Is this a known problem?  Is there a patch already out there or in the
> works that fixes this?  What other data would help drill into this
> problem?
>=20
>=20
> 	Rob
>=20
>=20
>=20
> -------------------------------------------------------
> This sf.net email is sponsored by:ThinkGeek
> Welcome to geek heaven.
> http://thinkgeek.com/sf
> _______________________________________________
> NFS maillist  -  NFS@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nfs
>=20


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2003-10-15  3:27 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-09-25 17:59 NFS/UDP slow read, lost fragments Robert L. Millner
2003-09-25 20:22 ` Brian Mancuso
2003-09-25 20:33 ` brianm
2003-09-25 21:44 ` Brian Mancuso
2003-09-25 23:31   ` Trond Myklebust
2003-09-26 21:07     ` Brian Mancuso
2003-09-27  5:02       ` Robert L. Millner
2003-10-14  1:20       ` Steve Dickson
2003-10-14 14:52         ` Trond Myklebust
2003-10-15  3:17           ` Steve Dickson
2003-10-15  3:27             ` Trond Myklebust
2003-09-25 20:11 Lever, Charles

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.