* NFS/UDP slow read, lost fragments
@ 2003-09-25 17:59 Robert L. Millner
2003-09-25 20:22 ` Brian Mancuso
` (2 more replies)
0 siblings, 3 replies; 12+ messages in thread
From: Robert L. Millner @ 2003-09-25 17:59 UTC (permalink / raw)
To: nfs
Hello,
The problem I am seeing is similar to what was posted by Larry Sendlosky
on Jun 27, "2.4.20-pre3 -> 2.4.21 : nfs client read performance broken"
though I have not done as through a drill-down into the nature of the
problem.
Somewhere between 2.4.19 and 2.4.20, NFS/UDP read performance began to
suck because of a large number of request retransmits. From tcpdump, the
retransmits are for read transactions which return data in a reasonable
time frame but are missing one or more fragments of the return packet.
The client is a PIII 550, dual CPU, 1GB RAM, eepro100 (tested with both
the e100 and eepro drivers) running a variety of kernels from the stock
Red Hat 7.3 (2.4.20-20.7 which exhibits the problem), and from stock
2.4.18 through 2.4.22. 2.4.20 and above exhibit this problem.
The server is a Network Appliance F820 running ONTAP 6.2.2. Tests are
conducted when the F820 is not under a noticeable load.
>From the difference in behavior between kernel revisions and by tcpdump,
it is believed that the fragments are transmitted by the server.
The timings for different IO sizes for NFSv3 reads of a 70MB file:
Configuration Seconds Real Time
-------------------------------------------
UDP, 32k 1080
UDP, 8k 420
UDP, 4k 210
UDP, 1k 40
TCP, 32k 6.4
The NFSv2/UDP timings for 8k, 4k and 1k are almost identical to the
NFSv3/UDP timings.
The same test with 2.4.18 yields read times for around 6.3 seconds for 32k
and 8k NFSv2 and NFSv3 over UDP.
Setting rmem_max, rmem_default, wmem_max and wmem_default to either 524284
or 262142 make no difference.
Setting netdev_max_backlog to 1200 (from 300) makes no difference.
Setting ipfrag_high_thresh to up to 4194304 makes no difference.
We have a mix of clients and servers, not all of which support NFS/TCP yet
so we can't globally set tcp in the automounter maps for our netapp mounts
or as "localoptions" in the autofs init script. The mount patch submitted
by Steve Dickson on Aug 6, "NFS Mount Patch: Making NFS over TCP the
default" is probably the immediate workaround that will at least make the
mounts that really matter to us work well again. I'll test that next.
Is this a known problem? Is there a patch already out there or in the
works that fixes this? What other data would help drill into this
problem?
Rob
-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist - NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: NFS/UDP slow read, lost fragments 2003-09-25 17:59 NFS/UDP slow read, lost fragments Robert L. Millner @ 2003-09-25 20:22 ` Brian Mancuso 2003-09-25 20:33 ` brianm 2003-09-25 21:44 ` Brian Mancuso 2 siblings, 0 replies; 12+ messages in thread From: Brian Mancuso @ 2003-09-25 20:22 UTC (permalink / raw) To: Robert L. Millner; +Cc: nfs On Thu, Sep 25, 2003 at 10:59:43AM -0700, Robert L. Millner wrote: : Hello, : : The problem I am seeing is similar to what was posted by Larry Sendlosky : on Jun 27, "2.4.20-pre3 -> 2.4.21 : nfs client read performance broken" : though I have not done as through a drill-down into the nature of the : problem. : : Somewhere between 2.4.19 and 2.4.20, NFS/UDP read performance began to : suck because of a large number of request retransmits. From tcpdump, the : retransmits are for read transactions which return data in a reasonable : time frame but are missing one or more fragments of the return packet. This is because code that exponentially backs off RTO for UDP RPC was backported from the 2.[56] series in 2.4.20, and this code is completely broken. Trond has a patch in his patchset for 2.6 that significantly fixes these problems, however this patch still has one problem that can result in a large number of unnecessary retransmits for RPC sessions that have low variance in RTT: RTO is calculated to be the filtered round trip time plus a small constant times the mean deviation of round trip times. However, because the RTT calculation code implements Karn's algorithm (from TCP: RTT calculation isn't done for responses to RPC requests that have been retransmitted), RTT is never allowed to increase, for were a response to take longer than measured RTT plus the (assumed small) deviation, the packet will be retransmitted and a calculatation that will increase measured RTT won't be done. Thus if a server's real RTT were to increase over time, initial RTO values would never grow (for measured RTT would never grow beyond the minimum ever measured), and RPC requests will frequently be retransmitted at least once. This can be easily remedied however by TCP's technique of inheriting backoff of RTO from previous transactions: create a new variable somewhere in the clnt structure called, say, cl_backoff; Each time an RPC transaction completes, store the number of retransmits for that transaction (req->rq_nresend) in cl_backoff; calculate RTO to be rpc_calc_rto() left shifted by the number of retransmits for this transaction (initially 0) plus clnt->cl_backoff (the number of retransmits for the last completed transaction). The backported code mentioned above will also result in significantly more EIO events for users with soft UDP mounts. Users seeing lots of EIO's should see them diminish after these problems are fixed. : Is this a known problem? Is there a patch already out there or in the : works that fixes this? What other data would help drill into this : problem? I will post a patch tomorrow for 2.4.20. Brian Mancuso ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: NFS/UDP slow read, lost fragments 2003-09-25 17:59 NFS/UDP slow read, lost fragments Robert L. Millner 2003-09-25 20:22 ` Brian Mancuso @ 2003-09-25 20:33 ` brianm 2003-09-25 21:44 ` Brian Mancuso 2 siblings, 0 replies; 12+ messages in thread From: brianm @ 2003-09-25 20:33 UTC (permalink / raw) To: Robert L. Millner; +Cc: nfs On Thu, Sep 25, 2003 at 10:59:43AM -0700, Robert L. Millner wrote: : Hello, : : The problem I am seeing is similar to what was posted by Larry Sendlosky : on Jun 27, "2.4.20-pre3 -> 2.4.21 : nfs client read performance broken" : though I have not done as through a drill-down into the nature of the : problem. : : Somewhere between 2.4.19 and 2.4.20, NFS/UDP read performance began to : suck because of a large number of request retransmits. From tcpdump, the : retransmits are for read transactions which return data in a reasonable : time frame but are missing one or more fragments of the return packet. This is because code that exponentially backs off RTO for UDP RPC was backported from the 2.[56] series in 2.4.20, and this code is completely broken. Trond has a patch in his patchset for 2.6 that significantly fixes these problems, however this patch still has one problem that can result in a large number of unnecessary retransmits for RPC sessions that have low variance in RTT: RTO is calculated to be the filtered round trip time plus a small constant times the mean deviation of round trip times. However, because the RTT calculation code implements Karn's algorithm (from TCP: RTT calculation isn't done for responses to RPC requests that have been retransmitted), RTT is never allowed to increase, for were a response to take longer than measured RTT plus the (assumed small) deviation, the packet will be retransmitted and a calculatation that will increase measured RTT won't be done. Thus if a server's real RTT were to increase over time, initial RTO values would never grow (for measured RTT would never grow beyond the minimum ever measured), and RPC requests will frequently be retransmitted at least once. This can be easily remedied however by TCP's technique of inheriting backoff of RTO from previous transactions: create a new variable somewhere in the clnt structure called, say, cl_backoff; Each time an RPC transaction completes, store the number of retransmits for that transaction (req->rq_nresend) in cl_backoff; calculate RTO to be rpc_calc_rto() left shifted by the number of retransmits for this transaction (initially 0) plus clnt->cl_backoff (the number of retransmits for the last completed transaction). The backported code mentioned above will also result in significantly more EIO events for users with soft UDP mounts. Users seeing lots of EIO's should see them diminish after these problems are fixed. : Is this a known problem? Is there a patch already out there or in the : works that fixes this? What other data would help drill into this : problem? I will post a patch tomorrow for 2.4.20. Brian Mancuso ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: NFS/UDP slow read, lost fragments 2003-09-25 17:59 NFS/UDP slow read, lost fragments Robert L. Millner 2003-09-25 20:22 ` Brian Mancuso 2003-09-25 20:33 ` brianm @ 2003-09-25 21:44 ` Brian Mancuso 2003-09-25 23:31 ` Trond Myklebust 2 siblings, 1 reply; 12+ messages in thread From: Brian Mancuso @ 2003-09-25 21:44 UTC (permalink / raw) To: nfs On Thu, Sep 25, 2003 at 10:59:43AM -0700, Robert L. Millner wrote: : Hello, : : The problem I am seeing is similar to what was posted by Larry Sendlosky : on Jun 27, "2.4.20-pre3 -> 2.4.21 : nfs client read performance broken" : though I have not done as through a drill-down into the nature of the : problem. : : Somewhere between 2.4.19 and 2.4.20, NFS/UDP read performance began to : suck because of a large number of request retransmits. From tcpdump, the : retransmits are for read transactions which return data in a reasonable : time frame but are missing one or more fragments of the return packet. This is because code that exponentially backs off RTO for UDP RPC was backported from the 2.[56] series in 2.4.20, and this code is completely broken. Trond has a patch in his patchset for 2.6 that significantly fixes these problems, however this patch still has one problem that can result in a large number of unnecessary retransmits for RPC sessions that have low variance in RTT: RTO is calculated to be the filtered round trip time plus a small constant times the mean deviation of round trip times. However, because the RTT calculation code implements Karn's algorithm (from TCP: RTT calculation isn't done for responses to RPC requests that have been retransmitted), RTT is never allowed to increase, for were a response to take longer than measured RTT plus the (assumed small) deviation, the packet will be retransmitted and a calculatation that will increase measured RTT won't be done. Thus if a server's real RTT were to increase over time, initial RTO values would never grow (for measured RTT would never grow beyond the minimum ever measured), and RPC requests will frequently be retransmitted at least once. This can be easily remedied however by TCP's technique of inheriting backoff of RTO from previous transactions: create a new variable somewhere in the clnt structure called, say, cl_backoff; Each time an RPC transaction completes, store the number of retransmits for that transaction (req->rq_nresend) in cl_backoff; calculate RTO to be rpc_calc_rto() left shifted by the number of retransmits for this transaction (initially 0) plus clnt->cl_backoff (the number of retransmits for the last completed transaction). The backported code mentioned above will also result in significantly more EIO events for users with soft UDP mounts. Users seeing lots of EIO's should see them diminish after these problems are fixed. : Is this a known problem? Is there a patch already out there or in the : works that fixes this? What other data would help drill into this : problem? I will post a patch tomorrow for 2.4.20. Brian Mancuso ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: NFS/UDP slow read, lost fragments 2003-09-25 21:44 ` Brian Mancuso @ 2003-09-25 23:31 ` Trond Myklebust 2003-09-26 21:07 ` Brian Mancuso 0 siblings, 1 reply; 12+ messages in thread From: Trond Myklebust @ 2003-09-25 23:31 UTC (permalink / raw) To: Brian Mancuso; +Cc: nfs >>>>> " " == Brian Mancuso <bmancuso@akamai.com> writes: > This can be easily remedied however by TCP's technique of > inheriting backoff of RTO from previous transactions: create a > new variable somewhere in the clnt structure called, say, > cl_backoff; Each time an RPC transaction completes, store the > number of retransmits for that transaction (req->rq_nresend) in > cl_backoff; calculate RTO to be rpc_calc_rto() left shifted by > the number of retransmits for this transaction (initially 0) > plus clnt-> cl_backoff (the number of retransmits for the last > completed transaction). Right... There is already a variable set aside for this task in the RTT code in the form of "ntimeouts". The following patch sets up a scheme of the form described by Brian, in combination with another fix to lengthen the window of time during which we accept updates to the RTO estimate (Karn's algorithm states that the window closes once you retransmit, whereas our current algorithm closes the window once a request times out). Could people give it a try? Cheers, Trond diff -u --recursive --new-file linux-2.4.23-pre5/include/linux/sunrpc/timer.h linux-2.4.23-01-fix_retrans/include/linux/sunrpc/timer.h --- linux-2.4.23-pre5/include/linux/sunrpc/timer.h 2003-09-19 13:15:31.000000000 -0700 +++ linux-2.4.23-01-fix_retrans/include/linux/sunrpc/timer.h 2003-09-25 16:25:02.000000000 -0700 @@ -23,14 +23,9 @@ extern void rpc_update_rtt(struct rpc_rtt *rt, int timer, long m); extern long rpc_calc_rto(struct rpc_rtt *rt, int timer); -static inline void rpc_inc_timeo(struct rpc_rtt *rt) +static inline void rpc_set_timeo(struct rpc_rtt *rt, int ntimeo) { - atomic_inc(&rt->ntimeouts); -} - -static inline void rpc_clear_timeo(struct rpc_rtt *rt) -{ - atomic_set(&rt->ntimeouts, 0); + atomic_set(&rt->ntimeouts, ntimeo); } static inline int rpc_ntimeo(struct rpc_rtt *rt) diff -u --recursive --new-file linux-2.4.23-pre5/include/linux/sunrpc/xprt.h linux-2.4.23-01-fix_retrans/include/linux/sunrpc/xprt.h --- linux-2.4.23-pre5/include/linux/sunrpc/xprt.h 2003-07-29 16:52:26.000000000 -0700 +++ linux-2.4.23-01-fix_retrans/include/linux/sunrpc/xprt.h 2003-09-19 13:15:31.000000000 -0700 @@ -115,7 +115,7 @@ long rq_xtime; /* when transmitted */ int rq_ntimeo; - int rq_nresend; + int rq_ntrans; }; #define rq_svec rq_snd_buf.head #define rq_slen rq_snd_buf.len diff -u --recursive --new-file linux-2.4.23-pre5/net/sunrpc/xprt.c linux-2.4.23-01-fix_retrans/net/sunrpc/xprt.c --- linux-2.4.23-pre5/net/sunrpc/xprt.c 2003-07-29 16:54:19.000000000 -0700 +++ linux-2.4.23-01-fix_retrans/net/sunrpc/xprt.c 2003-09-25 16:25:02.000000000 -0700 @@ -138,18 +138,21 @@ static int __xprt_lock_write(struct rpc_xprt *xprt, struct rpc_task *task) { + struct rpc_rqst *req = task->tk_rqstp; if (!xprt->snd_task) { if (xprt->nocong || __xprt_get_cong(xprt, task)) { xprt->snd_task = task; - if (task->tk_rqstp) - task->tk_rqstp->rq_bytes_sent = 0; + if (req) { + req->rq_bytes_sent = 0; + req->rq_ntrans++; + } } } if (xprt->snd_task != task) { dprintk("RPC: %4d TCP write queue full\n", task->tk_pid); task->tk_timeout = 0; task->tk_status = -EAGAIN; - if (task->tk_rqstp && task->tk_rqstp->rq_nresend) + if (req && req->rq_ntrans) rpc_sleep_on(&xprt->resend, task, NULL, NULL); else rpc_sleep_on(&xprt->sending, task, NULL, NULL); @@ -183,9 +186,12 @@ return; } if (xprt->nocong || __xprt_get_cong(xprt, task)) { + struct rpc_rqst *req = task->tk_rqstp; xprt->snd_task = task; - if (task->tk_rqstp) - task->tk_rqstp->rq_bytes_sent = 0; + if (req) { + req->rq_bytes_sent = 0; + req->rq_ntrans++; + } } } @@ -592,12 +598,12 @@ if (!xprt->nocong) { xprt_adjust_cwnd(xprt, copied); __xprt_put_cong(xprt, req); - if (!req->rq_nresend) { + if (req->rq_ntrans == 1) { int timer = rpcproc_timer(clnt, task->tk_msg.rpc_proc); if (timer) rpc_update_rtt(&clnt->cl_rtt, timer, (long)jiffies - req->rq_xtime); } - rpc_clear_timeo(&clnt->cl_rtt); + rpc_set_timeo(&clnt->cl_rtt, req->rq_ntrans - 1); } #ifdef RPC_PROFILE @@ -1063,7 +1069,7 @@ goto out; xprt_adjust_cwnd(req->rq_xprt, -ETIMEDOUT); - req->rq_nresend++; + __xprt_put_cong(xprt, req); dprintk("RPC: %4d xprt_timer (%s request)\n", task->tk_pid, req ? "pending" : "backlogged"); @@ -1219,6 +1225,7 @@ if (!xprt->nocong) { task->tk_timeout = rpc_calc_rto(&clnt->cl_rtt, rpcproc_timer(clnt, task->tk_msg.rpc_proc)); + task->tk_timeout <<= rpc_ntimeo(&clnt->cl_rtt); task->tk_timeout <<= clnt->cl_timeout.to_retries - req->rq_timeout.to_retries; if (task->tk_timeout > req->rq_timeout.to_maxval) ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: NFS/UDP slow read, lost fragments 2003-09-25 23:31 ` Trond Myklebust @ 2003-09-26 21:07 ` Brian Mancuso 2003-09-27 5:02 ` Robert L. Millner 2003-10-14 1:20 ` Steve Dickson 0 siblings, 2 replies; 12+ messages in thread From: Brian Mancuso @ 2003-09-26 21:07 UTC (permalink / raw) To: Trond Myklebust; +Cc: nfs On Thu, Sep 25, 2003 at 04:31:05PM -0700, Trond Myklebust wrote: : : Right... There is already a variable set aside for this task in the : RTT code in the form of "ntimeouts". : : The following patch sets up a scheme of the form described by Brian, : in combination with another fix to lengthen the window of time during : which we accept updates to the RTO estimate (Karn's algorithm states : that the window closes once you retransmit, whereas our current : algorithm closes the window once a request times out). : : Could people give it a try? : : Cheers, : Trond Hi Trond, This patch is great! One thing though: there is one case with this patch in which a request terminates but the client's ntimeouts value is not updated: when a request exhausts its retries. I think future requests should inherit the ntimeouts factor provided by timed-out requests for their RTO calculations. Here is a patch (against linux-2.4.23-pre5 with your patch) that implements this (you might know of a cleaner way of doing this..): --- clnt.c.1 Fri Sep 26 19:58:30 2003 +++ clnt.c Fri Sep 26 20:00:38 2003 @@ -699,6 +699,7 @@ call_status(struct rpc_task *task) static void call_timeout(struct rpc_task *task) { + struct rpc_rqst *req = task->tk_rqstp; struct rpc_clnt *clnt = task->tk_client; struct rpc_timeout *to = &task->tk_rqstp->rq_timeout; @@ -707,6 +708,7 @@ call_timeout(struct rpc_task *task) goto retry; } to->to_retries = clnt->cl_timeout.to_retries; + rpc_set_timeo(&clnt->cl_rtt, req->rq_ntrans - 1); dprintk("RPC: %4d call_timeout (major)\n", task->tk_pid); if (clnt->cl_softrtry) { I tested your patch. Here is some documentation of my testing: I added a printk to the code, mounted an NFS volume with a retrans=8 option, and did an ls on a directory with 100,000 entries in it. This takes awhile, and leaves ample time for the system to develop a good RTT estimate and for one to interfere with client/server connectivity so as to induce retransmits with exponential backoff: 01 DO_XPRT_TRANSMIT: rtt = 4, ntimeo = 0, ntrans = 0, rto = 4 02 DO_XPRT_TRANSMIT: rtt = 4, ntimeo = 0, ntrans = 0, rto = 4 ... 03 DO_XPRT_TRANSMIT: rtt = 4, ntimeo = 0, ntrans = 0, rto = 4 04 DO_XPRT_TRANSMIT: rtt = 4, ntimeo = 0, ntrans = 2, rto = 16 05 DO_XPRT_TRANSMIT: rtt = 4, ntimeo = 0, ntrans = 3, rto = 32 06 DO_XPRT_TRANSMIT: rtt = 4, ntimeo = 0, ntrans = 4, rto = 64 07 DO_XPRT_TRANSMIT: rtt = 4, ntimeo = 0, ntrans = 5, rto = 128 08 DO_XPRT_TRANSMIT: rtt = 4, ntimeo = 0, ntrans = 6, rto = 256 09 DO_XPRT_TRANSMIT: rtt = 4, ntimeo = 0, ntrans = 7, rto = 512 10 DO_XPRT_TRANSMIT: rtt = 4, ntimeo = 0, ntrans = 8, rto = 1024 11 nfs: server 172.18.192.11 not responding, timed out 12 DO_XPRT_TRANSMIT: rtt = 4, ntimeo = 8, ntrans = 0, rto = 1024 13 DO_XPRT_TRANSMIT: rtt = 4, ntimeo = 8, ntrans = 1, rto = 2048 14 DO_XPRT_TRANSMIT: rtt = 4, ntimeo = 8, ntrans = 2, rto = 4096 The numbers on the left are line numbers I manually inserted for illustration; they were not produced by the printk. The '...' indicates lots of lines, identical to the preceding, were removed. There is one line per RTO calculation (i.e. per transmit). Lines 1-2 show RTO calculations for RPC transactions that saw no loss. 'rtt' is the value returned by rpc_calc_rto(), and represents the estimated round trip time plus a constant factor times its mean deviation. 'ntimeo' is the ntimeouts value in the client structure, set to the number of transmits minus one for the last terminated (successful or otherwise) request. 'ntrans' is effectively the number of retransmits of the request in question. 'rto' is the calculated RTO: rtt << ntimeo + ntrans. The server was disconnected at (before) line 3. Lines 4-10 show the client exponentially backing off RTO. Line 11 represents an exhaustion of retransmit attempts and an EIO returned to the application, ls (this is a soft mount). Lines 12-14 show RTO calculations for a new request. Note that the ntimeo on line 12 inherited its value from the ntrans value of the request that terminated on line 10 (this kernel had my above modification). Brian : diff -u --recursive --new-file linux-2.4.23-pre5/include/linux/sunrpc/timer.h linux-2.4.23-01-fix_retrans/include/linux/sunrpc/timer.h : --- linux-2.4.23-pre5/include/linux/sunrpc/timer.h 2003-09-19 13:15:31.000000000 -0700 : +++ linux-2.4.23-01-fix_retrans/include/linux/sunrpc/timer.h 2003-09-25 16:25:02.000000000 -0700 : @@ -23,14 +23,9 @@ : extern void rpc_update_rtt(struct rpc_rtt *rt, int timer, long m); : extern long rpc_calc_rto(struct rpc_rtt *rt, int timer); : : -static inline void rpc_inc_timeo(struct rpc_rtt *rt) : +static inline void rpc_set_timeo(struct rpc_rtt *rt, int ntimeo) : { : - atomic_inc(&rt->ntimeouts); : -} : - : -static inline void rpc_clear_timeo(struct rpc_rtt *rt) : -{ : - atomic_set(&rt->ntimeouts, 0); : + atomic_set(&rt->ntimeouts, ntimeo); : } : : static inline int rpc_ntimeo(struct rpc_rtt *rt) : diff -u --recursive --new-file linux-2.4.23-pre5/include/linux/sunrpc/xprt.h linux-2.4.23-01-fix_retrans/include/linux/sunrpc/xprt.h : --- linux-2.4.23-pre5/include/linux/sunrpc/xprt.h 2003-07-29 16:52:26.000000000 -0700 : +++ linux-2.4.23-01-fix_retrans/include/linux/sunrpc/xprt.h 2003-09-19 13:15:31.000000000 -0700 : @@ -115,7 +115,7 @@ : : long rq_xtime; /* when transmitted */ : int rq_ntimeo; : - int rq_nresend; : + int rq_ntrans; : }; : #define rq_svec rq_snd_buf.head : #define rq_slen rq_snd_buf.len : diff -u --recursive --new-file linux-2.4.23-pre5/net/sunrpc/xprt.c linux-2.4.23-01-fix_retrans/net/sunrpc/xprt.c : --- linux-2.4.23-pre5/net/sunrpc/xprt.c 2003-07-29 16:54:19.000000000 -0700 : +++ linux-2.4.23-01-fix_retrans/net/sunrpc/xprt.c 2003-09-25 16:25:02.000000000 -0700 : @@ -138,18 +138,21 @@ : static int : __xprt_lock_write(struct rpc_xprt *xprt, struct rpc_task *task) : { : + struct rpc_rqst *req = task->tk_rqstp; : if (!xprt->snd_task) { : if (xprt->nocong || __xprt_get_cong(xprt, task)) { : xprt->snd_task = task; : - if (task->tk_rqstp) : - task->tk_rqstp->rq_bytes_sent = 0; : + if (req) { : + req->rq_bytes_sent = 0; : + req->rq_ntrans++; : + } : } : } : if (xprt->snd_task != task) { : dprintk("RPC: %4d TCP write queue full\n", task->tk_pid); : task->tk_timeout = 0; : task->tk_status = -EAGAIN; : - if (task->tk_rqstp && task->tk_rqstp->rq_nresend) : + if (req && req->rq_ntrans) : rpc_sleep_on(&xprt->resend, task, NULL, NULL); : else : rpc_sleep_on(&xprt->sending, task, NULL, NULL); : @@ -183,9 +186,12 @@ : return; : } : if (xprt->nocong || __xprt_get_cong(xprt, task)) { : + struct rpc_rqst *req = task->tk_rqstp; : xprt->snd_task = task; : - if (task->tk_rqstp) : - task->tk_rqstp->rq_bytes_sent = 0; : + if (req) { : + req->rq_bytes_sent = 0; : + req->rq_ntrans++; : + } : } : } : : @@ -592,12 +598,12 @@ : if (!xprt->nocong) { : xprt_adjust_cwnd(xprt, copied); : __xprt_put_cong(xprt, req); : - if (!req->rq_nresend) { : + if (req->rq_ntrans == 1) { : int timer = rpcproc_timer(clnt, task->tk_msg.rpc_proc); : if (timer) : rpc_update_rtt(&clnt->cl_rtt, timer, (long)jiffies - req->rq_xtime); : } : - rpc_clear_timeo(&clnt->cl_rtt); : + rpc_set_timeo(&clnt->cl_rtt, req->rq_ntrans - 1); : } : : #ifdef RPC_PROFILE : @@ -1063,7 +1069,7 @@ : goto out; : : xprt_adjust_cwnd(req->rq_xprt, -ETIMEDOUT); : - req->rq_nresend++; : + __xprt_put_cong(xprt, req); : : dprintk("RPC: %4d xprt_timer (%s request)\n", : task->tk_pid, req ? "pending" : "backlogged"); : @@ -1219,6 +1225,7 @@ : if (!xprt->nocong) { : task->tk_timeout = rpc_calc_rto(&clnt->cl_rtt, : rpcproc_timer(clnt, task->tk_msg.rpc_proc)); : + task->tk_timeout <<= rpc_ntimeo(&clnt->cl_rtt); : task->tk_timeout <<= clnt->cl_timeout.to_retries : - req->rq_timeout.to_retries; : if (task->tk_timeout > req->rq_timeout.to_maxval) ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: NFS/UDP slow read, lost fragments 2003-09-26 21:07 ` Brian Mancuso @ 2003-09-27 5:02 ` Robert L. Millner 2003-10-14 1:20 ` Steve Dickson 1 sibling, 0 replies; 12+ messages in thread From: Robert L. Millner @ 2003-09-27 5:02 UTC (permalink / raw) To: nfs > The following patch sets up a scheme of the form described by Brian, > in combination with another fix to lengthen the window of time during > which we accept updates to the RTO estimate (Karn's algorithm states > that the window closes once you retransmit, whereas our current > algorithm closes the window once a request times out). Still seeing an inordinate number of retransmits (25% of the reads are duplicates) with 2.4.23-pre5 and the patch on my test case (copy a 70MB file from the server). Its interesting to note that a dual P4 Xeon with an e1000 attached to the same Foundry switch as the Netapp also has a similar retransmit rate (20%). If the same test is run against a different Netapp (running 6.3.1R1), there's still a large number of retransmits but down to 17%. Examining this problem further; if the server is Linux (stock Red Hat 7.3, 2.4.20-18.7smp, dual Xeon 2.4GhZ) then there are no retransmits and the 70MB copy takes its appropriate 6.4 sec with the client running either the stock Red Hat kernel or a patched 2.4.23-pre5. If the server is a Sun 880, running Solaris 8, then there are no retransmits and the 70MB copy takes around 6.5 sec. If the Netapp is placed under a heavy load (CPU at between 65% and 100% - which traditionally kills Netapp performance), the 2.4.23-pre5+patch client has roughly the same percentage of retransmits as it does against an otherwise quiescent Netapp; however, the test takes 470 seconds instead of 550 seconds. As a sanity check against transient network problems, the problem still consistently goes away if the client boots 2.2.19 and consistently repeats if the client boots 2.4.23-pre5+patch. I guess the next step is to go over packet dumps from both the client and another host on a mirrored network port to see what's interesting about the fragments being dropped. That may point to why a heavily loaded Netapp performs this test faster than a quiescent one. Rob ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: NFS/UDP slow read, lost fragments 2003-09-26 21:07 ` Brian Mancuso 2003-09-27 5:02 ` Robert L. Millner @ 2003-10-14 1:20 ` Steve Dickson 2003-10-14 14:52 ` Trond Myklebust 1 sibling, 1 reply; 12+ messages in thread From: Steve Dickson @ 2003-10-14 1:20 UTC (permalink / raw) To: Brian Mancuso; +Cc: Trond Myklebust, nfs Brian Mancuso wrote: >On Thu, Sep 25, 2003 at 04:31:05PM -0700, Trond Myklebust wrote: >: >:Could people give it a try? >: > Here is a patch (against linux-2.4.23-pre5 with your patch) that implements this (you might >know of a cleaner way of doing this..): > > I've done some testing with both of these patches and here is what I've found.... The test consisted of 10 client threads reading 10 different 100mg files on the same filesystem simultaneously. The machines were duel x86 using private 100bt network. I used the linux-2.4.23-pre7 kernel with both Trond's and Brian's patches. The test was done with hard and soft mounts using v3 over UDP. Here are the results: linux-2.4.23-pre7 Soft mount: EIO calls retrans rate 0 128042 2105 6.20Mbps to 6.08Mbps 0 128040 1916 6.28Mbps to 6.09Mbps 0 128044 1936 6.37Mbps to 6.12Mbps Hard mount: 128048 1995 6.29Mbp to 6.09Mbps 128033 2075 6.21Mbps to 6.07Mbps 128037 2015 6.23Mbps to 6.10Mbps with Trond's patch Soft mount: EIO calls retrans rate 0 128038 1954 6.25Mbps to 6.12Mbps 0 128039 1789 6.32Mbp to 6.14Mbps 0 128042 1853 6.25Mbps to 6.13Mbps Hard mount: 128039 1880 6.28Mbps to 6.11Mbps 128042 2019 6.29Mbps to 6.12Mbps 128042 1883 6.32Mbps to 6.14Mbps with Trond's and Brian's patch Soft mount: EIO calls retrans rate 0 128036 1943 6.31Mbps to 6.14Mbps 0 128042 1802 6.35Mbps to 6.19Mbps 0 128047 1782 6.35Mbps to 6.14Mbps Hard mount: 128042 1953 6.28Mbps to 6.10Mbps 128038 1752 6.48Mbps to 6.17Mbps 128034 1978 6.26Mbps to 6.11Mbps As you can see the patches does help a little bit but but not as much as I would expect.... Maybe it was the type of testes ran or maybe I didn't run the tests long enough (each test ran for about ~2mins) or maybe my expectations are too high... But it just doesn't seem that these patches really help that much in bringing down the retransmissions... SteveD. ------------------------------------------------------- This SF.net email is sponsored by: SF.net Giveback Program. SourceForge.net hosts over 70,000 Open Source Projects. See the people who have HELPED US provide better services: Click here: http://sourceforge.net/supporters.php _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: NFS/UDP slow read, lost fragments 2003-10-14 1:20 ` Steve Dickson @ 2003-10-14 14:52 ` Trond Myklebust 2003-10-15 3:17 ` Steve Dickson 0 siblings, 1 reply; 12+ messages in thread From: Trond Myklebust @ 2003-10-14 14:52 UTC (permalink / raw) To: Steve Dickson; +Cc: Brian Mancuso, nfs >>>>> " " == Steve Dickson <SteveD@RedHat.com> writes: > As you can see the patches does help a little bit but but not > as much as I would expect.... Maybe it was the type of testes > ran or maybe I didn't run the tests long enough (each test ran > for about ~2mins) or maybe my expectations are too high... But > it just doesn't seem that these patches really help that much > in bringing down the retransmissions... Thanks for putting numbers to this Steve! Could you try using the slight variation in http://www.fys.uio.no/~trondmy/src/2.4.23-pre7/linux-2.4.23-03-fix_retrans.dif As you can see, that has a slight change to rpc_set_timeo(): if it sees that several retransmissions have occurred for a given request, then it keeps the value of rtt->ntimeouts high for a couple of extra iterations in order to allow the new value of the timeout to converge. The reason for doing this is that if the round trip time changes suddenly, then it will takes ~ 8 measurements before you converge on the new value (due to the weighting done in the algorithm). Cheers, Trond ------------------------------------------------------- This SF.net email is sponsored by: SF.net Giveback Program. SourceForge.net hosts over 70,000 Open Source Projects. See the people who have HELPED US provide better services: Click here: http://sourceforge.net/supporters.php _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: NFS/UDP slow read, lost fragments 2003-10-14 14:52 ` Trond Myklebust @ 2003-10-15 3:17 ` Steve Dickson 2003-10-15 3:27 ` Trond Myklebust 0 siblings, 1 reply; 12+ messages in thread From: Steve Dickson @ 2003-10-15 3:17 UTC (permalink / raw) To: trond.myklebust; +Cc: nfs Trond Myklebust wrote: >Could you try using the slight variation in > > It didn't seem to help much... here the numbers... linux-2.4.23-pre7 Soft mount: EIO calls retrans rate 0 128042 2105 6.20Mbps to 6.08Mbps 0 128040 1916 6.28Mbps to 6.09Mbps 0 128044 1936 6.37Mbps to 6.12Mbps Hard mount: 128048 1995 6.29Mbp to 6.09Mbps 128033 2075 6.21Mbps to 6.07Mbps 128037 2015 6.23Mbps to 6.10Mbps with Trond's Original Patch Soft mount: EIO calls retrans rate 0 128043 1817 6.36Mbps to 6.17Mbps 0 128036 1630 6.38Mbps to 6.21Mbps 0 128038 1848 6.35Mbps to 6.18Mbps Hard mount: 128043 1706 6.30Mbps to 6.22Mbps 128041 1778 6.33Mbps to 6.18Mbps 128041 1891 6.31Mbps to 6.16Mbps with Trond's Original Patch w/ Brian's patch Soft mount: EIO calls retrans rate 0 128039 1626 6.45Mbps to 6.22Mbps 0 128043 1896 6.31Mbps to 6.20Mbps 0 128036 1948 6.32Mbps to 6.15Mbps Hard mount: 128034 1792 6.41Mbps to 6.17Mbps 128039 1823 6.29Mbps to 6.17Mbps 128043 1903 6.42Mbps to 6.18Mbps with Trond's New Patch Soft mount: EIO calls retrans rate 128041 1954 6.33Mbps to 6.11Mbps 128040 1897 6.31Mbps to 6.16Mbps 128039 1758 6.26Mbps to 6.17Mbps Hard mount: 128043 1808 6.40Mbps to 6.19Mbps 128033 1671 6.37Mbps to 6.20Mbps 128042 1852 6.37Mbps to 6.17Mbps Again this is 10 threads reading 10 100m files from the same filesystem SteveD. ------------------------------------------------------- This SF.net email is sponsored by: SF.net Giveback Program. SourceForge.net hosts over 70,000 Open Source Projects. See the people who have HELPED US provide better services: Click here: http://sourceforge.net/supporters.php _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: NFS/UDP slow read, lost fragments 2003-10-15 3:17 ` Steve Dickson @ 2003-10-15 3:27 ` Trond Myklebust 0 siblings, 0 replies; 12+ messages in thread From: Trond Myklebust @ 2003-10-15 3:27 UTC (permalink / raw) To: Steve Dickson; +Cc: NFS maillist >>>>> " " == Steve Dickson <SteveD@RedHat.com> writes: > Trond Myklebust wrote: >> Could you try using the slight variation in >> >> > It didn't seem to help much... here the numbers... Oh well, a 1.5% retransmission rate is in any case not a major problem. It is more than we should see given that we are setting a timeout of > 4 standard deviations, but it is consistent with what I see in my own tests... Cheers, Trond ------------------------------------------------------- This SF.net email is sponsored by: SF.net Giveback Program. SourceForge.net hosts over 70,000 Open Source Projects. See the people who have HELPED US provide better services: Click here: http://sourceforge.net/supporters.php _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs ^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: NFS/UDP slow read, lost fragments
@ 2003-09-25 20:11 Lever, Charles
0 siblings, 0 replies; 12+ messages in thread
From: Lever, Charles @ 2003-09-25 20:11 UTC (permalink / raw)
To: Robert L. Millner; +Cc: nfs
hi robert-
if packets are being lost in server replies, then i
don't think this is strictly a client problem.
the newer clients may be more likely to overload a server
or network by sending more requests than can be buffered.
look into network cleanliness, and contact your local
NetApp SE for help in getting diagnostic info from your
filer.
especially because this problem goes away with TCP, i
would suspect a networking problem first.
> -----Original Message-----
> From: Robert L. Millner [mailto:rmillner@transmeta.com]=20
> Sent: Thursday, September 25, 2003 2:00 PM
> To: nfs@lists.sourceforge.net
> Subject: [NFS] NFS/UDP slow read, lost fragments
>=20
>=20
> Hello,
>=20
> The problem I am seeing is similar to what was posted by=20
> Larry Sendlosky
> on Jun 27, "2.4.20-pre3 -> 2.4.21 : nfs client read=20
> performance broken"
> though I have not done as through a drill-down into the nature of the
> problem.
>=20
> Somewhere between 2.4.19 and 2.4.20, NFS/UDP read performance began to
> suck because of a large number of request retransmits. From=20
> tcpdump, the
> retransmits are for read transactions which return data in a=20
> reasonable
> time frame but are missing one or more fragments of the return packet.
>=20
> The client is a PIII 550, dual CPU, 1GB RAM, eepro100 (tested=20
> with both
> the e100 and eepro drivers) running a variety of kernels from=20
> the stock
> Red Hat 7.3 (2.4.20-20.7 which exhibits the problem), and from stock
> 2.4.18 through 2.4.22. 2.4.20 and above exhibit this problem.
>=20
> The server is a Network Appliance F820 running ONTAP 6.2.2. Tests are
> conducted when the F820 is not under a noticeable load.
>=20
> >From the difference in behavior between kernel revisions and=20
> by tcpdump,=20
> it is believed that the fragments are transmitted by the server.
>=20
> The timings for different IO sizes for NFSv3 reads of a 70MB file:
>=20
> Configuration Seconds Real Time
> -------------------------------------------
> UDP, 32k 1080
> UDP, 8k 420
> UDP, 4k 210
> UDP, 1k 40
>=20
> TCP, 32k 6.4
>=20
>=20
> The NFSv2/UDP timings for 8k, 4k and 1k are almost identical to the
> NFSv3/UDP timings.
>=20
> The same test with 2.4.18 yields read times for around 6.3=20
> seconds for 32k
> and 8k NFSv2 and NFSv3 over UDP.
>=20
> Setting rmem_max, rmem_default, wmem_max and wmem_default to=20
> either 524284=20
> or 262142 make no difference.
>=20
> Setting netdev_max_backlog to 1200 (from 300) makes no difference.
>=20
> Setting ipfrag_high_thresh to up to 4194304 makes no difference.
>=20
>=20
> We have a mix of clients and servers, not all of which=20
> support NFS/TCP yet
> so we can't globally set tcp in the automounter maps for our=20
> netapp mounts
> or as "localoptions" in the autofs init script. The mount=20
> patch submitted
> by Steve Dickson on Aug 6, "NFS Mount Patch: Making NFS over TCP the
> default" is probably the immediate workaround that will at=20
> least make the
> mounts that really matter to us work well again. I'll test that next.
>=20
>=20
> Is this a known problem? Is there a patch already out there or in the
> works that fixes this? What other data would help drill into this
> problem?
>=20
>=20
> Rob
>=20
>=20
>=20
> -------------------------------------------------------
> This sf.net email is sponsored by:ThinkGeek
> Welcome to geek heaven.
> http://thinkgeek.com/sf
> _______________________________________________
> NFS maillist - NFS@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nfs
>=20
-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist - NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2003-10-15 3:27 UTC | newest] Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2003-09-25 17:59 NFS/UDP slow read, lost fragments Robert L. Millner 2003-09-25 20:22 ` Brian Mancuso 2003-09-25 20:33 ` brianm 2003-09-25 21:44 ` Brian Mancuso 2003-09-25 23:31 ` Trond Myklebust 2003-09-26 21:07 ` Brian Mancuso 2003-09-27 5:02 ` Robert L. Millner 2003-10-14 1:20 ` Steve Dickson 2003-10-14 14:52 ` Trond Myklebust 2003-10-15 3:17 ` Steve Dickson 2003-10-15 3:27 ` Trond Myklebust 2003-09-25 20:11 Lever, Charles
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.