All of lore.kernel.org
 help / color / mirror / Atom feed
* NFSD Flow Control Using the TCP Transport
@ 2003-03-19 15:05 Steve Dickson
  2003-03-20  4:24 ` MINOURA Makoto
  0 siblings, 1 reply; 9+ messages in thread
From: Steve Dickson @ 2003-03-19 15:05 UTC (permalink / raw)
  To: nfs

Hello,

There seems to be some issues (probably known) with the flow control
over  TCP connections (on an SMP machine) to NFSD. Unfortunately,
the fstress benchmark brings these issues out fairly nicely :-(
This is occurring in a 2.4.20 kernel.

When fstress starts  it's stress tests,  svc_tcp_sendto() immediately 
starts failing
with -EGAINs. Initially, this caused  an oops because svc_delete_socket()
was being called twice for the same socket [ which was easily fixed by 
checking
 for the SK_DEAD bit in svsk->sk_flags],  but now the tests just fail.

The problem seems to stem from the fact that the queued memory in
the TCP send buffer (i.e. sk->wmem_queued) is not being released ( i.e
tcp_wspace(sk)  becomes negative and  never recovers).

Here is what's (appears to be)  happening:
Fstress opens one TCP connection and then start sending
multiple nfs ops with different fhandles . The problems start when
a nfs op, with a large responses (like a read), gets 'stuck' in the nfs code
for a few microseconds and in the meantime other nfs ops, with smaller
responses are being processed.  With every smaller response, the
sk->wmem_queued value is incremented. Now when the 'stuck' nfs read
tries to send its responses the send buffer is full (i.e. 
tcp_memory_free(sk)
in tcp_sendmsg() fails) and  after a 30 second sleep (in tcp_sendmsg())
-EAGAIN is returned and the show is over.....

I _guess_ what is suppose  to happen is that the queued memory  will be
freed (or reclaimed) when a socket buffer is freed (via kfree_skb()).
Which in turn causes the threads waiting for memory (i.e. sleeping
in tcp_sendmsg()) to be woke up via a  call to sk->write_space().
But this does not seem to be happening even when the smaller
replies are processed....

Can anyone shed some light on what the heck is going on here
and if there are any patches or solutions or ideas addressing this
problem.

TIA,

SteveD.




-------------------------------------------------------
This SF.net email is sponsored by: Does your code think in ink? 
You could win a Tablet PC. Get a free Tablet PC hat just for playing. 
What are you waiting for?
http://ads.sourceforge.net/cgi-bin/redirect.pl?micr5043en
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: NFSD Flow Control Using the TCP Transport
  2003-03-19 15:05 NFSD Flow Control Using the TCP Transport Steve Dickson
@ 2003-03-20  4:24 ` MINOURA Makoto
  2003-03-20  4:43   ` MINOURA Makoto
  2003-03-21 20:38   ` Steve Dickson
  0 siblings, 2 replies; 9+ messages in thread
From: MINOURA Makoto @ 2003-03-20  4:24 UTC (permalink / raw)
  To: Steve Dickson; +Cc: nfs, taka, yamamoto

[-- Attachment #1: Type: text/plain, Size: 2144 bytes --]


|> In <3E78872B.5020702@RedHat.com>
|>  Steve Dickson <SteveD@RedHat.com> wrote:

> There seems to be some issues (probably known) with the flow control
> over  TCP connections (on an SMP machine) to NFSD. Unfortunately,
> the fstress benchmark brings these issues out fairly nicely :-(
> This is occurring in a 2.4.20 kernel.

We found the same problem a few weeks ago and already
addressed it.


> The problem seems to stem from the fact that the queued memory in
> the TCP send buffer (i.e. sk->wmem_queued) is not being released ( i.e
> tcp_wspace(sk)  becomes negative and  never recovers).

At first, sock_wspace() does not return the actual free
space we expect;  it only count the ongoing segments.

Second, TCP write_space callback is called only when
SOCK_NOSPACE flag is asserted.

These two problems should be solved by the attached patch.
Please someone who knows the network stack tell us the
impact to other network part if we change the sock_wspace()
itself to the svc_sock_wspace() in the patch.


> with -EGAINs. Initially, this caused  an oops because svc_delete_socket()
> was being called twice for the same socket [ which was easily fixed by 
> checking
>  for the SK_DEAD bit in svsk->sk_flags],  but now the tests just fail.

Current implementation closes the socket on EAGAIN/partial
write, but has some delay.  During this delay on-queue
messages are unintentionally sent (in case the NFS traffic
is busy).  We addressed this problem by removing the delay
(close the socket immediatly) but sending the remaining
correctly should be better.
Since we use the backported version of the 2.5.x zerocopy
NFSd, the patch to address this cannot be applied to 2.4.x
without some modifications.

We think socket lock mechanism (sk_sem) should be also
backported to serialize the output.  Now that we removed the
MSG_DONTWAIT flag, the TCP layer might sleep with the socket
lock released, and messages from other NFSd thread might be
mixed.  Or other locking protocol in the TCP (or socket)
layer to prevent the mixture should be introduced.

-- 
Minoura Makoto <minoura@valinux.co.jp>
Engineering Dept., VA Linux Systems Japan

[-- Attachment #2: Type: application/octet-stream, Size: 1646 bytes --]

diff -urpN linux-2.4.20.orig/net/sunrpc/svcsock.c linux-2.4.20/net/sunrpc/svcsock.c
--- linux-2.4.20.orig/net/sunrpc/svcsock.c	Fri Nov 29 08:53:16 2002
+++ linux-2.4.20/net/sunrpc/svcsock.c	Thu Mar 20 11:49:41 2003
@@ -42,6 +42,8 @@
 #include <linux/sunrpc/svcsock.h>
 #include <linux/sunrpc/stats.h>
 
+#include <net/tcp.h>
+
 /* SMP locking strategy:
  *
  * 	svc_serv->sv_lock protects most stuff for that service.
@@ -106,6 +108,19 @@ svc_release_skb(struct svc_rqst *rqstp)
 	skb_free_datagram(rqstp->rq_sock->sk_sk, skb);
 }
 
+static int
+svc_sock_wspace(struct svc_sock *svsk)
+{
+	int wspace;
+
+	if (svsk->sk_sock->type == SOCK_STREAM)
+		wspace = tcp_wspace(svsk->sk_sk);
+	else
+		wspace = sock_wspace(svsk->sk_sk);
+
+	return wspace;
+}
+
 /*
  * Queue up a socket with data pending. If there are idle nfsd
  * processes, wake 'em up.
@@ -134,16 +149,18 @@ svc_sock_enqueue(struct svc_sock *svsk)
 		goto out_unlock;
 	}
 
+	set_bit(SOCK_NOSPACE, &svsk->sk_sock->flags);
 	if (((svsk->sk_reserved + serv->sv_bufsz)*2
-	     > sock_wspace(svsk->sk_sk))
+	     > svc_sock_wspace(svsk->sk_sk))
 	    && !test_bit(SK_CLOSE, &svsk->sk_flags)
 	    && !test_bit(SK_CONN, &svsk->sk_flags)) {
 		/* Don't enqueue while not enough space for reply */
 		dprintk("svc: socket %p  no space, %d*2 > %ld, not enqueued\n",
 			svsk->sk_sk, svsk->sk_reserved+serv->sv_bufsz,
-			sock_wspace(svsk->sk_sk));
+			svc_sock_wspace(svsk->sk_sk));
 		goto out_unlock;
 	}
+	clear_bit(SOCK_NOSPACE, &svsk->sk_sock->flags);
 
 	/* Mark socket as busy. It will remain in this state until the
 	 * server has processed all pending data and put the socket back

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: NFSD Flow Control Using the TCP Transport
  2003-03-20  4:24 ` MINOURA Makoto
@ 2003-03-20  4:43   ` MINOURA Makoto
  2003-03-21 20:38   ` Steve Dickson
  1 sibling, 0 replies; 9+ messages in thread
From: MINOURA Makoto @ 2003-03-20  4:43 UTC (permalink / raw)
  To: Steve Dickson; +Cc: nfs, taka, yamamoto


|> In <20030320042454.496801DA3E6@brer.local.valinux.co.jp>
|>  minoura@valinux.co.jp (MINOURA Makoto) wrote:

> Current implementation closes the socket on EAGAIN/partial
> write, but has some delay.  During this delay on-queue
> messages are unintentionally sent (in case the NFS traffic
> is busy).  We addressed this problem by removing the delay
> (close the socket immediatly) but sending the remaining
> correctly should be better.

Forgot to note: I suppose the cause of the test failure is
that someone (other NFSd thread) sent something from the
same socket before the socket is actually closed.  This
breaks the `recordmark'-separated message stream and the
peer (client) would be confused.

To address this,

> We think socket lock mechanism (sk_sem) should be also
> backported to serialize the output.

should also be enough.

-- 
Minoura Makoto <minoura@valinux.co.jp>
Engineering Dept., VA Linux Systems Japan


-------------------------------------------------------
This SF.net email is sponsored by: Does your code think in ink? 
You could win a Tablet PC. Get a free Tablet PC hat just for playing. 
What are you waiting for?
http://ads.sourceforge.net/cgi-bin/redirect.pl?micr5043en
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: NFSD Flow Control Using the TCP Transport
  2003-03-20  4:24 ` MINOURA Makoto
  2003-03-20  4:43   ` MINOURA Makoto
@ 2003-03-21 20:38   ` Steve Dickson
  2003-03-24  2:51     ` MINOURA Makoto
  1 sibling, 1 reply; 9+ messages in thread
From: Steve Dickson @ 2003-03-21 20:38 UTC (permalink / raw)
  To: MINOURA Makoto; +Cc: nfs, taka, yamamoto



MINOURA Makoto wrote:

>Current implementation closes the socket on EAGAIN/partial
>write, but has some delay.  During this delay on-queue
>messages are unintentionally sent (in case the NFS traffic
>is busy).  We addressed this problem by removing the delay
>(close the socket immediatly) but sending the remaining
>correctly should be better.
>
So If I understand what your saying, EAGAINs or partial writes are 
interpreted
as fatal errors. This confuse me. EAGAINs or partial writes are flow 
control
issues not fatal errors. Just like on the client side, shouldn't the
thread sleep until there is room? Closing down the socket seems a
bit drastic... Or am I missing something?

SteveD.



-------------------------------------------------------
This SF.net email is sponsored by:Crypto Challenge is now open! 
Get cracking and register here for some mind boggling fun and 
the chance of winning an Apple iPod:
http://ads.sourceforge.net/cgi-bin/redirect.pl?thaw0031en
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: NFSD Flow Control Using the TCP Transport
  2003-03-21 20:38   ` Steve Dickson
@ 2003-03-24  2:51     ` MINOURA Makoto
  2003-03-24 15:07       ` Steve Dickson
  0 siblings, 1 reply; 9+ messages in thread
From: MINOURA Makoto @ 2003-03-24  2:51 UTC (permalink / raw)
  To: Steve Dickson; +Cc: nfs, taka, yamamoto


|> In <3E7B7853.4020605@RedHat.com>
|>  Steve Dickson <SteveD@redhat.com> wrote:

> So If I understand what your saying, EAGAINs or partial writes are 
> interpreted
> as fatal errors. This confuse me. EAGAINs or partial writes are flow 
> control
> issues not fatal errors. Just like on the client side, shouldn't the
> thread sleep until there is room? Closing down the socket seems a
> bit drastic... Or am I missing something?

In general you are right, but as of Linux kNFSd something
like a flow control is done in the RPC layer.

In kNFSd multiple nfsd threads share a single socket and to
avoid data mixture a sendto call must be atomic.  In order
to guarantee this the threads sleep in the RPC layer until
there is enough TCP write space.  By doing so EAGAIN can be
interpreted as something wrong is going on.

But since the free space calculation was wrong this
assumption was broken.  The patch attached in my previous
mail corrects this.

-- 
Minoura Makoto <minoura@valinux.co.jp>
Engineering Dept., VA Linux Systems Japan


-------------------------------------------------------
This SF.net email is sponsored by:Crypto Challenge is now open! 
Get cracking and register here for some mind boggling fun and 
the chance of winning an Apple iPod:
http://ads.sourceforge.net/cgi-bin/redirect.pl?thaw0031en
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: NFSD Flow Control Using the TCP Transport
  2003-03-24  2:51     ` MINOURA Makoto
@ 2003-03-24 15:07       ` Steve Dickson
  2003-03-25  5:32         ` MINOURA Makoto
  0 siblings, 1 reply; 9+ messages in thread
From: Steve Dickson @ 2003-03-24 15:07 UTC (permalink / raw)
  To: MINOURA Makoto; +Cc: nfs, taka, yamamoto

Yes I understand its a RPC issue not an NFS issue.
I guess I should have made it clear that I think
that the _RPC_ layer should be able to handle
EAGAINs (using the TCP transport).

The patch you provided does indeed stop the RPC
layer from returning EAGAINs but also stops the flow
of traffic because the sockets with are no longer
being enqueued which in turn means there are no
threads sleeping/waiting (in tcp_sendmsg) when
the svc_write_space routine is called. So once
this flow condition hits, the replys stop which
means the incoming requests also stop and all
the nfsds end up sitting in svc_recv() waiting
for something to do....

0bviously, there is another issue in player here
(which I'm currently debugging), since I do think your
patch make sense....

SteveD.

MINOURA Makoto wrote:

>|> In <3E7B7853.4020605@RedHat.com>
>|>  Steve Dickson <SteveD@redhat.com> wrote:
>
>  
>
>>So If I understand what your saying, EAGAINs or partial writes are 
>>interpreted
>>as fatal errors. This confuse me. EAGAINs or partial writes are flow 
>>control
>>issues not fatal errors. Just like on the client side, shouldn't the
>>thread sleep until there is room? Closing down the socket seems a
>>bit drastic... Or am I missing something?
>>    
>>
>
>In general you are right, but as of Linux kNFSd something
>like a flow control is done in the RPC layer.
>
>In kNFSd multiple nfsd threads share a single socket and to
>avoid data mixture a sendto call must be atomic.  In order
>to guarantee this the threads sleep in the RPC layer until
>there is enough TCP write space.  By doing so EAGAIN can be
>interpreted as something wrong is going on.
>
>But since the free space calculation was wrong this
>assumption was broken.  The patch attached in my previous
>mail corrects this.
>
>  
>




-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: NFSD Flow Control Using the TCP Transport
  2003-03-24 15:07       ` Steve Dickson
@ 2003-03-25  5:32         ` MINOURA Makoto
  2003-03-25 11:33           ` Steve Dickson
  0 siblings, 1 reply; 9+ messages in thread
From: MINOURA Makoto @ 2003-03-25  5:32 UTC (permalink / raw)
  To: Steve Dickson; +Cc: nfs, taka, yamamoto


|> In <3E7F1F1D.8090206@RedHat.com>
|>  Steve Dickson <SteveD@redhat.com> wrote:

> The patch you provided does indeed stop the RPC
> layer from returning EAGAINs but also stops the flow
> of traffic because the sockets with are no longer
> being enqueued which in turn means there are no
> threads sleeping/waiting (in tcp_sendmsg) when
> the svc_write_space routine is called. So once
> this flow condition hits, the replys stop which
> means the incoming requests also stop and all
> the nfsds end up sitting in svc_recv() waiting
> for something to do....

I'm afraid that I do not understand exactly what you are
saying.

We do not start processing a request (enqueue the socket)
before we get enough write space for any single request.  So
we do not have to sleep in tcp_sendmsg.  This is the
original intention of Linux-2.4.20, not a specific behaviour
to my patch.

-- 
Minoura Makoto <minoura@valinux.co.jp>
Engineering Dept., VA Linux Systems Japan


-------------------------------------------------------
This SF.net email is sponsored by:
The Definitive IT and Networking Event. Be There!
NetWorld+Interop Las Vegas 2003 -- Register today!
http://ads.sourceforge.net/cgi-bin/redirect.pl?keyn0001en
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: NFSD Flow Control Using the TCP Transport
  2003-03-25  5:32         ` MINOURA Makoto
@ 2003-03-25 11:33           ` Steve Dickson
  2003-03-26  5:07             ` MINOURA Makoto
  0 siblings, 1 reply; 9+ messages in thread
From: Steve Dickson @ 2003-03-25 11:33 UTC (permalink / raw)
  To: MINOURA Makoto; +Cc: nfs, taka, yamamoto


MINOURA Makoto wrote:

>We do not start processing a request (enqueue the socket)
>before we get enough write space for any single request.  
>
Right.... This is exactly what is happening.  Three or four responses are
*not* being enqueued because there is not enough write space.  The
problem I'm seeing is those sockets are never getting re-enqueued.

Your patch is depending on the tcp layer calling svc_write_space() 
[when the write space is freed] to start the enqueuing of the sockets.
For some reason that is not happening in my case.....

SteveD.



-------------------------------------------------------
This SF.net email is sponsored by:
The Definitive IT and Networking Event. Be There!
NetWorld+Interop Las Vegas 2003 -- Register today!
http://ads.sourceforge.net/cgi-bin/redirect.pl?keyn0001en
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: NFSD Flow Control Using the TCP Transport
  2003-03-25 11:33           ` Steve Dickson
@ 2003-03-26  5:07             ` MINOURA Makoto
  0 siblings, 0 replies; 9+ messages in thread
From: MINOURA Makoto @ 2003-03-26  5:07 UTC (permalink / raw)
  To: Steve Dickson; +Cc: nfs, taka, yamamoto


|> In <3E803E7E.7040303@RedHat.com>
|>  Steve Dickson <SteveD@RedHat.com> wrote:

> Right.... This is exactly what is happening.  Three or four responses are
> *not* being enqueued because there is not enough write space.  The
> problem I'm seeing is those sockets are never getting re-enqueued.

> Your patch is depending on the tcp layer calling svc_write_space() 
> [when the write space is freed] to start the enqueuing of the sockets.
> For some reason that is not happening in my case.....

Hmmm, at least one reason is fixed by our patch, asserting
SOCK_NOSPACE before sleeping for more write space.

-- 
Minoura Makoto <minoura@valinux.co.jp>
Engineering Dept., VA Linux Systems Japan




-------------------------------------------------------
This SF.net email is sponsored by:
The Definitive IT and Networking Event. Be There!
NetWorld+Interop Las Vegas 2003 -- Register today!
http://ads.sourceforge.net/cgi-bin/redirect.pl?keyn0001en
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2003-03-26  5:07 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-03-19 15:05 NFSD Flow Control Using the TCP Transport Steve Dickson
2003-03-20  4:24 ` MINOURA Makoto
2003-03-20  4:43   ` MINOURA Makoto
2003-03-21 20:38   ` Steve Dickson
2003-03-24  2:51     ` MINOURA Makoto
2003-03-24 15:07       ` Steve Dickson
2003-03-25  5:32         ` MINOURA Makoto
2003-03-25 11:33           ` Steve Dickson
2003-03-26  5:07             ` MINOURA Makoto

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.