linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Honoring SO_RCVLOWAT in proto_ops.poll methods
@ 2008-09-20 21:42 lkml
  2008-09-20 22:21 ` David Miller
  0 siblings, 1 reply; 20+ messages in thread
From: lkml @ 2008-09-20 21:42 UTC (permalink / raw)
  To: linux-kernel

Hello lkml,

I have a need for select/poll/epoll_wait to block on sockets which have
unread data sitting in the receive buffer with a quantity less than
specified via setsockopt() w/SO_RCVLOWAT, not less than one like the
current implementation.

Upon looking at the code in net/ipv4/tcp.c it doesn't look like this
would be difficult to support.  Something along the lines of changing
the function tcp_poll() of version 2.6.26.5 from doing this:

 394                 if ((tp->rcv_nxt != tp->copied_seq) &&
 395                     (tp->urg_seq != tp->copied_seq ||
 396                      tp->rcv_nxt != tp->copied_seq + 1 ||
 397                      sock_flag(sk, SOCK_URGINLINE) || !tp->urg_data))
 398                         mask |= POLLIN | POLLRDNORM;

to this:

 394                 if ((tp->rcv_nxt != tp->copied_seq) &&
 395                     (tp->urg_seq != tp->copied_seq ||
 396                      tp->rcv_nxt > tp->copied_seq + sk->sk_rcvlowat ||
 397                      sock_flag(sk, SOCK_URGINLINE) || !tp->urg_data))
 398                         mask |= POLLIN | POLLRDNORM;


I imagine it's similarly simple for the other socket types.  Am I
missing something?  Is there a technical reason why Linux only blocks in
read/recv but not the poll with respect to the SO_RCVLOWAT setting?


For those interested the application is basically doing:

1: Wait for input on active sockets with epoll_wait()
2: Recv() data off the eventful sockets with MSG_PEEK flag, parse contents
   looking for specific keyword and value from buffer.
3: If buffer is too short call setsockopt() with SO_RCVLOWAT parameter set
   greater than what recv() with MSG_PEEK returned.

Repeat @ 1 until finding what's needed or shutdown/error/timeout.

Once the value is parsed it's used as an address to locate (with a db
lookup) a path which would be a unix domain socket for passing the
socket descriptor through.  The unrelated process at the other end needs
to accept the descriptor and use it as if no other program mucked with it
hence the MSG_PEEK magic.

As-is I must basically tie up threads in blocked recv() calls after
increasing SO_RCVLOWAT if the first recv() peek falls short.  This
solution obviously doesn't scale well.

Thanks,
Vito Caputo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Honoring SO_RCVLOWAT in proto_ops.poll methods
  2008-09-20 21:42 Honoring SO_RCVLOWAT in proto_ops.poll methods lkml
@ 2008-09-20 22:21 ` David Miller
  2008-09-20 23:00   ` lkml
  0 siblings, 1 reply; 20+ messages in thread
From: David Miller @ 2008-09-20 22:21 UTC (permalink / raw)
  To: lkml; +Cc: linux-kernel

From: lkml@pengaru.com
Date: Sat, 20 Sep 2008 16:42:29 -0500

> I have a need for select/poll/epoll_wait to block on sockets which have
> unread data sitting in the receive buffer with a quantity less than
> specified via setsockopt() w/SO_RCVLOWAT, not less than one like the
> current implementation.

If BSD never provided this behavior, such a change is likely
to break applications.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Honoring SO_RCVLOWAT in proto_ops.poll methods
  2008-09-20 22:21 ` David Miller
@ 2008-09-20 23:00   ` lkml
  2008-09-21  9:24     ` lkml
  0 siblings, 1 reply; 20+ messages in thread
From: lkml @ 2008-09-20 23:00 UTC (permalink / raw)
  To: David Miller; +Cc: linux-kernel

On Sat, Sep 20, 2008 at 03:21:40PM -0700, David Miller wrote:
> From: lkml@pengaru.com
> Date: Sat, 20 Sep 2008 16:42:29 -0500
> 
> > I have a need for select/poll/epoll_wait to block on sockets which have
> > unread data sitting in the receive buffer with a quantity less than
> > specified via setsockopt() w/SO_RCVLOWAT, not less than one like the
> > current implementation.
> 
> If BSD never provided this behavior, such a change is likely
> to break applications.

I did a quick look through FreeBSD source on fxr and found this macro:
http://fxr.watson.org/fxr/source/sys/socketvar.h#L197

Which is used by the generic socket poll here:
http://fxr.watson.org/fxr/source/kern/uipc_socket.c#L2731

You can look throughout that listing and so_rcv.sb_lowat is always what
is compared against for determining rcv buf readability.

You might also want to look at the socket(7) man page which implies that
what Linux currently does is exceptional & incorrect:

       SO_RCVLOWAT and SO_SNDLOWAT
              Specify the minimum number of bytes in  the  buffer  until
              the  socket  layer  will  pass  the  data  to the protocol
              (SO_SNDLOWAT) or  the  user  on  receiving  (SO_RCVLOWAT).
              These two values are initialised to 1.  SO_SNDLOWAT is not
              changeable on Linux (setsockopt fails with the error  ENO-
              PROTOOPT).   SO_RCVLOWAT  is  changeable  only since Linux
              2.4.  The select(2) and poll(2) system calls currently  do
              not  respect  the SO_RCVLOWAT setting on Linux, and mark a
              socket readable when even a single byte of data is  avail-
              able.   A subsequent read from the socket will block until
              SO_RCVLOWAT bytes are available.


Regards,
Vito Caputo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Honoring SO_RCVLOWAT in proto_ops.poll methods
  2008-09-20 23:00   ` lkml
@ 2008-09-21  9:24     ` lkml
  2008-09-21 14:18       ` Alan Cox
  0 siblings, 1 reply; 20+ messages in thread
From: lkml @ 2008-09-21  9:24 UTC (permalink / raw)
  To: linux-kernel

On Sat, Sep 20, 2008 at 06:00:46PM -0500, lkml@pengaru.com wrote:
> On Sat, Sep 20, 2008 at 03:21:40PM -0700, David Miller wrote:
> > From: lkml@pengaru.com
> > Date: Sat, 20 Sep 2008 16:42:29 -0500
> > 
> > > I have a need for select/poll/epoll_wait to block on sockets which have
> > > unread data sitting in the receive buffer with a quantity less than
> > > specified via setsockopt() w/SO_RCVLOWAT, not less than one like the
> > > current implementation.
> > 
> > If BSD never provided this behavior, such a change is likely
> > to break applications.
> 
> I did a quick look through FreeBSD source on fxr and found this macro:
> http://fxr.watson.org/fxr/source/sys/socketvar.h#L197
> 
> Which is used by the generic socket poll here:
> http://fxr.watson.org/fxr/source/kern/uipc_socket.c#L2731
> 
> You can look throughout that listing and so_rcv.sb_lowat is always what
> is compared against for determining rcv buf readability.
> 
> You might also want to look at the socket(7) man page which implies that
> what Linux currently does is exceptional & incorrect:
> 
>        SO_RCVLOWAT and SO_SNDLOWAT
>               Specify the minimum number of bytes in  the  buffer  until
>               the  socket  layer  will  pass  the  data  to the protocol
>               (SO_SNDLOWAT) or  the  user  on  receiving  (SO_RCVLOWAT).
>               These two values are initialised to 1.  SO_SNDLOWAT is not
>               changeable on Linux (setsockopt fails with the error  ENO-
>               PROTOOPT).   SO_RCVLOWAT  is  changeable  only since Linux
>               2.4.  The select(2) and poll(2) system calls currently  do
>               not  respect  the SO_RCVLOWAT setting on Linux, and mark a
>               socket readable when even a single byte of data is  avail-
>               able.   A subsequent read from the socket will block until
>               SO_RCVLOWAT bytes are available.
> 

I've been working on my application further and finally got around to
testing it with the assumption that poll won't block with regard to
SO_RCVLOWAT, and to my surprise even my recv() calls with MSG_PEEK flags
set are not blocking.  They block without MSG_PEEK, but not with.

Upon further investigation I find in tcp.c tcp_recvmsg() 2.6.26.5:

1306         target = sock_rcvlowat(sk, flags & MSG_WAITALL, len);

...snip...

1371                 if (copied >= target && !sk->sk_backlog.tail)
1372                         break;
1373 
1374                 if (copied) {
1375                         if (sk->sk_err ||
1376                             sk->sk_state == TCP_CLOSE ||
1377                             (sk->sk_shutdown & RCV_SHUTDOWN) ||
1378                             !timeo ||
1379                             signal_pending(current) ||
1380                             (flags & MSG_PEEK))
1381                                 break;
1382                 } else {


So line #1380 drops out without satisfying copied >= target if MSG_PEEK is
set, and if you look at the remainder of the function it's assuming that
it needs to cleanup buffers before waiting for more.  So fixing this guy
is likely not as trivial as fixing poll, since the rest of the function
has to be massaged to not try free things be in MSG_PEEK mode.

Once again, this deviates from FreeBSD behavior.

At this point, for my application to work on Linux without burning CPU like
mad... I basically have to sleep and poll the socket regularly to see if
more data has arrived with the tcp socket ioctl SIOCINQ. :(

Regards,
Vito Caputo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Honoring SO_RCVLOWAT in proto_ops.poll methods
  2008-09-21  9:24     ` lkml
@ 2008-09-21 14:18       ` Alan Cox
       [not found]         ` <20080921145134.GT2761@fc6222126.aspadmin.net>
  0 siblings, 1 reply; 20+ messages in thread
From: Alan Cox @ 2008-09-21 14:18 UTC (permalink / raw)
  To: lkml; +Cc: linux-kernel

> > You might also want to look at the socket(7) man page which implies that
> > what Linux currently does is exceptional & incorrect:

Actually we follow 1003.1g draft 6.4 which is about as close as there
ever was a spec for BSD sockets. We go beyond it actually -
SO_RCVLOWAT/SO_SNDLOWAT are in fact optional.

> SO_RCVLOWAT, and to my surprise even my recv() calls with MSG_PEEK flags
> set are not blocking.  They block without MSG_PEEK, but not with.

Correct and we've always done that intentionally.

> At this point, for my application to work on Linux without burning CPU like
> mad... I basically have to sleep and poll the socket regularly to see if
> more data has arrived with the tcp socket ioctl SIOCINQ. :(

What are you actually trying to do ? The usual way to handle urgent data
in those odd cases that use it is to select for an exception event.

Alan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Honoring SO_RCVLOWAT in proto_ops.poll methods
       [not found]         ` <20080921145134.GT2761@fc6222126.aspadmin.net>
@ 2008-09-21 20:13           ` Alan Cox
  2008-09-21 22:09             ` lkml
  2008-09-22 12:15             ` David Miller
  0 siblings, 2 replies; 20+ messages in thread
From: Alan Cox @ 2008-09-21 20:13 UTC (permalink / raw)
  To: vcaputo, linux-kernel

Thats a gloriously insane way of trying to do HTTP/1.1, and one I'm not
sure is actually viable in the real world because the TCP window may be
smaller than the number of bytes required to find a Host: header - so you
may simply not be able to receive it via MSG_PEEK. In particular mobile
phone gateways have a nasty habit of using very small windows.

> I just don't see a good solution for what I'm doing other than MSG_PEEK
> and SO_RCVLOWAT, any ideas?

I don't either, and while I don't agree that what you are doing for
HTTP/1.1 is remotely sane there are probably other cases this would be
both sane and useful which does suggest fixing it would be beneficial

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Honoring SO_RCVLOWAT in proto_ops.poll methods
  2008-09-21 20:13           ` Alan Cox
@ 2008-09-21 22:09             ` lkml
  2008-10-05 20:27               ` David Miller
  2008-09-22 12:15             ` David Miller
  1 sibling, 1 reply; 20+ messages in thread
From: lkml @ 2008-09-21 22:09 UTC (permalink / raw)
  To: Alan Cox; +Cc: linux-kernel

On Sun, Sep 21, 2008 at 09:13:37PM +0100, Alan Cox wrote:
> Thats a gloriously insane way of trying to do HTTP/1.1, and one I'm not
> sure is actually viable in the real world because the TCP window may be
> smaller than the number of bytes required to find a Host: header - so you
> may simply not be able to receive it via MSG_PEEK. In particular mobile
> phone gateways have a nasty habit of using very small windows.
> 
> > I just don't see a good solution for what I'm doing other than MSG_PEEK
> > and SO_RCVLOWAT, any ideas?
> 
> I don't either, and while I don't agree that what you are doing for
> HTTP/1.1 is remotely sane there are probably other cases this would be
> both sane and useful which does suggest fixing it would be beneficial

I somewhat agree that it's insane, but it should be possible and what I've
done already works most of the time.  With some additional special handling
of the missing Host: case what I have might be a viable general case
solution.

For now httpx just drops the connection when Host: isn't found as it's a
work in progress.  I wanted to get this out there largely for you guys to
see "Hey maybe there's a practical need for MSG_PEEK & SO_RCVLOWAT behaving
differently than it does".  It seems to have been effective in this
respect and I am hugely appreciative of your attention, time, and input.
 
As a proof of concept httpx already has some people scratching their heads
because name-based vhosting can be a nightmare with the venerable Apache.  A
solution like mine could alleviate alot of that.  With some work and a way
to sleep on peek with a rcvbuf < rcvlowat there might be something quite
interesting for the web hosting world.

Cheers,
Vito Caputo


--------------------------------------------------------------------------------
For those of you on the list:  I had replied to Alan directly but since
Alan CC'd the list in his response I've continued with that format.  For
those of you in the dark, httpx is my insane app using the MSG_PEEK and
SO_RCVLOWAT being referred to above, you can find it here:
http://serverkit.org/modules/httpx/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Honoring SO_RCVLOWAT in proto_ops.poll methods
  2008-09-21 20:13           ` Alan Cox
  2008-09-21 22:09             ` lkml
@ 2008-09-22 12:15             ` David Miller
  1 sibling, 0 replies; 20+ messages in thread
From: David Miller @ 2008-09-22 12:15 UTC (permalink / raw)
  To: alan; +Cc: vcaputo, linux-kernel

From: Alan Cox <alan@lxorguk.ukuu.org.uk>
Date: Sun, 21 Sep 2008 21:13:37 +0100

> Thats a gloriously insane way of trying to do HTTP/1.1, and one I'm not
> sure is actually viable in the real world because the TCP window may be
> smaller than the number of bytes required to find a Host: header - so you
> may simply not be able to receive it via MSG_PEEK. In particular mobile
> phone gateways have a nasty habit of using very small windows.
> 
> > I just don't see a good solution for what I'm doing other than MSG_PEEK
> > and SO_RCVLOWAT, any ideas?
> 
> I don't either, and while I don't agree that what you are doing for
> HTTP/1.1 is remotely sane there are probably other cases this would be
> both sane and useful which does suggest fixing it would be beneficial

I agree and I'll try to move this forward.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Honoring SO_RCVLOWAT in proto_ops.poll methods
  2008-09-21 22:09             ` lkml
@ 2008-10-05 20:27               ` David Miller
  2008-10-05 21:45                 ` swivel
  0 siblings, 1 reply; 20+ messages in thread
From: David Miller @ 2008-10-05 20:27 UTC (permalink / raw)
  To: lkml; +Cc: alan, linux-kernel


Give this patch a try:

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 1ab341e..0e43875 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -384,13 +384,17 @@ unsigned int tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
 
 	/* Connected? */
 	if ((1 << sk->sk_state) & ~(TCPF_SYN_SENT | TCPF_SYN_RECV)) {
+		int target = sock_rcvlowat(sk, 0, INT_MAX);
+
+		if (tp->urg_seq == tp->copied_seq &&
+		    !sock_flag(sk, SOCK_URGINLINE) &&
+		    tp->urg_data)
+			target--;
+
 		/* Potential race condition. If read of tp below will
 		 * escape above sk->sk_state, we can be illegally awaken
 		 * in SYN_* states. */
-		if ((tp->rcv_nxt != tp->copied_seq) &&
-		    (tp->urg_seq != tp->copied_seq ||
-		     tp->rcv_nxt != tp->copied_seq + 1 ||
-		     sock_flag(sk, SOCK_URGINLINE) || !tp->urg_data))
+		if (target >= tp->rcv_nxt - tp->copied_seq)
 			mask |= POLLIN | POLLRDNORM;
 
 		if (!(sk->sk_shutdown & SEND_SHUTDOWN)) {

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: Honoring SO_RCVLOWAT in proto_ops.poll methods
  2008-10-05 20:27               ` David Miller
@ 2008-10-05 21:45                 ` swivel
  2008-10-05 22:30                   ` David Miller
  0 siblings, 1 reply; 20+ messages in thread
From: swivel @ 2008-10-05 21:45 UTC (permalink / raw)
  To: David Miller; +Cc: alan, linux-kernel

On Sun, Oct 05, 2008 at 01:27:22PM -0700, David Miller wrote:
> 
> Give this patch a try:
> 
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 1ab341e..0e43875 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -384,13 +384,17 @@ unsigned int tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
>  
>  	/* Connected? */
>  	if ((1 << sk->sk_state) & ~(TCPF_SYN_SENT | TCPF_SYN_RECV)) {
> +		int target = sock_rcvlowat(sk, 0, INT_MAX);
> +
> +		if (tp->urg_seq == tp->copied_seq &&
> +		    !sock_flag(sk, SOCK_URGINLINE) &&
> +		    tp->urg_data)
> +			target--;
> +
>  		/* Potential race condition. If read of tp below will
>  		 * escape above sk->sk_state, we can be illegally awaken
>  		 * in SYN_* states. */
> -		if ((tp->rcv_nxt != tp->copied_seq) &&
> -		    (tp->urg_seq != tp->copied_seq ||
> -		     tp->rcv_nxt != tp->copied_seq + 1 ||
> -		     sock_flag(sk, SOCK_URGINLINE) || !tp->urg_data))
> +		if (target >= tp->rcv_nxt - tp->copied_seq)
>  			mask |= POLLIN | POLLRDNORM;
>  
>  		if (!(sk->sk_shutdown & SEND_SHUTDOWN)) {


I will be testing this patch today.  At a glance it appears with this
patch we're still not taking rcvlowat into consideration in recv()
with MSG_PEEK flag set.  This should probably also be corrected, as
mentioned in the thread previously.

Regards,
Vito Caputo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Honoring SO_RCVLOWAT in proto_ops.poll methods
  2008-10-05 21:45                 ` swivel
@ 2008-10-05 22:30                   ` David Miller
  2008-10-06  5:17                     ` lkml
  2008-10-13  7:34                     ` David Miller
  0 siblings, 2 replies; 20+ messages in thread
From: David Miller @ 2008-10-05 22:30 UTC (permalink / raw)
  To: swivel; +Cc: alan, linux-kernel

From: swivel@shells.gnugeneration.com
Date: Sun, 5 Oct 2008 16:45:57 -0500

> I will be testing this patch today.  At a glance it appears with this
> patch we're still not taking rcvlowat into consideration in recv()
> with MSG_PEEK flag set.  This should probably also be corrected, as
> mentioned in the thread previously.

Yes, I remember you mentioning that.

I'll look into it.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Honoring SO_RCVLOWAT in proto_ops.poll methods
  2008-10-05 22:30                   ` David Miller
@ 2008-10-06  5:17                     ` lkml
  2008-10-06 17:18                       ` David Miller
  2008-10-13  7:34                     ` David Miller
  1 sibling, 1 reply; 20+ messages in thread
From: lkml @ 2008-10-06  5:17 UTC (permalink / raw)
  To: David Miller; +Cc: alan, linux-kernel

On Sun, Oct 05, 2008 at 03:30:59PM -0700, David Miller wrote:
> From: swivel@shells.gnugeneration.com
> Date: Sun, 5 Oct 2008 16:45:57 -0500
> 
> > I will be testing this patch today.  At a glance it appears with this
> > patch we're still not taking rcvlowat into consideration in recv()
> > with MSG_PEEK flag set.  This should probably also be corrected, as
> > mentioned in the thread previously.
> 
> Yes, I remember you mentioning that.
> 
> I'll look into it.


Looks like the RCVLOWAT patch breaks the tcp poll logic in the normal
case.

I didn't have a chance to scrutinize the changes but with the patch
applied simple things like a telnet to 127.0.0.1:80 exit immediately
with "Connection closed by foreign host".

In strace of the above telnet failure I see recv() returning EAGAIN
before exiting.  Telnet expected a select() to block until data was
available and didn't expect recv() to find nothing available when the
select() reported there was something.  Select() was no longer behaving
properly on ipv4 tcp sockets.

Regards,
Vito Caputo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Honoring SO_RCVLOWAT in proto_ops.poll methods
  2008-10-06  5:17                     ` lkml
@ 2008-10-06 17:18                       ` David Miller
  2008-10-06 17:45                         ` David Miller
  0 siblings, 1 reply; 20+ messages in thread
From: David Miller @ 2008-10-06 17:18 UTC (permalink / raw)
  To: lkml; +Cc: alan, linux-kernel

From: lkml@pengaru.com
Date: Mon, 6 Oct 2008 00:17:18 -0500

> Looks like the RCVLOWAT patch breaks the tcp poll logic in the normal
> case.

Sorry, the condition was reversed, try this one instead:

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 1ab341e..7d81a1e 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -384,13 +384,17 @@ unsigned int tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
 
 	/* Connected? */
 	if ((1 << sk->sk_state) & ~(TCPF_SYN_SENT | TCPF_SYN_RECV)) {
+		int target = sock_rcvlowat(sk, 0, INT_MAX);
+
+		if (tp->urg_seq == tp->copied_seq &&
+		    !sock_flag(sk, SOCK_URGINLINE) &&
+		    tp->urg_data)
+			target--;
+
 		/* Potential race condition. If read of tp below will
 		 * escape above sk->sk_state, we can be illegally awaken
 		 * in SYN_* states. */
-		if ((tp->rcv_nxt != tp->copied_seq) &&
-		    (tp->urg_seq != tp->copied_seq ||
-		     tp->rcv_nxt != tp->copied_seq + 1 ||
-		     sock_flag(sk, SOCK_URGINLINE) || !tp->urg_data))
+		if (tp->rcv_nxt - tp->copied_seq >= target)
 			mask |= POLLIN | POLLRDNORM;
 
 		if (!(sk->sk_shutdown & SEND_SHUTDOWN)) {

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: Honoring SO_RCVLOWAT in proto_ops.poll methods
  2008-10-06 17:18                       ` David Miller
@ 2008-10-06 17:45                         ` David Miller
  0 siblings, 0 replies; 20+ messages in thread
From: David Miller @ 2008-10-06 17:45 UTC (permalink / raw)
  To: lkml; +Cc: alan, linux-kernel

From: David Miller <davem@davemloft.net>
Date: Mon, 06 Oct 2008 10:18:14 -0700 (PDT)

> From: lkml@pengaru.com
> Date: Mon, 6 Oct 2008 00:17:18 -0500
> 
> > Looks like the RCVLOWAT patch breaks the tcp poll logic in the normal
> > case.
> 
> Sorry, the condition was reversed, try this one instead:

FWIW, since this passed my own testing, I pushed this into
net-next-2.6

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Honoring SO_RCVLOWAT in proto_ops.poll methods
  2008-10-05 22:30                   ` David Miller
  2008-10-06  5:17                     ` lkml
@ 2008-10-13  7:34                     ` David Miller
  2008-10-13  8:32                       ` swivel
  1 sibling, 1 reply; 20+ messages in thread
From: David Miller @ 2008-10-13  7:34 UTC (permalink / raw)
  To: swivel; +Cc: alan, linux-kernel

From: David Miller <davem@davemloft.net>
Date: Sun, 05 Oct 2008 15:30:59 -0700 (PDT)

> From: swivel@shells.gnugeneration.com
> Date: Sun, 5 Oct 2008 16:45:57 -0500
> 
> > I will be testing this patch today.  At a glance it appears with this
> > patch we're still not taking rcvlowat into consideration in recv()
> > with MSG_PEEK flag set.  This should probably also be corrected, as
> > mentioned in the thread previously.
> 
> Yes, I remember you mentioning that.
> 
> I'll look into it.

Were you able to test my updated patch?

If it makes your application work, I might want to defer messing around
with MSG_PEEK semantics in recvmsg().

So I've been waiting for your test results before I proceed.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Honoring SO_RCVLOWAT in proto_ops.poll methods
  2008-10-13  7:34                     ` David Miller
@ 2008-10-13  8:32                       ` swivel
  2008-10-13  9:58                         ` David Miller
  0 siblings, 1 reply; 20+ messages in thread
From: swivel @ 2008-10-13  8:32 UTC (permalink / raw)
  To: David Miller; +Cc: alan, linux-kernel

On Mon, Oct 13, 2008 at 12:34:41AM -0700, David Miller wrote:
> From: David Miller <davem@davemloft.net>
> Date: Sun, 05 Oct 2008 15:30:59 -0700 (PDT)
> 
> > From: swivel@shells.gnugeneration.com
> > Date: Sun, 5 Oct 2008 16:45:57 -0500
> > 
> > > I will be testing this patch today.  At a glance it appears with this
> > > patch we're still not taking rcvlowat into consideration in recv()
> > > with MSG_PEEK flag set.  This should probably also be corrected, as
> > > mentioned in the thread previously.
> > 
> > Yes, I remember you mentioning that.
> > 
> > I'll look into it.
> 
> Were you able to test my updated patch?
> 
> If it makes your application work, I might want to defer messing around
> with MSG_PEEK semantics in recvmsg().
> 
> So I've been waiting for your test results before I proceed.


Didn't test your latest patch after you mentioned that you had tested it
yourself successfully.

The application is still broken having recv() not behave normally with
the MSG_PEEK flag.  The app will spin calling recv() repeatedly once any
data has made it into the rcvbuf, because recv() will always immediately
return with the >0 # of bytes in the buffer.

I'm using the pseudo-blocking recv() behavior achieved with SO_RCVTIMEO.
Thus my app expects recv() to block until SO_RCVLOWAT is met or SO_RCVTIMEO
expired.  Upon timeout expiration recv() is supposed to return -1 with
errno=EAGAIN.  With recv() immediately returning on MSG_PEEK there's no
possibility for timeout expiration.  recv() behaves perfectly in this
regard without MSG_PEEK, it just needs some minor adjustment to behave
the same with MSG_PEEK specified.

There's a simple diagram illustrating the application implementation
in general.  This may help understand the implications of recv() not
blocking once there is data in the rcvbuf without wasting much of your
time:
http://serverkit.org/modules/httpx/httpx.png

Thanks for the follow-up.

Regards,
Vito Caputo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Honoring SO_RCVLOWAT in proto_ops.poll methods
  2008-10-13  8:32                       ` swivel
@ 2008-10-13  9:58                         ` David Miller
  2008-10-20  3:58                           ` swivel
  0 siblings, 1 reply; 20+ messages in thread
From: David Miller @ 2008-10-13  9:58 UTC (permalink / raw)
  To: swivel; +Cc: alan, linux-kernel

From: swivel@shells.gnugeneration.com
Date: Mon, 13 Oct 2008 03:32:14 -0500

> I'm using the pseudo-blocking recv() behavior achieved with SO_RCVTIMEO.
> Thus my app expects recv() to block until SO_RCVLOWAT is met or SO_RCVTIMEO
> expired.

But if you poll() properly, you'll never call recv() unless the amount
of bytes you want are there.

And since I fixed poll()'s handling of SO_RCVLOWAT it should mostly
work.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Honoring SO_RCVLOWAT in proto_ops.poll methods
  2008-10-13  9:58                         ` David Miller
@ 2008-10-20  3:58                           ` swivel
  2008-10-20  4:25                             ` David Miller
  2008-11-05 11:36                             ` David Miller
  0 siblings, 2 replies; 20+ messages in thread
From: swivel @ 2008-10-20  3:58 UTC (permalink / raw)
  To: David Miller; +Cc: alan, linux-kernel

On Mon, Oct 13, 2008 at 02:58:16AM -0700, David Miller wrote:
> From: swivel@shells.gnugeneration.com
> Date: Mon, 13 Oct 2008 03:32:14 -0500
> 
> > I'm using the pseudo-blocking recv() behavior achieved with SO_RCVTIMEO.
> > Thus my app expects recv() to block until SO_RCVLOWAT is met or SO_RCVTIMEO
> > expired.
> 
> But if you poll() properly, you'll never call recv() unless the amount
> of bytes you want are there.
> 
> And since I fixed poll()'s handling of SO_RCVLOWAT it should mostly
> work.

The app already has a kludge in place to work around the current kernel
with broken poll() and recv() with regards to SO_RCVLOWAT.

It's less than ideal but 'mostly works', so I'm already at that point...
Doesn't really make sense to me to rewrite the kludge to depend on a
very modern kernel without even having it be able to use recv()
properly.

I just hope we can have recv() block with MSG_PEEK when SO_RCVLOWAT is
>1 in the near future.  My goal was next time I get around to doing a
dist-upgrade the new kernel would have both poll and recv fixed and I
could disable the kludge.

>From what I can see the recv() MSG_PEEK fix is trivial anyways, why not
fix it?

Thanks,
Vito Caputo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Honoring SO_RCVLOWAT in proto_ops.poll methods
  2008-10-20  3:58                           ` swivel
@ 2008-10-20  4:25                             ` David Miller
  2008-11-05 11:36                             ` David Miller
  1 sibling, 0 replies; 20+ messages in thread
From: David Miller @ 2008-10-20  4:25 UTC (permalink / raw)
  To: swivel; +Cc: alan, linux-kernel

From: swivel@shells.gnugeneration.com
Date: Sun, 19 Oct 2008 22:58:19 -0500

> From what I can see the recv() MSG_PEEK fix is trivial anyways, why not
> fix it?

Patches welcome :-)

I'm just very busy and that's the only reason I haven't tackled it
myself.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Honoring SO_RCVLOWAT in proto_ops.poll methods
  2008-10-20  3:58                           ` swivel
  2008-10-20  4:25                             ` David Miller
@ 2008-11-05 11:36                             ` David Miller
  1 sibling, 0 replies; 20+ messages in thread
From: David Miller @ 2008-11-05 11:36 UTC (permalink / raw)
  To: swivel; +Cc: alan, linux-kernel

From: swivel@shells.gnugeneration.com
Date: Sun, 19 Oct 2008 22:58:19 -0500

> From what I can see the recv() MSG_PEEK fix is trivial anyways, why not
> fix it?

Does this patch work for you?

tcp: Fix recvmsg MSG_PEEK influence of blocking behavior.

Vito Caputo noticed that tcp_recvmsg() returns immediately from
partial reads when MSG_PEEK is used.  In particular, this means that
SO_RCVLOWAT is not respected.

Simply remove the test.  And this matches the behavior of several
other systems, including BSD.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv4/tcp.c |    3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index eccb716..c5aca0b 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1374,8 +1374,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 			    sk->sk_state == TCP_CLOSE ||
 			    (sk->sk_shutdown & RCV_SHUTDOWN) ||
 			    !timeo ||
-			    signal_pending(current) ||
-			    (flags & MSG_PEEK))
+			    signal_pending(current))
 				break;
 		} else {
 			if (sock_flag(sk, SOCK_DONE))
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2008-11-05 11:37 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-09-20 21:42 Honoring SO_RCVLOWAT in proto_ops.poll methods lkml
2008-09-20 22:21 ` David Miller
2008-09-20 23:00   ` lkml
2008-09-21  9:24     ` lkml
2008-09-21 14:18       ` Alan Cox
     [not found]         ` <20080921145134.GT2761@fc6222126.aspadmin.net>
2008-09-21 20:13           ` Alan Cox
2008-09-21 22:09             ` lkml
2008-10-05 20:27               ` David Miller
2008-10-05 21:45                 ` swivel
2008-10-05 22:30                   ` David Miller
2008-10-06  5:17                     ` lkml
2008-10-06 17:18                       ` David Miller
2008-10-06 17:45                         ` David Miller
2008-10-13  7:34                     ` David Miller
2008-10-13  8:32                       ` swivel
2008-10-13  9:58                         ` David Miller
2008-10-20  3:58                           ` swivel
2008-10-20  4:25                             ` David Miller
2008-11-05 11:36                             ` David Miller
2008-09-22 12:15             ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).