linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [BUG] Regression on behavior of EPOLLET | EPOLLIN for AF_UNIX sockets in 3.2
@ 2012-01-27 17:05 Nick Mathewson
  2012-01-27 17:53 ` Eric Dumazet
  0 siblings, 1 reply; 7+ messages in thread
From: Nick Mathewson @ 2012-01-27 17:05 UTC (permalink / raw)
  To: netdev; +Cc: linux-kernel

[1.] One line summary of the problem:

EPOLLET doesn't give edge-triggered behavior for AF_UNIX sockets in 3.2

[2.] Full description of the problem/report:

When epoll is told to listen to a readable socket with the flags
EPOLLIN|EPOLLET, it is supposed to report the event once, and then
not report the event again until the socket has first become
non-readable and then become readable again.  (This behavior is part
of the definition of edge-triggered events, IIUC.)

But with AF_UNIX sockets on Linux 3.2, a call to read() on a socket
that does not drain the socket's buffer completely can apparently
cause epoll to think that the socket has generated another event,
even if no further data has actually arrived at the socket.

This behavior did not occur in 3.1, and does not occur in 3.2 with
AF_INET sockets or with pipes.

[3.] Keywords:

networking, AF_UNIX, epoll, socket

[4.] Kernel version (from /proc/version):

First found in:

Linux version 3.2.1-3.fc16.x86_64
(mockbuild@x86-13.phx2.fedoraproject.org) (gcc version 4.6.2 20111027
(Red Hat 4.6.2-1) (GCC) ) #1 SMP Mon Jan 23 15:36:17 UTC 2012

Another user has reproduced this with:

Linux version 3.2.0-1-686-pae (Debian 3.2.1-1) (ben@decadent.org.uk)
(gcc version 4.6.2 (Debian 4.6.2-11) ) #1 SMP Thu Jan 19 10:56:51 UTC
2012

[6.] A small shell script or example program which triggers the
     problem (if possible)

#include <sys/epoll.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <unistd.h>
#include <fcntl.h>

#include <stdio.h>
#include <errno.h>
#include <string.h>

int
main(int argc, const char **argv)
{
        int epfd;
        int pair[2];
        struct epoll_event epev;
        int n, r, n_reads;

        if ((epfd = epoll_create(32)) < 0) {
                perror("epoll_create()");
                return 2;
        }
        if (socketpair(AF_UNIX, SOCK_STREAM, 0, pair) < 0) {
                perror("socketpair()");
                return 2;
        }

        if (fcntl(pair[0], F_SETFL, O_NONBLOCK) < 0) {
                perror("fcntl()");
                return 2;
        }

        memset(&epev, 0, sizeof(epev));
        epev.events = EPOLLIN | EPOLLET;
        epev.data.fd = pair[0];
        if (epoll_ctl(epfd, EPOLL_CTL_ADD, pair[0], &epev) < 0) {
                perror("epoll_ctl()");
                return 2;
        }

        if ((n = write(pair[1], "A 21-character string", 21)) < 0) {
                perror("write()");
                return 2;
        }

        /* pair[0] should now be readable. EPOLLET above has said that we
         * want edge-triggered behavior, so we should only get a single
         * EPOLLIN event on the socket.  But on Linux 3.2, for some reason,
         * reading a single byte from the socket causes us to get another
         * EPOLLIN event.
         */
        n_reads = 0;
        while ((r = epoll_wait(epfd, &epev, 1, 500)) == 1) {
                char byte[1];
                printf("epoll_wait() said: events=%d, fd=%d\n",
                       epev.events, epev.data.fd);
                n = read(pair[0], byte, 1);
                if (n < 0 && errno == EAGAIN) {
                        puts("read() reported EAGAIN.");
                } else if (n < 0) {
                        perror("read()");
                } else if (n == 0) {
                        puts("read() reported EOF.");
                } else {
                        printf("Read %d byte(s)\n", n);
                        ++n_reads;
                }
        }
        if (r == 0) {
                puts("Timeout without event.");
        } else {
                perror("epoll_wait()");
        }

        close(pair[0]);
        close(pair[1]);
        close(epfd);

        if (n_reads == 1) {
                puts("Exactly one read event. Good.");
        } else {
                printf("Got %d read events. That's not right!\n", n_reads);
        }
        return (n_reads == 1) ? 0 : 1;
}

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG] Regression on behavior of EPOLLET | EPOLLIN for AF_UNIX sockets in 3.2
  2012-01-27 17:05 [BUG] Regression on behavior of EPOLLET | EPOLLIN for AF_UNIX sockets in 3.2 Nick Mathewson
@ 2012-01-27 17:53 ` Eric Dumazet
  2012-01-27 18:17   ` Glauber Costa
  2012-01-29  2:11   ` [PATCH] af_unix: fix EPOLLET regression for stream sockets Eric Dumazet
  0 siblings, 2 replies; 7+ messages in thread
From: Eric Dumazet @ 2012-01-27 17:53 UTC (permalink / raw)
  To: Nick Mathewson; +Cc: netdev, linux-kernel, Alexey Moiseytsev

Le vendredi 27 janvier 2012 à 12:05 -0500, Nick Mathewson a écrit :
> [1.] One line summary of the problem:
> 
> EPOLLET doesn't give edge-triggered behavior for AF_UNIX sockets in 3.2
> 
> [2.] Full description of the problem/report:
> 
> When epoll is told to listen to a readable socket with the flags
> EPOLLIN|EPOLLET, it is supposed to report the event once, and then
> not report the event again until the socket has first become
> non-readable and then become readable again.  (This behavior is part
> of the definition of edge-triggered events, IIUC.)
> 
> But with AF_UNIX sockets on Linux 3.2, a call to read() on a socket
> that does not drain the socket's buffer completely can apparently
> cause epoll to think that the socket has generated another event,
> even if no further data has actually arrived at the socket.
> 
> This behavior did not occur in 3.1, and does not occur in 3.2 with
> AF_INET sockets or with pipes.
> 
> [3.] Keywords:
> 
> networking, AF_UNIX, epoll, socket
> 
> [4.] Kernel version (from /proc/version):
> 
> First found in:
> 
> Linux version 3.2.1-3.fc16.x86_64
> (mockbuild@x86-13.phx2.fedoraproject.org) (gcc version 4.6.2 20111027
> (Red Hat 4.6.2-1) (GCC) ) #1 SMP Mon Jan 23 15:36:17 UTC 2012
> 
> Another user has reproduced this with:
> 
> Linux version 3.2.0-1-686-pae (Debian 3.2.1-1) (ben@decadent.org.uk)
> (gcc version 4.6.2 (Debian 4.6.2-11) ) #1 SMP Thu Jan 19 10:56:51 UTC
> 2012
> 
> [6.] A small shell script or example program which triggers the
>      problem (if possible)
> 
> #include <sys/epoll.h>
> #include <sys/types.h>
> #include <sys/socket.h>
> #include <unistd.h>
> #include <fcntl.h>
> 
> #include <stdio.h>
> #include <errno.h>
> #include <string.h>
> 
> int
> main(int argc, const char **argv)
> {
>         int epfd;
>         int pair[2];
>         struct epoll_event epev;
>         int n, r, n_reads;
> 
>         if ((epfd = epoll_create(32)) < 0) {
>                 perror("epoll_create()");
>                 return 2;
>         }
>         if (socketpair(AF_UNIX, SOCK_STREAM, 0, pair) < 0) {
>                 perror("socketpair()");
>                 return 2;
>         }
> 
>         if (fcntl(pair[0], F_SETFL, O_NONBLOCK) < 0) {
>                 perror("fcntl()");
>                 return 2;
>         }
> 
>         memset(&epev, 0, sizeof(epev));
>         epev.events = EPOLLIN | EPOLLET;
>         epev.data.fd = pair[0];
>         if (epoll_ctl(epfd, EPOLL_CTL_ADD, pair[0], &epev) < 0) {
>                 perror("epoll_ctl()");
>                 return 2;
>         }
> 
>         if ((n = write(pair[1], "A 21-character string", 21)) < 0) {
>                 perror("write()");
>                 return 2;
>         }
> 
>         /* pair[0] should now be readable. EPOLLET above has said that we
>          * want edge-triggered behavior, so we should only get a single
>          * EPOLLIN event on the socket.  But on Linux 3.2, for some reason,
>          * reading a single byte from the socket causes us to get another
>          * EPOLLIN event.
>          */
>         n_reads = 0;
>         while ((r = epoll_wait(epfd, &epev, 1, 500)) == 1) {
>                 char byte[1];
>                 printf("epoll_wait() said: events=%d, fd=%d\n",
>                        epev.events, epev.data.fd);
>                 n = read(pair[0], byte, 1);
>                 if (n < 0 && errno == EAGAIN) {
>                         puts("read() reported EAGAIN.");
>                 } else if (n < 0) {
>                         perror("read()");
>                 } else if (n == 0) {
>                         puts("read() reported EOF.");
>                 } else {
>                         printf("Read %d byte(s)\n", n);
>                         ++n_reads;
>                 }
>         }
>         if (r == 0) {
>                 puts("Timeout without event.");
>         } else {
>                 perror("epoll_wait()");
>         }
> 
>         close(pair[0]);
>         close(pair[1]);
>         close(epfd);
> 
>         if (n_reads == 1) {
>                 puts("Exactly one read event. Good.");
>         } else {
>                 printf("Got %d read events. That's not right!\n", n_reads);
>         }
>         return (n_reads == 1) ? 0 : 1;
> }
> --

Hi

Probably coming from commit 0884d7aa24e15e72b3c07f7da910a13bb7df3592
(AF_UNIX: Fix poll blocking problem when reading from a stream socket)

When we requeue skb because not completely eaten, we call again

sk->sk_data_ready(sk, skb->len);




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG] Regression on behavior of EPOLLET | EPOLLIN for AF_UNIX sockets in 3.2
  2012-01-27 17:53 ` Eric Dumazet
@ 2012-01-27 18:17   ` Glauber Costa
  2012-01-27 18:55     ` Eric Dumazet
  2012-01-29  2:11   ` [PATCH] af_unix: fix EPOLLET regression for stream sockets Eric Dumazet
  1 sibling, 1 reply; 7+ messages in thread
From: Glauber Costa @ 2012-01-27 18:17 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Nick Mathewson, netdev, linux-kernel, Alexey Moiseytsev

On 01/27/2012 09:53 PM, Eric Dumazet wrote:
> Le vendredi 27 janvier 2012 à 12:05 -0500, Nick Mathewson a écrit :
>> [1.] One line summary of the problem:
>>
>> EPOLLET doesn't give edge-triggered behavior for AF_UNIX sockets in 3.2
>>
>> [2.] Full description of the problem/report:
>>
>> When epoll is told to listen to a readable socket with the flags
>> EPOLLIN|EPOLLET, it is supposed to report the event once, and then
>> not report the event again until the socket has first become
>> non-readable and then become readable again.  (This behavior is part
>> of the definition of edge-triggered events, IIUC.)
>>
>> But with AF_UNIX sockets on Linux 3.2, a call to read() on a socket
>> that does not drain the socket's buffer completely can apparently
>> cause epoll to think that the socket has generated another event,
>> even if no further data has actually arrived at the socket.
>>
>> This behavior did not occur in 3.1, and does not occur in 3.2 with
>> AF_INET sockets or with pipes.
>>
>> [3.] Keywords:
>>
>> networking, AF_UNIX, epoll, socket
>>
>> [4.] Kernel version (from /proc/version):
>>
>> First found in:
>>
>> Linux version 3.2.1-3.fc16.x86_64
>> (mockbuild@x86-13.phx2.fedoraproject.org) (gcc version 4.6.2 20111027
>> (Red Hat 4.6.2-1) (GCC) ) #1 SMP Mon Jan 23 15:36:17 UTC 2012
>>
>> Another user has reproduced this with:
>>
>> Linux version 3.2.0-1-686-pae (Debian 3.2.1-1) (ben@decadent.org.uk)
>> (gcc version 4.6.2 (Debian 4.6.2-11) ) #1 SMP Thu Jan 19 10:56:51 UTC
>> 2012
>>
>> [6.] A small shell script or example program which triggers the
>>       problem (if possible)
>>
>> #include<sys/epoll.h>
>> #include<sys/types.h>
>> #include<sys/socket.h>
>> #include<unistd.h>
>> #include<fcntl.h>
>>
>> #include<stdio.h>
>> #include<errno.h>
>> #include<string.h>
>>
>> int
>> main(int argc, const char **argv)
>> {
>>          int epfd;
>>          int pair[2];
>>          struct epoll_event epev;
>>          int n, r, n_reads;
>>
>>          if ((epfd = epoll_create(32))<  0) {
>>                  perror("epoll_create()");
>>                  return 2;
>>          }
>>          if (socketpair(AF_UNIX, SOCK_STREAM, 0, pair)<  0) {
>>                  perror("socketpair()");
>>                  return 2;
>>          }
>>
>>          if (fcntl(pair[0], F_SETFL, O_NONBLOCK)<  0) {
>>                  perror("fcntl()");
>>                  return 2;
>>          }
>>
>>          memset(&epev, 0, sizeof(epev));
>>          epev.events = EPOLLIN | EPOLLET;
>>          epev.data.fd = pair[0];
>>          if (epoll_ctl(epfd, EPOLL_CTL_ADD, pair[0],&epev)<  0) {
>>                  perror("epoll_ctl()");
>>                  return 2;
>>          }
>>
>>          if ((n = write(pair[1], "A 21-character string", 21))<  0) {
>>                  perror("write()");
>>                  return 2;
>>          }
>>
>>          /* pair[0] should now be readable. EPOLLET above has said that we
>>           * want edge-triggered behavior, so we should only get a single
>>           * EPOLLIN event on the socket.  But on Linux 3.2, for some reason,
>>           * reading a single byte from the socket causes us to get another
>>           * EPOLLIN event.
>>           */
>>          n_reads = 0;
>>          while ((r = epoll_wait(epfd,&epev, 1, 500)) == 1) {
>>                  char byte[1];
>>                  printf("epoll_wait() said: events=%d, fd=%d\n",
>>                         epev.events, epev.data.fd);
>>                  n = read(pair[0], byte, 1);
>>                  if (n<  0&&  errno == EAGAIN) {
>>                          puts("read() reported EAGAIN.");
>>                  } else if (n<  0) {
>>                          perror("read()");
>>                  } else if (n == 0) {
>>                          puts("read() reported EOF.");
>>                  } else {
>>                          printf("Read %d byte(s)\n", n);
>>                          ++n_reads;
>>                  }
>>          }
>>          if (r == 0) {
>>                  puts("Timeout without event.");
>>          } else {
>>                  perror("epoll_wait()");
>>          }
>>
>>          close(pair[0]);
>>          close(pair[1]);
>>          close(epfd);
>>
>>          if (n_reads == 1) {
>>                  puts("Exactly one read event. Good.");
>>          } else {
>>                  printf("Got %d read events. That's not right!\n", n_reads);
>>          }
>>          return (n_reads == 1) ? 0 : 1;
>> }
>> --
>
> Hi
>
> Probably coming from commit 0884d7aa24e15e72b3c07f7da910a13bb7df3592
> (AF_UNIX: Fix poll blocking problem when reading from a stream socket)
>
> When we requeue skb because not completely eaten, we call again
>
> sk->sk_data_ready(sk, skb->len);
>
For the record, I just confirmed this to be the case.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG] Regression on behavior of EPOLLET | EPOLLIN for AF_UNIX sockets in 3.2
  2012-01-27 18:17   ` Glauber Costa
@ 2012-01-27 18:55     ` Eric Dumazet
  2012-01-27 19:44       ` Eric Dumazet
  0 siblings, 1 reply; 7+ messages in thread
From: Eric Dumazet @ 2012-01-27 18:55 UTC (permalink / raw)
  To: Glauber Costa; +Cc: Nick Mathewson, netdev, linux-kernel, Alexey Moiseytsev

Le vendredi 27 janvier 2012 à 22:17 +0400, Glauber Costa a écrit :
> On 01/27/2012 09:53 PM, Eric Dumazet wrote:
> > Le vendredi 27 janvier 2012 à 12:05 -0500, Nick Mathewson a écrit :
> >> [1.] One line summary of the problem:
> >>
> >
> > Hi
> >
> > Probably coming from commit 0884d7aa24e15e72b3c07f7da910a13bb7df3592
> > (AF_UNIX: Fix poll blocking problem when reading from a stream socket)
> >
> > When we requeue skb because not completely eaten, we call again
> >
> > sk->sk_data_ready(sk, skb->len);
> >
> For the record, I just confirmed this to be the case.

A fix would be to change unix_poll() and not call sk_data_ready() when
skb is requeued.

if (!skb_queue_empty(&sk->sk_receive_queue))
	mask |= POLLIN | POLLRDNORM;

Might be tricky if we want to keep unix_poll() lockless, but quite
possible.

Or... not dequeue skb from sk_received_queue unless fully consumed.



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG] Regression on behavior of EPOLLET | EPOLLIN for AF_UNIX sockets in 3.2
  2012-01-27 18:55     ` Eric Dumazet
@ 2012-01-27 19:44       ` Eric Dumazet
  0 siblings, 0 replies; 7+ messages in thread
From: Eric Dumazet @ 2012-01-27 19:44 UTC (permalink / raw)
  To: Glauber Costa; +Cc: Nick Mathewson, netdev, linux-kernel, Alexey Moiseytsev

Le vendredi 27 janvier 2012 à 19:55 +0100, Eric Dumazet a écrit :
> Le vendredi 27 janvier 2012 à 22:17 +0400, Glauber Costa a écrit :
> > On 01/27/2012 09:53 PM, Eric Dumazet wrote:
> > > Le vendredi 27 janvier 2012 à 12:05 -0500, Nick Mathewson a écrit :
> > >> [1.] One line summary of the problem:
> > >>
> > >
> > > Hi
> > >
> > > Probably coming from commit 0884d7aa24e15e72b3c07f7da910a13bb7df3592
> > > (AF_UNIX: Fix poll blocking problem when reading from a stream socket)
> > >
> > > When we requeue skb because not completely eaten, we call again
> > >
> > > sk->sk_data_ready(sk, skb->len);
> > >
> > For the record, I just confirmed this to be the case.
> 
> A fix would be to change unix_poll() and not call sk_data_ready() when
> skb is requeued.
> 
> if (!skb_queue_empty(&sk->sk_receive_queue))
> 	mask |= POLLIN | POLLRDNORM;
> 
> Might be tricky if we want to keep unix_poll() lockless, but quite
> possible.
> 
> Or... not dequeue skb from sk_received_queue unless fully consumed.
> 

I am testing following patch :

 net/unix/af_unix.c |   18 ++++--------------
 1 file changed, 4 insertions(+), 14 deletions(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index aad8fb6..6eca195 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1918,7 +1918,7 @@ static int unix_stream_recvmsg(struct kiocb *iocb, struct socket *sock,
 		struct sk_buff *skb;
 
 		unix_state_lock(sk);
-		skb = skb_dequeue(&sk->sk_receive_queue);
+		skb = skb_peek(&sk->sk_receive_queue);
 		if (skb == NULL) {
 			unix_sk(sk)->recursion_level = 0;
 			if (copied >= target)
@@ -1959,8 +1959,6 @@ static int unix_stream_recvmsg(struct kiocb *iocb, struct socket *sock,
 			/* Never glue messages from different writers */
 			if ((UNIXCB(skb).pid  != siocb->scm->pid) ||
 			    (UNIXCB(skb).cred != siocb->scm->cred)) {
-				skb_queue_head(&sk->sk_receive_queue, skb);
-				sk->sk_data_ready(sk, skb->len);
 				break;
 			}
 		} else {
@@ -1977,8 +1975,6 @@ static int unix_stream_recvmsg(struct kiocb *iocb, struct socket *sock,
 
 		chunk = min_t(unsigned int, skb->len, size);
 		if (memcpy_toiovec(msg->msg_iov, skb->data, chunk)) {
-			skb_queue_head(&sk->sk_receive_queue, skb);
-			sk->sk_data_ready(sk, skb->len);
 			if (copied == 0)
 				copied = -EFAULT;
 			break;
@@ -1993,13 +1989,10 @@ static int unix_stream_recvmsg(struct kiocb *iocb, struct socket *sock,
 			if (UNIXCB(skb).fp)
 				unix_detach_fds(siocb->scm, skb);
 
-			/* put the skb back if we didn't use it up.. */
-			if (skb->len) {
-				skb_queue_head(&sk->sk_receive_queue, skb);
-				sk->sk_data_ready(sk, skb->len);
+			if (skb->len)
 				break;
-			}
-
+			
+			skb_unlink(skb, &sk->sk_receive_queue);
 			consume_skb(skb);
 
 			if (siocb->scm->fp)
@@ -2010,9 +2003,6 @@ static int unix_stream_recvmsg(struct kiocb *iocb, struct socket *sock,
 			if (UNIXCB(skb).fp)
 				siocb->scm->fp = scm_fp_dup(UNIXCB(skb).fp);
 
-			/* put message back and return */
-			skb_queue_head(&sk->sk_receive_queue, skb);
-			sk->sk_data_ready(sk, skb->len);
 			break;
 		}
 	} while (size);



^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH] af_unix: fix EPOLLET regression for stream sockets
  2012-01-27 17:53 ` Eric Dumazet
  2012-01-27 18:17   ` Glauber Costa
@ 2012-01-29  2:11   ` Eric Dumazet
  2012-01-30 17:45     ` David Miller
  1 sibling, 1 reply; 7+ messages in thread
From: Eric Dumazet @ 2012-01-29  2:11 UTC (permalink / raw)
  To: Nick Mathewson, David Miller; +Cc: netdev, linux-kernel, Alexey Moiseytsev

Commit 0884d7aa24 (AF_UNIX: Fix poll blocking problem when reading from
a stream socket) added a regression for epoll() in Edge Triggered mode
(EPOLLET)

Appropriate fix is to use skb_peek()/skb_unlink() instead of
skb_dequeue(), and only call skb_unlink() when skb is fully consumed.

This remove the need to requeue a partial skb into sk_receive_queue head
and the extra sk->sk_data_ready() calls that added the regression.

This is safe because once skb is given to sk_receive_queue, it is not
modified by a writer, and readers are serialized by u->readlock mutex.

This also reduce number of spinlock acquisition for small reads or
MSG_PEEK users so should improve overall performance.

Reported-by: Nick Mathewson <nickm@freehaven.net>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Alexey Moiseytsev <himeraster@gmail.com>
---
 net/unix/af_unix.c |   21 +++++----------------
 1 file changed, 5 insertions(+), 16 deletions(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index aad8fb6..afc2ae4 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1918,7 +1918,7 @@ static int unix_stream_recvmsg(struct kiocb *iocb, struct socket *sock,
 		struct sk_buff *skb;
 
 		unix_state_lock(sk);
-		skb = skb_dequeue(&sk->sk_receive_queue);
+		skb = skb_peek(&sk->sk_receive_queue);
 		if (skb == NULL) {
 			unix_sk(sk)->recursion_level = 0;
 			if (copied >= target)
@@ -1958,11 +1958,8 @@ static int unix_stream_recvmsg(struct kiocb *iocb, struct socket *sock,
 		if (check_creds) {
 			/* Never glue messages from different writers */
 			if ((UNIXCB(skb).pid  != siocb->scm->pid) ||
-			    (UNIXCB(skb).cred != siocb->scm->cred)) {
-				skb_queue_head(&sk->sk_receive_queue, skb);
-				sk->sk_data_ready(sk, skb->len);
+			    (UNIXCB(skb).cred != siocb->scm->cred))
 				break;
-			}
 		} else {
 			/* Copy credentials */
 			scm_set_cred(siocb->scm, UNIXCB(skb).pid, UNIXCB(skb).cred);
@@ -1977,8 +1974,6 @@ static int unix_stream_recvmsg(struct kiocb *iocb, struct socket *sock,
 
 		chunk = min_t(unsigned int, skb->len, size);
 		if (memcpy_toiovec(msg->msg_iov, skb->data, chunk)) {
-			skb_queue_head(&sk->sk_receive_queue, skb);
-			sk->sk_data_ready(sk, skb->len);
 			if (copied == 0)
 				copied = -EFAULT;
 			break;
@@ -1993,13 +1988,10 @@ static int unix_stream_recvmsg(struct kiocb *iocb, struct socket *sock,
 			if (UNIXCB(skb).fp)
 				unix_detach_fds(siocb->scm, skb);
 
-			/* put the skb back if we didn't use it up.. */
-			if (skb->len) {
-				skb_queue_head(&sk->sk_receive_queue, skb);
-				sk->sk_data_ready(sk, skb->len);
+			if (skb->len)
 				break;
-			}
-
+			
+			skb_unlink(skb, &sk->sk_receive_queue);
 			consume_skb(skb);
 
 			if (siocb->scm->fp)
@@ -2010,9 +2002,6 @@ static int unix_stream_recvmsg(struct kiocb *iocb, struct socket *sock,
 			if (UNIXCB(skb).fp)
 				siocb->scm->fp = scm_fp_dup(UNIXCB(skb).fp);
 
-			/* put message back and return */
-			skb_queue_head(&sk->sk_receive_queue, skb);
-			sk->sk_data_ready(sk, skb->len);
 			break;
 		}
 	} while (size);



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] af_unix: fix EPOLLET regression for stream sockets
  2012-01-29  2:11   ` [PATCH] af_unix: fix EPOLLET regression for stream sockets Eric Dumazet
@ 2012-01-30 17:45     ` David Miller
  0 siblings, 0 replies; 7+ messages in thread
From: David Miller @ 2012-01-30 17:45 UTC (permalink / raw)
  To: eric.dumazet; +Cc: nickm, netdev, linux-kernel, himeraster

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sun, 29 Jan 2012 03:11:03 +0100

> Commit 0884d7aa24 (AF_UNIX: Fix poll blocking problem when reading from
> a stream socket) added a regression for epoll() in Edge Triggered mode
> (EPOLLET)
> 
> Appropriate fix is to use skb_peek()/skb_unlink() instead of
> skb_dequeue(), and only call skb_unlink() when skb is fully consumed.
> 
> This remove the need to requeue a partial skb into sk_receive_queue head
> and the extra sk->sk_data_ready() calls that added the regression.
> 
> This is safe because once skb is given to sk_receive_queue, it is not
> modified by a writer, and readers are serialized by u->readlock mutex.
> 
> This also reduce number of spinlock acquisition for small reads or
> MSG_PEEK users so should improve overall performance.
> 
> Reported-by: Nick Mathewson <nickm@freehaven.net>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: Alexey Moiseytsev <himeraster@gmail.com>

Applied and queued up for -stable, thanks.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2012-01-30 17:45 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-27 17:05 [BUG] Regression on behavior of EPOLLET | EPOLLIN for AF_UNIX sockets in 3.2 Nick Mathewson
2012-01-27 17:53 ` Eric Dumazet
2012-01-27 18:17   ` Glauber Costa
2012-01-27 18:55     ` Eric Dumazet
2012-01-27 19:44       ` Eric Dumazet
2012-01-29  2:11   ` [PATCH] af_unix: fix EPOLLET regression for stream sockets Eric Dumazet
2012-01-30 17:45     ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).