All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/1] RFC: poll/select performance on datagram sockets
@ 2010-10-29 18:18 ` Alban Crequy
  0 siblings, 0 replies; 27+ messages in thread
From: Alban Crequy @ 2010-10-29 18:18 UTC (permalink / raw)
  To: Alban Crequy, David S. Miller, Eric Dumazet, Stephen Hemminger,
	Cyrill Gorcunov, Alexey Dobriyan, netdev, linux-kernel,
	Pauli Nieminen, Rainer Weikusat

Hi,

When a process calls the poll or select, the kernel calls (struct
file_operations)->poll on every file descriptor and returns a mask of
events which are ready. If the process is only interested by POLLIN
events, the mask is still computed for POLLOUT and it can be expensive.
For example, on Unix datagram sockets, a process running poll() with
POLLIN will wakes-up when the remote end call read(). This is a
performance regression introduced when fixing another bug by
3c73419c09a5ef73d56472dbfdade9e311496e9b and
ec0d215f9420564fc8286dcf93d2d068bb53a07e.

The attached program illustrates the problem. It compares the
performance of sending/receiving data on an Unix datagram socket and
select(). When the datagram sockets are not connected, the performance
problem is not triggered, but when they are connected it becomes a lot
slower. On my computer, I have the following time:

Connected datagram sockets: >4 seconds
Non-connected datagram sockets: <1 second

The patch attached in the next email fixes the performance problem: it
becomes <1 second for both cases. I am not suggesting the patch for
inclusion; I would like to change the prototype of (struct
file_operations)->poll instead of adding ->poll2. But there is a lot of
poll functions to change (grep tells me 337 functions).

Any opinions?

Best regards,
Alban Crequy


---------------
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/time.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/un.h>
#include <time.h>

int
main(int argc, char *argv[])
{
#define NB_CLIENTS 800
  int client_fds[NB_CLIENTS];
  int server_fd;
  int max_fds = 0;
  struct sockaddr_un addr_server;
  struct sockaddr_un addr_client;
  fd_set rfds;
  struct timeval tv;
  int retval;
  int i;
  char buffer[1024];
  int pair[2] = {0,0};
  time_t t1, t2;
  int trigger;

  if (argc != 2)
    exit(1);

  if (strcmp(argv[1], "connected") == 0)
    {
      printf("The performance problem will be triggered\n");
      trigger = 1;
    }
  else
    {
      printf("The performance will be good\n");
      trigger = 0;
    }

  memset(&addr_server, 0, sizeof(addr_server));
  addr_server.sun_family = AF_UNIX;
  addr_server.sun_path[0] = '\0';
  strcpy(addr_server.sun_path + 1, "dgram_perfs_server");

  memset(&addr_client, 0, sizeof(addr_client));
  addr_client.sun_family = AF_UNIX;
  addr_client.sun_path[0] = '\0';
  strcpy(addr_client.sun_path + 1, "dgram_perfs_client");

  server_fd = socket(AF_UNIX, SOCK_DGRAM, 0);
  bind(server_fd, (struct sockaddr*)&addr_server, sizeof(addr_server));

  FD_ZERO(&rfds);
  for (i = 0 ; i < NB_CLIENTS ; i++)
    {
      client_fds[i] = socket(AF_UNIX, SOCK_DGRAM, 0);
      FD_SET(client_fds[i], &rfds);
      if (client_fds[i] > max_fds)
        max_fds = client_fds[i];
    }
  max_fds++;

  /* Do not connect the first one */
  for (i = 1 ; i < NB_CLIENTS ; i++)
    {
      if (trigger)
        connect(client_fds[i], (struct sockaddr*)&addr_server, sizeof(addr_server));
    }
  bind(client_fds[0], (struct sockaddr*)&addr_client, sizeof(addr_client));

  tv.tv_sec = 500;
  tv.tv_usec = 0;

  socketpair(AF_UNIX, SOCK_STREAM, 0, pair);

  if (fork() != 0)
  {
    recv(pair[1], buffer, 1024, 0);
    usleep(1000);
    for (i = 0 ; i < 20000 ; i++)
    {
      sendto(client_fds[0], "S", 1, 0, (struct sockaddr*)&addr_server, sizeof(addr_server));
      recv(server_fd, buffer, 1024, 0);
    }
    sendto(client_fds[0], "Z", 1, 0, (struct sockaddr*)&addr_client, sizeof(addr_client));
    wait(NULL);
    exit(0);
  }

  send(pair[0], "W", 1, 0);
  printf("select: begin\n");
  t1 = time(NULL);
  retval = select(max_fds, &rfds, NULL, NULL, &tv);
  t2 = time(NULL);
  printf("select: end: %d seconds\n", t2 - t1);

  return 0;
}
---------------


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 0/1] RFC: poll/select performance on datagram sockets
@ 2010-10-29 18:18 ` Alban Crequy
  0 siblings, 0 replies; 27+ messages in thread
From: Alban Crequy @ 2010-10-29 18:18 UTC (permalink / raw)
  To: Alban Crequy, David S. Miller, Eric Dumazet, Stephen Hemminger, Cyrill

Hi,

When a process calls the poll or select, the kernel calls (struct
file_operations)->poll on every file descriptor and returns a mask of
events which are ready. If the process is only interested by POLLIN
events, the mask is still computed for POLLOUT and it can be expensive.
For example, on Unix datagram sockets, a process running poll() with
POLLIN will wakes-up when the remote end call read(). This is a
performance regression introduced when fixing another bug by
3c73419c09a5ef73d56472dbfdade9e311496e9b and
ec0d215f9420564fc8286dcf93d2d068bb53a07e.

The attached program illustrates the problem. It compares the
performance of sending/receiving data on an Unix datagram socket and
select(). When the datagram sockets are not connected, the performance
problem is not triggered, but when they are connected it becomes a lot
slower. On my computer, I have the following time:

Connected datagram sockets: >4 seconds
Non-connected datagram sockets: <1 second

The patch attached in the next email fixes the performance problem: it
becomes <1 second for both cases. I am not suggesting the patch for
inclusion; I would like to change the prototype of (struct
file_operations)->poll instead of adding ->poll2. But there is a lot of
poll functions to change (grep tells me 337 functions).

Any opinions?

Best regards,
Alban Crequy


---------------
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/time.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/un.h>
#include <time.h>

int
main(int argc, char *argv[])
{
#define NB_CLIENTS 800
  int client_fds[NB_CLIENTS];
  int server_fd;
  int max_fds = 0;
  struct sockaddr_un addr_server;
  struct sockaddr_un addr_client;
  fd_set rfds;
  struct timeval tv;
  int retval;
  int i;
  char buffer[1024];
  int pair[2] = {0,0};
  time_t t1, t2;
  int trigger;

  if (argc != 2)
    exit(1);

  if (strcmp(argv[1], "connected") == 0)
    {
      printf("The performance problem will be triggered\n");
      trigger = 1;
    }
  else
    {
      printf("The performance will be good\n");
      trigger = 0;
    }

  memset(&addr_server, 0, sizeof(addr_server));
  addr_server.sun_family = AF_UNIX;
  addr_server.sun_path[0] = '\0';
  strcpy(addr_server.sun_path + 1, "dgram_perfs_server");

  memset(&addr_client, 0, sizeof(addr_client));
  addr_client.sun_family = AF_UNIX;
  addr_client.sun_path[0] = '\0';
  strcpy(addr_client.sun_path + 1, "dgram_perfs_client");

  server_fd = socket(AF_UNIX, SOCK_DGRAM, 0);
  bind(server_fd, (struct sockaddr*)&addr_server, sizeof(addr_server));

  FD_ZERO(&rfds);
  for (i = 0 ; i < NB_CLIENTS ; i++)
    {
      client_fds[i] = socket(AF_UNIX, SOCK_DGRAM, 0);
      FD_SET(client_fds[i], &rfds);
      if (client_fds[i] > max_fds)
        max_fds = client_fds[i];
    }
  max_fds++;

  /* Do not connect the first one */
  for (i = 1 ; i < NB_CLIENTS ; i++)
    {
      if (trigger)
        connect(client_fds[i], (struct sockaddr*)&addr_server, sizeof(addr_server));
    }
  bind(client_fds[0], (struct sockaddr*)&addr_client, sizeof(addr_client));

  tv.tv_sec = 500;
  tv.tv_usec = 0;

  socketpair(AF_UNIX, SOCK_STREAM, 0, pair);

  if (fork() != 0)
  {
    recv(pair[1], buffer, 1024, 0);
    usleep(1000);
    for (i = 0 ; i < 20000 ; i++)
    {
      sendto(client_fds[0], "S", 1, 0, (struct sockaddr*)&addr_server, sizeof(addr_server));
      recv(server_fd, buffer, 1024, 0);
    }
    sendto(client_fds[0], "Z", 1, 0, (struct sockaddr*)&addr_client, sizeof(addr_client));
    wait(NULL);
    exit(0);
  }

  send(pair[0], "W", 1, 0);
  printf("select: begin\n");
  t1 = time(NULL);
  retval = select(max_fds, &rfds, NULL, NULL, &tv);
  t2 = time(NULL);
  printf("select: end: %d seconds\n", t2 - t1);

  return 0;
}
---------------


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH] poll/select performance on datagram sockets
  2010-10-29 18:18 ` Alban Crequy
  (?)
@ 2010-10-29 18:21 ` Alban Crequy
  -1 siblings, 0 replies; 27+ messages in thread
From: Alban Crequy @ 2010-10-29 18:21 UTC (permalink / raw)
  To: Alban Crequy
  Cc: Alban Crequy, David S. Miller, Eric Dumazet, netdev,
	linux-kernel, Pauli Nieminen, Rainer Weikusat

---
 fs/select.c         |   27 ++++++++++++++++++++++++---
 include/linux/fs.h  |    1 +
 include/linux/net.h |    3 +++
 net/socket.c        |   15 ++++++++++-----
 net/unix/af_unix.c  |   17 +++++++++++------
 5 files changed, 49 insertions(+), 14 deletions(-)

diff --git a/fs/select.c b/fs/select.c
index 500a669..ff46afa 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -449,9 +449,21 @@ int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
 				if (file) {
 					f_op = file->f_op;
 					mask = DEFAULT_POLLMASK;
-					if (f_op && f_op->poll) {
+					if (f_op && f_op->poll2) {
+						unsigned long filter_mask = 0;
+						filter_mask |= (in & bit)
+							? POLLIN_SET : 0;
+						filter_mask |= (out & bit)
+							? POLLOUT_SET : 0;
+						filter_mask |= (ex & bit)
+							? POLLEX_SET : 0;
 						wait_key_set(wait, in, out, bit);
-						mask = (*f_op->poll)(file, wait);
+						mask = (*f_op->poll2)(file,
+							wait, filter_mask);
+					} else if (f_op && f_op->poll) {
+						wait_key_set(wait, in, out, bit);
+						mask = (*f_op->poll)(file,
+							wait);
 					}
 					fput_light(file, fput_needed);
 					if ((mask & POLLIN_SET) && (in & bit)) {
@@ -738,7 +750,16 @@ static inline unsigned int do_pollfd(struct pollfd *pollfd, poll_table *pwait)
 		mask = POLLNVAL;
 		if (file != NULL) {
 			mask = DEFAULT_POLLMASK;
-			if (file->f_op && file->f_op->poll) {
+			if (file->f_op && file->f_op->poll2) {
+				unsigned long filter_mask;
+				filter_mask = pollfd->events
+						| POLLERR | POLLHUP;
+				if (pwait)
+					pwait->key = pollfd->events |
+							POLLERR | POLLHUP;
+				mask = file->f_op->poll2(file, pwait,
+							filter_mask);
+			} else if (file->f_op && file->f_op->poll) {
 				if (pwait)
 					pwait->key = pollfd->events |
 							POLLERR | POLLHUP;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 509ca14..dd0eb7a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1493,6 +1493,7 @@ struct file_operations {
 	ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
 	int (*readdir) (struct file *, void *, filldir_t);
 	unsigned int (*poll) (struct file *, struct poll_table_struct *);
+	unsigned int (*poll2) (struct file *, struct poll_table_struct *, unsigned long);
 	int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long);
 	long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
 	long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
diff --git a/include/linux/net.h b/include/linux/net.h
index dee0b11..1f33d1c 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -175,6 +175,9 @@ struct proto_ops {
 				      int *sockaddr_len, int peer);
 	unsigned int	(*poll)	     (struct file *file, struct socket *sock,
 				      struct poll_table_struct *wait);
+	unsigned int	(*poll2)     (struct file *file, struct socket *sock,
+				      struct poll_table_struct *wait,
+				      unsigned long filter_mask);
 	int		(*ioctl)     (struct socket *sock, unsigned int cmd,
 				      unsigned long arg);
 #ifdef CONFIG_COMPAT
diff --git a/net/socket.c b/net/socket.c
index 367d547..38b7786 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -113,8 +113,9 @@ static ssize_t sock_aio_write(struct kiocb *iocb, const struct iovec *iov,
 static int sock_mmap(struct file *file, struct vm_area_struct *vma);
 
 static int sock_close(struct inode *inode, struct file *file);
-static unsigned int sock_poll(struct file *file,
-			      struct poll_table_struct *wait);
+static unsigned int sock_poll2(struct file *file,
+			      struct poll_table_struct *wait,
+			      unsigned long filter_mask);
 static long sock_ioctl(struct file *file, unsigned int cmd, unsigned long arg);
 #ifdef CONFIG_COMPAT
 static long compat_sock_ioctl(struct file *file,
@@ -137,7 +138,7 @@ static const struct file_operations socket_file_ops = {
 	.llseek =	no_llseek,
 	.aio_read =	sock_aio_read,
 	.aio_write =	sock_aio_write,
-	.poll =		sock_poll,
+	.poll2 =	sock_poll2,
 	.unlocked_ioctl = sock_ioctl,
 #ifdef CONFIG_COMPAT
 	.compat_ioctl = compat_sock_ioctl,
@@ -1049,7 +1050,8 @@ out_release:
 }
 
 /* No kernel lock held - perfect */
-static unsigned int sock_poll(struct file *file, poll_table *wait)
+static unsigned int sock_poll2(struct file *file, poll_table *wait,
+			       unsigned long filter_mask)
 {
 	struct socket *sock;
 
@@ -1057,7 +1059,10 @@ static unsigned int sock_poll(struct file *file, poll_table *wait)
 	 *      We can't return errors to poll, so it's either yes or no.
 	 */
 	sock = file->private_data;
-	return sock->ops->poll(file, sock, wait);
+	if (sock->ops->poll2)
+		return sock->ops->poll2(file, sock, wait, filter_mask);
+	else
+		return sock->ops->poll(file, sock, wait);
 }
 
 static int sock_mmap(struct file *file, struct vm_area_struct *vma)
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 617bea4..60e30ea 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -489,7 +489,7 @@ static int unix_accept(struct socket *, struct socket *, int);
 static int unix_getname(struct socket *, struct sockaddr *, int *, int);
 static unsigned int unix_poll(struct file *, struct socket *, poll_table *);
 static unsigned int unix_dgram_poll(struct file *, struct socket *,
-				    poll_table *);
+				    poll_table *, unsigned long);
 static int unix_ioctl(struct socket *, unsigned int, unsigned long);
 static int unix_shutdown(struct socket *, int);
 static int unix_stream_sendmsg(struct kiocb *, struct socket *,
@@ -535,7 +535,7 @@ static const struct proto_ops unix_dgram_ops = {
 	.socketpair =	unix_socketpair,
 	.accept =	sock_no_accept,
 	.getname =	unix_getname,
-	.poll =		unix_dgram_poll,
+	.poll2 =	unix_dgram_poll,
 	.ioctl =	unix_ioctl,
 	.listen =	sock_no_listen,
 	.shutdown =	unix_shutdown,
@@ -556,7 +556,7 @@ static const struct proto_ops unix_seqpacket_ops = {
 	.socketpair =	unix_socketpair,
 	.accept =	unix_accept,
 	.getname =	unix_getname,
-	.poll =		unix_dgram_poll,
+	.poll2 =	unix_dgram_poll,
 	.ioctl =	unix_ioctl,
 	.listen =	unix_listen,
 	.shutdown =	unix_shutdown,
@@ -2031,7 +2031,8 @@ static unsigned int unix_poll(struct file *file, struct socket *sock, poll_table
 }
 
 static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
-				    poll_table *wait)
+				    poll_table *wait,
+				    unsigned long filter_mask)
 {
 	struct sock *sk = sock->sk, *other;
 	unsigned int mask, writable;
@@ -2061,8 +2062,12 @@ static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
 			return mask;
 	}
 
-	/* writable? */
-	writable = unix_writable(sk);
+	/* writable?
+	 * sock_poll_wait() is expensive. Don't bother checking if it is not
+	 * requested.
+	 */
+	writable = (filter_mask & (POLLOUT | POLLWRNORM | POLLWRBAND))
+		    && unix_writable(sk);
 	if (writable) {
 		other = unix_peer_get(sk);
 		if (other) {
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/1] RFC: poll/select performance on datagram sockets
  2010-10-29 18:18 ` Alban Crequy
  (?)
  (?)
@ 2010-10-29 19:27 ` Eric Dumazet
  2010-10-29 20:08   ` Davide Libenzi
                     ` (3 more replies)
  -1 siblings, 4 replies; 27+ messages in thread
From: Eric Dumazet @ 2010-10-29 19:27 UTC (permalink / raw)
  To: Alban Crequy
  Cc: David S. Miller, Stephen Hemminger, Cyrill Gorcunov,
	Alexey Dobriyan, netdev, linux-kernel, Pauli Nieminen,
	Rainer Weikusat

Le vendredi 29 octobre 2010 à 19:18 +0100, Alban Crequy a écrit :
> Hi,
> 
> When a process calls the poll or select, the kernel calls (struct
> file_operations)->poll on every file descriptor and returns a mask of
> events which are ready. If the process is only interested by POLLIN
> events, the mask is still computed for POLLOUT and it can be expensive.
> For example, on Unix datagram sockets, a process running poll() with
> POLLIN will wakes-up when the remote end call read(). This is a
> performance regression introduced when fixing another bug by
> 3c73419c09a5ef73d56472dbfdade9e311496e9b and
> ec0d215f9420564fc8286dcf93d2d068bb53a07e.
> 
> The attached program illustrates the problem. It compares the
> performance of sending/receiving data on an Unix datagram socket and
> select(). When the datagram sockets are not connected, the performance
> problem is not triggered, but when they are connected it becomes a lot
> slower. On my computer, I have the following time:
> 
> Connected datagram sockets: >4 seconds
> Non-connected datagram sockets: <1 second
> 
> The patch attached in the next email fixes the performance problem: it
> becomes <1 second for both cases. I am not suggesting the patch for
> inclusion; I would like to change the prototype of (struct
> file_operations)->poll instead of adding ->poll2. But there is a lot of
> poll functions to change (grep tells me 337 functions).
> 
> Any opinions?

My opinion would be to use epoll() for this kind of workload.

Also, about unix_datagram_poll() being slow, it probably can be
addressed separately.




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/1] RFC: poll/select performance on datagram sockets
  2010-10-29 19:27 ` [PATCH 0/1] RFC: " Eric Dumazet
@ 2010-10-29 20:08   ` Davide Libenzi
  2010-10-29 20:20     ` Eric Dumazet
  2010-10-29 20:46     ` Davide Libenzi
  2010-10-29 20:20   ` [PATCH 0/1] RFC: poll/select performance on datagram sockets Jesper Juhl
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 27+ messages in thread
From: Davide Libenzi @ 2010-10-29 20:08 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alban Crequy, David S. Miller, Stephen Hemminger,
	Cyrill Gorcunov, Alexey Dobriyan, netdev,
	Linux Kernel Mailing List, Pauli Nieminen, Rainer Weikusat

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1857 bytes --]

On Fri, 29 Oct 2010, Eric Dumazet wrote:

> Le vendredi 29 octobre 2010 à 19:18 +0100, Alban Crequy a écrit :
> > Hi,
> > 
> > When a process calls the poll or select, the kernel calls (struct
> > file_operations)->poll on every file descriptor and returns a mask of
> > events which are ready. If the process is only interested by POLLIN
> > events, the mask is still computed for POLLOUT and it can be expensive.
> > For example, on Unix datagram sockets, a process running poll() with
> > POLLIN will wakes-up when the remote end call read(). This is a
> > performance regression introduced when fixing another bug by
> > 3c73419c09a5ef73d56472dbfdade9e311496e9b and
> > ec0d215f9420564fc8286dcf93d2d068bb53a07e.
> > 
> > The attached program illustrates the problem. It compares the
> > performance of sending/receiving data on an Unix datagram socket and
> > select(). When the datagram sockets are not connected, the performance
> > problem is not triggered, but when they are connected it becomes a lot
> > slower. On my computer, I have the following time:
> > 
> > Connected datagram sockets: >4 seconds
> > Non-connected datagram sockets: <1 second
> > 
> > The patch attached in the next email fixes the performance problem: it
> > becomes <1 second for both cases. I am not suggesting the patch for
> > inclusion; I would like to change the prototype of (struct
> > file_operations)->poll instead of adding ->poll2. But there is a lot of
> > poll functions to change (grep tells me 337 functions).
> > 
> > Any opinions?
> 
> My opinion would be to use epoll() for this kind of workload.

Yeah, epoll does check for event hints coming with the callback wakeup, 
and avoid waking up epoll_wait() waiters, for non matching events.
Most of the devices we care about, have been modified to report the event 
mask with the wakeup call.


- Davide


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/1] RFC: poll/select performance on datagram sockets
  2010-10-29 19:27 ` [PATCH 0/1] RFC: " Eric Dumazet
  2010-10-29 20:08   ` Davide Libenzi
@ 2010-10-29 20:20   ` Jesper Juhl
  2010-10-29 20:40     ` David Miller
  2010-10-30  6:44   ` [PATCH] af_unix: unix_write_space() use keyed wakeups Eric Dumazet
  2010-10-30 11:34   ` [PATCH 0/1] RFC: poll/select performance on datagram sockets Alban Crequy
  3 siblings, 1 reply; 27+ messages in thread
From: Jesper Juhl @ 2010-10-29 20:20 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alban Crequy, David S. Miller, Stephen Hemminger,
	Cyrill Gorcunov, Alexey Dobriyan, netdev, linux-kernel,
	Pauli Nieminen, Rainer Weikusat

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2140 bytes --]

On Fri, 29 Oct 2010, Eric Dumazet wrote:

> Le vendredi 29 octobre 2010 à 19:18 +0100, Alban Crequy a écrit :
> > Hi,
> > 
> > When a process calls the poll or select, the kernel calls (struct
> > file_operations)->poll on every file descriptor and returns a mask of
> > events which are ready. If the process is only interested by POLLIN
> > events, the mask is still computed for POLLOUT and it can be expensive.
> > For example, on Unix datagram sockets, a process running poll() with
> > POLLIN will wakes-up when the remote end call read(). This is a
> > performance regression introduced when fixing another bug by
> > 3c73419c09a5ef73d56472dbfdade9e311496e9b and
> > ec0d215f9420564fc8286dcf93d2d068bb53a07e.
> > 
> > The attached program illustrates the problem. It compares the
> > performance of sending/receiving data on an Unix datagram socket and
> > select(). When the datagram sockets are not connected, the performance
> > problem is not triggered, but when they are connected it becomes a lot
> > slower. On my computer, I have the following time:
> > 
> > Connected datagram sockets: >4 seconds
> > Non-connected datagram sockets: <1 second
> > 
> > The patch attached in the next email fixes the performance problem: it
> > becomes <1 second for both cases. I am not suggesting the patch for
> > inclusion; I would like to change the prototype of (struct
> > file_operations)->poll instead of adding ->poll2. But there is a lot of
> > poll functions to change (grep tells me 337 functions).
> > 
> > Any opinions?
> 
> My opinion would be to use epoll() for this kind of workload.
> 
Sorry to intrude out of the blue without really understanding the kernel 
side of most of the code in question, but if there's a performance 
regression for applications using poll() shouldn't we address that so we 
get back to the prior performance level rather than requireing all 
userspace apps to switch to epoll() ??

-- 
Jesper Juhl <jj@chaosbits.net>             http://www.chaosbits.net/
Plain text mails only, please      http://www.expita.com/nomime.html
Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/1] RFC: poll/select performance on datagram sockets
  2010-10-29 20:08   ` Davide Libenzi
@ 2010-10-29 20:20     ` Eric Dumazet
  2010-10-29 20:46     ` Davide Libenzi
  1 sibling, 0 replies; 27+ messages in thread
From: Eric Dumazet @ 2010-10-29 20:20 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Alban Crequy, David S. Miller, Stephen Hemminger,
	Cyrill Gorcunov, Alexey Dobriyan, netdev,
	Linux Kernel Mailing List, Pauli Nieminen, Rainer Weikusat

Le vendredi 29 octobre 2010 à 13:08 -0700, Davide Libenzi a écrit :

> Yeah, epoll does check for event hints coming with the callback wakeup, 
> and avoid waking up epoll_wait() waiters, for non matching events.
> Most of the devices we care about, have been modified to report the event 
> mask with the wakeup call.

Alban test program is _very_ pathological :

All the time is consumed in do_select() because of false sharing between
two tasks.

We can probably rearrange variables in do_select() to make this false
sharing less problematic. I am taking a look at this.

Events: 3K cycles
+     26.14%  uclient  [kernel.kallsyms]  [k] do_raw_spin_lock              
+     21.11%  uclient  [kernel.kallsyms]  [k] do_select                     
+     13.38%  uclient  [kernel.kallsyms]  [k] pollwake                      
+      9.22%  uclient  [kernel.kallsyms]  [k] unix_dgram_poll               
+      5.24%  uclient  [kernel.kallsyms]  [k] unix_peer_get                 
+      3.04%  uclient  [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore   
+      3.03%  uclient  [kernel.kallsyms]  [k] task_rq_lock                  
+      2.85%  uclient  [kernel.kallsyms]  [k] do_raw_spin_unlock            
+      1.84%  uclient  [kernel.kallsyms]  [k] try_to_wake_up                
+      1.55%  uclient  [kernel.kallsyms]  [k] fget_light                    
+      1.34%  uclient  [kernel.kallsyms]  [k] core_kernel_text              


annotate :

   5.66 :        410fb342:       85 ff                   test   %edi,%edi                  
    0.00 :        410fb344:       74 1f                   je     410fb365 <do_select+0x3d5> 
    0.13 :        410fb346:       85 b5 6c fd ff ff       test   %esi,-0x294(%ebp)          
    0.00 :        410fb34c:       74 17                   je     410fb365 <do_select+0x3d5> 
         :                                                        res_out |= bit;           
    0.00 :        410fb34e:       09 b5 5c fd ff ff       or     %esi,-0x2a4(%ebp)          
         :                                                        retval++;                 
    0.00 :        410fb354:       83 85 64 fd ff ff 01    addl   $0x1,-0x29c(%ebp)          
         :                                                        wait = NULL;              
    0.00 :        410fb35b:       c7 85 7c fd ff ff 00    movl   $0x0,-0x284(%ebp)          
    0.00 :        410fb362:       00 00 00                                                  
         :                                               }            
         :                                                if ((mask & POLLEX_SET) && (ex & bit)) {
   43.27 :        410fb365:       85 d2                   test   %edx,%edx                 
    0.00 :        410fb367:       0f 84 f3 fe ff ff       je     410fb260 <do_select+0x2d0>
    0.00 :        410fb36d:       85 b5 74 fd ff ff       test   %esi,-0x28c(%ebp)         
    0.00 :        410fb373:       0f 84 e7 fe ff ff       je     410fb260 <do_select+0x2d0>
         :                                                        res_ex |= bit;           
    0.00 :        410fb379:       09 b5 58 fd ff ff       or     %esi,-0x2a8(%ebp)         
         :                                if (all_bits == 0) {   
         :                                        i += __NFDBITS;
         :                                        continue;
         :                                }



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/1] RFC: poll/select performance on datagram sockets
  2010-10-29 20:20   ` [PATCH 0/1] RFC: poll/select performance on datagram sockets Jesper Juhl
@ 2010-10-29 20:40     ` David Miller
  2010-10-29 20:45       ` Eric Dumazet
  0 siblings, 1 reply; 27+ messages in thread
From: David Miller @ 2010-10-29 20:40 UTC (permalink / raw)
  To: jj
  Cc: eric.dumazet, alban.crequy, shemminger, gorcunov, adobriyan,
	netdev, linux-kernel, pauli.nieminen, rweikusat

From: Jesper Juhl <jj@chaosbits.net>
Date: Fri, 29 Oct 2010 22:20:12 +0200 (CEST)

> Sorry to intrude out of the blue without really understanding the kernel 
> side of most of the code in question, but if there's a performance 
> regression for applications using poll() shouldn't we address that so we 
> get back to the prior performance level rather than requireing all 
> userspace apps to switch to epoll() ??

For such a pathological program like Alban's test case, I say
absolutely not.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/1] RFC: poll/select performance on datagram sockets
  2010-10-29 20:40     ` David Miller
@ 2010-10-29 20:45       ` Eric Dumazet
  0 siblings, 0 replies; 27+ messages in thread
From: Eric Dumazet @ 2010-10-29 20:45 UTC (permalink / raw)
  To: David Miller
  Cc: jj, alban.crequy, shemminger, gorcunov, adobriyan, netdev,
	linux-kernel, pauli.nieminen, rweikusat

Le vendredi 29 octobre 2010 à 13:40 -0700, David Miller a écrit :
> From: Jesper Juhl <jj@chaosbits.net>
> Date: Fri, 29 Oct 2010 22:20:12 +0200 (CEST)
> 
> > Sorry to intrude out of the blue without really understanding the kernel 
> > side of most of the code in question, but if there's a performance 
> > regression for applications using poll() shouldn't we address that so we 
> > get back to the prior performance level rather than requireing all 
> > userspace apps to switch to epoll() ??
> 
> For such a pathological program like Alban's test case, I say
> absolutely not.

Yes, and with some perf tool help, we probably can find out how to
speedup the thing again, with no API change.




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/1] RFC: poll/select performance on datagram sockets
  2010-10-29 20:08   ` Davide Libenzi
  2010-10-29 20:20     ` Eric Dumazet
@ 2010-10-29 20:46     ` Davide Libenzi
  2010-10-29 21:05       ` Eric Dumazet
  2010-10-30  9:53       ` [PATCH] af_unix: optimize unix_dgram_poll() Eric Dumazet
  1 sibling, 2 replies; 27+ messages in thread
From: Davide Libenzi @ 2010-10-29 20:46 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alban Crequy, David S. Miller, Stephen Hemminger,
	Cyrill Gorcunov, Alexey Dobriyan, netdev,
	Linux Kernel Mailing List, Pauli Nieminen, Rainer Weikusat

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1941 bytes --]

On Fri, 29 Oct 2010, Davide Libenzi wrote:

> On Fri, 29 Oct 2010, Eric Dumazet wrote:
> 
> > Le vendredi 29 octobre 2010 à 19:18 +0100, Alban Crequy a écrit :
> > > Hi,
> > > 
> > > When a process calls the poll or select, the kernel calls (struct
> > > file_operations)->poll on every file descriptor and returns a mask of
> > > events which are ready. If the process is only interested by POLLIN
> > > events, the mask is still computed for POLLOUT and it can be expensive.
> > > For example, on Unix datagram sockets, a process running poll() with
> > > POLLIN will wakes-up when the remote end call read(). This is a
> > > performance regression introduced when fixing another bug by
> > > 3c73419c09a5ef73d56472dbfdade9e311496e9b and
> > > ec0d215f9420564fc8286dcf93d2d068bb53a07e.
> > > 
> > > The attached program illustrates the problem. It compares the
> > > performance of sending/receiving data on an Unix datagram socket and
> > > select(). When the datagram sockets are not connected, the performance
> > > problem is not triggered, but when they are connected it becomes a lot
> > > slower. On my computer, I have the following time:
> > > 
> > > Connected datagram sockets: >4 seconds
> > > Non-connected datagram sockets: <1 second
> > > 
> > > The patch attached in the next email fixes the performance problem: it
> > > becomes <1 second for both cases. I am not suggesting the patch for
> > > inclusion; I would like to change the prototype of (struct
> > > file_operations)->poll instead of adding ->poll2. But there is a lot of
> > > poll functions to change (grep tells me 337 functions).
> > > 
> > > Any opinions?
> > 
> > My opinion would be to use epoll() for this kind of workload.
> 
> Yeah, epoll does check for event hints coming with the callback wakeup, 
> and avoid waking up epoll_wait() waiters, for non matching events.

Also, why not using the existing wait->key instead of adding a poll2()?


- Davide

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/1] RFC: poll/select performance on datagram sockets
  2010-10-29 20:46     ` Davide Libenzi
@ 2010-10-29 21:05       ` Eric Dumazet
  2010-10-29 21:57         ` Davide Libenzi
  2010-10-30  9:53       ` [PATCH] af_unix: optimize unix_dgram_poll() Eric Dumazet
  1 sibling, 1 reply; 27+ messages in thread
From: Eric Dumazet @ 2010-10-29 21:05 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Alban Crequy, David S. Miller, Stephen Hemminger,
	Cyrill Gorcunov, Alexey Dobriyan, netdev,
	Linux Kernel Mailing List, Pauli Nieminen, Rainer Weikusat

Following patch solves the problem for me, and its only a start, I am
pretty sure we can optimize more than that.

What I did is to move "struct poll_wqueues table;" out of do_select()
in its caller. This structure is highly modified by an other cpu in
Alban workload, and this slow down do_select() because many accesses to
some of its local variables (normally, private ones !) hit a false
sharing.

Before :

# time /root/uclient connected
The performance problem will be triggered
select: begin
select: end: 3 seconds

real	0m3.101s
user	0m0.000s
sys	0m6.104s

After :

# time /root/uclient connected
The performance problem will be triggered
select: begin
select: end: 1 seconds

real	0m1.041s
user	0m0.004s
sys	0m2.040s


Impressive no ?

Reported-by: Alban Crequy <alban.crequy@collabora.co.uk>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 fs/compat.c          |    3 ++-
 fs/select.c          |   19 ++++++++++---------
 include/linux/poll.h |    3 ++-
 3 files changed, 14 insertions(+), 11 deletions(-)

diff --git a/fs/compat.c b/fs/compat.c
index ff66c0d..3c3c3d3 100644
--- a/fs/compat.c
+++ b/fs/compat.c
@@ -1704,6 +1704,7 @@ int compat_core_sys_select(int n, compat_ulong_t __user *inp,
 	int size, max_fds, ret = -EINVAL;
 	struct fdtable *fdt;
 	long stack_fds[SELECT_STACK_ALLOC/sizeof(long)];
+	struct poll_wqueues table;
 
 	if (n < 0)
 		goto out_nofds;
@@ -1744,7 +1745,7 @@ int compat_core_sys_select(int n, compat_ulong_t __user *inp,
 	zero_fd_set(n, fds.res_out);
 	zero_fd_set(n, fds.res_ex);
 
-	ret = do_select(n, &fds, end_time);
+	ret = do_select(n, &fds, end_time, &table);
 
 	if (ret < 0)
 		goto out;
diff --git a/fs/select.c b/fs/select.c
index b7b10aa..ed054f5 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -393,10 +393,10 @@ static inline void wait_key_set(poll_table *wait, unsigned long in,
 	}
 }
 
-int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
+int do_select(int n, fd_set_bits *fds, struct timespec *end_time,
+	      struct poll_wqueues *table)
 {
 	ktime_t expire, *to = NULL;
-	struct poll_wqueues table;
 	poll_table *wait;
 	int retval, i, timed_out = 0;
 	unsigned long slack = 0;
@@ -409,8 +409,8 @@ int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
 		return retval;
 	n = retval;
 
-	poll_initwait(&table);
-	wait = &table.pt;
+	poll_initwait(table);
+	wait = &table->pt;
 	if (end_time && !end_time->tv_sec && !end_time->tv_nsec) {
 		wait = NULL;
 		timed_out = 1;
@@ -482,8 +482,8 @@ int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
 		wait = NULL;
 		if (retval || timed_out || signal_pending(current))
 			break;
-		if (table.error) {
-			retval = table.error;
+		if (table->error) {
+			retval = table->error;
 			break;
 		}
 
@@ -497,12 +497,12 @@ int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
 			to = &expire;
 		}
 
-		if (!poll_schedule_timeout(&table, TASK_INTERRUPTIBLE,
+		if (!poll_schedule_timeout(table, TASK_INTERRUPTIBLE,
 					   to, slack))
 			timed_out = 1;
 	}
 
-	poll_freewait(&table);
+	poll_freewait(table);
 
 	return retval;
 }
@@ -528,6 +528,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
 	struct fdtable *fdt;
 	/* Allocate small arguments on the stack to save memory and be faster */
 	long stack_fds[SELECT_STACK_ALLOC/sizeof(long)];
+	struct poll_wqueues table;
 
 	ret = -EINVAL;
 	if (n < 0)
@@ -570,7 +571,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
 	zero_fd_set(n, fds.res_out);
 	zero_fd_set(n, fds.res_ex);
 
-	ret = do_select(n, &fds, end_time);
+	ret = do_select(n, &fds, end_time, &table);
 
 	if (ret < 0)
 		goto out;
diff --git a/include/linux/poll.h b/include/linux/poll.h
index 56e76af..54ba625 100644
--- a/include/linux/poll.h
+++ b/include/linux/poll.h
@@ -130,7 +130,8 @@ void zero_fd_set(unsigned long nr, unsigned long *fdset)
 
 #define MAX_INT64_SECONDS (((s64)(~((u64)0)>>1)/HZ)-1)
 
-extern int do_select(int n, fd_set_bits *fds, struct timespec *end_time);
+extern int do_select(int n, fd_set_bits *fds, struct timespec *end_time,
+		     struct poll_wqueues *table);
 extern int do_sys_poll(struct pollfd __user * ufds, unsigned int nfds,
 		       struct timespec *end_time);
 extern int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/1] RFC: poll/select performance on datagram sockets
  2010-10-29 21:05       ` Eric Dumazet
@ 2010-10-29 21:57         ` Davide Libenzi
  2010-10-29 22:08           ` Eric Dumazet
  0 siblings, 1 reply; 27+ messages in thread
From: Davide Libenzi @ 2010-10-29 21:57 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alban Crequy, David S. Miller, Stephen Hemminger,
	Cyrill Gorcunov, Alexey Dobriyan, netdev,
	Linux Kernel Mailing List, Pauli Nieminen, Rainer Weikusat

On Fri, 29 Oct 2010, Eric Dumazet wrote:

> Following patch solves the problem for me, and its only a start, I am
> pretty sure we can optimize more than that.
> 
> What I did is to move "struct poll_wqueues table;" out of do_select()
> in its caller. This structure is highly modified by an other cpu in
> Alban workload, and this slow down do_select() because many accesses to
> some of its local variables (normally, private ones !) hit a false
> sharing.
> 
> Before :
> 
> # time /root/uclient connected
> The performance problem will be triggered
> select: begin
> select: end: 3 seconds
> 
> real	0m3.101s
> user	0m0.000s
> sys	0m6.104s
> 
> After :
> 
> # time /root/uclient connected
> The performance problem will be triggered
> select: begin
> select: end: 1 seconds
> 
> real	0m1.041s
> user	0m0.004s
> sys	0m2.040s
> 
> 
> Impressive no ?

It is, though I am not sure that shuffling code around to chase false 
sharing on very specific cases, is an approach worth following.
The new position of the table on the stack might share with another hot 
line, under a different load.


- Davide



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/1] RFC: poll/select performance on datagram sockets
  2010-10-29 21:57         ` Davide Libenzi
@ 2010-10-29 22:08           ` Eric Dumazet
  0 siblings, 0 replies; 27+ messages in thread
From: Eric Dumazet @ 2010-10-29 22:08 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Alban Crequy, David S. Miller, Stephen Hemminger,
	Cyrill Gorcunov, Alexey Dobriyan, netdev,
	Linux Kernel Mailing List, Pauli Nieminen, Rainer Weikusat

Le vendredi 29 octobre 2010 à 14:57 -0700, Davide Libenzi a écrit :

> It is, though I am not sure that shuffling code around to chase false 
> sharing on very specific cases, is an approach worth following.
> The new position of the table on the stack might share with another hot 
> line, under a different load.
> 

Sure. Its only a start as I said. We should identify the shared portion
and make sure it sits in a dedicated cache line.




^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH] af_unix: unix_write_space() use keyed wakeups
  2010-10-29 19:27 ` [PATCH 0/1] RFC: " Eric Dumazet
  2010-10-29 20:08   ` Davide Libenzi
  2010-10-29 20:20   ` [PATCH 0/1] RFC: poll/select performance on datagram sockets Jesper Juhl
@ 2010-10-30  6:44   ` Eric Dumazet
  2010-10-30 15:03     ` Davide Libenzi
  2010-10-30 21:36     ` Alban Crequy
  2010-10-30 11:34   ` [PATCH 0/1] RFC: poll/select performance on datagram sockets Alban Crequy
  3 siblings, 2 replies; 27+ messages in thread
From: Eric Dumazet @ 2010-10-30  6:44 UTC (permalink / raw)
  To: Alban Crequy, Davide Libenzi
  Cc: David S. Miller, Stephen Hemminger, Cyrill Gorcunov,
	Alexey Dobriyan, netdev, linux-kernel, Pauli Nieminen,
	Rainer Weikusat

Le vendredi 29 octobre 2010 à 21:27 +0200, Eric Dumazet a écrit :
> Le vendredi 29 octobre 2010 à 19:18 +0100, Alban Crequy a écrit :
> > Hi,
> > 
> > When a process calls the poll or select, the kernel calls (struct
> > file_operations)->poll on every file descriptor and returns a mask of
> > events which are ready. If the process is only interested by POLLIN
> > events, the mask is still computed for POLLOUT and it can be expensive.
> > For example, on Unix datagram sockets, a process running poll() with
> > POLLIN will wakes-up when the remote end call read(). This is a
> > performance regression introduced when fixing another bug by
> > 3c73419c09a5ef73d56472dbfdade9e311496e9b and
> > ec0d215f9420564fc8286dcf93d2d068bb53a07e.
> > 

unix_write_space() doesn’t yet use the wake_up_interruptible_sync_poll()
to restrict wakeups to only POLLOUT | POLLWRNORM | POLLWRBAND interested
sleepers. Same for unix_dgram_recvmsg()

In your pathological case, each time the other process calls
unix_dgram_recvmsg(), it loops on 800 pollwake() /
default_wake_function() / try_to_wake_up(), which are obviously
expensive, as you pointed out with your test program, carefully designed
to show the false sharing and O(N^2) effect :)

Once do_select() thread can _really_ block, the false sharing problem
disappears for good.

We still loop on 800 items, on each wake_up_interruptible_sync_poll()
call, so maybe we want to optimize this later, adding a global key,
ORing all items keys. I dont think its worth the added complexity, given
the biased usage of your program (800 'listeners' to one event). Is it a
real life scenario ?

Thanks

[PATCH] af_unix: use keyed wakeups

Instead of wakeup all sleepers, use wake_up_interruptible_sync_poll() to
wakeup only ones interested into writing the socket.

This patch is a specialization of commit 37e5540b3c9d (epoll keyed
wakeups: make sockets use keyed wakeups).

On a test program provided by Alan Crequy :

Before:
real    0m3.101s
user    0m0.000s
sys     0m6.104s

After:

real	0m0.211s
user	0m0.000s
sys	0m0.208s

Reported-by: Alban Crequy <alban.crequy@collabora.co.uk>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Davide Libenzi <davidel@xmailserver.org>
---
 net/unix/af_unix.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 3c95304..f33c595 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -316,7 +316,8 @@ static void unix_write_space(struct sock *sk)
 	if (unix_writable(sk)) {
 		wq = rcu_dereference(sk->sk_wq);
 		if (wq_has_sleeper(wq))
-			wake_up_interruptible_sync(&wq->wait);
+			wake_up_interruptible_sync_poll(&wq->wait,
+				POLLOUT | POLLWRNORM | POLLWRBAND);
 		sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
 	}
 	rcu_read_unlock();
@@ -1710,7 +1711,8 @@ static int unix_dgram_recvmsg(struct kiocb *iocb, struct socket *sock,
 		goto out_unlock;
 	}
 
-	wake_up_interruptible_sync(&u->peer_wait);
+	wake_up_interruptible_sync_poll(&u->peer_wait,
+					POLLOUT | POLLWRNORM | POLLWRBAND);
 
 	if (msg->msg_name)
 		unix_copy_addr(msg, skb->sk);



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH] af_unix: optimize unix_dgram_poll()
  2010-10-29 20:46     ` Davide Libenzi
  2010-10-29 21:05       ` Eric Dumazet
@ 2010-10-30  9:53       ` Eric Dumazet
  2010-10-30 17:45         ` Davide Libenzi
  1 sibling, 1 reply; 27+ messages in thread
From: Eric Dumazet @ 2010-10-30  9:53 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Alban Crequy, David S. Miller, Stephen Hemminger,
	Cyrill Gorcunov, Alexey Dobriyan, netdev,
	Linux Kernel Mailing List, Pauli Nieminen, Rainer Weikusat

Le vendredi 29 octobre 2010 à 13:46 -0700, Davide Libenzi a écrit :

> Also, why not using the existing wait->key instead of adding a poll2()?

Indeed, if wait is not null, we have in wait->key the interest of
poller. If a particular poll() function is expensive, it can test these
bits.

Thanks !

Note: I chose the 'goto skip_write' to make this patch really obvious.

[PATCH] af_unix: optimize unix_dgram_poll()

unix_dgram_poll() is pretty expensive to check POLLOUT status, because
it has to lock the socket to get its peer, take a reference on the peer
to check its receive queue status, and queue another poll_wait on
peer_wait. This all can be avoided if the process calling
unix_dgram_poll() is not interested in POLLOUT status. It makes
unix_dgram_recvmsg() faster by not queueing irrelevant pollers in
peer_wait.

On a test program provided by Alan Crequy :

Before:

real    0m0.211s
user    0m0.000s
sys     0m0.208s

After:

real	0m0.044s
user	0m0.000s
sys	0m0.040s

Suggested-by: Davide Libenzi <davidel@xmailserver.org>
Reported-by: Alban Crequy <alban.crequy@collabora.co.uk>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/unix/af_unix.c |    4 ++++
 1 file changed, 4 insertions(+)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 3c95304..dcb84fe 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2090,6 +2090,9 @@ static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
 			return mask;
 	}
 
+	if (wait && !(wait->key & (POLLWRBAND | POLLWRNORM | POLLOUT)))
+		goto skip_write;
+
 	/* writable? */
 	writable = unix_writable(sk);
 	if (writable) {
@@ -2111,6 +2114,7 @@ static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
 	else
 		set_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags);
 
+skip_write:
 	return mask;
 }
 




^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/1] RFC: poll/select performance on datagram sockets
  2010-10-29 19:27 ` [PATCH 0/1] RFC: " Eric Dumazet
                     ` (2 preceding siblings ...)
  2010-10-30  6:44   ` [PATCH] af_unix: unix_write_space() use keyed wakeups Eric Dumazet
@ 2010-10-30 11:34   ` Alban Crequy
  2010-10-30 12:53     ` Eric Dumazet
  3 siblings, 1 reply; 27+ messages in thread
From: Alban Crequy @ 2010-10-30 11:34 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, Stephen Hemminger, Cyrill Gorcunov,
	Alexey Dobriyan, netdev, linux-kernel, Pauli Nieminen,
	Rainer Weikusat, Davide Libenzi

Le Fri, 29 Oct 2010 21:27:11 +0200,
Eric Dumazet <eric.dumazet@gmail.com> a écrit :

> Le vendredi 29 octobre 2010 à 19:18 +0100, Alban Crequy a écrit :
> > Hi,
> > 
> > When a process calls the poll or select, the kernel calls (struct
> > file_operations)->poll on every file descriptor and returns a mask
> > of events which are ready. If the process is only interested by
> > POLLIN events, the mask is still computed for POLLOUT and it can be
> > expensive. For example, on Unix datagram sockets, a process running
> > poll() with POLLIN will wakes-up when the remote end call read().
> > This is a performance regression introduced when fixing another bug
> > by 3c73419c09a5ef73d56472dbfdade9e311496e9b and
> > ec0d215f9420564fc8286dcf93d2d068bb53a07e.
> > 
> > The attached program illustrates the problem. It compares the
> > performance of sending/receiving data on an Unix datagram socket and
> > select(). When the datagram sockets are not connected, the
> > performance problem is not triggered, but when they are connected
> > it becomes a lot slower. On my computer, I have the following time:
> > 
> > Connected datagram sockets: >4 seconds
> > Non-connected datagram sockets: <1 second
> > 
> > The patch attached in the next email fixes the performance problem:
> > it becomes <1 second for both cases. I am not suggesting the patch
> > for inclusion; I would like to change the prototype of (struct
> > file_operations)->poll instead of adding ->poll2. But there is a
> > lot of poll functions to change (grep tells me 337 functions).
> > 
> > Any opinions?
> 
> My opinion would be to use epoll() for this kind of workload.

I found a problem with epoll() with the following program. When there
is several datagram sockets connected to the same server and the
receiving queue is full, epoll(EPOLLOUT) wakes up only the emitter who
has its skb removed from the queue, and not all the emitters. It is
because sock_wfree() runs sk->sk_write_space() only for one emitter.

poll/select do not have this problem.

-----------------------

#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/time.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/un.h>
#include <time.h>
#include <sys/epoll.h>

int
main(int argc, char *argv[])
{

// cat /proc/sys/net/unix/max_dgram_qlen
#define NB_CLIENTS (10 + 2)

  int client_fds[NB_CLIENTS];
  int server_fd;
  struct sockaddr_un addr_server;
  int epollfd = epoll_create(10);
#define MAX_EVENTS 10
  struct epoll_event ev, events[MAX_EVENTS];
  int i;
  char buffer[1024];

  int trigger = atoi(argv[1]);
  printf("trigger = %d\n", trigger);

  memset(&addr_server, 0, sizeof(addr_server));
  addr_server.sun_family = AF_UNIX;
  addr_server.sun_path[0] = '\0';
  strcpy(addr_server.sun_path + 1, "dgram_perfs_server");

  server_fd = socket(AF_UNIX, SOCK_DGRAM, 0);
  bind(server_fd, (struct sockaddr*)&addr_server, sizeof(addr_server));

  for (i = 0 ; i < NB_CLIENTS ; i++)
    {
      client_fds[i] = socket(AF_UNIX, SOCK_DGRAM, 0);
    }

  ev.events = EPOLLOUT;
  ev.data.fd = client_fds[NB_CLIENTS-1];
  if (epoll_ctl(epollfd, EPOLL_CTL_ADD, client_fds[NB_CLIENTS-1], &ev) == -1) {
      perror("epoll_ctl: client_fds max");
      exit(EXIT_FAILURE);
  }
  if (trigger == 0)
    {
      ev.events = EPOLLOUT;
      ev.data.fd = client_fds[0];
      if (epoll_ctl(epollfd, EPOLL_CTL_ADD, client_fds[0], &ev) == -1) {
        perror("epoll_ctl: client_fds 0");
        exit(EXIT_FAILURE);
      }
    }

  for (i = 0 ; i < NB_CLIENTS ; i++)
    {
      connect(client_fds[i], (struct sockaddr*)&addr_server, sizeof(addr_server));
    }

  if (fork() > 0)
    {
      for (i = 0 ; i < NB_CLIENTS - 1 ; i++)
        sendto(client_fds[i], "S", 1, 0, (struct sockaddr*)&addr_server, sizeof(addr_server));
      printf("Everything sent successfully. Now epoll_wait...\n");

      epoll_wait(epollfd, events, MAX_EVENTS, -1);
      printf("epoll_wait works fine :-)\n");
      wait(NULL);
      exit(0);
    }

  sleep(1);

  printf("Receiving one buffer...\n");
  recv(server_fd, buffer, 1024, 0);
  printf("One buffer received\n");

  return 0;
}

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/1] RFC: poll/select performance on datagram sockets
  2010-10-30 11:34   ` [PATCH 0/1] RFC: poll/select performance on datagram sockets Alban Crequy
@ 2010-10-30 12:53     ` Eric Dumazet
  2010-10-30 13:17       ` Eric Dumazet
  0 siblings, 1 reply; 27+ messages in thread
From: Eric Dumazet @ 2010-10-30 12:53 UTC (permalink / raw)
  To: Alban Crequy
  Cc: David S. Miller, Stephen Hemminger, Cyrill Gorcunov,
	Alexey Dobriyan, netdev, linux-kernel, Pauli Nieminen,
	Rainer Weikusat, Davide Libenzi

Le samedi 30 octobre 2010 à 12:34 +0100, Alban Crequy a écrit :
> Le Fri, 29 Oct 2010 21:27:11 +0200,
> Eric Dumazet <eric.dumazet@gmail.com> a écrit :
> 
> > Le vendredi 29 octobre 2010 à 19:18 +0100, Alban Crequy a écrit :
> > > Hi,
> > > 
> > > When a process calls the poll or select, the kernel calls (struct
> > > file_operations)->poll on every file descriptor and returns a mask
> > > of events which are ready. If the process is only interested by
> > > POLLIN events, the mask is still computed for POLLOUT and it can be
> > > expensive. For example, on Unix datagram sockets, a process running
> > > poll() with POLLIN will wakes-up when the remote end call read().
> > > This is a performance regression introduced when fixing another bug
> > > by 3c73419c09a5ef73d56472dbfdade9e311496e9b and
> > > ec0d215f9420564fc8286dcf93d2d068bb53a07e.
> > > 
> > > The attached program illustrates the problem. It compares the
> > > performance of sending/receiving data on an Unix datagram socket and
> > > select(). When the datagram sockets are not connected, the
> > > performance problem is not triggered, but when they are connected
> > > it becomes a lot slower. On my computer, I have the following time:
> > > 
> > > Connected datagram sockets: >4 seconds
> > > Non-connected datagram sockets: <1 second
> > > 
> > > The patch attached in the next email fixes the performance problem:
> > > it becomes <1 second for both cases. I am not suggesting the patch
> > > for inclusion; I would like to change the prototype of (struct
> > > file_operations)->poll instead of adding ->poll2. But there is a
> > > lot of poll functions to change (grep tells me 337 functions).
> > > 
> > > Any opinions?
> > 
> > My opinion would be to use epoll() for this kind of workload.
> 
> I found a problem with epoll() with the following program. When there
> is several datagram sockets connected to the same server and the
> receiving queue is full, epoll(EPOLLOUT) wakes up only the emitter who
> has its skb removed from the queue, and not all the emitters. It is
> because sock_wfree() runs sk->sk_write_space() only for one emitter.
> 

I dont think this is the reason.

sock_wfree() really is good here, since it copes with one socket (the
one that sent the message)

Problem is the peer_wait, that epoll doesnt seem to be plugged into.

Bug is in unix_dgram_poll()

It calls sock_poll_wait( ... &unix_sk(other)->peer_wait,) only if socket
is 'writable'. Its a clear bug

Try this patch please ?

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 0ebc777..315716c 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2092,7 +2092,7 @@ static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
 
 	/* writable? */
 	writable = unix_writable(sk);
-	if (writable) {
+	if (1 /*writable*/) {
 		other = unix_peer_get(sk);
 		if (other) {
 			if (unix_peer(other) != sk) {





^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/1] RFC: poll/select performance on datagram sockets
  2010-10-30 12:53     ` Eric Dumazet
@ 2010-10-30 13:17       ` Eric Dumazet
       [not found]         ` <20101030224703.065e70f6@chocolatine.cbg.collabora.co.uk>
  0 siblings, 1 reply; 27+ messages in thread
From: Eric Dumazet @ 2010-10-30 13:17 UTC (permalink / raw)
  To: Alban Crequy
  Cc: David S. Miller, Stephen Hemminger, Cyrill Gorcunov,
	Alexey Dobriyan, netdev, linux-kernel, Pauli Nieminen,
	Rainer Weikusat, Davide Libenzi


> Problem is the peer_wait, that epoll doesnt seem to be plugged into.
> 
> Bug is in unix_dgram_poll()
> 
> It calls sock_poll_wait( ... &unix_sk(other)->peer_wait,) only if socket
> is 'writable'. Its a clear bug
> 
> Try this patch please ?
> 
> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> index 0ebc777..315716c 100644
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
> @@ -2092,7 +2092,7 @@ static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
>  
>  	/* writable? */
>  	writable = unix_writable(sk);
> -	if (writable) {
> +	if (1 /*writable*/) {
>  		other = unix_peer_get(sk);
>  		if (other) {
>  			if (unix_peer(other) != sk) {
> 
> 
> 

Also you'll need to change your program to make the epoll registrations
_after_ sockets connect(), or else you can see that epoll() wont know
about the other peer stuff.

for (i = 0 ; i < NB_CLIENTS ; i++) {
	client_fds[i] = socket(AF_UNIX, SOCK_DGRAM, 0);
}


for (i = 0 ; i < NB_CLIENTS ; i++) {
	connect(client_fds[i], (struct sockaddr*)&addr_server, sizeof(addr_server));
}

ev.events = EPOLLOUT;
ev.data.fd = client_fds[NB_CLIENTS-1];
if (epoll_ctl(epollfd, EPOLL_CTL_ADD, client_fds[NB_CLIENTS-1], &ev) == -1) {
      perror("epoll_ctl: client_fds max");
      exit(EXIT_FAILURE);
}
if (trigger == 0) {
      ev.events = EPOLLOUT;
      ev.data.fd = client_fds[0];
      if (epoll_ctl(epollfd, EPOLL_CTL_ADD, client_fds[0], &ev) == -1) {
        perror("epoll_ctl: client_fds 0");
        exit(EXIT_FAILURE);
      }
}




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] af_unix: unix_write_space() use keyed wakeups
  2010-10-30  6:44   ` [PATCH] af_unix: unix_write_space() use keyed wakeups Eric Dumazet
@ 2010-10-30 15:03     ` Davide Libenzi
  2010-11-08 21:44       ` David Miller
  2010-10-30 21:36     ` Alban Crequy
  1 sibling, 1 reply; 27+ messages in thread
From: Davide Libenzi @ 2010-10-30 15:03 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alban Crequy, David S. Miller, Stephen Hemminger,
	Cyrill Gorcunov, Alexey Dobriyan, netdev, linux-kernel,
	Pauli Nieminen, Rainer Weikusat

On Sat, 30 Oct 2010, Eric Dumazet wrote:

> This patch is a specialization of commit 37e5540b3c9d (epoll keyed
> wakeups: make sockets use keyed wakeups).

Whoops, I must have missed AF_UNIX :)
Looks good to me.


- Davide



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] af_unix: optimize unix_dgram_poll()
  2010-10-30  9:53       ` [PATCH] af_unix: optimize unix_dgram_poll() Eric Dumazet
@ 2010-10-30 17:45         ` Davide Libenzi
  0 siblings, 0 replies; 27+ messages in thread
From: Davide Libenzi @ 2010-10-30 17:45 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Davide Libenzi, Alban Crequy, David S. Miller, Stephen Hemminger,
	Cyrill Gorcunov, Alexey Dobriyan, netdev,
	Linux Kernel Mailing List, Pauli Nieminen, Rainer Weikusat

[-- Attachment #1: Type: TEXT/PLAIN, Size: 740 bytes --]

On Sat, 30 Oct 2010, Eric Dumazet wrote:

> Le vendredi 29 octobre 2010 à 13:46 -0700, Davide Libenzi a écrit :
> 
> > Also, why not using the existing wait->key instead of adding a poll2()?
> 
> Indeed, if wait is not null, we have in wait->key the interest of
> poller. If a particular poll() function is expensive, it can test these
> bits.
> 
> Thanks !
> 
> Note: I chose the 'goto skip_write' to make this patch really obvious.

Plain agreement on th patch, and I understand the indent overflow 
concerns, but why not ...

	/*
	 * No write status requested, avoid expensive OUT tests.
	 */
	if (wait && !(wait->key & (POLLWRBAND | POLLWRNORM | POLLOUT)))
		return mask

The write-test code is the last one we do anyway.


- Davide


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] af_unix: unix_write_space() use keyed wakeups
  2010-10-30  6:44   ` [PATCH] af_unix: unix_write_space() use keyed wakeups Eric Dumazet
  2010-10-30 15:03     ` Davide Libenzi
@ 2010-10-30 21:36     ` Alban Crequy
       [not found]       ` <1290554876.2158.5.camel@Nokia-N900-51-1>
  1 sibling, 1 reply; 27+ messages in thread
From: Alban Crequy @ 2010-10-30 21:36 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Davide Libenzi, David S. Miller, Stephen Hemminger,
	Cyrill Gorcunov, Alexey Dobriyan, netdev, linux-kernel,
	Pauli Nieminen, Rainer Weikusat

Le Sat, 30 Oct 2010 08:44:44 +0200,
Eric Dumazet <eric.dumazet@gmail.com> a écrit :

> We still loop on 800 items, on each wake_up_interruptible_sync_poll()
> call, so maybe we want to optimize this later, adding a global key,
> ORing all items keys. I dont think its worth the added complexity,
> given the biased usage of your program (800 'listeners' to one
> event). Is it a real life scenario ?

Pauli Nieminen told me about his performance problem in select() so I
wrote the test program but I don't know what exactly is the real life
scenario.

> [PATCH] af_unix: use keyed wakeups
> [PATCH] af_unix: optimize unix_dgram_poll()

Your 2 patches are good for me. In my opinion the improved performances
are good enough with your 2 patches, so no need to add more complexity
unless we discover new problems.

I am preparing patches to implement multicast features on Unix
datagram+seqpacket sockets and my patches could potentially make things
worse in unix_dgram_poll() because it would need to check the receiving
queues of all multicast members. So I want unix_dgram_poll() to be fast
in the first place before proposing other changes for multicast.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/2] af_unix: fix unix_dgram_poll() behavior for EPOLLOUT event
       [not found]         ` <20101030224703.065e70f6@chocolatine.cbg.collabora.co.uk>
@ 2010-10-31 15:36           ` Eric Dumazet
  2010-10-31 19:07             ` Davide Libenzi
  2010-11-08 21:44             ` David Miller
  0 siblings, 2 replies; 27+ messages in thread
From: Eric Dumazet @ 2010-10-31 15:36 UTC (permalink / raw)
  To: Alban Crequy, David Miller; +Cc: netdev, Davide Libenzi

Le samedi 30 octobre 2010 à 22:47 +0100, Alban Crequy a écrit :
> Le Sat, 30 Oct 2010 15:17:58 +0200,
> Eric Dumazet <eric.dumazet@gmail.com> a écrit :
> 
> > > Problem is the peer_wait, that epoll doesnt seem to be plugged into.
> > > 
> > > Bug is in unix_dgram_poll()
> > > 
> > > It calls sock_poll_wait( ... &unix_sk(other)->peer_wait,) only if
> > > socket is 'writable'. Its a clear bug
> 
> Yes, it looks weird...
> 
> > > Try this patch please ?
> 
> I will be away from computer and I will continue to work on this from
> the 20th of November.

OK, no problem. I tried it and it solves the problem. Here is an
official patch submission.

David, for your convenience, I cooked a patch serie for the 2 patches
against af_unix unix_dgram_poll().

The third patch (af_unix: unix_write_space() use keyed wakeups)) can be
applied as is.

Thanks !

[PATCH 1/2] af_unix: fix unix_dgram_poll() behavior for EPOLLOUT event

Alban Crequy reported a problem with connected dgram af_unix sockets and
provided a test program. epoll() would miss to send an EPOLLOUT event
when a thread unqueues a packet from the other peer, making its receive
queue not full.

This is because unix_dgram_poll() fails to call sock_poll_wait(file,
&unix_sk(other)->peer_wait, wait);
if the socket is not writeable at the time epoll_ctl(ADD) is called. 

We must call sock_poll_wait(), regardless of 'writable' status, so that
epoll can be notified later of states changes.

Misc: avoids testing twice (sk->sk_shutdown & RCV_SHUTDOWN)

Reported-by: Alban Crequy <alban.crequy@collabora.co.uk>
Cc: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/unix/af_unix.c |   24 +++++++++---------------
 1 file changed, 9 insertions(+), 15 deletions(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 0ebc777..7375131 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2072,13 +2072,12 @@ static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
 	if (sk->sk_err || !skb_queue_empty(&sk->sk_error_queue))
 		mask |= POLLERR;
 	if (sk->sk_shutdown & RCV_SHUTDOWN)
-		mask |= POLLRDHUP;
+		mask |= POLLRDHUP | POLLIN | POLLRDNORM;
 	if (sk->sk_shutdown == SHUTDOWN_MASK)
 		mask |= POLLHUP;
 
 	/* readable? */
-	if (!skb_queue_empty(&sk->sk_receive_queue) ||
-	    (sk->sk_shutdown & RCV_SHUTDOWN))
+	if (!skb_queue_empty(&sk->sk_receive_queue))
 		mask |= POLLIN | POLLRDNORM;
 
 	/* Connection-based need to check for termination and startup */
@@ -2090,20 +2089,15 @@ static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
 			return mask;
 	}
 
-	/* writable? */
 	writable = unix_writable(sk);
-	if (writable) {
-		other = unix_peer_get(sk);
-		if (other) {
-			if (unix_peer(other) != sk) {
-				sock_poll_wait(file, &unix_sk(other)->peer_wait,
-					  wait);
-				if (unix_recvq_full(other))
-					writable = 0;
-			}
-
-			sock_put(other);
+	other = unix_peer_get(sk);
+	if (other) {
+		if (unix_peer(other) != sk) {
+			sock_poll_wait(file, &unix_sk(other)->peer_wait, wait);
+			if (unix_recvq_full(other))
+				writable = 0;
 		}
+		sock_put(other);
 	}
 
 	if (writable)



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] af_unix: fix unix_dgram_poll() behavior for EPOLLOUT event
  2010-10-31 15:36           ` [PATCH 1/2] af_unix: fix unix_dgram_poll() behavior for EPOLLOUT event Eric Dumazet
@ 2010-10-31 19:07             ` Davide Libenzi
  2010-11-08 21:44             ` David Miller
  1 sibling, 0 replies; 27+ messages in thread
From: Davide Libenzi @ 2010-10-31 19:07 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Alban Crequy, David Miller, netdev

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3498 bytes --]

On Sun, 31 Oct 2010, Eric Dumazet wrote:

> Le samedi 30 octobre 2010 à 22:47 +0100, Alban Crequy a écrit :
> > Le Sat, 30 Oct 2010 15:17:58 +0200,
> > Eric Dumazet <eric.dumazet@gmail.com> a écrit :
> > 
> > > > Problem is the peer_wait, that epoll doesnt seem to be plugged into.
> > > > 
> > > > Bug is in unix_dgram_poll()
> > > > 
> > > > It calls sock_poll_wait( ... &unix_sk(other)->peer_wait,) only if
> > > > socket is 'writable'. Its a clear bug
> > 
> > Yes, it looks weird...
> > 
> > > > Try this patch please ?
> > 
> > I will be away from computer and I will continue to work on this from
> > the 20th of November.
> 
> OK, no problem. I tried it and it solves the problem. Here is an
> official patch submission.
> 
> David, for your convenience, I cooked a patch serie for the 2 patches
> against af_unix unix_dgram_poll().

Looks good to me...

Acked-by: Davide Libenzi <davidel@xmailserver.org>




> The third patch (af_unix: unix_write_space() use keyed wakeups)) can be
> applied as is.
> 
> Thanks !
> 
> [PATCH 1/2] af_unix: fix unix_dgram_poll() behavior for EPOLLOUT event
> 
> Alban Crequy reported a problem with connected dgram af_unix sockets and
> provided a test program. epoll() would miss to send an EPOLLOUT event
> when a thread unqueues a packet from the other peer, making its receive
> queue not full.
> 
> This is because unix_dgram_poll() fails to call sock_poll_wait(file,
> &unix_sk(other)->peer_wait, wait);
> if the socket is not writeable at the time epoll_ctl(ADD) is called. 
> 
> We must call sock_poll_wait(), regardless of 'writable' status, so that
> epoll can be notified later of states changes.
> 
> Misc: avoids testing twice (sk->sk_shutdown & RCV_SHUTDOWN)
> 
> Reported-by: Alban Crequy <alban.crequy@collabora.co.uk>
> Cc: Davide Libenzi <davidel@xmailserver.org>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---
>  net/unix/af_unix.c |   24 +++++++++---------------
>  1 file changed, 9 insertions(+), 15 deletions(-)
> 
> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> index 0ebc777..7375131 100644
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
> @@ -2072,13 +2072,12 @@ static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
>  	if (sk->sk_err || !skb_queue_empty(&sk->sk_error_queue))
>  		mask |= POLLERR;
>  	if (sk->sk_shutdown & RCV_SHUTDOWN)
> -		mask |= POLLRDHUP;
> +		mask |= POLLRDHUP | POLLIN | POLLRDNORM;
>  	if (sk->sk_shutdown == SHUTDOWN_MASK)
>  		mask |= POLLHUP;
>  
>  	/* readable? */
> -	if (!skb_queue_empty(&sk->sk_receive_queue) ||
> -	    (sk->sk_shutdown & RCV_SHUTDOWN))
> +	if (!skb_queue_empty(&sk->sk_receive_queue))
>  		mask |= POLLIN | POLLRDNORM;
>  
>  	/* Connection-based need to check for termination and startup */
> @@ -2090,20 +2089,15 @@ static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
>  			return mask;
>  	}
>  
> -	/* writable? */
>  	writable = unix_writable(sk);
> -	if (writable) {
> -		other = unix_peer_get(sk);
> -		if (other) {
> -			if (unix_peer(other) != sk) {
> -				sock_poll_wait(file, &unix_sk(other)->peer_wait,
> -					  wait);
> -				if (unix_recvq_full(other))
> -					writable = 0;
> -			}
> -
> -			sock_put(other);
> +	other = unix_peer_get(sk);
> +	if (other) {
> +		if (unix_peer(other) != sk) {
> +			sock_poll_wait(file, &unix_sk(other)->peer_wait, wait);
> +			if (unix_recvq_full(other))
> +				writable = 0;
>  		}
> +		sock_put(other);
>  	}
>  
>  	if (writable)
> 


- Davide


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] af_unix: unix_write_space() use keyed wakeups
  2010-10-30 15:03     ` Davide Libenzi
@ 2010-11-08 21:44       ` David Miller
  0 siblings, 0 replies; 27+ messages in thread
From: David Miller @ 2010-11-08 21:44 UTC (permalink / raw)
  To: davidel
  Cc: eric.dumazet, alban.crequy, shemminger, gorcunov, adobriyan,
	netdev, linux-kernel, pauli.nieminen, rweikusat

From: Davide Libenzi <davidel@xmailserver.org>
Date: Sat, 30 Oct 2010 08:03:15 -0700 (PDT)

> On Sat, 30 Oct 2010, Eric Dumazet wrote:
> 
>> This patch is a specialization of commit 37e5540b3c9d (epoll keyed
>> wakeups: make sockets use keyed wakeups).
> 
> Whoops, I must have missed AF_UNIX :)
> Looks good to me.

Applied.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] af_unix: fix unix_dgram_poll() behavior for EPOLLOUT event
  2010-10-31 15:36           ` [PATCH 1/2] af_unix: fix unix_dgram_poll() behavior for EPOLLOUT event Eric Dumazet
  2010-10-31 19:07             ` Davide Libenzi
@ 2010-11-08 21:44             ` David Miller
  1 sibling, 0 replies; 27+ messages in thread
From: David Miller @ 2010-11-08 21:44 UTC (permalink / raw)
  To: eric.dumazet; +Cc: alban.crequy, netdev, davidel

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sun, 31 Oct 2010 16:36:23 +0100

> [PATCH 1/2] af_unix: fix unix_dgram_poll() behavior for EPOLLOUT event
> 
> Alban Crequy reported a problem with connected dgram af_unix sockets and
> provided a test program. epoll() would miss to send an EPOLLOUT event
> when a thread unqueues a packet from the other peer, making its receive
> queue not full.
> 
> This is because unix_dgram_poll() fails to call sock_poll_wait(file,
> &unix_sk(other)->peer_wait, wait);
> if the socket is not writeable at the time epoll_ctl(ADD) is called. 
> 
> We must call sock_poll_wait(), regardless of 'writable' status, so that
> epoll can be notified later of states changes.
> 
> Misc: avoids testing twice (sk->sk_shutdown & RCV_SHUTDOWN)
> 
> Reported-by: Alban Crequy <alban.crequy@collabora.co.uk>
> Cc: Davide Libenzi <davidel@xmailserver.org>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] af_unix: unix_write_space() use keyed wakeups
       [not found]       ` <1290554876.2158.5.camel@Nokia-N900-51-1>
@ 2010-11-24  0:20         ` Alban Crequy
  2010-11-24  0:28           ` Eric Dumazet
  0 siblings, 1 reply; 27+ messages in thread
From: Alban Crequy @ 2010-11-24  0:20 UTC (permalink / raw)
  To: Pauli Nieminen
  Cc: Eric Dumazet, Davide Libenzi, David S. Miller, Stephen Hemminger,
	Cyrill Gorcunov, Alexey Dobriyan, netdev, linux-kernel,
	Rainer Weikusat

Le Wed, 24 Nov 2010 01:27:56 +0200,
Pauli Nieminen <pauli.nieminen@collabora.co.uk> a écrit :

> ----- Original message -----
> > Le Sat, 30 Oct 2010 08:44:44 +0200,
> > Eric Dumazet <eric.dumazet@gmail.com> a écrit :
> > 
> > > We still loop on 800 items, on each
> > > wake_up_interruptible_sync_poll() call, so maybe we want to
> > > optimize this later, adding a global key, ORing all items keys. I
> > > dont think its worth the added complexity, given the biased usage
> > > of your program (800 'listeners' to one event). Is it a real life
> > > scenario ?
> > 
> > Pauli Nieminen told me about his performance problem in select() so
> > I wrote the test program but I don't know what exactly is the real
> > life scenario.
> > 
> 
> Real world scenario is xsever that has tens client connections to
> manage. When xserver is happily sleeping at seclect call some client
> reading events/replies from server triggers in kernel looping over
> all xserver fds. xserver isn't waiting for socket to became writeable
> in ussual cases so kernel schedules back to client.

But are they SOCK_STREAM or SOCK_DGRAM sockets? The patches fix
performances with SOCK_DGRAM. If the xserver scenario is with
SOCK_STREAM sockets, your problem is probably still unfixed.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] af_unix: unix_write_space() use keyed wakeups
  2010-11-24  0:20         ` Alban Crequy
@ 2010-11-24  0:28           ` Eric Dumazet
  0 siblings, 0 replies; 27+ messages in thread
From: Eric Dumazet @ 2010-11-24  0:28 UTC (permalink / raw)
  To: Alban Crequy
  Cc: Pauli Nieminen, Davide Libenzi, David S. Miller,
	Stephen Hemminger, Cyrill Gorcunov, Alexey Dobriyan, netdev,
	linux-kernel, Rainer Weikusat

Le mercredi 24 novembre 2010 à 00:20 +0000, Alban Crequy a écrit :
> Le Wed, 24 Nov 2010 01:27:56 +0200,
> Pauli Nieminen <pauli.nieminen@collabora.co.uk> a écrit :
> 
> > ----- Original message -----
> > > Le Sat, 30 Oct 2010 08:44:44 +0200,
> > > Eric Dumazet <eric.dumazet@gmail.com> a écrit :
> > > 
> > > > We still loop on 800 items, on each
> > > > wake_up_interruptible_sync_poll() call, so maybe we want to
> > > > optimize this later, adding a global key, ORing all items keys. I
> > > > dont think its worth the added complexity, given the biased usage
> > > > of your program (800 'listeners' to one event). Is it a real life
> > > > scenario ?
> > > 
> > > Pauli Nieminen told me about his performance problem in select() so
> > > I wrote the test program but I don't know what exactly is the real
> > > life scenario.
> > > 
> > 
> > Real world scenario is xsever that has tens client connections to
> > manage. When xserver is happily sleeping at seclect call some client
> > reading events/replies from server triggers in kernel looping over
> > all xserver fds. xserver isn't waiting for socket to became writeable
> > in ussual cases so kernel schedules back to client.
> 
> But are they SOCK_STREAM or SOCK_DGRAM sockets? The patches fix
> performances with SOCK_DGRAM. If the xserver scenario is with
> SOCK_STREAM sockets, your problem is probably still unfixed.
> 


It should not matter ?

commit 67426b756c4d52c51 (af_unix: use keyed wakeups ) makes
unix_write_space() call wake_up_interruptible_sync_poll() instead of
wake_up_interruptible_sync().

So it should be fixed for both STREAM/DGRAM sockets ?




^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2010-11-24  0:28 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-10-29 18:18 [PATCH 0/1] RFC: poll/select performance on datagram sockets Alban Crequy
2010-10-29 18:18 ` Alban Crequy
2010-10-29 18:21 ` [PATCH] " Alban Crequy
2010-10-29 19:27 ` [PATCH 0/1] RFC: " Eric Dumazet
2010-10-29 20:08   ` Davide Libenzi
2010-10-29 20:20     ` Eric Dumazet
2010-10-29 20:46     ` Davide Libenzi
2010-10-29 21:05       ` Eric Dumazet
2010-10-29 21:57         ` Davide Libenzi
2010-10-29 22:08           ` Eric Dumazet
2010-10-30  9:53       ` [PATCH] af_unix: optimize unix_dgram_poll() Eric Dumazet
2010-10-30 17:45         ` Davide Libenzi
2010-10-29 20:20   ` [PATCH 0/1] RFC: poll/select performance on datagram sockets Jesper Juhl
2010-10-29 20:40     ` David Miller
2010-10-29 20:45       ` Eric Dumazet
2010-10-30  6:44   ` [PATCH] af_unix: unix_write_space() use keyed wakeups Eric Dumazet
2010-10-30 15:03     ` Davide Libenzi
2010-11-08 21:44       ` David Miller
2010-10-30 21:36     ` Alban Crequy
     [not found]       ` <1290554876.2158.5.camel@Nokia-N900-51-1>
2010-11-24  0:20         ` Alban Crequy
2010-11-24  0:28           ` Eric Dumazet
2010-10-30 11:34   ` [PATCH 0/1] RFC: poll/select performance on datagram sockets Alban Crequy
2010-10-30 12:53     ` Eric Dumazet
2010-10-30 13:17       ` Eric Dumazet
     [not found]         ` <20101030224703.065e70f6@chocolatine.cbg.collabora.co.uk>
2010-10-31 15:36           ` [PATCH 1/2] af_unix: fix unix_dgram_poll() behavior for EPOLLOUT event Eric Dumazet
2010-10-31 19:07             ` Davide Libenzi
2010-11-08 21:44             ` David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.