linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 net-next 0/1] net: lls select poll support
@ 2013-06-19 10:04 Eliezer Tamir
  2013-06-19 10:04 ` [PATCH v3 net-next] net: poll/select low latency socket support Eliezer Tamir
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Eliezer Tamir @ 2013-06-19 10:04 UTC (permalink / raw)
  To: David Miller
  Cc: linux-kernel, netdev, Jesse Brandeburg, Don Skidmore,
	e1000-devel, Willem de Bruijn, Eric Dumazet, Ben Hutchings,
	Andi Kleen, HPA, Eilon Greenstien, Or Gerlitz, Amir Vadai,
	Alex Rosenbaum, Eliezer Tamir

David,

Here is a rework of the select/poll patch.

One question: do we need in sock_poll() to test that sock->sk is not null?
(Thanks to Willem de Bruijn for pointing this out.) 

When select or poll are used on a lot of sockets the sysctl value
needs to be set higher than 50. For 300 sockets a setting of 100 works
for me. For 1000 sockets a setting of 200 works well but the gain is
very small, probably not worth it.

I should mention that unlike the version we had in v9, with this version
of the patch, LLS always performs better than no LLS. 

In this version I split the sysctl entries into two.
sysctl.net.core.low_latency_read for blocking socket reads.
sysctl.net.core.low_latency_poll for select/poll.

The main reason we need two sysctl entries is that the values are different.
For socket reads 50us is plenty of time but when polling on 300 sockets
we need about 100us.

This also allows poll to busy poll only on some sockets.
You set sysctl.net.core.low_latency_poll, but leave the read value 0.
Then you use the socket option to enable SO_LL only on the sockets
you would like to buys poll on.

Maybe what we really want is a cgroup that would allow settings
per-process group.

-Eliezer

Change log:
v3
- split sysctl value into two separate ones, one for read and one for poll.
- updated Documentation/sysctl/net.txt
- fixed possible overflow (again) reported by Eric Dumazet.
- poll/select loop optimizations by Eric Dumazet and Willem de Bruijn.

v2
- added POLL_LL flag, used to signal between the syscalls and sock_poll().
- add a separate ll_end_time() function to be used from poll.
- slight reorder of sk_poll_ll so the timing primitives are not used
  when called from sock_poll().
- select/poll stop busy polling as soon as there is something to return
  to the user.

Change log from the original LLS patch series:
v9
- better mask testing in sock_poll(), reported by Eric Dumazet.

v8
- split out udp and select/poll into separate patches.
  what used to be patch 2/5 is now three patches.

v5
- added simple poll/select support


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v3 net-next] net: poll/select low latency socket support
  2013-06-19 10:04 [PATCH v3 net-next 0/1] net: lls select poll support Eliezer Tamir
@ 2013-06-19 10:04 ` Eliezer Tamir
  2013-06-19 11:42 ` [PATCH RFC] net: lls epoll support Eliezer Tamir
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 8+ messages in thread
From: Eliezer Tamir @ 2013-06-19 10:04 UTC (permalink / raw)
  To: David Miller
  Cc: linux-kernel, netdev, Jesse Brandeburg, Don Skidmore,
	e1000-devel, Willem de Bruijn, Eric Dumazet, Ben Hutchings,
	Andi Kleen, HPA, Eilon Greenstien, Or Gerlitz, Amir Vadai,
	Alex Rosenbaum, Eliezer Tamir

select/poll busy-poll support.

Split sysctl value into two separate ones, one for read and one for poll.
updated Documentation/sysctl/net.txt

Add a new poll flag POLL_LL. When this flag is set, sock poll will call
sk_poll_ll() if possible. sock_poll sets this flag in its return value
to indicate to select/poll when a socket that can busy poll is found.

When poll/select have nothing to report, call the low-level
sock_poll() again until we are out of time or we find something.

Once the system call finds something, it stops setting POLL_LL, so it can
return the result to the user ASAP.

Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
---

 Documentation/sysctl/net.txt    |   18 ++++++++++++++++--
 fs/select.c                     |   34 +++++++++++++++++++++++++++++-----
 include/net/ll_poll.h           |   34 ++++++++++++++++++++++------------
 include/uapi/asm-generic/poll.h |    2 ++
 net/core/sock.c                 |    2 +-
 net/core/sysctl_net_core.c      |    8 ++++++++
 net/socket.c                    |   15 ++++++++++++++-
 7 files changed, 92 insertions(+), 21 deletions(-)

diff --git a/Documentation/sysctl/net.txt b/Documentation/sysctl/net.txt
index 5369879..e658bbf 100644
--- a/Documentation/sysctl/net.txt
+++ b/Documentation/sysctl/net.txt
@@ -50,13 +50,27 @@ The maximum number of packets that kernel can handle on a NAPI interrupt,
 it's a Per-CPU variable.
 Default: 64
 
-low_latency_poll
+low_latency_read
 ----------------
-Low latency busy poll timeout. (needs CONFIG_NET_LL_RX_POLL)
+Low latency busy poll timeout for socket reads. (needs CONFIG_NET_LL_RX_POLL)
 Approximate time in us to spin waiting for packets on the device queue.
+This sets the default value of the SO_LL socket option.
+Can be set or overridden per socket by setting socket option SO_LL.
 Recommended value is 50. May increase power usage.
 Default: 0 (off)
 
+low_latency_poll
+----------------
+Low latency busy poll timeout for poll and select. (needs CONFIG_NET_LL_RX_POLL)
+Approximate time in us to spin waiting for packets on the device queue.
+Recommended value depends on the number of sockets you poll on.
+For several sockets 50, for several hundreds 100.
+For more than that you probably want to use epoll.
+Note that only sockets with SO_LL set will be busy polled, so you want to either
+selectively set SO_LL on those sockets or set sysctl.net.low_latency_read globally.
+May increase power usage.
+Default: 0 (off)
+
 rmem_default
 ------------
 
diff --git a/fs/select.c b/fs/select.c
index 8c1c96c..4698b0a 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -27,6 +27,7 @@
 #include <linux/rcupdate.h>
 #include <linux/hrtimer.h>
 #include <linux/sched/rt.h>
+#include <net/ll_poll.h>
 
 #include <asm/uaccess.h>
 
@@ -384,9 +385,10 @@ get_max:
 #define POLLEX_SET (POLLPRI)
 
 static inline void wait_key_set(poll_table *wait, unsigned long in,
-				unsigned long out, unsigned long bit)
+				unsigned long out, unsigned long bit,
+				unsigned int ll_flag)
 {
-	wait->_key = POLLEX_SET;
+	wait->_key = POLLEX_SET | ll_flag;
 	if (in & bit)
 		wait->_key |= POLLIN_SET;
 	if (out & bit)
@@ -400,6 +402,8 @@ int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
 	poll_table *wait;
 	int retval, i, timed_out = 0;
 	unsigned long slack = 0;
+	unsigned int ll_flag = POLL_LL;
+	u64 ll_time = ll_end_time();
 
 	rcu_read_lock();
 	retval = max_select_fd(n, fds);
@@ -422,6 +426,7 @@ int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
 	retval = 0;
 	for (;;) {
 		unsigned long *rinp, *routp, *rexp, *inp, *outp, *exp;
+		bool can_ll = false;
 
 		inp = fds->in; outp = fds->out; exp = fds->ex;
 		rinp = fds->res_in; routp = fds->res_out; rexp = fds->res_ex;
@@ -449,7 +454,8 @@ int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
 					f_op = f.file->f_op;
 					mask = DEFAULT_POLLMASK;
 					if (f_op && f_op->poll) {
-						wait_key_set(wait, in, out, bit);
+						wait_key_set(wait, in, out,
+							     bit, ll_flag);
 						mask = (*f_op->poll)(f.file, wait);
 					}
 					fdput(f);
@@ -468,6 +474,11 @@ int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
 						retval++;
 						wait->_qproc = NULL;
 					}
+					if (mask & POLL_LL)
+						can_ll = true;
+					/* got something, stop busy polling */
+					if (retval)
+						ll_flag = 0;
 				}
 			}
 			if (res_in)
@@ -486,6 +497,9 @@ int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
 			break;
 		}
 
+		if (can_ll && can_poll_ll(ll_time))
+			continue;
+
 		/*
 		 * If this is the first loop and we have a timeout
 		 * given, then we convert to ktime_t and set the to
@@ -717,7 +731,8 @@ struct poll_list {
  * pwait poll_table will be used by the fd-provided poll handler for waiting,
  * if pwait->_qproc is non-NULL.
  */
-static inline unsigned int do_pollfd(struct pollfd *pollfd, poll_table *pwait)
+static inline unsigned int do_pollfd(struct pollfd *pollfd, poll_table *pwait,
+					bool *can_ll, unsigned int ll_flag)
 {
 	unsigned int mask;
 	int fd;
@@ -731,7 +746,10 @@ static inline unsigned int do_pollfd(struct pollfd *pollfd, poll_table *pwait)
 			mask = DEFAULT_POLLMASK;
 			if (f.file->f_op && f.file->f_op->poll) {
 				pwait->_key = pollfd->events|POLLERR|POLLHUP;
+				pwait->_key |= ll_flag;
 				mask = f.file->f_op->poll(f.file, pwait);
+				if (mask & POLL_LL)
+					*can_ll = true;
 			}
 			/* Mask out unneeded events. */
 			mask &= pollfd->events | POLLERR | POLLHUP;
@@ -750,6 +768,8 @@ static int do_poll(unsigned int nfds,  struct poll_list *list,
 	ktime_t expire, *to = NULL;
 	int timed_out = 0, count = 0;
 	unsigned long slack = 0;
+	unsigned int ll_flag = POLL_LL;
+	u64 ll_time = ll_end_time();
 
 	/* Optimise the no-wait case */
 	if (end_time && !end_time->tv_sec && !end_time->tv_nsec) {
@@ -762,6 +782,7 @@ static int do_poll(unsigned int nfds,  struct poll_list *list,
 
 	for (;;) {
 		struct poll_list *walk;
+		bool can_ll = false;
 
 		for (walk = list; walk != NULL; walk = walk->next) {
 			struct pollfd * pfd, * pfd_end;
@@ -776,9 +797,10 @@ static int do_poll(unsigned int nfds,  struct poll_list *list,
 				 * this. They'll get immediately deregistered
 				 * when we break out and return.
 				 */
-				if (do_pollfd(pfd, pt)) {
+				if (do_pollfd(pfd, pt, &can_ll, ll_flag)) {
 					count++;
 					pt->_qproc = NULL;
+					ll_flag = 0;
 				}
 			}
 		}
@@ -795,6 +817,8 @@ static int do_poll(unsigned int nfds,  struct poll_list *list,
 		if (count || timed_out)
 			break;
 
+		if (can_ll && can_poll_ll(ll_time))
+			continue;
 		/*
 		 * If this is the first loop and we have a timeout
 		 * given, then we convert to ktime_t and set the to
diff --git a/include/net/ll_poll.h b/include/net/ll_poll.h
index fcc7c36..c4f96bf 100644
--- a/include/net/ll_poll.h
+++ b/include/net/ll_poll.h
@@ -30,6 +30,7 @@
 #ifdef CONFIG_NET_LL_RX_POLL
 
 struct napi_struct;
+extern unsigned int sysctl_net_ll_read __read_mostly;
 extern unsigned int sysctl_net_ll_poll __read_mostly;
 
 /* return values from ndo_ll_poll */
@@ -38,17 +39,18 @@ extern unsigned int sysctl_net_ll_poll __read_mostly;
 
 /* we can use sched_clock() because we don't care much about precision
  * we only care that the average is bounded
+ * we don't mind a ~2.5% imprecision so <<10 instead of *1000
+ * sk->sk_ll_usec is a u_int so this can't overflow
  */
-static inline u64 ll_end_time(struct sock *sk)
+static inline u64 ll_sk_end_time(struct sock *sk)
 {
-	u64 end_time = ACCESS_ONCE(sk->sk_ll_usec);
-
-	/* we don't mind a ~2.5% imprecision
-	 * sk->sk_ll_usec is a u_int so this can't overflow
-	 */
-	end_time = (end_time << 10) + sched_clock();
+	return ((u64)ACCESS_ONCE(sk->sk_ll_usec) << 10) + sched_clock();
+}
 
-	return end_time;
+/* in poll/select we use the global sysctl_net_ll_poll value */
+static inline u64 ll_end_time(void)
+{
+	return ((u64)ACCESS_ONCE(sysctl_net_ll_poll) << 10) + sched_clock();
 }
 
 static inline bool sk_valid_ll(struct sock *sk)
@@ -62,10 +64,13 @@ static inline bool can_poll_ll(u64 end_time)
 	return !time_after64(sched_clock(), end_time);
 }
 
+/* when used in sock_poll() nonblock is known at compile time to be true
+ * so the loop and end_time will be optimized out
+ */
 static inline bool sk_poll_ll(struct sock *sk, int nonblock)
 {
+	u64 end_time = nonblock ? 0 : ll_sk_end_time(sk);
 	const struct net_device_ops *ops;
-	u64 end_time = ll_end_time(sk);
 	struct napi_struct *napi;
 	int rc = false;
 
@@ -95,8 +100,8 @@ static inline bool sk_poll_ll(struct sock *sk, int nonblock)
 			NET_ADD_STATS_BH(sock_net(sk),
 					 LINUX_MIB_LOWLATENCYRXPACKETS, rc);
 
-	} while (skb_queue_empty(&sk->sk_receive_queue)
-			&& can_poll_ll(end_time) && !nonblock);
+	} while (!nonblock && skb_queue_empty(&sk->sk_receive_queue)
+			&& can_poll_ll(end_time));
 
 	rc = !skb_queue_empty(&sk->sk_receive_queue);
 out:
@@ -118,7 +123,12 @@ static inline void sk_mark_ll(struct sock *sk, struct sk_buff *skb)
 
 #else /* CONFIG_NET_LL_RX_POLL */
 
-static inline u64 ll_end_time(struct sock *sk)
+static inline u64 sk_ll_end_time(struct sock *sk)
+{
+	return 0;
+}
+
+static inline u64 ll_end_time(void)
 {
 	return 0;
 }
diff --git a/include/uapi/asm-generic/poll.h b/include/uapi/asm-generic/poll.h
index 9ce7f44..4aee586 100644
--- a/include/uapi/asm-generic/poll.h
+++ b/include/uapi/asm-generic/poll.h
@@ -30,6 +30,8 @@
 
 #define POLLFREE	0x4000	/* currently only for epoll */
 
+#define POLL_LL		0x8000
+
 struct pollfd {
 	int fd;
 	short events;
diff --git a/net/core/sock.c b/net/core/sock.c
index 1e744b1..b6c619f 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2307,7 +2307,7 @@ void sock_init_data(struct socket *sock, struct sock *sk)
 
 #ifdef CONFIG_NET_LL_RX_POLL
 	sk->sk_napi_id		=	0;
-	sk->sk_ll_usec		=	sysctl_net_ll_poll;
+	sk->sk_ll_usec		=	sysctl_net_ll_read;
 #endif
 
 	/*
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index 62702c2..afc677e 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -306,6 +306,14 @@ static struct ctl_table net_core_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec
 	},
+	{
+		.procname	= "low_latency_read",
+		.data		= &sysctl_net_ll_read,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec
+	},
+#
 #endif
 #endif /* CONFIG_NET */
 	{
diff --git a/net/socket.c b/net/socket.c
index 3eec3f7..1eed3c0 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -107,6 +107,7 @@
 #include <net/ll_poll.h>
 
 #ifdef CONFIG_NET_LL_RX_POLL
+unsigned int sysctl_net_ll_read __read_mostly;
 unsigned int sysctl_net_ll_poll __read_mostly;
 #endif
 
@@ -1147,13 +1148,25 @@ EXPORT_SYMBOL(sock_create_lite);
 /* No kernel lock held - perfect */
 static unsigned int sock_poll(struct file *file, poll_table *wait)
 {
+	unsigned int ll_flag = 0;
 	struct socket *sock;
 
 	/*
 	 *      We can't return errors to poll, so it's either yes or no.
 	 */
 	sock = file->private_data;
-	return sock->ops->poll(file, sock, wait);
+
+	if (sk_valid_ll(sock->sk)) {
+
+		/* this socket can poll_ll so tell the system call */
+		ll_flag = POLL_LL;
+
+		/* only if requested by syscall */
+		if (wait && (wait->_key & POLL_LL))
+			sk_poll_ll(sock->sk, 1);
+	}
+
+	return ll_flag | sock->ops->poll(file, sock, wait);
 }
 
 static int sock_mmap(struct file *file, struct vm_area_struct *vma)


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH RFC] net: lls epoll support
  2013-06-19 10:04 [PATCH v3 net-next 0/1] net: lls select poll support Eliezer Tamir
  2013-06-19 10:04 ` [PATCH v3 net-next] net: poll/select low latency socket support Eliezer Tamir
@ 2013-06-19 11:42 ` Eliezer Tamir
  2013-06-25 14:26   ` yaniv saar
  2013-06-19 12:13 ` [PATCH v3 net-next 0/1] net: lls select poll support Eric Dumazet
  2013-06-24  1:44 ` David Miller
  3 siblings, 1 reply; 8+ messages in thread
From: Eliezer Tamir @ 2013-06-19 11:42 UTC (permalink / raw)
  To: David Miller
  Cc: linux-kernel, netdev, Jesse Brandeburg, Don Skidmore,
	e1000-devel, Willem de Bruijn, Eric Dumazet, Ben Hutchings,
	Andi Kleen, HPA, Eilon Greenstien, Or Gerlitz, Amir Vadai,
	Alex Rosenbaum, Eliezer Tamir

This is a wild hack, just as a POC to show the power or LLS with epoll.

We assume that we only ever need to poll on one device queue,
so the first FD that reports POLL_LL gets saved aside so we can poll on.

While this assumption is wrong in so many ways, it's very easy to 
satisfy with a micro-benchmark.

[this patch needs the poll patch to be applied first]
with sockperf doing epoll on 1000 sockets I see an avg latency of 6us

Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
---

  fs/eventpoll.c |   39 +++++++++++++++++++++++++++++++++------
  1 files changed, 33 insertions(+), 6 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index deecc72..3c7562b 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -41,6 +41,7 @@
  #include <linux/proc_fs.h>
  #include <linux/seq_file.h>
  #include <linux/compat.h>
+#include <net/ll_poll.h>

  /*
   * LOCKING:
@@ -214,6 +215,7 @@ struct eventpoll {
  	/* used to optimize loop detection check */
  	int visited;
  	struct list_head visited_list_link;
+	struct epitem *ll_epi;
  };

  /* Wait structure used by the poll hooks */
@@ -773,13 +775,30 @@ static int ep_eventpoll_release(struct inode 
*inode, struct file *file)
  	return 0;
  }

-static inline unsigned int ep_item_poll(struct epitem *epi, poll_table *pt)
+static inline unsigned int ep_item_poll(struct epitem *epi, poll_table 
*pt, struct eventpoll *ep)
  {
+	unsigned int events = epi->ffd.file->f_op->poll(epi->ffd.file, pt);
  	pt->_key = epi->event.events;

-	return epi->ffd.file->f_op->poll(epi->ffd.file, pt) & epi->event.events;
+	if (events & POLLLLS) {
+		events &= ~POLLLLS;
+		ep->ll_epi = epi;
+	}
+
+	return events & epi->event.events;
+}
+
+static inline bool ep_item_poll_ll(struct epitem *epi)
+{
+	poll_table wait;
+
+	wait._key = POLLLLS;
+	wait._qproc = NULL;
+
+	return epi->ffd.file->f_op->poll(epi->ffd.file, &wait);
  }

+
  static int ep_read_events_proc(struct eventpoll *ep, struct list_head 
*head,
  			       void *priv)
  {
@@ -789,7 +808,7 @@ static int ep_read_events_proc(struct eventpoll *ep, 
struct list_head *head,
  	init_poll_funcptr(&pt, NULL);

  	list_for_each_entry_safe(epi, tmp, head, rdllink) {
-		if (ep_item_poll(epi, &pt))
+		if (ep_item_poll(epi, &pt, ep))
  			return POLLIN | POLLRDNORM;
  		else {
  			/*
@@ -1271,7 +1290,7 @@ static int ep_insert(struct eventpoll *ep, struct 
epoll_event *event,
  	 * this operation completes, the poll callback can start hitting
  	 * the new item.
  	 */
-	revents = ep_item_poll(epi, &epq.pt);
+	revents = ep_item_poll(epi, &epq.pt, ep);

  	/*
  	 * We have to check if something went wrong during the poll wait queue
@@ -1403,7 +1422,7 @@ static int ep_modify(struct eventpoll *ep, struct 
epitem *epi, struct epoll_even
  	 * Get current event bits. We can safely use the file* here because
  	 * its usage count has been increased by the caller of this function.
  	 */
-	revents = ep_item_poll(epi, &pt);
+	revents = ep_item_poll(epi, &pt, ep);

  	/*
  	 * If the item is "hot" and it is not registered inside the ready
@@ -1471,7 +1490,7 @@ static int ep_send_events_proc(struct eventpoll 
*ep, struct list_head *head,

  		list_del_init(&epi->rdllink);

-		revents = ep_item_poll(epi, &pt);
+		revents = ep_item_poll(epi, &pt, ep);

  		/*
  		 * If the event mask intersect the caller-requested one,
@@ -1558,6 +1577,10 @@ static int ep_poll(struct eventpoll *ep, struct 
epoll_event __user *events,
  	long slack = 0;
  	wait_queue_t wait;
  	ktime_t expires, *to = NULL;
+	cycles_t ll_time = ll_end_time();
+	//bool try_ll = true;
+	bool can_ll = !!ep->ll_epi;
+

  	if (timeout > 0) {
  		struct timespec end_time = ep_set_mstimeout(timeout);
@@ -1601,6 +1624,10 @@ fetch_events:
  				break;
  			}

+			while (can_ll && can_poll_ll(ll_time)
+					&& !ep_events_available(ep))
+				ep_item_poll_ll(ep->ll_epi);
+
  			spin_unlock_irqrestore(&ep->lock, flags);
  			if (!schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS))
  				timed_out = 1;



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 net-next 0/1] net: lls select poll support
  2013-06-19 10:04 [PATCH v3 net-next 0/1] net: lls select poll support Eliezer Tamir
  2013-06-19 10:04 ` [PATCH v3 net-next] net: poll/select low latency socket support Eliezer Tamir
  2013-06-19 11:42 ` [PATCH RFC] net: lls epoll support Eliezer Tamir
@ 2013-06-19 12:13 ` Eric Dumazet
  2013-06-24  1:44 ` David Miller
  3 siblings, 0 replies; 8+ messages in thread
From: Eric Dumazet @ 2013-06-19 12:13 UTC (permalink / raw)
  To: Eliezer Tamir
  Cc: David Miller, linux-kernel, netdev, Jesse Brandeburg,
	Don Skidmore, e1000-devel, Willem de Bruijn, Ben Hutchings,
	Andi Kleen, HPA, Eilon Greenstien, Or Gerlitz, Amir Vadai,
	Alex Rosenbaum, Eliezer Tamir

On Wed, 2013-06-19 at 13:04 +0300, Eliezer Tamir wrote:

> One question: do we need in sock_poll() to test that sock->sk is not null?
> (Thanks to Willem de Bruijn for pointing this out.) 

Why sock->sk could be NULL in sock_poll() ?

This is the thing I am not sure [1]

Normally, sock_poll() should be called only if the current thread owns a
reference on the file. And we should not dismantle sock->sk while a
reference is owned.

Apparently, epoll() might break this assumption, but I was unable to
track it yet. epoll() doesn't hold file refcount, because it relies on
eventpoll_release_file() being called from __fput()


[1] We do have some crashes occurrences here on some Google machines
(Google-Bug-id: 8940884) , because tcp_poll() hits a NULL sock->sk

<4>[206994.855824] RIP: 0010:[<ffffffff8026e2e8>]  [<ffffffff8026e2e8>] tcp_poll+0x18/0x1b0
<4>[206994.863652] RSP: 0018:ffff8806c7921e48  EFLAGS: 00010246
<4>[206994.869038] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff8806c7921f10
<4>[206994.876241] RDX: 0000000000000000 RSI: ffff880292d52300 RDI: ffff88034a8837c0
<4>[206994.883443] RBP: ffff8806c7921e58 R08: 00007f7cadc3c000 R09: ffff88034a8837c0
<4>[206994.890648] R10: 00000000fffffff2 R11: ffffffff8088ffc0 R12: ffff880699960d98
<4>[206994.897850] R13: 0000000000000000 R14: ffff8804924dab58 R15: 0000000000002000
<4>[206994.905055] FS:  00007f7c98726700(0000) GS:ffff88027fc80000(0000) knlGS:00000000f4cf8b40
<4>[206994.913210] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[206994.919028] CR2: 00000000000000c0 CR3: 0000000494ab2000 CR4: 00000000000006e0
<4>[206994.926233] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[206994.933437] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>[206994.940640] Process java (pid: 8736, threadinfo ffff8806c7920000, task ffff8806990d0400)
<4>[206994.948795] Stack:
<4>[206994.950890]  0000000000000000 ffff880699960d40 ffff8806c7921e68 ffffffff802604da
<4>[206994.958422]  ffff8806c7921f78 ffffffff8024f528 ffff88069b6e4a80 00007f7cadc3c000
<4>[206994.965956]  ffff8806fffffff2 ffff8806c7921f10 ffffffff80aaa6f0 ffffffff80aaa6f0
<4>[206994.973488] Call Trace:
<4>[206994.976023]  [<ffffffff802604da>] sock_poll+0x1a/0x20
<4>[206994.981148]  [<ffffffff8024f528>] sys_epoll_wait+0x788/0x940
<4>[206994.986881]  [<ffffffff802d81b0>] ? ep_send_events_proc+0x120/0x120
<4>[206994.993226]  [<ffffffff8072e032>] system_call_fastpath+0x16/0x1b



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 net-next 0/1] net: lls select poll support
  2013-06-19 10:04 [PATCH v3 net-next 0/1] net: lls select poll support Eliezer Tamir
                   ` (2 preceding siblings ...)
  2013-06-19 12:13 ` [PATCH v3 net-next 0/1] net: lls select poll support Eric Dumazet
@ 2013-06-24  1:44 ` David Miller
  2013-06-24  4:23   ` Eliezer Tamir
  3 siblings, 1 reply; 8+ messages in thread
From: David Miller @ 2013-06-24  1:44 UTC (permalink / raw)
  To: eliezer.tamir
  Cc: linux-kernel, netdev, jesse.brandeburg, donald.c.skidmore,
	e1000-devel, willemb, erdnetdev, bhutchings, andi, hpa, eilong,
	or.gerlitz, amirv, alexr, eliezer

From: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Date: Wed, 19 Jun 2013 13:04:21 +0300

> One question: do we need in sock_poll() to test that sock->sk is not null?
> (Thanks to Willem de Bruijn for pointing this out.) 

We should not have to.

Please clean up various things in this patch:

1) You have cases where you add tons of unnecessary empty lines, of the
   form:

	if (foo) {
	/* EMPTY LINE */
		code...

    please get rid of them.

2) Some of the LL loops have conditions that are not properly formatted,
   I see things like:

	do {
		...
	} while (x && y && z
			&& whatever)

   That "&& whatever" is not properly indented.  You must always indent
   multi-line conditions like this:

	xxx (x && y &&
	     z && ...)

   That is, operators always end the line, the never begin the line.
   And one the second and subsequent lines you must line up the first
   character with the first column after the openning parenthesis of
   the first line.  You should not just use a series of only TAB
   characters to indent these expressions.

Thanks.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 net-next 0/1] net: lls select poll support
  2013-06-24  1:44 ` David Miller
@ 2013-06-24  4:23   ` Eliezer Tamir
  0 siblings, 0 replies; 8+ messages in thread
From: Eliezer Tamir @ 2013-06-24  4:23 UTC (permalink / raw)
  To: David Miller
  Cc: linux-kernel, netdev, jesse.brandeburg, donald.c.skidmore,
	e1000-devel, willemb, erdnetdev, bhutchings, andi, hpa, eilong,
	or.gerlitz, amirv, alexr, eliezer

On 24/06/2013 04:44, David Miller wrote:
> From: Eliezer Tamir <eliezer.tamir@linux.intel.com>
> Date: Wed, 19 Jun 2013 13:04:21 +0300
>
>> One question: do we need in sock_poll() to test that sock->sk is not null?
>> (Thanks to Willem de Bruijn for pointing this out.)
>
> We should not have to.
>
> Please clean up various things in this patch:

Ok,
Thanks!

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH RFC] net: lls epoll support
  2013-06-19 11:42 ` [PATCH RFC] net: lls epoll support Eliezer Tamir
@ 2013-06-25 14:26   ` yaniv saar
  2013-06-25 15:34     ` Eliezer Tamir
  0 siblings, 1 reply; 8+ messages in thread
From: yaniv saar @ 2013-06-25 14:26 UTC (permalink / raw)
  To: Eliezer Tamir
  Cc: David Miller, linux-kernel, netdev, Jesse Brandeburg,
	Don Skidmore, e1000-devel, Willem de Bruijn, Eric Dumazet,
	Ben Hutchings, Andi Kleen, HPA, Eilon Greenstien, Or Gerlitz,
	Amir Vadai, Alex Rosenbaum, Eliezer Tamir

On Wed, Jun 19, 2013 at 2:42 PM, Eliezer Tamir
<eliezer.tamir@linux.intel.com> wrote:
> This is a wild hack, just as a POC to show the power or LLS with epoll.
>
> We assume that we only ever need to poll on one device queue,
> so the first FD that reports POLL_LL gets saved aside so we can poll on.
>
> While this assumption is wrong in so many ways, it's very easy to satisfy
> with a micro-benchmark.
>
> [this patch needs the poll patch to be applied first]
> with sockperf doing epoll on 1000 sockets I see an avg latency of 6us
>

hi eliezer,

please consider the following solution for epoll that is based on
polling dev+queue.
instead of looping over the socket as in LLS, maintain in eventpool
struct a list of device+queues (qdlist).
the dqlist must be unique w.r.t. device+queue, (no two identical
device+queues items in qdlist).
each device+queues item (qditem) holds:
* device (id)
* queue (id)
* list of epi (epilist) that created this qditem
- I think it won't be possible to extend epitem (breaks cache aligned
to 128)... instead you can have a simple ll_usec list.
* ll_usec, the maximum time to poll from all the referring epi items.
finally, polling should iterate over the qdlist once, and then check for events.

----
as far as coding this sketch involves:
1) adjust eventpoll struct.
2) initialize on creation (epoll_create)
3) update the list on modification (epoll_ctl)
3.1) ep_insert->add this epi/ll_usec in relevant qditem (or create new
one), and update qditem->ll_usec
3.2) ep_remove->remove this epi/ll_usec from relevant qditem (MUST be
existing -- sort of ref counting), and update qditem->ll_usec
3.3) ep_modify->...
4) on polling event (epoll_wait)
ep_poll->if qdlist is not empty, then find the maximum ll_usec (could
be done while maintaining...)
... and just before going into wait ...
if max ll_usec!=0
    poll once on all device+queues in the qdlist.
    continue to the next iteration (check events).
5) to support this flow we also need to implement API for
5.1) given a file/fd/epi, if is a sock then get the device+queue.
5.2) poll over a given device+queue (dq_poll_ll) once.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH RFC] net: lls epoll support
  2013-06-25 14:26   ` yaniv saar
@ 2013-06-25 15:34     ` Eliezer Tamir
  0 siblings, 0 replies; 8+ messages in thread
From: Eliezer Tamir @ 2013-06-25 15:34 UTC (permalink / raw)
  To: yaniv saar
  Cc: David Miller, linux-kernel, netdev, Jesse Brandeburg,
	Don Skidmore, e1000-devel, Willem de Bruijn, Eric Dumazet,
	Ben Hutchings, Andi Kleen, HPA, Eilon Greenstien, Or Gerlitz,
	Amir Vadai, Alex Rosenbaum, Eliezer Tamir

On 25/06/2013 17:26, yaniv saar wrote:
> On Wed, Jun 19, 2013 at 2:42 PM, Eliezer Tamir
> <eliezer.tamir@linux.intel.com> wrote:
>>
>> [this patch needs the poll patch to be applied first]
>> with sockperf doing epoll on 1000 sockets I see an avg latency of 6us
>>
>
> hi eliezer,
>
> please consider the following solution for epoll that is based on
> polling dev+queue.
> instead of looping over the socket as in LLS, maintain in eventpool
> struct a list of device+queues (qdlist).

Thanks for looking into this.
I'm currently working on a solution that has a lot similar to what you 
are proposing.

We don't need a new id mechanism, we already have the napi_id.
The nice thing about the napi_id is that the only locking it needs
is an rcu_read_lock when dereferencing.

we don't need to remember the ll_usec value of each socket because
the patch for select/poll (currently waiting for review) added
a separate sysctl value for poll.

I would like to find a way for the user to specify how long to busy
wait, directly from the system call, but I was not able to find
a simple way of adding this without a change to the system call
prototype.

we do however need to track when a socket's napi_id changes.
But for that we can hook into sk_mark_ll().

so here is a list of proposed changes:

1. add a linked list of unique napi_id's to struct eventpoll.
each id will have a collision list of sockets that have the same id.
-a hash is gratuitous, we expect the unique list to have 0 to 2
  elements in most cases.

2. when a new socket is added, if its id is new it gets added to the 
unique list, otherwise to the collision list of that id.

3. when a socket is removed, if it's on the unique list, replace it
with the first on its former collision list.

4. add callback mechanism to sk_mark_ll() which will be activated when
  the mark changes, update the lists.
(a socket may be polled by more than one epoll so be careful)

5. add  and remove to/from the lists in ep_insert and ep_remove
  respectively. check if we need to do something for ep_modify().

6. add an ep_poll helper that will round robin polling on the
files in the unique list.

7. init everything from epoll_create.

locking:

napi_id's are great since they don't need locking except for an
rcu_read_lock when polling on one.

the lists need a spinlock for adding/removing, maybe they
can use ep->lock.

callback registration/removal needs to use the same mechanism that
ep_add / ep_remove use to protect themselves from the rest of epoll.



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2013-06-25 15:34 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-06-19 10:04 [PATCH v3 net-next 0/1] net: lls select poll support Eliezer Tamir
2013-06-19 10:04 ` [PATCH v3 net-next] net: poll/select low latency socket support Eliezer Tamir
2013-06-19 11:42 ` [PATCH RFC] net: lls epoll support Eliezer Tamir
2013-06-25 14:26   ` yaniv saar
2013-06-25 15:34     ` Eliezer Tamir
2013-06-19 12:13 ` [PATCH v3 net-next 0/1] net: lls select poll support Eric Dumazet
2013-06-24  1:44 ` David Miller
2013-06-24  4:23   ` Eliezer Tamir

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).