netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next 0/5] net: introduce noref sk
@ 2017-09-20 16:54 Paolo Abeni
  2017-09-20 16:54 ` [PATCH net-next 1/5] net: add support for noref skb->sk Paolo Abeni
                   ` (5 more replies)
  0 siblings, 6 replies; 13+ messages in thread
From: Paolo Abeni @ 2017-09-20 16:54 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Pablo Neira Ayuso, Florian Westphal,
	Eric Dumazet, Hannes Frederic Sowa

This series introduce the infrastructure to store inside the skb a socket
pointer without carrying a refcount to the socket.

Such infrastructure is then used in the network receive path - and
specifically the early demux operation.

This allows the UDP early demux to perform a full lookup for UDP sockets,
with many benefits:

- the UDP early demux code is now much simpler
- the early demux does not hit any performance penalties in case of UDP hash
  table collision - previously the early demux performed a partial, unsuccesful,
  lookup
- early demux is now operational also for unconnected sockets.

This infrastrcture will be used in follow-up series to allow dst caching for
unconnected UDP sockets, and than to extend the same features to TCP listening
sockets.

Paolo Abeni (5):
  net: add support for noref skb->sk
  net: allow early demux to fetch noref socket
  udp: do not touch socket refcount in early demux
  net: add simple socket-like dst cache helpers
  udp: perform full socket lookup in early demux

 include/linux/skbuff.h           | 30 +++++++++++++++
 include/linux/udp.h              |  2 +
 include/net/dst.h                | 12 ++++++
 net/core/dst.c                   | 16 ++++++++
 net/core/sock.c                  |  6 +++
 net/ipv4/ip_input.c              | 12 ++++++
 net/ipv4/ipmr.c                  | 18 +++++++--
 net/ipv4/netfilter/nf_dup_ipv4.c |  3 ++
 net/ipv4/udp.c                   | 80 ++++++++++++++++------------------------
 net/ipv6/ip6_input.c             |  7 +++-
 net/ipv6/netfilter/nf_dup_ipv6.c |  3 ++
 net/ipv6/udp.c                   | 67 ++++++++++-----------------------
 net/netfilter/nf_queue.c         |  3 ++
 13 files changed, 159 insertions(+), 100 deletions(-)

-- 
2.13.5

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH net-next 1/5] net: add support for noref skb->sk
  2017-09-20 16:54 [PATCH net-next 0/5] net: introduce noref sk Paolo Abeni
@ 2017-09-20 16:54 ` Paolo Abeni
  2017-09-20 17:41   ` Eric Dumazet
  2017-09-20 16:54 ` [PATCH net-next 2/5] net: allow early demux to fetch noref socket Paolo Abeni
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 13+ messages in thread
From: Paolo Abeni @ 2017-09-20 16:54 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Pablo Neira Ayuso, Florian Westphal,
	Eric Dumazet, Hannes Frederic Sowa

Noref sk do not carry a socket refcount, are valid
only inside the current RCU section and must be
explicitly cleared before exiting such section.

They will be used in a later patch to allow early demux
without sock refcounting.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 include/linux/skbuff.h | 30 ++++++++++++++++++++++++++++++
 net/core/sock.c        |  6 ++++++
 2 files changed, 36 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 72299ef00061..459a5672811d 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -922,6 +922,36 @@ static inline struct rtable *skb_rtable(const struct sk_buff *skb)
 	return (struct rtable *)skb_dst(skb);
 }
 
+void sock_dummyfree(struct sk_buff *skb);
+
+/* only early demux can set noref socks
+ * noref socks do not carry any refcount and must be
+ * cleared before exiting the current RCU section
+ */
+static inline void skb_set_noref_sk(struct sk_buff *skb, struct sock *sk)
+{
+	skb->sk = sk;
+	skb->destructor = sock_dummyfree;
+}
+
+static inline bool skb_has_noref_sk(struct sk_buff *skb)
+{
+	return skb->destructor == sock_dummyfree;
+}
+
+static inline struct sock *skb_clear_noref_sk(struct sk_buff *skb)
+{
+	struct sock *ret;
+
+	if (!skb_has_noref_sk(skb))
+		return NULL;
+
+	ret = skb->sk;
+	skb->sk = NULL;
+	skb->destructor = NULL;
+	return ret;
+}
+
 /* For mangling skb->pkt_type from user space side from applications
  * such as nft, tc, etc, we only allow a conservative subset of
  * possible pkt_types to be set.
diff --git a/net/core/sock.c b/net/core/sock.c
index 9b7b6bbb2a23..3aa4950639bb 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1893,6 +1893,12 @@ void sock_efree(struct sk_buff *skb)
 }
 EXPORT_SYMBOL(sock_efree);
 
+/* dummy destructor used by noref sockets */
+void sock_dummyfree(struct sk_buff *skb)
+{
+}
+EXPORT_SYMBOL(sock_dummyfree);
+
 kuid_t sock_i_uid(struct sock *sk)
 {
 	kuid_t uid;
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH net-next 2/5] net: allow early demux to fetch noref socket
  2017-09-20 16:54 [PATCH net-next 0/5] net: introduce noref sk Paolo Abeni
  2017-09-20 16:54 ` [PATCH net-next 1/5] net: add support for noref skb->sk Paolo Abeni
@ 2017-09-20 16:54 ` Paolo Abeni
  2017-09-21  9:13   ` Paolo Abeni
  2017-09-20 16:54 ` [PATCH net-next 3/5] udp: do not touch socket refcount in early demux Paolo Abeni
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 13+ messages in thread
From: Paolo Abeni @ 2017-09-20 16:54 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Pablo Neira Ayuso, Florian Westphal,
	Eric Dumazet, Hannes Frederic Sowa

We must be careful to avoid leaking such sockets outside
the RCU section containing the early demux call; we clear
them on nonlocal delivery.

For ipv4 we must take care of local mcast delivery, too,
since udp early demux works also for mcast addresses.

Also update all iptables/nftables extension that can
happen in the input chain and can transmit the skb outside
such patch, namely TEE, nft_dup and nfqueue.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/ipv4/ip_input.c              | 12 ++++++++++++
 net/ipv4/ipmr.c                  | 18 ++++++++++++++----
 net/ipv4/netfilter/nf_dup_ipv4.c |  3 +++
 net/ipv6/ip6_input.c             |  7 ++++++-
 net/ipv6/netfilter/nf_dup_ipv6.c |  3 +++
 net/netfilter/nf_queue.c         |  3 +++
 6 files changed, 41 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index fa2dc8f692c6..e71abc8b698c 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -349,6 +349,18 @@ static int ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
 				__NET_INC_STATS(net, LINUX_MIB_IPRPFILTER);
 			goto drop;
 		}
+
+		/* Since the sk has no reference to the socket, we must
+		 * clear it before escaping this RCU section.
+		 * The sk is just an hint and we know we are not going to use
+		 * it outside the input path.
+		 */
+		if (skb_dst(skb)->input != ip_local_deliver
+#ifdef CONFIG_IP_MROUTE
+		    && skb_dst(skb)->input != ip_mr_input
+#endif
+		    )
+			skb_clear_noref_sk(skb);
 	}
 
 #ifdef CONFIG_IP_ROUTE_CLASSID
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index c9b3e6e069ae..76642af79038 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -1978,11 +1978,12 @@ static struct mr_table *ipmr_rt_fib_lookup(struct net *net, struct sk_buff *skb)
  */
 int ip_mr_input(struct sk_buff *skb)
 {
-	struct mfc_cache *cache;
-	struct net *net = dev_net(skb->dev);
 	int local = skb_rtable(skb)->rt_flags & RTCF_LOCAL;
-	struct mr_table *mrt;
+	struct net *net = dev_net(skb->dev);
+	struct mfc_cache *cache;
 	struct net_device *dev;
+	struct mr_table *mrt;
+	struct sock *sk;
 
 	/* skb->dev passed in is the loX master dev for vrfs.
 	 * As there are no vifs associated with loopback devices,
@@ -2052,6 +2053,9 @@ int ip_mr_input(struct sk_buff *skb)
 			skb = skb2;
 		}
 
+		/* avoid leaking the noref sk on forward path */
+		skb_clear_noref_sk(skb);
+
 		read_lock(&mrt_lock);
 		vif = ipmr_find_vif(mrt, dev);
 		if (vif >= 0) {
@@ -2065,12 +2069,18 @@ int ip_mr_input(struct sk_buff *skb)
 		return -ENODEV;
 	}
 
+	/* avoid leaking the noref sk on forward path... */
+	sk = skb_clear_noref_sk(skb);
 	read_lock(&mrt_lock);
 	ip_mr_forward(net, mrt, dev, skb, cache, local);
 	read_unlock(&mrt_lock);
 
-	if (local)
+	if (local) {
+		/* ... but preserve it for local delivery */
+		if (sk)
+			skb_set_noref_sk(skb, sk);
 		return ip_local_deliver(skb);
+	}
 
 	return 0;
 
diff --git a/net/ipv4/netfilter/nf_dup_ipv4.c b/net/ipv4/netfilter/nf_dup_ipv4.c
index 39895b9ddeb9..bf8b78492fc8 100644
--- a/net/ipv4/netfilter/nf_dup_ipv4.c
+++ b/net/ipv4/netfilter/nf_dup_ipv4.c
@@ -71,6 +71,9 @@ void nf_dup_ipv4(struct net *net, struct sk_buff *skb, unsigned int hooknum,
 	nf_reset(skb);
 	nf_ct_set(skb, NULL, IP_CT_UNTRACKED);
 #endif
+	/* Avoid leaking noref sk outside the input path */
+	skb_clear_noref_sk(skb);
+
 	/*
 	 * If we are in PREROUTING/INPUT, decrease the TTL to mitigate potential
 	 * loops between two hosts.
diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
index 9ee208a348f5..9aa6baffd4b9 100644
--- a/net/ipv6/ip6_input.c
+++ b/net/ipv6/ip6_input.c
@@ -65,9 +65,14 @@ int ip6_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
 		if (ipprot && (edemux = READ_ONCE(ipprot->early_demux)))
 			edemux(skb);
 	}
-	if (!skb_valid_dst(skb))
+	if (!skb_valid_dst(skb)) {
 		ip6_route_input(skb);
 
+		/* see comment on ipv4 edmux */
+		if (skb_dst(skb)->input != ip6_input)
+			skb_clear_noref_sk(skb);
+	}
+
 	return dst_input(skb);
 }
 
diff --git a/net/ipv6/netfilter/nf_dup_ipv6.c b/net/ipv6/netfilter/nf_dup_ipv6.c
index 4a7ddeddbaab..939f6a2238f9 100644
--- a/net/ipv6/netfilter/nf_dup_ipv6.c
+++ b/net/ipv6/netfilter/nf_dup_ipv6.c
@@ -60,6 +60,9 @@ void nf_dup_ipv6(struct net *net, struct sk_buff *skb, unsigned int hooknum,
 	nf_reset(skb);
 	nf_ct_set(skb, NULL, IP_CT_UNTRACKED);
 #endif
+	/* Avoid leaking noref sk outside the input path */
+	skb_clear_noref_sk(skb);
+
 	if (hooknum == NF_INET_PRE_ROUTING ||
 	    hooknum == NF_INET_LOCAL_IN) {
 		struct ipv6hdr *iph = ipv6_hdr(skb);
diff --git a/net/netfilter/nf_queue.c b/net/netfilter/nf_queue.c
index f7e21953b1de..100eff08cb51 100644
--- a/net/netfilter/nf_queue.c
+++ b/net/netfilter/nf_queue.c
@@ -145,6 +145,9 @@ static int __nf_queue(struct sk_buff *skb, const struct nf_hook_state *state,
 		.size	= sizeof(*entry) + afinfo->route_key_size,
 	};
 
+	/* Avoid leaking noref sk outside the input path */
+	skb_clear_noref_sk(skb);
+
 	nf_queue_entry_get_refs(entry);
 	skb_dst_force(skb);
 	afinfo->saveroute(skb, entry);
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH net-next 3/5] udp: do not touch socket refcount in early demux
  2017-09-20 16:54 [PATCH net-next 0/5] net: introduce noref sk Paolo Abeni
  2017-09-20 16:54 ` [PATCH net-next 1/5] net: add support for noref skb->sk Paolo Abeni
  2017-09-20 16:54 ` [PATCH net-next 2/5] net: allow early demux to fetch noref socket Paolo Abeni
@ 2017-09-20 16:54 ` Paolo Abeni
  2017-09-20 16:54 ` [PATCH net-next 4/5] net: add simple socket-like dst cache helpers Paolo Abeni
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: Paolo Abeni @ 2017-09-20 16:54 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Pablo Neira Ayuso, Florian Westphal,
	Eric Dumazet, Hannes Frederic Sowa

use noref sockets instead. This gives some small performance
improvements and will allow efficient early demux for unconnected
sockets in a later patch.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/ipv4/udp.c | 18 ++++++++++--------
 net/ipv6/udp.c | 10 ++++++----
 2 files changed, 16 insertions(+), 12 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 784ced0b9150..ba49d5aa9f09 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2050,12 +2050,13 @@ static inline int udp4_csum_init(struct sk_buff *skb, struct udphdr *uh,
 int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 		   int proto)
 {
-	struct sock *sk;
-	struct udphdr *uh;
-	unsigned short ulen;
+	struct net *net = dev_net(skb->dev);
 	struct rtable *rt = skb_rtable(skb);
+	unsigned short ulen;
 	__be32 saddr, daddr;
-	struct net *net = dev_net(skb->dev);
+	struct udphdr *uh;
+	struct sock *sk;
+	bool noref_sk;
 
 	/*
 	 *  Validate the packet.
@@ -2081,6 +2082,7 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 	if (udp4_csum_init(skb, uh, proto))
 		goto csum_error;
 
+	noref_sk = skb_has_noref_sk(skb);
 	sk = skb_steal_sock(skb);
 	if (sk) {
 		struct dst_entry *dst = skb_dst(skb);
@@ -2090,7 +2092,8 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 			udp_sk_rx_dst_set(sk, dst);
 
 		ret = udp_queue_rcv_skb(sk, skb);
-		sock_put(sk);
+		if (!noref_sk)
+			sock_put(sk);
 		/* a return value > 0 means to resubmit the input, but
 		 * it wants the return to be -protocol, or 0
 		 */
@@ -2261,11 +2264,10 @@ void udp_v4_early_demux(struct sk_buff *skb)
 					     uh->source, iph->saddr, dif, sdif);
 	}
 
-	if (!sk || !refcount_inc_not_zero(&sk->sk_refcnt))
+	if (!sk)
 		return;
 
-	skb->sk = sk;
-	skb->destructor = sock_efree;
+	skb_set_noref_sk(skb, sk);
 	dst = READ_ONCE(sk->sk_rx_dst);
 
 	if (dst)
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index e2ecfb137297..8f62392c4c35 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -787,6 +787,7 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 	struct net *net = dev_net(skb->dev);
 	struct udphdr *uh;
 	struct sock *sk;
+	bool noref_sk;
 	u32 ulen = 0;
 
 	if (!pskb_may_pull(skb, sizeof(struct udphdr)))
@@ -823,6 +824,7 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 		goto csum_error;
 
 	/* Check if the socket is already available, e.g. due to early demux */
+	noref_sk = skb_has_noref_sk(skb);
 	sk = skb_steal_sock(skb);
 	if (sk) {
 		struct dst_entry *dst = skb_dst(skb);
@@ -832,7 +834,8 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 			udp6_sk_rx_dst_set(sk, dst);
 
 		ret = udpv6_queue_rcv_skb(sk, skb);
-		sock_put(sk);
+		if (!noref_sk)
+			sock_put(sk);
 
 		/* a return value > 0 means to resubmit the input */
 		if (ret > 0)
@@ -948,11 +951,10 @@ static void udp_v6_early_demux(struct sk_buff *skb)
 	else
 		return;
 
-	if (!sk || !refcount_inc_not_zero(&sk->sk_refcnt))
+	if (!sk)
 		return;
 
-	skb->sk = sk;
-	skb->destructor = sock_efree;
+	skb_set_noref_sk(skb, sk);
 	dst = READ_ONCE(sk->sk_rx_dst);
 
 	if (dst)
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH net-next 4/5] net: add simple socket-like dst cache helpers
  2017-09-20 16:54 [PATCH net-next 0/5] net: introduce noref sk Paolo Abeni
                   ` (2 preceding siblings ...)
  2017-09-20 16:54 ` [PATCH net-next 3/5] udp: do not touch socket refcount in early demux Paolo Abeni
@ 2017-09-20 16:54 ` Paolo Abeni
  2017-09-20 16:54 ` [PATCH net-next 5/5] udp: perform full socket lookup in early demux Paolo Abeni
  2017-09-21  3:20 ` [PATCH net-next 0/5] net: introduce noref sk David Miller
  5 siblings, 0 replies; 13+ messages in thread
From: Paolo Abeni @ 2017-09-20 16:54 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Pablo Neira Ayuso, Florian Westphal,
	Eric Dumazet, Hannes Frederic Sowa

It will be used by later patches to reduce code duplication.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 include/net/dst.h | 12 ++++++++++++
 net/core/dst.c    | 16 ++++++++++++++++
 2 files changed, 28 insertions(+)

diff --git a/include/net/dst.h b/include/net/dst.h
index 93568bd0a352..a6a39357f19a 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -485,6 +485,18 @@ static inline struct dst_entry *dst_check(struct dst_entry *dst, u32 cookie)
 	return dst;
 }
 
+bool dst_update(struct dst_entry **cache, struct dst_entry *dst);
+static inline struct dst_entry *dst_access(struct dst_entry **cache,
+					      u32 cookie)
+{
+	struct dst_entry *dst = READ_ONCE(*cache);
+
+	if (!dst)
+		return NULL;
+
+	return dst_check(dst, cookie);
+}
+
 /* Flags for xfrm_lookup flags argument. */
 enum {
 	XFRM_LOOKUP_ICMP = 1 << 0,
diff --git a/net/core/dst.c b/net/core/dst.c
index a6c47da7d0f8..6aff0a3e7ba3 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -205,6 +205,22 @@ void dst_release_immediate(struct dst_entry *dst)
 }
 EXPORT_SYMBOL(dst_release_immediate);
 
+/* 'dst' is not refcounted */
+bool dst_update(struct dst_entry **cache, struct dst_entry *dst)
+{
+	if (likely(*cache == dst))
+		return false;
+
+	if (dst_hold_safe(dst)) {
+		struct dst_entry *old = xchg(cache, dst);
+
+		dst_release(old);
+		return old != dst;
+	}
+	return false;
+}
+EXPORT_SYMBOL_GPL(dst_update);
+
 u32 *dst_cow_metrics_generic(struct dst_entry *dst, unsigned long old)
 {
 	struct dst_metrics *p = kmalloc(sizeof(*p), GFP_ATOMIC);
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH net-next 5/5] udp: perform full socket lookup in early demux
  2017-09-20 16:54 [PATCH net-next 0/5] net: introduce noref sk Paolo Abeni
                   ` (3 preceding siblings ...)
  2017-09-20 16:54 ` [PATCH net-next 4/5] net: add simple socket-like dst cache helpers Paolo Abeni
@ 2017-09-20 16:54 ` Paolo Abeni
  2017-09-21  3:20 ` [PATCH net-next 0/5] net: introduce noref sk David Miller
  5 siblings, 0 replies; 13+ messages in thread
From: Paolo Abeni @ 2017-09-20 16:54 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Pablo Neira Ayuso, Florian Westphal,
	Eric Dumazet, Hannes Frederic Sowa

Since UDP early demux lookup fetches noref socket references,
we can safely be optimistic about it and set the sk reference
even if the skb is not going to land on such socket, avoiding
the rx dst cache usage for unconnected unicast sockets.

This avoids a second lookup for unconnected sockets, and clean
up a bit the whole udp early demux code.

After this change, on hosts not acting as routers, the UDP
early demux never affect negatively the receive performances,
while before this change UDP early demux caused measurable
performance impact for unconnected sockets.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 include/linux/udp.h |  2 ++
 net/ipv4/udp.c      | 62 +++++++++++++++++++----------------------------------
 net/ipv6/udp.c      | 57 ++++++++++++------------------------------------
 3 files changed, 38 insertions(+), 83 deletions(-)

diff --git a/include/linux/udp.h b/include/linux/udp.h
index eaea63bc79bb..9c68b57543cc 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -92,6 +92,8 @@ static inline struct udp_sock *udp_sk(const struct sock *sk)
 	return (struct udp_sock *)sk;
 }
 
+void udp_set_skb_rx_dst(struct sock *sk, struct sk_buff *skb, u32 cookie);
+
 static inline void udp_set_no_check6_tx(struct sock *sk, bool val)
 {
 	udp_sk(sk)->no_check6_tx = val;
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index ba49d5aa9f09..5cbbd78024dc 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2043,6 +2043,11 @@ static inline int udp4_csum_init(struct sk_buff *skb, struct udphdr *uh,
 							 inet_compute_pseudo);
 }
 
+static bool udp_use_rx_dst_cache(struct sock *sk, struct sk_buff *skb)
+{
+	return sk->sk_state == TCP_ESTABLISHED || skb->pkt_type != PACKET_HOST;
+}
+
 /*
  *	All we need to do is get the socket, and then do a checksum.
  */
@@ -2088,8 +2093,8 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 		struct dst_entry *dst = skb_dst(skb);
 		int ret;
 
-		if (unlikely(sk->sk_rx_dst != dst))
-			udp_sk_rx_dst_set(sk, dst);
+		if (udp_use_rx_dst_cache(sk, skb))
+			dst_update(&sk->sk_rx_dst, dst);
 
 		ret = udp_queue_rcv_skb(sk, skb);
 		if (!noref_sk)
@@ -2196,42 +2201,28 @@ static struct sock *__udp4_lib_mcast_demux_lookup(struct net *net,
 	return result;
 }
 
-/* For unicast we should only early demux connected sockets or we can
- * break forwarding setups.  The chains here can be long so only check
- * if the first socket is an exact match and if not move on.
- */
-static struct sock *__udp4_lib_demux_lookup(struct net *net,
-					    __be16 loc_port, __be32 loc_addr,
-					    __be16 rmt_port, __be32 rmt_addr,
-					    int dif, int sdif)
+void udp_set_skb_rx_dst(struct sock *sk, struct sk_buff *skb, u32 cookie)
 {
-	unsigned short hnum = ntohs(loc_port);
-	unsigned int hash2 = udp4_portaddr_hash(net, loc_addr, hnum);
-	unsigned int slot2 = hash2 & udp_table.mask;
-	struct udp_hslot *hslot2 = &udp_table.hash2[slot2];
-	INET_ADDR_COOKIE(acookie, rmt_addr, loc_addr);
-	const __portpair ports = INET_COMBINED_PORTS(rmt_port, hnum);
-	struct sock *sk;
+	struct dst_entry *dst = dst_access(&sk->sk_rx_dst, cookie);
 
-	udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
-		if (INET_MATCH(sk, net, acookie, rmt_addr,
-			       loc_addr, ports, dif, sdif))
-			return sk;
-		/* Only check first socket in chain */
-		break;
+	if (dst) {
+		/* set noref for now.
+		 * any place which wants to hold dst has to call
+		 * dst_hold_safe()
+		 */
+		skb_dst_set_noref(skb, dst);
 	}
-	return NULL;
 }
+EXPORT_SYMBOL_GPL(udp_set_skb_rx_dst);
 
 void udp_v4_early_demux(struct sk_buff *skb)
 {
 	struct net *net = dev_net(skb->dev);
+	int dif = skb->dev->ifindex;
+	int sdif = inet_sdif(skb);
 	const struct iphdr *iph;
 	const struct udphdr *uh;
 	struct sock *sk = NULL;
-	struct dst_entry *dst;
-	int dif = skb->dev->ifindex;
-	int sdif = inet_sdif(skb);
 	int ours;
 
 	/* validate the packet */
@@ -2260,25 +2251,16 @@ void udp_v4_early_demux(struct sk_buff *skb)
 						   uh->source, iph->saddr,
 						   dif, sdif);
 	} else if (skb->pkt_type == PACKET_HOST) {
-		sk = __udp4_lib_demux_lookup(net, uh->dest, iph->daddr,
-					     uh->source, iph->saddr, dif, sdif);
+		sk = __udp4_lib_lookup(net, iph->saddr, uh->source, iph->daddr,
+				       uh->dest, dif, sdif, &udp_table, skb);
 	}
 
 	if (!sk)
 		return;
 
 	skb_set_noref_sk(skb, sk);
-	dst = READ_ONCE(sk->sk_rx_dst);
-
-	if (dst)
-		dst = dst_check(dst, 0);
-	if (dst) {
-		/* set noref for now.
-		 * any place which wants to hold dst has to call
-		 * dst_hold_safe()
-		 */
-		skb_dst_set_noref(skb, dst);
-	}
+	if (udp_use_rx_dst_cache(sk, skb))
+		udp_set_skb_rx_dst(sk, skb, 0);
 }
 
 int udp_rcv(struct sk_buff *skb)
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 8f62392c4c35..67d340679c3a 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -773,13 +773,18 @@ static int __udp6_lib_mcast_deliver(struct net *net, struct sk_buff *skb,
 
 static void udp6_sk_rx_dst_set(struct sock *sk, struct dst_entry *dst)
 {
-	if (udp_sk_rx_dst_set(sk, dst)) {
+	if (unlikely(dst_update(&sk->sk_rx_dst, dst))) {
 		const struct rt6_info *rt = (const struct rt6_info *)dst;
 
 		inet6_sk(sk)->rx_dst_cookie = rt6_get_cookie(rt);
 	}
 }
 
+static bool udp6_use_rx_dst_cache(struct sock *sk)
+{
+	return sk->sk_state == TCP_ESTABLISHED;
+}
+
 int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 		   int proto)
 {
@@ -830,7 +835,7 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 		struct dst_entry *dst = skb_dst(skb);
 		int ret;
 
-		if (unlikely(sk->sk_rx_dst != dst))
+		if (udp6_use_rx_dst_cache(sk))
 			udp6_sk_rx_dst_set(sk, dst);
 
 		ret = udpv6_queue_rcv_skb(sk, skb);
@@ -905,37 +910,13 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 	return 0;
 }
 
-
-static struct sock *__udp6_lib_demux_lookup(struct net *net,
-			__be16 loc_port, const struct in6_addr *loc_addr,
-			__be16 rmt_port, const struct in6_addr *rmt_addr,
-			int dif, int sdif)
-{
-	unsigned short hnum = ntohs(loc_port);
-	unsigned int hash2 = udp6_portaddr_hash(net, loc_addr, hnum);
-	unsigned int slot2 = hash2 & udp_table.mask;
-	struct udp_hslot *hslot2 = &udp_table.hash2[slot2];
-	const __portpair ports = INET_COMBINED_PORTS(rmt_port, hnum);
-	struct sock *sk;
-
-	udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
-		if (sk->sk_state == TCP_ESTABLISHED &&
-		    INET6_MATCH(sk, net, rmt_addr, loc_addr, ports, dif, sdif))
-			return sk;
-		/* Only check first socket in chain */
-		break;
-	}
-	return NULL;
-}
-
 static void udp_v6_early_demux(struct sk_buff *skb)
 {
 	struct net *net = dev_net(skb->dev);
-	const struct udphdr *uh;
-	struct sock *sk;
-	struct dst_entry *dst;
 	int dif = skb->dev->ifindex;
 	int sdif = inet6_sdif(skb);
+	const struct udphdr *uh;
+	struct sock *sk;
 
 	if (!pskb_may_pull(skb, skb_transport_offset(skb) +
 	    sizeof(struct udphdr)))
@@ -944,10 +925,9 @@ static void udp_v6_early_demux(struct sk_buff *skb)
 	uh = udp_hdr(skb);
 
 	if (skb->pkt_type == PACKET_HOST)
-		sk = __udp6_lib_demux_lookup(net, uh->dest,
-					     &ipv6_hdr(skb)->daddr,
-					     uh->source, &ipv6_hdr(skb)->saddr,
-					     dif, sdif);
+		sk = __udp6_lib_lookup(net, &ipv6_hdr(skb)->saddr, uh->source,
+				       &ipv6_hdr(skb)->daddr, uh->dest, dif,
+				       sdif, &udp_table, skb);
 	else
 		return;
 
@@ -955,17 +935,8 @@ static void udp_v6_early_demux(struct sk_buff *skb)
 		return;
 
 	skb_set_noref_sk(skb, sk);
-	dst = READ_ONCE(sk->sk_rx_dst);
-
-	if (dst)
-		dst = dst_check(dst, inet6_sk(sk)->rx_dst_cookie);
-	if (dst) {
-		/* set noref for now.
-		 * any place which wants to hold dst has to call
-		 * dst_hold_safe()
-		 */
-		skb_dst_set_noref(skb, dst);
-	}
+	if (udp6_use_rx_dst_cache(sk))
+		udp_set_skb_rx_dst(sk, skb, inet6_sk(sk)->rx_dst_cookie);
 }
 
 static __inline__ int udpv6_rcv(struct sk_buff *skb)
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next 1/5] net: add support for noref skb->sk
  2017-09-20 16:54 ` [PATCH net-next 1/5] net: add support for noref skb->sk Paolo Abeni
@ 2017-09-20 17:41   ` Eric Dumazet
  2017-09-21  9:14     ` Paolo Abeni
  0 siblings, 1 reply; 13+ messages in thread
From: Eric Dumazet @ 2017-09-20 17:41 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: netdev, David S. Miller, Pablo Neira Ayuso, Florian Westphal,
	Eric Dumazet, Hannes Frederic Sowa

On Wed, 2017-09-20 at 18:54 +0200, Paolo Abeni wrote:
> Noref sk do not carry a socket refcount, are valid
> only inside the current RCU section and must be
> explicitly cleared before exiting such section.
> 
> They will be used in a later patch to allow early demux
> without sock refcounting.




> +/* dummy destructor used by noref sockets */
> +void sock_dummyfree(struct sk_buff *skb)
> +{

BUG();

> +}
> +EXPORT_SYMBOL(sock_dummyfree);
> +


I do not see how you ensure we do not leave RCU section with an skb
destructor pointing to this sock_dummyfree()

This patch series looks quite dangerous to me.

Do we really have real applications using connected UDP sockets and
wanting very high pps throughput ?

I am pretty sure the bottleneck is the sender part.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next 0/5] net: introduce noref sk
  2017-09-20 16:54 [PATCH net-next 0/5] net: introduce noref sk Paolo Abeni
                   ` (4 preceding siblings ...)
  2017-09-20 16:54 ` [PATCH net-next 5/5] udp: perform full socket lookup in early demux Paolo Abeni
@ 2017-09-21  3:20 ` David Miller
  2017-09-21  9:42   ` Paolo Abeni
  5 siblings, 1 reply; 13+ messages in thread
From: David Miller @ 2017-09-21  3:20 UTC (permalink / raw)
  To: pabeni; +Cc: netdev, pablo, fw, edumazet, hannes

From: Paolo Abeni <pabeni@redhat.com>
Date: Wed, 20 Sep 2017 18:54:00 +0200

> This series introduce the infrastructure to store inside the skb a socket
> pointer without carrying a refcount to the socket.
> 
> Such infrastructure is then used in the network receive path - and
> specifically the early demux operation.
> 
> This allows the UDP early demux to perform a full lookup for UDP sockets,
> with many benefits:
> 
> - the UDP early demux code is now much simpler
> - the early demux does not hit any performance penalties in case of UDP hash
>   table collision - previously the early demux performed a partial, unsuccesful,
>   lookup
> - early demux is now operational also for unconnected sockets.
> 
> This infrastrcture will be used in follow-up series to allow dst caching for
> unconnected UDP sockets, and than to extend the same features to TCP listening
> sockets.

Like Eric, I find this series (while exciting) quite scary :-)

You really have to post some kind of performance numbers in your
header posting in order to justify something with these ramifications
and scale.

Thank you.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next 2/5] net: allow early demux to fetch noref socket
  2017-09-20 16:54 ` [PATCH net-next 2/5] net: allow early demux to fetch noref socket Paolo Abeni
@ 2017-09-21  9:13   ` Paolo Abeni
  0 siblings, 0 replies; 13+ messages in thread
From: Paolo Abeni @ 2017-09-21  9:13 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Pablo Neira Ayuso, Florian Westphal,
	Eric Dumazet, Hannes Frederic Sowa

On Wed, 2017-09-20 at 18:54 +0200, Paolo Abeni wrote:
> We must be careful to avoid leaking such sockets outside
> the RCU section containing the early demux call; we clear
> them on nonlocal delivery.
> 
> For ipv4 we must take care of local mcast delivery, too,
> since udp early demux works also for mcast addresses.
> 
> Also update all iptables/nftables extension that can
> happen in the input chain and can transmit the skb outside
> such patch, namely TEE, nft_dup and nfqueue.
> 
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
>  net/ipv4/ip_input.c              | 12 ++++++++++++
>  net/ipv4/ipmr.c                  | 18 ++++++++++++++----
>  net/ipv4/netfilter/nf_dup_ipv4.c |  3 +++
>  net/ipv6/ip6_input.c             |  7 ++++++-
>  net/ipv6/netfilter/nf_dup_ipv6.c |  3 +++
>  net/netfilter/nf_queue.c         |  3 +++
>  6 files changed, 41 insertions(+), 5 deletions(-)
> 
> diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
> index fa2dc8f692c6..e71abc8b698c 100644
> --- a/net/ipv4/ip_input.c
> +++ b/net/ipv4/ip_input.c
> @@ -349,6 +349,18 @@ static int ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
>  				__NET_INC_STATS(net, LINUX_MIB_IPRPFILTER);
>  			goto drop;
>  		}
> +
> +		/* Since the sk has no reference to the socket, we must
> +		 * clear it before escaping this RCU section.
> +		 * The sk is just an hint and we know we are not going to use
> +		 * it outside the input path.
> +		 */
> +		if (skb_dst(skb)->input != ip_local_deliver
> +#ifdef CONFIG_IP_MROUTE
> +		    && skb_dst(skb)->input != ip_mr_input
> +#endif
> +		    )
> +			skb_clear_noref_sk(skb);
>  	}

The above is to allow early demux for multicast sockets even on hosts
acting as multicast router. This is probably overkill: an host will
probably act as a multicast router or receive large amount of locally
terminate mcast traffic.

We can instead preserve the sknoref only for ip_local_deliver(),
dropping the early demux optimization in the above scenario, which
should not be very relevant. Will simplify the above chunk and drop the
need for the ipmr.c changes below; overall this patch will become much
simpler.

Paolo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next 1/5] net: add support for noref skb->sk
  2017-09-20 17:41   ` Eric Dumazet
@ 2017-09-21  9:14     ` Paolo Abeni
  2017-09-21 10:35       ` Eric Dumazet
  0 siblings, 1 reply; 13+ messages in thread
From: Paolo Abeni @ 2017-09-21  9:14 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, David S. Miller, Pablo Neira Ayuso, Florian Westphal,
	Eric Dumazet, Hannes Frederic Sowa

Hi,

Thank you for looking at it!

On Wed, 2017-09-20 at 10:41 -0700, Eric Dumazet wrote:
> On Wed, 2017-09-20 at 18:54 +0200, Paolo Abeni wrote:
> > Noref sk do not carry a socket refcount, are valid
> > only inside the current RCU section and must be
> > explicitly cleared before exiting such section.
> > 
> > They will be used in a later patch to allow early demux
> > without sock refcounting.
> 
> 
> 
> 
> > +/* dummy destructor used by noref sockets */
> > +void sock_dummyfree(struct sk_buff *skb)
> > +{
> 
> BUG();
> 
> > +}
> > +EXPORT_SYMBOL(sock_dummyfree);
> > +

We can call sock_dummyfree() in legitimate paths, see below, but we can
add a:

WARN_ON_ONCE(!rcu_read_lock_held());

here and in  skb_clear_noref_sk(). That should help much to catch
possible bugs.

> I do not see how you ensure we do not leave RCU section with an skb
> destructor pointing to this sock_dummyfree()
> 
> This patch series looks quite dangerous to me.

The idea is to explicitly clear the sknoref references before leaving
the RCU section. Quite alike what we currently do for dst noref, but
here the only place where we get a noref socket is the socket early
demux, thus the scope of this change is more limited to what we have
with noref dst_entries.

The relevant code is in the next 2 patches; after the demux we preserve
the sknoref only if the skb has a local destination. The UDP socket
will then set the noref on early demux lookup, and the skb will either:

* land on the corresponding UDP socket, the receive function will steal
the sknoref
* be dropped by some nft/iptables target - the dummy destructor is
called
* forwarded by some nft/iptables target outside the input path; we
clear the skref explicitly in such targets. 

Currently there are an handful of places affected, and we can simplify
the code dropping the early demux result for locally terminated
multicast sockets on a host acting as a multicast router, please see
the comment on the next patch.

> Do we really have real applications using connected UDP sockets and
> wanting very high pps throughput ?

The ultimate goal is to improve the unconnected UDP sockets scenario,
we do actually have use cases for that - DNS servers and VoIP SBCs.

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next 0/5] net: introduce noref sk
  2017-09-21  3:20 ` [PATCH net-next 0/5] net: introduce noref sk David Miller
@ 2017-09-21  9:42   ` Paolo Abeni
  2017-09-21 10:37     ` Eric Dumazet
  0 siblings, 1 reply; 13+ messages in thread
From: Paolo Abeni @ 2017-09-21  9:42 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, pablo, fw, edumazet, hannes

Hi,

Thanks for the feedback!

On Wed, 2017-09-20 at 20:20 -0700, David Miller wrote:
> From: Paolo Abeni <pabeni@redhat.com>
> Date: Wed, 20 Sep 2017 18:54:00 +0200
> 
> > This series introduce the infrastructure to store inside the skb a socket
> > pointer without carrying a refcount to the socket.
> > 
> > Such infrastructure is then used in the network receive path - and
> > specifically the early demux operation.
> > 
> > This allows the UDP early demux to perform a full lookup for UDP sockets,
> > with many benefits:
> > 
> > - the UDP early demux code is now much simpler
> > - the early demux does not hit any performance penalties in case of UDP hash
> >   table collision - previously the early demux performed a partial, unsuccesful,
> >   lookup
> > - early demux is now operational also for unconnected sockets.
> > 
> > This infrastrcture will be used in follow-up series to allow dst caching for
> > unconnected UDP sockets, and than to extend the same features to TCP listening
> > sockets.
> 
> Like Eric, I find this series (while exciting) quite scary :-)
> 
> You really have to post some kind of performance numbers in your
> header posting in order to justify something with these ramifications
> and scale.

This is actually a preparatory work for the next series which will
bring in the real gain. The next patches are still to be polished so we
 posted this separately to get some early feedback. 

If that would help, I can post the follow-up soon as RFC. Overall -
with the follow-up appplied, too - when using a single rx ingress
queue, I measured ~20% tput gain for unconnected ipv4 sockets - with
rp_filter disabled - and ~30% for ipv6 sockets. In case of multiple
ingress queues, the gain is smaller but still measurable (roughly 5%). 

Please let me know if you prefer the see the full work early. 

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next 1/5] net: add support for noref skb->sk
  2017-09-21  9:14     ` Paolo Abeni
@ 2017-09-21 10:35       ` Eric Dumazet
  0 siblings, 0 replies; 13+ messages in thread
From: Eric Dumazet @ 2017-09-21 10:35 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: netdev, David S. Miller, Pablo Neira Ayuso, Florian Westphal,
	Eric Dumazet, Hannes Frederic Sowa

On Thu, 2017-09-21 at 11:14 +0200, Paolo Abeni wrote:
> Hi,
> 
> Thank you for looking at it!
> 
> On Wed, 2017-09-20 at 10:41 -0700, Eric Dumazet wrote:
> > On Wed, 2017-09-20 at 18:54 +0200, Paolo Abeni wrote:
> > > Noref sk do not carry a socket refcount, are valid
> > > only inside the current RCU section and must be
> > > explicitly cleared before exiting such section.
> > > 
> > > They will be used in a later patch to allow early demux
> > > without sock refcounting.
> > 
> > 
> > 
> > 
> > > +/* dummy destructor used by noref sockets */
> > > +void sock_dummyfree(struct sk_buff *skb)
> > > +{
> > 
> > BUG();
> > 
> > > +}
> > > +EXPORT_SYMBOL(sock_dummyfree);
> > > +
> 
> We can call sock_dummyfree() in legitimate paths, see below, but we can
> add a:
> 
> WARN_ON_ONCE(!rcu_read_lock_held());

This wont be enough see below.

> 
> here and in  skb_clear_noref_sk(). That should help much to catch
> possible bugs.
> 
> > I do not see how you ensure we do not leave RCU section with an skb
> > destructor pointing to this sock_dummyfree()
> > 
> > This patch series looks quite dangerous to me.
> 
> The idea is to explicitly clear the sknoref references before leaving
> the RCU section. Quite alike what we currently do for dst noref, but
> here the only place where we get a noref socket is the socket early
> demux, thus the scope of this change is more limited to what we have
> with noref dst_entries.
> 
> The relevant code is in the next 2 patches; after the demux we preserve
> the sknoref only if the skb has a local destination. The UDP socket
> will then set the noref on early demux lookup, and the skb will either:
> 
> * land on the corresponding UDP socket, the receive function will steal
> the sknoref
> * be dropped by some nft/iptables target - the dummy destructor is
> called
> * forwarded by some nft/iptables target outside the input path; we
> clear the skref explicitly in such targets. 
> 
> Currently there are an handful of places affected, and we can simplify
> the code dropping the early demux result for locally terminated
> multicast sockets on a host acting as a multicast router, please see
> the comment on the next patch.
> 
> > Do we really have real applications using connected UDP sockets and
> > wanting very high pps throughput ?
> 
> The ultimate goal is to improve the unconnected UDP sockets scenario,
> we do actually have use cases for that - DNS servers and VoIP SBCs.

Unconnected UDP traffic does not use refcounting on sk _already_.

And SO_REUSEPORT already allows us to handle all the traffic we want
_already_.


Please take a look at 71563f3414e917c62acd8e0fb0edf8ed6af63e4b

This might tell you why I am so nervous about your changes.

Checking WARN_ON_ONCE(!rcu_read_lock_held());
is not enough.

rcu_read_lock()
skb->destructor = sock_dummyfree;

queue the packet into an intermediate queue.
rcu_read_unlock();

....

rcu_read_lock()
...
if (skb->sk && skb->sk->state == ...) // crash

Also you covered IPv4, but really we need to forget about IPv4 and focus
on IPv6 only. And _then_ take care of IPv4 compat.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next 0/5] net: introduce noref sk
  2017-09-21  9:42   ` Paolo Abeni
@ 2017-09-21 10:37     ` Eric Dumazet
  0 siblings, 0 replies; 13+ messages in thread
From: Eric Dumazet @ 2017-09-21 10:37 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: David Miller, netdev, pablo, fw, edumazet, hannes

On Thu, 2017-09-21 at 11:42 +0200, Paolo Abeni wrote:
> Hi,
> 
> Thanks for the feedback!
> 
> On Wed, 2017-09-20 at 20:20 -0700, David Miller wrote:
> > From: Paolo Abeni <pabeni@redhat.com>
> > Date: Wed, 20 Sep 2017 18:54:00 +0200
> > 
> > > This series introduce the infrastructure to store inside the skb a socket
> > > pointer without carrying a refcount to the socket.
> > > 
> > > Such infrastructure is then used in the network receive path - and
> > > specifically the early demux operation.
> > > 
> > > This allows the UDP early demux to perform a full lookup for UDP sockets,
> > > with many benefits:
> > > 
> > > - the UDP early demux code is now much simpler
> > > - the early demux does not hit any performance penalties in case of UDP hash
> > >   table collision - previously the early demux performed a partial, unsuccesful,
> > >   lookup
> > > - early demux is now operational also for unconnected sockets.
> > > 
> > > This infrastrcture will be used in follow-up series to allow dst caching for
> > > unconnected UDP sockets, and than to extend the same features to TCP listening
> > > sockets.
> > 
> > Like Eric, I find this series (while exciting) quite scary :-)
> > 
> > You really have to post some kind of performance numbers in your
> > header posting in order to justify something with these ramifications
> > and scale.
> 
> This is actually a preparatory work for the next series which will
> bring in the real gain. The next patches are still to be polished so we
>  posted this separately to get some early feedback. 
> 
> If that would help, I can post the follow-up soon as RFC. Overall -
> with the follow-up appplied, too - when using a single rx ingress
> queue, I measured ~20% tput gain for unconnected ipv4 sockets - with
> rp_filter disabled - and ~30% for ipv6 sockets. In case of multiple
> ingress queues, the gain is smaller but still measurable (roughly 5%). 
> 
> Please let me know if you prefer the see the full work early. 

I want to see the full work yes. Ipv6, and everything.

I do not want ~1000 lines of changed code in the stack for some corner
cases, where people do not properly use existing infra, like proper
SO_REUSEPORT with proper BPF filter to have as many clean siloes (proper
CPU/NUMA affinities to avoid QPI traffic)

The complexity of your patches reached a point where I am extremely
nervous.

Thanks.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2017-09-21 10:37 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-20 16:54 [PATCH net-next 0/5] net: introduce noref sk Paolo Abeni
2017-09-20 16:54 ` [PATCH net-next 1/5] net: add support for noref skb->sk Paolo Abeni
2017-09-20 17:41   ` Eric Dumazet
2017-09-21  9:14     ` Paolo Abeni
2017-09-21 10:35       ` Eric Dumazet
2017-09-20 16:54 ` [PATCH net-next 2/5] net: allow early demux to fetch noref socket Paolo Abeni
2017-09-21  9:13   ` Paolo Abeni
2017-09-20 16:54 ` [PATCH net-next 3/5] udp: do not touch socket refcount in early demux Paolo Abeni
2017-09-20 16:54 ` [PATCH net-next 4/5] net: add simple socket-like dst cache helpers Paolo Abeni
2017-09-20 16:54 ` [PATCH net-next 5/5] udp: perform full socket lookup in early demux Paolo Abeni
2017-09-21  3:20 ` [PATCH net-next 0/5] net: introduce noref sk David Miller
2017-09-21  9:42   ` Paolo Abeni
2017-09-21 10:37     ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).