All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 00/11] udp: full early demux for unconnected sockets
@ 2017-09-22 21:06 Paolo Abeni
  2017-09-22 21:06 ` [RFC PATCH 01/11] net: add support for noref skb->sk Paolo Abeni
                   ` (11 more replies)
  0 siblings, 12 replies; 14+ messages in thread
From: Paolo Abeni @ 2017-09-22 21:06 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Pablo Neira Ayuso, Florian Westphal,
	Eric Dumazet, Hannes Frederic Sowa

This series refactor the UDP early demux code so that:

* full socket lookup is performed for unicast packets
* a sk is grabbed even for unconnected socket match
* a dst cache is used even in such scenario

To perform this tasks a couple of facilities are added:

* noref socket references, scoped inside the current RCU section, to be
  explicitly cleared before leaving such section
* a dst cache inside the inet and inet6 local addresses tables, caching the
  related local dst entry

The measured performance gain under small packet UDP flood is as follow:

ingress NIC	vanilla		patched		delta
rx queues	(kpps)		(kpps)		(%)
[ipv4]
1		2177		2414		10
2		2527		2892		14
3		3050		3733		22
4		3918		4643		18
5		5074		5699		12
6		5654		6869		21

[ipv6]
1		2002		2821		40
2		2087		3148		50
3		2583		4008		55
4		3072		4963		61
5		3719		5992		61
6		4314		6910		60

The number of user space process in use is equal to the number of
NIC rx queue; when multiple user space processes the SO_REUSEPORT 
options is used, as described below:

ethtool  -L em2 combined $n
MASK=1
for I in `seq 0 $((n - 1))`; do
        udp_sink  --reuse-port --recvfrom --count 1000000000 --port 9 $1 &
        taskset -p $((MASK << ($I + $n) )) $!
done

Paolo Abeni (11):
  net: add support for noref skb->sk
  net: allow early demux to fetch noref socket
  udp: do not touch socket refcount in early demux
  net: add simple socket-like dst cache helpers
  udp: perform full socket lookup in early demux
  ip/route: factor out helper for local route creation
  ipv6/addrconf: add an helper for inet6 address lookup
  net: implement local route cache inside ifaddr
  route: add ipv4/6 helpers to do partial route lookup vs local dst
  IP: early demux can return an error code
  udp: dst lookup in early demux for unconnected sockets

 include/linux/inetdevice.h       |   4 ++
 include/linux/skbuff.h           |  31 +++++++++++
 include/linux/udp.h              |   2 +
 include/net/addrconf.h           |   3 ++
 include/net/dst.h                |  20 +++++++
 include/net/if_inet6.h           |   4 ++
 include/net/ip6_route.h          |   1 +
 include/net/protocol.h           |   4 +-
 include/net/route.h              |   4 ++
 include/net/tcp.h                |   2 +-
 include/net/udp.h                |   2 +-
 net/core/dst.c                   |  12 +++++
 net/core/sock.c                  |   7 +++
 net/ipv4/devinet.c               |  29 ++++++++++-
 net/ipv4/ip_input.c              |  33 ++++++++----
 net/ipv4/netfilter/nf_dup_ipv4.c |   3 ++
 net/ipv4/route.c                 |  73 +++++++++++++++++++++++---
 net/ipv4/tcp_ipv4.c              |   9 ++--
 net/ipv4/udp.c                   |  95 +++++++++++++++-------------------
 net/ipv6/addrconf.c              | 109 +++++++++++++++++++++++++++------------
 net/ipv6/ip6_input.c             |   4 ++
 net/ipv6/netfilter/nf_dup_ipv6.c |   3 ++
 net/ipv6/route.c                 |  13 +++++
 net/ipv6/udp.c                   |  72 ++++++++++----------------
 net/netfilter/nf_queue.c         |   3 ++
 25 files changed, 383 insertions(+), 159 deletions(-)

-- 
2.13.5

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC PATCH 01/11] net: add support for noref skb->sk
  2017-09-22 21:06 [RFC PATCH 00/11] udp: full early demux for unconnected sockets Paolo Abeni
@ 2017-09-22 21:06 ` Paolo Abeni
  2017-09-22 21:06 ` [RFC PATCH 02/11] net: allow early demux to fetch noref socket Paolo Abeni
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 14+ messages in thread
From: Paolo Abeni @ 2017-09-22 21:06 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Pablo Neira Ayuso, Florian Westphal,
	Eric Dumazet, Hannes Frederic Sowa

Noref sk do not carry a socket refcount, are valid
only inside the current RCU section and must be
explicitly cleared before exiting such section.

They will be used in a later patch to allow early demux
without sock refcounting.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 include/linux/skbuff.h | 31 +++++++++++++++++++++++++++++++
 net/core/sock.c        |  7 +++++++
 2 files changed, 38 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 492828801acb..c3fc32636690 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -922,6 +922,37 @@ static inline struct rtable *skb_rtable(const struct sk_buff *skb)
 	return (struct rtable *)skb_dst(skb);
 }
 
+void sock_dummyfree(struct sk_buff *skb);
+
+/* only early demux can set noref socks
+ * noref socks do not carry any refcount and must be
+ * cleared before exiting the current RCU section
+ */
+static inline void skb_set_noref_sk(struct sk_buff *skb, struct sock *sk)
+{
+	skb->sk = sk;
+	skb->destructor = sock_dummyfree;
+}
+
+static inline bool skb_has_noref_sk(struct sk_buff *skb)
+{
+	return skb->destructor == sock_dummyfree;
+}
+
+static inline struct sock *skb_clear_noref_sk(struct sk_buff *skb)
+{
+	struct sock *ret;
+
+	if (!skb_has_noref_sk(skb))
+		return NULL;
+
+	WARN_ON_ONCE(!rcu_read_lock_held());
+	ret = skb->sk;
+	skb->sk = NULL;
+	skb->destructor = NULL;
+	return ret;
+}
+
 /* For mangling skb->pkt_type from user space side from applications
  * such as nft, tc, etc, we only allow a conservative subset of
  * possible pkt_types to be set.
diff --git a/net/core/sock.c b/net/core/sock.c
index 9b7b6bbb2a23..33da8e7e58a0 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1893,6 +1893,13 @@ void sock_efree(struct sk_buff *skb)
 }
 EXPORT_SYMBOL(sock_efree);
 
+/* dummy destructor used by noref sockets */
+void sock_dummyfree(struct sk_buff *skb)
+{
+	WARN_ON_ONCE(!rcu_read_lock_held());
+}
+EXPORT_SYMBOL(sock_dummyfree);
+
 kuid_t sock_i_uid(struct sock *sk)
 {
 	kuid_t uid;
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH 02/11] net: allow early demux to fetch noref socket
  2017-09-22 21:06 [RFC PATCH 00/11] udp: full early demux for unconnected sockets Paolo Abeni
  2017-09-22 21:06 ` [RFC PATCH 01/11] net: add support for noref skb->sk Paolo Abeni
@ 2017-09-22 21:06 ` Paolo Abeni
  2017-09-22 21:06 ` [RFC PATCH 03/11] udp: do not touch socket refcount in early demux Paolo Abeni
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 14+ messages in thread
From: Paolo Abeni @ 2017-09-22 21:06 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Pablo Neira Ayuso, Florian Westphal,
	Eric Dumazet, Hannes Frederic Sowa

We must be careful to avoid leaking such sockets outside
the RCU section containing the early demux call; we clear
them on nonlocal delivery.

For ipv4 we clear sknoref even for multicast traffic entering
the ip_mr_input() path; we will lose the mcast early demux
optimization when the host is acting as multicast router, but
that will help to keep to code simple.

Also update all iptables/nftables extension that can
happen in the input chain and can transmit the skb outside
such patch, namely TEE, nft_dup and nfqueue.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/ipv4/ip_input.c              | 8 ++++++++
 net/ipv4/netfilter/nf_dup_ipv4.c | 3 +++
 net/ipv6/ip6_input.c             | 4 ++++
 net/ipv6/netfilter/nf_dup_ipv6.c | 3 +++
 net/netfilter/nf_queue.c         | 3 +++
 5 files changed, 21 insertions(+)

diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index fa2dc8f692c6..5690ef09da28 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -351,6 +351,14 @@ static int ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
 		}
 	}
 
+	/* Since the sk has no reference to the socket, we must
+	 * clear it before escaping this RCU section.
+	 * The sk is just an hint and we know we are not going to use
+	 * it outside the input path.
+	 */
+	if (skb_dst(skb)->input != ip_local_deliver)
+		skb_clear_noref_sk(skb);
+
 #ifdef CONFIG_IP_ROUTE_CLASSID
 	if (unlikely(skb_dst(skb)->tclassid)) {
 		struct ip_rt_acct *st = this_cpu_ptr(ip_rt_acct);
diff --git a/net/ipv4/netfilter/nf_dup_ipv4.c b/net/ipv4/netfilter/nf_dup_ipv4.c
index 39895b9ddeb9..bf8b78492fc8 100644
--- a/net/ipv4/netfilter/nf_dup_ipv4.c
+++ b/net/ipv4/netfilter/nf_dup_ipv4.c
@@ -71,6 +71,9 @@ void nf_dup_ipv4(struct net *net, struct sk_buff *skb, unsigned int hooknum,
 	nf_reset(skb);
 	nf_ct_set(skb, NULL, IP_CT_UNTRACKED);
 #endif
+	/* Avoid leaking noref sk outside the input path */
+	skb_clear_noref_sk(skb);
+
 	/*
 	 * If we are in PREROUTING/INPUT, decrease the TTL to mitigate potential
 	 * loops between two hosts.
diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
index 9ee208a348f5..e15ec2d36b9e 100644
--- a/net/ipv6/ip6_input.c
+++ b/net/ipv6/ip6_input.c
@@ -68,6 +68,10 @@ int ip6_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
 	if (!skb_valid_dst(skb))
 		ip6_route_input(skb);
 
+	/* see comment on ipv4 edmux */
+	if (skb_dst(skb)->input != ip6_input)
+		skb_clear_noref_sk(skb);
+
 	return dst_input(skb);
 }
 
diff --git a/net/ipv6/netfilter/nf_dup_ipv6.c b/net/ipv6/netfilter/nf_dup_ipv6.c
index 4a7ddeddbaab..939f6a2238f9 100644
--- a/net/ipv6/netfilter/nf_dup_ipv6.c
+++ b/net/ipv6/netfilter/nf_dup_ipv6.c
@@ -60,6 +60,9 @@ void nf_dup_ipv6(struct net *net, struct sk_buff *skb, unsigned int hooknum,
 	nf_reset(skb);
 	nf_ct_set(skb, NULL, IP_CT_UNTRACKED);
 #endif
+	/* Avoid leaking noref sk outside the input path */
+	skb_clear_noref_sk(skb);
+
 	if (hooknum == NF_INET_PRE_ROUTING ||
 	    hooknum == NF_INET_LOCAL_IN) {
 		struct ipv6hdr *iph = ipv6_hdr(skb);
diff --git a/net/netfilter/nf_queue.c b/net/netfilter/nf_queue.c
index f7e21953b1de..100eff08cb51 100644
--- a/net/netfilter/nf_queue.c
+++ b/net/netfilter/nf_queue.c
@@ -145,6 +145,9 @@ static int __nf_queue(struct sk_buff *skb, const struct nf_hook_state *state,
 		.size	= sizeof(*entry) + afinfo->route_key_size,
 	};
 
+	/* Avoid leaking noref sk outside the input path */
+	skb_clear_noref_sk(skb);
+
 	nf_queue_entry_get_refs(entry);
 	skb_dst_force(skb);
 	afinfo->saveroute(skb, entry);
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH 03/11] udp: do not touch socket refcount in early demux
  2017-09-22 21:06 [RFC PATCH 00/11] udp: full early demux for unconnected sockets Paolo Abeni
  2017-09-22 21:06 ` [RFC PATCH 01/11] net: add support for noref skb->sk Paolo Abeni
  2017-09-22 21:06 ` [RFC PATCH 02/11] net: allow early demux to fetch noref socket Paolo Abeni
@ 2017-09-22 21:06 ` Paolo Abeni
  2017-09-22 21:06 ` [RFC PATCH 04/11] net: add simple socket-like dst cache helpers Paolo Abeni
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 14+ messages in thread
From: Paolo Abeni @ 2017-09-22 21:06 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Pablo Neira Ayuso, Florian Westphal,
	Eric Dumazet, Hannes Frederic Sowa

use noref sockets instead. This gives some small performance
improvements and will allow efficient early demux for unconnected
sockets in a later patch.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/ipv4/udp.c | 18 ++++++++++--------
 net/ipv6/udp.c | 10 ++++++----
 2 files changed, 16 insertions(+), 12 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 784ced0b9150..ba49d5aa9f09 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2050,12 +2050,13 @@ static inline int udp4_csum_init(struct sk_buff *skb, struct udphdr *uh,
 int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 		   int proto)
 {
-	struct sock *sk;
-	struct udphdr *uh;
-	unsigned short ulen;
+	struct net *net = dev_net(skb->dev);
 	struct rtable *rt = skb_rtable(skb);
+	unsigned short ulen;
 	__be32 saddr, daddr;
-	struct net *net = dev_net(skb->dev);
+	struct udphdr *uh;
+	struct sock *sk;
+	bool noref_sk;
 
 	/*
 	 *  Validate the packet.
@@ -2081,6 +2082,7 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 	if (udp4_csum_init(skb, uh, proto))
 		goto csum_error;
 
+	noref_sk = skb_has_noref_sk(skb);
 	sk = skb_steal_sock(skb);
 	if (sk) {
 		struct dst_entry *dst = skb_dst(skb);
@@ -2090,7 +2092,8 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 			udp_sk_rx_dst_set(sk, dst);
 
 		ret = udp_queue_rcv_skb(sk, skb);
-		sock_put(sk);
+		if (!noref_sk)
+			sock_put(sk);
 		/* a return value > 0 means to resubmit the input, but
 		 * it wants the return to be -protocol, or 0
 		 */
@@ -2261,11 +2264,10 @@ void udp_v4_early_demux(struct sk_buff *skb)
 					     uh->source, iph->saddr, dif, sdif);
 	}
 
-	if (!sk || !refcount_inc_not_zero(&sk->sk_refcnt))
+	if (!sk)
 		return;
 
-	skb->sk = sk;
-	skb->destructor = sock_efree;
+	skb_set_noref_sk(skb, sk);
 	dst = READ_ONCE(sk->sk_rx_dst);
 
 	if (dst)
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index e2ecfb137297..8f62392c4c35 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -787,6 +787,7 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 	struct net *net = dev_net(skb->dev);
 	struct udphdr *uh;
 	struct sock *sk;
+	bool noref_sk;
 	u32 ulen = 0;
 
 	if (!pskb_may_pull(skb, sizeof(struct udphdr)))
@@ -823,6 +824,7 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 		goto csum_error;
 
 	/* Check if the socket is already available, e.g. due to early demux */
+	noref_sk = skb_has_noref_sk(skb);
 	sk = skb_steal_sock(skb);
 	if (sk) {
 		struct dst_entry *dst = skb_dst(skb);
@@ -832,7 +834,8 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 			udp6_sk_rx_dst_set(sk, dst);
 
 		ret = udpv6_queue_rcv_skb(sk, skb);
-		sock_put(sk);
+		if (!noref_sk)
+			sock_put(sk);
 
 		/* a return value > 0 means to resubmit the input */
 		if (ret > 0)
@@ -948,11 +951,10 @@ static void udp_v6_early_demux(struct sk_buff *skb)
 	else
 		return;
 
-	if (!sk || !refcount_inc_not_zero(&sk->sk_refcnt))
+	if (!sk)
 		return;
 
-	skb->sk = sk;
-	skb->destructor = sock_efree;
+	skb_set_noref_sk(skb, sk);
 	dst = READ_ONCE(sk->sk_rx_dst);
 
 	if (dst)
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH 04/11] net: add simple socket-like dst cache helpers
  2017-09-22 21:06 [RFC PATCH 00/11] udp: full early demux for unconnected sockets Paolo Abeni
                   ` (2 preceding siblings ...)
  2017-09-22 21:06 ` [RFC PATCH 03/11] udp: do not touch socket refcount in early demux Paolo Abeni
@ 2017-09-22 21:06 ` Paolo Abeni
  2017-09-22 21:06 ` [RFC PATCH 05/11] udp: perform full socket lookup in early demux Paolo Abeni
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 14+ messages in thread
From: Paolo Abeni @ 2017-09-22 21:06 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Pablo Neira Ayuso, Florian Westphal,
	Eric Dumazet, Hannes Frederic Sowa

It will be used by later patches to reduce code duplication.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 include/net/dst.h | 20 ++++++++++++++++++++
 net/core/dst.c    | 12 ++++++++++++
 2 files changed, 32 insertions(+)

diff --git a/include/net/dst.h b/include/net/dst.h
index 93568bd0a352..4fcca0e368c6 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -485,6 +485,26 @@ static inline struct dst_entry *dst_check(struct dst_entry *dst, u32 cookie)
 	return dst;
 }
 
+/* update the cache with dst, assuming the latter already carries a refcount */
+static inline bool __dst_update(struct dst_entry **cache, struct dst_entry *dst)
+{
+	struct dst_entry *old = xchg(cache, dst);
+
+	dst_release(old);
+	return old != dst;
+}
+bool dst_update(struct dst_entry **cache, struct dst_entry *dst);
+static inline struct dst_entry *dst_access(struct dst_entry **cache,
+					      u32 cookie)
+{
+	struct dst_entry *dst = READ_ONCE(*cache);
+
+	if (!dst)
+		return NULL;
+
+	return dst_check(dst, cookie);
+}
+
 /* Flags for xfrm_lookup flags argument. */
 enum {
 	XFRM_LOOKUP_ICMP = 1 << 0,
diff --git a/net/core/dst.c b/net/core/dst.c
index a6c47da7d0f8..4076f9af45d7 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -205,6 +205,18 @@ void dst_release_immediate(struct dst_entry *dst)
 }
 EXPORT_SYMBOL(dst_release_immediate);
 
+/* update the cache with dst, assuming the latter does not carry a refcount */
+bool dst_update(struct dst_entry **cache, struct dst_entry *dst)
+{
+	if (likely(*cache == dst))
+		return false;
+
+	if (dst_hold_safe(dst))
+		return __dst_update(cache, dst);
+	return false;
+}
+EXPORT_SYMBOL_GPL(dst_update);
+
 u32 *dst_cow_metrics_generic(struct dst_entry *dst, unsigned long old)
 {
 	struct dst_metrics *p = kmalloc(sizeof(*p), GFP_ATOMIC);
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH 05/11] udp: perform full socket lookup in early demux
  2017-09-22 21:06 [RFC PATCH 00/11] udp: full early demux for unconnected sockets Paolo Abeni
                   ` (3 preceding siblings ...)
  2017-09-22 21:06 ` [RFC PATCH 04/11] net: add simple socket-like dst cache helpers Paolo Abeni
@ 2017-09-22 21:06 ` Paolo Abeni
  2017-09-22 21:06 ` [RFC PATCH 06/11] ip/route: factor out helper for local route creation Paolo Abeni
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 14+ messages in thread
From: Paolo Abeni @ 2017-09-22 21:06 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Pablo Neira Ayuso, Florian Westphal,
	Eric Dumazet, Hannes Frederic Sowa

Since UDP early demux lookup fetches noref socket references,
we can safely be optimistic about it and set the sk reference
even if the skb is not going to land on such socket, avoiding
the rx dst cache usage for unconnected unicast sockets.

This avoids a second lookup for unconnected sockets, and clean
up a bit the whole udp early demux code.

After this change, on hosts not acting as routers, the UDP
early demux never affect negatively the receive performances,
while before this change UDP early demux caused measurable
performance impact for unconnected sockets.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 include/linux/udp.h |  2 ++
 net/ipv4/udp.c      | 62 +++++++++++++++++++----------------------------------
 net/ipv6/udp.c      | 57 ++++++++++++------------------------------------
 3 files changed, 38 insertions(+), 83 deletions(-)

diff --git a/include/linux/udp.h b/include/linux/udp.h
index eaea63bc79bb..9c68b57543cc 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -92,6 +92,8 @@ static inline struct udp_sock *udp_sk(const struct sock *sk)
 	return (struct udp_sock *)sk;
 }
 
+void udp_set_skb_rx_dst(struct sock *sk, struct sk_buff *skb, u32 cookie);
+
 static inline void udp_set_no_check6_tx(struct sock *sk, bool val)
 {
 	udp_sk(sk)->no_check6_tx = val;
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index ba49d5aa9f09..5cbbd78024dc 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2043,6 +2043,11 @@ static inline int udp4_csum_init(struct sk_buff *skb, struct udphdr *uh,
 							 inet_compute_pseudo);
 }
 
+static bool udp_use_rx_dst_cache(struct sock *sk, struct sk_buff *skb)
+{
+	return sk->sk_state == TCP_ESTABLISHED || skb->pkt_type != PACKET_HOST;
+}
+
 /*
  *	All we need to do is get the socket, and then do a checksum.
  */
@@ -2088,8 +2093,8 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 		struct dst_entry *dst = skb_dst(skb);
 		int ret;
 
-		if (unlikely(sk->sk_rx_dst != dst))
-			udp_sk_rx_dst_set(sk, dst);
+		if (udp_use_rx_dst_cache(sk, skb))
+			dst_update(&sk->sk_rx_dst, dst);
 
 		ret = udp_queue_rcv_skb(sk, skb);
 		if (!noref_sk)
@@ -2196,42 +2201,28 @@ static struct sock *__udp4_lib_mcast_demux_lookup(struct net *net,
 	return result;
 }
 
-/* For unicast we should only early demux connected sockets or we can
- * break forwarding setups.  The chains here can be long so only check
- * if the first socket is an exact match and if not move on.
- */
-static struct sock *__udp4_lib_demux_lookup(struct net *net,
-					    __be16 loc_port, __be32 loc_addr,
-					    __be16 rmt_port, __be32 rmt_addr,
-					    int dif, int sdif)
+void udp_set_skb_rx_dst(struct sock *sk, struct sk_buff *skb, u32 cookie)
 {
-	unsigned short hnum = ntohs(loc_port);
-	unsigned int hash2 = udp4_portaddr_hash(net, loc_addr, hnum);
-	unsigned int slot2 = hash2 & udp_table.mask;
-	struct udp_hslot *hslot2 = &udp_table.hash2[slot2];
-	INET_ADDR_COOKIE(acookie, rmt_addr, loc_addr);
-	const __portpair ports = INET_COMBINED_PORTS(rmt_port, hnum);
-	struct sock *sk;
+	struct dst_entry *dst = dst_access(&sk->sk_rx_dst, cookie);
 
-	udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
-		if (INET_MATCH(sk, net, acookie, rmt_addr,
-			       loc_addr, ports, dif, sdif))
-			return sk;
-		/* Only check first socket in chain */
-		break;
+	if (dst) {
+		/* set noref for now.
+		 * any place which wants to hold dst has to call
+		 * dst_hold_safe()
+		 */
+		skb_dst_set_noref(skb, dst);
 	}
-	return NULL;
 }
+EXPORT_SYMBOL_GPL(udp_set_skb_rx_dst);
 
 void udp_v4_early_demux(struct sk_buff *skb)
 {
 	struct net *net = dev_net(skb->dev);
+	int dif = skb->dev->ifindex;
+	int sdif = inet_sdif(skb);
 	const struct iphdr *iph;
 	const struct udphdr *uh;
 	struct sock *sk = NULL;
-	struct dst_entry *dst;
-	int dif = skb->dev->ifindex;
-	int sdif = inet_sdif(skb);
 	int ours;
 
 	/* validate the packet */
@@ -2260,25 +2251,16 @@ void udp_v4_early_demux(struct sk_buff *skb)
 						   uh->source, iph->saddr,
 						   dif, sdif);
 	} else if (skb->pkt_type == PACKET_HOST) {
-		sk = __udp4_lib_demux_lookup(net, uh->dest, iph->daddr,
-					     uh->source, iph->saddr, dif, sdif);
+		sk = __udp4_lib_lookup(net, iph->saddr, uh->source, iph->daddr,
+				       uh->dest, dif, sdif, &udp_table, skb);
 	}
 
 	if (!sk)
 		return;
 
 	skb_set_noref_sk(skb, sk);
-	dst = READ_ONCE(sk->sk_rx_dst);
-
-	if (dst)
-		dst = dst_check(dst, 0);
-	if (dst) {
-		/* set noref for now.
-		 * any place which wants to hold dst has to call
-		 * dst_hold_safe()
-		 */
-		skb_dst_set_noref(skb, dst);
-	}
+	if (udp_use_rx_dst_cache(sk, skb))
+		udp_set_skb_rx_dst(sk, skb, 0);
 }
 
 int udp_rcv(struct sk_buff *skb)
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 8f62392c4c35..67d340679c3a 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -773,13 +773,18 @@ static int __udp6_lib_mcast_deliver(struct net *net, struct sk_buff *skb,
 
 static void udp6_sk_rx_dst_set(struct sock *sk, struct dst_entry *dst)
 {
-	if (udp_sk_rx_dst_set(sk, dst)) {
+	if (unlikely(dst_update(&sk->sk_rx_dst, dst))) {
 		const struct rt6_info *rt = (const struct rt6_info *)dst;
 
 		inet6_sk(sk)->rx_dst_cookie = rt6_get_cookie(rt);
 	}
 }
 
+static bool udp6_use_rx_dst_cache(struct sock *sk)
+{
+	return sk->sk_state == TCP_ESTABLISHED;
+}
+
 int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 		   int proto)
 {
@@ -830,7 +835,7 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 		struct dst_entry *dst = skb_dst(skb);
 		int ret;
 
-		if (unlikely(sk->sk_rx_dst != dst))
+		if (udp6_use_rx_dst_cache(sk))
 			udp6_sk_rx_dst_set(sk, dst);
 
 		ret = udpv6_queue_rcv_skb(sk, skb);
@@ -905,37 +910,13 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 	return 0;
 }
 
-
-static struct sock *__udp6_lib_demux_lookup(struct net *net,
-			__be16 loc_port, const struct in6_addr *loc_addr,
-			__be16 rmt_port, const struct in6_addr *rmt_addr,
-			int dif, int sdif)
-{
-	unsigned short hnum = ntohs(loc_port);
-	unsigned int hash2 = udp6_portaddr_hash(net, loc_addr, hnum);
-	unsigned int slot2 = hash2 & udp_table.mask;
-	struct udp_hslot *hslot2 = &udp_table.hash2[slot2];
-	const __portpair ports = INET_COMBINED_PORTS(rmt_port, hnum);
-	struct sock *sk;
-
-	udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
-		if (sk->sk_state == TCP_ESTABLISHED &&
-		    INET6_MATCH(sk, net, rmt_addr, loc_addr, ports, dif, sdif))
-			return sk;
-		/* Only check first socket in chain */
-		break;
-	}
-	return NULL;
-}
-
 static void udp_v6_early_demux(struct sk_buff *skb)
 {
 	struct net *net = dev_net(skb->dev);
-	const struct udphdr *uh;
-	struct sock *sk;
-	struct dst_entry *dst;
 	int dif = skb->dev->ifindex;
 	int sdif = inet6_sdif(skb);
+	const struct udphdr *uh;
+	struct sock *sk;
 
 	if (!pskb_may_pull(skb, skb_transport_offset(skb) +
 	    sizeof(struct udphdr)))
@@ -944,10 +925,9 @@ static void udp_v6_early_demux(struct sk_buff *skb)
 	uh = udp_hdr(skb);
 
 	if (skb->pkt_type == PACKET_HOST)
-		sk = __udp6_lib_demux_lookup(net, uh->dest,
-					     &ipv6_hdr(skb)->daddr,
-					     uh->source, &ipv6_hdr(skb)->saddr,
-					     dif, sdif);
+		sk = __udp6_lib_lookup(net, &ipv6_hdr(skb)->saddr, uh->source,
+				       &ipv6_hdr(skb)->daddr, uh->dest, dif,
+				       sdif, &udp_table, skb);
 	else
 		return;
 
@@ -955,17 +935,8 @@ static void udp_v6_early_demux(struct sk_buff *skb)
 		return;
 
 	skb_set_noref_sk(skb, sk);
-	dst = READ_ONCE(sk->sk_rx_dst);
-
-	if (dst)
-		dst = dst_check(dst, inet6_sk(sk)->rx_dst_cookie);
-	if (dst) {
-		/* set noref for now.
-		 * any place which wants to hold dst has to call
-		 * dst_hold_safe()
-		 */
-		skb_dst_set_noref(skb, dst);
-	}
+	if (udp6_use_rx_dst_cache(sk))
+		udp_set_skb_rx_dst(sk, skb, inet6_sk(sk)->rx_dst_cookie);
 }
 
 static __inline__ int udpv6_rcv(struct sk_buff *skb)
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH 06/11] ip/route: factor out helper for local route creation
  2017-09-22 21:06 [RFC PATCH 00/11] udp: full early demux for unconnected sockets Paolo Abeni
                   ` (4 preceding siblings ...)
  2017-09-22 21:06 ` [RFC PATCH 05/11] udp: perform full socket lookup in early demux Paolo Abeni
@ 2017-09-22 21:06 ` Paolo Abeni
  2017-09-22 21:06 ` [RFC PATCH 07/11] ipv6/addrconf: add an helper for inet6 address lookup Paolo Abeni
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 14+ messages in thread
From: Paolo Abeni @ 2017-09-22 21:06 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Pablo Neira Ayuso, Florian Westphal,
	Eric Dumazet, Hannes Frederic Sowa

Will be used by a later patch to build the ifaddr dst cache.
No functional changes are introduced here.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 include/net/route.h |  2 ++
 net/ipv4/route.c    | 30 ++++++++++++++++++++++--------
 2 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/include/net/route.h b/include/net/route.h
index 1b09a9368c68..ec09c3d73581 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -176,6 +176,8 @@ static inline struct rtable *ip_route_output_gre(struct net *net, struct flowi4
 	return ip_route_output_key(net, fl4);
 }
 
+struct rtable *ip_local_route_alloc(struct net_device *dev, unsigned int flags,
+				    u32 itag, unsigned char type, bool docache);
 int ip_route_input_noref(struct sk_buff *skb, __be32 dst, __be32 src,
 			 u8 tos, struct net_device *devin);
 int ip_route_input_rcu(struct sk_buff *skb, __be32 dst, __be32 src,
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 94d4cd2d5ea4..515589f1b3d1 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1859,6 +1859,27 @@ static int ip_mkroute_input(struct sk_buff *skb,
 	return __mkroute_input(skb, res, in_dev, daddr, saddr, tos);
 }
 
+struct rtable *ip_local_route_alloc(struct net_device *dev, unsigned int flags,
+				    u32 itag, unsigned char type, bool do_cache)
+{
+	struct in_device *in_dev = __in_dev_get_rcu(dev);
+	struct net *net = dev_net(dev);
+	struct rtable *rth;
+
+	rth = rt_dst_alloc(l3mdev_master_dev_rcu(dev) ? : net->loopback_dev,
+			   flags | RTCF_LOCAL, type,
+			   IN_DEV_CONF_GET(in_dev, NOPOLICY), false, do_cache);
+	if (!rth)
+		return NULL;
+
+	rth->dst.output= ip_rt_bug;
+#ifdef CONFIG_IP_ROUTE_CLASSID
+	rth->dst.tclassid = itag;
+#endif
+	rth->rt_is_input = 1;
+	return rth;
+}
+
 /*
  *	NOTE. We drop all the packets that has local source
  *	addresses, because every properly looped back packet
@@ -1996,17 +2017,10 @@ out:	return err;
 		}
 	}
 
-	rth = rt_dst_alloc(l3mdev_master_dev_rcu(dev) ? : net->loopback_dev,
-			   flags | RTCF_LOCAL, res->type,
-			   IN_DEV_CONF_GET(in_dev, NOPOLICY), false, do_cache);
+	rth = ip_local_route_alloc(dev, flags, itag, res->type, do_cache);
 	if (!rth)
 		goto e_nobufs;
 
-	rth->dst.output= ip_rt_bug;
-#ifdef CONFIG_IP_ROUTE_CLASSID
-	rth->dst.tclassid = itag;
-#endif
-	rth->rt_is_input = 1;
 	if (res->table)
 		rth->rt_table_id = res->table->tb_id;
 
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH 07/11] ipv6/addrconf: add an helper for inet6 address lookup
  2017-09-22 21:06 [RFC PATCH 00/11] udp: full early demux for unconnected sockets Paolo Abeni
                   ` (5 preceding siblings ...)
  2017-09-22 21:06 ` [RFC PATCH 06/11] ip/route: factor out helper for local route creation Paolo Abeni
@ 2017-09-22 21:06 ` Paolo Abeni
  2017-09-22 21:06 ` [RFC PATCH 08/11] net: implement local route cache inside ifaddr Paolo Abeni
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 14+ messages in thread
From: Paolo Abeni @ 2017-09-22 21:06 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Pablo Neira Ayuso, Florian Westphal,
	Eric Dumazet, Hannes Frederic Sowa

reduce code duplication and will simplify follow-up patch

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/ipv6/addrconf.c | 65 +++++++++++++++++++++++++----------------------------
 1 file changed, 31 insertions(+), 34 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index c2e2a78787ec..5940062cac8d 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -1796,35 +1796,46 @@ int ipv6_chk_addr(struct net *net, const struct in6_addr *addr,
 }
 EXPORT_SYMBOL(ipv6_chk_addr);
 
+/* called under RCU lock with bh disabled */
+static struct inet6_ifaddr *ipv6_lookup_ifaddr_rcu_bh(struct net *net,
+						    const struct in6_addr *addr)
+{
+	unsigned int hash = inet6_addr_hash(addr);
+	struct inet6_ifaddr *ifp;
+
+	hlist_for_each_entry_rcu_bh(ifp, &inet6_addr_lst[hash], addr_lst)
+		if (net_eq(dev_net(ifp->idev->dev), net) &&
+		    ipv6_addr_equal(&ifp->addr, addr))
+			return ifp;
+
+	return NULL;
+}
+
 int ipv6_chk_addr_and_flags(struct net *net, const struct in6_addr *addr,
 			    const struct net_device *dev, int strict,
 			    u32 banned_flags)
 {
 	struct inet6_ifaddr *ifp;
-	unsigned int hash = inet6_addr_hash(addr);
 	u32 ifp_flags;
+	int ret = 0;
 
 	rcu_read_lock_bh();
-	hlist_for_each_entry_rcu(ifp, &inet6_addr_lst[hash], addr_lst) {
-		if (!net_eq(dev_net(ifp->idev->dev), net))
-			continue;
+	ifp = ipv6_lookup_ifaddr_rcu_bh(net, addr);
+	if (ifp) {
 		/* Decouple optimistic from tentative for evaluation here.
 		 * Ban optimistic addresses explicitly, when required.
 		 */
 		ifp_flags = (ifp->flags&IFA_F_OPTIMISTIC)
 			    ? (ifp->flags&~IFA_F_TENTATIVE)
 			    : ifp->flags;
-		if (ipv6_addr_equal(&ifp->addr, addr) &&
-		    !(ifp_flags&banned_flags) &&
+		if (!(ifp_flags&banned_flags) &&
 		    (!dev || ifp->idev->dev == dev ||
-		     !(ifp->scope&(IFA_LINK|IFA_HOST) || strict))) {
-			rcu_read_unlock_bh();
-			return 1;
-		}
+		     !(ifp->scope&(IFA_LINK|IFA_HOST) || strict)))
+			ret = 1;
 	}
 
 	rcu_read_unlock_bh();
-	return 0;
+	return ret;
 }
 EXPORT_SYMBOL(ipv6_chk_addr_and_flags);
 
@@ -1900,20 +1911,13 @@ struct inet6_ifaddr *ipv6_get_ifaddr(struct net *net, const struct in6_addr *add
 				     struct net_device *dev, int strict)
 {
 	struct inet6_ifaddr *ifp, *result = NULL;
-	unsigned int hash = inet6_addr_hash(addr);
 
 	rcu_read_lock_bh();
-	hlist_for_each_entry_rcu_bh(ifp, &inet6_addr_lst[hash], addr_lst) {
-		if (!net_eq(dev_net(ifp->idev->dev), net))
-			continue;
-		if (ipv6_addr_equal(&ifp->addr, addr)) {
-			if (!dev || ifp->idev->dev == dev ||
-			    !(ifp->scope&(IFA_LINK|IFA_HOST) || strict)) {
-				result = ifp;
-				in6_ifa_hold(ifp);
-				break;
-			}
-		}
+	ifp = ipv6_lookup_ifaddr_rcu_bh(net, addr);
+	if (ifp && (!dev || ifp->idev->dev == dev ||
+		    !(ifp->scope & (IFA_LINK|IFA_HOST) || strict))) {
+		result = ifp;
+		in6_ifa_hold(ifp);
 	}
 	rcu_read_unlock_bh();
 
@@ -4226,20 +4230,13 @@ void if6_proc_exit(void)
 /* Check if address is a home address configured on any interface. */
 int ipv6_chk_home_addr(struct net *net, const struct in6_addr *addr)
 {
-	int ret = 0;
 	struct inet6_ifaddr *ifp = NULL;
-	unsigned int hash = inet6_addr_hash(addr);
+	int ret = 0;
 
 	rcu_read_lock_bh();
-	hlist_for_each_entry_rcu_bh(ifp, &inet6_addr_lst[hash], addr_lst) {
-		if (!net_eq(dev_net(ifp->idev->dev), net))
-			continue;
-		if (ipv6_addr_equal(&ifp->addr, addr) &&
-		    (ifp->flags & IFA_F_HOMEADDRESS)) {
-			ret = 1;
-			break;
-		}
-	}
+	ifp = ipv6_lookup_ifaddr_rcu_bh(net, addr);
+	if (ifp && ifp->flags & IFA_F_HOMEADDRESS)
+		ret = 1;
 	rcu_read_unlock_bh();
 	return ret;
 }
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH 08/11] net: implement local route cache inside ifaddr
  2017-09-22 21:06 [RFC PATCH 00/11] udp: full early demux for unconnected sockets Paolo Abeni
                   ` (6 preceding siblings ...)
  2017-09-22 21:06 ` [RFC PATCH 07/11] ipv6/addrconf: add an helper for inet6 address lookup Paolo Abeni
@ 2017-09-22 21:06 ` Paolo Abeni
  2017-09-22 21:06 ` [RFC PATCH 09/11] route: add ipv4/6 helpers to do partial route lookup vs local dst Paolo Abeni
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 14+ messages in thread
From: Paolo Abeni @ 2017-09-22 21:06 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Pablo Neira Ayuso, Florian Westphal,
	Eric Dumazet, Hannes Frederic Sowa

add storage and helpers to associate an ipv{4,6} address
with the local route to self. This will be used by a
later patch to implement early demux for unconnected UDP
sockets.

The caches are filled on address creation, with DST_OBSOLETE_NONE.
Ipv6 cache are explicitly clearered and refreshed on underlaying
device down/up events.

The above schema is simpler than refreshing the cache every
time the dst expires under the default obsolete schema.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 include/linux/inetdevice.h |  4 ++++
 include/net/addrconf.h     |  3 +++
 include/net/if_inet6.h     |  4 ++++
 net/ipv4/devinet.c         | 29 ++++++++++++++++++++++++++++-
 net/ipv6/addrconf.c        | 44 ++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 83 insertions(+), 1 deletion(-)

diff --git a/include/linux/inetdevice.h b/include/linux/inetdevice.h
index 751d051f0bc7..c29982f178bb 100644
--- a/include/linux/inetdevice.h
+++ b/include/linux/inetdevice.h
@@ -130,6 +130,8 @@ static inline void ipv4_devconf_setall(struct in_device *in_dev)
 #define IN_DEV_ARP_IGNORE(in_dev)	IN_DEV_MAXCONF((in_dev), ARP_IGNORE)
 #define IN_DEV_ARP_NOTIFY(in_dev)	IN_DEV_MAXCONF((in_dev), ARP_NOTIFY)
 
+struct dst_entry;
+
 struct in_ifaddr {
 	struct hlist_node	hash;
 	struct in_ifaddr	*ifa_next;
@@ -149,6 +151,7 @@ struct in_ifaddr {
 	__u32			ifa_preferred_lft;
 	unsigned long		ifa_cstamp; /* created timestamp */
 	unsigned long		ifa_tstamp; /* updated timestamp */
+	struct dst_entry	*dst; /* local route to self */
 };
 
 struct in_validator_info {
@@ -180,6 +183,7 @@ __be32 inet_confirm_addr(struct net *net, struct in_device *in_dev, __be32 dst,
 struct in_ifaddr *inet_ifa_byprefix(struct in_device *in_dev, __be32 prefix,
 				    __be32 mask);
 struct in_ifaddr *inet_lookup_ifaddr_rcu(struct net *net, __be32 addr);
+struct dst_entry *inet_get_ifaddr_dst_rcu(struct net *net, __be32 addr);
 static __inline__ bool inet_ifa_match(__be32 addr, struct in_ifaddr *ifa)
 {
 	return !((addr^ifa->ifa_address)&ifa->ifa_mask);
diff --git a/include/net/addrconf.h b/include/net/addrconf.h
index 87981cd63180..bdfa3306a4c5 100644
--- a/include/net/addrconf.h
+++ b/include/net/addrconf.h
@@ -87,6 +87,9 @@ struct inet6_ifaddr *ipv6_get_ifaddr(struct net *net,
 				     const struct in6_addr *addr,
 				     struct net_device *dev, int strict);
 
+struct dst_entry *inet6_get_ifaddr_dst_rcu_bh(struct net *net,
+					      const struct in6_addr *addr);
+
 int ipv6_dev_get_saddr(struct net *net, const struct net_device *dev,
 		       const struct in6_addr *daddr, unsigned int srcprefs,
 		       struct in6_addr *saddr);
diff --git a/include/net/if_inet6.h b/include/net/if_inet6.h
index d4088d1a688d..1dd42e7c17a4 100644
--- a/include/net/if_inet6.h
+++ b/include/net/if_inet6.h
@@ -39,6 +39,8 @@ enum {
 	INET6_IFADDR_STATE_DEAD,
 };
 
+struct dst_entry;
+
 struct inet6_ifaddr {
 	struct in6_addr		addr;
 	__u32			prefix_len;
@@ -77,6 +79,8 @@ struct inet6_ifaddr {
 
 	struct rcu_head		rcu;
 	struct in6_addr		peer_addr;
+
+	struct dst_entry	*dst; /* local route to self */
 };
 
 struct ip6_sf_socklist {
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 7ce22a2c07ce..a7748f787866 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -179,6 +179,17 @@ struct in_ifaddr *inet_lookup_ifaddr_rcu(struct net *net, __be32 addr)
 	return NULL;
 }
 
+/* called under RCU lock */
+struct dst_entry *inet_get_ifaddr_dst_rcu(struct net *net, __be32 addr)
+{
+	struct in_ifaddr *ifa = inet_lookup_ifaddr_rcu(net, addr);
+
+	if (!ifa)
+		return NULL;
+
+	return dst_access(&ifa->dst, 0);
+}
+
 static void rtmsg_ifa(int event, struct in_ifaddr *, struct nlmsghdr *, u32);
 
 static BLOCKING_NOTIFIER_HEAD(inetaddr_chain);
@@ -337,6 +348,7 @@ static void __inet_del_ifa(struct in_device *in_dev, struct in_ifaddr **ifap,
 	struct in_ifaddr *last_prim = in_dev->ifa_list;
 	struct in_ifaddr *prev_prom = NULL;
 	int do_promote = IN_DEV_PROMOTE_SECONDARIES(in_dev);
+	struct dst_entry *dst;
 
 	ASSERT_RTNL();
 
@@ -395,7 +407,12 @@ static void __inet_del_ifa(struct in_device *in_dev, struct in_ifaddr **ifap,
 	*ifap = ifa1->ifa_next;
 	inet_hash_remove(ifa1);
 
-	/* 3. Announce address deletion */
+	/* 3. Clear dst cache */
+
+	dst = xchg(&ifa1->dst, NULL);
+	dst_release(dst);
+
+	/* 4. Announce address deletion */
 
 	/* Send message first, then call notifier.
 	   At first sight, FIB update triggered by notifier
@@ -449,6 +466,7 @@ static int __inet_insert_ifa(struct in_ifaddr *ifa, struct nlmsghdr *nlh,
 	struct in_device *in_dev = ifa->ifa_dev;
 	struct in_ifaddr *ifa1, **ifap, **last_primary;
 	struct in_validator_info ivi;
+	struct rtable *rt;
 	int ret;
 
 	ASSERT_RTNL();
@@ -516,6 +534,15 @@ static int __inet_insert_ifa(struct in_ifaddr *ifa, struct nlmsghdr *nlh,
 	rtmsg_ifa(RTM_NEWADDR, ifa, nlh, portid);
 	blocking_notifier_call_chain(&inetaddr_chain, NETDEV_UP, ifa);
 
+	/* fill the dst cache and transfer the dst ownership to it */
+	rt = ip_local_route_alloc(in_dev->dev, 0, 0, RTN_LOCAL, false);
+	if (rt) {
+		/* the local route will be valid for till the address will be
+		 * up
+		 */
+		rt->dst.obsolete = DST_OBSOLETE_NONE;
+		__dst_update(&ifa->dst, &rt->dst);
+	}
 	return 0;
 }
 
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 5940062cac8d..5fa8d1b764ca 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -904,6 +904,26 @@ static int addrconf_fixup_linkdown(struct ctl_table *table, int *p, int newf)
 
 #endif
 
+static void inet6_addr_dst_clear(struct inet6_ifaddr *ifp)
+{
+	dst_release(xchg(&ifp->dst, NULL));
+}
+
+static void inet6_addr_dst_update(struct inet6_dev *idev,
+				  struct inet6_ifaddr *ifp)
+{
+	struct rt6_info *rt = addrconf_dst_alloc(idev, &ifp->addr, false);
+
+	if (IS_ERR(rt))
+		return;
+
+	/* we are going to manully clear the cache when the related dev will
+	 * go down
+	 */
+	rt->dst.obsolete = DST_OBSOLETE_NONE;
+	__dst_update(&ifp->dst, &rt->dst);
+}
+
 /* Nobody refers to this ifaddr, destroy it */
 void inet6_ifa_finish_destroy(struct inet6_ifaddr *ifp)
 {
@@ -914,6 +934,7 @@ void inet6_ifa_finish_destroy(struct inet6_ifaddr *ifp)
 #endif
 
 	in6_dev_put(ifp->idev);
+	inet6_addr_dst_clear(ifp);
 
 	if (cancel_delayed_work(&ifp->dad_work))
 		pr_notice("delayed DAD work was pending while freeing ifa=%p\n",
@@ -1907,6 +1928,18 @@ int ipv6_chk_prefix(const struct in6_addr *addr, struct net_device *dev)
 }
 EXPORT_SYMBOL(ipv6_chk_prefix);
 
+/* called under RCU lock */
+struct dst_entry *inet6_get_ifaddr_dst_rcu_bh(struct net *net,
+					      const struct in6_addr *addr)
+{
+	struct inet6_ifaddr *ifp = ipv6_lookup_ifaddr_rcu_bh(net, addr);
+
+	if (!ifp)
+		return NULL;
+
+	return dst_access(&ifp->dst, 0);
+}
+
 struct inet6_ifaddr *ipv6_get_ifaddr(struct net *net, const struct in6_addr *addr,
 				     struct net_device *dev, int strict)
 {
@@ -3300,6 +3333,15 @@ static int fixup_permanent_addr(struct inet6_dev *idev,
 		ifp->rt = rt;
 		spin_unlock(&ifp->lock);
 
+		/* if dad is not going to start, we must cache the new route
+		 * elsewhere we can just clear the old one
+		 */
+		if (ifp->state == INET6_IFADDR_STATE_PREDAD ||
+		    ifp->flags & IFA_F_TENTATIVE)
+			inet6_addr_dst_clear(ifp);
+		else
+			inet6_addr_dst_update(idev, ifp);
+
 		ip6_rt_put(prev);
 	}
 
@@ -3679,6 +3721,7 @@ static int addrconf_ifdown(struct net_device *dev, int how)
 
 		spin_unlock_bh(&ifa->lock);
 
+		inet6_addr_dst_clear(ifa);
 		if (rt)
 			ip6_del_rt(rt);
 
@@ -4009,6 +4052,7 @@ static void addrconf_dad_completed(struct inet6_ifaddr *ifp, bool bump_id)
 	 */
 
 	ipv6_ifa_notify(RTM_NEWADDR, ifp);
+	inet6_addr_dst_update(ifp->idev, ifp);
 
 	/* If added prefix is link local and we are prepared to process
 	   router advertisements, start sending router solicitations.
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH 09/11] route: add ipv4/6 helpers to do partial route lookup vs local dst
  2017-09-22 21:06 [RFC PATCH 00/11] udp: full early demux for unconnected sockets Paolo Abeni
                   ` (7 preceding siblings ...)
  2017-09-22 21:06 ` [RFC PATCH 08/11] net: implement local route cache inside ifaddr Paolo Abeni
@ 2017-09-22 21:06 ` Paolo Abeni
  2017-09-22 21:58 ` [RFC PATCH 00/11] udp: full early demux for unconnected sockets Eric Dumazet
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 14+ messages in thread
From: Paolo Abeni @ 2017-09-22 21:06 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Pablo Neira Ayuso, Florian Westphal,
	Eric Dumazet, Hannes Frederic Sowa

For ipv4 also implement the proper source address validation, even
against martian addresses and return an error code accordingly.

Will be used by later patches to perform dst lookup in early
demux for unconnected sockets.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 include/net/ip6_route.h |  1 +
 include/net/route.h     |  2 ++
 net/ipv4/route.c        | 43 +++++++++++++++++++++++++++++++++++++++++++
 net/ipv6/route.c        | 13 +++++++++++++
 4 files changed, 59 insertions(+)

diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index ee96f402cb75..edb24456a609 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -65,6 +65,7 @@ static inline bool rt6_need_strict(const struct in6_addr *daddr)
 		(IPV6_ADDR_MULTICAST | IPV6_ADDR_LINKLOCAL | IPV6_ADDR_LOOPBACK);
 }
 
+void ip6_route_try_local_rcu_bh(struct net *net, struct sk_buff *skb);
 void ip6_route_input(struct sk_buff *skb);
 struct dst_entry *ip6_route_input_lookup(struct net *net,
 					 struct net_device *dev,
diff --git a/include/net/route.h b/include/net/route.h
index ec09c3d73581..21927231cc14 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -178,6 +178,8 @@ static inline struct rtable *ip_route_output_gre(struct net *net, struct flowi4
 
 struct rtable *ip_local_route_alloc(struct net_device *dev, unsigned int flags,
 				    u32 itag, unsigned char type, bool docache);
+int ip_route_try_local_rcu(struct net *net, struct sk_buff *skb,
+			   const struct iphdr *iph);
 int ip_route_input_noref(struct sk_buff *skb, __be32 dst, __be32 src,
 			 u8 tos, struct net_device *devin);
 int ip_route_input_rcu(struct sk_buff *skb, __be32 dst, __be32 src,
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 515589f1b3d1..84248dd41da6 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2079,6 +2079,49 @@ out:	return err;
 	goto out;
 }
 
+/* try to resolve and set the route for the ingress packet in the local
+ * destination, looking-up the destination address against the local ones
+ * and performing source validation
+ * return an error only if the local look up is successful and validation fails
+ * Called under RCU
+ */
+int ip_route_try_local_rcu(struct net *net, struct sk_buff *skb,
+			   const struct iphdr *iph)
+{
+	__be32 saddr = iph->saddr;
+	struct in_device *in_dev;
+	struct dst_entry *dst;
+	int err = -EINVAL;
+	u32 itag;
+
+	dst = inet_get_ifaddr_dst_rcu(net, iph->daddr);
+	if (!dst)
+		return 0;
+
+	in_dev = __in_dev_get_rcu(skb->dev);
+	if (ipv4_is_multicast(saddr) || ipv4_is_lbcast(saddr))
+		goto martian_source;
+
+	/* check for zeronet only after successful lookup, so that we don't trip
+	 * over limited broadcast destination, see ip_route_input_slow()
+	 */
+	if (ipv4_is_zeronet(saddr) || (ipv4_is_loopback(saddr) &&
+				       !IN_DEV_NET_ROUTE_LOCALNET(in_dev, net)))
+		goto martian_source;
+
+	err = fib_validate_source(skb, saddr, iph->daddr, iph->tos, 0, skb->dev,
+				  in_dev, &itag);
+	if (err < 0)
+		goto martian_source;
+
+	skb_dst_set_noref(skb, dst);
+	return 0;
+
+martian_source:
+	ip_handle_martian_source(skb->dev, in_dev, skb, iph->daddr, iph->saddr);
+	return err;
+}
+
 int ip_route_input_noref(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 			 u8 tos, struct net_device *dev)
 {
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 26cc9f483b6d..d957e30b1cbe 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1283,6 +1283,19 @@ void ip6_route_input(struct sk_buff *skb)
 	skb_dst_set(skb, ip6_route_input_lookup(net, skb->dev, &fl6, flags));
 }
 
+/* try to resolve and set the route for the ingress packet in the local
+ * destination
+ * Called under RCU
+ */
+void ip6_route_try_local_rcu_bh(struct net *net, struct sk_buff *skb)
+{
+	struct dst_entry *dst;
+
+	dst = inet6_get_ifaddr_dst_rcu_bh(net, &ipv6_hdr(skb)->daddr);
+	if (dst)
+		skb_dst_set_noref(skb, dst);
+}
+
 static struct rt6_info *ip6_pol_route_output(struct net *net, struct fib6_table *table,
 					     struct flowi6 *fl6, int flags)
 {
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 00/11] udp: full early demux for unconnected sockets
  2017-09-22 21:06 [RFC PATCH 00/11] udp: full early demux for unconnected sockets Paolo Abeni
                   ` (8 preceding siblings ...)
  2017-09-22 21:06 ` [RFC PATCH 09/11] route: add ipv4/6 helpers to do partial route lookup vs local dst Paolo Abeni
@ 2017-09-22 21:58 ` Eric Dumazet
  2017-09-25 20:26   ` Paolo Abeni
  2017-09-26 20:18 ` [RFC PATCH 10/11] IP: early demux can return an error code Paolo Abeni
  2017-09-26 20:18 ` [RFC PATCH 11/11] udp: dst lookup in early demux for unconnected sockets Paolo Abeni
  11 siblings, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2017-09-22 21:58 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: netdev, David S. Miller, Pablo Neira Ayuso, Florian Westphal,
	Eric Dumazet, Hannes Frederic Sowa

On Fri, 2017-09-22 at 23:06 +0200, Paolo Abeni wrote:
> This series refactor the UDP early demux code so that:
> 
> * full socket lookup is performed for unicast packets
> * a sk is grabbed even for unconnected socket match
> * a dst cache is used even in such scenario
> 
> To perform this tasks a couple of facilities are added:
> 
> * noref socket references, scoped inside the current RCU section, to be
>   explicitly cleared before leaving such section
> * a dst cache inside the inet and inet6 local addresses tables, caching the
>   related local dst entry
> 
> The measured performance gain under small packet UDP flood is as follow:
> 
> ingress NIC	vanilla		patched		delta
> rx queues	(kpps)		(kpps)		(%)
> [ipv4]
> 1		2177		2414		10
> 2		2527		2892		14
> 3		3050		3733		22


This is a clear sign your program is not using latest SO_REUSEPORT +
[ec]BPF filter [1]

return socket[RX_QUEUE# | or CPU#];

If udp_sink uses SO_REUSEPORT with no extra hint, socket selection is
based on a lazy hash, meaning that you do not have proper siloing.

return socket[hash(skb)];

Multiple cpus can then :
 - compete on grabbing same socket refcount
 - compete on grabbing the receive queue lock
 - compete for releasing lock and socket refcount
 - skb freeing done on different cpus than where allocated.

You are adding complexity to the kernel because you are using a
sub-optimal user space program, favoring false sharing.

First solve the false sharing issue.

Performance with 2 rx queues should be almost twice the performance with
1 rx queue.

Then we can see if the gains you claim are still applicable.

Thanks

PS: Wei Wan is about to release the IPV6 changes so that the big
differences you showed are going to disappear soon.

Refs [1]

tools/testing/selftests/net/reuseport_bpf.c

6a5ef90c58daada158ba16ba330558efc3471491 Merge branch 'faster-soreuseport'
3ca8e4029969d40ab90e3f1ecd83ab1cadd60fbb soreuseport: BPF selection functional test
538950a1b7527a0a52ccd9337e3fcd304f027f13 soreuseport: setsockopt SO_ATTACH_REUSEPORT_[CE]BPF
e32ea7e747271a0abcd37e265005e97cc81d9df5 soreuseport: fast reuseport UDP socket selection
ef456144da8ef507c8cf504284b6042e9201a05c soreuseport: define reuseport groups

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 00/11] udp: full early demux for unconnected sockets
  2017-09-22 21:58 ` [RFC PATCH 00/11] udp: full early demux for unconnected sockets Eric Dumazet
@ 2017-09-25 20:26   ` Paolo Abeni
  0 siblings, 0 replies; 14+ messages in thread
From: Paolo Abeni @ 2017-09-25 20:26 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, David S. Miller, Pablo Neira Ayuso, Florian Westphal,
	Eric Dumazet, Hannes Frederic Sowa

On Fri, 2017-09-22 at 14:58 -0700, Eric Dumazet wrote:
> On Fri, 2017-09-22 at 23:06 +0200, Paolo Abeni wrote:
> > This series refactor the UDP early demux code so that:
> > 
> > * full socket lookup is performed for unicast packets
> > * a sk is grabbed even for unconnected socket match
> > * a dst cache is used even in such scenario
> > 
> > To perform this tasks a couple of facilities are added:
> > 
> > * noref socket references, scoped inside the current RCU section, to be
> >   explicitly cleared before leaving such section
> > * a dst cache inside the inet and inet6 local addresses tables, caching the
> >   related local dst entry
> > 
> > The measured performance gain under small packet UDP flood is as follow:
> > 
> > ingress NIC	vanilla		patched		delta
> > rx queues	(kpps)		(kpps)		(%)
> > [ipv4]
> > 1		2177		2414		10
> > 2		2527		2892		14
> > 3		3050		3733		22
> 
> 
> This is a clear sign your program is not using latest SO_REUSEPORT +
> [ec]BPF filter [1]
> 
> return socket[RX_QUEUE# | or CPU#];
> 
> If udp_sink uses SO_REUSEPORT with no extra hint, socket selection is
> based on a lazy hash, meaning that you do not have proper siloing.
> 
> return socket[hash(skb)];
> 
> Multiple cpus can then :
>  - compete on grabbing same socket refcount
>  - compete on grabbing the receive queue lock
>  - compete for releasing lock and socket refcount
>  - skb freeing done on different cpus than where allocated.
> 
> You are adding complexity to the kernel because you are using a
> sub-optimal user space program, favoring false sharing.
> 
> First solve the false sharing issue.
> 
> Performance with 2 rx queues should be almost twice the performance with
> 1 rx queue.
> 
> Then we can see if the gains you claim are still applicable.

Here are the performance results using a BPF filter to distribute the
ingress packet to the reuseport socket with the same id of the ingress
CPU - we have 1 to 1 mapping between the ingress receive queue and the
destination socket:

ingress NIC     vanilla         patched         delta
rx queues       (kpps)          (kpps)          (%)
[ipv4]
2               3020                3663                21
3               4352                5179                19
4               5318                6194                16
5               6258                7583                21
6               7376                8558                16

[ipv6]
2               2446                3949                61
3               3099                5092                64
4               3698                6611                78
5               4382                7852                79
6               5116                8851                73

Sone notes:

- figures obtained with: 

ethtool  -L em2 combined $n
MASK=1
for I in `seq 0 $((n - 1))`; do
        [ $I -eq 0 ] && USE_BPF="--use_bpf" || USE_BPF=""
        udp_sink  --reuseport $USE_BPF --recvfrom --count 10000000 --port 9 &
        taskset -p $((MASK << ($I + $n) )) $!
done

- in the IPv6 routing code we currently have a relevant bottle-neck in
ip6_pol_route(), I see a lot of contention on a dst refcount, so
without early demux the performances do not scale well there.

- For maximum performances BH and user space sink need to run on
difference CPUs - yes we have some more cacheline misses and a little
contention on the receive queue spin lock, but a lot less icache misses
and more CPU cycles available, the overall tput is a lot higher than
binding on the same CPU where the BH is running.

> PS: Wei Wan is about to release the IPV6 changes so that the big
> differences you showed are going to disappear soon.

Interesting, looking forward to that!

Cheers,

Paolo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC PATCH 10/11] IP: early demux can return an error code
  2017-09-22 21:06 [RFC PATCH 00/11] udp: full early demux for unconnected sockets Paolo Abeni
                   ` (9 preceding siblings ...)
  2017-09-22 21:58 ` [RFC PATCH 00/11] udp: full early demux for unconnected sockets Eric Dumazet
@ 2017-09-26 20:18 ` Paolo Abeni
  2017-09-26 20:18 ` [RFC PATCH 11/11] udp: dst lookup in early demux for unconnected sockets Paolo Abeni
  11 siblings, 0 replies; 14+ messages in thread
From: Paolo Abeni @ 2017-09-26 20:18 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Pablo Neira Ayuso, Florian Westphal,
	Eric Dumazet, Hannes Frederic Sowa

it will used by later patch to cope with unconnected sockets.
Since early demux can do a route lookup and an ipv4 route
lookup can return an error code this is consistent with the
current ipv4 route infrastructure.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
This patch and the next one did not land on the ML previously
due to PEBKAC, appending now to give the complete picture of
this RFC series.
Side note: currently the early demux lookup for mcast sockets
does not perform source address validation and we need (also)
something like this commit to fix the issue without causing
large performance regressions.
---
 include/net/protocol.h |  4 ++--
 include/net/tcp.h      |  2 +-
 include/net/udp.h      |  2 +-
 net/ipv4/ip_input.c    | 25 +++++++++++++++----------
 net/ipv4/tcp_ipv4.c    |  9 +++++----
 net/ipv4/udp.c         | 11 ++++++-----
 6 files changed, 30 insertions(+), 23 deletions(-)

diff --git a/include/net/protocol.h b/include/net/protocol.h
index 65ba335b0e7e..4fc75f7ae23b 100644
--- a/include/net/protocol.h
+++ b/include/net/protocol.h
@@ -39,8 +39,8 @@
 
 /* This is used to register protocols. */
 struct net_protocol {
-	void			(*early_demux)(struct sk_buff *skb);
-	void                    (*early_demux_handler)(struct sk_buff *skb);
+	int			(*early_demux)(struct sk_buff *skb);
+	int			(*early_demux_handler)(struct sk_buff *skb);
 	int			(*handler)(struct sk_buff *skb);
 	void			(*err_handler)(struct sk_buff *skb, u32 info);
 	unsigned int		no_policy:1,
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 49a8a46466f3..cf0bb918c52d 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -345,7 +345,7 @@ void tcp_v4_err(struct sk_buff *skb, u32);
 
 void tcp_shutdown(struct sock *sk, int how);
 
-void tcp_v4_early_demux(struct sk_buff *skb);
+int tcp_v4_early_demux(struct sk_buff *skb);
 int tcp_v4_rcv(struct sk_buff *skb);
 
 int tcp_v4_tw_remember_stamp(struct inet_timewait_sock *tw);
diff --git a/include/net/udp.h b/include/net/udp.h
index 12dfbfe2e2d7..6c759c8594e2 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -259,7 +259,7 @@ static inline struct sk_buff *skb_recv_udp(struct sock *sk, unsigned int flags,
 	return __skb_recv_udp(sk, flags, noblock, &peeked, &off, err);
 }
 
-void udp_v4_early_demux(struct sk_buff *skb);
+int udp_v4_early_demux(struct sk_buff *skb);
 bool udp_sk_rx_dst_set(struct sock *sk, struct dst_entry *dst);
 int udp_get_port(struct sock *sk, unsigned short snum,
 		 int (*saddr_cmp)(const struct sock *,
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index 5690ef09da28..f172be87674f 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -311,9 +311,10 @@ static inline bool ip_rcv_options(struct sk_buff *skb)
 static int ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
 	const struct iphdr *iph = ip_hdr(skb);
-	struct rtable *rt;
+	int (*edemux)(struct sk_buff *skb);
 	struct net_device *dev = skb->dev;
-	void (*edemux)(struct sk_buff *skb);
+	struct rtable *rt;
+	int err;
 
 	/* if ingress device is enslaved to an L3 master device pass the
 	 * skb to its handler for processing
@@ -331,7 +332,9 @@ static int ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
 
 		ipprot = rcu_dereference(inet_protos[protocol]);
 		if (ipprot && (edemux = READ_ONCE(ipprot->early_demux))) {
-			edemux(skb);
+			err = edemux(skb);
+			if (unlikely(err))
+				goto drop_error;
 			/* must reload iph, skb->head might have changed */
 			iph = ip_hdr(skb);
 		}
@@ -342,13 +345,10 @@ static int ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
 	 *	how the packet travels inside Linux networking.
 	 */
 	if (!skb_valid_dst(skb)) {
-		int err = ip_route_input_noref(skb, iph->daddr, iph->saddr,
-					       iph->tos, dev);
-		if (unlikely(err)) {
-			if (err == -EXDEV)
-				__NET_INC_STATS(net, LINUX_MIB_IPRPFILTER);
-			goto drop;
-		}
+		err = ip_route_input_noref(skb, iph->daddr, iph->saddr,
+					   iph->tos, dev);
+		if (unlikely(err))
+			goto drop_error;
 	}
 
 	/* Since the sk has no reference to the socket, we must
@@ -407,6 +407,11 @@ static int ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
 drop:
 	kfree_skb(skb);
 	return NET_RX_DROP;
+
+drop_error:
+	if (err == -EXDEV)
+		__NET_INC_STATS(net, LINUX_MIB_IPRPFILTER);
+	goto drop;
 }
 
 /*
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index d9416b5162bc..85164d4d3e53 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1503,23 +1503,23 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
 }
 EXPORT_SYMBOL(tcp_v4_do_rcv);
 
-void tcp_v4_early_demux(struct sk_buff *skb)
+int tcp_v4_early_demux(struct sk_buff *skb)
 {
 	const struct iphdr *iph;
 	const struct tcphdr *th;
 	struct sock *sk;
 
 	if (skb->pkt_type != PACKET_HOST)
-		return;
+		return 0;
 
 	if (!pskb_may_pull(skb, skb_transport_offset(skb) + sizeof(struct tcphdr)))
-		return;
+		return 0;
 
 	iph = ip_hdr(skb);
 	th = tcp_hdr(skb);
 
 	if (th->doff < sizeof(struct tcphdr) / 4)
-		return;
+		return 0;
 
 	sk = __inet_lookup_established(dev_net(skb->dev), &tcp_hashinfo,
 				       iph->saddr, th->source,
@@ -1538,6 +1538,7 @@ void tcp_v4_early_demux(struct sk_buff *skb)
 				skb_dst_set_noref(skb, dst);
 		}
 	}
+	return 0;
 }
 
 bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb)
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 5cbbd78024dc..b7202a15f360 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2215,7 +2215,7 @@ void udp_set_skb_rx_dst(struct sock *sk, struct sk_buff *skb, u32 cookie)
 }
 EXPORT_SYMBOL_GPL(udp_set_skb_rx_dst);
 
-void udp_v4_early_demux(struct sk_buff *skb)
+int udp_v4_early_demux(struct sk_buff *skb)
 {
 	struct net *net = dev_net(skb->dev);
 	int dif = skb->dev->ifindex;
@@ -2227,7 +2227,7 @@ void udp_v4_early_demux(struct sk_buff *skb)
 
 	/* validate the packet */
 	if (!pskb_may_pull(skb, skb_transport_offset(skb) + sizeof(struct udphdr)))
-		return;
+		return 0;
 
 	iph = ip_hdr(skb);
 	uh = udp_hdr(skb);
@@ -2237,14 +2237,14 @@ void udp_v4_early_demux(struct sk_buff *skb)
 		struct in_device *in_dev = __in_dev_get_rcu(skb->dev);
 
 		if (!in_dev)
-			return;
+			return 0;
 
 		/* we are supposed to accept bcast packets */
 		if (skb->pkt_type == PACKET_MULTICAST) {
 			ours = ip_check_mc_rcu(in_dev, iph->daddr, iph->saddr,
 					       iph->protocol);
 			if (!ours)
-				return;
+				return 0;
 		}
 
 		sk = __udp4_lib_mcast_demux_lookup(net, uh->dest, iph->daddr,
@@ -2256,11 +2256,12 @@ void udp_v4_early_demux(struct sk_buff *skb)
 	}
 
 	if (!sk)
-		return;
+		return 0;
 
 	skb_set_noref_sk(skb, sk);
 	if (udp_use_rx_dst_cache(sk, skb))
 		udp_set_skb_rx_dst(sk, skb, 0);
+	return 0;
 }
 
 int udp_rcv(struct sk_buff *skb)
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH 11/11] udp: dst lookup in early demux for unconnected sockets
  2017-09-22 21:06 [RFC PATCH 00/11] udp: full early demux for unconnected sockets Paolo Abeni
                   ` (10 preceding siblings ...)
  2017-09-26 20:18 ` [RFC PATCH 10/11] IP: early demux can return an error code Paolo Abeni
@ 2017-09-26 20:18 ` Paolo Abeni
  11 siblings, 0 replies; 14+ messages in thread
From: Paolo Abeni @ 2017-09-26 20:18 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Pablo Neira Ayuso, Florian Westphal,
	Eric Dumazet, Hannes Frederic Sowa

Now that all the helper are in place we can leverage them to
fetch the appropriate destination for unconnected sockets,
looking up the ifaddr cache if not local bind are not allowed.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/ipv4/udp.c | 10 ++++++++--
 net/ipv6/udp.c |  9 ++++++++-
 2 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index b7202a15f360..87d48101efe2 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2259,9 +2259,15 @@ int udp_v4_early_demux(struct sk_buff *skb)
 		return 0;
 
 	skb_set_noref_sk(skb, sk);
-	if (udp_use_rx_dst_cache(sk, skb))
+	if (udp_use_rx_dst_cache(sk, skb)) {
 		udp_set_skb_rx_dst(sk, skb, 0);
-	return 0;
+		return 0;
+	}
+
+	if (net->ipv4.sysctl_ip_nonlocal_bind)
+		return 0;
+
+	return ip_route_try_local_rcu(net, skb, iph);
 }
 
 int udp_rcv(struct sk_buff *skb)
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 67d340679c3a..360d54bc2359 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -935,8 +935,15 @@ static void udp_v6_early_demux(struct sk_buff *skb)
 		return;
 
 	skb_set_noref_sk(skb, sk);
-	if (udp6_use_rx_dst_cache(sk))
+	if (udp6_use_rx_dst_cache(sk)) {
 		udp_set_skb_rx_dst(sk, skb, inet6_sk(sk)->rx_dst_cookie);
+		return;
+	}
+
+	if (net->ipv6.sysctl.ip_nonlocal_bind)
+		return;
+
+	ip6_route_try_local_rcu_bh(net, skb);
 }
 
 static __inline__ int udpv6_rcv(struct sk_buff *skb)
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2017-09-26 20:19 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-22 21:06 [RFC PATCH 00/11] udp: full early demux for unconnected sockets Paolo Abeni
2017-09-22 21:06 ` [RFC PATCH 01/11] net: add support for noref skb->sk Paolo Abeni
2017-09-22 21:06 ` [RFC PATCH 02/11] net: allow early demux to fetch noref socket Paolo Abeni
2017-09-22 21:06 ` [RFC PATCH 03/11] udp: do not touch socket refcount in early demux Paolo Abeni
2017-09-22 21:06 ` [RFC PATCH 04/11] net: add simple socket-like dst cache helpers Paolo Abeni
2017-09-22 21:06 ` [RFC PATCH 05/11] udp: perform full socket lookup in early demux Paolo Abeni
2017-09-22 21:06 ` [RFC PATCH 06/11] ip/route: factor out helper for local route creation Paolo Abeni
2017-09-22 21:06 ` [RFC PATCH 07/11] ipv6/addrconf: add an helper for inet6 address lookup Paolo Abeni
2017-09-22 21:06 ` [RFC PATCH 08/11] net: implement local route cache inside ifaddr Paolo Abeni
2017-09-22 21:06 ` [RFC PATCH 09/11] route: add ipv4/6 helpers to do partial route lookup vs local dst Paolo Abeni
2017-09-22 21:58 ` [RFC PATCH 00/11] udp: full early demux for unconnected sockets Eric Dumazet
2017-09-25 20:26   ` Paolo Abeni
2017-09-26 20:18 ` [RFC PATCH 10/11] IP: early demux can return an error code Paolo Abeni
2017-09-26 20:18 ` [RFC PATCH 11/11] udp: dst lookup in early demux for unconnected sockets Paolo Abeni

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.