* [PATCH v3 net-next 0/4] tcp: better smp listener behavior
@ 2015-10-09 2:33 Eric Dumazet
2015-10-09 2:33 ` [PATCH v3 net-next 1/4] net: SO_INCOMING_CPU setsockopt() support Eric Dumazet
` (5 more replies)
0 siblings, 6 replies; 10+ messages in thread
From: Eric Dumazet @ 2015-10-09 2:33 UTC (permalink / raw)
To: David S . Miller; +Cc: netdev, Eric Dumazet, Eric Dumazet
As promised in last patch series, we implement a better SO_REUSEPORT
strategy, based on cpu affinities if selected by the application.
We also moved sk_refcnt out of the cache line containing the lookup
keys, as it was considerably slowing down smp operations because
of false sharing. This was simpler than converting listen sockets
to conventional RCU (to avoid sk_refcnt dirtying)
Could process 6.0 Mpps SYN instead of 4.2 Mpps on my test server.
Eric Dumazet (4):
net: SO_INCOMING_CPU setsockopt() support
net: align sk_refcnt on 128 bytes boundary
net: shrink struct sock and request_sock by 8 bytes
tcp: shrink tcp_timewait_sock by 8 bytes
include/linux/tcp.h | 4 ++--
include/net/inet_timewait_sock.h | 2 +-
include/net/request_sock.h | 7 +++----
include/net/sock.h | 41 +++++++++++++++++++++++++++-------------
net/core/sock.c | 5 +++++
net/ipv4/inet_hashtables.c | 2 ++
net/ipv4/syncookies.c | 4 ++--
net/ipv4/tcp_input.c | 2 +-
net/ipv4/tcp_ipv4.c | 2 +-
net/ipv4/tcp_minisocks.c | 18 +++++++++---------
net/ipv4/tcp_output.c | 2 +-
net/ipv4/udp.c | 6 +++++-
net/ipv6/inet6_hashtables.c | 2 ++
net/ipv6/syncookies.c | 4 ++--
net/ipv6/tcp_ipv6.c | 2 +-
net/ipv6/udp.c | 11 +++++++----
16 files changed, 72 insertions(+), 42 deletions(-)
--
2.6.0.rc2.230.g3dd15c0
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH v3 net-next 1/4] net: SO_INCOMING_CPU setsockopt() support
2015-10-09 2:33 [PATCH v3 net-next 0/4] tcp: better smp listener behavior Eric Dumazet
@ 2015-10-09 2:33 ` Eric Dumazet
2015-10-09 3:40 ` Tom Herbert
2015-10-09 2:33 ` [PATCH v3 net-next 2/4] net: align sk_refcnt on 128 bytes boundary Eric Dumazet
` (4 subsequent siblings)
5 siblings, 1 reply; 10+ messages in thread
From: Eric Dumazet @ 2015-10-09 2:33 UTC (permalink / raw)
To: David S . Miller; +Cc: netdev, Eric Dumazet, Eric Dumazet
SO_INCOMING_CPU as added in commit 2c8c56e15df3 was a getsockopt() command
to fetch incoming cpu handling a particular TCP flow after accept()
This commits adds setsockopt() support and extends SO_REUSEPORT selection
logic : If a TCP listener or UDP socket has this option set, a packet is
delivered to this socket only if CPU handling the packet matches the specified
one.
This allows to build very efficient TCP servers, using one listener per
RX queue, as the associated TCP listener should only accept flows handled
in softirq by the same cpu.
This provides optimal NUMA behavior and keep cpu caches hot.
Note that __inet_lookup_listener() still has to iterate over the list of
all listeners. Following patch puts sk_refcnt in a different cache line
to let this iteration hit only shared and read mostly cache lines.
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
include/net/sock.h | 10 ++++------
net/core/sock.c | 5 +++++
net/ipv4/inet_hashtables.c | 2 ++
net/ipv4/udp.c | 6 +++++-
net/ipv6/inet6_hashtables.c | 2 ++
net/ipv6/udp.c | 11 +++++++----
6 files changed, 25 insertions(+), 11 deletions(-)
diff --git a/include/net/sock.h b/include/net/sock.h
index dfe2eb8e1132..08abffe32236 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -150,6 +150,7 @@ typedef __u64 __bitwise __addrpair;
* @skc_node: main hash linkage for various protocol lookup tables
* @skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol
* @skc_tx_queue_mapping: tx queue number for this connection
+ * @skc_incoming_cpu: record/match cpu processing incoming packets
* @skc_refcnt: reference count
*
* This is the minimal network layer representation of sockets, the header
@@ -212,6 +213,8 @@ struct sock_common {
struct hlist_nulls_node skc_nulls_node;
};
int skc_tx_queue_mapping;
+ int skc_incoming_cpu;
+
atomic_t skc_refcnt;
/* private: */
int skc_dontcopy_end[0];
@@ -274,7 +277,6 @@ struct cg_proto;
* @sk_rcvtimeo: %SO_RCVTIMEO setting
* @sk_sndtimeo: %SO_SNDTIMEO setting
* @sk_rxhash: flow hash received from netif layer
- * @sk_incoming_cpu: record cpu processing incoming packets
* @sk_txhash: computed flow hash for use on transmit
* @sk_filter: socket filtering instructions
* @sk_timer: sock cleanup timer
@@ -331,6 +333,7 @@ struct sock {
#define sk_v6_daddr __sk_common.skc_v6_daddr
#define sk_v6_rcv_saddr __sk_common.skc_v6_rcv_saddr
#define sk_cookie __sk_common.skc_cookie
+#define sk_incoming_cpu __sk_common.skc_incoming_cpu
socket_lock_t sk_lock;
struct sk_buff_head sk_receive_queue;
@@ -353,11 +356,6 @@ struct sock {
#ifdef CONFIG_RPS
__u32 sk_rxhash;
#endif
- u16 sk_incoming_cpu;
- /* 16bit hole
- * Warned : sk_incoming_cpu can be set from softirq,
- * Do not use this hole without fully understanding possible issues.
- */
__u32 sk_txhash;
#ifdef CONFIG_NET_RX_BUSY_POLL
diff --git a/net/core/sock.c b/net/core/sock.c
index 7dd1263e4c24..1071f9380250 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -988,6 +988,10 @@ set_rcvbuf:
sk->sk_max_pacing_rate);
break;
+ case SO_INCOMING_CPU:
+ sk->sk_incoming_cpu = val;
+ break;
+
default:
ret = -ENOPROTOOPT;
break;
@@ -2353,6 +2357,7 @@ void sock_init_data(struct socket *sock, struct sock *sk)
sk->sk_max_pacing_rate = ~0U;
sk->sk_pacing_rate = ~0U;
+ sk->sk_incoming_cpu = -1;
/*
* Before updating sk_refcnt, we must commit prior changes to memory
* (Documentation/RCU/rculist_nulls.txt for details)
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index bed8886a4b6c..08643a3616af 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -185,6 +185,8 @@ static inline int compute_score(struct sock *sk, struct net *net,
return -1;
score += 4;
}
+ if (sk->sk_incoming_cpu == raw_smp_processor_id())
+ score++;
}
return score;
}
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index e1fc129099ea..24ec14f9825c 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -375,7 +375,8 @@ static inline int compute_score(struct sock *sk, struct net *net,
return -1;
score += 4;
}
-
+ if (sk->sk_incoming_cpu == raw_smp_processor_id())
+ score++;
return score;
}
@@ -419,6 +420,9 @@ static inline int compute_score2(struct sock *sk, struct net *net,
score += 4;
}
+ if (sk->sk_incoming_cpu == raw_smp_processor_id())
+ score++;
+
return score;
}
diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
index 6ac8dad0138a..21ace5a2bf7c 100644
--- a/net/ipv6/inet6_hashtables.c
+++ b/net/ipv6/inet6_hashtables.c
@@ -114,6 +114,8 @@ static inline int compute_score(struct sock *sk, struct net *net,
return -1;
score++;
}
+ if (sk->sk_incoming_cpu == raw_smp_processor_id())
+ score++;
}
return score;
}
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 0aba654f5b91..01bcb49619ee 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -182,10 +182,12 @@ static inline int compute_score(struct sock *sk, struct net *net,
score++;
}
+ if (sk->sk_incoming_cpu == raw_smp_processor_id())
+ score++;
+
return score;
}
-#define SCORE2_MAX (1 + 1 + 1)
static inline int compute_score2(struct sock *sk, struct net *net,
const struct in6_addr *saddr, __be16 sport,
const struct in6_addr *daddr,
@@ -223,6 +225,9 @@ static inline int compute_score2(struct sock *sk, struct net *net,
score++;
}
+ if (sk->sk_incoming_cpu == raw_smp_processor_id())
+ score++;
+
return score;
}
@@ -251,8 +256,7 @@ begin:
hash = udp6_ehashfn(net, daddr, hnum,
saddr, sport);
matches = 1;
- } else if (score == SCORE2_MAX)
- goto exact_match;
+ }
} else if (score == badness && reuseport) {
matches++;
if (reciprocal_scale(hash, matches) == 0)
@@ -269,7 +273,6 @@ begin:
goto begin;
if (result) {
-exact_match:
if (unlikely(!atomic_inc_not_zero_hint(&result->sk_refcnt, 2)))
result = NULL;
else if (unlikely(compute_score2(result, net, saddr, sport,
--
2.6.0.rc2.230.g3dd15c0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v3 net-next 2/4] net: align sk_refcnt on 128 bytes boundary
2015-10-09 2:33 [PATCH v3 net-next 0/4] tcp: better smp listener behavior Eric Dumazet
2015-10-09 2:33 ` [PATCH v3 net-next 1/4] net: SO_INCOMING_CPU setsockopt() support Eric Dumazet
@ 2015-10-09 2:33 ` Eric Dumazet
2015-10-09 2:33 ` [PATCH v3 net-next 3/4] net: shrink struct sock and request_sock by 8 bytes Eric Dumazet
` (3 subsequent siblings)
5 siblings, 0 replies; 10+ messages in thread
From: Eric Dumazet @ 2015-10-09 2:33 UTC (permalink / raw)
To: David S . Miller; +Cc: netdev, Eric Dumazet, Eric Dumazet
sk->sk_refcnt is dirtied for every TCP/UDP incoming packet.
This is a performance issue if multiple cpus hit a common socket,
or multiple sockets are chained due to SO_REUSEPORT.
By moving sk_refcnt 8 bytes further, first 128 bytes of sockets
are mostly read. As they contain the lookup keys, this has
a considerable performance impact, as cpus can cache them.
These 8 bytes are not wasted, we use them as a place holder
for various fields, depending on the socket type.
Tested:
SYN flood hitting a 16 RX queues NIC.
TCP listener using 16 sockets and SO_REUSEPORT
and SO_INCOMING_CPU for proper siloing.
Could process 6.0 Mpps SYN instead of 4.2 Mpps
Kernel profile looked like :
11.68% [kernel] [k] sha_transform
6.51% [kernel] [k] __inet_lookup_listener
5.07% [kernel] [k] __inet_lookup_established
4.15% [kernel] [k] memcpy_erms
3.46% [kernel] [k] ipt_do_table
2.74% [kernel] [k] fib_table_lookup
2.54% [kernel] [k] tcp_make_synack
2.34% [kernel] [k] tcp_conn_request
2.05% [kernel] [k] __netif_receive_skb_core
2.03% [kernel] [k] kmem_cache_alloc
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
include/net/inet_timewait_sock.h | 2 +-
include/net/request_sock.h | 2 +-
include/net/sock.h | 17 ++++++++++++++---
3 files changed, 16 insertions(+), 5 deletions(-)
diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h
index 186f3a1e1b1f..e581fc69129d 100644
--- a/include/net/inet_timewait_sock.h
+++ b/include/net/inet_timewait_sock.h
@@ -70,6 +70,7 @@ struct inet_timewait_sock {
#define tw_dport __tw_common.skc_dport
#define tw_num __tw_common.skc_num
#define tw_cookie __tw_common.skc_cookie
+#define tw_dr __tw_common.skc_tw_dr
int tw_timeout;
volatile unsigned char tw_substate;
@@ -88,7 +89,6 @@ struct inet_timewait_sock {
kmemcheck_bitfield_end(flags);
struct timer_list tw_timer;
struct inet_bind_bucket *tw_tb;
- struct inet_timewait_death_row *tw_dr;
};
#define tw_tclass tw_tos
diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index 95ab5d7aab96..6b818b77d5e5 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -50,9 +50,9 @@ struct request_sock {
struct sock_common __req_common;
#define rsk_refcnt __req_common.skc_refcnt
#define rsk_hash __req_common.skc_hash
+#define rsk_listener __req_common.skc_listener
struct request_sock *dl_next;
- struct sock *rsk_listener;
u16 mss;
u8 num_retrans; /* number of retransmits */
u8 cookie_ts:1; /* syncookie: encode tcpopts in timestamp */
diff --git a/include/net/sock.h b/include/net/sock.h
index 08abffe32236..a7818104a73f 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -150,6 +150,9 @@ typedef __u64 __bitwise __addrpair;
* @skc_node: main hash linkage for various protocol lookup tables
* @skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol
* @skc_tx_queue_mapping: tx queue number for this connection
+ * @skc_flags: place holder for sk_flags
+ * %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
+ * %SO_OOBINLINE settings, %SO_TIMESTAMPING settings
* @skc_incoming_cpu: record/match cpu processing incoming packets
* @skc_refcnt: reference count
*
@@ -201,6 +204,16 @@ struct sock_common {
atomic64_t skc_cookie;
+ /* following fields are padding to force
+ * offset(struct sock, sk_refcnt) == 128 on 64bit arches
+ * assuming IPV6 is enabled. We use this padding differently
+ * for different kind of 'sockets'
+ */
+ union {
+ unsigned long skc_flags;
+ struct sock *skc_listener; /* request_sock */
+ struct inet_timewait_death_row *skc_tw_dr; /* inet_timewait_sock */
+ };
/*
* fields between dontcopy_begin/dontcopy_end
* are not copied in sock_copy()
@@ -246,8 +259,6 @@ struct cg_proto;
* @sk_pacing_rate: Pacing rate (if supported by transport/packet scheduler)
* @sk_max_pacing_rate: Maximum pacing rate (%SO_MAX_PACING_RATE)
* @sk_sndbuf: size of send buffer in bytes
- * @sk_flags: %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
- * %SO_OOBINLINE settings, %SO_TIMESTAMPING settings
* @sk_no_check_tx: %SO_NO_CHECK setting, set checksum in TX packets
* @sk_no_check_rx: allow zero checksum in RX packets
* @sk_route_caps: route capabilities (e.g. %NETIF_F_TSO)
@@ -334,6 +345,7 @@ struct sock {
#define sk_v6_rcv_saddr __sk_common.skc_v6_rcv_saddr
#define sk_cookie __sk_common.skc_cookie
#define sk_incoming_cpu __sk_common.skc_incoming_cpu
+#define sk_flags __sk_common.skc_flags
socket_lock_t sk_lock;
struct sk_buff_head sk_receive_queue;
@@ -371,7 +383,6 @@ struct sock {
#ifdef CONFIG_XFRM
struct xfrm_policy *sk_policy[2];
#endif
- unsigned long sk_flags;
struct dst_entry *sk_rx_dst;
struct dst_entry __rcu *sk_dst_cache;
spinlock_t sk_dst_lock;
--
2.6.0.rc2.230.g3dd15c0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v3 net-next 3/4] net: shrink struct sock and request_sock by 8 bytes
2015-10-09 2:33 [PATCH v3 net-next 0/4] tcp: better smp listener behavior Eric Dumazet
2015-10-09 2:33 ` [PATCH v3 net-next 1/4] net: SO_INCOMING_CPU setsockopt() support Eric Dumazet
2015-10-09 2:33 ` [PATCH v3 net-next 2/4] net: align sk_refcnt on 128 bytes boundary Eric Dumazet
@ 2015-10-09 2:33 ` Eric Dumazet
2015-10-09 2:33 ` [PATCH v3 net-next 4/4] tcp: shrink tcp_timewait_sock " Eric Dumazet
` (2 subsequent siblings)
5 siblings, 0 replies; 10+ messages in thread
From: Eric Dumazet @ 2015-10-09 2:33 UTC (permalink / raw)
To: David S . Miller; +Cc: netdev, Eric Dumazet, Eric Dumazet
One 32bit hole is following skc_refcnt, use it.
skc_incoming_cpu can also be an union for request_sock rcv_wnd.
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
include/net/request_sock.h | 5 ++---
include/net/sock.h | 14 +++++++++-----
net/ipv4/syncookies.c | 4 ++--
net/ipv4/tcp_input.c | 2 +-
net/ipv4/tcp_ipv4.c | 2 +-
net/ipv4/tcp_minisocks.c | 18 +++++++++---------
net/ipv4/tcp_output.c | 2 +-
net/ipv6/syncookies.c | 4 ++--
net/ipv6/tcp_ipv6.c | 2 +-
9 files changed, 28 insertions(+), 25 deletions(-)
diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index 6b818b77d5e5..2e73748956d5 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -51,15 +51,14 @@ struct request_sock {
#define rsk_refcnt __req_common.skc_refcnt
#define rsk_hash __req_common.skc_hash
#define rsk_listener __req_common.skc_listener
+#define rsk_window_clamp __req_common.skc_window_clamp
+#define rsk_rcv_wnd __req_common.skc_rcv_wnd
struct request_sock *dl_next;
u16 mss;
u8 num_retrans; /* number of retransmits */
u8 cookie_ts:1; /* syncookie: encode tcpopts in timestamp */
u8 num_timeout:7; /* number of timeouts */
- /* The following two fields can be easily recomputed I think -AK */
- u32 window_clamp; /* window clamp at creation time */
- u32 rcv_wnd; /* rcv_wnd offered first time */
u32 ts_recent;
struct timer_list rsk_timer;
const struct request_sock_ops *rsk_ops;
diff --git a/include/net/sock.h b/include/net/sock.h
index a7818104a73f..fce12399fad4 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -226,11 +226,18 @@ struct sock_common {
struct hlist_nulls_node skc_nulls_node;
};
int skc_tx_queue_mapping;
- int skc_incoming_cpu;
+ union {
+ int skc_incoming_cpu;
+ u32 skc_rcv_wnd;
+ };
atomic_t skc_refcnt;
/* private: */
int skc_dontcopy_end[0];
+ union {
+ u32 skc_rxhash;
+ u32 skc_window_clamp;
+ };
/* public: */
};
@@ -287,7 +294,6 @@ struct cg_proto;
* @sk_rcvlowat: %SO_RCVLOWAT setting
* @sk_rcvtimeo: %SO_RCVTIMEO setting
* @sk_sndtimeo: %SO_SNDTIMEO setting
- * @sk_rxhash: flow hash received from netif layer
* @sk_txhash: computed flow hash for use on transmit
* @sk_filter: socket filtering instructions
* @sk_timer: sock cleanup timer
@@ -346,6 +352,7 @@ struct sock {
#define sk_cookie __sk_common.skc_cookie
#define sk_incoming_cpu __sk_common.skc_incoming_cpu
#define sk_flags __sk_common.skc_flags
+#define sk_rxhash __sk_common.skc_rxhash
socket_lock_t sk_lock;
struct sk_buff_head sk_receive_queue;
@@ -365,9 +372,6 @@ struct sock {
} sk_backlog;
#define sk_rmem_alloc sk_backlog.rmem_alloc
int sk_forward_alloc;
-#ifdef CONFIG_RPS
- __u32 sk_rxhash;
-#endif
__u32 sk_txhash;
#ifdef CONFIG_NET_RX_BUSY_POLL
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index 8113c30ccf96..0769248bc0db 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -381,10 +381,10 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
}
/* Try to redo what tcp_v4_send_synack did. */
- req->window_clamp = tp->window_clamp ? :dst_metric(&rt->dst, RTAX_WINDOW);
+ req->rsk_window_clamp = tp->window_clamp ? :dst_metric(&rt->dst, RTAX_WINDOW);
tcp_select_initial_window(tcp_full_space(sk), req->mss,
- &req->rcv_wnd, &req->window_clamp,
+ &req->rsk_rcv_wnd, &req->rsk_window_clamp,
ireq->wscale_ok, &rcv_wscale,
dst_metric(&rt->dst, RTAX_INITRWND));
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index ddadb318e850..3b35c3f4d268 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6022,7 +6022,7 @@ static void tcp_openreq_init(struct request_sock *req,
{
struct inet_request_sock *ireq = inet_rsk(req);
- req->rcv_wnd = 0; /* So that tcp_send_synack() knows! */
+ req->rsk_rcv_wnd = 0; /* So that tcp_send_synack() knows! */
req->cookie_ts = 0;
tcp_rsk(req)->rcv_isn = TCP_SKB_CB(skb)->seq;
tcp_rsk(req)->rcv_nxt = TCP_SKB_CB(skb)->seq + 1;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 34310748a365..ddb198392c7f 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -803,7 +803,7 @@ static void tcp_v4_reqsk_send_ack(const struct sock *sk, struct sk_buff *skb,
*/
tcp_v4_send_ack(skb, (sk->sk_state == TCP_LISTEN) ?
tcp_rsk(req)->snt_isn + 1 : tcp_sk(sk)->snd_nxt,
- tcp_rsk(req)->rcv_nxt, req->rcv_wnd,
+ tcp_rsk(req)->rcv_nxt, req->rsk_rcv_wnd,
tcp_time_stamp,
req->ts_recent,
0,
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 9adf1e2c3170..85272bf50f6e 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -381,18 +381,18 @@ void tcp_openreq_init_rwin(struct request_sock *req,
window_clamp = READ_ONCE(tp->window_clamp);
/* Set this up on the first call only */
- req->window_clamp = window_clamp ? : dst_metric(dst, RTAX_WINDOW);
+ req->rsk_window_clamp = window_clamp ? : dst_metric(dst, RTAX_WINDOW);
/* limit the window selection if the user enforce a smaller rx buffer */
if (sk_listener->sk_userlocks & SOCK_RCVBUF_LOCK &&
- (req->window_clamp > full_space || req->window_clamp == 0))
- req->window_clamp = full_space;
+ (req->rsk_window_clamp > full_space || req->rsk_window_clamp == 0))
+ req->rsk_window_clamp = full_space;
/* tcp_full_space because it is guaranteed to be the first packet */
tcp_select_initial_window(full_space,
mss - (ireq->tstamp_ok ? TCPOLEN_TSTAMP_ALIGNED : 0),
- &req->rcv_wnd,
- &req->window_clamp,
+ &req->rsk_rcv_wnd,
+ &req->rsk_window_clamp,
ireq->wscale_ok,
&rcv_wscale,
dst_metric(dst, RTAX_INITRWND));
@@ -512,9 +512,9 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
if (sysctl_tcp_fack)
tcp_enable_fack(newtp);
}
- newtp->window_clamp = req->window_clamp;
- newtp->rcv_ssthresh = req->rcv_wnd;
- newtp->rcv_wnd = req->rcv_wnd;
+ newtp->window_clamp = req->rsk_window_clamp;
+ newtp->rcv_ssthresh = req->rsk_rcv_wnd;
+ newtp->rcv_wnd = req->rsk_rcv_wnd;
newtp->rx_opt.wscale_ok = ireq->wscale_ok;
if (newtp->rx_opt.wscale_ok) {
newtp->rx_opt.snd_wscale = ireq->snd_wscale;
@@ -707,7 +707,7 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
/* RFC793: "first check sequence number". */
if (paws_reject || !tcp_in_window(TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq,
- tcp_rsk(req)->rcv_nxt, tcp_rsk(req)->rcv_nxt + req->rcv_wnd)) {
+ tcp_rsk(req)->rcv_nxt, tcp_rsk(req)->rcv_nxt + req->rsk_rcv_wnd)) {
/* Out of window: send ACK and drop. */
if (!(flg & TCP_FLAG_RST))
req->rsk_ops->send_ack(sk, skb, req);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 55ed3266b05f..6e79fcb0addb 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3023,7 +3023,7 @@ struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst,
th->ack_seq = htonl(tcp_rsk(req)->rcv_nxt);
/* RFC1323: The window in SYN & SYN/ACK segments is never scaled. */
- th->window = htons(min(req->rcv_wnd, 65535U));
+ th->window = htons(min(req->rsk_rcv_wnd, 65535U));
tcp_options_write((__be32 *)(th + 1), NULL, &opts);
th->doff = (tcp_header_size >> 2);
TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_OUTSEGS);
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index f610b5310b17..bb8f2fa1c7fb 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -235,9 +235,9 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
goto out_free;
}
- req->window_clamp = tp->window_clamp ? :dst_metric(dst, RTAX_WINDOW);
+ req->rsk_window_clamp = tp->window_clamp ? :dst_metric(dst, RTAX_WINDOW);
tcp_select_initial_window(tcp_full_space(sk), req->mss,
- &req->rcv_wnd, &req->window_clamp,
+ &req->rsk_rcv_wnd, &req->rsk_window_clamp,
ireq->wscale_ok, &rcv_wscale,
dst_metric(dst, RTAX_INITRWND));
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 33334f0c217d..2887c8474b65 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -931,7 +931,7 @@ static void tcp_v6_reqsk_send_ack(const struct sock *sk, struct sk_buff *skb,
*/
tcp_v6_send_ack(sk, skb, (sk->sk_state == TCP_LISTEN) ?
tcp_rsk(req)->snt_isn + 1 : tcp_sk(sk)->snd_nxt,
- tcp_rsk(req)->rcv_nxt, req->rcv_wnd,
+ tcp_rsk(req)->rcv_nxt, req->rsk_rcv_wnd,
tcp_time_stamp, req->ts_recent, sk->sk_bound_dev_if,
tcp_v6_md5_do_lookup(sk, &ipv6_hdr(skb)->daddr),
0, 0);
--
2.6.0.rc2.230.g3dd15c0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v3 net-next 4/4] tcp: shrink tcp_timewait_sock by 8 bytes
2015-10-09 2:33 [PATCH v3 net-next 0/4] tcp: better smp listener behavior Eric Dumazet
` (2 preceding siblings ...)
2015-10-09 2:33 ` [PATCH v3 net-next 3/4] net: shrink struct sock and request_sock by 8 bytes Eric Dumazet
@ 2015-10-09 2:33 ` Eric Dumazet
2015-10-09 4:16 ` [PATCH v3 net-next 0/4] tcp: better smp listener behavior Grant Zhang
2015-10-13 2:29 ` David Miller
5 siblings, 0 replies; 10+ messages in thread
From: Eric Dumazet @ 2015-10-09 2:33 UTC (permalink / raw)
To: David S . Miller; +Cc: netdev, Eric Dumazet, Eric Dumazet
Reducing tcp_timewait_sock from 280 bytes to 272 bytes
allows SLAB to pack 15 objects per page instead of 14 (on x86)
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
include/linux/tcp.h | 4 ++--
include/net/sock.h | 2 ++
2 files changed, 4 insertions(+), 2 deletions(-)
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index e442e6e9a365..86a7edaa6797 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -356,8 +356,8 @@ static inline struct tcp_sock *tcp_sk(const struct sock *sk)
struct tcp_timewait_sock {
struct inet_timewait_sock tw_sk;
- u32 tw_rcv_nxt;
- u32 tw_snd_nxt;
+#define tw_rcv_nxt tw_sk.__tw_common.skc_tw_rcv_nxt
+#define tw_snd_nxt tw_sk.__tw_common.skc_tw_snd_nxt
u32 tw_rcv_wnd;
u32 tw_ts_offset;
u32 tw_ts_recent;
diff --git a/include/net/sock.h b/include/net/sock.h
index fce12399fad4..288934da0ae3 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -229,6 +229,7 @@ struct sock_common {
union {
int skc_incoming_cpu;
u32 skc_rcv_wnd;
+ u32 skc_tw_rcv_nxt; /* struct tcp_timewait_sock */
};
atomic_t skc_refcnt;
@@ -237,6 +238,7 @@ struct sock_common {
union {
u32 skc_rxhash;
u32 skc_window_clamp;
+ u32 skc_tw_snd_nxt; /* struct tcp_timewait_sock */
};
/* public: */
};
--
2.6.0.rc2.230.g3dd15c0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH v3 net-next 1/4] net: SO_INCOMING_CPU setsockopt() support
2015-10-09 2:33 ` [PATCH v3 net-next 1/4] net: SO_INCOMING_CPU setsockopt() support Eric Dumazet
@ 2015-10-09 3:40 ` Tom Herbert
2015-10-09 9:45 ` Eric Dumazet
0 siblings, 1 reply; 10+ messages in thread
From: Tom Herbert @ 2015-10-09 3:40 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David S . Miller, netdev, Eric Dumazet
On Thu, Oct 8, 2015 at 7:33 PM, Eric Dumazet <edumazet@google.com> wrote:
> SO_INCOMING_CPU as added in commit 2c8c56e15df3 was a getsockopt() command
> to fetch incoming cpu handling a particular TCP flow after accept()
>
> This commits adds setsockopt() support and extends SO_REUSEPORT selection
> logic : If a TCP listener or UDP socket has this option set, a packet is
> delivered to this socket only if CPU handling the packet matches the specified
> one.
>
> This allows to build very efficient TCP servers, using one listener per
> RX queue, as the associated TCP listener should only accept flows handled
> in softirq by the same cpu.
> This provides optimal NUMA behavior and keep cpu caches hot.
>
> Note that __inet_lookup_listener() still has to iterate over the list of
> all listeners. Following patch puts sk_refcnt in a different cache line
> to let this iteration hit only shared and read mostly cache lines.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
> include/net/sock.h | 10 ++++------
> net/core/sock.c | 5 +++++
> net/ipv4/inet_hashtables.c | 2 ++
> net/ipv4/udp.c | 6 +++++-
> net/ipv6/inet6_hashtables.c | 2 ++
> net/ipv6/udp.c | 11 +++++++----
> 6 files changed, 25 insertions(+), 11 deletions(-)
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index dfe2eb8e1132..08abffe32236 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -150,6 +150,7 @@ typedef __u64 __bitwise __addrpair;
> * @skc_node: main hash linkage for various protocol lookup tables
> * @skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol
> * @skc_tx_queue_mapping: tx queue number for this connection
> + * @skc_incoming_cpu: record/match cpu processing incoming packets
> * @skc_refcnt: reference count
> *
> * This is the minimal network layer representation of sockets, the header
> @@ -212,6 +213,8 @@ struct sock_common {
> struct hlist_nulls_node skc_nulls_node;
> };
> int skc_tx_queue_mapping;
> + int skc_incoming_cpu;
> +
> atomic_t skc_refcnt;
> /* private: */
> int skc_dontcopy_end[0];
> @@ -274,7 +277,6 @@ struct cg_proto;
> * @sk_rcvtimeo: %SO_RCVTIMEO setting
> * @sk_sndtimeo: %SO_SNDTIMEO setting
> * @sk_rxhash: flow hash received from netif layer
> - * @sk_incoming_cpu: record cpu processing incoming packets
> * @sk_txhash: computed flow hash for use on transmit
> * @sk_filter: socket filtering instructions
> * @sk_timer: sock cleanup timer
> @@ -331,6 +333,7 @@ struct sock {
> #define sk_v6_daddr __sk_common.skc_v6_daddr
> #define sk_v6_rcv_saddr __sk_common.skc_v6_rcv_saddr
> #define sk_cookie __sk_common.skc_cookie
> +#define sk_incoming_cpu __sk_common.skc_incoming_cpu
>
> socket_lock_t sk_lock;
> struct sk_buff_head sk_receive_queue;
> @@ -353,11 +356,6 @@ struct sock {
> #ifdef CONFIG_RPS
> __u32 sk_rxhash;
> #endif
> - u16 sk_incoming_cpu;
> - /* 16bit hole
> - * Warned : sk_incoming_cpu can be set from softirq,
> - * Do not use this hole without fully understanding possible issues.
> - */
>
> __u32 sk_txhash;
> #ifdef CONFIG_NET_RX_BUSY_POLL
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 7dd1263e4c24..1071f9380250 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -988,6 +988,10 @@ set_rcvbuf:
> sk->sk_max_pacing_rate);
> break;
>
> + case SO_INCOMING_CPU:
> + sk->sk_incoming_cpu = val;
> + break;
> +
> default:
> ret = -ENOPROTOOPT;
> break;
> @@ -2353,6 +2357,7 @@ void sock_init_data(struct socket *sock, struct sock *sk)
>
> sk->sk_max_pacing_rate = ~0U;
> sk->sk_pacing_rate = ~0U;
> + sk->sk_incoming_cpu = -1;
> /*
> * Before updating sk_refcnt, we must commit prior changes to memory
> * (Documentation/RCU/rculist_nulls.txt for details)
> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> index bed8886a4b6c..08643a3616af 100644
> --- a/net/ipv4/inet_hashtables.c
> +++ b/net/ipv4/inet_hashtables.c
> @@ -185,6 +185,8 @@ static inline int compute_score(struct sock *sk, struct net *net,
> return -1;
> score += 4;
> }
> + if (sk->sk_incoming_cpu == raw_smp_processor_id())
> + score++;
> }
> return score;
> }
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index e1fc129099ea..24ec14f9825c 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -375,7 +375,8 @@ static inline int compute_score(struct sock *sk, struct net *net,
> return -1;
> score += 4;
> }
> -
> + if (sk->sk_incoming_cpu == raw_smp_processor_id())
> + score++;
> return score;
> }
>
> @@ -419,6 +420,9 @@ static inline int compute_score2(struct sock *sk, struct net *net,
> score += 4;
> }
>
> + if (sk->sk_incoming_cpu == raw_smp_processor_id())
> + score++;
> +
> return score;
> }
>
> diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
> index 6ac8dad0138a..21ace5a2bf7c 100644
> --- a/net/ipv6/inet6_hashtables.c
> +++ b/net/ipv6/inet6_hashtables.c
> @@ -114,6 +114,8 @@ static inline int compute_score(struct sock *sk, struct net *net,
> return -1;
> score++;
> }
> + if (sk->sk_incoming_cpu == raw_smp_processor_id())
> + score++;
> }
> return score;
> }
> diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
> index 0aba654f5b91..01bcb49619ee 100644
> --- a/net/ipv6/udp.c
> +++ b/net/ipv6/udp.c
> @@ -182,10 +182,12 @@ static inline int compute_score(struct sock *sk, struct net *net,
> score++;
> }
>
> + if (sk->sk_incoming_cpu == raw_smp_processor_id())
> + score++;
> +
> return score;
> }
>
> -#define SCORE2_MAX (1 + 1 + 1)
> static inline int compute_score2(struct sock *sk, struct net *net,
> const struct in6_addr *saddr, __be16 sport,
> const struct in6_addr *daddr,
> @@ -223,6 +225,9 @@ static inline int compute_score2(struct sock *sk, struct net *net,
> score++;
> }
>
> + if (sk->sk_incoming_cpu == raw_smp_processor_id())
> + score++;
> +
> return score;
> }
>
> @@ -251,8 +256,7 @@ begin:
> hash = udp6_ehashfn(net, daddr, hnum,
> saddr, sport);
> matches = 1;
> - } else if (score == SCORE2_MAX)
> - goto exact_match;
> + }
Do we care about losing this optimization? It's not done in IPv4 but I
can imagine that there is some arguments that address comparisons in
IPv6 are more expensive hence this might make sense...
> } else if (score == badness && reuseport) {
> matches++;
> if (reciprocal_scale(hash, matches) == 0)
> @@ -269,7 +273,6 @@ begin:
> goto begin;
>
> if (result) {
> -exact_match:
> if (unlikely(!atomic_inc_not_zero_hint(&result->sk_refcnt, 2)))
> result = NULL;
> else if (unlikely(compute_score2(result, net, saddr, sport,
> --
> 2.6.0.rc2.230.g3dd15c0
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v3 net-next 0/4] tcp: better smp listener behavior
2015-10-09 2:33 [PATCH v3 net-next 0/4] tcp: better smp listener behavior Eric Dumazet
` (3 preceding siblings ...)
2015-10-09 2:33 ` [PATCH v3 net-next 4/4] tcp: shrink tcp_timewait_sock " Eric Dumazet
@ 2015-10-09 4:16 ` Grant Zhang
2015-10-09 10:53 ` Eric Dumazet
2015-10-13 2:29 ` David Miller
5 siblings, 1 reply; 10+ messages in thread
From: Grant Zhang @ 2015-10-09 4:16 UTC (permalink / raw)
To: Eric Dumazet, David S . Miller; +Cc: netdev, Eric Dumazet
On 08/10/2015 19:33, Eric Dumazet wrote:
> As promised in last patch series, we implement a better SO_REUSEPORT
> strategy, based on cpu affinities if selected by the application.
>
> We also moved sk_refcnt out of the cache line containing the lookup
> keys, as it was considerably slowing down smp operations because
> of false sharing. This was simpler than converting listen sockets
> to conventional RCU (to avoid sk_refcnt dirtying)
>
> Could process 6.0 Mpps SYN instead of 4.2 Mpps on my test server.
>
> Eric Dumazet (4):
> net: SO_INCOMING_CPU setsockopt() support
> net: align sk_refcnt on 128 bytes boundary
> net: shrink struct sock and request_sock by 8 bytes
> tcp: shrink tcp_timewait_sock by 8 bytes
>
> include/linux/tcp.h | 4 ++--
> include/net/inet_timewait_sock.h | 2 +-
> include/net/request_sock.h | 7 +++----
> include/net/sock.h | 41 +++++++++++++++++++++++++++-------------
> net/core/sock.c | 5 +++++
> net/ipv4/inet_hashtables.c | 2 ++
> net/ipv4/syncookies.c | 4 ++--
> net/ipv4/tcp_input.c | 2 +-
> net/ipv4/tcp_ipv4.c | 2 +-
> net/ipv4/tcp_minisocks.c | 18 +++++++++---------
> net/ipv4/tcp_output.c | 2 +-
> net/ipv4/udp.c | 6 +++++-
> net/ipv6/inet6_hashtables.c | 2 ++
> net/ipv6/syncookies.c | 4 ++--
> net/ipv6/tcp_ipv6.c | 2 +-
> net/ipv6/udp.c | 11 +++++++----
> 16 files changed, 72 insertions(+), 42 deletions(-)
>
Eric,
Does it make sense to make the listener hash table percpu? Socket with
SO_INCOMING_CPU set could just be add to the hashtable for that specific
cpu.
Thanks,
Grant
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v3 net-next 1/4] net: SO_INCOMING_CPU setsockopt() support
2015-10-09 3:40 ` Tom Herbert
@ 2015-10-09 9:45 ` Eric Dumazet
0 siblings, 0 replies; 10+ messages in thread
From: Eric Dumazet @ 2015-10-09 9:45 UTC (permalink / raw)
To: Tom Herbert; +Cc: Eric Dumazet, David S . Miller, netdev
On Thu, 2015-10-08 at 20:40 -0700, Tom Herbert wrote:
> Do we care about losing this optimization? It's not done in IPv4 but I
> can imagine that there is some arguments that address comparisons in
> IPv6 are more expensive hence this might make sense...
I do not think we care. You removed the 'optimization' in IPv4 in commit
ba418fa357a7b ("soreuseport: UDP/IPv4 implementation") back in 2013 and
really no one noticed.
The important factor here is the number of cache lines taken to traverse
the list...
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v3 net-next 0/4] tcp: better smp listener behavior
2015-10-09 4:16 ` [PATCH v3 net-next 0/4] tcp: better smp listener behavior Grant Zhang
@ 2015-10-09 10:53 ` Eric Dumazet
0 siblings, 0 replies; 10+ messages in thread
From: Eric Dumazet @ 2015-10-09 10:53 UTC (permalink / raw)
To: Grant Zhang; +Cc: Eric Dumazet, David S . Miller, netdev
On Thu, 2015-10-08 at 21:16 -0700, Grant Zhang wrote:
>
> Does it make sense to make the listener hash table percpu? Socket with
> SO_INCOMING_CPU set could just be add to the hashtable for that specific
> cpu.
Not sure : We plan to upstream a patch adding a soreuseport specific
table to make the lookup time independent of number of sockets bound to
one particular port. This simply adds an RCU protected array, with
ability to immediately fetch slot number X from this array.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v3 net-next 0/4] tcp: better smp listener behavior
2015-10-09 2:33 [PATCH v3 net-next 0/4] tcp: better smp listener behavior Eric Dumazet
` (4 preceding siblings ...)
2015-10-09 4:16 ` [PATCH v3 net-next 0/4] tcp: better smp listener behavior Grant Zhang
@ 2015-10-13 2:29 ` David Miller
5 siblings, 0 replies; 10+ messages in thread
From: David Miller @ 2015-10-13 2:29 UTC (permalink / raw)
To: edumazet; +Cc: netdev, eric.dumazet
From: Eric Dumazet <edumazet@google.com>
Date: Thu, 8 Oct 2015 19:33:20 -0700
> As promised in last patch series, we implement a better SO_REUSEPORT
> strategy, based on cpu affinities if selected by the application.
>
> We also moved sk_refcnt out of the cache line containing the lookup
> keys, as it was considerably slowing down smp operations because
> of false sharing. This was simpler than converting listen sockets
> to conventional RCU (to avoid sk_refcnt dirtying)
>
> Could process 6.0 Mpps SYN instead of 4.2 Mpps on my test server.
Just clarifying that I applied this v3 not v2 which I just replied
to by accident :-)
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2015-10-13 2:13 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-09 2:33 [PATCH v3 net-next 0/4] tcp: better smp listener behavior Eric Dumazet
2015-10-09 2:33 ` [PATCH v3 net-next 1/4] net: SO_INCOMING_CPU setsockopt() support Eric Dumazet
2015-10-09 3:40 ` Tom Herbert
2015-10-09 9:45 ` Eric Dumazet
2015-10-09 2:33 ` [PATCH v3 net-next 2/4] net: align sk_refcnt on 128 bytes boundary Eric Dumazet
2015-10-09 2:33 ` [PATCH v3 net-next 3/4] net: shrink struct sock and request_sock by 8 bytes Eric Dumazet
2015-10-09 2:33 ` [PATCH v3 net-next 4/4] tcp: shrink tcp_timewait_sock " Eric Dumazet
2015-10-09 4:16 ` [PATCH v3 net-next 0/4] tcp: better smp listener behavior Grant Zhang
2015-10-09 10:53 ` Eric Dumazet
2015-10-13 2:29 ` David Miller
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.