bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 bpf-next 00/11] bpf: tcp: Add SYN Cookie generation/validation SOCK_OPS hooks.
@ 2023-10-13 22:04 Kuniyuki Iwashima
  2023-10-13 22:04 ` [PATCH v1 bpf-next 01/11] tcp: Clean up reverse xmas tree in cookie_v[46]_check() Kuniyuki Iwashima
                   ` (13 more replies)
  0 siblings, 14 replies; 44+ messages in thread
From: Kuniyuki Iwashima @ 2023-10-13 22:04 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	David Ahern, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Mykola Lysenko
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev

Under SYN Flood, the TCP stack generates SYN Cookie to remain stateless
for the connection request until a valid ACK is responded to the SYN+ACK.

The cookie contains two kinds of host-specific bits, a timestamp and
secrets, so only can it be validated by the generator.  It means SYN
Cookie consumes network resources between the client and the server;
intermediate nodes must remember which nodes to route ACK for the cookie.

SYN Proxy reduces such unwanted resource allocation by handling 3WHS at
the edge network.  After SYN Proxy completes 3WHS, it forwards SYN to the
backend server and completes another 3WHS.  However, since the server's
ISN differs from the cookie, the proxy must manage the ISN mappings and
fix up SEQ/ACK numbers in every packet for each connection.  If a proxy
node is down, all the connections through it are also down.  Keeping a
state at proxy is painful from that perspective.

At AWS, we use a dirty hack to build truly stateless SYN Proxy at scale.
Our SYN Proxy consists of the front proxy layer and the backend kernel
module.  (See slides of netconf [0], p6 - p15)

The cookie that SYN Proxy generates differs from the kernel's cookie in
that it contains a secret (called rolling salt) (i) shared by all the proxy
nodes so that any node can validate ACK and (ii) updated periodically so
that old cookies cannot be validated.  Also, ISN contains WScale, SACK, and
ECN, not in TS val.  This is not to sacrifice any connection quality, where
some customers turn off the timestamp option due to retro CVE.

After 3WHS, the proxy restores SYN and forwards it and ACK to the backend
server.  Our kernel module works at Netfilter input/output hooks and first
feeds SYN to the TCP stack to initiate 3WHS.  When the module is triggered
for SYN+ACK, it looks up the corresponding request socket and overwrites
tcp_rsk(req)->snt_isn with the proxy's cookie.  Then, the module can
complete 3WHS with the original ACK as is.

This way, our SYN Proxy does not manage the ISN mappings and can stay
stateless.  It's working very well for high-bandwidth services like
multiple Tbps, but we are looking for a way to drop the dirty hack and
further optimise the sequences.

If we could validate an arbitrary SYN Cookie on the backend server with
BPF, the proxy would need not restore SYN nor pass it.  After validating
ACK, the proxy node just needs to forward it, and then the server can do
the lightweight validation (e.g. check if ACK came from proxy nodes, etc)
and create a connection from the ACK.

This series adds two SOCK_OPS hooks to generate and validate arbitrary
SYN Cookie.  Each hook is invoked if BPF_SOCK_OPS_SYNCOOKIE_CB_FLAG is
set to the listening socket in advance by bpf_sock_ops_cb_flags_set().

The user interface looks like this:

  BPF_SOCK_OPS_GEN_SYNCOOKIE_CB

    input
    |- bpf_sock_ops.sk           : 4-tuple
    |- bpf_sock_ops.skb          : TCP header
    |- bpf_sock_ops.args[0]      : MSS
    `- bpf_sock_ops.args[1]      : BPF_SYNCOOKIE_XXX flags

    output
    |- bpf_sock_ops.replylong[0] : ISN (SYN Cookie) ------.
    `- bpf_sock_ops.replylong[1] : TS value -----------.  |
                                                       |  |
  BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB                      |  |
                                                       |  |
    input                                              |  |
    |- bpf_sock_ops.sk           : 4-tuple             |  |
    |- bpf_sock_ops.skb          : TCP header          |  |
    |- bpf_sock_ops.args[0]      : ISN (SYN Cookie) <-----'
    `- bpf_sock_ops.args[1]      : TS value <----------'

    output
    |- bpf_sock_ops.replylong[0] : MSS
    `- bpf_sock_ops.replylong[1] : BPF_SYNCOOKIE_XXX flags

To establish a connection from SYN Cookie, BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB
hook must set a valid MSS to bpf_sock_ops.replylong[0], meaning that
BPF_SOCK_OPS_GEN_SYNCOOKIE_CB hook must encode MSS to ISN or TS val to be
restored in the validation hook.

If WScale, SACK, and ECN are detected to be available in SYN packet, the
corresponding flags are passed to args[0] of BPF_SOCK_OPS_GEN_SYNCOOKIE_CB
so that bpf prog need not parse the TCP header.  The same flags can be set
to replylong[0] of BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB to enable each feature
on the connection.

For details, please see each patch.  Here's an overview:

  patch 1 - 4 : Misc cleanup
  patch 5, 6  : Add SOCK_OPS hook (only ISN is available here)
  patch 7, 8  : Make TS val available as the second cookie storage
  patch 9, 10 : Make WScale, SACK, and ECN configurable from ACK
  patch 11    : selftest, need some help from BPF experts...

[0]: https://netdev.bots.linux.dev/netconf/2023/kuniyuki.pdf


Kuniyuki Iwashima (11):
  tcp: Clean up reverse xmas tree in cookie_v[46]_check().
  tcp: Cache sock_net(sk) in cookie_v[46]_check().
  tcp: Clean up goto labels in cookie_v[46]_check().
  tcp: Don't initialise tp->tsoffset in tcp_get_cookie_sock().
  bpf: tcp: Add SYN Cookie generation SOCK_OPS hook.
  bpf: tcp: Add SYN Cookie validation SOCK_OPS hook.
  bpf: Make bpf_sock_ops.replylong[1] writable.
  bpf: tcp: Make TS available for SYN Cookie storage.
  tcp: Split cookie_ecn_ok().
  bpf: tcp: Make WS, SACK, ECN configurable from BPF SYN Cookie.
  selftest: bpf: Test BPF_SOCK_OPS_(GEN|CHECK)_SYNCOOKIE_CB.

 include/net/inet_sock.h                       |   4 +-
 include/net/tcp.h                             |  46 +++-
 include/uapi/linux/bpf.h                      |  52 ++++-
 net/core/filter.c                             |   2 +-
 net/ipv4/syncookies.c                         | 219 +++++++++++-------
 net/ipv4/tcp_input.c                          |  53 ++++-
 net/ipv6/syncookies.c                         |  94 +++++---
 tools/include/uapi/linux/bpf.h                |  52 ++++-
 .../selftests/bpf/prog_tests/tcp_syncookie.c  |  84 +++++++
 .../selftests/bpf/progs/test_siphash.h        |  65 ++++++
 .../selftests/bpf/progs/test_tcp_syncookie.c  | 170 ++++++++++++++
 .../selftests/bpf/test_tcp_hdr_options.h      |   8 +-
 12 files changed, 715 insertions(+), 134 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/tcp_syncookie.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_siphash.h
 create mode 100644 tools/testing/selftests/bpf/progs/test_tcp_syncookie.c

-- 
2.30.2


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v1 bpf-next 01/11] tcp: Clean up reverse xmas tree in cookie_v[46]_check().
  2023-10-13 22:04 [PATCH v1 bpf-next 00/11] bpf: tcp: Add SYN Cookie generation/validation SOCK_OPS hooks Kuniyuki Iwashima
@ 2023-10-13 22:04 ` Kuniyuki Iwashima
  2023-10-13 22:04 ` [PATCH v1 bpf-next 02/11] tcp: Cache sock_net(sk) " Kuniyuki Iwashima
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 44+ messages in thread
From: Kuniyuki Iwashima @ 2023-10-13 22:04 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	David Ahern, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Mykola Lysenko
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev

We will grow and cut the xmas tree a bit, so let's clean it up
to make later patches easy to review.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/ipv4/syncookies.c | 10 +++++-----
 net/ipv6/syncookies.c | 12 ++++++------
 2 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index dc478a0574cb..174eaddc28b5 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -325,18 +325,18 @@ EXPORT_SYMBOL_GPL(cookie_tcp_reqsk_alloc);
 struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 {
 	struct ip_options *opt = &TCP_SKB_CB(skb)->header.h4.opt;
+	const struct tcphdr *th = tcp_hdr(skb);
+	__u32 cookie = ntohl(th->ack_seq) - 1;
 	struct tcp_options_received tcp_opt;
+	struct tcp_sock *tp = tcp_sk(sk);
 	struct inet_request_sock *ireq;
 	struct tcp_request_sock *treq;
-	struct tcp_sock *tp = tcp_sk(sk);
-	const struct tcphdr *th = tcp_hdr(skb);
-	__u32 cookie = ntohl(th->ack_seq) - 1;
-	struct sock *ret = sk;
 	struct request_sock *req;
+	struct sock *ret = sk;
 	int full_space, mss;
+	struct flowi4 fl4;
 	struct rtable *rt;
 	__u8 rcv_wscale;
-	struct flowi4 fl4;
 	u32 tsoff = 0;
 
 	if (!READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_syncookies) ||
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index 5014aa663452..894d5ae312d1 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -127,17 +127,17 @@ EXPORT_SYMBOL_GPL(__cookie_v6_check);
 
 struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 {
+	const struct tcphdr *th = tcp_hdr(skb);
+	__u32 cookie = ntohl(th->ack_seq) - 1;
+	struct ipv6_pinfo *np = inet6_sk(sk);
 	struct tcp_options_received tcp_opt;
+	struct tcp_sock *tp = tcp_sk(sk);
 	struct inet_request_sock *ireq;
 	struct tcp_request_sock *treq;
-	struct ipv6_pinfo *np = inet6_sk(sk);
-	struct tcp_sock *tp = tcp_sk(sk);
-	const struct tcphdr *th = tcp_hdr(skb);
-	__u32 cookie = ntohl(th->ack_seq) - 1;
-	struct sock *ret = sk;
 	struct request_sock *req;
-	int full_space, mss;
 	struct dst_entry *dst;
+	struct sock *ret = sk;
+	int full_space, mss;
 	__u8 rcv_wscale;
 	u32 tsoff = 0;
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v1 bpf-next 02/11] tcp: Cache sock_net(sk) in cookie_v[46]_check().
  2023-10-13 22:04 [PATCH v1 bpf-next 00/11] bpf: tcp: Add SYN Cookie generation/validation SOCK_OPS hooks Kuniyuki Iwashima
  2023-10-13 22:04 ` [PATCH v1 bpf-next 01/11] tcp: Clean up reverse xmas tree in cookie_v[46]_check() Kuniyuki Iwashima
@ 2023-10-13 22:04 ` Kuniyuki Iwashima
  2023-10-13 22:04 ` [PATCH v1 bpf-next 03/11] tcp: Clean up goto labels " Kuniyuki Iwashima
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 44+ messages in thread
From: Kuniyuki Iwashima @ 2023-10-13 22:04 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	David Ahern, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Mykola Lysenko
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev

sock_net(sk) is used repeatedly in cookie_v[46]_check().
Let's cache it in a variable.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/ipv4/syncookies.c | 19 ++++++++++---------
 net/ipv6/syncookies.c | 17 +++++++++--------
 2 files changed, 19 insertions(+), 17 deletions(-)

diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index 174eaddc28b5..64280cf42667 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -330,6 +330,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 	struct tcp_options_received tcp_opt;
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct inet_request_sock *ireq;
+	struct net *net = sock_net(sk);
 	struct tcp_request_sock *treq;
 	struct request_sock *req;
 	struct sock *ret = sk;
@@ -339,7 +340,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 	__u8 rcv_wscale;
 	u32 tsoff = 0;
 
-	if (!READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_syncookies) ||
+	if (!READ_ONCE(net->ipv4.sysctl_tcp_syncookies) ||
 	    !th->ack || th->rst)
 		goto out;
 
@@ -348,24 +349,24 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 
 	mss = __cookie_v4_check(ip_hdr(skb), th, cookie);
 	if (mss == 0) {
-		__NET_INC_STATS(sock_net(sk), LINUX_MIB_SYNCOOKIESFAILED);
+		__NET_INC_STATS(net, LINUX_MIB_SYNCOOKIESFAILED);
 		goto out;
 	}
 
-	__NET_INC_STATS(sock_net(sk), LINUX_MIB_SYNCOOKIESRECV);
+	__NET_INC_STATS(net, LINUX_MIB_SYNCOOKIESRECV);
 
 	/* check for timestamp cookie support */
 	memset(&tcp_opt, 0, sizeof(tcp_opt));
-	tcp_parse_options(sock_net(sk), skb, &tcp_opt, 0, NULL);
+	tcp_parse_options(net, skb, &tcp_opt, 0, NULL);
 
 	if (tcp_opt.saw_tstamp && tcp_opt.rcv_tsecr) {
-		tsoff = secure_tcp_ts_off(sock_net(sk),
+		tsoff = secure_tcp_ts_off(net,
 					  ip_hdr(skb)->daddr,
 					  ip_hdr(skb)->saddr);
 		tcp_opt.rcv_tsecr -= tsoff;
 	}
 
-	if (!cookie_timestamp_decode(sock_net(sk), &tcp_opt))
+	if (!cookie_timestamp_decode(net, &tcp_opt))
 		goto out;
 
 	ret = NULL;
@@ -402,7 +403,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 	/* We throwed the options of the initial SYN away, so we hope
 	 * the ACK carries the same options again (see RFC1122 4.2.3.8)
 	 */
-	RCU_INIT_POINTER(ireq->ireq_opt, tcp_v4_save_options(sock_net(sk), skb));
+	RCU_INIT_POINTER(ireq->ireq_opt, tcp_v4_save_options(net, skb));
 
 	if (security_inet_conn_request(sk, skb, req)) {
 		reqsk_free(req);
@@ -423,7 +424,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 			   opt->srr ? opt->faddr : ireq->ir_rmt_addr,
 			   ireq->ir_loc_addr, th->source, th->dest, sk->sk_uid);
 	security_req_classify_flow(req, flowi4_to_flowi_common(&fl4));
-	rt = ip_route_output_key(sock_net(sk), &fl4);
+	rt = ip_route_output_key(net, &fl4);
 	if (IS_ERR(rt)) {
 		reqsk_free(req);
 		goto out;
@@ -443,7 +444,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 				  dst_metric(&rt->dst, RTAX_INITRWND));
 
 	ireq->rcv_wscale  = rcv_wscale;
-	ireq->ecn_ok = cookie_ecn_ok(&tcp_opt, sock_net(sk), &rt->dst);
+	ireq->ecn_ok = cookie_ecn_ok(&tcp_opt, net, &rt->dst);
 
 	ret = tcp_get_cookie_sock(sk, skb, req, &rt->dst, tsoff);
 	/* ip_queue_xmit() depends on our flow being setup
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index 894d5ae312d1..cbee2df8a006 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -133,6 +133,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 	struct tcp_options_received tcp_opt;
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct inet_request_sock *ireq;
+	struct net *net = sock_net(sk);
 	struct tcp_request_sock *treq;
 	struct request_sock *req;
 	struct dst_entry *dst;
@@ -141,7 +142,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 	__u8 rcv_wscale;
 	u32 tsoff = 0;
 
-	if (!READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_syncookies) ||
+	if (!READ_ONCE(net->ipv4.sysctl_tcp_syncookies) ||
 	    !th->ack || th->rst)
 		goto out;
 
@@ -150,24 +151,24 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 
 	mss = __cookie_v6_check(ipv6_hdr(skb), th, cookie);
 	if (mss == 0) {
-		__NET_INC_STATS(sock_net(sk), LINUX_MIB_SYNCOOKIESFAILED);
+		__NET_INC_STATS(net, LINUX_MIB_SYNCOOKIESFAILED);
 		goto out;
 	}
 
-	__NET_INC_STATS(sock_net(sk), LINUX_MIB_SYNCOOKIESRECV);
+	__NET_INC_STATS(net, LINUX_MIB_SYNCOOKIESRECV);
 
 	/* check for timestamp cookie support */
 	memset(&tcp_opt, 0, sizeof(tcp_opt));
-	tcp_parse_options(sock_net(sk), skb, &tcp_opt, 0, NULL);
+	tcp_parse_options(net, skb, &tcp_opt, 0, NULL);
 
 	if (tcp_opt.saw_tstamp && tcp_opt.rcv_tsecr) {
-		tsoff = secure_tcpv6_ts_off(sock_net(sk),
+		tsoff = secure_tcpv6_ts_off(net,
 					    ipv6_hdr(skb)->daddr.s6_addr32,
 					    ipv6_hdr(skb)->saddr.s6_addr32);
 		tcp_opt.rcv_tsecr -= tsoff;
 	}
 
-	if (!cookie_timestamp_decode(sock_net(sk), &tcp_opt))
+	if (!cookie_timestamp_decode(net, &tcp_opt))
 		goto out;
 
 	ret = NULL;
@@ -237,7 +238,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 		fl6.flowi6_uid = sk->sk_uid;
 		security_req_classify_flow(req, flowi6_to_flowi_common(&fl6));
 
-		dst = ip6_dst_lookup_flow(sock_net(sk), sk, &fl6, final_p);
+		dst = ip6_dst_lookup_flow(net, sk, &fl6, final_p);
 		if (IS_ERR(dst))
 			goto out_free;
 	}
@@ -255,7 +256,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 				  dst_metric(dst, RTAX_INITRWND));
 
 	ireq->rcv_wscale = rcv_wscale;
-	ireq->ecn_ok = cookie_ecn_ok(&tcp_opt, sock_net(sk), dst);
+	ireq->ecn_ok = cookie_ecn_ok(&tcp_opt, net, dst);
 
 	ret = tcp_get_cookie_sock(sk, skb, req, dst, tsoff);
 out:
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v1 bpf-next 03/11] tcp: Clean up goto labels in cookie_v[46]_check().
  2023-10-13 22:04 [PATCH v1 bpf-next 00/11] bpf: tcp: Add SYN Cookie generation/validation SOCK_OPS hooks Kuniyuki Iwashima
  2023-10-13 22:04 ` [PATCH v1 bpf-next 01/11] tcp: Clean up reverse xmas tree in cookie_v[46]_check() Kuniyuki Iwashima
  2023-10-13 22:04 ` [PATCH v1 bpf-next 02/11] tcp: Cache sock_net(sk) " Kuniyuki Iwashima
@ 2023-10-13 22:04 ` Kuniyuki Iwashima
  2023-10-17  0:00   ` Kui-Feng Lee
  2023-10-13 22:04 ` [PATCH v1 bpf-next 04/11] tcp: Don't initialise tp->tsoffset in tcp_get_cookie_sock() Kuniyuki Iwashima
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 44+ messages in thread
From: Kuniyuki Iwashima @ 2023-10-13 22:04 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	David Ahern, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Mykola Lysenko
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev

We will add a SOCK_OPS hook to validate SYN Cookie.

We invoke the hook after allocating reqsk.  In case it fails,
we will respond with RST instead of just dropping the ACK.

Then, there would be more duplicated error handling patterns.
To avoid that, let's clean up goto labels.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/ipv4/syncookies.c | 22 +++++++++++-----------
 net/ipv6/syncookies.c |  4 ++--
 2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index 64280cf42667..b0cf6f4d66d8 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -369,11 +369,10 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 	if (!cookie_timestamp_decode(net, &tcp_opt))
 		goto out;
 
-	ret = NULL;
 	req = cookie_tcp_reqsk_alloc(&tcp_request_sock_ops,
 				     &tcp_request_sock_ipv4_ops, sk, skb);
 	if (!req)
-		goto out;
+		goto out_drop;
 
 	ireq = inet_rsk(req);
 	treq = tcp_rsk(req);
@@ -405,10 +404,8 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 	 */
 	RCU_INIT_POINTER(ireq->ireq_opt, tcp_v4_save_options(net, skb));
 
-	if (security_inet_conn_request(sk, skb, req)) {
-		reqsk_free(req);
-		goto out;
-	}
+	if (security_inet_conn_request(sk, skb, req))
+		goto out_free;
 
 	req->num_retrans = 0;
 
@@ -425,10 +422,8 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 			   ireq->ir_loc_addr, th->source, th->dest, sk->sk_uid);
 	security_req_classify_flow(req, flowi4_to_flowi_common(&fl4));
 	rt = ip_route_output_key(net, &fl4);
-	if (IS_ERR(rt)) {
-		reqsk_free(req);
-		goto out;
-	}
+	if (IS_ERR(rt))
+		goto out_free;
 
 	/* Try to redo what tcp_v4_send_synack did. */
 	req->rsk_window_clamp = tp->window_clamp ? :dst_metric(&rt->dst, RTAX_WINDOW);
@@ -452,5 +447,10 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 	 */
 	if (ret)
 		inet_sk(ret)->cork.fl.u.ip4 = fl4;
-out:	return ret;
+out:
+	return ret;
+out_free:
+	reqsk_free(req);
+out_drop:
+	return NULL;
 }
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index cbee2df8a006..b8ef6efbb60e 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -171,11 +171,10 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 	if (!cookie_timestamp_decode(net, &tcp_opt))
 		goto out;
 
-	ret = NULL;
 	req = cookie_tcp_reqsk_alloc(&tcp6_request_sock_ops,
 				     &tcp_request_sock_ipv6_ops, sk, skb);
 	if (!req)
-		goto out;
+		goto out_drop;
 
 	ireq = inet_rsk(req);
 	treq = tcp_rsk(req);
@@ -263,5 +262,6 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 	return ret;
 out_free:
 	reqsk_free(req);
+out_drop:
 	return NULL;
 }
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v1 bpf-next 04/11] tcp: Don't initialise tp->tsoffset in tcp_get_cookie_sock().
  2023-10-13 22:04 [PATCH v1 bpf-next 00/11] bpf: tcp: Add SYN Cookie generation/validation SOCK_OPS hooks Kuniyuki Iwashima
                   ` (2 preceding siblings ...)
  2023-10-13 22:04 ` [PATCH v1 bpf-next 03/11] tcp: Clean up goto labels " Kuniyuki Iwashima
@ 2023-10-13 22:04 ` Kuniyuki Iwashima
  2023-10-13 22:04 ` [PATCH v1 bpf-next 05/11] bpf: tcp: Add SYN Cookie generation SOCK_OPS hook Kuniyuki Iwashima
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 44+ messages in thread
From: Kuniyuki Iwashima @ 2023-10-13 22:04 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	David Ahern, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Mykola Lysenko
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev

When we create a full socket from SYN Cookie, we initialise
tcp_sk(sk)->tsoffset redundantly in tcp_get_cookie_sock() as
the field is inherited from tcp_rsk(req)->ts_off.

  cookie_v[46]_check
  | - treq->ts_off = 0 <------------------------- (0)
  ` - tcp_get_cookie_sock
      |- tcp_v[46]_syn_recv_sock
      |  `- tcp_create_openreq_child
      |	    `- newtp->tsoffset = treq->ts_off <-- (1)
      `- tcp_sk(child)->tsoffset = tsoff <------- (2)

Let's initialise tcp_rsk(req)->ts_off with the correct offset
and remove the second initialisation of tcp_sk(sk)->tsoffset.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 include/net/tcp.h     | 2 +-
 net/ipv4/syncookies.c | 7 +++----
 net/ipv6/syncookies.c | 4 ++--
 3 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 91688d0dadcd..676618c89bb7 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -477,7 +477,7 @@ void inet_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb);
 /* From syncookies.c */
 struct sock *tcp_get_cookie_sock(struct sock *sk, struct sk_buff *skb,
 				 struct request_sock *req,
-				 struct dst_entry *dst, u32 tsoff);
+				 struct dst_entry *dst);
 int __cookie_v4_check(const struct iphdr *iph, const struct tcphdr *th,
 		      u32 cookie);
 struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb);
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index b0cf6f4d66d8..514f1a4abdee 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -198,7 +198,7 @@ EXPORT_SYMBOL_GPL(__cookie_v4_check);
 
 struct sock *tcp_get_cookie_sock(struct sock *sk, struct sk_buff *skb,
 				 struct request_sock *req,
-				 struct dst_entry *dst, u32 tsoff)
+				 struct dst_entry *dst)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct sock *child;
@@ -208,7 +208,6 @@ struct sock *tcp_get_cookie_sock(struct sock *sk, struct sk_buff *skb,
 						 NULL, &own_req);
 	if (child) {
 		refcount_set(&req->rsk_refcnt, 1);
-		tcp_sk(child)->tsoffset = tsoff;
 		sock_rps_save_rxhash(child, skb);
 
 		if (rsk_drop_req(req)) {
@@ -378,7 +377,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 	treq = tcp_rsk(req);
 	treq->rcv_isn		= ntohl(th->seq) - 1;
 	treq->snt_isn		= cookie;
-	treq->ts_off		= 0;
+	treq->ts_off		= tsoff;
 	treq->txhash		= net_tx_rndhash();
 	req->mss		= mss;
 	ireq->ir_num		= ntohs(th->dest);
@@ -441,7 +440,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 	ireq->rcv_wscale  = rcv_wscale;
 	ireq->ecn_ok = cookie_ecn_ok(&tcp_opt, net, &rt->dst);
 
-	ret = tcp_get_cookie_sock(sk, skb, req, &rt->dst, tsoff);
+	ret = tcp_get_cookie_sock(sk, skb, req, &rt->dst);
 	/* ip_queue_xmit() depends on our flow being setup
 	 * Normal sockets get it right from inet_csk_route_child_sock()
 	 */
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index b8ef6efbb60e..60bdc4d9150b 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -212,7 +212,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 	treq->snt_synack	= 0;
 	treq->rcv_isn = ntohl(th->seq) - 1;
 	treq->snt_isn = cookie;
-	treq->ts_off = 0;
+	treq->ts_off = tsoff;
 	treq->txhash = net_tx_rndhash();
 	if (IS_ENABLED(CONFIG_SMC))
 		ireq->smc_ok = 0;
@@ -257,7 +257,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 	ireq->rcv_wscale = rcv_wscale;
 	ireq->ecn_ok = cookie_ecn_ok(&tcp_opt, net, dst);
 
-	ret = tcp_get_cookie_sock(sk, skb, req, dst, tsoff);
+	ret = tcp_get_cookie_sock(sk, skb, req, dst);
 out:
 	return ret;
 out_free:
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v1 bpf-next 05/11] bpf: tcp: Add SYN Cookie generation SOCK_OPS hook.
  2023-10-13 22:04 [PATCH v1 bpf-next 00/11] bpf: tcp: Add SYN Cookie generation/validation SOCK_OPS hooks Kuniyuki Iwashima
                   ` (3 preceding siblings ...)
  2023-10-13 22:04 ` [PATCH v1 bpf-next 04/11] tcp: Don't initialise tp->tsoffset in tcp_get_cookie_sock() Kuniyuki Iwashima
@ 2023-10-13 22:04 ` Kuniyuki Iwashima
  2023-10-18  0:54   ` Martin KaFai Lau
  2023-10-13 22:04 ` [PATCH v1 bpf-next 06/11] bpf: tcp: Add SYN Cookie validation " Kuniyuki Iwashima
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 44+ messages in thread
From: Kuniyuki Iwashima @ 2023-10-13 22:04 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	David Ahern, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Mykola Lysenko
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev

This patch adds a new SOCK_OPS hook to generate arbitrary SYN Cookie.

When the kernel sends SYN Cookie to a client, the hook is invoked with
bpf_sock_ops.op == BPF_SOCK_OPS_GEN_SYNCOOKIE_CB if the listener has
BPF_SOCK_OPS_SYNCOOKIE_CB_FLAG set by bpf_sock_ops_cb_flags_set().

The BPF program can access the following information to encode into
ISN:

  bpf_sock_ops.sk      : 4-tuple
  bpf_sock_ops.skb     : TCP header
  bpf_sock_ops.args[0] : MSS

The program must encode MSS and set it to bpf_sock_ops.replylong[0],
which will be looped back to the paired hook added in the following
patch.

Note that we do not call tcp_synq_overflow() so that the BPF program
can set its own expiration period.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 include/uapi/linux/bpf.h       | 18 +++++++++++++++-
 net/ipv4/tcp_input.c           | 38 +++++++++++++++++++++++++++++++++-
 tools/include/uapi/linux/bpf.h | 18 +++++++++++++++-
 3 files changed, 71 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 7ba61b75bc0e..d3cc530613c0 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -6738,8 +6738,17 @@ enum {
 	 * options first before the BPF program does.
 	 */
 	BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<6),
+	/* Call bpf when the kernel generates SYN Cookie (ISN) for SYN+ACK.
+	 *
+	 * The bpf prog will be called to encode MSS into SYN Cookie with
+	 * sock_ops->op == BPF_SOCK_OPS_GEN_SYNCOOKIE_CB.
+	 *
+	 * Please refer to the comment in BPF_SOCK_OPS_GEN_SYNCOOKIE_CB for
+	 * input and output.
+	 */
+	BPF_SOCK_OPS_SYNCOOKIE_CB_FLAG = (1<<7),
 /* Mask of all currently supported cb flags */
-	BPF_SOCK_OPS_ALL_CB_FLAGS       = 0x7F,
+	BPF_SOCK_OPS_ALL_CB_FLAGS       = 0xFF,
 };
 
 /* List of known BPF sock_ops operators.
@@ -6852,6 +6861,13 @@ enum {
 					 * by the kernel or the
 					 * earlier bpf-progs.
 					 */
+	BPF_SOCK_OPS_GEN_SYNCOOKIE_CB,	/* Generate SYN Cookie (ISN of
+					 * SYN+ACK).
+					 *
+					 * args[0]: MSS
+					 *
+					 * replylong[0]: ISN
+					 */
 };
 
 /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 584825ddd0a0..c86a737e4fe6 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6966,6 +6966,37 @@ u16 tcp_get_syncookie_mss(struct request_sock_ops *rsk_ops,
 }
 EXPORT_SYMBOL_GPL(tcp_get_syncookie_mss);
 
+#if IS_ENABLED(CONFIG_CGROUP_BPF) && IS_ENABLED(CONFIG_SYN_COOKIES)
+static int bpf_skops_cookie_init_sequence(struct sock *sk, struct request_sock *req,
+					  struct sk_buff *skb, __u32 *isn)
+{
+	struct bpf_sock_ops_kern sock_ops;
+	int ret;
+
+	memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
+
+	sock_ops.op = BPF_SOCK_OPS_GEN_SYNCOOKIE_CB;
+	sock_ops.sk = req_to_sk(req);
+	sock_ops.args[0] = req->mss;
+
+	bpf_skops_init_skb(&sock_ops, skb, tcp_hdrlen(skb));
+
+	ret = BPF_CGROUP_RUN_PROG_SOCK_OPS_SK(&sock_ops, sk);
+	if (ret)
+		return ret;
+
+	*isn = sock_ops.replylong[0];
+
+	return 0;
+}
+#else
+static int bpf_skops_cookie_init_sequence(struct sock *sk, struct request_sock *req,
+					  struct sk_buff *skb, __u32 *isn)
+{
+	return 0;
+}
+#endif
+
 int tcp_conn_request(struct request_sock_ops *rsk_ops,
 		     const struct tcp_request_sock_ops *af_ops,
 		     struct sock *sk, struct sk_buff *skb)
@@ -7062,7 +7093,12 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
 	tcp_ecn_create_request(req, skb, sk, dst);
 
 	if (want_cookie) {
-		isn = cookie_init_sequence(af_ops, sk, skb, &req->mss);
+		if (BPF_SOCK_OPS_TEST_FLAG(tp, BPF_SOCK_OPS_SYNCOOKIE_CB_FLAG)) {
+			if (bpf_skops_cookie_init_sequence(sk, req, skb, &isn))
+				goto drop_and_release;
+		} else {
+			isn = cookie_init_sequence(af_ops, sk, skb, &req->mss);
+		}
 		if (!tmp_opt.tstamp_ok)
 			inet_rsk(req)->ecn_ok = 0;
 	}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 7ba61b75bc0e..d3cc530613c0 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -6738,8 +6738,17 @@ enum {
 	 * options first before the BPF program does.
 	 */
 	BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<6),
+	/* Call bpf when the kernel generates SYN Cookie (ISN) for SYN+ACK.
+	 *
+	 * The bpf prog will be called to encode MSS into SYN Cookie with
+	 * sock_ops->op == BPF_SOCK_OPS_GEN_SYNCOOKIE_CB.
+	 *
+	 * Please refer to the comment in BPF_SOCK_OPS_GEN_SYNCOOKIE_CB for
+	 * input and output.
+	 */
+	BPF_SOCK_OPS_SYNCOOKIE_CB_FLAG = (1<<7),
 /* Mask of all currently supported cb flags */
-	BPF_SOCK_OPS_ALL_CB_FLAGS       = 0x7F,
+	BPF_SOCK_OPS_ALL_CB_FLAGS       = 0xFF,
 };
 
 /* List of known BPF sock_ops operators.
@@ -6852,6 +6861,13 @@ enum {
 					 * by the kernel or the
 					 * earlier bpf-progs.
 					 */
+	BPF_SOCK_OPS_GEN_SYNCOOKIE_CB,	/* Generate SYN Cookie (ISN of
+					 * SYN+ACK).
+					 *
+					 * args[0]: MSS
+					 *
+					 * replylong[0]: ISN
+					 */
 };
 
 /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v1 bpf-next 06/11] bpf: tcp: Add SYN Cookie validation SOCK_OPS hook.
  2023-10-13 22:04 [PATCH v1 bpf-next 00/11] bpf: tcp: Add SYN Cookie generation/validation SOCK_OPS hooks Kuniyuki Iwashima
                   ` (4 preceding siblings ...)
  2023-10-13 22:04 ` [PATCH v1 bpf-next 05/11] bpf: tcp: Add SYN Cookie generation SOCK_OPS hook Kuniyuki Iwashima
@ 2023-10-13 22:04 ` Kuniyuki Iwashima
  2023-10-16 20:38   ` Stanislav Fomichev
  2023-10-17 16:52   ` Kuniyuki Iwashima
  2023-10-13 22:04 ` [PATCH v1 bpf-next 07/11] bpf: Make bpf_sock_ops.replylong[1] writable Kuniyuki Iwashima
                   ` (7 subsequent siblings)
  13 siblings, 2 replies; 44+ messages in thread
From: Kuniyuki Iwashima @ 2023-10-13 22:04 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	David Ahern, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Mykola Lysenko
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev

This patch adds a new SOCK_OPS hook to validate arbitrary SYN Cookie.

When the kernel receives ACK for SYN Cookie, the hook is invoked with
bpf_sock_ops.op == BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB if the listener has
BPF_SOCK_OPS_SYNCOOKIE_CB_FLAG set by bpf_sock_ops_cb_flags_set().

The BPF program can access the following information to validate ISN:

  bpf_sock_ops.sk      : 4-tuple
  bpf_sock_ops.skb     : TCP header
  bpf_sock_ops.args[0] : ISN

The program must decode MSS and set it to bpf_sock_ops.replylong[0].

By default, the kernel validates SYN Cookie before allocating reqsk, but
the hook is invoked after allocating reqsk to keep the user interface
consistent with BPF_SOCK_OPS_GEN_SYNCOOKIE_CB.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 include/net/tcp.h              | 12 ++++++
 include/uapi/linux/bpf.h       | 20 +++++++---
 net/ipv4/syncookies.c          | 73 +++++++++++++++++++++++++++-------
 net/ipv6/syncookies.c          | 44 +++++++++++++-------
 tools/include/uapi/linux/bpf.h | 20 +++++++---
 5 files changed, 130 insertions(+), 39 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 676618c89bb7..90d95acdc34a 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2158,6 +2158,18 @@ static inline __u32 cookie_init_sequence(const struct tcp_request_sock_ops *ops,
 	__NET_INC_STATS(sock_net(sk), LINUX_MIB_SYNCOOKIESSENT);
 	return ops->cookie_init_seq(skb, mss);
 }
+
+#ifdef CONFIG_CGROUP_BPF
+int bpf_skops_cookie_check(struct sock *sk, struct request_sock *req,
+			   struct sk_buff *skb);
+#else
+static inline int bpf_skops_cookie_check(struct sock *sk, struct request_sock *req,
+					 struct sk_buff *skb)
+{
+	return 0;
+}
+#endif
+
 #else
 static inline __u32 cookie_init_sequence(const struct tcp_request_sock_ops *ops,
 					 const struct sock *sk, struct sk_buff *skb,
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index d3cc530613c0..e6f1507d7895 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -6738,13 +6738,16 @@ enum {
 	 * options first before the BPF program does.
 	 */
 	BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<6),
-	/* Call bpf when the kernel generates SYN Cookie (ISN) for SYN+ACK.
+	/* Call bpf when the kernel generates SYN Cookie (ISN) for SYN+ACK
+	 * and validates ACK for SYN Cookie.
 	 *
-	 * The bpf prog will be called to encode MSS into SYN Cookie with
-	 * sock_ops->op == BPF_SOCK_OPS_GEN_SYNCOOKIE_CB.
+	 * The bpf prog will be first called to encode MSS into SYN Cookie
+	 * with sock_ops->op == BPF_SOCK_OPS_GEN_SYNCOOKIE_CB.  Then, the
+	 * bpf prog will be called to decode MSS from SYN Cookie with
+	 * sock_ops->op == BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB.
 	 *
-	 * Please refer to the comment in BPF_SOCK_OPS_GEN_SYNCOOKIE_CB for
-	 * input and output.
+	 * Please refer to the comment in BPF_SOCK_OPS_GEN_SYNCOOKIE_CB and
+	 * BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB for input and output.
 	 */
 	BPF_SOCK_OPS_SYNCOOKIE_CB_FLAG = (1<<7),
 /* Mask of all currently supported cb flags */
@@ -6868,6 +6871,13 @@ enum {
 					 *
 					 * replylong[0]: ISN
 					 */
+	BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB,/* Validate SYN Cookie and set
+					 * MSS.
+					 *
+					 * args[0]: ISN
+					 *
+					 * replylong[0]: MSS
+					 */
 };
 
 /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index 514f1a4abdee..b1dd415863ff 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -317,6 +317,37 @@ struct request_sock *cookie_tcp_reqsk_alloc(const struct request_sock_ops *ops,
 }
 EXPORT_SYMBOL_GPL(cookie_tcp_reqsk_alloc);
 
+#if IS_ENABLED(CONFIG_CGROUP_BPF) && IS_ENABLED(CONFIG_SYN_COOKIES)
+int bpf_skops_cookie_check(struct sock *sk, struct request_sock *req, struct sk_buff *skb)
+{
+	struct bpf_sock_ops_kern sock_ops;
+
+	memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
+
+	sock_ops.op = BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB;
+	sock_ops.sk = req_to_sk(req);
+	sock_ops.args[0] = tcp_rsk(req)->snt_isn;
+
+	bpf_skops_init_skb(&sock_ops, skb, tcp_hdrlen(skb));
+
+	if (BPF_CGROUP_RUN_PROG_SOCK_OPS_SK(&sock_ops, sk))
+		goto err;
+
+	if (!sock_ops.replylong[0])
+		goto err;
+
+	__NET_INC_STATS(sock_net(sk), LINUX_MIB_SYNCOOKIESRECV);
+
+	return sock_ops.replylong[0];
+
+err:
+	__NET_INC_STATS(sock_net(sk), LINUX_MIB_SYNCOOKIESFAILED);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(bpf_skops_cookie_check);
+#endif
+
 /* On input, sk is a listener.
  * Output is listener if incoming packet would not create a child
  *           NULL if memory could not be allocated.
@@ -336,6 +367,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 	int full_space, mss;
 	struct flowi4 fl4;
 	struct rtable *rt;
+	bool bpf_cookie;
 	__u8 rcv_wscale;
 	u32 tsoff = 0;
 
@@ -343,16 +375,19 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 	    !th->ack || th->rst)
 		goto out;
 
-	if (tcp_synq_no_recent_overflow(sk))
-		goto out;
+	bpf_cookie = BPF_SOCK_OPS_TEST_FLAG(tp, BPF_SOCK_OPS_SYNCOOKIE_CB_FLAG);
+	if (!bpf_cookie) {
+		if (tcp_synq_no_recent_overflow(sk))
+			goto out;
 
-	mss = __cookie_v4_check(ip_hdr(skb), th, cookie);
-	if (mss == 0) {
-		__NET_INC_STATS(net, LINUX_MIB_SYNCOOKIESFAILED);
-		goto out;
-	}
+		mss = __cookie_v4_check(ip_hdr(skb), th, cookie);
+		if (mss == 0) {
+			__NET_INC_STATS(net, LINUX_MIB_SYNCOOKIESFAILED);
+			goto out;
+		}
 
-	__NET_INC_STATS(net, LINUX_MIB_SYNCOOKIESRECV);
+		__NET_INC_STATS(net, LINUX_MIB_SYNCOOKIESRECV);
+	}
 
 	/* check for timestamp cookie support */
 	memset(&tcp_opt, 0, sizeof(tcp_opt));
@@ -365,7 +400,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 		tcp_opt.rcv_tsecr -= tsoff;
 	}
 
-	if (!cookie_timestamp_decode(net, &tcp_opt))
+	if (!bpf_cookie && !cookie_timestamp_decode(net, &tcp_opt))
 		goto out;
 
 	req = cookie_tcp_reqsk_alloc(&tcp_request_sock_ops,
@@ -375,21 +410,31 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 
 	ireq = inet_rsk(req);
 	treq = tcp_rsk(req);
-	treq->rcv_isn		= ntohl(th->seq) - 1;
-	treq->snt_isn		= cookie;
-	treq->ts_off		= tsoff;
-	treq->txhash		= net_tx_rndhash();
-	req->mss		= mss;
 	ireq->ir_num		= ntohs(th->dest);
 	ireq->ir_rmt_port	= th->source;
+	treq->snt_isn		= cookie;
+
 	sk_rcv_saddr_set(req_to_sk(req), ip_hdr(skb)->daddr);
 	sk_daddr_set(req_to_sk(req), ip_hdr(skb)->saddr);
+
+	if (bpf_cookie) {
+		mss = bpf_skops_cookie_check(sk, req, skb);
+		if (!mss) {
+			reqsk_free(req);
+			goto out;
+		}
+	}
+
+	req->mss		= mss;
 	ireq->ir_mark		= inet_request_mark(sk, skb);
 	ireq->snd_wscale	= tcp_opt.snd_wscale;
 	ireq->sack_ok		= tcp_opt.sack_ok;
 	ireq->wscale_ok		= tcp_opt.wscale_ok;
 	ireq->tstamp_ok		= tcp_opt.saw_tstamp;
 	req->ts_recent		= tcp_opt.saw_tstamp ? tcp_opt.rcv_tsval : 0;
+	treq->rcv_isn		= ntohl(th->seq) - 1;
+	treq->ts_off		= tsoff;
+	treq->txhash		= net_tx_rndhash();
 	treq->snt_synack	= 0;
 	treq->tfo_listener	= false;
 
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index 60bdc4d9150b..3e920e7eb5d3 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -139,6 +139,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 	struct dst_entry *dst;
 	struct sock *ret = sk;
 	int full_space, mss;
+	bool bpf_cookie;
 	__u8 rcv_wscale;
 	u32 tsoff = 0;
 
@@ -146,16 +147,19 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 	    !th->ack || th->rst)
 		goto out;
 
-	if (tcp_synq_no_recent_overflow(sk))
-		goto out;
+	bpf_cookie = BPF_SOCK_OPS_TEST_FLAG(tp, BPF_SOCK_OPS_SYNCOOKIE_CB_FLAG);
+	if (!bpf_cookie) {
+		if (tcp_synq_no_recent_overflow(sk))
+			goto out;
 
-	mss = __cookie_v6_check(ipv6_hdr(skb), th, cookie);
-	if (mss == 0) {
-		__NET_INC_STATS(net, LINUX_MIB_SYNCOOKIESFAILED);
-		goto out;
-	}
+		mss = __cookie_v6_check(ipv6_hdr(skb), th, cookie);
+		if (mss == 0) {
+			__NET_INC_STATS(net, LINUX_MIB_SYNCOOKIESFAILED);
+			goto out;
+		}
 
-	__NET_INC_STATS(net, LINUX_MIB_SYNCOOKIESRECV);
+		__NET_INC_STATS(net, LINUX_MIB_SYNCOOKIESRECV);
+	}
 
 	/* check for timestamp cookie support */
 	memset(&tcp_opt, 0, sizeof(tcp_opt));
@@ -168,7 +172,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 		tcp_opt.rcv_tsecr -= tsoff;
 	}
 
-	if (!cookie_timestamp_decode(net, &tcp_opt))
+	if (!bpf_cookie && !cookie_timestamp_decode(net, &tcp_opt))
 		goto out;
 
 	req = cookie_tcp_reqsk_alloc(&tcp6_request_sock_ops,
@@ -177,17 +181,25 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 		goto out_drop;
 
 	ireq = inet_rsk(req);
+	ireq->ir_rmt_port = th->source;
+	ireq->ir_num = ntohs(th->dest);
+	ireq->ir_v6_rmt_addr = ipv6_hdr(skb)->saddr;
+	ireq->ir_v6_loc_addr = ipv6_hdr(skb)->daddr;
+
 	treq = tcp_rsk(req);
-	treq->tfo_listener = false;
+	treq->snt_isn = cookie;
+
+	if (bpf_cookie) {
+		mss = bpf_skops_cookie_check(sk, req, skb);
+		if (!mss) {
+			reqsk_free(req);
+			goto out;
+		}
+	}
 
 	if (security_inet_conn_request(sk, skb, req))
 		goto out_free;
 
-	req->mss = mss;
-	ireq->ir_rmt_port = th->source;
-	ireq->ir_num = ntohs(th->dest);
-	ireq->ir_v6_rmt_addr = ipv6_hdr(skb)->saddr;
-	ireq->ir_v6_loc_addr = ipv6_hdr(skb)->daddr;
 	if (ipv6_opt_accepted(sk, skb, &TCP_SKB_CB(skb)->header.h6) ||
 	    np->rxopt.bits.rxinfo || np->rxopt.bits.rxoinfo ||
 	    np->rxopt.bits.rxhlim || np->rxopt.bits.rxohlim) {
@@ -203,6 +215,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 
 	ireq->ir_mark = inet_request_mark(sk, skb);
 
+	req->mss = mss;
 	req->num_retrans = 0;
 	ireq->snd_wscale	= tcp_opt.snd_wscale;
 	ireq->sack_ok		= tcp_opt.sack_ok;
@@ -210,6 +223,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 	ireq->tstamp_ok		= tcp_opt.saw_tstamp;
 	req->ts_recent		= tcp_opt.saw_tstamp ? tcp_opt.rcv_tsval : 0;
 	treq->snt_synack	= 0;
+	treq->tfo_listener = false;
 	treq->rcv_isn = ntohl(th->seq) - 1;
 	treq->snt_isn = cookie;
 	treq->ts_off = tsoff;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index d3cc530613c0..e6f1507d7895 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -6738,13 +6738,16 @@ enum {
 	 * options first before the BPF program does.
 	 */
 	BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<6),
-	/* Call bpf when the kernel generates SYN Cookie (ISN) for SYN+ACK.
+	/* Call bpf when the kernel generates SYN Cookie (ISN) for SYN+ACK
+	 * and validates ACK for SYN Cookie.
 	 *
-	 * The bpf prog will be called to encode MSS into SYN Cookie with
-	 * sock_ops->op == BPF_SOCK_OPS_GEN_SYNCOOKIE_CB.
+	 * The bpf prog will be first called to encode MSS into SYN Cookie
+	 * with sock_ops->op == BPF_SOCK_OPS_GEN_SYNCOOKIE_CB.  Then, the
+	 * bpf prog will be called to decode MSS from SYN Cookie with
+	 * sock_ops->op == BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB.
 	 *
-	 * Please refer to the comment in BPF_SOCK_OPS_GEN_SYNCOOKIE_CB for
-	 * input and output.
+	 * Please refer to the comment in BPF_SOCK_OPS_GEN_SYNCOOKIE_CB and
+	 * BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB for input and output.
 	 */
 	BPF_SOCK_OPS_SYNCOOKIE_CB_FLAG = (1<<7),
 /* Mask of all currently supported cb flags */
@@ -6868,6 +6871,13 @@ enum {
 					 *
 					 * replylong[0]: ISN
 					 */
+	BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB,/* Validate SYN Cookie and set
+					 * MSS.
+					 *
+					 * args[0]: ISN
+					 *
+					 * replylong[0]: MSS
+					 */
 };
 
 /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v1 bpf-next 07/11] bpf: Make bpf_sock_ops.replylong[1] writable.
  2023-10-13 22:04 [PATCH v1 bpf-next 00/11] bpf: tcp: Add SYN Cookie generation/validation SOCK_OPS hooks Kuniyuki Iwashima
                   ` (5 preceding siblings ...)
  2023-10-13 22:04 ` [PATCH v1 bpf-next 06/11] bpf: tcp: Add SYN Cookie validation " Kuniyuki Iwashima
@ 2023-10-13 22:04 ` Kuniyuki Iwashima
  2023-10-13 22:04 ` [PATCH v1 bpf-next 08/11] bpf: tcp: Make TS available for SYN Cookie storage Kuniyuki Iwashima
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 44+ messages in thread
From: Kuniyuki Iwashima @ 2023-10-13 22:04 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	David Ahern, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Mykola Lysenko
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev

With the following patch, the BPF_SOCK_OPS_GEN_SYNCOOKIE_CB program
can utilise the TCP Timestamps option as another storage to encode
client information.

Then, we use bpf_sock_ops.replylong[1] as the user interface to pass
the timestamp value.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/core/filter.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index cc2e4babc85f..276abecf5d90 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -9079,7 +9079,7 @@ static bool sock_ops_is_valid_access(int off, int size,
 
 	if (type == BPF_WRITE) {
 		switch (off) {
-		case offsetof(struct bpf_sock_ops, reply):
+		case bpf_ctx_range_till(struct bpf_sock_ops, reply, replylong[1]):
 		case offsetof(struct bpf_sock_ops, sk_txhash):
 			if (size != size_default)
 				return false;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v1 bpf-next 08/11] bpf: tcp: Make TS available for SYN Cookie storage.
  2023-10-13 22:04 [PATCH v1 bpf-next 00/11] bpf: tcp: Add SYN Cookie generation/validation SOCK_OPS hooks Kuniyuki Iwashima
                   ` (6 preceding siblings ...)
  2023-10-13 22:04 ` [PATCH v1 bpf-next 07/11] bpf: Make bpf_sock_ops.replylong[1] writable Kuniyuki Iwashima
@ 2023-10-13 22:04 ` Kuniyuki Iwashima
  2023-10-13 22:04 ` [PATCH v1 bpf-next 09/11] tcp: Split cookie_ecn_ok() Kuniyuki Iwashima
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 44+ messages in thread
From: Kuniyuki Iwashima @ 2023-10-13 22:04 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	David Ahern, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Mykola Lysenko
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev

BPF_SOCK_OPS_GEN_SYNCOOKIE_CB can now encode more information into
TS value via bpf_sock_ops.replylong[1], which will be looped back to
bpf_sock_ops.args[1] of BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB to validate.

After invoking BPF_SOCK_OPS_GEN_SYNCOOKIE_CB hook, we set 1 to
inet_rsk(req)->bpf_cookie and saves bpf_sock_ops.replylong[1] in
inet_rsk(req)->bpf_cookie_tsval.  Later in cookie_init_timestamp(),
we use bpf_cookie_tsval as TS value if bpf_cookie is 1.

Also, we set 0 to tcp_rsk(req)->ts_off so that the generated TS value is
sent as is.  This is to remove host-specific bits from SYN Cookie for
scalability.  However, ts_off is implemented to randomise TS value for
each peer for security reasons.  Thus, the TS value must look like a
random number.  For example, init TS with a random number first and use
a few bits to encode client information.

Before invoking BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB hook, we need not adjust
tcp_opt.rcv_tsecr as ts_off was 0 when sending the timestamp.  Also, we
need to initialise tcp_rsk(req)->ts_off with tcp_opt->rcv_tsecr -
tcp_ns_to_ts(tcp_clock_ns()) so that the timestamp after 3WHS will be the
initial TS + delta.

  SYN+ACK    : Initial TS

  After 3WHS : tcp_ns_to_ts(tcp_clock_ns()) + tp->tsoffset
               = tcp_ns_to_ts(tcp_clock_ns())   <-- In tcp_established_options()
                 + tcp_opt->rcv_tsecr
                 - tcp_ns_to_ts(tcp_clock_ns()) <-- When validating ACK
               = Initial TS + delta

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 include/net/inet_sock.h        |  4 ++-
 include/net/tcp.h              |  5 ++--
 include/uapi/linux/bpf.h       | 12 ++++++---
 net/ipv4/syncookies.c          | 45 ++++++++++++++++++++++------------
 net/ipv4/tcp_input.c           |  4 +++
 net/ipv6/syncookies.c          | 23 +++++++++--------
 tools/include/uapi/linux/bpf.h | 12 ++++++---
 7 files changed, 71 insertions(+), 34 deletions(-)

diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
index 98e11958cdff..19b3ddcda0f8 100644
--- a/include/net/inet_sock.h
+++ b/include/net/inet_sock.h
@@ -87,8 +87,10 @@ struct inet_request_sock {
 				ecn_ok	   : 1,
 				acked	   : 1,
 				no_srccheck: 1,
-				smc_ok	   : 1;
+				smc_ok	   : 1,
+				bpf_cookie : 1;
 	u32                     ir_mark;
+	u32			bpf_cookie_tsval;
 	union {
 		struct ip_options_rcu __rcu	*ireq_opt;
 #if IS_ENABLED(CONFIG_IPV6)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 90d95acdc34a..4fe19917db6c 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2161,10 +2161,11 @@ static inline __u32 cookie_init_sequence(const struct tcp_request_sock_ops *ops,
 
 #ifdef CONFIG_CGROUP_BPF
 int bpf_skops_cookie_check(struct sock *sk, struct request_sock *req,
-			   struct sk_buff *skb);
+			   struct sk_buff *skb, struct tcp_options_received *tcp_opt);
 #else
 static inline int bpf_skops_cookie_check(struct sock *sk, struct request_sock *req,
-					 struct sk_buff *skb)
+					 struct sk_buff *skb,
+					 struct tcp_options_received *tcp_opt)
 {
 	return 0;
 }
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index e6f1507d7895..24f673d88c0d 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -6865,16 +6865,22 @@ enum {
 					 * earlier bpf-progs.
 					 */
 	BPF_SOCK_OPS_GEN_SYNCOOKIE_CB,	/* Generate SYN Cookie (ISN of
-					 * SYN+ACK).
+					 * SYN+ACK) and value of Timestamps
+					 * option.
 					 *
 					 * args[0]: MSS
 					 *
 					 * replylong[0]: ISN
+					 * replylong[1]: TS
+					 *
+					 * TS value must look like random
+					 * for security reasons.
 					 */
-	BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB,/* Validate SYN Cookie and set
-					 * MSS.
+	BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB,/* Validate SYN Cookie and TS and
+					 * set MSS.
 					 *
 					 * args[0]: ISN
+					 * args[1]: TS
 					 *
 					 * replylong[0]: MSS
 					 */
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index b1dd415863ff..f78566991e08 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -62,11 +62,12 @@ static u32 cookie_hash(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport,
  */
 u64 cookie_init_timestamp(struct request_sock *req, u64 now)
 {
-	struct inet_request_sock *ireq;
-	u32 ts, ts_now = tcp_ns_to_ts(now);
-	u32 options = 0;
+	struct inet_request_sock *ireq = inet_rsk(req);
+	u32 ts, ts_now;
+	u32 options;
 
-	ireq = inet_rsk(req);
+	if (ireq->bpf_cookie)
+		return ireq->bpf_cookie_tsval * (NSEC_PER_SEC / TCP_TS_HZ);
 
 	options = ireq->wscale_ok ? ireq->snd_wscale : TS_OPT_WSCALE_MASK;
 	if (ireq->sack_ok)
@@ -74,6 +75,7 @@ u64 cookie_init_timestamp(struct request_sock *req, u64 now)
 	if (ireq->ecn_ok)
 		options |= TS_OPT_ECN;
 
+	ts_now = tcp_ns_to_ts(now);
 	ts = ts_now & ~TSMASK;
 	ts |= options;
 	if (ts > ts_now) {
@@ -318,15 +320,25 @@ struct request_sock *cookie_tcp_reqsk_alloc(const struct request_sock_ops *ops,
 EXPORT_SYMBOL_GPL(cookie_tcp_reqsk_alloc);
 
 #if IS_ENABLED(CONFIG_CGROUP_BPF) && IS_ENABLED(CONFIG_SYN_COOKIES)
-int bpf_skops_cookie_check(struct sock *sk, struct request_sock *req, struct sk_buff *skb)
+int bpf_skops_cookie_check(struct sock *sk, struct request_sock *req, struct sk_buff *skb,
+			   struct tcp_options_received *tcp_opt)
 {
 	struct bpf_sock_ops_kern sock_ops;
+	struct net *net = sock_net(sk);
+
+	if (tcp_opt->saw_tstamp) {
+		if (!READ_ONCE(net->ipv4.sysctl_tcp_timestamps))
+			goto err;
+
+		tcp_rsk(req)->ts_off = tcp_opt->rcv_tsecr - tcp_ns_to_ts(tcp_clock_ns());
+	}
 
 	memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
 
 	sock_ops.op = BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB;
 	sock_ops.sk = req_to_sk(req);
 	sock_ops.args[0] = tcp_rsk(req)->snt_isn;
+	sock_ops.args[1] = tcp_opt->rcv_tsecr;
 
 	bpf_skops_init_skb(&sock_ops, skb, tcp_hdrlen(skb));
 
@@ -393,15 +405,17 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 	memset(&tcp_opt, 0, sizeof(tcp_opt));
 	tcp_parse_options(net, skb, &tcp_opt, 0, NULL);
 
-	if (tcp_opt.saw_tstamp && tcp_opt.rcv_tsecr) {
-		tsoff = secure_tcp_ts_off(net,
-					  ip_hdr(skb)->daddr,
-					  ip_hdr(skb)->saddr);
-		tcp_opt.rcv_tsecr -= tsoff;
-	}
+	if (!bpf_cookie) {
+		if (tcp_opt.saw_tstamp && tcp_opt.rcv_tsecr) {
+			tsoff = secure_tcp_ts_off(net,
+						  ip_hdr(skb)->daddr,
+						  ip_hdr(skb)->saddr);
+			tcp_opt.rcv_tsecr -= tsoff;
+		}
 
-	if (!bpf_cookie && !cookie_timestamp_decode(net, &tcp_opt))
-		goto out;
+		if (!cookie_timestamp_decode(net, &tcp_opt))
+			goto out;
+	}
 
 	req = cookie_tcp_reqsk_alloc(&tcp_request_sock_ops,
 				     &tcp_request_sock_ipv4_ops, sk, skb);
@@ -418,11 +432,13 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 	sk_daddr_set(req_to_sk(req), ip_hdr(skb)->saddr);
 
 	if (bpf_cookie) {
-		mss = bpf_skops_cookie_check(sk, req, skb);
+		mss = bpf_skops_cookie_check(sk, req, skb, &tcp_opt);
 		if (!mss) {
 			reqsk_free(req);
 			goto out;
 		}
+	} else {
+		treq->ts_off = tsoff;
 	}
 
 	req->mss		= mss;
@@ -433,7 +449,6 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 	ireq->tstamp_ok		= tcp_opt.saw_tstamp;
 	req->ts_recent		= tcp_opt.saw_tstamp ? tcp_opt.rcv_tsval : 0;
 	treq->rcv_isn		= ntohl(th->seq) - 1;
-	treq->ts_off		= tsoff;
 	treq->txhash		= net_tx_rndhash();
 	treq->snt_synack	= 0;
 	treq->tfo_listener	= false;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index c86a737e4fe6..feb44bff29ef 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6987,6 +6987,10 @@ static int bpf_skops_cookie_init_sequence(struct sock *sk, struct request_sock *
 
 	*isn = sock_ops.replylong[0];
 
+	inet_rsk(req)->bpf_cookie = 1;
+	inet_rsk(req)->bpf_cookie_tsval = sock_ops.replylong[1];
+	tcp_rsk(req)->ts_off = 0;
+
 	return 0;
 }
 #else
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index 3e920e7eb5d3..b0a7ea75a504 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -165,15 +165,17 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 	memset(&tcp_opt, 0, sizeof(tcp_opt));
 	tcp_parse_options(net, skb, &tcp_opt, 0, NULL);
 
-	if (tcp_opt.saw_tstamp && tcp_opt.rcv_tsecr) {
-		tsoff = secure_tcpv6_ts_off(net,
-					    ipv6_hdr(skb)->daddr.s6_addr32,
-					    ipv6_hdr(skb)->saddr.s6_addr32);
-		tcp_opt.rcv_tsecr -= tsoff;
-	}
+	if (!bpf_cookie) {
+		if (tcp_opt.saw_tstamp && tcp_opt.rcv_tsecr) {
+			tsoff = secure_tcpv6_ts_off(net,
+						    ipv6_hdr(skb)->daddr.s6_addr32,
+						    ipv6_hdr(skb)->saddr.s6_addr32);
+			tcp_opt.rcv_tsecr -= tsoff;
+		}
 
-	if (!bpf_cookie && !cookie_timestamp_decode(net, &tcp_opt))
-		goto out;
+		if (!cookie_timestamp_decode(net, &tcp_opt))
+			goto out;
+	}
 
 	req = cookie_tcp_reqsk_alloc(&tcp6_request_sock_ops,
 				     &tcp_request_sock_ipv6_ops, sk, skb);
@@ -190,11 +192,13 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 	treq->snt_isn = cookie;
 
 	if (bpf_cookie) {
-		mss = bpf_skops_cookie_check(sk, req, skb);
+		mss = bpf_skops_cookie_check(sk, req, skb, &tcp_opt);
 		if (!mss) {
 			reqsk_free(req);
 			goto out;
 		}
+	} else {
+		treq->ts_off = tsoff;
 	}
 
 	if (security_inet_conn_request(sk, skb, req))
@@ -226,7 +230,6 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 	treq->tfo_listener = false;
 	treq->rcv_isn = ntohl(th->seq) - 1;
 	treq->snt_isn = cookie;
-	treq->ts_off = tsoff;
 	treq->txhash = net_tx_rndhash();
 	if (IS_ENABLED(CONFIG_SMC))
 		ireq->smc_ok = 0;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index e6f1507d7895..24f673d88c0d 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -6865,16 +6865,22 @@ enum {
 					 * earlier bpf-progs.
 					 */
 	BPF_SOCK_OPS_GEN_SYNCOOKIE_CB,	/* Generate SYN Cookie (ISN of
-					 * SYN+ACK).
+					 * SYN+ACK) and value of Timestamps
+					 * option.
 					 *
 					 * args[0]: MSS
 					 *
 					 * replylong[0]: ISN
+					 * replylong[1]: TS
+					 *
+					 * TS value must look like random
+					 * for security reasons.
 					 */
-	BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB,/* Validate SYN Cookie and set
-					 * MSS.
+	BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB,/* Validate SYN Cookie and TS and
+					 * set MSS.
 					 *
 					 * args[0]: ISN
+					 * args[1]: TS
 					 *
 					 * replylong[0]: MSS
 					 */
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v1 bpf-next 09/11] tcp: Split cookie_ecn_ok().
  2023-10-13 22:04 [PATCH v1 bpf-next 00/11] bpf: tcp: Add SYN Cookie generation/validation SOCK_OPS hooks Kuniyuki Iwashima
                   ` (7 preceding siblings ...)
  2023-10-13 22:04 ` [PATCH v1 bpf-next 08/11] bpf: tcp: Make TS available for SYN Cookie storage Kuniyuki Iwashima
@ 2023-10-13 22:04 ` Kuniyuki Iwashima
  2023-10-13 22:04 ` [PATCH v1 bpf-next 10/11] bpf: tcp: Make WS, SACK, ECN configurable from BPF SYN Cookie Kuniyuki Iwashima
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 44+ messages in thread
From: Kuniyuki Iwashima @ 2023-10-13 22:04 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	David Ahern, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Mykola Lysenko
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev

This patch is a prerequisite for the following change.

For non-BPF SYN Cookie, inet_rsk(req)->ecn_ok is initialised with
tcp_opt->rcv_tsecr & TS_OPT_ECN.  OTOH, we will initialise it
differently for BPF SYN Cookie.

Then, inet_rsk(req)->ecn_ok is updated with net->ipv4.sysctl_tcp_ecn
or dst_feature(dst, RTAX_FEATURE_ECN).

Now cookie_ecn_ok() is just oneliner, but TS_OPT_ECN is only available
in net/ipv4/syncookies.c.  Instead of exporting the function, we move
it and TS_OPT_XXX to tcp.h.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 include/net/tcp.h     | 31 +++++++++++++++++++++++++++++--
 net/ipv4/syncookies.c | 43 +++----------------------------------------
 net/ipv6/syncookies.c |  4 +++-
 3 files changed, 35 insertions(+), 43 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 4fe19917db6c..143f47c28312 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -569,8 +569,35 @@ __u32 cookie_v4_init_sequence(const struct sk_buff *skb, __u16 *mss);
 u64 cookie_init_timestamp(struct request_sock *req, u64 now);
 bool cookie_timestamp_decode(const struct net *net,
 			     struct tcp_options_received *opt);
-bool cookie_ecn_ok(const struct tcp_options_received *opt,
-		   const struct net *net, const struct dst_entry *dst);
+
+/* TCP Timestamp: 6 lowest bits of timestamp sent in the cookie SYN-ACK
+ * stores TCP options:
+ *
+ * MSB                               LSB
+ * | 31 ...   6 |  5  |  4   | 3 2 1 0 |
+ * |  Timestamp | ECN | SACK | WScale  |
+ *
+ * When we receive a valid cookie-ACK, we look at the echoed tsval (if
+ * any) to figure out which TCP options we should use for the rebuilt
+ * connection.
+ *
+ * A WScale setting of '0xf' (which is an invalid scaling value)
+ * means that original syn did not include the TCP window scaling option.
+ */
+#define TS_OPT_WSCALE_MASK	0xf
+#define TS_OPT_SACK		BIT(4)
+#define TS_OPT_ECN		BIT(5)
+/* There is no TS_OPT_TIMESTAMP:
+ * if ACK contains timestamp option, we already know it was
+ * requested/supported by the syn/synack exchange.
+ */
+#define TSBITS	6
+#define TSMASK	(((__u32)1 << TSBITS) - 1)
+
+static inline bool cookie_ecn_ok(const struct tcp_options_received *tcp_opt)
+{
+	return tcp_opt->rcv_tsecr & TS_OPT_ECN;
+}
 
 /* From net/ipv6/syncookies.c */
 int __cookie_v6_check(const struct ipv6hdr *iph, const struct tcphdr *th,
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index f78566991e08..ff979cc314da 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -19,30 +19,6 @@ static siphash_aligned_key_t syncookie_secret[2];
 #define COOKIEBITS 24	/* Upper bits store count */
 #define COOKIEMASK (((__u32)1 << COOKIEBITS) - 1)
 
-/* TCP Timestamp: 6 lowest bits of timestamp sent in the cookie SYN-ACK
- * stores TCP options:
- *
- * MSB                               LSB
- * | 31 ...   6 |  5  |  4   | 3 2 1 0 |
- * |  Timestamp | ECN | SACK | WScale  |
- *
- * When we receive a valid cookie-ACK, we look at the echoed tsval (if
- * any) to figure out which TCP options we should use for the rebuilt
- * connection.
- *
- * A WScale setting of '0xf' (which is an invalid scaling value)
- * means that original syn did not include the TCP window scaling option.
- */
-#define TS_OPT_WSCALE_MASK	0xf
-#define TS_OPT_SACK		BIT(4)
-#define TS_OPT_ECN		BIT(5)
-/* There is no TS_OPT_TIMESTAMP:
- * if ACK contains timestamp option, we already know it was
- * requested/supported by the syn/synack exchange.
- */
-#define TSBITS	6
-#define TSMASK	(((__u32)1 << TSBITS) - 1)
-
 static u32 cookie_hash(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport,
 		       u32 count, int c)
 {
@@ -266,21 +242,6 @@ bool cookie_timestamp_decode(const struct net *net,
 }
 EXPORT_SYMBOL(cookie_timestamp_decode);
 
-bool cookie_ecn_ok(const struct tcp_options_received *tcp_opt,
-		   const struct net *net, const struct dst_entry *dst)
-{
-	bool ecn_ok = tcp_opt->rcv_tsecr & TS_OPT_ECN;
-
-	if (!ecn_ok)
-		return false;
-
-	if (READ_ONCE(net->ipv4.sysctl_tcp_ecn))
-		return true;
-
-	return dst_feature(dst, RTAX_FEATURE_ECN);
-}
-EXPORT_SYMBOL(cookie_ecn_ok);
-
 struct request_sock *cookie_tcp_reqsk_alloc(const struct request_sock_ops *ops,
 					    const struct tcp_request_sock_ops *af_ops,
 					    struct sock *sk,
@@ -438,6 +399,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 			goto out;
 		}
 	} else {
+		ireq->ecn_ok = cookie_ecn_ok(&tcp_opt);
 		treq->ts_off = tsoff;
 	}
 
@@ -498,7 +460,8 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 				  dst_metric(&rt->dst, RTAX_INITRWND));
 
 	ireq->rcv_wscale  = rcv_wscale;
-	ireq->ecn_ok = cookie_ecn_ok(&tcp_opt, net, &rt->dst);
+	ireq->ecn_ok &= READ_ONCE(net->ipv4.sysctl_tcp_ecn) ||
+		dst_feature(&rt->dst, RTAX_FEATURE_ECN);
 
 	ret = tcp_get_cookie_sock(sk, skb, req, &rt->dst);
 	/* ip_queue_xmit() depends on our flow being setup
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index b0a7ea75a504..f4c0cb463e02 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -198,6 +198,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 			goto out;
 		}
 	} else {
+		ireq->ecn_ok = cookie_ecn_ok(&tcp_opt);
 		treq->ts_off = tsoff;
 	}
 
@@ -272,7 +273,8 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 				  dst_metric(dst, RTAX_INITRWND));
 
 	ireq->rcv_wscale = rcv_wscale;
-	ireq->ecn_ok = cookie_ecn_ok(&tcp_opt, net, dst);
+	ireq->ecn_ok &= READ_ONCE(net->ipv4.sysctl_tcp_ecn) ||
+		dst_feature(dst, RTAX_FEATURE_ECN);
 
 	ret = tcp_get_cookie_sock(sk, skb, req, dst);
 out:
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v1 bpf-next 10/11] bpf: tcp: Make WS, SACK, ECN configurable from BPF SYN Cookie.
  2023-10-13 22:04 [PATCH v1 bpf-next 00/11] bpf: tcp: Add SYN Cookie generation/validation SOCK_OPS hooks Kuniyuki Iwashima
                   ` (8 preceding siblings ...)
  2023-10-13 22:04 ` [PATCH v1 bpf-next 09/11] tcp: Split cookie_ecn_ok() Kuniyuki Iwashima
@ 2023-10-13 22:04 ` Kuniyuki Iwashima
  2023-10-18  1:08   ` Martin KaFai Lau
  2023-10-13 22:04 ` [PATCH v1 bpf-next 11/11] selftest: bpf: Test BPF_SOCK_OPS_(GEN|CHECK)_SYNCOOKIE_CB Kuniyuki Iwashima
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 44+ messages in thread
From: Kuniyuki Iwashima @ 2023-10-13 22:04 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	David Ahern, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Mykola Lysenko
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev

This patch allows BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB hook to enable WScale,
SACK, and ECN by passing corresponding flags to bpf_sock_ops.replylong[1].

The same flags are passed to BPF_SOCK_OPS_GEN_SYNCOOKIE_CB hook as
bpf_sock_ops.args[1] so that the BPF prog need not parse the TCP header to
check if WScale, SACK, ECN, and TS are available in SYN.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 include/uapi/linux/bpf.h       | 18 ++++++++++++++++++
 net/ipv4/syncookies.c          | 20 ++++++++++++++++++++
 net/ipv4/tcp_input.c           | 11 +++++++++++
 tools/include/uapi/linux/bpf.h | 18 ++++++++++++++++++
 4 files changed, 67 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 24f673d88c0d..cdae4dd5d797 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -6869,6 +6869,7 @@ enum {
 					 * option.
 					 *
 					 * args[0]: MSS
+					 * args[1]: BPF_SYNCOOKIE_XXX
 					 *
 					 * replylong[0]: ISN
 					 * replylong[1]: TS
@@ -6883,6 +6884,7 @@ enum {
 					 * args[1]: TS
 					 *
 					 * replylong[0]: MSS
+					 * replylong[1]: BPF_SYNCOOKIE_XXX
 					 */
 };
 
@@ -6970,6 +6972,22 @@ enum {
 						 */
 };
 
+/* arg[1] value for BPF_SOCK_OPS_GEN_SYNCOOKIE_CB and
+ * replylong[1] for BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB.
+ *
+ * MSB                                LSB
+ * | 31 ... | 6  | 5   | 4    | 3 2 1 0 |
+ * |    ... | TS | ECN | SACK | WScale  |
+ */
+enum {
+	/* 0xf is invalid thus means that SYN did not have WScale. */
+	BPF_SYNCOOKIE_WSCALE_MASK	= (1 << 4) - 1,
+	BPF_SYNCOOKIE_SACK		= (1 << 4),
+	BPF_SYNCOOKIE_ECN		= (1 << 5),
+	/* Only available for BPF_SOCK_OPS_GEN_SYNCOOKIE_CB to check if SYN has TS */
+	BPF_SYNCOOKIE_TS		= (1 << 6),
+};
+
 struct bpf_perf_event_value {
 	__u64 counter;
 	__u64 enabled;
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index ff979cc314da..22353a9af52d 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -286,6 +286,7 @@ int bpf_skops_cookie_check(struct sock *sk, struct request_sock *req, struct sk_
 {
 	struct bpf_sock_ops_kern sock_ops;
 	struct net *net = sock_net(sk);
+	u32 options;
 
 	if (tcp_opt->saw_tstamp) {
 		if (!READ_ONCE(net->ipv4.sysctl_tcp_timestamps))
@@ -309,6 +310,25 @@ int bpf_skops_cookie_check(struct sock *sk, struct request_sock *req, struct sk_
 	if (!sock_ops.replylong[0])
 		goto err;
 
+	options = sock_ops.replylong[1];
+
+	if ((options & BPF_SYNCOOKIE_WSCALE_MASK) != BPF_SYNCOOKIE_WSCALE_MASK) {
+		if (!READ_ONCE(net->ipv4.sysctl_tcp_window_scaling))
+			goto err;
+
+		tcp_opt->wscale_ok = 1;
+		tcp_opt->snd_wscale = options & BPF_SYNCOOKIE_WSCALE_MASK;
+	}
+
+	if (options & BPF_SYNCOOKIE_SACK) {
+		if (!READ_ONCE(net->ipv4.sysctl_tcp_sack))
+			goto err;
+
+		tcp_opt->sack_ok = 1;
+	}
+
+	inet_rsk(req)->ecn_ok = options & BPF_SYNCOOKIE_ECN;
+
 	__NET_INC_STATS(sock_net(sk), LINUX_MIB_SYNCOOKIESRECV);
 
 	return sock_ops.replylong[0];
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index feb44bff29ef..483e2f36afe5 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6970,14 +6970,25 @@ EXPORT_SYMBOL_GPL(tcp_get_syncookie_mss);
 static int bpf_skops_cookie_init_sequence(struct sock *sk, struct request_sock *req,
 					  struct sk_buff *skb, __u32 *isn)
 {
+	struct inet_request_sock *ireq = inet_rsk(req);
 	struct bpf_sock_ops_kern sock_ops;
+	u32 options;
 	int ret;
 
+	options = ireq->wscale_ok ? ireq->snd_wscale : BPF_SYNCOOKIE_WSCALE_MASK;
+	if (ireq->sack_ok)
+		options |= BPF_SYNCOOKIE_SACK;
+	if (ireq->ecn_ok)
+		options |= BPF_SYNCOOKIE_ECN;
+	if (ireq->tstamp_ok)
+		options |= BPF_SYNCOOKIE_TS;
+
 	memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
 
 	sock_ops.op = BPF_SOCK_OPS_GEN_SYNCOOKIE_CB;
 	sock_ops.sk = req_to_sk(req);
 	sock_ops.args[0] = req->mss;
+	sock_ops.args[1] = options;
 
 	bpf_skops_init_skb(&sock_ops, skb, tcp_hdrlen(skb));
 
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 24f673d88c0d..cdae4dd5d797 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -6869,6 +6869,7 @@ enum {
 					 * option.
 					 *
 					 * args[0]: MSS
+					 * args[1]: BPF_SYNCOOKIE_XXX
 					 *
 					 * replylong[0]: ISN
 					 * replylong[1]: TS
@@ -6883,6 +6884,7 @@ enum {
 					 * args[1]: TS
 					 *
 					 * replylong[0]: MSS
+					 * replylong[1]: BPF_SYNCOOKIE_XXX
 					 */
 };
 
@@ -6970,6 +6972,22 @@ enum {
 						 */
 };
 
+/* arg[1] value for BPF_SOCK_OPS_GEN_SYNCOOKIE_CB and
+ * replylong[1] for BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB.
+ *
+ * MSB                                LSB
+ * | 31 ... | 6  | 5   | 4    | 3 2 1 0 |
+ * |    ... | TS | ECN | SACK | WScale  |
+ */
+enum {
+	/* 0xf is invalid thus means that SYN did not have WScale. */
+	BPF_SYNCOOKIE_WSCALE_MASK	= (1 << 4) - 1,
+	BPF_SYNCOOKIE_SACK		= (1 << 4),
+	BPF_SYNCOOKIE_ECN		= (1 << 5),
+	/* Only available for BPF_SOCK_OPS_GEN_SYNCOOKIE_CB to check if SYN has TS */
+	BPF_SYNCOOKIE_TS		= (1 << 6),
+};
+
 struct bpf_perf_event_value {
 	__u64 counter;
 	__u64 enabled;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v1 bpf-next 11/11] selftest: bpf: Test BPF_SOCK_OPS_(GEN|CHECK)_SYNCOOKIE_CB.
  2023-10-13 22:04 [PATCH v1 bpf-next 00/11] bpf: tcp: Add SYN Cookie generation/validation SOCK_OPS hooks Kuniyuki Iwashima
                   ` (9 preceding siblings ...)
  2023-10-13 22:04 ` [PATCH v1 bpf-next 10/11] bpf: tcp: Make WS, SACK, ECN configurable from BPF SYN Cookie Kuniyuki Iwashima
@ 2023-10-13 22:04 ` Kuniyuki Iwashima
  2023-10-17  5:50   ` Martin KaFai Lau
  2023-10-16 13:05 ` [PATCH v1 bpf-next 00/11] bpf: tcp: Add SYN Cookie generation/validation SOCK_OPS hooks Daniel Borkmann
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 44+ messages in thread
From: Kuniyuki Iwashima @ 2023-10-13 22:04 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	David Ahern, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Mykola Lysenko
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev

This patch adds a test for BPF_SOCK_OPS_(GEN|CHECK)_SYNCOOKIE_CB hooks.

BPF_SOCK_OPS_GEN_SYNCOOKIE_CB hook generates a hash using SipHash from
based on 4-tuple.  The hash is split into ISN and TS.  MSS, ECN, SACK,
and WScale are encoded into the lower 8-bits of ISN.

  ISN:
    MSB                                   LSB
    | 31 ... 8 | 7 6 | 5   | 4    | 3 2 1 0 |
    | Hash_1   | MSS | ECN | SACK | WScale  |

  TS:
    MSB                LSB
    | 31 ... 8 | 7 ... 0 |
    | Random   | Hash_2  |

BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB hook re-calculates the hash and validates
the cookie.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
Currently, the validator is incomplete...

If this line is changed

    skops->replylong[0] = msstab[3];

to
    skops->replylong[0] = msstab[mssind];

, we will get the error below during make:

    GEN-SKEL [test_progs] test_tcp_syncookie.skel.h
  ...
  Error: failed to open BPF object file: No such file or directory
    GEN-SKEL [test_progs-no_alu32] test_tcp_syncookie.skel.h
  make: *** [Makefile:603: /home/ec2-user/kernel/bpf_syncookie/tools/testing/selftests/bpf/test_tcp_syncookie.skel.h] Error 254
  make: *** Deleting file '/home/ec2-user/kernel/bpf_syncookie/tools/testing/selftests/bpf/test_tcp_syncookie.skel.h'
  make: *** Waiting for unfinished jobs....
---
 .../selftests/bpf/prog_tests/tcp_syncookie.c  |  84 +++++++++
 .../selftests/bpf/progs/test_siphash.h        |  65 +++++++
 .../selftests/bpf/progs/test_tcp_syncookie.c  | 170 ++++++++++++++++++
 .../selftests/bpf/test_tcp_hdr_options.h      |   8 +-
 4 files changed, 326 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/tcp_syncookie.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_siphash.h
 create mode 100644 tools/testing/selftests/bpf/progs/test_tcp_syncookie.c

diff --git a/tools/testing/selftests/bpf/prog_tests/tcp_syncookie.c b/tools/testing/selftests/bpf/prog_tests/tcp_syncookie.c
new file mode 100644
index 000000000000..53af1434fc2c
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/tcp_syncookie.c
@@ -0,0 +1,84 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright Amazon.com Inc. or its affiliates. */
+
+#define _GNU_SOURCE
+#include <sched.h>
+#include <stdlib.h>
+
+#include "test_progs.h"
+#include "cgroup_helpers.h"
+#include "network_helpers.h"
+#include "test_tcp_syncookie.skel.h"
+
+static int setup_netns(void)
+{
+	if (!ASSERT_OK(unshare(CLONE_NEWNET), "create netns"))
+		return -1;
+
+	if (!ASSERT_OK(system("ip link set dev lo up"), "system"))
+		return -1;
+
+	if (!ASSERT_OK(write_sysctl("/proc/sys/net/ipv4/tcp_syncookies", "2"),
+		       "write_sysctl(tcp_syncookies)"))
+		return -1;
+
+	if (!ASSERT_OK(write_sysctl("/proc/sys/net/ipv4/tcp_ecn", "1"),
+		       "write_sysctl(tcp_ecn)"))
+		return -1;
+
+	return 0;
+}
+
+static void create_connection(void)
+{
+	int server, client, child;
+
+	server = start_server(AF_INET, SOCK_STREAM, "127.0.0.1", 0, 0);
+	if (!ASSERT_NEQ(server, -1, "start_server"))
+		return;
+
+	client = connect_to_fd(server, 0);
+	if (!ASSERT_NEQ(client, -1, "connect_to_fd"))
+		goto close_server;
+
+	child = accept(server, NULL, 0);
+	if (!ASSERT_NEQ(child, -1, "accept"))
+		goto close_client;
+
+	close(child);
+close_client:
+	close(client);
+close_server:
+	close(server);
+}
+
+void test_tcp_syncookie(void)
+{
+	struct test_tcp_syncookie *skel;
+	struct bpf_link *link;
+	int cgroup;
+
+	if (setup_netns())
+		return;
+
+	skel = test_tcp_syncookie__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "open_and_load"))
+		return;
+
+	cgroup = test__join_cgroup("/tcp_syncookie");
+	if (!ASSERT_GE(cgroup, 0, "join_cgroup"))
+		goto destroy_skel;
+
+	link = bpf_program__attach_cgroup(skel->progs.syncookie, cgroup);
+	if (!ASSERT_OK_PTR(link, "attach_cgroup"))
+		goto close_cgroup;
+
+	create_connection();
+
+	bpf_link__destroy(link);
+
+close_cgroup:
+	close(cgroup);
+destroy_skel:
+	test_tcp_syncookie__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_siphash.h b/tools/testing/selftests/bpf/progs/test_siphash.h
new file mode 100644
index 000000000000..e36de63fdbaa
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_siphash.h
@@ -0,0 +1,65 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright Amazon.com Inc. or its affiliates. */
+
+/* include/linux/bitops.h */
+static __always_inline __u64 rol64(__u64 word, unsigned int shift)
+{
+	return (word << (shift & 63)) | (word >> ((-shift) & 63));
+}
+
+/* include/linux/siphash.h */
+typedef struct {
+	__u64 key[2];
+} siphash_key_t;
+
+#define SIPHASH_PERMUTATION(a, b, c, d) ( \
+	(a) += (b), (b) = rol64((b), 13), (b) ^= (a), (a) = rol64((a), 32), \
+	(c) += (d), (d) = rol64((d), 16), (d) ^= (c), \
+	(a) += (d), (d) = rol64((d), 21), (d) ^= (a), \
+	(c) += (b), (b) = rol64((b), 17), (b) ^= (c), (c) = rol64((c), 32))
+
+#define SIPHASH_CONST_0 0x736f6d6570736575ULL
+#define SIPHASH_CONST_1 0x646f72616e646f6dULL
+#define SIPHASH_CONST_2 0x6c7967656e657261ULL
+#define SIPHASH_CONST_3 0x7465646279746573ULL
+
+/* lib/siphash.c */
+#define SIPROUND SIPHASH_PERMUTATION(v0, v1, v2, v3)
+
+#define PREAMBLE(len) \
+	__u64 v0 = SIPHASH_CONST_0; \
+	__u64 v1 = SIPHASH_CONST_1; \
+	__u64 v2 = SIPHASH_CONST_2; \
+	__u64 v3 = SIPHASH_CONST_3; \
+	__u64 b = ((__u64)(len)) << 56; \
+	v3 ^= key->key[1]; \
+	v2 ^= key->key[0]; \
+	v1 ^= key->key[1]; \
+	v0 ^= key->key[0];
+
+#define POSTAMBLE \
+	v3 ^= b; \
+	SIPROUND; \
+	SIPROUND; \
+	v0 ^= b; \
+	v2 ^= 0xff; \
+	SIPROUND; \
+	SIPROUND; \
+	SIPROUND; \
+	SIPROUND; \
+	return (v0 ^ v1) ^ (v2 ^ v3);
+
+static __always_inline __u64 siphash_2u64(const __u64 first, const __u64 second,
+					  const siphash_key_t *key)
+{
+	PREAMBLE(16)
+	v3 ^= first;
+	SIPROUND;
+	SIPROUND;
+	v0 ^= first;
+	v3 ^= second;
+	SIPROUND;
+	SIPROUND;
+	v0 ^= second;
+	POSTAMBLE
+}
diff --git a/tools/testing/selftests/bpf/progs/test_tcp_syncookie.c b/tools/testing/selftests/bpf/progs/test_tcp_syncookie.c
new file mode 100644
index 000000000000..5d1fc928602b
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_tcp_syncookie.c
@@ -0,0 +1,170 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright Amazon.com Inc. or its affiliates. */
+
+#include <stdbool.h>
+#include <linux/bpf.h>
+#include <linux/tcp.h>
+#include <linux/types.h>
+#include <bpf/bpf_helpers.h>
+#define BPF_PROG_TEST_TCP_HDR_OPTIONS
+#include "test_tcp_hdr_options.h"
+#include "test_siphash.h"
+
+#define ARRAY_SIZE(arr)	(sizeof(arr) / sizeof((arr)[0]))
+
+static int assert_gen_syncookie_cb(struct bpf_sock_ops *skops)
+{
+	struct tcp_opt tcp_opt;
+	int ret;
+
+	tcp_opt.kind = TCPOPT_WINDOW;
+	tcp_opt.len = 0;
+
+	ret = bpf_load_hdr_opt(skops, &tcp_opt, TCPOLEN_WINDOW, 0);
+	if (ret != TCPOLEN_WINDOW ||
+	    tcp_opt.data[0] != (skops->args[1] & BPF_SYNCOOKIE_WSCALE_MASK))
+		goto err;
+
+	tcp_opt.kind = TCPOPT_SACK_PERM;
+	tcp_opt.len = 0;
+
+	ret = bpf_load_hdr_opt(skops, &tcp_opt, TCPOLEN_SACK_PERM, 0);
+	if (ret != TCPOLEN_SACK_PERM ||
+	    !(skops->args[1] & BPF_SYNCOOKIE_SACK))
+		goto err;
+
+	tcp_opt.kind = TCPOPT_TIMESTAMP;
+	tcp_opt.len = 0;
+
+	ret = bpf_load_hdr_opt(skops, &tcp_opt, TCPOLEN_TIMESTAMP, 0);
+	if (ret != TCPOLEN_TIMESTAMP ||
+	    !(skops->args[1] & BPF_SYNCOOKIE_TS))
+		goto err;
+
+	if (((skops->skb_tcp_flags & (TCPHDR_ECE | TCPHDR_CWR)) !=
+	     (TCPHDR_ECE | TCPHDR_CWR)) ||
+	    !(skops->args[1] & BPF_SYNCOOKIE_ECN))
+		goto err;
+
+	return CG_OK;
+
+err:
+	return CG_ERR;
+}
+
+static siphash_key_t test_key_siphash = {
+	{ 0x0706050403020100ULL, 0x0f0e0d0c0b0a0908ULL }
+};
+
+static __u32 cookie_hash(struct bpf_sock_ops *skops)
+{
+	return siphash_2u64((__u64)skops->remote_ip4 << 32 | skops->local_ip4,
+			    (__u64)skops->remote_port << 32 | skops->local_port,
+			    &test_key_siphash);
+}
+
+static const __u16 msstab[] = {
+	536,
+	1300,
+	1440,
+	1460,
+};
+
+#define COOKIE_BITS	8
+#define COOKIE_MASK	(((__u32)1 << COOKIE_BITS) - 1)
+
+/* Hash is calculated for each client and split into
+ * ISN and TS.
+ *
+ * ISN:
+ *
+ * MSB                                   LSB
+ * | 31 ... 8 | 7 6 | 5   | 4    | 3 2 1 0 |
+ * | Hash_1   | MSS | ECN | SACK | WScale  |
+ *
+ * TS:
+ *
+ * MSB                LSB
+ * | 31 ... 8 | 7 ... 0 |
+ * | Random   | Hash_2  |
+ */
+static void gen_syncookie(struct bpf_sock_ops *skops)
+{
+	__u16 mss = skops->args[0];
+	__u32 tstamp = 0;
+	__u32 cookie;
+	int mssind;
+
+	for (mssind = ARRAY_SIZE(msstab) - 1; mssind; mssind--)
+		if (mss > msstab[mssind])
+			break;
+
+	cookie = cookie_hash(skops);
+
+	if (skops->args[1] & BPF_SYNCOOKIE_TS) {
+		tstamp = bpf_get_prandom_u32();
+		tstamp &= ~COOKIE_MASK;
+		tstamp |= cookie & COOKIE_MASK;
+	}
+
+	cookie &= ~COOKIE_MASK;
+	cookie |= mssind << 6;
+	cookie |= skops->args[1] & (BPF_SYNCOOKIE_ECN |
+				    BPF_SYNCOOKIE_SACK |
+				    BPF_SYNCOOKIE_WSCALE_MASK);
+
+	skops->replylong[0] = cookie;
+	skops->replylong[1] = tstamp;
+}
+
+static int check_syncookie(struct bpf_sock_ops *skops)
+{
+	__u32 cookie = cookie_hash(skops);
+	__u32 tstamp = skops->args[1];
+	__u8 mssind;
+
+	if (tstamp)
+		cookie -= tstamp & COOKIE_MASK;
+	else
+		cookie &= ~COOKIE_MASK;
+
+	cookie -= skops->args[0] & ~COOKIE_MASK;
+	if (cookie)
+		return CG_ERR;
+
+	mssind = (skops->args[0] & (3 << 6)) >> 6;
+	if (mssind > ARRAY_SIZE(msstab))
+		return CG_ERR;
+
+	/* msstab[mssind]; does not compile ... */
+	skops->replylong[0] = msstab[3];
+	skops->replylong[1] = skops->args[0] & (BPF_SYNCOOKIE_ECN |
+						BPF_SYNCOOKIE_SACK |
+						BPF_SYNCOOKIE_WSCALE_MASK);
+
+	return CG_OK;
+}
+
+SEC("sockops")
+int syncookie(struct bpf_sock_ops *skops)
+{
+	int ret = CG_OK;
+
+	switch (skops->op) {
+	case BPF_SOCK_OPS_TCP_LISTEN_CB:
+		bpf_sock_ops_cb_flags_set(skops, BPF_SOCK_OPS_SYNCOOKIE_CB_FLAG);
+		break;
+	case BPF_SOCK_OPS_GEN_SYNCOOKIE_CB:
+		ret = assert_gen_syncookie_cb(skops);
+		if (ret)
+			gen_syncookie(skops);
+		break;
+	case BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB:
+		ret = check_syncookie(skops);
+		break;
+	}
+
+	return ret;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/test_tcp_hdr_options.h b/tools/testing/selftests/bpf/test_tcp_hdr_options.h
index 56c9f8a3ad3d..3efca29a1394 100644
--- a/tools/testing/selftests/bpf/test_tcp_hdr_options.h
+++ b/tools/testing/selftests/bpf/test_tcp_hdr_options.h
@@ -52,8 +52,14 @@ struct linum_err {
 #define TCPOPT_NOP		1
 #define TCPOPT_MSS		2
 #define TCPOPT_WINDOW		3
+#define TCPOPT_SACK_PERM	4
+#define TCPOPT_TIMESTAMP	8
 #define TCPOPT_EXP		254
 
+#define TCPOLEN_WINDOW		3
+#define TCPOLEN_SACK_PERM	2
+#define TCPOLEN_TIMESTAMP	10
+
 #define TCP_BPF_EXPOPT_BASE_LEN 4
 #define MAX_TCP_HDR_LEN		60
 #define MAX_TCP_OPTION_SPACE	40
@@ -81,7 +87,7 @@ struct tcp_opt {
 	__u8 kind;
 	__u8 len;
 	union {
-		__u8 data[4];
+		__u8 data[8];
 		__u32 data32;
 	};
 } __attribute__((packed));
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 bpf-next 00/11] bpf: tcp: Add SYN Cookie generation/validation SOCK_OPS hooks.
  2023-10-13 22:04 [PATCH v1 bpf-next 00/11] bpf: tcp: Add SYN Cookie generation/validation SOCK_OPS hooks Kuniyuki Iwashima
                   ` (10 preceding siblings ...)
  2023-10-13 22:04 ` [PATCH v1 bpf-next 11/11] selftest: bpf: Test BPF_SOCK_OPS_(GEN|CHECK)_SYNCOOKIE_CB Kuniyuki Iwashima
@ 2023-10-16 13:05 ` Daniel Borkmann
  2023-10-16 16:11   ` Kuniyuki Iwashima
  2023-10-16 14:19 ` Willem de Bruijn
  2023-10-17  5:53 ` Martin KaFai Lau
  13 siblings, 1 reply; 44+ messages in thread
From: Daniel Borkmann @ 2023-10-16 13:05 UTC (permalink / raw)
  To: Kuniyuki Iwashima, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, David Ahern, Alexei Starovoitov, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Mykola Lysenko
  Cc: Kuniyuki Iwashima, bpf, netdev

On 10/14/23 12:04 AM, Kuniyuki Iwashima wrote:
> Under SYN Flood, the TCP stack generates SYN Cookie to remain stateless
> for the connection request until a valid ACK is responded to the SYN+ACK.
> 
> The cookie contains two kinds of host-specific bits, a timestamp and
> secrets, so only can it be validated by the generator.  It means SYN
> Cookie consumes network resources between the client and the server;
> intermediate nodes must remember which nodes to route ACK for the cookie.
> 
> SYN Proxy reduces such unwanted resource allocation by handling 3WHS at
> the edge network.  After SYN Proxy completes 3WHS, it forwards SYN to the
> backend server and completes another 3WHS.  However, since the server's
> ISN differs from the cookie, the proxy must manage the ISN mappings and
> fix up SEQ/ACK numbers in every packet for each connection.  If a proxy
> node is down, all the connections through it are also down.  Keeping a
> state at proxy is painful from that perspective.
> 
> At AWS, we use a dirty hack to build truly stateless SYN Proxy at scale.
> Our SYN Proxy consists of the front proxy layer and the backend kernel
> module.  (See slides of netconf [0], p6 - p15)
> 
> The cookie that SYN Proxy generates differs from the kernel's cookie in
> that it contains a secret (called rolling salt) (i) shared by all the proxy
> nodes so that any node can validate ACK and (ii) updated periodically so
> that old cookies cannot be validated.  Also, ISN contains WScale, SACK, and
> ECN, not in TS val.  This is not to sacrifice any connection quality, where
> some customers turn off the timestamp option due to retro CVE.
> 
> After 3WHS, the proxy restores SYN and forwards it and ACK to the backend
> server.  Our kernel module works at Netfilter input/output hooks and first
> feeds SYN to the TCP stack to initiate 3WHS.  When the module is triggered
> for SYN+ACK, it looks up the corresponding request socket and overwrites
> tcp_rsk(req)->snt_isn with the proxy's cookie.  Then, the module can
> complete 3WHS with the original ACK as is.
> 
> This way, our SYN Proxy does not manage the ISN mappings and can stay
> stateless.  It's working very well for high-bandwidth services like
> multiple Tbps, but we are looking for a way to drop the dirty hack and
> further optimise the sequences.
> 
> If we could validate an arbitrary SYN Cookie on the backend server with
> BPF, the proxy would need not restore SYN nor pass it.  After validating
> ACK, the proxy node just needs to forward it, and then the server can do
> the lightweight validation (e.g. check if ACK came from proxy nodes, etc)
> and create a connection from the ACK.
> 
> This series adds two SOCK_OPS hooks to generate and validate arbitrary
> SYN Cookie.  Each hook is invoked if BPF_SOCK_OPS_SYNCOOKIE_CB_FLAG is
> set to the listening socket in advance by bpf_sock_ops_cb_flags_set().
> 
> The user interface looks like this:
> 
>    BPF_SOCK_OPS_GEN_SYNCOOKIE_CB
> 
>      input
>      |- bpf_sock_ops.sk           : 4-tuple
>      |- bpf_sock_ops.skb          : TCP header
>      |- bpf_sock_ops.args[0]      : MSS
>      `- bpf_sock_ops.args[1]      : BPF_SYNCOOKIE_XXX flags
> 
>      output
>      |- bpf_sock_ops.replylong[0] : ISN (SYN Cookie) ------.
>      `- bpf_sock_ops.replylong[1] : TS value -----------.  |
>                                                         |  |
>    BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB                      |  |
>                                                         |  |
>      input                                              |  |
>      |- bpf_sock_ops.sk           : 4-tuple             |  |
>      |- bpf_sock_ops.skb          : TCP header          |  |
>      |- bpf_sock_ops.args[0]      : ISN (SYN Cookie) <-----'
>      `- bpf_sock_ops.args[1]      : TS value <----------'
> 
>      output
>      |- bpf_sock_ops.replylong[0] : MSS
>      `- bpf_sock_ops.replylong[1] : BPF_SYNCOOKIE_XXX flags
> 
> To establish a connection from SYN Cookie, BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB
> hook must set a valid MSS to bpf_sock_ops.replylong[0], meaning that
> BPF_SOCK_OPS_GEN_SYNCOOKIE_CB hook must encode MSS to ISN or TS val to be
> restored in the validation hook.
> 
> If WScale, SACK, and ECN are detected to be available in SYN packet, the
> corresponding flags are passed to args[0] of BPF_SOCK_OPS_GEN_SYNCOOKIE_CB
> so that bpf prog need not parse the TCP header.  The same flags can be set
> to replylong[0] of BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB to enable each feature
> on the connection.
> 
> For details, please see each patch.  Here's an overview:
> 
>    patch 1 - 4 : Misc cleanup
>    patch 5, 6  : Add SOCK_OPS hook (only ISN is available here)
>    patch 7, 8  : Make TS val available as the second cookie storage
>    patch 9, 10 : Make WScale, SACK, and ECN configurable from ACK
>    patch 11    : selftest, need some help from BPF experts...
> 
> [0]: https://netdev.bots.linux.dev/netconf/2023/kuniyuki.pdf

Fyi, just as quick feedback, this fails BPF CI selftests :

https://github.com/kernel-patches/bpf/actions/runs/6513838231/job/17694669376

Notice: Success: 427/3396, Skipped: 24, Failed: 1
Error: #274 tcpbpf_user
   Error: #274 tcpbpf_user
   test_tcpbpf_user:PASS:open and load skel 0 nsec
   test_tcpbpf_user:PASS:test__join_cgroup(/tcpbpf-user-test) 0 nsec
   test_tcpbpf_user:PASS:attach_cgroup(bpf_testcb) 0 nsec
   run_test:PASS:start_server 0 nsec
   run_test:PASS:connect_to_fd(listen_fd) 0 nsec
   run_test:PASS:accept(listen_fd) 0 nsec
   run_test:PASS:send(cli_fd) 0 nsec
   run_test:PASS:recv(accept_fd) 0 nsec
   run_test:PASS:send(accept_fd) 0 nsec
   run_test:PASS:recv(cli_fd) 0 nsec
   run_test:PASS:recv(cli_fd) for fin 0 nsec
   run_test:PASS:recv(accept_fd) for fin 0 nsec
   verify_result:PASS:event_map 0 nsec
   verify_result:PASS:bytes_received 0 nsec
   verify_result:PASS:bytes_acked 0 nsec
   verify_result:PASS:data_segs_in 0 nsec
   verify_result:PASS:data_segs_out 0 nsec
   verify_result:FAIL:bad_cb_test_rv unexpected bad_cb_test_rv: actual 0 != expected 128
   verify_result:PASS:good_cb_test_rv 0 nsec
   verify_result:PASS:num_listen 0 nsec
   verify_result:PASS:num_close_events 0 nsec
   verify_result:PASS:tcp_save_syn 0 nsec
   verify_result:PASS:tcp_saved_syn 0 nsec
   verify_result:PASS:window_clamp_client 0 nsec
   verify_result:PASS:window_clamp_server 0 nsec

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 bpf-next 00/11] bpf: tcp: Add SYN Cookie generation/validation SOCK_OPS hooks.
  2023-10-13 22:04 [PATCH v1 bpf-next 00/11] bpf: tcp: Add SYN Cookie generation/validation SOCK_OPS hooks Kuniyuki Iwashima
                   ` (11 preceding siblings ...)
  2023-10-16 13:05 ` [PATCH v1 bpf-next 00/11] bpf: tcp: Add SYN Cookie generation/validation SOCK_OPS hooks Daniel Borkmann
@ 2023-10-16 14:19 ` Willem de Bruijn
  2023-10-16 16:46   ` Kuniyuki Iwashima
  2023-10-17  5:53 ` Martin KaFai Lau
  13 siblings, 1 reply; 44+ messages in thread
From: Willem de Bruijn @ 2023-10-16 14:19 UTC (permalink / raw)
  To: Kuniyuki Iwashima, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, David Ahern, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Mykola Lysenko
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, bpf, netdev

Kuniyuki Iwashima wrote:
> Under SYN Flood, the TCP stack generates SYN Cookie to remain stateless
> for the connection request until a valid ACK is responded to the SYN+ACK.
> 
> The cookie contains two kinds of host-specific bits, a timestamp and
> secrets, so only can it be validated by the generator.  It means SYN
> Cookie consumes network resources between the client and the server;
> intermediate nodes must remember which nodes to route ACK for the cookie.
> 
> SYN Proxy reduces such unwanted resource allocation by handling 3WHS at
> the edge network.  After SYN Proxy completes 3WHS, it forwards SYN to the
> backend server and completes another 3WHS.  However, since the server's
> ISN differs from the cookie, the proxy must manage the ISN mappings and
> fix up SEQ/ACK numbers in every packet for each connection.  If a proxy
> node is down, all the connections through it are also down.  Keeping a
> state at proxy is painful from that perspective.
> 
> At AWS, we use a dirty hack to build truly stateless SYN Proxy at scale.
> Our SYN Proxy consists of the front proxy layer and the backend kernel
> module.  (See slides of netconf [0], p6 - p15)
> 
> The cookie that SYN Proxy generates differs from the kernel's cookie in
> that it contains a secret (called rolling salt) (i) shared by all the proxy
> nodes so that any node can validate ACK and (ii) updated periodically so
> that old cookies cannot be validated.  Also, ISN contains WScale, SACK, and
> ECN, not in TS val.  This is not to sacrifice any connection quality, where
> some customers turn off the timestamp option due to retro CVE.

If easier: I think it should be possible to make the host secret
readable and writable with CAP_NET_ADMIN, to allow synchronizing
between hosts.

For similar reasons as suggested here, a rolling salt might be
useful more broadly too.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 bpf-next 00/11] bpf: tcp: Add SYN Cookie generation/validation SOCK_OPS hooks.
  2023-10-16 13:05 ` [PATCH v1 bpf-next 00/11] bpf: tcp: Add SYN Cookie generation/validation SOCK_OPS hooks Daniel Borkmann
@ 2023-10-16 16:11   ` Kuniyuki Iwashima
  0 siblings, 0 replies; 44+ messages in thread
From: Kuniyuki Iwashima @ 2023-10-16 16:11 UTC (permalink / raw)
  To: daniel
  Cc: andrii, ast, bpf, davem, dsahern, edumazet, haoluo,
	john.fastabend, jolsa, kpsingh, kuba, kuni1840, kuniyu,
	martin.lau, mykolal, netdev, pabeni, sdf, song, yonghong.song

From: Daniel Borkmann <daniel@iogearbox.net>
Date: Mon, 16 Oct 2023 15:05:25 +0200
> On 10/14/23 12:04 AM, Kuniyuki Iwashima wrote:
> > Under SYN Flood, the TCP stack generates SYN Cookie to remain stateless
> > for the connection request until a valid ACK is responded to the SYN+ACK.
> > 
> > The cookie contains two kinds of host-specific bits, a timestamp and
> > secrets, so only can it be validated by the generator.  It means SYN
> > Cookie consumes network resources between the client and the server;
> > intermediate nodes must remember which nodes to route ACK for the cookie.
> > 
> > SYN Proxy reduces such unwanted resource allocation by handling 3WHS at
> > the edge network.  After SYN Proxy completes 3WHS, it forwards SYN to the
> > backend server and completes another 3WHS.  However, since the server's
> > ISN differs from the cookie, the proxy must manage the ISN mappings and
> > fix up SEQ/ACK numbers in every packet for each connection.  If a proxy
> > node is down, all the connections through it are also down.  Keeping a
> > state at proxy is painful from that perspective.
> > 
> > At AWS, we use a dirty hack to build truly stateless SYN Proxy at scale.
> > Our SYN Proxy consists of the front proxy layer and the backend kernel
> > module.  (See slides of netconf [0], p6 - p15)
> > 
> > The cookie that SYN Proxy generates differs from the kernel's cookie in
> > that it contains a secret (called rolling salt) (i) shared by all the proxy
> > nodes so that any node can validate ACK and (ii) updated periodically so
> > that old cookies cannot be validated.  Also, ISN contains WScale, SACK, and
> > ECN, not in TS val.  This is not to sacrifice any connection quality, where
> > some customers turn off the timestamp option due to retro CVE.
> > 
> > After 3WHS, the proxy restores SYN and forwards it and ACK to the backend
> > server.  Our kernel module works at Netfilter input/output hooks and first
> > feeds SYN to the TCP stack to initiate 3WHS.  When the module is triggered
> > for SYN+ACK, it looks up the corresponding request socket and overwrites
> > tcp_rsk(req)->snt_isn with the proxy's cookie.  Then, the module can
> > complete 3WHS with the original ACK as is.
> > 
> > This way, our SYN Proxy does not manage the ISN mappings and can stay
> > stateless.  It's working very well for high-bandwidth services like
> > multiple Tbps, but we are looking for a way to drop the dirty hack and
> > further optimise the sequences.
> > 
> > If we could validate an arbitrary SYN Cookie on the backend server with
> > BPF, the proxy would need not restore SYN nor pass it.  After validating
> > ACK, the proxy node just needs to forward it, and then the server can do
> > the lightweight validation (e.g. check if ACK came from proxy nodes, etc)
> > and create a connection from the ACK.
> > 
> > This series adds two SOCK_OPS hooks to generate and validate arbitrary
> > SYN Cookie.  Each hook is invoked if BPF_SOCK_OPS_SYNCOOKIE_CB_FLAG is
> > set to the listening socket in advance by bpf_sock_ops_cb_flags_set().
> > 
> > The user interface looks like this:
> > 
> >    BPF_SOCK_OPS_GEN_SYNCOOKIE_CB
> > 
> >      input
> >      |- bpf_sock_ops.sk           : 4-tuple
> >      |- bpf_sock_ops.skb          : TCP header
> >      |- bpf_sock_ops.args[0]      : MSS
> >      `- bpf_sock_ops.args[1]      : BPF_SYNCOOKIE_XXX flags
> > 
> >      output
> >      |- bpf_sock_ops.replylong[0] : ISN (SYN Cookie) ------.
> >      `- bpf_sock_ops.replylong[1] : TS value -----------.  |
> >                                                         |  |
> >    BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB                      |  |
> >                                                         |  |
> >      input                                              |  |
> >      |- bpf_sock_ops.sk           : 4-tuple             |  |
> >      |- bpf_sock_ops.skb          : TCP header          |  |
> >      |- bpf_sock_ops.args[0]      : ISN (SYN Cookie) <-----'
> >      `- bpf_sock_ops.args[1]      : TS value <----------'
> > 
> >      output
> >      |- bpf_sock_ops.replylong[0] : MSS
> >      `- bpf_sock_ops.replylong[1] : BPF_SYNCOOKIE_XXX flags
> > 
> > To establish a connection from SYN Cookie, BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB
> > hook must set a valid MSS to bpf_sock_ops.replylong[0], meaning that
> > BPF_SOCK_OPS_GEN_SYNCOOKIE_CB hook must encode MSS to ISN or TS val to be
> > restored in the validation hook.
> > 
> > If WScale, SACK, and ECN are detected to be available in SYN packet, the
> > corresponding flags are passed to args[0] of BPF_SOCK_OPS_GEN_SYNCOOKIE_CB
> > so that bpf prog need not parse the TCP header.  The same flags can be set
> > to replylong[0] of BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB to enable each feature
> > on the connection.
> > 
> > For details, please see each patch.  Here's an overview:
> > 
> >    patch 1 - 4 : Misc cleanup
> >    patch 5, 6  : Add SOCK_OPS hook (only ISN is available here)
> >    patch 7, 8  : Make TS val available as the second cookie storage
> >    patch 9, 10 : Make WScale, SACK, and ECN configurable from ACK
> >    patch 11    : selftest, need some help from BPF experts...
> > 
> > [0]: https://netdev.bots.linux.dev/netconf/2023/kuniyuki.pdf
> 
> Fyi, just as quick feedback, this fails BPF CI selftests :
> 
> https://github.com/kernel-patches/bpf/actions/runs/6513838231/job/17694669376
> 
> Notice: Success: 427/3396, Skipped: 24, Failed: 1
> Error: #274 tcpbpf_user
>    Error: #274 tcpbpf_user
>    test_tcpbpf_user:PASS:open and load skel 0 nsec
>    test_tcpbpf_user:PASS:test__join_cgroup(/tcpbpf-user-test) 0 nsec
>    test_tcpbpf_user:PASS:attach_cgroup(bpf_testcb) 0 nsec
>    run_test:PASS:start_server 0 nsec
>    run_test:PASS:connect_to_fd(listen_fd) 0 nsec
>    run_test:PASS:accept(listen_fd) 0 nsec
>    run_test:PASS:send(cli_fd) 0 nsec
>    run_test:PASS:recv(accept_fd) 0 nsec
>    run_test:PASS:send(accept_fd) 0 nsec
>    run_test:PASS:recv(cli_fd) 0 nsec
>    run_test:PASS:recv(cli_fd) for fin 0 nsec
>    run_test:PASS:recv(accept_fd) for fin 0 nsec
>    verify_result:PASS:event_map 0 nsec
>    verify_result:PASS:bytes_received 0 nsec
>    verify_result:PASS:bytes_acked 0 nsec
>    verify_result:PASS:data_segs_in 0 nsec
>    verify_result:PASS:data_segs_out 0 nsec
>    verify_result:FAIL:bad_cb_test_rv unexpected bad_cb_test_rv: actual 0 != expected 128

128 (0x80) should be BPF_SOCK_OPS_ALL_CB_FLAGS + 1 instead so
that we need not update the test for each SOCK_OPS addition.

I'll include this diff in the next revision.

Thank you!

---8<---
diff --git a/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c b/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c
index 7e8fe1bad03f..e4849d2a2956 100644
--- a/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c
+++ b/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c
@@ -26,7 +26,8 @@ static void verify_result(struct tcpbpf_globals *result)
 	ASSERT_EQ(result->bytes_acked, 1002, "bytes_acked");
 	ASSERT_EQ(result->data_segs_in, 1, "data_segs_in");
 	ASSERT_EQ(result->data_segs_out, 1, "data_segs_out");
-	ASSERT_EQ(result->bad_cb_test_rv, 0x80, "bad_cb_test_rv");
+	ASSERT_EQ(result->bad_cb_test_rv, BPF_SOCK_OPS_ALL_CB_FLAGS + 1,
+		  "bad_cb_test_rv");
 	ASSERT_EQ(result->good_cb_test_rv, 0, "good_cb_test_rv");
 	ASSERT_EQ(result->num_listen, 1, "num_listen");
 
diff --git a/tools/testing/selftests/bpf/progs/test_tcpbpf_kern.c b/tools/testing/selftests/bpf/progs/test_tcpbpf_kern.c
index cf7ed8cbb1fe..52da66d77fd6 100644
--- a/tools/testing/selftests/bpf/progs/test_tcpbpf_kern.c
+++ b/tools/testing/selftests/bpf/progs/test_tcpbpf_kern.c
@@ -103,7 +103,8 @@ int bpf_testcb(struct bpf_sock_ops *skops)
 		break;
 	case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
 		/* Test failure to set largest cb flag (assumes not defined) */
-		global.bad_cb_test_rv = bpf_sock_ops_cb_flags_set(skops, 0x80);
+		global.bad_cb_test_rv = bpf_sock_ops_cb_flags_set(skops,
+								  BPF_SOCK_OPS_ALL_CB_FLAGS + 1);
 		/* Set callback */
 		global.good_cb_test_rv = bpf_sock_ops_cb_flags_set(skops,
 						 BPF_SOCK_OPS_STATE_CB_FLAG);
---8<---


>    verify_result:PASS:good_cb_test_rv 0 nsec
>    verify_result:PASS:num_listen 0 nsec
>    verify_result:PASS:num_close_events 0 nsec
>    verify_result:PASS:tcp_save_syn 0 nsec
>    verify_result:PASS:tcp_saved_syn 0 nsec
>    verify_result:PASS:window_clamp_client 0 nsec
>    verify_result:PASS:window_clamp_server 0 nsec

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 bpf-next 00/11] bpf: tcp: Add SYN Cookie generation/validation SOCK_OPS hooks.
  2023-10-16 14:19 ` Willem de Bruijn
@ 2023-10-16 16:46   ` Kuniyuki Iwashima
  2023-10-16 18:41     ` Willem de Bruijn
  0 siblings, 1 reply; 44+ messages in thread
From: Kuniyuki Iwashima @ 2023-10-16 16:46 UTC (permalink / raw)
  To: willemdebruijn.kernel
  Cc: andrii, ast, bpf, daniel, davem, dsahern, edumazet, haoluo,
	john.fastabend, jolsa, kpsingh, kuba, kuni1840, kuniyu,
	martin.lau, mykolal, netdev, pabeni, sdf, song, yonghong.song

From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Date: Mon, 16 Oct 2023 10:19:18 -0400
> Kuniyuki Iwashima wrote:
> > Under SYN Flood, the TCP stack generates SYN Cookie to remain stateless
> > for the connection request until a valid ACK is responded to the SYN+ACK.
> > 
> > The cookie contains two kinds of host-specific bits, a timestamp and
> > secrets, so only can it be validated by the generator.  It means SYN
> > Cookie consumes network resources between the client and the server;
> > intermediate nodes must remember which nodes to route ACK for the cookie.
> > 
> > SYN Proxy reduces such unwanted resource allocation by handling 3WHS at
> > the edge network.  After SYN Proxy completes 3WHS, it forwards SYN to the
> > backend server and completes another 3WHS.  However, since the server's
> > ISN differs from the cookie, the proxy must manage the ISN mappings and
> > fix up SEQ/ACK numbers in every packet for each connection.  If a proxy
> > node is down, all the connections through it are also down.  Keeping a
> > state at proxy is painful from that perspective.
> > 
> > At AWS, we use a dirty hack to build truly stateless SYN Proxy at scale.
> > Our SYN Proxy consists of the front proxy layer and the backend kernel
> > module.  (See slides of netconf [0], p6 - p15)
> > 
> > The cookie that SYN Proxy generates differs from the kernel's cookie in
> > that it contains a secret (called rolling salt) (i) shared by all the proxy
> > nodes so that any node can validate ACK and (ii) updated periodically so
> > that old cookies cannot be validated.  Also, ISN contains WScale, SACK, and
> > ECN, not in TS val.  This is not to sacrifice any connection quality, where
> > some customers turn off the timestamp option due to retro CVE.
> 
> If easier: I think it should be possible to make the host secret
> readable and writable with CAP_NET_ADMIN, to allow synchronizing
> between hosts.

I think the idea is doable for syncookie_secret and syncookie6_secret.
However, the cookie timestamp is generated based on jiffies that cannot
be written.

[ I answered sharing secrets would resolve our issue at netconf, but
  I was wrong. ]


> For similar reasons as suggested here, a rolling salt might be
> useful more broadly too.

Maybe we need not use jiffies and can create a worker to update the
secret periodically if it's not configured manually.

The problem here would be that we need to update/read u64[4] atomically
if we want to use SipHash or HSipHash.  Maybe this also can be changed.

But, we still want to use BPF as we need to encode (at least) WS and
SACK bits in ISN, not TS and use different MSS candidates rather than
msstab.

Also, in our use case, the validation for cookie itself is done in
the front proxy layer, and the kernel will do more light-weight
validation like checking if the cookie is forwarded from trusted
nodes.  Then, we can prevent invalid ACK from flowing through the
backend and consuming some networking entries, and the backend need
not do full validation.

With BPF, we can get such flexibility at encoding and validation, and
making cookie generation algorithm private could be good for security.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 bpf-next 00/11] bpf: tcp: Add SYN Cookie generation/validation SOCK_OPS hooks.
  2023-10-16 16:46   ` Kuniyuki Iwashima
@ 2023-10-16 18:41     ` Willem de Bruijn
  0 siblings, 0 replies; 44+ messages in thread
From: Willem de Bruijn @ 2023-10-16 18:41 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: andrii, ast, bpf, daniel, davem, dsahern, edumazet, haoluo,
	john.fastabend, jolsa, kpsingh, kuba, kuni1840, martin.lau,
	mykolal, netdev, pabeni, sdf, song, yonghong.song

On Mon, Oct 16, 2023 at 12:46 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
> Date: Mon, 16 Oct 2023 10:19:18 -0400
> > Kuniyuki Iwashima wrote:
> > > Under SYN Flood, the TCP stack generates SYN Cookie to remain stateless
> > > for the connection request until a valid ACK is responded to the SYN+ACK.
> > >
> > > The cookie contains two kinds of host-specific bits, a timestamp and
> > > secrets, so only can it be validated by the generator.  It means SYN
> > > Cookie consumes network resources between the client and the server;
> > > intermediate nodes must remember which nodes to route ACK for the cookie.
> > >
> > > SYN Proxy reduces such unwanted resource allocation by handling 3WHS at
> > > the edge network.  After SYN Proxy completes 3WHS, it forwards SYN to the
> > > backend server and completes another 3WHS.  However, since the server's
> > > ISN differs from the cookie, the proxy must manage the ISN mappings and
> > > fix up SEQ/ACK numbers in every packet for each connection.  If a proxy
> > > node is down, all the connections through it are also down.  Keeping a
> > > state at proxy is painful from that perspective.
> > >
> > > At AWS, we use a dirty hack to build truly stateless SYN Proxy at scale.
> > > Our SYN Proxy consists of the front proxy layer and the backend kernel
> > > module.  (See slides of netconf [0], p6 - p15)
> > >
> > > The cookie that SYN Proxy generates differs from the kernel's cookie in
> > > that it contains a secret (called rolling salt) (i) shared by all the proxy
> > > nodes so that any node can validate ACK and (ii) updated periodically so
> > > that old cookies cannot be validated.  Also, ISN contains WScale, SACK, and
> > > ECN, not in TS val.  This is not to sacrifice any connection quality, where
> > > some customers turn off the timestamp option due to retro CVE.
> >
> > If easier: I think it should be possible to make the host secret
> > readable and writable with CAP_NET_ADMIN, to allow synchronizing
> > between hosts.
>
> I think the idea is doable for syncookie_secret and syncookie6_secret.
> However, the cookie timestamp is generated based on jiffies that cannot
> be written.
>
> [ I answered sharing secrets would resolve our issue at netconf, but
>   I was wrong. ]
>
>
> > For similar reasons as suggested here, a rolling salt might be
> > useful more broadly too.
>
> Maybe we need not use jiffies and can create a worker to update the
> secret periodically if it's not configured manually.
>
> The problem here would be that we need to update/read u64[4] atomically
> if we want to use SipHash or HSipHash.  Maybe this also can be changed.
>
> But, we still want to use BPF as we need to encode (at least) WS and
> SACK bits in ISN, not TS and use different MSS candidates rather than
> msstab.
>
> Also, in our use case, the validation for cookie itself is done in
> the front proxy layer, and the kernel will do more light-weight
> validation like checking if the cookie is forwarded from trusted
> nodes.  Then, we can prevent invalid ACK from flowing through the
> backend and consuming some networking entries, and the backend need
> not do full validation.
>
> With BPF, we can get such flexibility at encoding and validation, and
> making cookie generation algorithm private could be good for security.

Thanks for that context. Sounds like it indeed would not be a small
change to support your use case in the existing syncookie C code,
then.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 bpf-next 06/11] bpf: tcp: Add SYN Cookie validation SOCK_OPS hook.
  2023-10-13 22:04 ` [PATCH v1 bpf-next 06/11] bpf: tcp: Add SYN Cookie validation " Kuniyuki Iwashima
@ 2023-10-16 20:38   ` Stanislav Fomichev
  2023-10-16 22:02     ` Kuniyuki Iwashima
  2023-10-17 16:52   ` Kuniyuki Iwashima
  1 sibling, 1 reply; 44+ messages in thread
From: Stanislav Fomichev @ 2023-10-16 20:38 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	David Ahern, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Hao Luo, Jiri Olsa, Mykola Lysenko,
	Kuniyuki Iwashima, bpf, netdev

On 10/13, Kuniyuki Iwashima wrote:
> This patch adds a new SOCK_OPS hook to validate arbitrary SYN Cookie.
> 
> When the kernel receives ACK for SYN Cookie, the hook is invoked with
> bpf_sock_ops.op == BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB if the listener has
> BPF_SOCK_OPS_SYNCOOKIE_CB_FLAG set by bpf_sock_ops_cb_flags_set().
> 
> The BPF program can access the following information to validate ISN:
> 
>   bpf_sock_ops.sk      : 4-tuple
>   bpf_sock_ops.skb     : TCP header
>   bpf_sock_ops.args[0] : ISN
> 
> The program must decode MSS and set it to bpf_sock_ops.replylong[0].
> 
> By default, the kernel validates SYN Cookie before allocating reqsk, but
> the hook is invoked after allocating reqsk to keep the user interface
> consistent with BPF_SOCK_OPS_GEN_SYNCOOKIE_CB.
> 
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> ---
>  include/net/tcp.h              | 12 ++++++
>  include/uapi/linux/bpf.h       | 20 +++++++---
>  net/ipv4/syncookies.c          | 73 +++++++++++++++++++++++++++-------
>  net/ipv6/syncookies.c          | 44 +++++++++++++-------
>  tools/include/uapi/linux/bpf.h | 20 +++++++---
>  5 files changed, 130 insertions(+), 39 deletions(-)
> 
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 676618c89bb7..90d95acdc34a 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -2158,6 +2158,18 @@ static inline __u32 cookie_init_sequence(const struct tcp_request_sock_ops *ops,
>  	__NET_INC_STATS(sock_net(sk), LINUX_MIB_SYNCOOKIESSENT);
>  	return ops->cookie_init_seq(skb, mss);
>  }
> +
> +#ifdef CONFIG_CGROUP_BPF
> +int bpf_skops_cookie_check(struct sock *sk, struct request_sock *req,
> +			   struct sk_buff *skb);
> +#else
> +static inline int bpf_skops_cookie_check(struct sock *sk, struct request_sock *req,
> +					 struct sk_buff *skb)
> +{
> +	return 0;
> +}
> +#endif
> +
>  #else
>  static inline __u32 cookie_init_sequence(const struct tcp_request_sock_ops *ops,
>  					 const struct sock *sk, struct sk_buff *skb,
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index d3cc530613c0..e6f1507d7895 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -6738,13 +6738,16 @@ enum {
>  	 * options first before the BPF program does.
>  	 */
>  	BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<6),
> -	/* Call bpf when the kernel generates SYN Cookie (ISN) for SYN+ACK.
> +	/* Call bpf when the kernel generates SYN Cookie (ISN) for SYN+ACK
> +	 * and validates ACK for SYN Cookie.
>  	 *
> -	 * The bpf prog will be called to encode MSS into SYN Cookie with
> -	 * sock_ops->op == BPF_SOCK_OPS_GEN_SYNCOOKIE_CB.
> +	 * The bpf prog will be first called to encode MSS into SYN Cookie
> +	 * with sock_ops->op == BPF_SOCK_OPS_GEN_SYNCOOKIE_CB.  Then, the
> +	 * bpf prog will be called to decode MSS from SYN Cookie with
> +	 * sock_ops->op == BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB.
>  	 *
> -	 * Please refer to the comment in BPF_SOCK_OPS_GEN_SYNCOOKIE_CB for
> -	 * input and output.
> +	 * Please refer to the comment in BPF_SOCK_OPS_GEN_SYNCOOKIE_CB and
> +	 * BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB for input and output.
>  	 */
>  	BPF_SOCK_OPS_SYNCOOKIE_CB_FLAG = (1<<7),
>  /* Mask of all currently supported cb flags */
> @@ -6868,6 +6871,13 @@ enum {
>  					 *
>  					 * replylong[0]: ISN
>  					 */
> +	BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB,/* Validate SYN Cookie and set
> +					 * MSS.
> +					 *
> +					 * args[0]: ISN
> +					 *
> +					 * replylong[0]: MSS
> +					 */
>  };
>  
>  /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
> diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
> index 514f1a4abdee..b1dd415863ff 100644
> --- a/net/ipv4/syncookies.c
> +++ b/net/ipv4/syncookies.c
> @@ -317,6 +317,37 @@ struct request_sock *cookie_tcp_reqsk_alloc(const struct request_sock_ops *ops,
>  }
>  EXPORT_SYMBOL_GPL(cookie_tcp_reqsk_alloc);
>  
> +#if IS_ENABLED(CONFIG_CGROUP_BPF) && IS_ENABLED(CONFIG_SYN_COOKIES)
> +int bpf_skops_cookie_check(struct sock *sk, struct request_sock *req, struct sk_buff *skb)
> +{
> +	struct bpf_sock_ops_kern sock_ops;
> +
> +	memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
> +
> +	sock_ops.op = BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB;
> +	sock_ops.sk = req_to_sk(req);
> +	sock_ops.args[0] = tcp_rsk(req)->snt_isn;
> +
> +	bpf_skops_init_skb(&sock_ops, skb, tcp_hdrlen(skb));
> +
> +	if (BPF_CGROUP_RUN_PROG_SOCK_OPS_SK(&sock_ops, sk))
> +		goto err;
> +
> +	if (!sock_ops.replylong[0])
> +		goto err;
> +
> +	__NET_INC_STATS(sock_net(sk), LINUX_MIB_SYNCOOKIESRECV);

I don't see LINUX_MIB_SYNCOOKIESSENT being incremented in the
previous patch, so maybe also don't touch the mib here? The bpf
program can do the counting if needed?

Or, alternatively, add LINUX_MIB_SYNCOOKIESSENT to
the BPF_SOCK_OPS_GEN_SYNCOOKIE_CB path?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 bpf-next 06/11] bpf: tcp: Add SYN Cookie validation SOCK_OPS hook.
  2023-10-16 20:38   ` Stanislav Fomichev
@ 2023-10-16 22:02     ` Kuniyuki Iwashima
  0 siblings, 0 replies; 44+ messages in thread
From: Kuniyuki Iwashima @ 2023-10-16 22:02 UTC (permalink / raw)
  To: sdf
  Cc: andrii, ast, bpf, daniel, davem, dsahern, edumazet, haoluo,
	john.fastabend, jolsa, kpsingh, kuba, kuni1840, kuniyu,
	martin.lau, mykolal, netdev, pabeni, song, yonghong.song

From: Stanislav Fomichev <sdf@google.com>
Date: Mon, 16 Oct 2023 13:38:25 -0700
> On 10/13, Kuniyuki Iwashima wrote:
> > This patch adds a new SOCK_OPS hook to validate arbitrary SYN Cookie.
> > 
> > When the kernel receives ACK for SYN Cookie, the hook is invoked with
> > bpf_sock_ops.op == BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB if the listener has
> > BPF_SOCK_OPS_SYNCOOKIE_CB_FLAG set by bpf_sock_ops_cb_flags_set().
> > 
> > The BPF program can access the following information to validate ISN:
> > 
> >   bpf_sock_ops.sk      : 4-tuple
> >   bpf_sock_ops.skb     : TCP header
> >   bpf_sock_ops.args[0] : ISN
> > 
> > The program must decode MSS and set it to bpf_sock_ops.replylong[0].
> > 
> > By default, the kernel validates SYN Cookie before allocating reqsk, but
> > the hook is invoked after allocating reqsk to keep the user interface
> > consistent with BPF_SOCK_OPS_GEN_SYNCOOKIE_CB.
> > 
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > ---
> >  include/net/tcp.h              | 12 ++++++
> >  include/uapi/linux/bpf.h       | 20 +++++++---
> >  net/ipv4/syncookies.c          | 73 +++++++++++++++++++++++++++-------
> >  net/ipv6/syncookies.c          | 44 +++++++++++++-------
> >  tools/include/uapi/linux/bpf.h | 20 +++++++---
> >  5 files changed, 130 insertions(+), 39 deletions(-)
> > 
> > diff --git a/include/net/tcp.h b/include/net/tcp.h
> > index 676618c89bb7..90d95acdc34a 100644
> > --- a/include/net/tcp.h
> > +++ b/include/net/tcp.h
> > @@ -2158,6 +2158,18 @@ static inline __u32 cookie_init_sequence(const struct tcp_request_sock_ops *ops,
> >  	__NET_INC_STATS(sock_net(sk), LINUX_MIB_SYNCOOKIESSENT);
> >  	return ops->cookie_init_seq(skb, mss);
> >  }
> > +
> > +#ifdef CONFIG_CGROUP_BPF
> > +int bpf_skops_cookie_check(struct sock *sk, struct request_sock *req,
> > +			   struct sk_buff *skb);
> > +#else
> > +static inline int bpf_skops_cookie_check(struct sock *sk, struct request_sock *req,
> > +					 struct sk_buff *skb)
> > +{
> > +	return 0;
> > +}
> > +#endif
> > +
> >  #else
> >  static inline __u32 cookie_init_sequence(const struct tcp_request_sock_ops *ops,
> >  					 const struct sock *sk, struct sk_buff *skb,
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index d3cc530613c0..e6f1507d7895 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -6738,13 +6738,16 @@ enum {
> >  	 * options first before the BPF program does.
> >  	 */
> >  	BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<6),
> > -	/* Call bpf when the kernel generates SYN Cookie (ISN) for SYN+ACK.
> > +	/* Call bpf when the kernel generates SYN Cookie (ISN) for SYN+ACK
> > +	 * and validates ACK for SYN Cookie.
> >  	 *
> > -	 * The bpf prog will be called to encode MSS into SYN Cookie with
> > -	 * sock_ops->op == BPF_SOCK_OPS_GEN_SYNCOOKIE_CB.
> > +	 * The bpf prog will be first called to encode MSS into SYN Cookie
> > +	 * with sock_ops->op == BPF_SOCK_OPS_GEN_SYNCOOKIE_CB.  Then, the
> > +	 * bpf prog will be called to decode MSS from SYN Cookie with
> > +	 * sock_ops->op == BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB.
> >  	 *
> > -	 * Please refer to the comment in BPF_SOCK_OPS_GEN_SYNCOOKIE_CB for
> > -	 * input and output.
> > +	 * Please refer to the comment in BPF_SOCK_OPS_GEN_SYNCOOKIE_CB and
> > +	 * BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB for input and output.
> >  	 */
> >  	BPF_SOCK_OPS_SYNCOOKIE_CB_FLAG = (1<<7),
> >  /* Mask of all currently supported cb flags */
> > @@ -6868,6 +6871,13 @@ enum {
> >  					 *
> >  					 * replylong[0]: ISN
> >  					 */
> > +	BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB,/* Validate SYN Cookie and set
> > +					 * MSS.
> > +					 *
> > +					 * args[0]: ISN
> > +					 *
> > +					 * replylong[0]: MSS
> > +					 */
> >  };
> >  
> >  /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
> > diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
> > index 514f1a4abdee..b1dd415863ff 100644
> > --- a/net/ipv4/syncookies.c
> > +++ b/net/ipv4/syncookies.c
> > @@ -317,6 +317,37 @@ struct request_sock *cookie_tcp_reqsk_alloc(const struct request_sock_ops *ops,
> >  }
> >  EXPORT_SYMBOL_GPL(cookie_tcp_reqsk_alloc);
> >  
> > +#if IS_ENABLED(CONFIG_CGROUP_BPF) && IS_ENABLED(CONFIG_SYN_COOKIES)
> > +int bpf_skops_cookie_check(struct sock *sk, struct request_sock *req, struct sk_buff *skb)
> > +{
> > +	struct bpf_sock_ops_kern sock_ops;
> > +
> > +	memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
> > +
> > +	sock_ops.op = BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB;
> > +	sock_ops.sk = req_to_sk(req);
> > +	sock_ops.args[0] = tcp_rsk(req)->snt_isn;
> > +
> > +	bpf_skops_init_skb(&sock_ops, skb, tcp_hdrlen(skb));
> > +
> > +	if (BPF_CGROUP_RUN_PROG_SOCK_OPS_SK(&sock_ops, sk))
> > +		goto err;
> > +
> > +	if (!sock_ops.replylong[0])
> > +		goto err;
> > +
> > +	__NET_INC_STATS(sock_net(sk), LINUX_MIB_SYNCOOKIESRECV);
> 
> I don't see LINUX_MIB_SYNCOOKIESSENT being incremented in the
> previous patch, so maybe also don't touch the mib here? The bpf
> program can do the counting if needed?
> 
> Or, alternatively, add LINUX_MIB_SYNCOOKIESSENT to
> the BPF_SOCK_OPS_GEN_SYNCOOKIE_CB path?

Good catch!

I skipped calling tcp_synq_overflow() in the previous patch but should
have incremented LINUX_MIB_SYNCOOKIESSENT.  Will fix in v2.

Thanks!


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 bpf-next 03/11] tcp: Clean up goto labels in cookie_v[46]_check().
  2023-10-13 22:04 ` [PATCH v1 bpf-next 03/11] tcp: Clean up goto labels " Kuniyuki Iwashima
@ 2023-10-17  0:00   ` Kui-Feng Lee
  2023-10-17  0:30     ` Kuniyuki Iwashima
  0 siblings, 1 reply; 44+ messages in thread
From: Kui-Feng Lee @ 2023-10-17  0:00 UTC (permalink / raw)
  To: Kuniyuki Iwashima, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, David Ahern, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Mykola Lysenko
  Cc: Kuniyuki Iwashima, bpf, netdev



On 10/13/23 15:04, Kuniyuki Iwashima wrote:
> We will add a SOCK_OPS hook to validate SYN Cookie.
> 
> We invoke the hook after allocating reqsk.  In case it fails,
> we will respond with RST instead of just dropping the ACK.
> 
> Then, there would be more duplicated error handling patterns.
> To avoid that, let's clean up goto labels.
> 
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> ---
>   net/ipv4/syncookies.c | 22 +++++++++++-----------
>   net/ipv6/syncookies.c |  4 ++--
>   2 files changed, 13 insertions(+), 13 deletions(-)
> 
> diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
> index 64280cf42667..b0cf6f4d66d8 100644
> --- a/net/ipv4/syncookies.c
> +++ b/net/ipv4/syncookies.c
> @@ -369,11 +369,10 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
>   	if (!cookie_timestamp_decode(net, &tcp_opt))
>   		goto out;
>   
> -	ret = NULL;
>   	req = cookie_tcp_reqsk_alloc(&tcp_request_sock_ops,
>   				     &tcp_request_sock_ipv4_ops, sk, skb);
>   	if (!req)
> -		goto out;
> +		goto out_drop;
>   
>   	ireq = inet_rsk(req);
>   	treq = tcp_rsk(req);
> @@ -405,10 +404,8 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
>   	 */
>   	RCU_INIT_POINTER(ireq->ireq_opt, tcp_v4_save_options(net, skb));
>   
> -	if (security_inet_conn_request(sk, skb, req)) {
> -		reqsk_free(req);
> -		goto out;
> -	}
> +	if (security_inet_conn_request(sk, skb, req))
> +		goto out_free;
>   
>   	req->num_retrans = 0;
>   
> @@ -425,10 +422,8 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
>   			   ireq->ir_loc_addr, th->source, th->dest, sk->sk_uid);
>   	security_req_classify_flow(req, flowi4_to_flowi_common(&fl4));
>   	rt = ip_route_output_key(net, &fl4);
> -	if (IS_ERR(rt)) {
> -		reqsk_free(req);
> -		goto out;
> -	}
> +	if (IS_ERR(rt))
> +		goto out_free;
>   
>   	/* Try to redo what tcp_v4_send_synack did. */
>   	req->rsk_window_clamp = tp->window_clamp ? :dst_metric(&rt->dst, RTAX_WINDOW);
> @@ -452,5 +447,10 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
>   	 */
>   	if (ret)
>   		inet_sk(ret)->cork.fl.u.ip4 = fl4;
> -out:	return ret;
> +out:
> +	return ret;
> +out_free:
> +	reqsk_free(req);
> +out_drop:
> +	return NULL;
>   }

Looks like you don't use out_free and out_drop at all
in the patch 5 & 6. Are these changes still necessary?
Especially, the line 'goto out_drop' can be 'return NULL' concisely.


> diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
> index cbee2df8a006..b8ef6efbb60e 100644
> --- a/net/ipv6/syncookies.c
> +++ b/net/ipv6/syncookies.c
> @@ -171,11 +171,10 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
>   	if (!cookie_timestamp_decode(net, &tcp_opt))
>   		goto out;
>   
> -	ret = NULL;
>   	req = cookie_tcp_reqsk_alloc(&tcp6_request_sock_ops,
>   				     &tcp_request_sock_ipv6_ops, sk, skb);
>   	if (!req)
> -		goto out;
> +		goto out_drop;
>   
>   	ireq = inet_rsk(req);
>   	treq = tcp_rsk(req);
> @@ -263,5 +262,6 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
>   	return ret;
>   out_free:
>   	reqsk_free(req);
> +out_drop:
>   	return NULL;
>   }
Same here!

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 bpf-next 03/11] tcp: Clean up goto labels in cookie_v[46]_check().
  2023-10-17  0:00   ` Kui-Feng Lee
@ 2023-10-17  0:30     ` Kuniyuki Iwashima
  0 siblings, 0 replies; 44+ messages in thread
From: Kuniyuki Iwashima @ 2023-10-17  0:30 UTC (permalink / raw)
  To: sinquersw
  Cc: andrii, ast, bpf, daniel, davem, dsahern, edumazet, haoluo,
	john.fastabend, jolsa, kpsingh, kuba, kuni1840, kuniyu,
	martin.lau, mykolal, netdev, pabeni, sdf, song, yonghong.song

From: Kui-Feng Lee <sinquersw@gmail.com>
Date: Mon, 16 Oct 2023 17:00:39 -0700
> On 10/13/23 15:04, Kuniyuki Iwashima wrote:
> > We will add a SOCK_OPS hook to validate SYN Cookie.
> > 
> > We invoke the hook after allocating reqsk.  In case it fails,
> > we will respond with RST instead of just dropping the ACK.
> > 
> > Then, there would be more duplicated error handling patterns.
> > To avoid that, let's clean up goto labels.
> > 
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > ---
> >   net/ipv4/syncookies.c | 22 +++++++++++-----------
> >   net/ipv6/syncookies.c |  4 ++--
> >   2 files changed, 13 insertions(+), 13 deletions(-)
> > 
> > diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
> > index 64280cf42667..b0cf6f4d66d8 100644
> > --- a/net/ipv4/syncookies.c
> > +++ b/net/ipv4/syncookies.c
> > @@ -369,11 +369,10 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
> >   	if (!cookie_timestamp_decode(net, &tcp_opt))
> >   		goto out;
> >   
> > -	ret = NULL;
> >   	req = cookie_tcp_reqsk_alloc(&tcp_request_sock_ops,
> >   				     &tcp_request_sock_ipv4_ops, sk, skb);
> >   	if (!req)
> > -		goto out;
> > +		goto out_drop;
> >   
> >   	ireq = inet_rsk(req);
> >   	treq = tcp_rsk(req);
> > @@ -405,10 +404,8 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
> >   	 */
> >   	RCU_INIT_POINTER(ireq->ireq_opt, tcp_v4_save_options(net, skb));
> >   
> > -	if (security_inet_conn_request(sk, skb, req)) {
> > -		reqsk_free(req);
> > -		goto out;
> > -	}
> > +	if (security_inet_conn_request(sk, skb, req))
> > +		goto out_free;
> >   
> >   	req->num_retrans = 0;
> >   
> > @@ -425,10 +422,8 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
> >   			   ireq->ir_loc_addr, th->source, th->dest, sk->sk_uid);
> >   	security_req_classify_flow(req, flowi4_to_flowi_common(&fl4));
> >   	rt = ip_route_output_key(net, &fl4);
> > -	if (IS_ERR(rt)) {
> > -		reqsk_free(req);
> > -		goto out;
> > -	}
> > +	if (IS_ERR(rt))
> > +		goto out_free;
> >   
> >   	/* Try to redo what tcp_v4_send_synack did. */
> >   	req->rsk_window_clamp = tp->window_clamp ? :dst_metric(&rt->dst, RTAX_WINDOW);
> > @@ -452,5 +447,10 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
> >   	 */
> >   	if (ret)
> >   		inet_sk(ret)->cork.fl.u.ip4 = fl4;
> > -out:	return ret;
> > +out:
> > +	return ret;
> > +out_free:
> > +	reqsk_free(req);
> > +out_drop:
> > +	return NULL;
> >   }
> 
> Looks like you don't use out_free and out_drop at all
> in the patch 5 & 6. Are these changes still necessary?
> Especially, the line 'goto out_drop' can be 'return NULL' concisely.

I think it's hard to follow a function where goto and return
are mixed, so I cleaned up the labels while at it.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 bpf-next 11/11] selftest: bpf: Test BPF_SOCK_OPS_(GEN|CHECK)_SYNCOOKIE_CB.
  2023-10-13 22:04 ` [PATCH v1 bpf-next 11/11] selftest: bpf: Test BPF_SOCK_OPS_(GEN|CHECK)_SYNCOOKIE_CB Kuniyuki Iwashima
@ 2023-10-17  5:50   ` Martin KaFai Lau
  2023-10-17 16:29     ` Kuniyuki Iwashima
  0 siblings, 1 reply; 44+ messages in thread
From: Martin KaFai Lau @ 2023-10-17  5:50 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: Kuniyuki Iwashima, bpf, netdev, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, David Ahern, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Mykola Lysenko

On 10/13/23 3:04 PM, Kuniyuki Iwashima wrote:
> This patch adds a test for BPF_SOCK_OPS_(GEN|CHECK)_SYNCOOKIE_CB hooks.
> 
> BPF_SOCK_OPS_GEN_SYNCOOKIE_CB hook generates a hash using SipHash from
> based on 4-tuple.  The hash is split into ISN and TS.  MSS, ECN, SACK,
> and WScale are encoded into the lower 8-bits of ISN.
> 
>    ISN:
>      MSB                                   LSB
>      | 31 ... 8 | 7 6 | 5   | 4    | 3 2 1 0 |
>      | Hash_1   | MSS | ECN | SACK | WScale  |
> 
>    TS:
>      MSB                LSB
>      | 31 ... 8 | 7 ... 0 |
>      | Random   | Hash_2  |
> 
> BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB hook re-calculates the hash and validates
> the cookie.
> 
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> ---
> Currently, the validator is incomplete...
> 
> If this line is changed
> 
>      skops->replylong[0] = msstab[3];
> 
> to
>      skops->replylong[0] = msstab[mssind];
> 
> , we will get the error below during make:
> 
>      GEN-SKEL [test_progs] test_tcp_syncookie.skel.h
>    ...
>    Error: failed to open BPF object file: No such file or directory

I cannot reprod. Does it have error earlier than this? GEN-SKEL is probably 
running this (make V=1 can tell):

tools/testing/selftests/bpf/tools/sbin/bpftool gen skeleton 
tools/testing/selftests/bpf/test_tcp_syncookie.bpf.linked3.o name 
test_tcp_syncookie > tools/testing/selftests/bpf/test_tcp_syncookie.skel.h

Add a "-d" to bpftool for more debug output: bpftool -d gen skeleton....


I cannot compile the patch in my environment as-is also:

In file included from progs/test_tcp_syncookie.c:6:
In file included from 
/data/users/kafai/fb-kernel/linux/tools/include/uapi/linux/tcp.h:22:
In file included from /usr/include/asm/byteorder.h:5:
In file included from /usr/include/linux/byteorder/little_endian.h:13:
/usr/include/linux/swab.h:136:8: error: unknown type name '__always_inline'
   136 | static __always_inline unsigned long __swab(const unsigned long y)

I have to add a "#include <linux/stddef.h>".


>      GEN-SKEL [test_progs-no_alu32] test_tcp_syncookie.skel.h
>    make: *** [Makefile:603: /home/ec2-user/kernel/bpf_syncookie/tools/testing/selftests/bpf/test_tcp_syncookie.skel.h] Error 254
>    make: *** Deleting file '/home/ec2-user/kernel/bpf_syncookie/tools/testing/selftests/bpf/test_tcp_syncookie.skel.h'
>    make: *** Waiting for unfinished jobs....


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 bpf-next 00/11] bpf: tcp: Add SYN Cookie generation/validation SOCK_OPS hooks.
  2023-10-13 22:04 [PATCH v1 bpf-next 00/11] bpf: tcp: Add SYN Cookie generation/validation SOCK_OPS hooks Kuniyuki Iwashima
                   ` (12 preceding siblings ...)
  2023-10-16 14:19 ` Willem de Bruijn
@ 2023-10-17  5:53 ` Martin KaFai Lau
  2023-10-17 16:48   ` Kuniyuki Iwashima
  13 siblings, 1 reply; 44+ messages in thread
From: Martin KaFai Lau @ 2023-10-17  5:53 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: Kuniyuki Iwashima, bpf, netdev, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, David Ahern, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Mykola Lysenko

On 10/13/23 3:04 PM, Kuniyuki Iwashima wrote:
> Under SYN Flood, the TCP stack generates SYN Cookie to remain stateless
> After 3WHS, the proxy restores SYN and forwards it and ACK to the backend
> server.  Our kernel module works at Netfilter input/output hooks and first
> feeds SYN to the TCP stack to initiate 3WHS.  When the module is triggered
> for SYN+ACK, it looks up the corresponding request socket and overwrites
> tcp_rsk(req)->snt_isn with the proxy's cookie.  Then, the module can
> complete 3WHS with the original ACK as is.

Does the current kernel module also use the timestamp bits differently? 
(something like patch 8 and patch 10 trying to do)

> 
> This way, our SYN Proxy does not manage the ISN mappings and can stay
> stateless.  It's working very well for high-bandwidth services like
> multiple Tbps, but we are looking for a way to drop the dirty hack and
> further optimise the sequences.
> 
> If we could validate an arbitrary SYN Cookie on the backend server with
> BPF, the proxy would need not restore SYN nor pass it.  After validating
> ACK, the proxy node just needs to forward it, and then the server can do
> the lightweight validation (e.g. check if ACK came from proxy nodes, etc)
> and create a connection from the ACK.
> 
> This series adds two SOCK_OPS hooks to generate and validate arbitrary
> SYN Cookie.  Each hook is invoked if BPF_SOCK_OPS_SYNCOOKIE_CB_FLAG is
> set to the listening socket in advance by bpf_sock_ops_cb_flags_set().
> 
> The user interface looks like this:
> 
>    BPF_SOCK_OPS_GEN_SYNCOOKIE_CB
> 
>      input
>      |- bpf_sock_ops.sk           : 4-tuple
>      |- bpf_sock_ops.skb          : TCP header
>      |- bpf_sock_ops.args[0]      : MSS
>      `- bpf_sock_ops.args[1]      : BPF_SYNCOOKIE_XXX flags
> 
>      output
>      |- bpf_sock_ops.replylong[0] : ISN (SYN Cookie) ------.
>      `- bpf_sock_ops.replylong[1] : TS value -----------.  |
>                                                         |  |
>    BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB                      |  |
>                                                         |  |
>      input                                              |  |
>      |- bpf_sock_ops.sk           : 4-tuple             |  |
>      |- bpf_sock_ops.skb          : TCP header          |  |
>      |- bpf_sock_ops.args[0]      : ISN (SYN Cookie) <-----'
>      `- bpf_sock_ops.args[1]      : TS value <----------'
> 
>      output
>      |- bpf_sock_ops.replylong[0] : MSS
>      `- bpf_sock_ops.replylong[1] : BPF_SYNCOOKIE_XXX flags
> 
> To establish a connection from SYN Cookie, BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB
> hook must set a valid MSS to bpf_sock_ops.replylong[0], meaning that
> BPF_SOCK_OPS_GEN_SYNCOOKIE_CB hook must encode MSS to ISN or TS val to be
> restored in the validation hook.
> 
> If WScale, SACK, and ECN are detected to be available in SYN packet, the
> corresponding flags are passed to args[0] of BPF_SOCK_OPS_GEN_SYNCOOKIE_CB
> so that bpf prog need not parse the TCP header.  The same flags can be set
> to replylong[0] of BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB to enable each feature
> on the connection.
> 
> For details, please see each patch.  Here's an overview:
> 
>    patch 1 - 4 : Misc cleanup
>    patch 5, 6  : Add SOCK_OPS hook (only ISN is available here)
>    patch 7, 8  : Make TS val available as the second cookie storage
>    patch 9, 10 : Make WScale, SACK, and ECN configurable from ACK
>    patch 11    : selftest, need some help from BPF experts...

I cannot reprod the issue. Commented in patch 11.

I only scanned through the high level of the patchset. will take a closer look. 
Thanks.


> 
> [0]: https://netdev.bots.linux.dev/netconf/2023/kuniyuki.pdf


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 bpf-next 11/11] selftest: bpf: Test BPF_SOCK_OPS_(GEN|CHECK)_SYNCOOKIE_CB.
  2023-10-17  5:50   ` Martin KaFai Lau
@ 2023-10-17 16:29     ` Kuniyuki Iwashima
  0 siblings, 0 replies; 44+ messages in thread
From: Kuniyuki Iwashima @ 2023-10-17 16:29 UTC (permalink / raw)
  To: martin.lau
  Cc: andrii, ast, bpf, daniel, davem, dsahern, edumazet, haoluo,
	john.fastabend, jolsa, kpsingh, kuba, kuni1840, kuniyu, mykolal,
	netdev, pabeni, sdf, song, yonghong.song

From: Martin KaFai Lau <martin.lau@linux.dev>
Date: Mon, 16 Oct 2023 22:50:44 -0700
> On 10/13/23 3:04 PM, Kuniyuki Iwashima wrote:
> > This patch adds a test for BPF_SOCK_OPS_(GEN|CHECK)_SYNCOOKIE_CB hooks.
> > 
> > BPF_SOCK_OPS_GEN_SYNCOOKIE_CB hook generates a hash using SipHash from
> > based on 4-tuple.  The hash is split into ISN and TS.  MSS, ECN, SACK,
> > and WScale are encoded into the lower 8-bits of ISN.
> > 
> >    ISN:
> >      MSB                                   LSB
> >      | 31 ... 8 | 7 6 | 5   | 4    | 3 2 1 0 |
> >      | Hash_1   | MSS | ECN | SACK | WScale  |
> > 
> >    TS:
> >      MSB                LSB
> >      | 31 ... 8 | 7 ... 0 |
> >      | Random   | Hash_2  |
> > 
> > BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB hook re-calculates the hash and validates
> > the cookie.
> > 
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > ---
> > Currently, the validator is incomplete...
> > 
> > If this line is changed
> > 
> >      skops->replylong[0] = msstab[3];
> > 
> > to
> >      skops->replylong[0] = msstab[mssind];
> > 
> > , we will get the error below during make:
> > 
> >      GEN-SKEL [test_progs] test_tcp_syncookie.skel.h
> >    ...
> >    Error: failed to open BPF object file: No such file or directory
> 
> I cannot reprod. Does it have error earlier than this? GEN-SKEL is probably 
> running this (make V=1 can tell):
> 
> tools/testing/selftests/bpf/tools/sbin/bpftool gen skeleton 
> tools/testing/selftests/bpf/test_tcp_syncookie.bpf.linked3.o name 
> test_tcp_syncookie > tools/testing/selftests/bpf/test_tcp_syncookie.skel.h
> 
> Add a "-d" to bpftool for more debug output: bpftool -d gen skeleton....

Somehow .rodata was 0 bytes while generating skeleton, and after
removing `static` from `msstab[]`, it compiled successfully.

Thank you!

---8<---
$ tools/testing/selftests/bpf/tools/sbin/bpftool -d gen skeleton tools/testing/selftests/bpf/test_tcp_syncookie.bpf.linked3.o name test_tcp_syncookie > tools/testing/selftests/bpf/test_tcp_syncookie.skel.h
libbpf: loading object 'test_tcp_syncookie' from buffer
libbpf: elf: section(2) .symtab, size 432, link 1, flags 0, type=2
libbpf: elf: section(3) .text, size 2888, link 0, flags 6, type=1
libbpf: sec '.text': found program 'cookie_hash' at insn offset 0 (0 bytes), code size 361 insns (2888 bytes)
libbpf: elf: section(4) sockops, size 864, link 0, flags 6, type=1
libbpf: sec 'sockops': found program 'syncookie' at insn offset 0 (0 bytes), code size 108 insns (864 bytes)
libbpf: elf: section(5) license, size 4, link 0, flags 3, type=1
libbpf: license of test_tcp_syncookie is GPL
libbpf: elf: section(6) .maps, size 32, link 0, flags 3, type=1
libbpf: elf: section(7) .rodata.cst8, size 8, link 0, flags 12, type=1
libbpf: elf: section(8) .relsockops, size 48, link 2, flags 40, type=9
libbpf: elf: section(9) .BTF, size 3891, link 0, flags 0, type=1
libbpf: elf: section(10) .BTF.ext, size 2648, link 0, flags 0, type=1
libbpf: looking for externs among 18 symbols...
libbpf: collected 0 externs total
libbpf: sec '.rodata': failed to determine size from ELF: size 0, err -2
Error: failed to open BPF object file: No such file or directory
---8<---

---8<---
diff --git a/tools/testing/selftests/bpf/progs/test_tcp_syncookie.c b/tools/testing/selftests/bpf/progs/test_tcp_syncookie.c
index 5d1fc928602b..19307567cc4c 100644
--- a/tools/testing/selftests/bpf/progs/test_tcp_syncookie.c
+++ b/tools/testing/selftests/bpf/progs/test_tcp_syncookie.c
@@ -63,7 +63,7 @@ static __u32 cookie_hash(struct bpf_sock_ops *skops)
 			    &test_key_siphash);
 }
 
-static const __u16 msstab[] = {
+const __u16 msstab[] = {
 	536,
 	1300,
 	1440,
@@ -137,7 +137,7 @@ static int check_syncookie(struct bpf_sock_ops *skops)
 		return CG_ERR;
 
 	/* msstab[mssind]; does not compile ... */
-	skops->replylong[0] = msstab[3];
+	skops->replylong[0] = msstab[mssind];
 	skops->replylong[1] = skops->args[0] & (BPF_SYNCOOKIE_ECN |
 						BPF_SYNCOOKIE_SACK |
 						BPF_SYNCOOKIE_WSCALE_MASK);
---8<---


> 
> 
> I cannot compile the patch in my environment as-is also:
> 
> In file included from progs/test_tcp_syncookie.c:6:
> In file included from 
> /data/users/kafai/fb-kernel/linux/tools/include/uapi/linux/tcp.h:22:
> In file included from /usr/include/asm/byteorder.h:5:
> In file included from /usr/include/linux/byteorder/little_endian.h:13:
> /usr/include/linux/swab.h:136:8: error: unknown type name '__always_inline'
>    136 | static __always_inline unsigned long __swab(const unsigned long y)
> 
> I have to add a "#include <linux/stddef.h>".

Will add it in v2.


> 
> 
> >      GEN-SKEL [test_progs-no_alu32] test_tcp_syncookie.skel.h
> >    make: *** [Makefile:603: /home/ec2-user/kernel/bpf_syncookie/tools/testing/selftests/bpf/test_tcp_syncookie.skel.h] Error 254
> >    make: *** Deleting file '/home/ec2-user/kernel/bpf_syncookie/tools/testing/selftests/bpf/test_tcp_syncookie.skel.h'
> >    make: *** Waiting for unfinished jobs....

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 bpf-next 00/11] bpf: tcp: Add SYN Cookie generation/validation SOCK_OPS hooks.
  2023-10-17  5:53 ` Martin KaFai Lau
@ 2023-10-17 16:48   ` Kuniyuki Iwashima
  2023-10-18  6:19     ` Martin KaFai Lau
  0 siblings, 1 reply; 44+ messages in thread
From: Kuniyuki Iwashima @ 2023-10-17 16:48 UTC (permalink / raw)
  To: martin.lau
  Cc: andrii, ast, bpf, daniel, davem, dsahern, edumazet, haoluo,
	john.fastabend, jolsa, kpsingh, kuba, kuni1840, kuniyu, mykolal,
	netdev, pabeni, sdf, song, yonghong.song

From: Martin KaFai Lau <martin.lau@linux.dev>
Date: Mon, 16 Oct 2023 22:53:15 -0700
> On 10/13/23 3:04 PM, Kuniyuki Iwashima wrote:
> > Under SYN Flood, the TCP stack generates SYN Cookie to remain stateless
> > After 3WHS, the proxy restores SYN and forwards it and ACK to the backend
> > server.  Our kernel module works at Netfilter input/output hooks and first
> > feeds SYN to the TCP stack to initiate 3WHS.  When the module is triggered
> > for SYN+ACK, it looks up the corresponding request socket and overwrites
> > tcp_rsk(req)->snt_isn with the proxy's cookie.  Then, the module can
> > complete 3WHS with the original ACK as is.
> 
> Does the current kernel module also use the timestamp bits differently? 
> (something like patch 8 and patch 10 trying to do)

Our SYN Proxy uses TS as is.  The proxy nodes generate a random number
if TS is in SYN.

But I thought someone would suggest making TS available so that we can
mock the default behaviour at least, and it would be more acceptable.

The selftest uses TS just to strengthen security by validating 32-bits
hash.  Dropping a part of hash makes collision easier to happen, but
24-bits were sufficient for us to reduce SYN flood to the managable
level at the backend.


> 
> > 
> > This way, our SYN Proxy does not manage the ISN mappings and can stay
> > stateless.  It's working very well for high-bandwidth services like
> > multiple Tbps, but we are looking for a way to drop the dirty hack and
> > further optimise the sequences.
> > 
> > If we could validate an arbitrary SYN Cookie on the backend server with
> > BPF, the proxy would need not restore SYN nor pass it.  After validating
> > ACK, the proxy node just needs to forward it, and then the server can do
> > the lightweight validation (e.g. check if ACK came from proxy nodes, etc)
> > and create a connection from the ACK.
> > 
> > This series adds two SOCK_OPS hooks to generate and validate arbitrary
> > SYN Cookie.  Each hook is invoked if BPF_SOCK_OPS_SYNCOOKIE_CB_FLAG is
> > set to the listening socket in advance by bpf_sock_ops_cb_flags_set().
> > 
> > The user interface looks like this:
> > 
> >    BPF_SOCK_OPS_GEN_SYNCOOKIE_CB
> > 
> >      input
> >      |- bpf_sock_ops.sk           : 4-tuple
> >      |- bpf_sock_ops.skb          : TCP header
> >      |- bpf_sock_ops.args[0]      : MSS
> >      `- bpf_sock_ops.args[1]      : BPF_SYNCOOKIE_XXX flags
> > 
> >      output
> >      |- bpf_sock_ops.replylong[0] : ISN (SYN Cookie) ------.
> >      `- bpf_sock_ops.replylong[1] : TS value -----------.  |
> >                                                         |  |
> >    BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB                      |  |
> >                                                         |  |
> >      input                                              |  |
> >      |- bpf_sock_ops.sk           : 4-tuple             |  |
> >      |- bpf_sock_ops.skb          : TCP header          |  |
> >      |- bpf_sock_ops.args[0]      : ISN (SYN Cookie) <-----'
> >      `- bpf_sock_ops.args[1]      : TS value <----------'
> > 
> >      output
> >      |- bpf_sock_ops.replylong[0] : MSS
> >      `- bpf_sock_ops.replylong[1] : BPF_SYNCOOKIE_XXX flags
> > 
> > To establish a connection from SYN Cookie, BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB
> > hook must set a valid MSS to bpf_sock_ops.replylong[0], meaning that
> > BPF_SOCK_OPS_GEN_SYNCOOKIE_CB hook must encode MSS to ISN or TS val to be
> > restored in the validation hook.
> > 
> > If WScale, SACK, and ECN are detected to be available in SYN packet, the
> > corresponding flags are passed to args[0] of BPF_SOCK_OPS_GEN_SYNCOOKIE_CB
> > so that bpf prog need not parse the TCP header.  The same flags can be set
> > to replylong[0] of BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB to enable each feature
> > on the connection.
> > 
> > For details, please see each patch.  Here's an overview:
> > 
> >    patch 1 - 4 : Misc cleanup
> >    patch 5, 6  : Add SOCK_OPS hook (only ISN is available here)
> >    patch 7, 8  : Make TS val available as the second cookie storage
> >    patch 9, 10 : Make WScale, SACK, and ECN configurable from ACK
> >    patch 11    : selftest, need some help from BPF experts...
> 
> I cannot reprod the issue. Commented in patch 11.
> 
> I only scanned through the high level of the patchset. will take a closer look. 
> Thanks.

I'll wait your review before posting v2.
Thank you!

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 bpf-next 06/11] bpf: tcp: Add SYN Cookie validation SOCK_OPS hook.
  2023-10-13 22:04 ` [PATCH v1 bpf-next 06/11] bpf: tcp: Add SYN Cookie validation " Kuniyuki Iwashima
  2023-10-16 20:38   ` Stanislav Fomichev
@ 2023-10-17 16:52   ` Kuniyuki Iwashima
  1 sibling, 0 replies; 44+ messages in thread
From: Kuniyuki Iwashima @ 2023-10-17 16:52 UTC (permalink / raw)
  To: kuniyu
  Cc: andrii, ast, bpf, daniel, davem, dsahern, edumazet, haoluo,
	john.fastabend, jolsa, kpsingh, kuba, kuni1840, martin.lau,
	mykolal, netdev, pabeni, sdf, song, yonghong.song

From: Kuniyuki Iwashima <kuniyu@amazon.com>
Date: Fri, 13 Oct 2023 15:04:28 -0700
> diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
> index 514f1a4abdee..b1dd415863ff 100644
> --- a/net/ipv4/syncookies.c
> +++ b/net/ipv4/syncookies.c
> @@ -317,6 +317,37 @@ struct request_sock *cookie_tcp_reqsk_alloc(const struct request_sock_ops *ops,
>  }
>  EXPORT_SYMBOL_GPL(cookie_tcp_reqsk_alloc);
>  
> +#if IS_ENABLED(CONFIG_CGROUP_BPF) && IS_ENABLED(CONFIG_SYN_COOKIES)
> +int bpf_skops_cookie_check(struct sock *sk, struct request_sock *req, struct sk_buff *skb)
> +{
> +	struct bpf_sock_ops_kern sock_ops;
> +
> +	memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
> +
> +	sock_ops.op = BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB;
> +	sock_ops.sk = req_to_sk(req);
> +	sock_ops.args[0] = tcp_rsk(req)->snt_isn;
> +
> +	bpf_skops_init_skb(&sock_ops, skb, tcp_hdrlen(skb));
> +
> +	if (BPF_CGROUP_RUN_PROG_SOCK_OPS_SK(&sock_ops, sk))
> +		goto err;
> +
> +	if (!sock_ops.replylong[0])
> +		goto err;

I noticed this test is insufficient to check valid MSS.
I'll use msstab[0] as the minimum valid MSS in v2.

---8<---
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index 22353a9af52d..4af165fd48f9 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -287,6 +287,7 @@ int bpf_skops_cookie_check(struct sock *sk, struct request_sock *req, struct sk_
 	struct bpf_sock_ops_kern sock_ops;
 	struct net *net = sock_net(sk);
 	u32 options;
+	u16 min_mss;
 
 	if (tcp_opt->saw_tstamp) {
 		if (!READ_ONCE(net->ipv4.sysctl_tcp_timestamps))
@@ -307,7 +308,8 @@ int bpf_skops_cookie_check(struct sock *sk, struct request_sock *req, struct sk_
 	if (BPF_CGROUP_RUN_PROG_SOCK_OPS_SK(&sock_ops, sk))
 		goto err;
 
-	if (!sock_ops.replylong[0])
+	min_mss = skb->protocol == htons(ETH_P_IP) ? msstab[0] : IPV6_MIN_MTU - 60;
+	if (sock_ops.replylong[0] < min_mss)
 		goto err;
 
 	options = sock_ops.replylong[1];
---8<---



> +
> +	__NET_INC_STATS(sock_net(sk), LINUX_MIB_SYNCOOKIESRECV);
> +
> +	return sock_ops.replylong[0];
> +
> +err:
> +	__NET_INC_STATS(sock_net(sk), LINUX_MIB_SYNCOOKIESFAILED);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(bpf_skops_cookie_check);
> +#endif
> +
>  /* On input, sk is a listener.
>   * Output is listener if incoming packet would not create a child
>   *           NULL if memory could not be allocated.

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 bpf-next 05/11] bpf: tcp: Add SYN Cookie generation SOCK_OPS hook.
  2023-10-13 22:04 ` [PATCH v1 bpf-next 05/11] bpf: tcp: Add SYN Cookie generation SOCK_OPS hook Kuniyuki Iwashima
@ 2023-10-18  0:54   ` Martin KaFai Lau
  2023-10-18 17:00     ` Kuniyuki Iwashima
  0 siblings, 1 reply; 44+ messages in thread
From: Martin KaFai Lau @ 2023-10-18  0:54 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: Kuniyuki Iwashima, bpf, netdev, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, David Ahern, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Mykola Lysenko

On 10/13/23 3:04 PM, Kuniyuki Iwashima wrote:
> This patch adds a new SOCK_OPS hook to generate arbitrary SYN Cookie.
> 
> When the kernel sends SYN Cookie to a client, the hook is invoked with
> bpf_sock_ops.op == BPF_SOCK_OPS_GEN_SYNCOOKIE_CB if the listener has
> BPF_SOCK_OPS_SYNCOOKIE_CB_FLAG set by bpf_sock_ops_cb_flags_set().
> 
> The BPF program can access the following information to encode into
> ISN:
> 
>    bpf_sock_ops.sk      : 4-tuple
>    bpf_sock_ops.skb     : TCP header
>    bpf_sock_ops.args[0] : MSS
> 
> The program must encode MSS and set it to bpf_sock_ops.replylong[0],
> which will be looped back to the paired hook added in the following
> patch.
> 
> Note that we do not call tcp_synq_overflow() so that the BPF program
> can set its own expiration period.
> 
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> ---
>   include/uapi/linux/bpf.h       | 18 +++++++++++++++-
>   net/ipv4/tcp_input.c           | 38 +++++++++++++++++++++++++++++++++-
>   tools/include/uapi/linux/bpf.h | 18 +++++++++++++++-
>   3 files changed, 71 insertions(+), 3 deletions(-)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 7ba61b75bc0e..d3cc530613c0 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -6738,8 +6738,17 @@ enum {
>   	 * options first before the BPF program does.
>   	 */
>   	BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<6),
> +	/* Call bpf when the kernel generates SYN Cookie (ISN) for SYN+ACK.
> +	 *
> +	 * The bpf prog will be called to encode MSS into SYN Cookie with
> +	 * sock_ops->op == BPF_SOCK_OPS_GEN_SYNCOOKIE_CB.
> +	 *
> +	 * Please refer to the comment in BPF_SOCK_OPS_GEN_SYNCOOKIE_CB for
> +	 * input and output.
> +	 */
> +	BPF_SOCK_OPS_SYNCOOKIE_CB_FLAG = (1<<7),
>   /* Mask of all currently supported cb flags */
> -	BPF_SOCK_OPS_ALL_CB_FLAGS       = 0x7F,
> +	BPF_SOCK_OPS_ALL_CB_FLAGS       = 0xFF,
>   };
>   
>   /* List of known BPF sock_ops operators.
> @@ -6852,6 +6861,13 @@ enum {
>   					 * by the kernel or the
>   					 * earlier bpf-progs.
>   					 */
> +	BPF_SOCK_OPS_GEN_SYNCOOKIE_CB,	/* Generate SYN Cookie (ISN of
> +					 * SYN+ACK).
> +					 *
> +					 * args[0]: MSS
> +					 *
> +					 * replylong[0]: ISN
> +					 */
>   };
>   
>   /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 584825ddd0a0..c86a737e4fe6 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -6966,6 +6966,37 @@ u16 tcp_get_syncookie_mss(struct request_sock_ops *rsk_ops,
>   }
>   EXPORT_SYMBOL_GPL(tcp_get_syncookie_mss);
>   
> +#if IS_ENABLED(CONFIG_CGROUP_BPF) && IS_ENABLED(CONFIG_SYN_COOKIES)
> +static int bpf_skops_cookie_init_sequence(struct sock *sk, struct request_sock *req,
> +					  struct sk_buff *skb, __u32 *isn)
> +{
> +	struct bpf_sock_ops_kern sock_ops;
> +	int ret;
> +
> +	memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
> +
> +	sock_ops.op = BPF_SOCK_OPS_GEN_SYNCOOKIE_CB;
> +	sock_ops.sk = req_to_sk(req);
> +	sock_ops.args[0] = req->mss;
> +
> +	bpf_skops_init_skb(&sock_ops, skb, tcp_hdrlen(skb));
> +
> +	ret = BPF_CGROUP_RUN_PROG_SOCK_OPS_SK(&sock_ops, sk);
> +	if (ret)
> +		return ret;
> +
> +	*isn = sock_ops.replylong[0];

sock_ops.{replylong,reply} cannot be used. afaik, no existing sockops hook 
relies on {replylong,reply}. It is a union of args[4]. There could be a few 
skops bpf in the same cgrp and each of them will be run one after another. (eg. 
two skops progs want to generate cookie).

I don't prefer to extend the uapi 'struct bpf_sock_ops' and then the 
sock_ops_convert_ctx_access(). Adding member to the kernel 'struct 
bpf_sock_addr_kern' could still be considered if it is really needed.

One option is to add kfunc to allow the bpf prog to directly update the value of 
the kernel obj (e.g. tcp_rsk(req)->snt_isn here).

Also, we need to allow a bpf prog to selectively generate custom cookie for one 
SYN but fall-through to the kernel cookie for another SYN.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 bpf-next 10/11] bpf: tcp: Make WS, SACK, ECN configurable from BPF SYN Cookie.
  2023-10-13 22:04 ` [PATCH v1 bpf-next 10/11] bpf: tcp: Make WS, SACK, ECN configurable from BPF SYN Cookie Kuniyuki Iwashima
@ 2023-10-18  1:08   ` Martin KaFai Lau
  2023-10-18 17:02     ` Kuniyuki Iwashima
  0 siblings, 1 reply; 44+ messages in thread
From: Martin KaFai Lau @ 2023-10-18  1:08 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: Kuniyuki Iwashima, bpf, netdev, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, David Ahern, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Mykola Lysenko

On 10/13/23 3:04 PM, Kuniyuki Iwashima wrote:
> This patch allows BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB hook to enable WScale,
> SACK, and ECN by passing corresponding flags to bpf_sock_ops.replylong[1].
> 
> The same flags are passed to BPF_SOCK_OPS_GEN_SYNCOOKIE_CB hook as
> bpf_sock_ops.args[1] so that the BPF prog need not parse the TCP header to
> check if WScale, SACK, ECN, and TS are available in SYN.
> 
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> ---
>   include/uapi/linux/bpf.h       | 18 ++++++++++++++++++
>   net/ipv4/syncookies.c          | 20 ++++++++++++++++++++
>   net/ipv4/tcp_input.c           | 11 +++++++++++
>   tools/include/uapi/linux/bpf.h | 18 ++++++++++++++++++
>   4 files changed, 67 insertions(+)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 24f673d88c0d..cdae4dd5d797 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -6869,6 +6869,7 @@ enum {
>   					 * option.
>   					 *
>   					 * args[0]: MSS
> +					 * args[1]: BPF_SYNCOOKIE_XXX
>   					 *
>   					 * replylong[0]: ISN
>   					 * replylong[1]: TS
> @@ -6883,6 +6884,7 @@ enum {
>   					 * args[1]: TS
>   					 *
>   					 * replylong[0]: MSS
> +					 * replylong[1]: BPF_SYNCOOKIE_XXX
>   					 */
>   };
>   
> @@ -6970,6 +6972,22 @@ enum {
>   						 */
>   };
>   
> +/* arg[1] value for BPF_SOCK_OPS_GEN_SYNCOOKIE_CB and
> + * replylong[1] for BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB.
> + *
> + * MSB                                LSB
> + * | 31 ... | 6  | 5   | 4    | 3 2 1 0 |
> + * |    ... | TS | ECN | SACK | WScale  |
> + */
> +enum {
> +	/* 0xf is invalid thus means that SYN did not have WScale. */
> +	BPF_SYNCOOKIE_WSCALE_MASK	= (1 << 4) - 1,
> +	BPF_SYNCOOKIE_SACK		= (1 << 4),
> +	BPF_SYNCOOKIE_ECN		= (1 << 5),
> +	/* Only available for BPF_SOCK_OPS_GEN_SYNCOOKIE_CB to check if SYN has TS */
> +	BPF_SYNCOOKIE_TS		= (1 << 6),
> +};

This details should not be exposed to uapi (more below).

> +
>   struct bpf_perf_event_value {
>   	__u64 counter;
>   	__u64 enabled;
> diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
> index ff979cc314da..22353a9af52d 100644
> --- a/net/ipv4/syncookies.c
> +++ b/net/ipv4/syncookies.c
> @@ -286,6 +286,7 @@ int bpf_skops_cookie_check(struct sock *sk, struct request_sock *req, struct sk_
>   {
>   	struct bpf_sock_ops_kern sock_ops;
>   	struct net *net = sock_net(sk);
> +	u32 options;
>   
>   	if (tcp_opt->saw_tstamp) {
>   		if (!READ_ONCE(net->ipv4.sysctl_tcp_timestamps))
> @@ -309,6 +310,25 @@ int bpf_skops_cookie_check(struct sock *sk, struct request_sock *req, struct sk_
>   	if (!sock_ops.replylong[0])
>   		goto err;
>   
> +	options = sock_ops.replylong[1];
> +
> +	if ((options & BPF_SYNCOOKIE_WSCALE_MASK) != BPF_SYNCOOKIE_WSCALE_MASK) {
> +		if (!READ_ONCE(net->ipv4.sysctl_tcp_window_scaling))
> +			goto err;
> +
> +		tcp_opt->wscale_ok = 1;
> +		tcp_opt->snd_wscale = options & BPF_SYNCOOKIE_WSCALE_MASK;
> +	}
> +
> +	if (options & BPF_SYNCOOKIE_SACK) {
> +		if (!READ_ONCE(net->ipv4.sysctl_tcp_sack))
> +			goto err;
> +
> +		tcp_opt->sack_ok = 1;
> +	}
> +
> +	inet_rsk(req)->ecn_ok = options & BPF_SYNCOOKIE_ECN;
> +
>   	__NET_INC_STATS(sock_net(sk), LINUX_MIB_SYNCOOKIESRECV);
>   
>   	return sock_ops.replylong[0];
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index feb44bff29ef..483e2f36afe5 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -6970,14 +6970,25 @@ EXPORT_SYMBOL_GPL(tcp_get_syncookie_mss);
>   static int bpf_skops_cookie_init_sequence(struct sock *sk, struct request_sock *req,
>   					  struct sk_buff *skb, __u32 *isn)
>   {
> +	struct inet_request_sock *ireq = inet_rsk(req);
>   	struct bpf_sock_ops_kern sock_ops;
> +	u32 options;
>   	int ret;
>   
> +	options = ireq->wscale_ok ? ireq->snd_wscale : BPF_SYNCOOKIE_WSCALE_MASK;
> +	if (ireq->sack_ok)
> +		options |= BPF_SYNCOOKIE_SACK;
> +	if (ireq->ecn_ok)
> +		options |= BPF_SYNCOOKIE_ECN;
> +	if (ireq->tstamp_ok)
> +		options |= BPF_SYNCOOKIE_TS;

No need to set "options" (which becomes args[1]). sock_ops.sk is available to 
the bpf prog. The bpf prog can directly read it. The recent AF_UNIX bpf support 
could be a reference on how the bpf_cast_to_kern_ctx() and bpf_rdonly_cast() are 
used.

https://lore.kernel.org/bpf/20231011185113.140426-10-daan.j.demeyer@gmail.com/

> +
>   	memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
>   
>   	sock_ops.op = BPF_SOCK_OPS_GEN_SYNCOOKIE_CB;
>   	sock_ops.sk = req_to_sk(req);
>   	sock_ops.args[0] = req->mss;
> +	sock_ops.args[1] = options;
>   
>   	bpf_skops_init_skb(&sock_ops, skb, tcp_hdrlen(skb));
>   
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 24f673d88c0d..cdae4dd5d797 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -6869,6 +6869,7 @@ enum {
>   					 * option.
>   					 *
>   					 * args[0]: MSS
> +					 * args[1]: BPF_SYNCOOKIE_XXX
>   					 *
>   					 * replylong[0]: ISN
>   					 * replylong[1]: TS
> @@ -6883,6 +6884,7 @@ enum {
>   					 * args[1]: TS
>   					 *
>   					 * replylong[0]: MSS
> +					 * replylong[1]: BPF_SYNCOOKIE_XXX
>   					 */
>   };
>   
> @@ -6970,6 +6972,22 @@ enum {
>   						 */
>   };
>   
> +/* arg[1] value for BPF_SOCK_OPS_GEN_SYNCOOKIE_CB and
> + * replylong[1] for BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB.
> + *
> + * MSB                                LSB
> + * | 31 ... | 6  | 5   | 4    | 3 2 1 0 |
> + * |    ... | TS | ECN | SACK | WScale  |
> + */
> +enum {
> +	/* 0xf is invalid thus means that SYN did not have WScale. */
> +	BPF_SYNCOOKIE_WSCALE_MASK	= (1 << 4) - 1,
> +	BPF_SYNCOOKIE_SACK		= (1 << 4),
> +	BPF_SYNCOOKIE_ECN		= (1 << 5),
> +	/* Only available for BPF_SOCK_OPS_GEN_SYNCOOKIE_CB to check if SYN has TS */
> +	BPF_SYNCOOKIE_TS		= (1 << 6),
> +};
> +
>   struct bpf_perf_event_value {
>   	__u64 counter;
>   	__u64 enabled;


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 bpf-next 00/11] bpf: tcp: Add SYN Cookie generation/validation SOCK_OPS hooks.
  2023-10-17 16:48   ` Kuniyuki Iwashima
@ 2023-10-18  6:19     ` Martin KaFai Lau
  2023-10-18  8:02       ` Eric Dumazet
  0 siblings, 1 reply; 44+ messages in thread
From: Martin KaFai Lau @ 2023-10-18  6:19 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: andrii, ast, bpf, daniel, davem, dsahern, edumazet, haoluo,
	john.fastabend, jolsa, kpsingh, kuba, kuni1840, mykolal, netdev,
	pabeni, sdf, song, yonghong.song

On 10/17/23 9:48 AM, Kuniyuki Iwashima wrote:
> From: Martin KaFai Lau <martin.lau@linux.dev>
> Date: Mon, 16 Oct 2023 22:53:15 -0700
>> On 10/13/23 3:04 PM, Kuniyuki Iwashima wrote:
>>> Under SYN Flood, the TCP stack generates SYN Cookie to remain stateless
>>> After 3WHS, the proxy restores SYN and forwards it and ACK to the backend
>>> server.  Our kernel module works at Netfilter input/output hooks and first
>>> feeds SYN to the TCP stack to initiate 3WHS.  When the module is triggered
>>> for SYN+ACK, it looks up the corresponding request socket and overwrites
>>> tcp_rsk(req)->snt_isn with the proxy's cookie.  Then, the module can
>>> complete 3WHS with the original ACK as is.
>>
>> Does the current kernel module also use the timestamp bits differently?
>> (something like patch 8 and patch 10 trying to do)
> 
> Our SYN Proxy uses TS as is.  The proxy nodes generate a random number
> if TS is in SYN.
> 
> But I thought someone would suggest making TS available so that we can
> mock the default behaviour at least, and it would be more acceptable.
> 
> The selftest uses TS just to strengthen security by validating 32-bits
> hash.  Dropping a part of hash makes collision easier to happen, but
> 24-bits were sufficient for us to reduce SYN flood to the managable
> level at the backend.

While enabling bpf to customize the syncookie (and timestamp), I want to explore 
where can this also be done other than at the tcp layer.

Have you thought about directly sending the SYNACK back at a lower layer like 
tc/xdp after receiving the SYN? There are already bpf_tcp_{gen,check}_syncookie 
helper that allows to do this for the performance reason to absorb synflood. It 
will be natural to extend it to handle the customized syncookie also.

I think it should already be doable to send a SYNACK back with customized 
syncookie (and timestamp) at tc/xdp today.

When ack is received, the prog@tc/xdp can verify the cookie. It will probably 
need some new kfuncs to create the ireq and queue the child socket. The bpf prog 
can change the ireq->{snd_wscale, sack_ok...} if needed. The details of the 
kfuncs need some more thoughts. I think most of the bpf-side infra is ready, 
e.g. acquire/release/ref-tracking...etc.





^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 bpf-next 00/11] bpf: tcp: Add SYN Cookie generation/validation SOCK_OPS hooks.
  2023-10-18  6:19     ` Martin KaFai Lau
@ 2023-10-18  8:02       ` Eric Dumazet
  2023-10-18 17:20         ` Kuniyuki Iwashima
  0 siblings, 1 reply; 44+ messages in thread
From: Eric Dumazet @ 2023-10-18  8:02 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Kuniyuki Iwashima, andrii, ast, bpf, daniel, davem, dsahern,
	haoluo, john.fastabend, jolsa, kpsingh, kuba, kuni1840, mykolal,
	netdev, pabeni, sdf, song, yonghong.song

On Wed, Oct 18, 2023 at 8:19 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 10/17/23 9:48 AM, Kuniyuki Iwashima wrote:
> > From: Martin KaFai Lau <martin.lau@linux.dev>
> > Date: Mon, 16 Oct 2023 22:53:15 -0700
> >> On 10/13/23 3:04 PM, Kuniyuki Iwashima wrote:
> >>> Under SYN Flood, the TCP stack generates SYN Cookie to remain stateless
> >>> After 3WHS, the proxy restores SYN and forwards it and ACK to the backend
> >>> server.  Our kernel module works at Netfilter input/output hooks and first
> >>> feeds SYN to the TCP stack to initiate 3WHS.  When the module is triggered
> >>> for SYN+ACK, it looks up the corresponding request socket and overwrites
> >>> tcp_rsk(req)->snt_isn with the proxy's cookie.  Then, the module can
> >>> complete 3WHS with the original ACK as is.
> >>
> >> Does the current kernel module also use the timestamp bits differently?
> >> (something like patch 8 and patch 10 trying to do)
> >
> > Our SYN Proxy uses TS as is.  The proxy nodes generate a random number
> > if TS is in SYN.
> >
> > But I thought someone would suggest making TS available so that we can
> > mock the default behaviour at least, and it would be more acceptable.
> >
> > The selftest uses TS just to strengthen security by validating 32-bits
> > hash.  Dropping a part of hash makes collision easier to happen, but
> > 24-bits were sufficient for us to reduce SYN flood to the managable
> > level at the backend.
>
> While enabling bpf to customize the syncookie (and timestamp), I want to explore
> where can this also be done other than at the tcp layer.
>
> Have you thought about directly sending the SYNACK back at a lower layer like
> tc/xdp after receiving the SYN? There are already bpf_tcp_{gen,check}_syncookie
> helper that allows to do this for the performance reason to absorb synflood. It
> will be natural to extend it to handle the customized syncookie also.
>
> I think it should already be doable to send a SYNACK back with customized
> syncookie (and timestamp) at tc/xdp today.
>
> When ack is received, the prog@tc/xdp can verify the cookie. It will probably
> need some new kfuncs to create the ireq and queue the child socket. The bpf prog
> can change the ireq->{snd_wscale, sack_ok...} if needed. The details of the
> kfuncs need some more thoughts. I think most of the bpf-side infra is ready,
> e.g. acquire/release/ref-tracking...etc.
>

I think I mostly agree with this.

I am rebasing  a patch adding usec resolution to TCP TS,
that we used for about 10 years at Google, because it is time to upstream it.

I am worried about more changes/conflicts caused by Kuniyuki patch set...

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 bpf-next 05/11] bpf: tcp: Add SYN Cookie generation SOCK_OPS hook.
  2023-10-18  0:54   ` Martin KaFai Lau
@ 2023-10-18 17:00     ` Kuniyuki Iwashima
  0 siblings, 0 replies; 44+ messages in thread
From: Kuniyuki Iwashima @ 2023-10-18 17:00 UTC (permalink / raw)
  To: martin.lau
  Cc: andrii, ast, bpf, daniel, davem, dsahern, edumazet, haoluo,
	john.fastabend, jolsa, kpsingh, kuba, kuni1840, kuniyu, mykolal,
	netdev, pabeni, sdf, song, yonghong.song

From: Martin KaFai Lau <martin.lau@linux.dev>
Date: Tue, 17 Oct 2023 17:54:53 -0700
> On 10/13/23 3:04 PM, Kuniyuki Iwashima wrote:
> > This patch adds a new SOCK_OPS hook to generate arbitrary SYN Cookie.
> > 
> > When the kernel sends SYN Cookie to a client, the hook is invoked with
> > bpf_sock_ops.op == BPF_SOCK_OPS_GEN_SYNCOOKIE_CB if the listener has
> > BPF_SOCK_OPS_SYNCOOKIE_CB_FLAG set by bpf_sock_ops_cb_flags_set().
> > 
> > The BPF program can access the following information to encode into
> > ISN:
> > 
> >    bpf_sock_ops.sk      : 4-tuple
> >    bpf_sock_ops.skb     : TCP header
> >    bpf_sock_ops.args[0] : MSS
> > 
> > The program must encode MSS and set it to bpf_sock_ops.replylong[0],
> > which will be looped back to the paired hook added in the following
> > patch.
> > 
> > Note that we do not call tcp_synq_overflow() so that the BPF program
> > can set its own expiration period.
> > 
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > ---
> >   include/uapi/linux/bpf.h       | 18 +++++++++++++++-
> >   net/ipv4/tcp_input.c           | 38 +++++++++++++++++++++++++++++++++-
> >   tools/include/uapi/linux/bpf.h | 18 +++++++++++++++-
> >   3 files changed, 71 insertions(+), 3 deletions(-)
> > 
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 7ba61b75bc0e..d3cc530613c0 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -6738,8 +6738,17 @@ enum {
> >   	 * options first before the BPF program does.
> >   	 */
> >   	BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<6),
> > +	/* Call bpf when the kernel generates SYN Cookie (ISN) for SYN+ACK.
> > +	 *
> > +	 * The bpf prog will be called to encode MSS into SYN Cookie with
> > +	 * sock_ops->op == BPF_SOCK_OPS_GEN_SYNCOOKIE_CB.
> > +	 *
> > +	 * Please refer to the comment in BPF_SOCK_OPS_GEN_SYNCOOKIE_CB for
> > +	 * input and output.
> > +	 */
> > +	BPF_SOCK_OPS_SYNCOOKIE_CB_FLAG = (1<<7),
> >   /* Mask of all currently supported cb flags */
> > -	BPF_SOCK_OPS_ALL_CB_FLAGS       = 0x7F,
> > +	BPF_SOCK_OPS_ALL_CB_FLAGS       = 0xFF,
> >   };
> >   
> >   /* List of known BPF sock_ops operators.
> > @@ -6852,6 +6861,13 @@ enum {
> >   					 * by the kernel or the
> >   					 * earlier bpf-progs.
> >   					 */
> > +	BPF_SOCK_OPS_GEN_SYNCOOKIE_CB,	/* Generate SYN Cookie (ISN of
> > +					 * SYN+ACK).
> > +					 *
> > +					 * args[0]: MSS
> > +					 *
> > +					 * replylong[0]: ISN
> > +					 */
> >   };
> >   
> >   /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index 584825ddd0a0..c86a737e4fe6 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -6966,6 +6966,37 @@ u16 tcp_get_syncookie_mss(struct request_sock_ops *rsk_ops,
> >   }
> >   EXPORT_SYMBOL_GPL(tcp_get_syncookie_mss);
> >   
> > +#if IS_ENABLED(CONFIG_CGROUP_BPF) && IS_ENABLED(CONFIG_SYN_COOKIES)
> > +static int bpf_skops_cookie_init_sequence(struct sock *sk, struct request_sock *req,
> > +					  struct sk_buff *skb, __u32 *isn)
> > +{
> > +	struct bpf_sock_ops_kern sock_ops;
> > +	int ret;
> > +
> > +	memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
> > +
> > +	sock_ops.op = BPF_SOCK_OPS_GEN_SYNCOOKIE_CB;
> > +	sock_ops.sk = req_to_sk(req);
> > +	sock_ops.args[0] = req->mss;
> > +
> > +	bpf_skops_init_skb(&sock_ops, skb, tcp_hdrlen(skb));
> > +
> > +	ret = BPF_CGROUP_RUN_PROG_SOCK_OPS_SK(&sock_ops, sk);
> > +	if (ret)
> > +		return ret;
> > +
> > +	*isn = sock_ops.replylong[0];
> 
> sock_ops.{replylong,reply} cannot be used. afaik, no existing sockops hook 
> relies on {replylong,reply}. It is a union of args[4]. There could be a few 
> skops bpf in the same cgrp and each of them will be run one after another. (eg. 
> two skops progs want to generate cookie).

Ah, I missed that case.  Looking at bpf_prog_run_array_cg(), multiple
SOCK_OPS prog can be attached and args[] are reused.  Then, we cannot
use replylong[] for interface from bpf prog.


> 
> I don't prefer to extend the uapi 'struct bpf_sock_ops' and then the 
> sock_ops_convert_ctx_access(). Adding member to the kernel 'struct 
> bpf_sock_addr_kern' could still be considered if it is really needed.
> 
> One option is to add kfunc to allow the bpf prog to directly update the value of 
> the kernel obj (e.g. tcp_rsk(req)->snt_isn here).

Yes, we need to set snt_isn, mss, sack_ok etc based on _CB (if we
continue with SOCK_OPS).


> 
> Also, we need to allow a bpf prog to selectively generate custom cookie for one 
> SYN but fall-through to the kernel cookie for another SYN.

Initially I implemented the fallback but the validation hook looked bit
ugly (because of reqsk allocation) and removed the fallback flow.

Also, I thought it can be done with other hooks so that such SYN will be
distributed to another listener.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 bpf-next 10/11] bpf: tcp: Make WS, SACK, ECN configurable from BPF SYN Cookie.
  2023-10-18  1:08   ` Martin KaFai Lau
@ 2023-10-18 17:02     ` Kuniyuki Iwashima
  0 siblings, 0 replies; 44+ messages in thread
From: Kuniyuki Iwashima @ 2023-10-18 17:02 UTC (permalink / raw)
  To: martin.lau
  Cc: andrii, ast, bpf, daniel, davem, dsahern, edumazet, haoluo,
	john.fastabend, jolsa, kpsingh, kuba, kuni1840, kuniyu, mykolal,
	netdev, pabeni, sdf, song, yonghong.song

From: Martin KaFai Lau <martin.lau@linux.dev>
Date: Tue, 17 Oct 2023 18:08:34 -0700
> On 10/13/23 3:04 PM, Kuniyuki Iwashima wrote:
> > This patch allows BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB hook to enable WScale,
> > SACK, and ECN by passing corresponding flags to bpf_sock_ops.replylong[1].
> > 
> > The same flags are passed to BPF_SOCK_OPS_GEN_SYNCOOKIE_CB hook as
> > bpf_sock_ops.args[1] so that the BPF prog need not parse the TCP header to
> > check if WScale, SACK, ECN, and TS are available in SYN.
> > 
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > ---
> >   include/uapi/linux/bpf.h       | 18 ++++++++++++++++++
> >   net/ipv4/syncookies.c          | 20 ++++++++++++++++++++
> >   net/ipv4/tcp_input.c           | 11 +++++++++++
> >   tools/include/uapi/linux/bpf.h | 18 ++++++++++++++++++
> >   4 files changed, 67 insertions(+)
> > 
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 24f673d88c0d..cdae4dd5d797 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -6869,6 +6869,7 @@ enum {
> >   					 * option.
> >   					 *
> >   					 * args[0]: MSS
> > +					 * args[1]: BPF_SYNCOOKIE_XXX
> >   					 *
> >   					 * replylong[0]: ISN
> >   					 * replylong[1]: TS
> > @@ -6883,6 +6884,7 @@ enum {
> >   					 * args[1]: TS
> >   					 *
> >   					 * replylong[0]: MSS
> > +					 * replylong[1]: BPF_SYNCOOKIE_XXX
> >   					 */
> >   };
> >   
> > @@ -6970,6 +6972,22 @@ enum {
> >   						 */
> >   };
> >   
> > +/* arg[1] value for BPF_SOCK_OPS_GEN_SYNCOOKIE_CB and
> > + * replylong[1] for BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB.
> > + *
> > + * MSB                                LSB
> > + * | 31 ... | 6  | 5   | 4    | 3 2 1 0 |
> > + * |    ... | TS | ECN | SACK | WScale  |
> > + */
> > +enum {
> > +	/* 0xf is invalid thus means that SYN did not have WScale. */
> > +	BPF_SYNCOOKIE_WSCALE_MASK	= (1 << 4) - 1,
> > +	BPF_SYNCOOKIE_SACK		= (1 << 4),
> > +	BPF_SYNCOOKIE_ECN		= (1 << 5),
> > +	/* Only available for BPF_SOCK_OPS_GEN_SYNCOOKIE_CB to check if SYN has TS */
> > +	BPF_SYNCOOKIE_TS		= (1 << 6),
> > +};
> 
> This details should not be exposed to uapi (more below).
> 
> > +
> >   struct bpf_perf_event_value {
> >   	__u64 counter;
> >   	__u64 enabled;
> > diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
> > index ff979cc314da..22353a9af52d 100644
> > --- a/net/ipv4/syncookies.c
> > +++ b/net/ipv4/syncookies.c
> > @@ -286,6 +286,7 @@ int bpf_skops_cookie_check(struct sock *sk, struct request_sock *req, struct sk_
> >   {
> >   	struct bpf_sock_ops_kern sock_ops;
> >   	struct net *net = sock_net(sk);
> > +	u32 options;
> >   
> >   	if (tcp_opt->saw_tstamp) {
> >   		if (!READ_ONCE(net->ipv4.sysctl_tcp_timestamps))
> > @@ -309,6 +310,25 @@ int bpf_skops_cookie_check(struct sock *sk, struct request_sock *req, struct sk_
> >   	if (!sock_ops.replylong[0])
> >   		goto err;
> >   
> > +	options = sock_ops.replylong[1];
> > +
> > +	if ((options & BPF_SYNCOOKIE_WSCALE_MASK) != BPF_SYNCOOKIE_WSCALE_MASK) {
> > +		if (!READ_ONCE(net->ipv4.sysctl_tcp_window_scaling))
> > +			goto err;
> > +
> > +		tcp_opt->wscale_ok = 1;
> > +		tcp_opt->snd_wscale = options & BPF_SYNCOOKIE_WSCALE_MASK;
> > +	}
> > +
> > +	if (options & BPF_SYNCOOKIE_SACK) {
> > +		if (!READ_ONCE(net->ipv4.sysctl_tcp_sack))
> > +			goto err;
> > +
> > +		tcp_opt->sack_ok = 1;
> > +	}
> > +
> > +	inet_rsk(req)->ecn_ok = options & BPF_SYNCOOKIE_ECN;
> > +
> >   	__NET_INC_STATS(sock_net(sk), LINUX_MIB_SYNCOOKIESRECV);
> >   
> >   	return sock_ops.replylong[0];
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index feb44bff29ef..483e2f36afe5 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -6970,14 +6970,25 @@ EXPORT_SYMBOL_GPL(tcp_get_syncookie_mss);
> >   static int bpf_skops_cookie_init_sequence(struct sock *sk, struct request_sock *req,
> >   					  struct sk_buff *skb, __u32 *isn)
> >   {
> > +	struct inet_request_sock *ireq = inet_rsk(req);
> >   	struct bpf_sock_ops_kern sock_ops;
> > +	u32 options;
> >   	int ret;
> >   
> > +	options = ireq->wscale_ok ? ireq->snd_wscale : BPF_SYNCOOKIE_WSCALE_MASK;
> > +	if (ireq->sack_ok)
> > +		options |= BPF_SYNCOOKIE_SACK;
> > +	if (ireq->ecn_ok)
> > +		options |= BPF_SYNCOOKIE_ECN;
> > +	if (ireq->tstamp_ok)
> > +		options |= BPF_SYNCOOKIE_TS;
> 
> No need to set "options" (which becomes args[1]). sock_ops.sk is available to 
> the bpf prog. The bpf prog can directly read it. The recent AF_UNIX bpf support 
> could be a reference on how the bpf_cast_to_kern_ctx() and bpf_rdonly_cast() are 
> used.
> 
> https://lore.kernel.org/bpf/20231011185113.140426-10-daan.j.demeyer@gmail.com/

I just tried bpf_cast_to_kern_ctx() and bpf_rdonly_cast() and found
it's quite useful, thanks!

If we want to set {sack_ok,ecn_ok,snd_wscale} in one shot, it would
be good to expose such flags and a helper.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 bpf-next 00/11] bpf: tcp: Add SYN Cookie generation/validation SOCK_OPS hooks.
  2023-10-18  8:02       ` Eric Dumazet
@ 2023-10-18 17:20         ` Kuniyuki Iwashima
  2023-10-18 21:47           ` Kui-Feng Lee
  0 siblings, 1 reply; 44+ messages in thread
From: Kuniyuki Iwashima @ 2023-10-18 17:20 UTC (permalink / raw)
  To: edumazet
  Cc: andrii, ast, bpf, daniel, davem, dsahern, haoluo, john.fastabend,
	jolsa, kpsingh, kuba, kuni1840, kuniyu, martin.lau, mykolal,
	netdev, pabeni, sdf, song, yonghong.song

From: Eric Dumazet <edumazet@google.com>
Date: Wed, 18 Oct 2023 10:02:51 +0200
> On Wed, Oct 18, 2023 at 8:19 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >
> > On 10/17/23 9:48 AM, Kuniyuki Iwashima wrote:
> > > From: Martin KaFai Lau <martin.lau@linux.dev>
> > > Date: Mon, 16 Oct 2023 22:53:15 -0700
> > >> On 10/13/23 3:04 PM, Kuniyuki Iwashima wrote:
> > >>> Under SYN Flood, the TCP stack generates SYN Cookie to remain stateless
> > >>> After 3WHS, the proxy restores SYN and forwards it and ACK to the backend
> > >>> server.  Our kernel module works at Netfilter input/output hooks and first
> > >>> feeds SYN to the TCP stack to initiate 3WHS.  When the module is triggered
> > >>> for SYN+ACK, it looks up the corresponding request socket and overwrites
> > >>> tcp_rsk(req)->snt_isn with the proxy's cookie.  Then, the module can
> > >>> complete 3WHS with the original ACK as is.
> > >>
> > >> Does the current kernel module also use the timestamp bits differently?
> > >> (something like patch 8 and patch 10 trying to do)
> > >
> > > Our SYN Proxy uses TS as is.  The proxy nodes generate a random number
> > > if TS is in SYN.
> > >
> > > But I thought someone would suggest making TS available so that we can
> > > mock the default behaviour at least, and it would be more acceptable.
> > >
> > > The selftest uses TS just to strengthen security by validating 32-bits
> > > hash.  Dropping a part of hash makes collision easier to happen, but
> > > 24-bits were sufficient for us to reduce SYN flood to the managable
> > > level at the backend.
> >
> > While enabling bpf to customize the syncookie (and timestamp), I want to explore
> > where can this also be done other than at the tcp layer.
> >
> > Have you thought about directly sending the SYNACK back at a lower layer like
> > tc/xdp after receiving the SYN?

Yes.  Actually, at netconf I mentioned the cookie generation hook will not
be necessary and should be replaced with XDP.


> > There are already bpf_tcp_{gen,check}_syncookie
> > helper that allows to do this for the performance reason to absorb synflood. It
> > will be natural to extend it to handle the customized syncookie also.

Maybe we even need not extend it and can use XDP as said below.


> >
> > I think it should already be doable to send a SYNACK back with customized
> > syncookie (and timestamp) at tc/xdp today.
> >
> > When ack is received, the prog@tc/xdp can verify the cookie. It will probably
> > need some new kfuncs to create the ireq and queue the child socket. The bpf prog
> > can change the ireq->{snd_wscale, sack_ok...} if needed. The details of the
> > kfuncs need some more thoughts. I think most of the bpf-side infra is ready,
> > e.g. acquire/release/ref-tracking...etc.
> >
> 
> I think I mostly agree with this.

I didn't come up with kfunc to create ireq and queue it to listener, so
cookie_v[46]_check() were best place for me to extend easily, but now it
sounds like kfunc would be the way to go.

Maybe we can move the core part of cookie_v[46]_check() except for kernel
cookie's validation to __cookie_v[46]_check() and expose a wrapper of it
as kfunc ?

Then, we can look up sk and pass the listener, skb, and flags (for sack_ok,
etc) to the kfunc.  (It could still introduce some conflicts with Eric's
patch though...)


> 
> I am rebasing  a patch adding usec resolution to TCP TS,
> that we used for about 10 years at Google, because it is time to upstream it.
> 
> I am worried about more changes/conflicts caused by Kuniyuki patch set...

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 bpf-next 00/11] bpf: tcp: Add SYN Cookie generation/validation SOCK_OPS hooks.
  2023-10-18 17:20         ` Kuniyuki Iwashima
@ 2023-10-18 21:47           ` Kui-Feng Lee
  2023-10-18 22:31             ` Kuniyuki Iwashima
  0 siblings, 1 reply; 44+ messages in thread
From: Kui-Feng Lee @ 2023-10-18 21:47 UTC (permalink / raw)
  To: Kuniyuki Iwashima, edumazet
  Cc: andrii, ast, bpf, daniel, davem, dsahern, haoluo, john.fastabend,
	jolsa, kpsingh, kuba, kuni1840, martin.lau, mykolal, netdev,
	pabeni, sdf, song, yonghong.song



On 10/18/23 10:20, Kuniyuki Iwashima wrote:
> From: Eric Dumazet <edumazet@google.com>
> Date: Wed, 18 Oct 2023 10:02:51 +0200
>> On Wed, Oct 18, 2023 at 8:19 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>>
>>> On 10/17/23 9:48 AM, Kuniyuki Iwashima wrote:
>>>> From: Martin KaFai Lau <martin.lau@linux.dev>
>>>> Date: Mon, 16 Oct 2023 22:53:15 -0700
>>>>> On 10/13/23 3:04 PM, Kuniyuki Iwashima wrote:
>>>>>> Under SYN Flood, the TCP stack generates SYN Cookie to remain stateless
>>>>>> After 3WHS, the proxy restores SYN and forwards it and ACK to the backend
>>>>>> server.  Our kernel module works at Netfilter input/output hooks and first
>>>>>> feeds SYN to the TCP stack to initiate 3WHS.  When the module is triggered
>>>>>> for SYN+ACK, it looks up the corresponding request socket and overwrites
>>>>>> tcp_rsk(req)->snt_isn with the proxy's cookie.  Then, the module can
>>>>>> complete 3WHS with the original ACK as is.
>>>>>
>>>>> Does the current kernel module also use the timestamp bits differently?
>>>>> (something like patch 8 and patch 10 trying to do)
>>>>
>>>> Our SYN Proxy uses TS as is.  The proxy nodes generate a random number
>>>> if TS is in SYN.
>>>>
>>>> But I thought someone would suggest making TS available so that we can
>>>> mock the default behaviour at least, and it would be more acceptable.
>>>>
>>>> The selftest uses TS just to strengthen security by validating 32-bits
>>>> hash.  Dropping a part of hash makes collision easier to happen, but
>>>> 24-bits were sufficient for us to reduce SYN flood to the managable
>>>> level at the backend.
>>>
>>> While enabling bpf to customize the syncookie (and timestamp), I want to explore
>>> where can this also be done other than at the tcp layer.
>>>
>>> Have you thought about directly sending the SYNACK back at a lower layer like
>>> tc/xdp after receiving the SYN?
> 
> Yes.  Actually, at netconf I mentioned the cookie generation hook will not
> be necessary and should be replaced with XDP.
> 
> 
>>> There are already bpf_tcp_{gen,check}_syncookie
>>> helper that allows to do this for the performance reason to absorb synflood. It
>>> will be natural to extend it to handle the customized syncookie also.
> 
> Maybe we even need not extend it and can use XDP as said below.
> 
> 
>>>
>>> I think it should already be doable to send a SYNACK back with customized
>>> syncookie (and timestamp) at tc/xdp today.
>>>
>>> When ack is received, the prog@tc/xdp can verify the cookie. It will probably
>>> need some new kfuncs to create the ireq and queue the child socket. The bpf prog
>>> can change the ireq->{snd_wscale, sack_ok...} if needed. The details of the
>>> kfuncs need some more thoughts. I think most of the bpf-side infra is ready,
>>> e.g. acquire/release/ref-tracking...etc.
>>>
>>
>> I think I mostly agree with this.
> 
> I didn't come up with kfunc to create ireq and queue it to listener, so
> cookie_v[46]_check() were best place for me to extend easily, but now it
> sounds like kfunc would be the way to go.
> 
> Maybe we can move the core part of cookie_v[46]_check() except for kernel
> cookie's validation to __cookie_v[46]_check() and expose a wrapper of it
> as kfunc ?
> 
> Then, we can look up sk and pass the listener, skb, and flags (for sack_ok,
> etc) to the kfunc.  (It could still introduce some conflicts with Eric's
> patch though...)

Does that mean the packets handled in this way (in XDP) will skip all 
netfilter at all?


> 
> 
>>
>> I am rebasing  a patch adding usec resolution to TCP TS,
>> that we used for about 10 years at Google, because it is time to upstream it.
>>
>> I am worried about more changes/conflicts caused by Kuniyuki patch set...
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 bpf-next 00/11] bpf: tcp: Add SYN Cookie generation/validation SOCK_OPS hooks.
  2023-10-18 21:47           ` Kui-Feng Lee
@ 2023-10-18 22:31             ` Kuniyuki Iwashima
  2023-10-19  7:25               ` Martin KaFai Lau
  0 siblings, 1 reply; 44+ messages in thread
From: Kuniyuki Iwashima @ 2023-10-18 22:31 UTC (permalink / raw)
  To: sinquersw
  Cc: andrii, ast, bpf, daniel, davem, dsahern, edumazet, haoluo,
	john.fastabend, jolsa, kpsingh, kuba, kuni1840, kuniyu,
	martin.lau, mykolal, netdev, pabeni, sdf, song, yonghong.song

From: Kui-Feng Lee <sinquersw@gmail.com>
Date: Wed, 18 Oct 2023 14:47:43 -0700
> On 10/18/23 10:20, Kuniyuki Iwashima wrote:
> > From: Eric Dumazet <edumazet@google.com>
> > Date: Wed, 18 Oct 2023 10:02:51 +0200
> >> On Wed, Oct 18, 2023 at 8:19 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >>>
> >>> On 10/17/23 9:48 AM, Kuniyuki Iwashima wrote:
> >>>> From: Martin KaFai Lau <martin.lau@linux.dev>
> >>>> Date: Mon, 16 Oct 2023 22:53:15 -0700
> >>>>> On 10/13/23 3:04 PM, Kuniyuki Iwashima wrote:
> >>>>>> Under SYN Flood, the TCP stack generates SYN Cookie to remain stateless
> >>>>>> After 3WHS, the proxy restores SYN and forwards it and ACK to the backend
> >>>>>> server.  Our kernel module works at Netfilter input/output hooks and first
> >>>>>> feeds SYN to the TCP stack to initiate 3WHS.  When the module is triggered
> >>>>>> for SYN+ACK, it looks up the corresponding request socket and overwrites
> >>>>>> tcp_rsk(req)->snt_isn with the proxy's cookie.  Then, the module can
> >>>>>> complete 3WHS with the original ACK as is.
> >>>>>
> >>>>> Does the current kernel module also use the timestamp bits differently?
> >>>>> (something like patch 8 and patch 10 trying to do)
> >>>>
> >>>> Our SYN Proxy uses TS as is.  The proxy nodes generate a random number
> >>>> if TS is in SYN.
> >>>>
> >>>> But I thought someone would suggest making TS available so that we can
> >>>> mock the default behaviour at least, and it would be more acceptable.
> >>>>
> >>>> The selftest uses TS just to strengthen security by validating 32-bits
> >>>> hash.  Dropping a part of hash makes collision easier to happen, but
> >>>> 24-bits were sufficient for us to reduce SYN flood to the managable
> >>>> level at the backend.
> >>>
> >>> While enabling bpf to customize the syncookie (and timestamp), I want to explore
> >>> where can this also be done other than at the tcp layer.
> >>>
> >>> Have you thought about directly sending the SYNACK back at a lower layer like
> >>> tc/xdp after receiving the SYN?
> > 
> > Yes.  Actually, at netconf I mentioned the cookie generation hook will not
> > be necessary and should be replaced with XDP.
> > 
> > 
> >>> There are already bpf_tcp_{gen,check}_syncookie
> >>> helper that allows to do this for the performance reason to absorb synflood. It
> >>> will be natural to extend it to handle the customized syncookie also.
> > 
> > Maybe we even need not extend it and can use XDP as said below.
> > 
> > 
> >>>
> >>> I think it should already be doable to send a SYNACK back with customized
> >>> syncookie (and timestamp) at tc/xdp today.
> >>>
> >>> When ack is received, the prog@tc/xdp can verify the cookie. It will probably
> >>> need some new kfuncs to create the ireq and queue the child socket. The bpf prog
> >>> can change the ireq->{snd_wscale, sack_ok...} if needed. The details of the
> >>> kfuncs need some more thoughts. I think most of the bpf-side infra is ready,
> >>> e.g. acquire/release/ref-tracking...etc.
> >>>
> >>
> >> I think I mostly agree with this.
> > 
> > I didn't come up with kfunc to create ireq and queue it to listener, so
> > cookie_v[46]_check() were best place for me to extend easily, but now it
> > sounds like kfunc would be the way to go.
> > 
> > Maybe we can move the core part of cookie_v[46]_check() except for kernel
> > cookie's validation to __cookie_v[46]_check() and expose a wrapper of it
> > as kfunc ?
> > 
> > Then, we can look up sk and pass the listener, skb, and flags (for sack_ok,
> > etc) to the kfunc.  (It could still introduce some conflicts with Eric's
> > patch though...)
> 
> Does that mean the packets handled in this way (in XDP) will skip all 
> netfilter at all?

Good point.

If we want not to skip other layers, maybe we can use tc ?

1) allocate ireq and set sack_ok etc with kfunc
2) bpf_sk_assign() to set ireq to skb (this could be done in kfunc above)
3) let inet_steal_sock() return req->sk_listener if not sk_fullsock(sk)
4) if skb->sk is reqsk in cookie_v[46]_check(), skip validation and
   req allocation and create full sk

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 bpf-next 00/11] bpf: tcp: Add SYN Cookie generation/validation SOCK_OPS hooks.
  2023-10-18 22:31             ` Kuniyuki Iwashima
@ 2023-10-19  7:25               ` Martin KaFai Lau
  2023-10-19 18:01                 ` Kuniyuki Iwashima
  0 siblings, 1 reply; 44+ messages in thread
From: Martin KaFai Lau @ 2023-10-19  7:25 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: andrii, ast, bpf, daniel, davem, dsahern, edumazet, haoluo,
	john.fastabend, jolsa, kpsingh, kuba, kuni1840, mykolal, netdev,
	pabeni, sdf, song, yonghong.song, sinquersw

On 10/18/23 3:31 PM, Kuniyuki Iwashima wrote:
> From: Kui-Feng Lee <sinquersw@gmail.com>
> Date: Wed, 18 Oct 2023 14:47:43 -0700
>> On 10/18/23 10:20, Kuniyuki Iwashima wrote:
>>> From: Eric Dumazet <edumazet@google.com>
>>> Date: Wed, 18 Oct 2023 10:02:51 +0200
>>>> On Wed, Oct 18, 2023 at 8:19 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>>>>
>>>>> On 10/17/23 9:48 AM, Kuniyuki Iwashima wrote:
>>>>>> From: Martin KaFai Lau <martin.lau@linux.dev>
>>>>>> Date: Mon, 16 Oct 2023 22:53:15 -0700
>>>>>>> On 10/13/23 3:04 PM, Kuniyuki Iwashima wrote:
>>>>>>>> Under SYN Flood, the TCP stack generates SYN Cookie to remain stateless
>>>>>>>> After 3WHS, the proxy restores SYN and forwards it and ACK to the backend
>>>>>>>> server.  Our kernel module works at Netfilter input/output hooks and first
>>>>>>>> feeds SYN to the TCP stack to initiate 3WHS.  When the module is triggered
>>>>>>>> for SYN+ACK, it looks up the corresponding request socket and overwrites
>>>>>>>> tcp_rsk(req)->snt_isn with the proxy's cookie.  Then, the module can
>>>>>>>> complete 3WHS with the original ACK as is.
>>>>>>>
>>>>>>> Does the current kernel module also use the timestamp bits differently?
>>>>>>> (something like patch 8 and patch 10 trying to do)
>>>>>>
>>>>>> Our SYN Proxy uses TS as is.  The proxy nodes generate a random number
>>>>>> if TS is in SYN.
>>>>>>
>>>>>> But I thought someone would suggest making TS available so that we can
>>>>>> mock the default behaviour at least, and it would be more acceptable.
>>>>>>
>>>>>> The selftest uses TS just to strengthen security by validating 32-bits
>>>>>> hash.  Dropping a part of hash makes collision easier to happen, but
>>>>>> 24-bits were sufficient for us to reduce SYN flood to the managable
>>>>>> level at the backend.
>>>>>
>>>>> While enabling bpf to customize the syncookie (and timestamp), I want to explore
>>>>> where can this also be done other than at the tcp layer.
>>>>>
>>>>> Have you thought about directly sending the SYNACK back at a lower layer like
>>>>> tc/xdp after receiving the SYN?
>>>
>>> Yes.  Actually, at netconf I mentioned the cookie generation hook will not
>>> be necessary and should be replaced with XDP.

Right, it is also what I have been thinking when seeing the 
BPF_SOCK_OPS_GEN_SYNCOOKIE_CB carrying the bpf generated timestamp to the 
tcp_make_synack. It feels like trying hard to work with the tcp want_cookie 
logic while there is an existing better alternative in tc/xdp to deal with synflood.

>>>
>>>
>>>>> There are already bpf_tcp_{gen,check}_syncookie
>>>>> helper that allows to do this for the performance reason to absorb synflood. It
>>>>> will be natural to extend it to handle the customized syncookie also.
>>>
>>> Maybe we even need not extend it and can use XDP as said below.
>>>
>>>
>>>>>
>>>>> I think it should already be doable to send a SYNACK back with customized
>>>>> syncookie (and timestamp) at tc/xdp today.
>>>>>
>>>>> When ack is received, the prog@tc/xdp can verify the cookie. It will probably
>>>>> need some new kfuncs to create the ireq and queue the child socket. The bpf prog
>>>>> can change the ireq->{snd_wscale, sack_ok...} if needed. The details of the
>>>>> kfuncs need some more thoughts. I think most of the bpf-side infra is ready,
>>>>> e.g. acquire/release/ref-tracking...etc.
>>>>>
>>>>
>>>> I think I mostly agree with this.
>>>
>>> I didn't come up with kfunc to create ireq and queue it to listener, so
>>> cookie_v[46]_check() were best place for me to extend easily, but now it
>>> sounds like kfunc would be the way to go.
>>>
>>> Maybe we can move the core part of cookie_v[46]_check() except for kernel
>>> cookie's validation to __cookie_v[46]_check() and expose a wrapper of it
>>> as kfunc ?
>>>
>>> Then, we can look up sk and pass the listener, skb, and flags (for sack_ok,
>>> etc) to the kfunc.  (It could still introduce some conflicts with Eric's
>>> patch though...)
>>
>> Does that mean the packets handled in this way (in XDP) will skip all
>> netfilter at all?
> 
> Good point.
> 
> If we want not to skip other layers, maybe we can use tc ?
> 
> 1) allocate ireq and set sack_ok etc with kfunc
> 2) bpf_sk_assign() to set ireq to skb (this could be done in kfunc above)
> 3) let inet_steal_sock() return req->sk_listener if not sk_fullsock(sk)
> 4) if skb->sk is reqsk in cookie_v[46]_check(), skip validation and
>     req allocation and create full sk

Haven't looked at the details. The above feels reasonable and would be nice if 
it works out. don't know if the skb at tc can be used in cookie_v[46]_check() as 
is. It probably needs more thoughts.  [ note, xdp does not have skb. ]

Regarding the "allocate ireq and set sack_ok etc with kfunc", do you think it 
will be useful (and potentially cleaner) even for the 
BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB if it needed to go back to consider skops? Then 
only do the BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB and the xdp/tc can generate SYNACK. 
The xdp/tc can still do the check and drop the bad ACK earlier in the stack.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 bpf-next 00/11] bpf: tcp: Add SYN Cookie generation/validation SOCK_OPS hooks.
  2023-10-19  7:25               ` Martin KaFai Lau
@ 2023-10-19 18:01                 ` Kuniyuki Iwashima
  2023-10-20 19:59                   ` Martin KaFai Lau
  0 siblings, 1 reply; 44+ messages in thread
From: Kuniyuki Iwashima @ 2023-10-19 18:01 UTC (permalink / raw)
  To: martin.lau
  Cc: andrii, ast, bpf, daniel, davem, dsahern, edumazet, haoluo,
	john.fastabend, jolsa, kpsingh, kuba, kuni1840, kuniyu, mykolal,
	netdev, pabeni, sdf, sinquersw, song, yonghong.song

From: Martin KaFai Lau <martin.lau@linux.dev>
Date: Thu, 19 Oct 2023 00:25:00 -0700
> On 10/18/23 3:31 PM, Kuniyuki Iwashima wrote:
> > From: Kui-Feng Lee <sinquersw@gmail.com>
> > Date: Wed, 18 Oct 2023 14:47:43 -0700
> >> On 10/18/23 10:20, Kuniyuki Iwashima wrote:
> >>> From: Eric Dumazet <edumazet@google.com>
> >>> Date: Wed, 18 Oct 2023 10:02:51 +0200
> >>>> On Wed, Oct 18, 2023 at 8:19 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >>>>>
> >>>>> On 10/17/23 9:48 AM, Kuniyuki Iwashima wrote:
> >>>>>> From: Martin KaFai Lau <martin.lau@linux.dev>
> >>>>>> Date: Mon, 16 Oct 2023 22:53:15 -0700
> >>>>>>> On 10/13/23 3:04 PM, Kuniyuki Iwashima wrote:
> >>>>>>>> Under SYN Flood, the TCP stack generates SYN Cookie to remain stateless
> >>>>>>>> After 3WHS, the proxy restores SYN and forwards it and ACK to the backend
> >>>>>>>> server.  Our kernel module works at Netfilter input/output hooks and first
> >>>>>>>> feeds SYN to the TCP stack to initiate 3WHS.  When the module is triggered
> >>>>>>>> for SYN+ACK, it looks up the corresponding request socket and overwrites
> >>>>>>>> tcp_rsk(req)->snt_isn with the proxy's cookie.  Then, the module can
> >>>>>>>> complete 3WHS with the original ACK as is.
> >>>>>>>
> >>>>>>> Does the current kernel module also use the timestamp bits differently?
> >>>>>>> (something like patch 8 and patch 10 trying to do)
> >>>>>>
> >>>>>> Our SYN Proxy uses TS as is.  The proxy nodes generate a random number
> >>>>>> if TS is in SYN.
> >>>>>>
> >>>>>> But I thought someone would suggest making TS available so that we can
> >>>>>> mock the default behaviour at least, and it would be more acceptable.
> >>>>>>
> >>>>>> The selftest uses TS just to strengthen security by validating 32-bits
> >>>>>> hash.  Dropping a part of hash makes collision easier to happen, but
> >>>>>> 24-bits were sufficient for us to reduce SYN flood to the managable
> >>>>>> level at the backend.
> >>>>>
> >>>>> While enabling bpf to customize the syncookie (and timestamp), I want to explore
> >>>>> where can this also be done other than at the tcp layer.
> >>>>>
> >>>>> Have you thought about directly sending the SYNACK back at a lower layer like
> >>>>> tc/xdp after receiving the SYN?
> >>>
> >>> Yes.  Actually, at netconf I mentioned the cookie generation hook will not
> >>> be necessary and should be replaced with XDP.
> 
> Right, it is also what I have been thinking when seeing the 
> BPF_SOCK_OPS_GEN_SYNCOOKIE_CB carrying the bpf generated timestamp to the 
> tcp_make_synack. It feels like trying hard to work with the tcp want_cookie 
> logic while there is an existing better alternative in tc/xdp to deal with synflood.
> 
> >>>
> >>>
> >>>>> There are already bpf_tcp_{gen,check}_syncookie
> >>>>> helper that allows to do this for the performance reason to absorb synflood. It
> >>>>> will be natural to extend it to handle the customized syncookie also.
> >>>
> >>> Maybe we even need not extend it and can use XDP as said below.
> >>>
> >>>
> >>>>>
> >>>>> I think it should already be doable to send a SYNACK back with customized
> >>>>> syncookie (and timestamp) at tc/xdp today.
> >>>>>
> >>>>> When ack is received, the prog@tc/xdp can verify the cookie. It will probably
> >>>>> need some new kfuncs to create the ireq and queue the child socket. The bpf prog
> >>>>> can change the ireq->{snd_wscale, sack_ok...} if needed. The details of the
> >>>>> kfuncs need some more thoughts. I think most of the bpf-side infra is ready,
> >>>>> e.g. acquire/release/ref-tracking...etc.
> >>>>>
> >>>>
> >>>> I think I mostly agree with this.
> >>>
> >>> I didn't come up with kfunc to create ireq and queue it to listener, so
> >>> cookie_v[46]_check() were best place for me to extend easily, but now it
> >>> sounds like kfunc would be the way to go.
> >>>
> >>> Maybe we can move the core part of cookie_v[46]_check() except for kernel
> >>> cookie's validation to __cookie_v[46]_check() and expose a wrapper of it
> >>> as kfunc ?
> >>>
> >>> Then, we can look up sk and pass the listener, skb, and flags (for sack_ok,
> >>> etc) to the kfunc.  (It could still introduce some conflicts with Eric's
> >>> patch though...)
> >>
> >> Does that mean the packets handled in this way (in XDP) will skip all
> >> netfilter at all?
> > 
> > Good point.
> > 
> > If we want not to skip other layers, maybe we can use tc ?
> > 
> > 1) allocate ireq and set sack_ok etc with kfunc
> > 2) bpf_sk_assign() to set ireq to skb (this could be done in kfunc above)
> > 3) let inet_steal_sock() return req->sk_listener if not sk_fullsock(sk)
> > 4) if skb->sk is reqsk in cookie_v[46]_check(), skip validation and
> >     req allocation and create full sk
> 
> Haven't looked at the details. The above feels reasonable and would be nice if 
> it works out. don't know if the skb at tc can be used in cookie_v[46]_check() as 
> is. It probably needs more thoughts.  [ note, xdp does not have skb. ]
> 
> Regarding the "allocate ireq and set sack_ok etc with kfunc", do you think it 
> will be useful (and potentially cleaner) even for the 
> BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB if it needed to go back to consider skops? Then 
> only do the BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB and the xdp/tc can generate SYNACK. 
> The xdp/tc can still do the check and drop the bad ACK earlier in the stack.

kfunc would be useful if we want to fall back to the default
validation, but I think we should not allocate ireq in kfunc.

The SOCK_OPS prog only returns a binary value.  If we decide whether
we skip validation or not based on kfunc call (ireq allocation), the
flow would be like :

  1. CG_OK & ireq is allocated -> skip validation and req allocation
  2. CG_OK & no ireq           -> default validation
  3. CG_ERR                    -> RST

The problem here is that if kfunc fails with -ENOMEM and cookie
is valid, we need a way to tell the kernel to drop the ACK instead
of sending RST.  (I hope the prog could return CG_DROP...)

If we allocate ireq first, it would be cleaner as bpf need not care
about the drop path.

  1. CG_OK & mss is set -> skip validation
  2. CG_OK & no mss set -> default validation
  3. CG_ERR             -> RST

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 bpf-next 00/11] bpf: tcp: Add SYN Cookie generation/validation SOCK_OPS hooks.
  2023-10-19 18:01                 ` Kuniyuki Iwashima
@ 2023-10-20 19:59                   ` Martin KaFai Lau
  2023-10-20 23:10                     ` Kuniyuki Iwashima
  0 siblings, 1 reply; 44+ messages in thread
From: Martin KaFai Lau @ 2023-10-20 19:59 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: andrii, ast, bpf, daniel, davem, dsahern, edumazet, haoluo,
	john.fastabend, jolsa, kpsingh, kuba, kuni1840, mykolal, netdev,
	pabeni, sdf, sinquersw, song, yonghong.song

On 10/19/23 11:01 AM, Kuniyuki Iwashima wrote:
> From: Martin KaFai Lau <martin.lau@linux.dev>
> Date: Thu, 19 Oct 2023 00:25:00 -0700
>> On 10/18/23 3:31 PM, Kuniyuki Iwashima wrote:
>>> From: Kui-Feng Lee <sinquersw@gmail.com>
>>> Date: Wed, 18 Oct 2023 14:47:43 -0700
>>>> On 10/18/23 10:20, Kuniyuki Iwashima wrote:
>>>>> From: Eric Dumazet <edumazet@google.com>
>>>>> Date: Wed, 18 Oct 2023 10:02:51 +0200
>>>>>> On Wed, Oct 18, 2023 at 8:19 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>>>>>>
>>>>>>> On 10/17/23 9:48 AM, Kuniyuki Iwashima wrote:
>>>>>>>> From: Martin KaFai Lau <martin.lau@linux.dev>
>>>>>>>> Date: Mon, 16 Oct 2023 22:53:15 -0700
>>>>>>>>> On 10/13/23 3:04 PM, Kuniyuki Iwashima wrote:
>>>>>>>>>> Under SYN Flood, the TCP stack generates SYN Cookie to remain stateless
>>>>>>>>>> After 3WHS, the proxy restores SYN and forwards it and ACK to the backend
>>>>>>>>>> server.  Our kernel module works at Netfilter input/output hooks and first
>>>>>>>>>> feeds SYN to the TCP stack to initiate 3WHS.  When the module is triggered
>>>>>>>>>> for SYN+ACK, it looks up the corresponding request socket and overwrites
>>>>>>>>>> tcp_rsk(req)->snt_isn with the proxy's cookie.  Then, the module can
>>>>>>>>>> complete 3WHS with the original ACK as is.
>>>>>>>>>
>>>>>>>>> Does the current kernel module also use the timestamp bits differently?
>>>>>>>>> (something like patch 8 and patch 10 trying to do)
>>>>>>>>
>>>>>>>> Our SYN Proxy uses TS as is.  The proxy nodes generate a random number
>>>>>>>> if TS is in SYN.
>>>>>>>>
>>>>>>>> But I thought someone would suggest making TS available so that we can
>>>>>>>> mock the default behaviour at least, and it would be more acceptable.
>>>>>>>>
>>>>>>>> The selftest uses TS just to strengthen security by validating 32-bits
>>>>>>>> hash.  Dropping a part of hash makes collision easier to happen, but
>>>>>>>> 24-bits were sufficient for us to reduce SYN flood to the managable
>>>>>>>> level at the backend.
>>>>>>>
>>>>>>> While enabling bpf to customize the syncookie (and timestamp), I want to explore
>>>>>>> where can this also be done other than at the tcp layer.
>>>>>>>
>>>>>>> Have you thought about directly sending the SYNACK back at a lower layer like
>>>>>>> tc/xdp after receiving the SYN?
>>>>>
>>>>> Yes.  Actually, at netconf I mentioned the cookie generation hook will not
>>>>> be necessary and should be replaced with XDP.
>>
>> Right, it is also what I have been thinking when seeing the
>> BPF_SOCK_OPS_GEN_SYNCOOKIE_CB carrying the bpf generated timestamp to the
>> tcp_make_synack. It feels like trying hard to work with the tcp want_cookie
>> logic while there is an existing better alternative in tc/xdp to deal with synflood.
>>
>>>>>
>>>>>
>>>>>>> There are already bpf_tcp_{gen,check}_syncookie
>>>>>>> helper that allows to do this for the performance reason to absorb synflood. It
>>>>>>> will be natural to extend it to handle the customized syncookie also.
>>>>>
>>>>> Maybe we even need not extend it and can use XDP as said below.
>>>>>
>>>>>
>>>>>>>
>>>>>>> I think it should already be doable to send a SYNACK back with customized
>>>>>>> syncookie (and timestamp) at tc/xdp today.
>>>>>>>
>>>>>>> When ack is received, the prog@tc/xdp can verify the cookie. It will probably
>>>>>>> need some new kfuncs to create the ireq and queue the child socket. The bpf prog
>>>>>>> can change the ireq->{snd_wscale, sack_ok...} if needed. The details of the
>>>>>>> kfuncs need some more thoughts. I think most of the bpf-side infra is ready,
>>>>>>> e.g. acquire/release/ref-tracking...etc.
>>>>>>>
>>>>>>
>>>>>> I think I mostly agree with this.
>>>>>
>>>>> I didn't come up with kfunc to create ireq and queue it to listener, so
>>>>> cookie_v[46]_check() were best place for me to extend easily, but now it
>>>>> sounds like kfunc would be the way to go.
>>>>>
>>>>> Maybe we can move the core part of cookie_v[46]_check() except for kernel
>>>>> cookie's validation to __cookie_v[46]_check() and expose a wrapper of it
>>>>> as kfunc ?
>>>>>
>>>>> Then, we can look up sk and pass the listener, skb, and flags (for sack_ok,
>>>>> etc) to the kfunc.  (It could still introduce some conflicts with Eric's
>>>>> patch though...)
>>>>
>>>> Does that mean the packets handled in this way (in XDP) will skip all
>>>> netfilter at all?
>>>
>>> Good point.
>>>
>>> If we want not to skip other layers, maybe we can use tc ?
>>>
>>> 1) allocate ireq and set sack_ok etc with kfunc
>>> 2) bpf_sk_assign() to set ireq to skb (this could be done in kfunc above)
>>> 3) let inet_steal_sock() return req->sk_listener if not sk_fullsock(sk)
>>> 4) if skb->sk is reqsk in cookie_v[46]_check(), skip validation and
>>>      req allocation and create full sk
>>
>> Haven't looked at the details. The above feels reasonable and would be nice if
>> it works out. don't know if the skb at tc can be used in cookie_v[46]_check() as
>> is. It probably needs more thoughts.  [ note, xdp does not have skb. ]
>>
>> Regarding the "allocate ireq and set sack_ok etc with kfunc", do you think it
>> will be useful (and potentially cleaner) even for the
>> BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB if it needed to go back to consider skops? Then
>> only do the BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB and the xdp/tc can generate SYNACK.
>> The xdp/tc can still do the check and drop the bad ACK earlier in the stack.
> 
> kfunc would be useful if we want to fall back to the default
> validation, but I think we should not allocate ireq in kfunc.
> 
> The SOCK_OPS prog only returns a binary value.  If we decide whether
> we skip validation or not based on kfunc call (ireq allocation), the
> flow would be like :
> 
>    1. CG_OK & ireq is allocated -> skip validation and req allocation
>    2. CG_OK & no ireq           -> default validation
>    3. CG_ERR                    -> RST
> 
> The problem here is that if kfunc fails with -ENOMEM and cookie
> is valid, we need a way to tell the kernel to drop the ACK instead
> of sending RST.  (I hope the prog could return CG_DROP...)

bpf_set_retval() helper allows the cgrp bpf prog to return -ENOMEM. Take a look 
at how __cgroup_bpf_run_filter_getsockopt is using the return value of 
bpf_prog_run_array_cg() and an example in progs/cgroup_getset_retval_getsockopt.c.


> 
> If we allocate ireq first, it would be cleaner as bpf need not care
> about the drop path.
> 
>    1. CG_OK & mss is set -> skip validation
>    2. CG_OK & no mss set -> default validation
>    3. CG_ERR             -> RST

Even if it uses the mss set/not-set like above to decide drop/rst. Does it 
really need to pre-allocate ireq? Looking at the test, the bpf prog is not using 
the skops->sk either.

It would be nice to allow bpf prog to check the cookie first before creating 
ireq. The kernel also checks the cookie first before tcp_parse_option and ireq 
creation. Beside, I suspect the multiple "if ([!]bpf_cookie)" checks in 
cookie_v[46]_check() is due to the pre-alloc ireq requirement.

What does it take to create an ireq? sk, skb, tcp_opt, and mss? Potentially, it 
could have a "bpf_skops_parse_tcp_options(struct bpf_sock_ops_kern *skops, 
struct tcp_options_received *opt_rx, u32 opt_rx__sz)" to initialize the tcp_opt. 
I think the bpf prog should be able to parse the tcp options by itself also and 
directly initialize the tcp_opt.

The "bpf_skops_alloc_tcp_req(struct bpf_sock_ops_kern *skops, struct 
tcp_options_received *opt_rx, u32 opt_rx__size, int mss,...)" could directly 
save the "ireq" in skops->ireq (new member). If skops->ireq is available, the 
kernel could then skip most of the ireq initialization and directly continue the 
remaining processing (e.g. directly to security_inet_conn_request() ?). would 
that work?


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 bpf-next 00/11] bpf: tcp: Add SYN Cookie generation/validation SOCK_OPS hooks.
  2023-10-20 19:59                   ` Martin KaFai Lau
@ 2023-10-20 23:10                     ` Kuniyuki Iwashima
  2023-10-21  6:48                       ` Kuniyuki Iwashima
  0 siblings, 1 reply; 44+ messages in thread
From: Kuniyuki Iwashima @ 2023-10-20 23:10 UTC (permalink / raw)
  To: martin.lau
  Cc: andrii, ast, bpf, daniel, davem, dsahern, edumazet, haoluo,
	john.fastabend, jolsa, kpsingh, kuba, kuni1840, kuniyu, mykolal,
	netdev, pabeni, sdf, sinquersw, song, yonghong.song

From: Martin KaFai Lau <martin.lau@linux.dev>
Date: Fri, 20 Oct 2023 12:59:00 -0700
> On 10/19/23 11:01 AM, Kuniyuki Iwashima wrote:
> > From: Martin KaFai Lau <martin.lau@linux.dev>
> > Date: Thu, 19 Oct 2023 00:25:00 -0700
> >> On 10/18/23 3:31 PM, Kuniyuki Iwashima wrote:
> >>> From: Kui-Feng Lee <sinquersw@gmail.com>
> >>> Date: Wed, 18 Oct 2023 14:47:43 -0700
> >>>> On 10/18/23 10:20, Kuniyuki Iwashima wrote:
> >>>>> From: Eric Dumazet <edumazet@google.com>
> >>>>> Date: Wed, 18 Oct 2023 10:02:51 +0200
> >>>>>> On Wed, Oct 18, 2023 at 8:19 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >>>>>>>
> >>>>>>> On 10/17/23 9:48 AM, Kuniyuki Iwashima wrote:
> >>>>>>>> From: Martin KaFai Lau <martin.lau@linux.dev>
> >>>>>>>> Date: Mon, 16 Oct 2023 22:53:15 -0700
> >>>>>>>>> On 10/13/23 3:04 PM, Kuniyuki Iwashima wrote:
> >>>>>>>>>> Under SYN Flood, the TCP stack generates SYN Cookie to remain stateless
> >>>>>>>>>> After 3WHS, the proxy restores SYN and forwards it and ACK to the backend
> >>>>>>>>>> server.  Our kernel module works at Netfilter input/output hooks and first
> >>>>>>>>>> feeds SYN to the TCP stack to initiate 3WHS.  When the module is triggered
> >>>>>>>>>> for SYN+ACK, it looks up the corresponding request socket and overwrites
> >>>>>>>>>> tcp_rsk(req)->snt_isn with the proxy's cookie.  Then, the module can
> >>>>>>>>>> complete 3WHS with the original ACK as is.
> >>>>>>>>>
> >>>>>>>>> Does the current kernel module also use the timestamp bits differently?
> >>>>>>>>> (something like patch 8 and patch 10 trying to do)
> >>>>>>>>
> >>>>>>>> Our SYN Proxy uses TS as is.  The proxy nodes generate a random number
> >>>>>>>> if TS is in SYN.
> >>>>>>>>
> >>>>>>>> But I thought someone would suggest making TS available so that we can
> >>>>>>>> mock the default behaviour at least, and it would be more acceptable.
> >>>>>>>>
> >>>>>>>> The selftest uses TS just to strengthen security by validating 32-bits
> >>>>>>>> hash.  Dropping a part of hash makes collision easier to happen, but
> >>>>>>>> 24-bits were sufficient for us to reduce SYN flood to the managable
> >>>>>>>> level at the backend.
> >>>>>>>
> >>>>>>> While enabling bpf to customize the syncookie (and timestamp), I want to explore
> >>>>>>> where can this also be done other than at the tcp layer.
> >>>>>>>
> >>>>>>> Have you thought about directly sending the SYNACK back at a lower layer like
> >>>>>>> tc/xdp after receiving the SYN?
> >>>>>
> >>>>> Yes.  Actually, at netconf I mentioned the cookie generation hook will not
> >>>>> be necessary and should be replaced with XDP.
> >>
> >> Right, it is also what I have been thinking when seeing the
> >> BPF_SOCK_OPS_GEN_SYNCOOKIE_CB carrying the bpf generated timestamp to the
> >> tcp_make_synack. It feels like trying hard to work with the tcp want_cookie
> >> logic while there is an existing better alternative in tc/xdp to deal with synflood.
> >>
> >>>>>
> >>>>>
> >>>>>>> There are already bpf_tcp_{gen,check}_syncookie
> >>>>>>> helper that allows to do this for the performance reason to absorb synflood. It
> >>>>>>> will be natural to extend it to handle the customized syncookie also.
> >>>>>
> >>>>> Maybe we even need not extend it and can use XDP as said below.
> >>>>>
> >>>>>
> >>>>>>>
> >>>>>>> I think it should already be doable to send a SYNACK back with customized
> >>>>>>> syncookie (and timestamp) at tc/xdp today.
> >>>>>>>
> >>>>>>> When ack is received, the prog@tc/xdp can verify the cookie. It will probably
> >>>>>>> need some new kfuncs to create the ireq and queue the child socket. The bpf prog
> >>>>>>> can change the ireq->{snd_wscale, sack_ok...} if needed. The details of the
> >>>>>>> kfuncs need some more thoughts. I think most of the bpf-side infra is ready,
> >>>>>>> e.g. acquire/release/ref-tracking...etc.
> >>>>>>>
> >>>>>>
> >>>>>> I think I mostly agree with this.
> >>>>>
> >>>>> I didn't come up with kfunc to create ireq and queue it to listener, so
> >>>>> cookie_v[46]_check() were best place for me to extend easily, but now it
> >>>>> sounds like kfunc would be the way to go.
> >>>>>
> >>>>> Maybe we can move the core part of cookie_v[46]_check() except for kernel
> >>>>> cookie's validation to __cookie_v[46]_check() and expose a wrapper of it
> >>>>> as kfunc ?
> >>>>>
> >>>>> Then, we can look up sk and pass the listener, skb, and flags (for sack_ok,
> >>>>> etc) to the kfunc.  (It could still introduce some conflicts with Eric's
> >>>>> patch though...)
> >>>>
> >>>> Does that mean the packets handled in this way (in XDP) will skip all
> >>>> netfilter at all?
> >>>
> >>> Good point.
> >>>
> >>> If we want not to skip other layers, maybe we can use tc ?
> >>>
> >>> 1) allocate ireq and set sack_ok etc with kfunc
> >>> 2) bpf_sk_assign() to set ireq to skb (this could be done in kfunc above)
> >>> 3) let inet_steal_sock() return req->sk_listener if not sk_fullsock(sk)
> >>> 4) if skb->sk is reqsk in cookie_v[46]_check(), skip validation and
> >>>      req allocation and create full sk
> >>
> >> Haven't looked at the details. The above feels reasonable and would be nice if
> >> it works out. don't know if the skb at tc can be used in cookie_v[46]_check() as
> >> is. It probably needs more thoughts.  [ note, xdp does not have skb. ]
> >>
> >> Regarding the "allocate ireq and set sack_ok etc with kfunc", do you think it
> >> will be useful (and potentially cleaner) even for the
> >> BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB if it needed to go back to consider skops? Then
> >> only do the BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB and the xdp/tc can generate SYNACK.
> >> The xdp/tc can still do the check and drop the bad ACK earlier in the stack.
> > 
> > kfunc would be useful if we want to fall back to the default
> > validation, but I think we should not allocate ireq in kfunc.
> > 
> > The SOCK_OPS prog only returns a binary value.  If we decide whether
> > we skip validation or not based on kfunc call (ireq allocation), the
> > flow would be like :
> > 
> >    1. CG_OK & ireq is allocated -> skip validation and req allocation
> >    2. CG_OK & no ireq           -> default validation
> >    3. CG_ERR                    -> RST
> > 
> > The problem here is that if kfunc fails with -ENOMEM and cookie
> > is valid, we need a way to tell the kernel to drop the ACK instead
> > of sending RST.  (I hope the prog could return CG_DROP...)
> 
> bpf_set_retval() helper allows the cgrp bpf prog to return -ENOMEM. Take a look 
> at how __cgroup_bpf_run_filter_getsockopt is using the return value of 
> bpf_prog_run_array_cg() and an example in progs/cgroup_getset_retval_getsockopt.c.

Oh, this is nice, I assumed -EPERM was always returned.


> > If we allocate ireq first, it would be cleaner as bpf need not care
> > about the drop path.
> > 
> >    1. CG_OK & mss is set -> skip validation
> >    2. CG_OK & no mss set -> default validation
> >    3. CG_ERR             -> RST
> 
> Even if it uses the mss set/not-set like above to decide drop/rst. Does it 
> really need to pre-allocate ireq? Looking at the test, the bpf prog is not using 
> the skops->sk either.

It uses skops->remote_ip4 etc, maybe this was another reason why
I chose pre-alloc, but yes, it's not needed.  The same value can
be extraced from skb with bpf_skb_load_bytes_relative(BPF_HDR_START_NET).


> It would be nice to allow bpf prog to check the cookie first before creating 
> ireq. The kernel also checks the cookie first before tcp_parse_option and ireq 
> creation. Beside, I suspect the multiple "if ([!]bpf_cookie)" checks in 
> cookie_v[46]_check() is due to the pre-alloc ireq requirement.
> 
> What does it take to create an ireq? sk, skb, tcp_opt, and mss? Potentially, it 
> could have a "bpf_skops_parse_tcp_options(struct bpf_sock_ops_kern *skops, 
> struct tcp_options_received *opt_rx, u32 opt_rx__sz)" to initialize the tcp_opt. 
> I think the bpf prog should be able to parse the tcp options by itself also and 
> directly initialize the tcp_opt.

Yes, also the prog will not need to parse all the options unless
the validation algorithm needs to becaues SACK_PERMITTED, WSCALE,
MSS (and ECN bits) are only available in SYN.

So, the prog will just need to parse timestamps option with
bpf_load_hdr_opt() and can initialise tcp_opt based on ISN
(and/or TS).


> The "bpf_skops_alloc_tcp_req(struct bpf_sock_ops_kern *skops, struct 
> tcp_options_received *opt_rx, u32 opt_rx__size, int mss,...)" could directly 
> save the "ireq" in skops->ireq (new member). If skops->ireq is available, the 
> kernel could then skip most of the ireq initialization and directly continue the 
> remaining processing (e.g. directly to security_inet_conn_request() ?). would 
> that work?

Yes, that will work.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 bpf-next 00/11] bpf: tcp: Add SYN Cookie generation/validation SOCK_OPS hooks.
  2023-10-20 23:10                     ` Kuniyuki Iwashima
@ 2023-10-21  6:48                       ` Kuniyuki Iwashima
  2023-10-23 21:35                         ` Martin KaFai Lau
  0 siblings, 1 reply; 44+ messages in thread
From: Kuniyuki Iwashima @ 2023-10-21  6:48 UTC (permalink / raw)
  To: kuniyu
  Cc: andrii, ast, bpf, daniel, davem, dsahern, edumazet, haoluo,
	john.fastabend, jolsa, kpsingh, kuba, kuni1840, martin.lau,
	mykolal, netdev, pabeni, sdf, sinquersw, song, yonghong.song

From: Kuniyuki Iwashima <kuniyu@amazon.com>
Date: Fri, 20 Oct 2023 16:10:03 -0700
> From: Martin KaFai Lau <martin.lau@linux.dev>
> Date: Fri, 20 Oct 2023 12:59:00 -0700
> > On 10/19/23 11:01 AM, Kuniyuki Iwashima wrote:
> > > From: Martin KaFai Lau <martin.lau@linux.dev>
> > > Date: Thu, 19 Oct 2023 00:25:00 -0700
> > >> On 10/18/23 3:31 PM, Kuniyuki Iwashima wrote:
> > >>> From: Kui-Feng Lee <sinquersw@gmail.com>
> > >>> Date: Wed, 18 Oct 2023 14:47:43 -0700
> > >>>> On 10/18/23 10:20, Kuniyuki Iwashima wrote:
> > >>>>> From: Eric Dumazet <edumazet@google.com>
> > >>>>> Date: Wed, 18 Oct 2023 10:02:51 +0200
> > >>>>>> On Wed, Oct 18, 2023 at 8:19 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> > >>>>>>>
> > >>>>>>> On 10/17/23 9:48 AM, Kuniyuki Iwashima wrote:
> > >>>>>>>> From: Martin KaFai Lau <martin.lau@linux.dev>
> > >>>>>>>> Date: Mon, 16 Oct 2023 22:53:15 -0700
> > >>>>>>>>> On 10/13/23 3:04 PM, Kuniyuki Iwashima wrote:
> > >>>>>>>>>> Under SYN Flood, the TCP stack generates SYN Cookie to remain stateless
> > >>>>>>>>>> After 3WHS, the proxy restores SYN and forwards it and ACK to the backend
> > >>>>>>>>>> server.  Our kernel module works at Netfilter input/output hooks and first
> > >>>>>>>>>> feeds SYN to the TCP stack to initiate 3WHS.  When the module is triggered
> > >>>>>>>>>> for SYN+ACK, it looks up the corresponding request socket and overwrites
> > >>>>>>>>>> tcp_rsk(req)->snt_isn with the proxy's cookie.  Then, the module can
> > >>>>>>>>>> complete 3WHS with the original ACK as is.
> > >>>>>>>>>
> > >>>>>>>>> Does the current kernel module also use the timestamp bits differently?
> > >>>>>>>>> (something like patch 8 and patch 10 trying to do)
> > >>>>>>>>
> > >>>>>>>> Our SYN Proxy uses TS as is.  The proxy nodes generate a random number
> > >>>>>>>> if TS is in SYN.
> > >>>>>>>>
> > >>>>>>>> But I thought someone would suggest making TS available so that we can
> > >>>>>>>> mock the default behaviour at least, and it would be more acceptable.
> > >>>>>>>>
> > >>>>>>>> The selftest uses TS just to strengthen security by validating 32-bits
> > >>>>>>>> hash.  Dropping a part of hash makes collision easier to happen, but
> > >>>>>>>> 24-bits were sufficient for us to reduce SYN flood to the managable
> > >>>>>>>> level at the backend.
> > >>>>>>>
> > >>>>>>> While enabling bpf to customize the syncookie (and timestamp), I want to explore
> > >>>>>>> where can this also be done other than at the tcp layer.
> > >>>>>>>
> > >>>>>>> Have you thought about directly sending the SYNACK back at a lower layer like
> > >>>>>>> tc/xdp after receiving the SYN?
> > >>>>>
> > >>>>> Yes.  Actually, at netconf I mentioned the cookie generation hook will not
> > >>>>> be necessary and should be replaced with XDP.
> > >>
> > >> Right, it is also what I have been thinking when seeing the
> > >> BPF_SOCK_OPS_GEN_SYNCOOKIE_CB carrying the bpf generated timestamp to the
> > >> tcp_make_synack. It feels like trying hard to work with the tcp want_cookie
> > >> logic while there is an existing better alternative in tc/xdp to deal with synflood.
> > >>
> > >>>>>
> > >>>>>
> > >>>>>>> There are already bpf_tcp_{gen,check}_syncookie
> > >>>>>>> helper that allows to do this for the performance reason to absorb synflood. It
> > >>>>>>> will be natural to extend it to handle the customized syncookie also.
> > >>>>>
> > >>>>> Maybe we even need not extend it and can use XDP as said below.
> > >>>>>
> > >>>>>
> > >>>>>>>
> > >>>>>>> I think it should already be doable to send a SYNACK back with customized
> > >>>>>>> syncookie (and timestamp) at tc/xdp today.
> > >>>>>>>
> > >>>>>>> When ack is received, the prog@tc/xdp can verify the cookie. It will probably
> > >>>>>>> need some new kfuncs to create the ireq and queue the child socket. The bpf prog
> > >>>>>>> can change the ireq->{snd_wscale, sack_ok...} if needed. The details of the
> > >>>>>>> kfuncs need some more thoughts. I think most of the bpf-side infra is ready,
> > >>>>>>> e.g. acquire/release/ref-tracking...etc.
> > >>>>>>>
> > >>>>>>
> > >>>>>> I think I mostly agree with this.
> > >>>>>
> > >>>>> I didn't come up with kfunc to create ireq and queue it to listener, so
> > >>>>> cookie_v[46]_check() were best place for me to extend easily, but now it
> > >>>>> sounds like kfunc would be the way to go.
> > >>>>>
> > >>>>> Maybe we can move the core part of cookie_v[46]_check() except for kernel
> > >>>>> cookie's validation to __cookie_v[46]_check() and expose a wrapper of it
> > >>>>> as kfunc ?
> > >>>>>
> > >>>>> Then, we can look up sk and pass the listener, skb, and flags (for sack_ok,
> > >>>>> etc) to the kfunc.  (It could still introduce some conflicts with Eric's
> > >>>>> patch though...)
> > >>>>
> > >>>> Does that mean the packets handled in this way (in XDP) will skip all
> > >>>> netfilter at all?
> > >>>
> > >>> Good point.
> > >>>
> > >>> If we want not to skip other layers, maybe we can use tc ?
> > >>>
> > >>> 1) allocate ireq and set sack_ok etc with kfunc
> > >>> 2) bpf_sk_assign() to set ireq to skb (this could be done in kfunc above)
> > >>> 3) let inet_steal_sock() return req->sk_listener if not sk_fullsock(sk)
> > >>> 4) if skb->sk is reqsk in cookie_v[46]_check(), skip validation and
> > >>>      req allocation and create full sk

I think this was doable.  With the diff below, I was able to skip
validation in cookie_v[46]_check() when if skb->sk is not NULL.

The kfunc allocates req and set req->syncookie to 1, which is usually
set in TX path, so if it's 1 in RX (inet_steal_sock()), we can see
that req is allocated by kfunc (at least, req->syncookie &&
req->rsk_listener never be true in the current TCP stack).

The difference here is that req allocated by kfunc holds refcnt of
rsk_listener (passing true to inet_reqsk_alloc()) to prevent freeing
the listener until req reaches cookie_v[46]_check().

The cookie generation at least should be done at tc/xdp.  The
valdation can be done earlier as well on tc/xdp, but it could
add another complexity, listener's life cycle if we allocate
req there.

I'm wondering which place to add the validation capability, and
I think SOCK_OPS is simpler than tc.

  #1 validate cookie and allocate req at tc, and skip validation

  #2 validate cookie (and update bpf map at xdp/tc, and look up bpf
     map) and allocate req at SOCK_OPS hook

Given SYN proxy is usually on the other node and incoming cookie
is almost always valid, we might need not validate it in the early
stage in the stack.

What do you think ?

---8<---
diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 3ecfeadbfa06..e5e4627bf270 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -462,9 +462,19 @@ struct sock *inet_steal_sock(struct net *net, struct sk_buff *skb, int doff,
 	if (!sk)
 		return NULL;
 
-	if (!prefetched || !sk_fullsock(sk))
+	if (!prefetched)
 		return sk;
 
+	if (!sk_fullsock(sk)) {
+		if (sk->sk_state == TCP_NEW_SYN_RECV && inet_reqsk(sk)->syncookie) {
+			skb->sk = sk;
+			skb->destructor = sock_pfree;
+			sk = inet_reqsk(sk)->rsk_listener;
+		}
+
+		return sk;
+	}
+
 	if (sk->sk_protocol == IPPROTO_TCP) {
 		if (sk->sk_state != TCP_LISTEN)
 			return sk;
diff --git a/net/core/filter.c b/net/core/filter.c
index cc2e4babc85f..bca491ddf42c 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -11800,6 +11800,71 @@ __bpf_kfunc int bpf_sock_addr_set_sun_path(struct bpf_sock_addr_kern *sa_kern,
 
 	return 0;
 }
+
+__bpf_kfunc int bpf_sk_assign_tcp_reqsk(struct sk_buff *skb, struct sock *sk,
+					struct tcp_options_received *tcp_opt,
+					int tcp_opt__sz, u16 mss)
+{
+	const struct tcp_request_sock_ops *af_ops;
+	const struct request_sock_ops *ops;
+	struct inet_request_sock *ireq;
+	struct tcp_request_sock *treq;
+	struct request_sock *req;
+
+	if (!sk)
+		return -EINVAL;
+
+	if (!skb_at_tc_ingress(skb))
+		return -EINVAL;
+
+	if (dev_net(skb->dev) != sock_net(sk))
+		return -ENETUNREACH;
+
+	switch (sk->sk_family) {
+	case AF_INET:  /* TODO: MPTCP */
+		ops = &tcp_request_sock_ops;
+		af_ops = &tcp_request_sock_ipv4_ops;
+		break;
+#if IS_ENABLED(CONFIG_IPV6)
+	case AF_INET6:
+		ops = &tcp6_request_sock_ops;
+		af_ops = &tcp_request_sock_ipv6_ops;
+		break;
+#endif
+	default:
+		return -EINVAL;
+	}
+
+	if (sk->sk_type != SOCK_STREAM || sk->sk_state != TCP_LISTEN)
+		return -EINVAL;
+
+	req = inet_reqsk_alloc(ops, sk, true);
+	if (!req)
+		return -ENOMEM;
+
+	ireq = inet_rsk(req);
+	treq = tcp_rsk(req);
+
+	refcount_set(&req->rsk_refcnt, 1);
+	req->syncookie = 1;
+	req->mss = mss;
+	req->ts_recent = tcp_opt->saw_tstamp ? tcp_opt->rcv_tsval : 0;
+
+	ireq->snd_wscale = tcp_opt->snd_wscale;
+	ireq->sack_ok = tcp_opt->sack_ok;
+	ireq->wscale_ok = tcp_opt->wscale_ok;
+	ireq->tstamp_ok	= tcp_opt->saw_tstamp;
+
+	tcp_rsk(req)->af_specific = af_ops;
+	tcp_rsk(req)->ts_off = tcp_opt->rcv_tsecr - tcp_ns_to_ts(tcp_clock_ns());
+
+	skb_orphan(skb);
+	skb->sk = req_to_sk(req);
+	skb->destructor = sock_pfree;
+
+	return 0;
+}
+
 __diag_pop();
 
 int bpf_dynptr_from_skb_rdonly(struct sk_buff *skb, u64 flags,
@@ -11828,6 +11893,10 @@ BTF_SET8_START(bpf_kfunc_check_set_sock_addr)
 BTF_ID_FLAGS(func, bpf_sock_addr_set_sun_path)
 BTF_SET8_END(bpf_kfunc_check_set_sock_addr)
 
+BTF_SET8_START(bpf_kfunc_check_set_tcp_reqsk)
+BTF_ID_FLAGS(func, bpf_sk_assign_tcp_reqsk)
+BTF_SET8_END(bpf_kfunc_check_set_tcp_reqsk)
+
 static const struct btf_kfunc_id_set bpf_kfunc_set_skb = {
 	.owner = THIS_MODULE,
 	.set = &bpf_kfunc_check_set_skb,
@@ -11843,6 +11912,11 @@ static const struct btf_kfunc_id_set bpf_kfunc_set_sock_addr = {
 	.set = &bpf_kfunc_check_set_sock_addr,
 };
 
+static const struct btf_kfunc_id_set bpf_kfunc_set_tcp_reqsk = {
+	.owner = THIS_MODULE,
+	.set = &bpf_kfunc_check_set_tcp_reqsk,
+};
+
 static int __init bpf_kfunc_init(void)
 {
 	int ret;
@@ -11858,8 +11932,10 @@ static int __init bpf_kfunc_init(void)
 	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_LWT_SEG6LOCAL, &bpf_kfunc_set_skb);
 	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_NETFILTER, &bpf_kfunc_set_skb);
 	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_XDP, &bpf_kfunc_set_xdp);
-	return ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
-						&bpf_kfunc_set_sock_addr);
+	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
+					       &bpf_kfunc_set_sock_addr);
+	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &bpf_kfunc_set_tcp_reqsk);
+	return ret;
 }
 late_initcall(bpf_kfunc_init);
 
---8<---


> > >>
> > >> Haven't looked at the details. The above feels reasonable and would be nice if
> > >> it works out. don't know if the skb at tc can be used in cookie_v[46]_check() as
> > >> is. It probably needs more thoughts.  [ note, xdp does not have skb. ]
> > >>
> > >> Regarding the "allocate ireq and set sack_ok etc with kfunc", do you think it
> > >> will be useful (and potentially cleaner) even for the
> > >> BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB if it needed to go back to consider skops? Then
> > >> only do the BPF_SOCK_OPS_CHECK_SYNCOOKIE_CB and the xdp/tc can generate SYNACK.
> > >> The xdp/tc can still do the check and drop the bad ACK earlier in the stack.
> > > 
> > > kfunc would be useful if we want to fall back to the default
> > > validation, but I think we should not allocate ireq in kfunc.
> > > 
> > > The SOCK_OPS prog only returns a binary value.  If we decide whether
> > > we skip validation or not based on kfunc call (ireq allocation), the
> > > flow would be like :
> > > 
> > >    1. CG_OK & ireq is allocated -> skip validation and req allocation
> > >    2. CG_OK & no ireq           -> default validation
> > >    3. CG_ERR                    -> RST
> > > 
> > > The problem here is that if kfunc fails with -ENOMEM and cookie
> > > is valid, we need a way to tell the kernel to drop the ACK instead
> > > of sending RST.  (I hope the prog could return CG_DROP...)
> > 
> > bpf_set_retval() helper allows the cgrp bpf prog to return -ENOMEM. Take a look 
> > at how __cgroup_bpf_run_filter_getsockopt is using the return value of 
> > bpf_prog_run_array_cg() and an example in progs/cgroup_getset_retval_getsockopt.c.
> 
> Oh, this is nice, I assumed -EPERM was always returned.
> 
> 
> > > If we allocate ireq first, it would be cleaner as bpf need not care
> > > about the drop path.
> > > 
> > >    1. CG_OK & mss is set -> skip validation
> > >    2. CG_OK & no mss set -> default validation
> > >    3. CG_ERR             -> RST
> > 
> > Even if it uses the mss set/not-set like above to decide drop/rst. Does it 
> > really need to pre-allocate ireq? Looking at the test, the bpf prog is not using 
> > the skops->sk either.
> 
> It uses skops->remote_ip4 etc, maybe this was another reason why
> I chose pre-alloc, but yes, it's not needed.  The same value can
> be extraced from skb with bpf_skb_load_bytes_relative(BPF_HDR_START_NET).
> 
> 
> > It would be nice to allow bpf prog to check the cookie first before creating 
> > ireq. The kernel also checks the cookie first before tcp_parse_option and ireq 
> > creation. Beside, I suspect the multiple "if ([!]bpf_cookie)" checks in 
> > cookie_v[46]_check() is due to the pre-alloc ireq requirement.
> > 
> > What does it take to create an ireq? sk, skb, tcp_opt, and mss? Potentially, it 
> > could have a "bpf_skops_parse_tcp_options(struct bpf_sock_ops_kern *skops, 
> > struct tcp_options_received *opt_rx, u32 opt_rx__sz)" to initialize the tcp_opt. 
> > I think the bpf prog should be able to parse the tcp options by itself also and 
> > directly initialize the tcp_opt.
> 
> Yes, also the prog will not need to parse all the options unless
> the validation algorithm needs to becaues SACK_PERMITTED, WSCALE,
> MSS (and ECN bits) are only available in SYN.
> 
> So, the prog will just need to parse timestamps option with
> bpf_load_hdr_opt() and can initialise tcp_opt based on ISN
> (and/or TS).
> 
> 
> > The "bpf_skops_alloc_tcp_req(struct bpf_sock_ops_kern *skops, struct 
> > tcp_options_received *opt_rx, u32 opt_rx__size, int mss,...)" could directly 
> > save the "ireq" in skops->ireq (new member). If skops->ireq is available, the 
> > kernel could then skip most of the ireq initialization and directly continue the 
> > remaining processing (e.g. directly to security_inet_conn_request() ?). would 
> > that work?
> 
> Yes, that will work.

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 bpf-next 00/11] bpf: tcp: Add SYN Cookie generation/validation SOCK_OPS hooks.
  2023-10-21  6:48                       ` Kuniyuki Iwashima
@ 2023-10-23 21:35                         ` Martin KaFai Lau
  2023-10-24  0:37                           ` Kui-Feng Lee
  0 siblings, 1 reply; 44+ messages in thread
From: Martin KaFai Lau @ 2023-10-23 21:35 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: andrii, ast, bpf, daniel, davem, dsahern, edumazet, haoluo,
	john.fastabend, jolsa, kpsingh, kuba, kuni1840, mykolal, netdev,
	pabeni, sdf, sinquersw, song, yonghong.song

On 10/20/23 11:48 PM, Kuniyuki Iwashima wrote:
> I think this was doable.  With the diff below, I was able to skip
> validation in cookie_v[46]_check() when if skb->sk is not NULL.
> 
> The kfunc allocates req and set req->syncookie to 1, which is usually
> set in TX path, so if it's 1 in RX (inet_steal_sock()), we can see
> that req is allocated by kfunc (at least, req->syncookie &&
> req->rsk_listener never be true in the current TCP stack).
> 
> The difference here is that req allocated by kfunc holds refcnt of
> rsk_listener (passing true to inet_reqsk_alloc()) to prevent freeing
> the listener until req reaches cookie_v[46]_check().

The cookie_v[46]_check() holds the listener sk refcnt now?

 >
> The cookie generation at least should be done at tc/xdp.  The
> valdation can be done earlier as well on tc/xdp, but it could
> add another complexity, listener's life cycle if we allocate
> req there.

I think your code below looks pretty close already.

It seems the only concern/complexity is the extra rsk_listener refcnt (btw the 
concern is on performance for the extra refcnt? or there is correctness issue?).

Asking because bpf_sk_assign() can already assign a listener to skb->sk and it 
also does not take a refcnt on the listener. The same no refcnt needed on 
req->rsk_listener should be doable also. sock_pfree may need to be smarter to 
check req->syncookie. What else may need to change?

> 
> I'm wondering which place to add the validation capability, and
> I think SOCK_OPS is simpler than tc.
> 
>    #1 validate cookie and allocate req at tc, and skip validation
> 
>    #2 validate cookie (and update bpf map at xdp/tc, and look up bpf
>       map) and allocate req at SOCK_OPS hook
> 
> Given SYN proxy is usually on the other node and incoming cookie
> is almost always valid, we might need not validate it in the early
> stage in the stack.
> 
> What do you think ?

Yeah, supporting validation in sock_ops is an open option if the tc side is too 
hard but I feel you are pretty close on the tc side.

> 
> ---8<---
> diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> index 3ecfeadbfa06..e5e4627bf270 100644
> --- a/include/net/inet_hashtables.h
> +++ b/include/net/inet_hashtables.h
> @@ -462,9 +462,19 @@ struct sock *inet_steal_sock(struct net *net, struct sk_buff *skb, int doff,
>   	if (!sk)
>   		return NULL;
>   
> -	if (!prefetched || !sk_fullsock(sk))
> +	if (!prefetched)
>   		return sk;
>   
> +	if (!sk_fullsock(sk)) {
> +		if (sk->sk_state == TCP_NEW_SYN_RECV && inet_reqsk(sk)->syncookie) {
> +			skb->sk = sk;
> +			skb->destructor = sock_pfree;
> +			sk = inet_reqsk(sk)->rsk_listener;
> +		}
> +
> +		return sk;
> +	}
> +
>   	if (sk->sk_protocol == IPPROTO_TCP) {
>   		if (sk->sk_state != TCP_LISTEN)
>   			return sk;
> diff --git a/net/core/filter.c b/net/core/filter.c
> index cc2e4babc85f..bca491ddf42c 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -11800,6 +11800,71 @@ __bpf_kfunc int bpf_sock_addr_set_sun_path(struct bpf_sock_addr_kern *sa_kern,
>   
>   	return 0;
>   }
> +
> +__bpf_kfunc int bpf_sk_assign_tcp_reqsk(struct sk_buff *skb, struct sock *sk,
> +					struct tcp_options_received *tcp_opt,
> +					int tcp_opt__sz, u16 mss)
> +{
> +	const struct tcp_request_sock_ops *af_ops;
> +	const struct request_sock_ops *ops;
> +	struct inet_request_sock *ireq;
> +	struct tcp_request_sock *treq;
> +	struct request_sock *req;
> +
> +	if (!sk)
> +		return -EINVAL;
> +
> +	if (!skb_at_tc_ingress(skb))
> +		return -EINVAL;
> +
> +	if (dev_net(skb->dev) != sock_net(sk))
> +		return -ENETUNREACH;
> +
> +	switch (sk->sk_family) {
> +	case AF_INET:  /* TODO: MPTCP */
> +		ops = &tcp_request_sock_ops;
> +		af_ops = &tcp_request_sock_ipv4_ops;
> +		break;
> +#if IS_ENABLED(CONFIG_IPV6)
> +	case AF_INET6:
> +		ops = &tcp6_request_sock_ops;
> +		af_ops = &tcp_request_sock_ipv6_ops;
> +		break;
> +#endif
> +	default:
> +		return -EINVAL;
> +	}
> +
> +	if (sk->sk_type != SOCK_STREAM || sk->sk_state != TCP_LISTEN)
> +		return -EINVAL;
> +
> +	req = inet_reqsk_alloc(ops, sk, true);
> +	if (!req)
> +		return -ENOMEM;
> +
> +	ireq = inet_rsk(req);
> +	treq = tcp_rsk(req);
> +
> +	refcount_set(&req->rsk_refcnt, 1);
> +	req->syncookie = 1;
> +	req->mss = mss;
> +	req->ts_recent = tcp_opt->saw_tstamp ? tcp_opt->rcv_tsval : 0;
> +
> +	ireq->snd_wscale = tcp_opt->snd_wscale;
> +	ireq->sack_ok = tcp_opt->sack_ok;
> +	ireq->wscale_ok = tcp_opt->wscale_ok;
> +	ireq->tstamp_ok	= tcp_opt->saw_tstamp;
> +
> +	tcp_rsk(req)->af_specific = af_ops;
> +	tcp_rsk(req)->ts_off = tcp_opt->rcv_tsecr - tcp_ns_to_ts(tcp_clock_ns());
> +
> +	skb_orphan(skb);
> +	skb->sk = req_to_sk(req);
> +	skb->destructor = sock_pfree;
> +
> +	return 0;
> +}
> +
>   __diag_pop();
>   
>   int bpf_dynptr_from_skb_rdonly(struct sk_buff *skb, u64 flags,
> @@ -11828,6 +11893,10 @@ BTF_SET8_START(bpf_kfunc_check_set_sock_addr)
>   BTF_ID_FLAGS(func, bpf_sock_addr_set_sun_path)
>   BTF_SET8_END(bpf_kfunc_check_set_sock_addr)
>   
> +BTF_SET8_START(bpf_kfunc_check_set_tcp_reqsk)
> +BTF_ID_FLAGS(func, bpf_sk_assign_tcp_reqsk)
> +BTF_SET8_END(bpf_kfunc_check_set_tcp_reqsk)
> +
>   static const struct btf_kfunc_id_set bpf_kfunc_set_skb = {
>   	.owner = THIS_MODULE,
>   	.set = &bpf_kfunc_check_set_skb,
> @@ -11843,6 +11912,11 @@ static const struct btf_kfunc_id_set bpf_kfunc_set_sock_addr = {
>   	.set = &bpf_kfunc_check_set_sock_addr,
>   };
>   
> +static const struct btf_kfunc_id_set bpf_kfunc_set_tcp_reqsk = {
> +	.owner = THIS_MODULE,
> +	.set = &bpf_kfunc_check_set_tcp_reqsk,
> +};
> +
>   static int __init bpf_kfunc_init(void)
>   {
>   	int ret;
> @@ -11858,8 +11932,10 @@ static int __init bpf_kfunc_init(void)
>   	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_LWT_SEG6LOCAL, &bpf_kfunc_set_skb);
>   	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_NETFILTER, &bpf_kfunc_set_skb);
>   	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_XDP, &bpf_kfunc_set_xdp);
> -	return ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
> -						&bpf_kfunc_set_sock_addr);
> +	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
> +					       &bpf_kfunc_set_sock_addr);
> +	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &bpf_kfunc_set_tcp_reqsk);
> +	return ret;
>   }
>   late_initcall(bpf_kfunc_init);
>   
> ---8<---


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 bpf-next 00/11] bpf: tcp: Add SYN Cookie generation/validation SOCK_OPS hooks.
  2023-10-23 21:35                         ` Martin KaFai Lau
@ 2023-10-24  0:37                           ` Kui-Feng Lee
  2023-10-24  1:22                             ` Kuniyuki Iwashima
  0 siblings, 1 reply; 44+ messages in thread
From: Kui-Feng Lee @ 2023-10-24  0:37 UTC (permalink / raw)
  To: Martin KaFai Lau, Kuniyuki Iwashima
  Cc: andrii, ast, bpf, daniel, davem, dsahern, edumazet, haoluo,
	john.fastabend, jolsa, kpsingh, kuba, kuni1840, mykolal, netdev,
	pabeni, sdf, song, yonghong.song



On 10/23/23 14:35, Martin KaFai Lau wrote:
> On 10/20/23 11:48 PM, Kuniyuki Iwashima wrote:
>> I think this was doable.  With the diff below, I was able to skip
>> validation in cookie_v[46]_check() when if skb->sk is not NULL.
>>
>> The kfunc allocates req and set req->syncookie to 1, which is usually
>> set in TX path, so if it's 1 in RX (inet_steal_sock()), we can see
>> that req is allocated by kfunc (at least, req->syncookie &&
>> req->rsk_listener never be true in the current TCP stack).
>>
>> The difference here is that req allocated by kfunc holds refcnt of
>> rsk_listener (passing true to inet_reqsk_alloc()) to prevent freeing
>> the listener until req reaches cookie_v[46]_check().
> 
> The cookie_v[46]_check() holds the listener sk refcnt now?


The caller of cookie_v[46]_check() should hold a refcnt of the listener.
If the listener is destroyed, the callers of cookie_v[46]_check() should
fail to lookup a sock for the skb. However, in this case, the kfunc sets
a sock to skb->sk, and the lookup function
(__inet_lookup_skb()) steals sock from skb. So, there is no guarantee
ensuring the listener is still alive.

One solution is let the stealing function to lookup the listener if
inet_reqsk(skb->sk)->syncookie is true.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 bpf-next 00/11] bpf: tcp: Add SYN Cookie generation/validation SOCK_OPS hooks.
  2023-10-24  0:37                           ` Kui-Feng Lee
@ 2023-10-24  1:22                             ` Kuniyuki Iwashima
  2023-10-24 17:55                               ` Kui-Feng Lee
  0 siblings, 1 reply; 44+ messages in thread
From: Kuniyuki Iwashima @ 2023-10-24  1:22 UTC (permalink / raw)
  To: sinquersw
  Cc: andrii, ast, bpf, daniel, davem, dsahern, edumazet, haoluo,
	john.fastabend, jolsa, kpsingh, kuba, kuni1840, kuniyu,
	martin.lau, mykolal, netdev, pabeni, sdf, song, yonghong.song

> On 10/23/23 14:35, Martin KaFai Lau wrote:
> > On 10/20/23 11:48 PM, Kuniyuki Iwashima wrote:
> > > I think this was doable.  With the diff below, I was able to skip
> > > validation in cookie_v[46]_check() when if skb->sk is not NULL.
> > > 
> > > The kfunc allocates req and set req->syncookie to 1, which is usually
> > > set in TX path, so if it's 1 in RX (inet_steal_sock()), we can see
> > > that req is allocated by kfunc (at least, req->syncookie &&
> > > req->rsk_listener never be true in the current TCP stack).
> > > 
> > > The difference here is that req allocated by kfunc holds refcnt of
> > > rsk_listener (passing true to inet_reqsk_alloc()) to prevent freeing
> > > the listener until req reaches cookie_v[46]_check().
> > 
> > The cookie_v[46]_check() holds the listener sk refcnt now?
> 
> The caller of cookie_v[46]_check() should hold a refcnt of the listener.

No, it need not.

When we handle the default syn cookie, cookie_tcp_reqsk_alloc() passes
false to inet_reqsk_alloc(), then reqsk does not hold refcnt of the
listener.

If inet_csk_reqsk_queue_add() in tcp_get_cookie_sock() succeeds, we know
the listener is still alive.


> If the listener is destroyed, the callers of cookie_v[46]_check() should
> fail to lookup a sock for the skb. However, in this case, the kfunc sets
> a sock to skb->sk, and the lookup function
> (__inet_lookup_skb()) steals sock from skb. So, there is no guarantee
> ensuring the listener is still alive.
> 
> One solution is let the stealing function to lookup the listener if
> inet_reqsk(skb->sk)->syncookie is true.

kfunc at least guarantees that the listener is not freed until req
is freed.  There's two cases where the listener could be close()d
after kfunc:

  1. close()d before lookup
     -> kfree_skb(skb) calls reqsk_put() and releases the last
        refcnt of the listener

  2. close()d between lookup and inet_csk_reqsk_queue_add()
     -> inet_csk_reqsk_queue_add() fails and __reqsk_free()
        releases the last refcnt of the listener.

So, we need not look up the listener again in inet_steal_sock().


> > 
> >  >
> > > The cookie generation at least should be done at tc/xdp.  The
> > > valdation can be done earlier as well on tc/xdp, but it could
> > > add another complexity, listener's life cycle if we allocate
> > > req there.
> > 
> > I think your code below looks pretty close already.
> > 
> > It seems the only concern/complexity is the extra rsk_listener refcnt (btw the 
> > concern is on performance for the extra refcnt? or there is correctness issue?).

Yes, that's the only concern and I think it's all ok now.

[ I was seeing a weird refcnt warning, but I missed *refcounted was true
  in inet_steal_sock() for reqsk and forgot to flipping it to false :S ]


> > 
> > Asking because bpf_sk_assign() can already assign a listener to skb->sk and it 
> > also does not take a refcnt on the listener. The same no refcnt needed on 
> > req->rsk_listener should be doable also. sock_pfree may need to be smarter to 
> > check req->syncookie. What else may need to change?

I was wondering if we are in the same RCU period between tc and
cookie_v[46]_check(), but yeah, probably sock_pfree() can check
req->syncookie and set NULL to rsk_listener so that reqsk_put()
will not touch the listener.


> > 
> > > 
> > > I'm wondering which place to add the validation capability, and
> > > I think SOCK_OPS is simpler than tc.
> > > 
> > >    #1 validate cookie and allocate req at tc, and skip validation
> > > 
> > >    #2 validate cookie (and update bpf map at xdp/tc, and look up bpf
> > >       map) and allocate req at SOCK_OPS hook
> > > 
> > > Given SYN proxy is usually on the other node and incoming cookie
> > > is almost always valid, we might need not validate it in the early
> > > stage in the stack.
> > > 
> > > What do you think ?
> > 
> > Yeah, supporting validation in sock_ops is an open option if the tc side is too 
> > hard but I feel you are pretty close on the tc side.

Now I think I can go v2 with tc.

Thanks for your guide!


> > 
> > > 
> > > ---8<---
> > > diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> > > index 3ecfeadbfa06..e5e4627bf270 100644
> > > --- a/include/net/inet_hashtables.h
> > > +++ b/include/net/inet_hashtables.h
> > > @@ -462,9 +462,19 @@ struct sock *inet_steal_sock(struct net *net, struct sk_buff *skb, int doff,
> > >   	if (!sk)
> > >   		return NULL;
> > >   
> > > -	if (!prefetched || !sk_fullsock(sk))
> > > +	if (!prefetched)
> > >   		return sk;
> > >   
> > > +	if (!sk_fullsock(sk)) {
> > > +		if (sk->sk_state == TCP_NEW_SYN_RECV && inet_reqsk(sk)->syncookie) {
> > > +			skb->sk = sk;
> > > +			skb->destructor = sock_pfree;
> > > +			sk = inet_reqsk(sk)->rsk_listener;
> > > +		}
> > > +
> > > +		return sk;
> > > +	}
> > > +
> > >   	if (sk->sk_protocol == IPPROTO_TCP) {
> > >   		if (sk->sk_state != TCP_LISTEN)
> > >   			return sk;
> > > diff --git a/net/core/filter.c b/net/core/filter.c
> > > index cc2e4babc85f..bca491ddf42c 100644
> > > --- a/net/core/filter.c
> > > +++ b/net/core/filter.c
> > > @@ -11800,6 +11800,71 @@ __bpf_kfunc int bpf_sock_addr_set_sun_path(struct bpf_sock_addr_kern *sa_kern,
> > >   
> > >   	return 0;
> > >   }
> > > +
> > > +__bpf_kfunc int bpf_sk_assign_tcp_reqsk(struct sk_buff *skb, struct sock *sk,
> > > +					struct tcp_options_received *tcp_opt,
> > > +					int tcp_opt__sz, u16 mss)
> > > +{
> > > +	const struct tcp_request_sock_ops *af_ops;
> > > +	const struct request_sock_ops *ops;
> > > +	struct inet_request_sock *ireq;
> > > +	struct tcp_request_sock *treq;
> > > +	struct request_sock *req;
> > > +
> > > +	if (!sk)
> > > +		return -EINVAL;
> > > +
> > > +	if (!skb_at_tc_ingress(skb))
> > > +		return -EINVAL;
> > > +
> > > +	if (dev_net(skb->dev) != sock_net(sk))
> > > +		return -ENETUNREACH;
> > > +
> > > +	switch (sk->sk_family) {
> > > +	case AF_INET:  /* TODO: MPTCP */
> > > +		ops = &tcp_request_sock_ops;
> > > +		af_ops = &tcp_request_sock_ipv4_ops;
> > > +		break;
> > > +#if IS_ENABLED(CONFIG_IPV6)
> > > +	case AF_INET6:
> > > +		ops = &tcp6_request_sock_ops;
> > > +		af_ops = &tcp_request_sock_ipv6_ops;
> > > +		break;
> > > +#endif
> > > +	default:
> > > +		return -EINVAL;
> > > +	}
> > > +
> > > +	if (sk->sk_type != SOCK_STREAM || sk->sk_state != TCP_LISTEN)
> > > +		return -EINVAL;
> > > +
> > > +	req = inet_reqsk_alloc(ops, sk, true);
> > > +	if (!req)
> > > +		return -ENOMEM;
> > > +
> > > +	ireq = inet_rsk(req);
> > > +	treq = tcp_rsk(req);
> > > +
> > > +	refcount_set(&req->rsk_refcnt, 1);
> > > +	req->syncookie = 1;
> > > +	req->mss = mss;
> > > +	req->ts_recent = tcp_opt->saw_tstamp ? tcp_opt->rcv_tsval : 0;
> > > +
> > > +	ireq->snd_wscale = tcp_opt->snd_wscale;
> > > +	ireq->sack_ok = tcp_opt->sack_ok;
> > > +	ireq->wscale_ok = tcp_opt->wscale_ok;
> > > +	ireq->tstamp_ok	= tcp_opt->saw_tstamp;
> > > +
> > > +	tcp_rsk(req)->af_specific = af_ops;
> > > +	tcp_rsk(req)->ts_off = tcp_opt->rcv_tsecr - tcp_ns_to_ts(tcp_clock_ns());
> > > +
> > > +	skb_orphan(skb);
> > > +	skb->sk = req_to_sk(req);
> > > +	skb->destructor = sock_pfree;
> > > +
> > > +	return 0;
> > > +}
> > > +
> > >   __diag_pop();
> > >   
> > >   int bpf_dynptr_from_skb_rdonly(struct sk_buff *skb, u64 flags,
> > > @@ -11828,6 +11893,10 @@ BTF_SET8_START(bpf_kfunc_check_set_sock_addr)
> > >   BTF_ID_FLAGS(func, bpf_sock_addr_set_sun_path)
> > >   BTF_SET8_END(bpf_kfunc_check_set_sock_addr)
> > >   
> > > +BTF_SET8_START(bpf_kfunc_check_set_tcp_reqsk)
> > > +BTF_ID_FLAGS(func, bpf_sk_assign_tcp_reqsk)
> > > +BTF_SET8_END(bpf_kfunc_check_set_tcp_reqsk)
> > > +
> > >   static const struct btf_kfunc_id_set bpf_kfunc_set_skb = {
> > >   	.owner = THIS_MODULE,
> > >   	.set = &bpf_kfunc_check_set_skb,
> > > @@ -11843,6 +11912,11 @@ static const struct btf_kfunc_id_set bpf_kfunc_set_sock_addr = {
> > >   	.set = &bpf_kfunc_check_set_sock_addr,
> > >   };
> > >   
> > > +static const struct btf_kfunc_id_set bpf_kfunc_set_tcp_reqsk = {
> > > +	.owner = THIS_MODULE,
> > > +	.set = &bpf_kfunc_check_set_tcp_reqsk,
> > > +};
> > > +
> > >   static int __init bpf_kfunc_init(void)
> > >   {
> > >   	int ret;
> > > @@ -11858,8 +11932,10 @@ static int __init bpf_kfunc_init(void)
> > >   	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_LWT_SEG6LOCAL, &bpf_kfunc_set_skb);
> > >   	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_NETFILTER, &bpf_kfunc_set_skb);
> > >   	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_XDP, &bpf_kfunc_set_xdp);
> > > -	return ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
> > > -						&bpf_kfunc_set_sock_addr);
> > > +	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
> > > +					       &bpf_kfunc_set_sock_addr);
> > > +	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &bpf_kfunc_set_tcp_reqsk);
> > > +	return ret;
> > >   }
> > >   late_initcall(bpf_kfunc_init);
> > >   
> > > ---8<---

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 bpf-next 00/11] bpf: tcp: Add SYN Cookie generation/validation SOCK_OPS hooks.
  2023-10-24  1:22                             ` Kuniyuki Iwashima
@ 2023-10-24 17:55                               ` Kui-Feng Lee
  0 siblings, 0 replies; 44+ messages in thread
From: Kui-Feng Lee @ 2023-10-24 17:55 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: andrii, ast, bpf, daniel, davem, dsahern, edumazet, haoluo,
	john.fastabend, jolsa, kpsingh, kuba, kuni1840, martin.lau,
	mykolal, netdev, pabeni, sdf, song, yonghong.song



On 10/23/23 18:22, Kuniyuki Iwashima wrote:
>> On 10/23/23 14:35, Martin KaFai Lau wrote:
>>> On 10/20/23 11:48 PM, Kuniyuki Iwashima wrote:
>>>> I think this was doable.  With the diff below, I was able to skip
>>>> validation in cookie_v[46]_check() when if skb->sk is not NULL.
>>>>
>>>> The kfunc allocates req and set req->syncookie to 1, which is usually
>>>> set in TX path, so if it's 1 in RX (inet_steal_sock()), we can see
>>>> that req is allocated by kfunc (at least, req->syncookie &&
>>>> req->rsk_listener never be true in the current TCP stack).
>>>>
>>>> The difference here is that req allocated by kfunc holds refcnt of
>>>> rsk_listener (passing true to inet_reqsk_alloc()) to prevent freeing
>>>> the listener until req reaches cookie_v[46]_check().
>>>
>>> The cookie_v[46]_check() holds the listener sk refcnt now?
>>
>> The caller of cookie_v[46]_check() should hold a refcnt of the listener.
> 
> No, it need not.
> 
> When we handle the default syn cookie, cookie_tcp_reqsk_alloc() passes
> false to inet_reqsk_alloc(), then reqsk does not hold refcnt of the
> listener.
> 
> If inet_csk_reqsk_queue_add() in tcp_get_cookie_sock() succeeds, we know
> the listener is still alive

What I said is the callers of cookie_v[46]_check().
For example, tcp_v4_rcv() will make sure the existence of the sk passing
to tcp_v4_do_rcv() -> tcp_v4_cookie_check(). tcp_v4_rcv() gets the sk
from __inet_lookup_skb().  The sk can be refcounted or not.
For the case of not refcounted, it should be rcu protected
(SOCK_RCU_FREE).

AFAIK, tcp_v4_rcv() is called in a rcu_read_lock() section
(far in ip_local_deliver_finish(), even netif_receive_skb_core()).

tcp_v4_rcv() and cookie_v4_check() also access the content of sk
without increase the refcount of sk. That also indicate these function
believe the sk returned by __inet_lookup_skb() is either refcounted
or protected in someway (RCU here).

What I mean protection is that the sk may be closed but not destroyed.


> 
> 
>> If the listener is destroyed, the callers of cookie_v[46]_check() should
>> fail to lookup a sock for the skb. However, in this case, the kfunc sets
>> a sock to skb->sk, and the lookup function
>> (__inet_lookup_skb()) steals sock from skb. So, there is no guarantee
>> ensuring the listener is still alive.
>>
>> One solution is let the stealing function to lookup the listener if
>> inet_reqsk(skb->sk)->syncookie is true.
> 
> kfunc at least guarantees that the listener is not freed until req
> is freed.  There's two cases where the listener could be close()d
> after kfunc:
> 
>    1. close()d before lookup
>       -> kfree_skb(skb) calls reqsk_put() and releases the last
>          refcnt of the listener
> 
>    2. close()d between lookup and inet_csk_reqsk_queue_add()
>       -> inet_csk_reqsk_queue_add() fails and __reqsk_free()
>          releases the last refcnt of the listener.
> 
> So, we need not look up the listener again in inet_steal_sock().

After thinking about this again, increasing the refcount of the listener
in the kfunc is not necessary. Since the caller of a
bpf program should already hold a refcount of the sk or
rcu protected, we can let inet_csk_reqsk_queue_add() handle it,
just like what you mentioned earlier.

WDYT?


> 
> 
>>>
>>>   >
>>>> The cookie generation at least should be done at tc/xdp.  The
>>>> valdation can be done earlier as well on tc/xdp, but it could
>>>> add another complexity, listener's life cycle if we allocate
>>>> req there.
>>>
>>> I think your code below looks pretty close already.
>>>
>>> It seems the only concern/complexity is the extra rsk_listener refcnt (btw the
>>> concern is on performance for the extra refcnt? or there is correctness issue?).
> 
> Yes, that's the only concern and I think it's all ok now.
> 
> [ I was seeing a weird refcnt warning, but I missed *refcounted was true
>    in inet_steal_sock() for reqsk and forgot to flipping it to false :S ]
> 
> 
>>>
>>> Asking because bpf_sk_assign() can already assign a listener to skb->sk and it
>>> also does not take a refcnt on the listener. The same no refcnt needed on
>>> req->rsk_listener should be doable also. sock_pfree may need to be smarter to
>>> check req->syncookie. What else may need to change?
> 
> I was wondering if we are in the same RCU period between tc and
> cookie_v[46]_check(), but yeah, probably sock_pfree() can check
> req->syncookie and set NULL to rsk_listener so that reqsk_put()
> will not touch the listener.
> 
> 
>>>
>>>>
>>>> I'm wondering which place to add the validation capability, and
>>>> I think SOCK_OPS is simpler than tc.
>>>>
>>>>     #1 validate cookie and allocate req at tc, and skip validation
>>>>
>>>>     #2 validate cookie (and update bpf map at xdp/tc, and look up bpf
>>>>        map) and allocate req at SOCK_OPS hook
>>>>
>>>> Given SYN proxy is usually on the other node and incoming cookie
>>>> is almost always valid, we might need not validate it in the early
>>>> stage in the stack.
>>>>
>>>> What do you think ?
>>>
>>> Yeah, supporting validation in sock_ops is an open option if the tc side is too
>>> hard but I feel you are pretty close on the tc side.
> 
> Now I think I can go v2 with tc.
> 
> Thanks for your guide!
> 
> 
>>>
>>>>
>>>> ---8<---
>>>> diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
>>>> index 3ecfeadbfa06..e5e4627bf270 100644
>>>> --- a/include/net/inet_hashtables.h
>>>> +++ b/include/net/inet_hashtables.h
>>>> @@ -462,9 +462,19 @@ struct sock *inet_steal_sock(struct net *net, struct sk_buff *skb, int doff,
>>>>    	if (!sk)
>>>>    		return NULL;
>>>>    
>>>> -	if (!prefetched || !sk_fullsock(sk))
>>>> +	if (!prefetched)
>>>>    		return sk;
>>>>    
>>>> +	if (!sk_fullsock(sk)) {
>>>> +		if (sk->sk_state == TCP_NEW_SYN_RECV && inet_reqsk(sk)->syncookie) {
>>>> +			skb->sk = sk;
>>>> +			skb->destructor = sock_pfree;
>>>> +			sk = inet_reqsk(sk)->rsk_listener;
>>>> +		}
>>>> +
>>>> +		return sk;
>>>> +	}
>>>> +
>>>>    	if (sk->sk_protocol == IPPROTO_TCP) {
>>>>    		if (sk->sk_state != TCP_LISTEN)
>>>>    			return sk;
>>>> diff --git a/net/core/filter.c b/net/core/filter.c
>>>> index cc2e4babc85f..bca491ddf42c 100644
>>>> --- a/net/core/filter.c
>>>> +++ b/net/core/filter.c
>>>> @@ -11800,6 +11800,71 @@ __bpf_kfunc int bpf_sock_addr_set_sun_path(struct bpf_sock_addr_kern *sa_kern,
>>>>    
>>>>    	return 0;
>>>>    }
>>>> +
>>>> +__bpf_kfunc int bpf_sk_assign_tcp_reqsk(struct sk_buff *skb, struct sock *sk,
>>>> +					struct tcp_options_received *tcp_opt,
>>>> +					int tcp_opt__sz, u16 mss)
>>>> +{
>>>> +	const struct tcp_request_sock_ops *af_ops;
>>>> +	const struct request_sock_ops *ops;
>>>> +	struct inet_request_sock *ireq;
>>>> +	struct tcp_request_sock *treq;
>>>> +	struct request_sock *req;
>>>> +
>>>> +	if (!sk)
>>>> +		return -EINVAL;
>>>> +
>>>> +	if (!skb_at_tc_ingress(skb))
>>>> +		return -EINVAL;
>>>> +
>>>> +	if (dev_net(skb->dev) != sock_net(sk))
>>>> +		return -ENETUNREACH;
>>>> +
>>>> +	switch (sk->sk_family) {
>>>> +	case AF_INET:  /* TODO: MPTCP */
>>>> +		ops = &tcp_request_sock_ops;
>>>> +		af_ops = &tcp_request_sock_ipv4_ops;
>>>> +		break;
>>>> +#if IS_ENABLED(CONFIG_IPV6)
>>>> +	case AF_INET6:
>>>> +		ops = &tcp6_request_sock_ops;
>>>> +		af_ops = &tcp_request_sock_ipv6_ops;
>>>> +		break;
>>>> +#endif
>>>> +	default:
>>>> +		return -EINVAL;
>>>> +	}
>>>> +
>>>> +	if (sk->sk_type != SOCK_STREAM || sk->sk_state != TCP_LISTEN)
>>>> +		return -EINVAL;
>>>> +
>>>> +	req = inet_reqsk_alloc(ops, sk, true);
>>>> +	if (!req)
>>>> +		return -ENOMEM;
>>>> +
>>>> +	ireq = inet_rsk(req);
>>>> +	treq = tcp_rsk(req);
>>>> +
>>>> +	refcount_set(&req->rsk_refcnt, 1);
>>>> +	req->syncookie = 1;
>>>> +	req->mss = mss;
>>>> +	req->ts_recent = tcp_opt->saw_tstamp ? tcp_opt->rcv_tsval : 0;
>>>> +
>>>> +	ireq->snd_wscale = tcp_opt->snd_wscale;
>>>> +	ireq->sack_ok = tcp_opt->sack_ok;
>>>> +	ireq->wscale_ok = tcp_opt->wscale_ok;
>>>> +	ireq->tstamp_ok	= tcp_opt->saw_tstamp;
>>>> +
>>>> +	tcp_rsk(req)->af_specific = af_ops;
>>>> +	tcp_rsk(req)->ts_off = tcp_opt->rcv_tsecr - tcp_ns_to_ts(tcp_clock_ns());
>>>> +
>>>> +	skb_orphan(skb);
>>>> +	skb->sk = req_to_sk(req);
>>>> +	skb->destructor = sock_pfree;
>>>> +
>>>> +	return 0;
>>>> +}
>>>> +
>>>>    __diag_pop();
>>>>    
>>>>    int bpf_dynptr_from_skb_rdonly(struct sk_buff *skb, u64 flags,
>>>> @@ -11828,6 +11893,10 @@ BTF_SET8_START(bpf_kfunc_check_set_sock_addr)
>>>>    BTF_ID_FLAGS(func, bpf_sock_addr_set_sun_path)
>>>>    BTF_SET8_END(bpf_kfunc_check_set_sock_addr)
>>>>    
>>>> +BTF_SET8_START(bpf_kfunc_check_set_tcp_reqsk)
>>>> +BTF_ID_FLAGS(func, bpf_sk_assign_tcp_reqsk)
>>>> +BTF_SET8_END(bpf_kfunc_check_set_tcp_reqsk)
>>>> +
>>>>    static const struct btf_kfunc_id_set bpf_kfunc_set_skb = {
>>>>    	.owner = THIS_MODULE,
>>>>    	.set = &bpf_kfunc_check_set_skb,
>>>> @@ -11843,6 +11912,11 @@ static const struct btf_kfunc_id_set bpf_kfunc_set_sock_addr = {
>>>>    	.set = &bpf_kfunc_check_set_sock_addr,
>>>>    };
>>>>    
>>>> +static const struct btf_kfunc_id_set bpf_kfunc_set_tcp_reqsk = {
>>>> +	.owner = THIS_MODULE,
>>>> +	.set = &bpf_kfunc_check_set_tcp_reqsk,
>>>> +};
>>>> +
>>>>    static int __init bpf_kfunc_init(void)
>>>>    {
>>>>    	int ret;
>>>> @@ -11858,8 +11932,10 @@ static int __init bpf_kfunc_init(void)
>>>>    	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_LWT_SEG6LOCAL, &bpf_kfunc_set_skb);
>>>>    	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_NETFILTER, &bpf_kfunc_set_skb);
>>>>    	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_XDP, &bpf_kfunc_set_xdp);
>>>> -	return ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
>>>> -						&bpf_kfunc_set_sock_addr);
>>>> +	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
>>>> +					       &bpf_kfunc_set_sock_addr);
>>>> +	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &bpf_kfunc_set_tcp_reqsk);
>>>> +	return ret;
>>>>    }
>>>>    late_initcall(bpf_kfunc_init);
>>>>    
>>>> ---8<---


^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2023-10-24 17:55 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-13 22:04 [PATCH v1 bpf-next 00/11] bpf: tcp: Add SYN Cookie generation/validation SOCK_OPS hooks Kuniyuki Iwashima
2023-10-13 22:04 ` [PATCH v1 bpf-next 01/11] tcp: Clean up reverse xmas tree in cookie_v[46]_check() Kuniyuki Iwashima
2023-10-13 22:04 ` [PATCH v1 bpf-next 02/11] tcp: Cache sock_net(sk) " Kuniyuki Iwashima
2023-10-13 22:04 ` [PATCH v1 bpf-next 03/11] tcp: Clean up goto labels " Kuniyuki Iwashima
2023-10-17  0:00   ` Kui-Feng Lee
2023-10-17  0:30     ` Kuniyuki Iwashima
2023-10-13 22:04 ` [PATCH v1 bpf-next 04/11] tcp: Don't initialise tp->tsoffset in tcp_get_cookie_sock() Kuniyuki Iwashima
2023-10-13 22:04 ` [PATCH v1 bpf-next 05/11] bpf: tcp: Add SYN Cookie generation SOCK_OPS hook Kuniyuki Iwashima
2023-10-18  0:54   ` Martin KaFai Lau
2023-10-18 17:00     ` Kuniyuki Iwashima
2023-10-13 22:04 ` [PATCH v1 bpf-next 06/11] bpf: tcp: Add SYN Cookie validation " Kuniyuki Iwashima
2023-10-16 20:38   ` Stanislav Fomichev
2023-10-16 22:02     ` Kuniyuki Iwashima
2023-10-17 16:52   ` Kuniyuki Iwashima
2023-10-13 22:04 ` [PATCH v1 bpf-next 07/11] bpf: Make bpf_sock_ops.replylong[1] writable Kuniyuki Iwashima
2023-10-13 22:04 ` [PATCH v1 bpf-next 08/11] bpf: tcp: Make TS available for SYN Cookie storage Kuniyuki Iwashima
2023-10-13 22:04 ` [PATCH v1 bpf-next 09/11] tcp: Split cookie_ecn_ok() Kuniyuki Iwashima
2023-10-13 22:04 ` [PATCH v1 bpf-next 10/11] bpf: tcp: Make WS, SACK, ECN configurable from BPF SYN Cookie Kuniyuki Iwashima
2023-10-18  1:08   ` Martin KaFai Lau
2023-10-18 17:02     ` Kuniyuki Iwashima
2023-10-13 22:04 ` [PATCH v1 bpf-next 11/11] selftest: bpf: Test BPF_SOCK_OPS_(GEN|CHECK)_SYNCOOKIE_CB Kuniyuki Iwashima
2023-10-17  5:50   ` Martin KaFai Lau
2023-10-17 16:29     ` Kuniyuki Iwashima
2023-10-16 13:05 ` [PATCH v1 bpf-next 00/11] bpf: tcp: Add SYN Cookie generation/validation SOCK_OPS hooks Daniel Borkmann
2023-10-16 16:11   ` Kuniyuki Iwashima
2023-10-16 14:19 ` Willem de Bruijn
2023-10-16 16:46   ` Kuniyuki Iwashima
2023-10-16 18:41     ` Willem de Bruijn
2023-10-17  5:53 ` Martin KaFai Lau
2023-10-17 16:48   ` Kuniyuki Iwashima
2023-10-18  6:19     ` Martin KaFai Lau
2023-10-18  8:02       ` Eric Dumazet
2023-10-18 17:20         ` Kuniyuki Iwashima
2023-10-18 21:47           ` Kui-Feng Lee
2023-10-18 22:31             ` Kuniyuki Iwashima
2023-10-19  7:25               ` Martin KaFai Lau
2023-10-19 18:01                 ` Kuniyuki Iwashima
2023-10-20 19:59                   ` Martin KaFai Lau
2023-10-20 23:10                     ` Kuniyuki Iwashima
2023-10-21  6:48                       ` Kuniyuki Iwashima
2023-10-23 21:35                         ` Martin KaFai Lau
2023-10-24  0:37                           ` Kui-Feng Lee
2023-10-24  1:22                             ` Kuniyuki Iwashima
2023-10-24 17:55                               ` Kui-Feng Lee

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).