All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH bpf-next 00/10] New BPF helpers to accelerate synproxy
@ 2021-10-19 14:46 Maxim Mikityanskiy
  2021-10-19 14:46 ` [PATCH bpf-next 01/10] bpf: Use ipv6_only_sock in bpf_tcp_gen_syncookie Maxim Mikityanskiy
                   ` (9 more replies)
  0 siblings, 10 replies; 48+ messages in thread
From: Maxim Mikityanskiy @ 2021-10-19 14:46 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh
  Cc: Eric Dumazet, David S. Miller, Jakub Kicinski, Hideaki YOSHIFUJI,
	David Ahern, Jesper Dangaard Brouer, Nathan Chancellor,
	Nick Desaulniers, Brendan Jackman, Florent Revest, Joe Stringer,
	Lorenz Bauer, Tariq Toukan, netdev, bpf, clang-built-linux,
	Maxim Mikityanskiy

This series starts with some cleanup and bugfixing in the existing BPF
helpers for SYN cookies. The second half adds new functionality that
allows XDP to accelerate iptables synproxy.

struct nf_conn is exposed to BPF, new helpers are added to query
conntrack info by 5-tuple. The only field exposed for now is status, but
it can be extended easily in the future.

New helpers are added to issue SYN and timestamp cookies and to check
SYN cookies without binding to a socket, which is useful in the synproxy
scenario.

Finally, a sample XDP and userspace program is added that show how all
components work together. The XDP program uses socketless SYN cookie
helpers and queries conntrack status instead of socket status. A demo
script shows how to deploy the synproxy+XDP solution.

The draft of the new functionality was presented on Netdev 0x15:

https://netdevconf.info/0x15/session.html?Accelerating-synproxy-with-XDP

Maxim Mikityanskiy (10):
  bpf: Use ipv6_only_sock in bpf_tcp_gen_syncookie
  bpf: Support dual-stack sockets in bpf_tcp_check_syncookie
  bpf: Use EOPNOTSUPP in bpf_tcp_check_syncookie
  bpf: Make errors of bpf_tcp_check_syncookie distinguishable
  bpf: Fix documentation of th_len in bpf_tcp_{gen,check}_syncookie
  bpf: Expose struct nf_conn to BPF
  bpf: Add helpers to query conntrack info
  bpf: Add helpers to issue and check SYN cookies in XDP
  bpf: Add a helper to issue timestamp cookies in XDP
  bpf: Add sample for raw syncookie helpers

 include/linux/bpf.h            |  46 +++
 include/net/tcp.h              |   2 +
 include/uapi/linux/bpf.h       | 193 ++++++++++-
 kernel/bpf/verifier.c          | 104 +++++-
 net/core/filter.c              | 433 +++++++++++++++++++++++-
 net/ipv4/syncookies.c          |  60 ++++
 net/ipv4/tcp_input.c           |   3 +-
 samples/bpf/.gitignore         |   1 +
 samples/bpf/Makefile           |   3 +
 samples/bpf/syncookie_kern.c   | 591 +++++++++++++++++++++++++++++++++
 samples/bpf/syncookie_test.sh  |  55 +++
 samples/bpf/syncookie_user.c   | 388 ++++++++++++++++++++++
 scripts/bpf_doc.py             |   1 +
 tools/include/uapi/linux/bpf.h | 193 ++++++++++-
 14 files changed, 2047 insertions(+), 26 deletions(-)
 create mode 100644 samples/bpf/syncookie_kern.c
 create mode 100755 samples/bpf/syncookie_test.sh
 create mode 100644 samples/bpf/syncookie_user.c

-- 
2.30.2


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH bpf-next 01/10] bpf: Use ipv6_only_sock in bpf_tcp_gen_syncookie
  2021-10-19 14:46 [PATCH bpf-next 00/10] New BPF helpers to accelerate synproxy Maxim Mikityanskiy
@ 2021-10-19 14:46 ` Maxim Mikityanskiy
  2021-10-19 14:46 ` [PATCH bpf-next 02/10] bpf: Support dual-stack sockets in bpf_tcp_check_syncookie Maxim Mikityanskiy
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 48+ messages in thread
From: Maxim Mikityanskiy @ 2021-10-19 14:46 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh
  Cc: Eric Dumazet, David S. Miller, Jakub Kicinski, Hideaki YOSHIFUJI,
	David Ahern, Jesper Dangaard Brouer, Nathan Chancellor,
	Nick Desaulniers, Brendan Jackman, Florent Revest, Joe Stringer,
	Lorenz Bauer, Tariq Toukan, netdev, bpf, clang-built-linux,
	Maxim Mikityanskiy

Instead of querying the sk_ipv6only field directly, use the dedicated
ipv6_only_sock helper.

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
---
 net/core/filter.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 4bace37a6a44..d830055d477c 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -6788,7 +6788,7 @@ BPF_CALL_5(bpf_tcp_gen_syncookie, struct sock *, sk, void *, iph, u32, iph_len,
 	 */
 	switch (((struct iphdr *)iph)->version) {
 	case 4:
-		if (sk->sk_family == AF_INET6 && sk->sk_ipv6only)
+		if (sk->sk_family == AF_INET6 && ipv6_only_sock(sk))
 			return -EINVAL;
 
 		mss = tcp_v4_get_syncookie(sk, iph, th, &cookie);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH bpf-next 02/10] bpf: Support dual-stack sockets in bpf_tcp_check_syncookie
  2021-10-19 14:46 [PATCH bpf-next 00/10] New BPF helpers to accelerate synproxy Maxim Mikityanskiy
  2021-10-19 14:46 ` [PATCH bpf-next 01/10] bpf: Use ipv6_only_sock in bpf_tcp_gen_syncookie Maxim Mikityanskiy
@ 2021-10-19 14:46 ` Maxim Mikityanskiy
  2021-10-19 14:46 ` [PATCH bpf-next 03/10] bpf: Use EOPNOTSUPP " Maxim Mikityanskiy
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 48+ messages in thread
From: Maxim Mikityanskiy @ 2021-10-19 14:46 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh
  Cc: Eric Dumazet, David S. Miller, Jakub Kicinski, Hideaki YOSHIFUJI,
	David Ahern, Jesper Dangaard Brouer, Nathan Chancellor,
	Nick Desaulniers, Brendan Jackman, Florent Revest, Joe Stringer,
	Lorenz Bauer, Tariq Toukan, netdev, bpf, clang-built-linux,
	Maxim Mikityanskiy

bpf_tcp_gen_syncookie looks at the IP version in the IP header and
validates the address family of the socket. It supports IPv4 packets in
AF_INET6 dual-stack sockets.

On the other hand, bpf_tcp_check_syncookie looks only at the address
family of the socket, ignoring the real IP version in headers, and
validates only the packet size. This implementation has some drawbacks:

1. Packets are not validated properly, allowing a BPF program to trick
   bpf_tcp_check_syncookie into handling an IPv6 packet on an IPv4
   socket.

2. Dual-stack sockets fail the checks on IPv4 packets. IPv4 clients end
   up receiving a SYNACK with the cookie, but the following ACK gets
   dropped.

This patch fixes these issues by changing the checks in
bpf_tcp_check_syncookie to match the ones in bpf_tcp_gen_syncookie. IP
version from the header is taken into account, and it is validated
properly with address family.

Fixes: 399040847084 ("bpf: add helper to check for a valid SYN cookie")
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
---
 net/core/filter.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index d830055d477c..6cfb676e1adb 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -6714,24 +6714,33 @@ BPF_CALL_5(bpf_tcp_check_syncookie, struct sock *, sk, void *, iph, u32, iph_len
 	if (!th->ack || th->rst || th->syn)
 		return -ENOENT;
 
+	if (unlikely(iph_len < sizeof(struct iphdr)))
+		return -EINVAL;
+
 	if (tcp_synq_no_recent_overflow(sk))
 		return -ENOENT;
 
 	cookie = ntohl(th->ack_seq) - 1;
 
-	switch (sk->sk_family) {
-	case AF_INET:
-		if (unlikely(iph_len < sizeof(struct iphdr)))
+	/* Both struct iphdr and struct ipv6hdr have the version field at the
+	 * same offset so we can cast to the shorter header (struct iphdr).
+	 */
+	switch (((struct iphdr *)iph)->version) {
+	case 4:
+		if (sk->sk_family == AF_INET6 && ipv6_only_sock(sk))
 			return -EINVAL;
 
 		ret = __cookie_v4_check((struct iphdr *)iph, th, cookie);
 		break;
 
 #if IS_BUILTIN(CONFIG_IPV6)
-	case AF_INET6:
+	case 6:
 		if (unlikely(iph_len < sizeof(struct ipv6hdr)))
 			return -EINVAL;
 
+		if (sk->sk_family != AF_INET6)
+			return -EINVAL;
+
 		ret = __cookie_v6_check((struct ipv6hdr *)iph, th, cookie);
 		break;
 #endif /* CONFIG_IPV6 */
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH bpf-next 03/10] bpf: Use EOPNOTSUPP in bpf_tcp_check_syncookie
  2021-10-19 14:46 [PATCH bpf-next 00/10] New BPF helpers to accelerate synproxy Maxim Mikityanskiy
  2021-10-19 14:46 ` [PATCH bpf-next 01/10] bpf: Use ipv6_only_sock in bpf_tcp_gen_syncookie Maxim Mikityanskiy
  2021-10-19 14:46 ` [PATCH bpf-next 02/10] bpf: Support dual-stack sockets in bpf_tcp_check_syncookie Maxim Mikityanskiy
@ 2021-10-19 14:46 ` Maxim Mikityanskiy
  2021-10-19 14:46 ` [PATCH bpf-next 04/10] bpf: Make errors of bpf_tcp_check_syncookie distinguishable Maxim Mikityanskiy
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 48+ messages in thread
From: Maxim Mikityanskiy @ 2021-10-19 14:46 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh
  Cc: Eric Dumazet, David S. Miller, Jakub Kicinski, Hideaki YOSHIFUJI,
	David Ahern, Jesper Dangaard Brouer, Nathan Chancellor,
	Nick Desaulniers, Brendan Jackman, Florent Revest, Joe Stringer,
	Lorenz Bauer, Tariq Toukan, netdev, bpf, clang-built-linux,
	Maxim Mikityanskiy

When CONFIG_SYN_COOKIES is off, bpf_tcp_check_syncookie returns
ENOTSUPP. It's a non-standard and deprecated code. The related function
bpf_tcp_gen_syncookie and most of the other functions use EOPNOTSUPP if
some feature is not available. This patch changes ENOTSUPP to EOPNOTSUPP
in bpf_tcp_check_syncookie.

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
---
 net/core/filter.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 6cfb676e1adb..2c5877b775d9 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -6754,7 +6754,7 @@ BPF_CALL_5(bpf_tcp_check_syncookie, struct sock *, sk, void *, iph, u32, iph_len
 
 	return -ENOENT;
 #else
-	return -ENOTSUPP;
+	return -EOPNOTSUPP;
 #endif
 }
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH bpf-next 04/10] bpf: Make errors of bpf_tcp_check_syncookie distinguishable
  2021-10-19 14:46 [PATCH bpf-next 00/10] New BPF helpers to accelerate synproxy Maxim Mikityanskiy
                   ` (2 preceding siblings ...)
  2021-10-19 14:46 ` [PATCH bpf-next 03/10] bpf: Use EOPNOTSUPP " Maxim Mikityanskiy
@ 2021-10-19 14:46 ` Maxim Mikityanskiy
  2021-10-20  3:28   ` John Fastabend
  2021-10-19 14:46 ` [PATCH bpf-next 05/10] bpf: Fix documentation of th_len in bpf_tcp_{gen,check}_syncookie Maxim Mikityanskiy
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 48+ messages in thread
From: Maxim Mikityanskiy @ 2021-10-19 14:46 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh
  Cc: Eric Dumazet, David S. Miller, Jakub Kicinski, Hideaki YOSHIFUJI,
	David Ahern, Jesper Dangaard Brouer, Nathan Chancellor,
	Nick Desaulniers, Brendan Jackman, Florent Revest, Joe Stringer,
	Lorenz Bauer, Tariq Toukan, netdev, bpf, clang-built-linux,
	Maxim Mikityanskiy

bpf_tcp_check_syncookie returns errors when SYN cookie generation is
disabled (EINVAL) or when no cookies were recently generated (ENOENT).
The same error codes are used for other kinds of errors: invalid
parameters (EINVAL), invalid packet (EINVAL, ENOENT), bad cookie
(ENOENT). Such an overlap makes it impossible for a BPF program to
distinguish different cases that may require different handling.

For a BPF program that accelerates generating and checking SYN cookies,
typical logic looks like this (with current error codes annotated):

1. Drop invalid packets (EINVAL, ENOENT).

2. Drop packets with bad cookies (ENOENT).

3. Pass packets with good cookies (0).

4. Pass all packets when cookies are not in use (EINVAL, ENOENT).

The last point also matches the behavior of cookie_v4_check and
cookie_v6_check that skip all checks if cookie generation is disabled or
no cookies were recently generated. Overlapping error codes, however,
make it impossible to distinguish case 4 from cases 1 and 2.

The original commit message of commit 399040847084 ("bpf: add helper to
check for a valid SYN cookie") mentions another use case, though:
traffic classification, where it's important to distinguish new
connections from existing ones, and case 4 should be distinguishable
from case 3.

To match the requirements of both use cases, this patch reassigns error
codes of bpf_tcp_check_syncookie and adds missing documentation:

1. EINVAL: Invalid packets.

2. EACCES: Packets with bad cookies.

3. 0: Packets with good cookies.

4. ENOENT: Cookies are not in use.

This way all four cases are easily distinguishable.

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
---
 include/uapi/linux/bpf.h       | 18 ++++++++++++++++--
 net/core/filter.c              |  6 +++---
 tools/include/uapi/linux/bpf.h | 18 ++++++++++++++++--
 3 files changed, 35 insertions(+), 7 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 6fc59d61937a..2f12b11f1259 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3545,8 +3545,22 @@ union bpf_attr {
  * 		*th* points to the start of the TCP header, while *th_len*
  * 		contains **sizeof**\ (**struct tcphdr**).
  * 	Return
- * 		0 if *iph* and *th* are a valid SYN cookie ACK, or a negative
- * 		error otherwise.
+ *		0 if *iph* and *th* are a valid SYN cookie ACK.
+ *
+ *		On failure, the returned value is one of the following:
+ *
+ *		**-EACCES** if the SYN cookie is not valid.
+ *
+ *		**-EINVAL** if the packet or input arguments are invalid.
+ *
+ *		**-ENOENT** if SYN cookies are not issued (no SYN flood, or SYN
+ *		cookies are disabled in sysctl).
+ *
+ *		**-EOPNOTSUPP** if the kernel configuration does not enable SYN
+ *		cookies (CONFIG_SYN_COOKIES is off).
+ *
+ *		**-EPROTONOSUPPORT** if the IP version is not 4 or 6 (or 6, but
+ *		CONFIG_IPV6 is disabled).
  *
  * long bpf_sysctl_get_name(struct bpf_sysctl *ctx, char *buf, size_t buf_len, u64 flags)
  *	Description
diff --git a/net/core/filter.c b/net/core/filter.c
index 2c5877b775d9..d04988e67640 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -6709,10 +6709,10 @@ BPF_CALL_5(bpf_tcp_check_syncookie, struct sock *, sk, void *, iph, u32, iph_len
 		return -EINVAL;
 
 	if (!sock_net(sk)->ipv4.sysctl_tcp_syncookies)
-		return -EINVAL;
+		return -ENOENT;
 
 	if (!th->ack || th->rst || th->syn)
-		return -ENOENT;
+		return -EINVAL;
 
 	if (unlikely(iph_len < sizeof(struct iphdr)))
 		return -EINVAL;
@@ -6752,7 +6752,7 @@ BPF_CALL_5(bpf_tcp_check_syncookie, struct sock *, sk, void *, iph, u32, iph_len
 	if (ret > 0)
 		return 0;
 
-	return -ENOENT;
+	return -EACCES;
 #else
 	return -EOPNOTSUPP;
 #endif
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 6fc59d61937a..2f12b11f1259 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -3545,8 +3545,22 @@ union bpf_attr {
  * 		*th* points to the start of the TCP header, while *th_len*
  * 		contains **sizeof**\ (**struct tcphdr**).
  * 	Return
- * 		0 if *iph* and *th* are a valid SYN cookie ACK, or a negative
- * 		error otherwise.
+ *		0 if *iph* and *th* are a valid SYN cookie ACK.
+ *
+ *		On failure, the returned value is one of the following:
+ *
+ *		**-EACCES** if the SYN cookie is not valid.
+ *
+ *		**-EINVAL** if the packet or input arguments are invalid.
+ *
+ *		**-ENOENT** if SYN cookies are not issued (no SYN flood, or SYN
+ *		cookies are disabled in sysctl).
+ *
+ *		**-EOPNOTSUPP** if the kernel configuration does not enable SYN
+ *		cookies (CONFIG_SYN_COOKIES is off).
+ *
+ *		**-EPROTONOSUPPORT** if the IP version is not 4 or 6 (or 6, but
+ *		CONFIG_IPV6 is disabled).
  *
  * long bpf_sysctl_get_name(struct bpf_sysctl *ctx, char *buf, size_t buf_len, u64 flags)
  *	Description
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH bpf-next 05/10] bpf: Fix documentation of th_len in bpf_tcp_{gen,check}_syncookie
  2021-10-19 14:46 [PATCH bpf-next 00/10] New BPF helpers to accelerate synproxy Maxim Mikityanskiy
                   ` (3 preceding siblings ...)
  2021-10-19 14:46 ` [PATCH bpf-next 04/10] bpf: Make errors of bpf_tcp_check_syncookie distinguishable Maxim Mikityanskiy
@ 2021-10-19 14:46 ` Maxim Mikityanskiy
  2021-10-19 14:46 ` [PATCH bpf-next 06/10] bpf: Expose struct nf_conn to BPF Maxim Mikityanskiy
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 48+ messages in thread
From: Maxim Mikityanskiy @ 2021-10-19 14:46 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh
  Cc: Eric Dumazet, David S. Miller, Jakub Kicinski, Hideaki YOSHIFUJI,
	David Ahern, Jesper Dangaard Brouer, Nathan Chancellor,
	Nick Desaulniers, Brendan Jackman, Florent Revest, Joe Stringer,
	Lorenz Bauer, Tariq Toukan, netdev, bpf, clang-built-linux,
	Maxim Mikityanskiy

bpf_tcp_gen_syncookie and bpf_tcp_check_syncookie expect the full length
of the TCP header (with all extensions). Fix the documentation that says
it should be sizeof(struct tcphdr).

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
---
 include/uapi/linux/bpf.h       | 6 ++++--
 tools/include/uapi/linux/bpf.h | 6 ++++--
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 2f12b11f1259..efb2750f39c6 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3543,7 +3543,8 @@ union bpf_attr {
  * 		**sizeof**\ (**struct ip6hdr**).
  *
  * 		*th* points to the start of the TCP header, while *th_len*
- * 		contains **sizeof**\ (**struct tcphdr**).
+ *		contains the length of the TCP header (at least
+ *		**sizeof**\ (**struct tcphdr**)).
  * 	Return
  *		0 if *iph* and *th* are a valid SYN cookie ACK.
  *
@@ -3743,7 +3744,8 @@ union bpf_attr {
  *		**sizeof**\ (**struct ip6hdr**).
  *
  *		*th* points to the start of the TCP header, while *th_len*
- *		contains the length of the TCP header.
+ *		contains the length of the TCP header (at least
+ *		**sizeof**\ (**struct tcphdr**)).
  *	Return
  *		On success, lower 32 bits hold the generated SYN cookie in
  *		followed by 16 bits which hold the MSS value for that cookie,
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 2f12b11f1259..efb2750f39c6 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -3543,7 +3543,8 @@ union bpf_attr {
  * 		**sizeof**\ (**struct ip6hdr**).
  *
  * 		*th* points to the start of the TCP header, while *th_len*
- * 		contains **sizeof**\ (**struct tcphdr**).
+ *		contains the length of the TCP header (at least
+ *		**sizeof**\ (**struct tcphdr**)).
  * 	Return
  *		0 if *iph* and *th* are a valid SYN cookie ACK.
  *
@@ -3743,7 +3744,8 @@ union bpf_attr {
  *		**sizeof**\ (**struct ip6hdr**).
  *
  *		*th* points to the start of the TCP header, while *th_len*
- *		contains the length of the TCP header.
+ *		contains the length of the TCP header (at least
+ *		**sizeof**\ (**struct tcphdr**)).
  *	Return
  *		On success, lower 32 bits hold the generated SYN cookie in
  *		followed by 16 bits which hold the MSS value for that cookie,
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH bpf-next 06/10] bpf: Expose struct nf_conn to BPF
  2021-10-19 14:46 [PATCH bpf-next 00/10] New BPF helpers to accelerate synproxy Maxim Mikityanskiy
                   ` (4 preceding siblings ...)
  2021-10-19 14:46 ` [PATCH bpf-next 05/10] bpf: Fix documentation of th_len in bpf_tcp_{gen,check}_syncookie Maxim Mikityanskiy
@ 2021-10-19 14:46 ` Maxim Mikityanskiy
  2021-10-19 14:46 ` [PATCH bpf-next 07/10] bpf: Add helpers to query conntrack info Maxim Mikityanskiy
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 48+ messages in thread
From: Maxim Mikityanskiy @ 2021-10-19 14:46 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh
  Cc: Eric Dumazet, David S. Miller, Jakub Kicinski, Hideaki YOSHIFUJI,
	David Ahern, Jesper Dangaard Brouer, Nathan Chancellor,
	Nick Desaulniers, Brendan Jackman, Florent Revest, Joe Stringer,
	Lorenz Bauer, Tariq Toukan, netdev, bpf, clang-built-linux,
	Maxim Mikityanskiy

This commit adds struct nf_conn as a new type to BPF. For now, only the
status field is mapped. It will allow to add helpers that will expose
conntrack information to BPF programs.

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
---
 include/linux/bpf.h            | 46 ++++++++++++++++
 include/uapi/linux/bpf.h       |  4 ++
 kernel/bpf/verifier.c          | 95 ++++++++++++++++++++++++++++++++--
 net/core/filter.c              | 41 +++++++++++++++
 scripts/bpf_doc.py             |  1 +
 tools/include/uapi/linux/bpf.h |  4 ++
 6 files changed, 186 insertions(+), 5 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index d604c8251d88..21ca6e1f0f7a 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -341,6 +341,7 @@ enum bpf_arg_type {
 	ARG_PTR_TO_STACK_OR_NULL,	/* pointer to stack or NULL */
 	ARG_PTR_TO_CONST_STR,	/* pointer to a null terminated read-only string */
 	ARG_PTR_TO_TIMER,	/* pointer to bpf_timer */
+	ARG_PTR_TO_NF_CONN,	/* pointer to nf_conn */
 	__BPF_ARG_TYPE_MAX,
 };
 
@@ -358,6 +359,7 @@ enum bpf_return_type {
 	RET_PTR_TO_MEM_OR_BTF_ID_OR_NULL, /* returns a pointer to a valid memory or a btf_id or NULL */
 	RET_PTR_TO_MEM_OR_BTF_ID,	/* returns a pointer to a valid memory or a btf_id */
 	RET_PTR_TO_BTF_ID,		/* returns a pointer to a btf_id */
+	RET_PTR_TO_NF_CONN_OR_NULL,	/* returns a pointer to a nf_conn or NULL */
 };
 
 /* eBPF function prototype used by verifier to allow BPF_CALLs from eBPF programs
@@ -459,6 +461,8 @@ enum bpf_reg_type {
 	PTR_TO_PERCPU_BTF_ID,	 /* reg points to a percpu kernel variable */
 	PTR_TO_FUNC,		 /* reg points to a bpf program function */
 	PTR_TO_MAP_KEY,		 /* reg points to a map element key */
+	PTR_TO_NF_CONN,		 /* reg points to struct nf_conn */
+	PTR_TO_NF_CONN_OR_NULL,	 /* reg points to struct nf_conn or NULL */
 	__BPF_REG_TYPE_MAX,
 };
 
@@ -2127,6 +2131,32 @@ u32 bpf_sock_convert_ctx_access(enum bpf_access_type type,
 				struct bpf_insn *insn_buf,
 				struct bpf_prog *prog,
 				u32 *target_size);
+
+#if IS_BUILTIN(CONFIG_NF_CONNTRACK)
+bool bpf_ct_is_valid_access(int off, int size, enum bpf_access_type type,
+			    struct bpf_insn_access_aux *info);
+u32 bpf_ct_convert_ctx_access(enum bpf_access_type type,
+			      const struct bpf_insn *si,
+			      struct bpf_insn *insn_buf,
+			      struct bpf_prog *prog, u32 *target_size);
+#else
+static inline bool bpf_ct_is_valid_access(int off, int size,
+					  enum bpf_access_type type,
+					  struct bpf_insn_access_aux *info)
+{
+	return false;
+}
+
+static inline u32 bpf_ct_convert_ctx_access(enum bpf_access_type type,
+					    const struct bpf_insn *si,
+					    struct bpf_insn *insn_buf,
+					    struct bpf_prog *prog,
+					    u32 *target_size)
+{
+	return 0;
+}
+#endif
+
 #else
 static inline bool bpf_sock_common_is_valid_access(int off, int size,
 						   enum bpf_access_type type,
@@ -2148,6 +2178,22 @@ static inline u32 bpf_sock_convert_ctx_access(enum bpf_access_type type,
 {
 	return 0;
 }
+
+static inline bool bpf_ct_is_valid_access(int off, int size,
+					  enum bpf_access_type type,
+					  struct bpf_insn_access_aux *info)
+{
+	return false;
+}
+
+static inline u32 bpf_ct_convert_ctx_access(enum bpf_access_type type,
+					    const struct bpf_insn *si,
+					    struct bpf_insn *insn_buf,
+					    struct bpf_prog *prog,
+					    u32 *target_size)
+{
+	return 0;
+}
 #endif
 
 #ifdef CONFIG_INET
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index efb2750f39c6..a10a44c4f79b 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -5472,6 +5472,10 @@ struct bpf_xdp_sock {
 	__u32 queue_id;
 };
 
+struct bpf_nf_conn {
+	__u64 status;
+};
+
 #define XDP_PACKET_HEADROOM 256
 
 /* User return codes for XDP prog type.
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 20900a1bac12..6eafef35e247 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -436,13 +436,19 @@ static bool type_is_sk_pointer(enum bpf_reg_type type)
 		type == PTR_TO_XDP_SOCK;
 }
 
+static bool type_is_ct_pointer(enum bpf_reg_type type)
+{
+	return type == PTR_TO_NF_CONN;
+}
+
 static bool reg_type_not_null(enum bpf_reg_type type)
 {
 	return type == PTR_TO_SOCKET ||
 		type == PTR_TO_TCP_SOCK ||
 		type == PTR_TO_MAP_VALUE ||
 		type == PTR_TO_MAP_KEY ||
-		type == PTR_TO_SOCK_COMMON;
+		type == PTR_TO_SOCK_COMMON ||
+		type == PTR_TO_NF_CONN;
 }
 
 static bool reg_type_may_be_null(enum bpf_reg_type type)
@@ -454,7 +460,8 @@ static bool reg_type_may_be_null(enum bpf_reg_type type)
 	       type == PTR_TO_BTF_ID_OR_NULL ||
 	       type == PTR_TO_MEM_OR_NULL ||
 	       type == PTR_TO_RDONLY_BUF_OR_NULL ||
-	       type == PTR_TO_RDWR_BUF_OR_NULL;
+	       type == PTR_TO_RDWR_BUF_OR_NULL ||
+	       type == PTR_TO_NF_CONN_OR_NULL;
 }
 
 static bool reg_may_point_to_spin_lock(const struct bpf_reg_state *reg)
@@ -470,12 +477,15 @@ static bool reg_type_may_be_refcounted_or_null(enum bpf_reg_type type)
 		type == PTR_TO_TCP_SOCK ||
 		type == PTR_TO_TCP_SOCK_OR_NULL ||
 		type == PTR_TO_MEM ||
-		type == PTR_TO_MEM_OR_NULL;
+		type == PTR_TO_MEM_OR_NULL ||
+		type == PTR_TO_NF_CONN ||
+		type == PTR_TO_NF_CONN_OR_NULL;
 }
 
 static bool arg_type_may_be_refcounted(enum bpf_arg_type type)
 {
-	return type == ARG_PTR_TO_SOCK_COMMON;
+	return type == ARG_PTR_TO_SOCK_COMMON ||
+		type == ARG_PTR_TO_NF_CONN;
 }
 
 static bool arg_type_may_be_null(enum bpf_arg_type type)
@@ -577,6 +587,8 @@ static const char * const reg_type_str[] = {
 	[PTR_TO_RDWR_BUF_OR_NULL] = "rdwr_buf_or_null",
 	[PTR_TO_FUNC]		= "func",
 	[PTR_TO_MAP_KEY]	= "map_key",
+	[PTR_TO_NF_CONN]	= "nf_conn",
+	[PTR_TO_NF_CONN_OR_NULL] = "nf_conn_or_null",
 };
 
 static char slot_type_char[] = {
@@ -1189,6 +1201,9 @@ static void mark_ptr_not_null_reg(struct bpf_reg_state *reg)
 	case PTR_TO_RDWR_BUF_OR_NULL:
 		reg->type = PTR_TO_RDWR_BUF;
 		break;
+	case PTR_TO_NF_CONN_OR_NULL:
+		reg->type = PTR_TO_NF_CONN;
+		break;
 	default:
 		WARN_ONCE(1, "unknown nullable register type");
 	}
@@ -2748,6 +2763,8 @@ static bool is_spillable_regtype(enum bpf_reg_type type)
 	case PTR_TO_MEM_OR_NULL:
 	case PTR_TO_FUNC:
 	case PTR_TO_MAP_KEY:
+	case PTR_TO_NF_CONN:
+	case PTR_TO_NF_CONN_OR_NULL:
 		return true;
 	default:
 		return false;
@@ -3665,6 +3682,40 @@ static int check_sock_access(struct bpf_verifier_env *env, int insn_idx,
 	return -EACCES;
 }
 
+static int check_ct_access(struct bpf_verifier_env *env, int insn_idx,
+			   u32 regno, int off, int size, enum bpf_access_type t)
+{
+	struct bpf_reg_state *regs = cur_regs(env);
+	struct bpf_reg_state *reg = &regs[regno];
+	struct bpf_insn_access_aux info = {};
+	bool valid;
+
+	if (reg->smin_value < 0) {
+		verbose(env, "R%d min value is negative, either use unsigned index or do a if (index >=0) check.\n",
+			regno);
+		return -EACCES;
+	}
+
+	switch (reg->type) {
+	case PTR_TO_NF_CONN:
+		valid = bpf_ct_is_valid_access(off, size, t, &info);
+		break;
+	default:
+		valid = false;
+	}
+
+	if (valid) {
+		env->insn_aux_data[insn_idx].ctx_field_size =
+			info.ctx_field_size;
+		return 0;
+	}
+
+	verbose(env, "R%d invalid %s access off=%d size=%d\n",
+		regno, reg_type_str[reg->type], off, size);
+
+	return -EACCES;
+}
+
 static bool is_pointer_value(struct bpf_verifier_env *env, int regno)
 {
 	return __is_pointer_value(env->allow_ptr_leaks, reg_state(env, regno));
@@ -3684,6 +3735,13 @@ static bool is_sk_reg(struct bpf_verifier_env *env, int regno)
 	return type_is_sk_pointer(reg->type);
 }
 
+static bool is_ct_reg(struct bpf_verifier_env *env, int regno)
+{
+	const struct bpf_reg_state *reg = reg_state(env, regno);
+
+	return type_is_ct_pointer(reg->type);
+}
+
 static bool is_pkt_reg(struct bpf_verifier_env *env, int regno)
 {
 	const struct bpf_reg_state *reg = reg_state(env, regno);
@@ -3804,6 +3862,9 @@ static int check_ptr_alignment(struct bpf_verifier_env *env,
 	case PTR_TO_XDP_SOCK:
 		pointer_desc = "xdp_sock ";
 		break;
+	case PTR_TO_NF_CONN:
+		pointer_desc = "nf_conn ";
+		break;
 	default:
 		break;
 	}
@@ -4478,6 +4539,15 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
 		err = check_sock_access(env, insn_idx, regno, off, size, t);
 		if (!err && value_regno >= 0)
 			mark_reg_unknown(env, regs, value_regno);
+	} else if (type_is_ct_pointer(reg->type)) {
+		if (t == BPF_WRITE) {
+			verbose(env, "R%d cannot write into %s\n",
+				regno, reg_type_str[reg->type]);
+			return -EACCES;
+		}
+		err = check_ct_access(env, insn_idx, regno, off, size, t);
+		if (!err && value_regno >= 0)
+			mark_reg_unknown(env, regs, value_regno);
 	} else if (reg->type == PTR_TO_TP_BUFFER) {
 		err = check_tp_buffer_access(env, reg, regno, off, size);
 		if (!err && t == BPF_READ && value_regno >= 0)
@@ -4571,7 +4641,8 @@ static int check_atomic(struct bpf_verifier_env *env, int insn_idx, struct bpf_i
 	if (is_ctx_reg(env, insn->dst_reg) ||
 	    is_pkt_reg(env, insn->dst_reg) ||
 	    is_flow_key_reg(env, insn->dst_reg) ||
-	    is_sk_reg(env, insn->dst_reg)) {
+	    is_sk_reg(env, insn->dst_reg) ||
+	    is_ct_reg(env, insn->dst_reg)) {
 		verbose(env, "BPF_ATOMIC stores into R%d %s is not allowed\n",
 			insn->dst_reg,
 			reg_type_str[reg_state(env, insn->dst_reg)->type]);
@@ -5086,6 +5157,7 @@ static const struct bpf_reg_types func_ptr_types = { .types = { PTR_TO_FUNC } };
 static const struct bpf_reg_types stack_ptr_types = { .types = { PTR_TO_STACK } };
 static const struct bpf_reg_types const_str_ptr_types = { .types = { PTR_TO_MAP_VALUE } };
 static const struct bpf_reg_types timer_types = { .types = { PTR_TO_MAP_VALUE } };
+static const struct bpf_reg_types nf_conn_types = { .types = { PTR_TO_NF_CONN } };
 
 static const struct bpf_reg_types *compatible_reg_types[__BPF_ARG_TYPE_MAX] = {
 	[ARG_PTR_TO_MAP_KEY]		= &map_key_value_types,
@@ -5118,6 +5190,7 @@ static const struct bpf_reg_types *compatible_reg_types[__BPF_ARG_TYPE_MAX] = {
 	[ARG_PTR_TO_STACK_OR_NULL]	= &stack_ptr_types,
 	[ARG_PTR_TO_CONST_STR]		= &const_str_ptr_types,
 	[ARG_PTR_TO_TIMER]		= &timer_types,
+	[ARG_PTR_TO_NF_CONN]		= &nf_conn_types,
 };
 
 static int check_reg_type(struct bpf_verifier_env *env, u32 regno,
@@ -6586,6 +6659,9 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
 		 */
 		regs[BPF_REG_0].btf = btf_vmlinux;
 		regs[BPF_REG_0].btf_id = ret_btf_id;
+	} else if (fn->ret_type == RET_PTR_TO_NF_CONN_OR_NULL) {
+		mark_reg_known_zero(env, regs, BPF_REG_0);
+		regs[BPF_REG_0].type = PTR_TO_NF_CONN_OR_NULL;
 	} else {
 		verbose(env, "unknown return type %d of func %s#%d\n",
 			fn->ret_type, func_id_name(func_id), func_id);
@@ -7214,6 +7290,8 @@ static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env,
 	case PTR_TO_TCP_SOCK:
 	case PTR_TO_TCP_SOCK_OR_NULL:
 	case PTR_TO_XDP_SOCK:
+	case PTR_TO_NF_CONN:
+	case PTR_TO_NF_CONN_OR_NULL:
 		verbose(env, "R%d pointer arithmetic on %s prohibited\n",
 			dst, reg_type_str[ptr_reg->type]);
 		return -EACCES;
@@ -10505,6 +10583,8 @@ static bool regsafe(struct bpf_verifier_env *env, struct bpf_reg_state *rold,
 	case PTR_TO_TCP_SOCK:
 	case PTR_TO_TCP_SOCK_OR_NULL:
 	case PTR_TO_XDP_SOCK:
+	case PTR_TO_NF_CONN:
+	case PTR_TO_NF_CONN_OR_NULL:
 		/* Only valid matches are exact, which memcmp() above
 		 * would have accepted
 		 */
@@ -11040,6 +11120,8 @@ static bool reg_type_mismatch_ok(enum bpf_reg_type type)
 	case PTR_TO_XDP_SOCK:
 	case PTR_TO_BTF_ID:
 	case PTR_TO_BTF_ID_OR_NULL:
+	case PTR_TO_NF_CONN:
+	case PTR_TO_NF_CONN_OR_NULL:
 		return false;
 	default:
 		return true;
@@ -12462,6 +12544,9 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
 				return -EINVAL;
 			}
 			continue;
+		case PTR_TO_NF_CONN:
+			convert_ctx_access = bpf_ct_convert_ctx_access;
+			break;
 		default:
 			continue;
 		}
diff --git a/net/core/filter.c b/net/core/filter.c
index d04988e67640..d2d07ccae599 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -78,6 +78,7 @@
 #include <linux/btf_ids.h>
 #include <net/tls.h>
 #include <net/xdp.h>
+#include <net/netfilter/nf_conntrack.h>
 
 static const struct bpf_func_proto *
 bpf_sk_base_func_proto(enum bpf_func_id func_id);
@@ -8002,6 +8003,24 @@ bool bpf_sock_is_valid_access(int off, int size, enum bpf_access_type type,
 	return size == size_default;
 }
 
+#if IS_BUILTIN(CONFIG_NF_CONNTRACK)
+bool bpf_ct_is_valid_access(int off, int size, enum bpf_access_type type,
+			    struct bpf_insn_access_aux *info)
+{
+	if (off < 0 || off > sizeof(struct bpf_nf_conn))
+		return false;
+	if (off % size != 0)
+		return false;
+
+	switch (off) {
+	case offsetof(struct bpf_nf_conn, status):
+		return size == sizeof_field(struct bpf_nf_conn, status);
+	}
+
+	return false;
+}
+#endif
+
 static bool sock_filter_is_valid_access(int off, int size,
 					enum bpf_access_type type,
 					const struct bpf_prog *prog,
@@ -9094,6 +9113,28 @@ u32 bpf_sock_convert_ctx_access(enum bpf_access_type type,
 	return insn - insn_buf;
 }
 
+#if IS_BUILTIN(CONFIG_NF_CONNTRACK)
+u32 bpf_ct_convert_ctx_access(enum bpf_access_type type,
+			      const struct bpf_insn *si,
+			      struct bpf_insn *insn_buf,
+			      struct bpf_prog *prog, u32 *target_size)
+{
+	struct bpf_insn *insn = insn_buf;
+
+	switch (si->off) {
+	case offsetof(struct bpf_nf_conn, status):
+		BUILD_BUG_ON(sizeof_field(struct nf_conn, status) >
+			     sizeof_field(struct bpf_nf_conn, status));
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct nf_conn, status),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct nf_conn, status));
+		break;
+	}
+
+	return insn - insn_buf;
+}
+#endif
+
 static u32 tc_cls_act_convert_ctx_access(enum bpf_access_type type,
 					 const struct bpf_insn *si,
 					 struct bpf_insn *insn_buf,
diff --git a/scripts/bpf_doc.py b/scripts/bpf_doc.py
index 00ac7b79cddb..0c2cd955f5e0 100755
--- a/scripts/bpf_doc.py
+++ b/scripts/bpf_doc.py
@@ -596,6 +596,7 @@ class PrinterHelpers(Printer):
             'struct socket',
             'struct file',
             'struct bpf_timer',
+            'struct bpf_nf_conn',
     }
     mapped_types = {
             'u8': '__u8',
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index efb2750f39c6..a10a44c4f79b 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -5472,6 +5472,10 @@ struct bpf_xdp_sock {
 	__u32 queue_id;
 };
 
+struct bpf_nf_conn {
+	__u64 status;
+};
+
 #define XDP_PACKET_HEADROOM 256
 
 /* User return codes for XDP prog type.
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH bpf-next 07/10] bpf: Add helpers to query conntrack info
  2021-10-19 14:46 [PATCH bpf-next 00/10] New BPF helpers to accelerate synproxy Maxim Mikityanskiy
                   ` (5 preceding siblings ...)
  2021-10-19 14:46 ` [PATCH bpf-next 06/10] bpf: Expose struct nf_conn to BPF Maxim Mikityanskiy
@ 2021-10-19 14:46 ` Maxim Mikityanskiy
  2021-10-20  3:56   ` Kumar Kartikeya Dwivedi
  2021-10-20  9:46   ` Toke Høiland-Jørgensen
  2021-10-19 14:46 ` [PATCH bpf-next 08/10] bpf: Add helpers to issue and check SYN cookies in XDP Maxim Mikityanskiy
                   ` (2 subsequent siblings)
  9 siblings, 2 replies; 48+ messages in thread
From: Maxim Mikityanskiy @ 2021-10-19 14:46 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh
  Cc: Eric Dumazet, David S. Miller, Jakub Kicinski, Hideaki YOSHIFUJI,
	David Ahern, Jesper Dangaard Brouer, Nathan Chancellor,
	Nick Desaulniers, Brendan Jackman, Florent Revest, Joe Stringer,
	Lorenz Bauer, Tariq Toukan, netdev, bpf, clang-built-linux,
	Maxim Mikityanskiy

The new helpers (bpf_ct_lookup_tcp and bpf_ct_lookup_udp) allow to query
connection tracking information of TCP and UDP connections based on
source and destination IP address and port. The helper returns a pointer
to struct nf_conn (if the conntrack entry was found), which needs to be
released with bpf_ct_release.

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
---
 include/uapi/linux/bpf.h       |  81 +++++++++++++
 kernel/bpf/verifier.c          |   9 +-
 net/core/filter.c              | 205 +++++++++++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h |  81 +++++++++++++
 4 files changed, 373 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index a10a44c4f79b..883de3f1bb8b 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -4925,6 +4925,79 @@ union bpf_attr {
  *	Return
  *		The number of bytes written to the buffer, or a negative error
  *		in case of failure.
+ *
+ * struct bpf_nf_conn *bpf_ct_lookup_tcp(void *ctx, struct bpf_sock_tuple *tuple, u32 tuple_size, u64 netns, u64 *flags_err)
+ *	Description
+ *		Look for conntrack info for a TCP connection matching *tuple*,
+ *		optionally in a child network namespace *netns*.
+ *
+ *		The *flags_err* argument is used as an input parameter for flags
+ *		and output parameter for the error code. The flags can be a
+ *		combination of one or more of the following values:
+ *
+ *		**BPF_F_CT_DIR_REPLY**
+ *			When set, the conntrack direction is IP_CT_DIR_REPLY,
+ *			otherwise IP_CT_DIR_ORIGINAL.
+ *
+ *		If the function returns **NULL**, *flags_err* will indicate the
+ *		error code:
+ *
+ *		**EAFNOSUPPORT**
+ *			*tuple_size* doesn't match supported address families
+ *			(AF_INET; AF_INET6 when CONFIG_IPV6 is enabled).
+ *
+ *		**EINVAL**
+ *			Input arguments are not valid.
+ *
+ *		**ENOENT**
+ *			Connection tracking entry for *tuple* wasn't found.
+ *
+ *		This helper is available only if the kernel was compiled with
+ *		**CONFIG_NF_CONNTRACK** configuration option as built-in.
+ *	Return
+ *		Connection tracking status (see **enum ip_conntrack_status**),
+ *		or **NULL** in case of failure or if there is no conntrack entry
+ *		for this tuple.
+ *
+ * struct bpf_nf_conn *bpf_ct_lookup_udp(void *ctx, struct bpf_sock_tuple *tuple, u32 tuple_size, u64 netns, u64 *flags_err)
+ *	Description
+ *		Look for conntrack info for a UDP connection matching *tuple*,
+ *		optionally in a child network namespace *netns*.
+ *
+ *		The *flags_err* argument is used as an input parameter for flags
+ *		and output parameter for the error code. The flags can be a
+ *		combination of one or more of the following values:
+ *
+ *		**BPF_F_CT_DIR_REPLY**
+ *			When set, the conntrack direction is IP_CT_DIR_REPLY,
+ *			otherwise IP_CT_DIR_ORIGINAL.
+ *
+ *		If the function returns **NULL**, *flags_err* will indicate the
+ *		error code:
+ *
+ *		**EAFNOSUPPORT**
+ *			*tuple_size* doesn't match supported address families
+ *			(AF_INET; AF_INET6 when CONFIG_IPV6 is enabled).
+ *
+ *		**EINVAL**
+ *			Input arguments are not valid.
+ *
+ *		**ENOENT**
+ *			Connection tracking entry for *tuple* wasn't found.
+ *
+ *		This helper is available only if the kernel was compiled with
+ *		**CONFIG_NF_CONNTRACK** configuration option as built-in.
+ *	Return
+ *		Connection tracking status (see **enum ip_conntrack_status**),
+ *		or **NULL** in case of failure or if there is no conntrack entry
+ *		for this tuple.
+ *
+ * long bpf_ct_release(void *ct)
+ *	Description
+ *		Release the reference held by *ct*. *ct* must be a non-**NULL**
+ *		pointer that was returned from **bpf_ct_lookup_xxx**\ ().
+ *	Return
+ *		0 on success, or a negative error in case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5105,6 +5178,9 @@ union bpf_attr {
 	FN(task_pt_regs),		\
 	FN(get_branch_snapshot),	\
 	FN(trace_vprintk),		\
+	FN(ct_lookup_tcp),		\
+	FN(ct_lookup_udp),		\
+	FN(ct_release),			\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
@@ -5288,6 +5364,11 @@ enum {
 	BPF_F_EXCLUDE_INGRESS	= (1ULL << 4),
 };
 
+/* Flags for bpf_ct_lookup_{tcp,udp} helpers. */
+enum {
+	BPF_F_CT_DIR_REPLY	= (1ULL << 0),
+};
+
 #define __bpf_md_ptr(type, name)	\
 union {					\
 	type name;			\
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 6eafef35e247..23e2a23ca9c4 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -506,7 +506,8 @@ static bool is_release_function(enum bpf_func_id func_id)
 {
 	return func_id == BPF_FUNC_sk_release ||
 	       func_id == BPF_FUNC_ringbuf_submit ||
-	       func_id == BPF_FUNC_ringbuf_discard;
+	       func_id == BPF_FUNC_ringbuf_discard ||
+	       func_id == BPF_FUNC_ct_release;
 }
 
 static bool may_be_acquire_function(enum bpf_func_id func_id)
@@ -515,7 +516,8 @@ static bool may_be_acquire_function(enum bpf_func_id func_id)
 		func_id == BPF_FUNC_sk_lookup_udp ||
 		func_id == BPF_FUNC_skc_lookup_tcp ||
 		func_id == BPF_FUNC_map_lookup_elem ||
-	        func_id == BPF_FUNC_ringbuf_reserve;
+		func_id == BPF_FUNC_ringbuf_reserve ||
+		func_id == BPF_FUNC_ct_lookup_tcp;
 }
 
 static bool is_acquire_function(enum bpf_func_id func_id,
@@ -526,7 +528,8 @@ static bool is_acquire_function(enum bpf_func_id func_id,
 	if (func_id == BPF_FUNC_sk_lookup_tcp ||
 	    func_id == BPF_FUNC_sk_lookup_udp ||
 	    func_id == BPF_FUNC_skc_lookup_tcp ||
-	    func_id == BPF_FUNC_ringbuf_reserve)
+	    func_id == BPF_FUNC_ringbuf_reserve ||
+	    func_id == BPF_FUNC_ct_lookup_tcp)
 		return true;
 
 	if (func_id == BPF_FUNC_map_lookup_elem &&
diff --git a/net/core/filter.c b/net/core/filter.c
index d2d07ccae599..f913851c97f7 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -79,6 +79,7 @@
 #include <net/tls.h>
 #include <net/xdp.h>
 #include <net/netfilter/nf_conntrack.h>
+#include <net/netfilter/nf_conntrack_core.h>
 
 static const struct bpf_func_proto *
 bpf_sk_base_func_proto(enum bpf_func_id func_id);
@@ -7096,6 +7097,194 @@ static const struct bpf_func_proto bpf_sock_ops_reserve_hdr_opt_proto = {
 	.arg3_type	= ARG_ANYTHING,
 };
 
+#if IS_BUILTIN(CONFIG_NF_CONNTRACK)
+static struct nf_conn *bpf_ct_lookup(struct net *caller_net,
+				     struct bpf_sock_tuple *tuple,
+				     u32 tuple_len,
+				     u8 protonum,
+				     u64 netns_id,
+				     u64 flags)
+{
+	struct nf_conntrack_tuple ct_tuple = {};
+	struct nf_conntrack_tuple_hash *found;
+	struct net *net;
+	u8 direction;
+
+	direction = flags & BPF_F_CT_DIR_REPLY ? IP_CT_DIR_REPLY :
+						 IP_CT_DIR_ORIGINAL;
+
+	if (flags & ~BPF_F_CT_DIR_REPLY)
+		return ERR_PTR(-EINVAL);
+
+	if (tuple_len == sizeof(tuple->ipv4)) {
+		ct_tuple.src.l3num = AF_INET;
+		ct_tuple.src.u3.ip = tuple->ipv4.saddr;
+		ct_tuple.src.u.tcp.port = tuple->ipv4.sport;
+		ct_tuple.dst.u3.ip = tuple->ipv4.daddr;
+		ct_tuple.dst.u.tcp.port = tuple->ipv4.dport;
+#if IS_ENABLED(CONFIG_IPV6)
+	} else if (tuple_len == sizeof(tuple->ipv6)) {
+		ct_tuple.src.l3num = AF_INET6;
+		memcpy(ct_tuple.src.u3.ip6, tuple->ipv6.saddr,
+		       sizeof(tuple->ipv6.saddr));
+		ct_tuple.src.u.tcp.port = tuple->ipv6.sport;
+		memcpy(ct_tuple.dst.u3.ip6, tuple->ipv6.daddr,
+		       sizeof(tuple->ipv6.daddr));
+		ct_tuple.dst.u.tcp.port = tuple->ipv6.dport;
+#endif
+	} else {
+		return ERR_PTR(-EAFNOSUPPORT);
+	}
+
+	ct_tuple.dst.protonum = protonum;
+	ct_tuple.dst.dir = direction;
+
+	net = caller_net;
+	if ((s32)netns_id >= 0) {
+		if (unlikely(netns_id > S32_MAX))
+			return ERR_PTR(-EINVAL);
+		net = get_net_ns_by_id(net, netns_id);
+		if (!net)
+			return ERR_PTR(-EINVAL);
+	}
+
+	found = nf_conntrack_find_get(net, &nf_ct_zone_dflt, &ct_tuple);
+
+	if ((s32)netns_id >= 0)
+		put_net(net);
+
+	if (!found)
+		return ERR_PTR(-ENOENT);
+	return nf_ct_tuplehash_to_ctrack(found);
+}
+
+BPF_CALL_5(bpf_xdp_ct_lookup_tcp, struct xdp_buff *, ctx,
+	   struct bpf_sock_tuple *, tuple, u32, tuple_len,
+	   u64, netns_id, u64 *, flags_err)
+{
+	struct nf_conn *ct;
+
+	ct = bpf_ct_lookup(dev_net(ctx->rxq->dev), tuple, tuple_len,
+			   IPPROTO_TCP, netns_id, *flags_err);
+	if (IS_ERR(ct)) {
+		*flags_err = PTR_ERR(ct);
+		return (unsigned long)NULL;
+	}
+	return (unsigned long)ct;
+}
+
+static const struct bpf_func_proto bpf_xdp_ct_lookup_tcp_proto = {
+	.func		= bpf_xdp_ct_lookup_tcp,
+	.gpl_only	= true, /* nf_conntrack_find_get is GPL */
+	.pkt_access	= true,
+	.ret_type	= RET_PTR_TO_NF_CONN_OR_NULL,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_PTR_TO_MEM,
+	.arg3_type	= ARG_CONST_SIZE,
+	.arg4_type	= ARG_ANYTHING,
+	.arg5_type	= ARG_PTR_TO_LONG,
+};
+
+BPF_CALL_5(bpf_xdp_ct_lookup_udp, struct xdp_buff *, ctx,
+	   struct bpf_sock_tuple *, tuple, u32, tuple_len,
+	   u64, netns_id, u64 *, flags_err)
+{
+	struct nf_conn *ct;
+
+	ct = bpf_ct_lookup(dev_net(ctx->rxq->dev), tuple, tuple_len,
+			   IPPROTO_UDP, netns_id, *flags_err);
+	if (IS_ERR(ct)) {
+		*flags_err = PTR_ERR(ct);
+		return (unsigned long)NULL;
+	}
+	return (unsigned long)ct;
+}
+
+static const struct bpf_func_proto bpf_xdp_ct_lookup_udp_proto = {
+	.func		= bpf_xdp_ct_lookup_udp,
+	.gpl_only	= true, /* nf_conntrack_find_get is GPL */
+	.pkt_access	= true,
+	.ret_type	= RET_PTR_TO_NF_CONN_OR_NULL,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_PTR_TO_MEM,
+	.arg3_type	= ARG_CONST_SIZE,
+	.arg4_type	= ARG_ANYTHING,
+	.arg5_type	= ARG_PTR_TO_LONG,
+};
+
+BPF_CALL_5(bpf_skb_ct_lookup_tcp, struct sk_buff *, skb,
+	   struct bpf_sock_tuple *, tuple, u32, tuple_len,
+	   u64, netns_id, u64 *, flags_err)
+{
+	struct net *caller_net;
+	struct nf_conn *ct;
+
+	caller_net = skb->dev ? dev_net(skb->dev) : sock_net(skb->sk);
+	ct = bpf_ct_lookup(caller_net, tuple, tuple_len, IPPROTO_TCP,
+			   netns_id, *flags_err);
+	if (IS_ERR(ct)) {
+		*flags_err = PTR_ERR(ct);
+		return (unsigned long)NULL;
+	}
+	return (unsigned long)ct;
+}
+
+static const struct bpf_func_proto bpf_skb_ct_lookup_tcp_proto = {
+	.func		= bpf_skb_ct_lookup_tcp,
+	.gpl_only	= true, /* nf_conntrack_find_get is GPL */
+	.pkt_access	= true,
+	.ret_type	= RET_PTR_TO_NF_CONN_OR_NULL,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_PTR_TO_MEM,
+	.arg3_type	= ARG_CONST_SIZE,
+	.arg4_type	= ARG_ANYTHING,
+	.arg5_type	= ARG_PTR_TO_LONG,
+};
+
+BPF_CALL_5(bpf_skb_ct_lookup_udp, struct sk_buff *, skb,
+	   struct bpf_sock_tuple *, tuple, u32, tuple_len,
+	   u64, netns_id, u64 *, flags_err)
+{
+	struct net *caller_net;
+	struct nf_conn *ct;
+
+	caller_net = skb->dev ? dev_net(skb->dev) : sock_net(skb->sk);
+	ct = bpf_ct_lookup(caller_net, tuple, tuple_len, IPPROTO_UDP,
+			   netns_id, *flags_err);
+	if (IS_ERR(ct)) {
+		*flags_err = PTR_ERR(ct);
+		return (unsigned long)NULL;
+	}
+	return (unsigned long)ct;
+}
+
+static const struct bpf_func_proto bpf_skb_ct_lookup_udp_proto = {
+	.func		= bpf_skb_ct_lookup_udp,
+	.gpl_only	= true, /* nf_conntrack_find_get is GPL */
+	.pkt_access	= true,
+	.ret_type	= RET_PTR_TO_NF_CONN_OR_NULL,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_PTR_TO_MEM,
+	.arg3_type	= ARG_CONST_SIZE,
+	.arg4_type	= ARG_ANYTHING,
+	.arg5_type	= ARG_PTR_TO_LONG,
+};
+
+BPF_CALL_1(bpf_ct_release, struct nf_conn *, ct)
+{
+	nf_ct_put(ct);
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_ct_release_proto = {
+	.func		= bpf_ct_release,
+	.gpl_only	= false,
+	.pkt_access	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_NF_CONN,
+};
+#endif
+
 #endif /* CONFIG_INET */
 
 bool bpf_helper_changes_pkt_data(void *func)
@@ -7455,6 +7644,14 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_tcp_gen_syncookie_proto;
 	case BPF_FUNC_sk_assign:
 		return &bpf_sk_assign_proto;
+#if IS_BUILTIN(CONFIG_NF_CONNTRACK)
+	case BPF_FUNC_ct_lookup_tcp:
+		return &bpf_skb_ct_lookup_tcp_proto;
+	case BPF_FUNC_ct_lookup_udp:
+		return &bpf_skb_ct_lookup_udp_proto;
+	case BPF_FUNC_ct_release:
+		return &bpf_ct_release_proto;
+#endif
 #endif
 	default:
 		return bpf_sk_base_func_proto(func_id);
@@ -7498,6 +7695,14 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_tcp_check_syncookie_proto;
 	case BPF_FUNC_tcp_gen_syncookie:
 		return &bpf_tcp_gen_syncookie_proto;
+#if IS_BUILTIN(CONFIG_NF_CONNTRACK)
+	case BPF_FUNC_ct_lookup_tcp:
+		return &bpf_xdp_ct_lookup_tcp_proto;
+	case BPF_FUNC_ct_lookup_udp:
+		return &bpf_xdp_ct_lookup_udp_proto;
+	case BPF_FUNC_ct_release:
+		return &bpf_ct_release_proto;
+#endif
 #endif
 	default:
 		return bpf_sk_base_func_proto(func_id);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index a10a44c4f79b..883de3f1bb8b 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -4925,6 +4925,79 @@ union bpf_attr {
  *	Return
  *		The number of bytes written to the buffer, or a negative error
  *		in case of failure.
+ *
+ * struct bpf_nf_conn *bpf_ct_lookup_tcp(void *ctx, struct bpf_sock_tuple *tuple, u32 tuple_size, u64 netns, u64 *flags_err)
+ *	Description
+ *		Look for conntrack info for a TCP connection matching *tuple*,
+ *		optionally in a child network namespace *netns*.
+ *
+ *		The *flags_err* argument is used as an input parameter for flags
+ *		and output parameter for the error code. The flags can be a
+ *		combination of one or more of the following values:
+ *
+ *		**BPF_F_CT_DIR_REPLY**
+ *			When set, the conntrack direction is IP_CT_DIR_REPLY,
+ *			otherwise IP_CT_DIR_ORIGINAL.
+ *
+ *		If the function returns **NULL**, *flags_err* will indicate the
+ *		error code:
+ *
+ *		**EAFNOSUPPORT**
+ *			*tuple_size* doesn't match supported address families
+ *			(AF_INET; AF_INET6 when CONFIG_IPV6 is enabled).
+ *
+ *		**EINVAL**
+ *			Input arguments are not valid.
+ *
+ *		**ENOENT**
+ *			Connection tracking entry for *tuple* wasn't found.
+ *
+ *		This helper is available only if the kernel was compiled with
+ *		**CONFIG_NF_CONNTRACK** configuration option as built-in.
+ *	Return
+ *		Connection tracking status (see **enum ip_conntrack_status**),
+ *		or **NULL** in case of failure or if there is no conntrack entry
+ *		for this tuple.
+ *
+ * struct bpf_nf_conn *bpf_ct_lookup_udp(void *ctx, struct bpf_sock_tuple *tuple, u32 tuple_size, u64 netns, u64 *flags_err)
+ *	Description
+ *		Look for conntrack info for a UDP connection matching *tuple*,
+ *		optionally in a child network namespace *netns*.
+ *
+ *		The *flags_err* argument is used as an input parameter for flags
+ *		and output parameter for the error code. The flags can be a
+ *		combination of one or more of the following values:
+ *
+ *		**BPF_F_CT_DIR_REPLY**
+ *			When set, the conntrack direction is IP_CT_DIR_REPLY,
+ *			otherwise IP_CT_DIR_ORIGINAL.
+ *
+ *		If the function returns **NULL**, *flags_err* will indicate the
+ *		error code:
+ *
+ *		**EAFNOSUPPORT**
+ *			*tuple_size* doesn't match supported address families
+ *			(AF_INET; AF_INET6 when CONFIG_IPV6 is enabled).
+ *
+ *		**EINVAL**
+ *			Input arguments are not valid.
+ *
+ *		**ENOENT**
+ *			Connection tracking entry for *tuple* wasn't found.
+ *
+ *		This helper is available only if the kernel was compiled with
+ *		**CONFIG_NF_CONNTRACK** configuration option as built-in.
+ *	Return
+ *		Connection tracking status (see **enum ip_conntrack_status**),
+ *		or **NULL** in case of failure or if there is no conntrack entry
+ *		for this tuple.
+ *
+ * long bpf_ct_release(void *ct)
+ *	Description
+ *		Release the reference held by *ct*. *ct* must be a non-**NULL**
+ *		pointer that was returned from **bpf_ct_lookup_xxx**\ ().
+ *	Return
+ *		0 on success, or a negative error in case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5105,6 +5178,9 @@ union bpf_attr {
 	FN(task_pt_regs),		\
 	FN(get_branch_snapshot),	\
 	FN(trace_vprintk),		\
+	FN(ct_lookup_tcp),		\
+	FN(ct_lookup_udp),		\
+	FN(ct_release),			\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
@@ -5288,6 +5364,11 @@ enum {
 	BPF_F_EXCLUDE_INGRESS	= (1ULL << 4),
 };
 
+/* Flags for bpf_ct_lookup_{tcp,udp} helpers. */
+enum {
+	BPF_F_CT_DIR_REPLY	= (1ULL << 0),
+};
+
 #define __bpf_md_ptr(type, name)	\
 union {					\
 	type name;			\
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH bpf-next 08/10] bpf: Add helpers to issue and check SYN cookies in XDP
  2021-10-19 14:46 [PATCH bpf-next 00/10] New BPF helpers to accelerate synproxy Maxim Mikityanskiy
                   ` (6 preceding siblings ...)
  2021-10-19 14:46 ` [PATCH bpf-next 07/10] bpf: Add helpers to query conntrack info Maxim Mikityanskiy
@ 2021-10-19 14:46 ` Maxim Mikityanskiy
  2021-10-19 14:46 ` [PATCH bpf-next 09/10] bpf: Add a helper to issue timestamp " Maxim Mikityanskiy
  2021-10-19 14:46 ` [PATCH bpf-next 10/10] bpf: Add sample for raw syncookie helpers Maxim Mikityanskiy
  9 siblings, 0 replies; 48+ messages in thread
From: Maxim Mikityanskiy @ 2021-10-19 14:46 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh
  Cc: Eric Dumazet, David S. Miller, Jakub Kicinski, Hideaki YOSHIFUJI,
	David Ahern, Jesper Dangaard Brouer, Nathan Chancellor,
	Nick Desaulniers, Brendan Jackman, Florent Revest, Joe Stringer,
	Lorenz Bauer, Tariq Toukan, netdev, bpf, clang-built-linux,
	Maxim Mikityanskiy

The new helpers bpf_tcp_raw_{gen,check}_syncookie allow an XDP program
to generate SYN cookies in response to TCP SYN packets and to check
those cookies upon receiving the first ACK packet (the final packet of
the TCP handshake).

Unlike bpf_tcp_{gen,check}_syncookie these new helpers don't need a
listening socket on the local machine, which allows to use them together
with synproxy to accelerate SYN cookie generation.

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
---
 include/net/tcp.h              |   1 +
 include/uapi/linux/bpf.h       |  57 +++++++++++++++
 net/core/filter.c              | 122 +++++++++++++++++++++++++++++++++
 net/ipv4/tcp_input.c           |   3 +-
 tools/include/uapi/linux/bpf.h |  57 +++++++++++++++
 5 files changed, 239 insertions(+), 1 deletion(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 4c2898ac6569..1cc96a225848 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -431,6 +431,7 @@ u16 tcp_v4_get_syncookie(struct sock *sk, struct iphdr *iph,
 			 struct tcphdr *th, u32 *cookie);
 u16 tcp_v6_get_syncookie(struct sock *sk, struct ipv6hdr *iph,
 			 struct tcphdr *th, u32 *cookie);
+u16 tcp_parse_mss_option(const struct tcphdr *th, u16 user_mss);
 u16 tcp_get_syncookie_mss(struct request_sock_ops *rsk_ops,
 			  const struct tcp_request_sock_ops *af_ops,
 			  struct sock *sk, struct tcphdr *th);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 883de3f1bb8b..e32f72077250 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -4998,6 +4998,61 @@ union bpf_attr {
  *		pointer that was returned from **bpf_ct_lookup_xxx**\ ().
  *	Return
  *		0 on success, or a negative error in case of failure.
+ *
+ * s64 bpf_tcp_raw_gen_syncookie(void *iph, u32 iph_len, struct tcphdr *th, u32 th_len)
+ *	Description
+ *		Try to issue a SYN cookie for the packet with corresponding
+ *		IP/TCP headers, *iph* and *th*, without depending on a listening
+ *		socket.
+ *
+ *		*iph* points to the start of the IPv4 or IPv6 header, while
+ *		*iph_len* contains **sizeof**\ (**struct iphdr**) or
+ *		**sizeof**\ (**struct ip6hdr**).
+ *
+ *		*th* points to the start of the TCP header, while *th_len*
+ *		contains the length of the TCP header (at least
+ *		**sizeof**\ (**struct tcphdr**)).
+ *	Return
+ *		On success, lower 32 bits hold the generated SYN cookie in
+ *		followed by 16 bits which hold the MSS value for that cookie,
+ *		and the top 16 bits are unused.
+ *
+ *		On failure, the returned value is one of the following:
+ *
+ *		**-EINVAL** if the packet or input arguments are invalid.
+ *
+ *		**-EOPNOTSUPP** if the kernel configuration does not enable SYN
+ *		cookies (CONFIG_SYN_COOKIES is off).
+ *
+ *		**-EPROTONOSUPPORT** if the IP version is not 4 or 6 (or 6, but
+ *		CONFIG_IPV6 is disabled).
+ *
+ * int bpf_tcp_raw_check_syncookie(void *iph, u32 iph_len, struct tcphdr *th, u32 th_len)
+ *	Description
+ *		Check whether *iph* and *th* contain a valid SYN cookie ACK
+ *		without depending on a listening socket.
+ *
+ *		*iph* points to the start of the IPv4 or IPv6 header, while
+ *		*iph_len* contains **sizeof**\ (**struct iphdr**) or
+ *		**sizeof**\ (**struct ip6hdr**).
+ *
+ *		*th* points to the start of the TCP header, while *th_len*
+ *		contains the length of the TCP header (at least
+ *		**sizeof**\ (**struct tcphdr**)).
+ *	Return
+ *		0 if *iph* and *th* are a valid SYN cookie ACK.
+ *
+ *		On failure, the returned value is one of the following:
+ *
+ *		**-EACCES** if the SYN cookie is not valid.
+ *
+ *		**-EINVAL** if the packet or input arguments are invalid.
+ *
+ *		**-EOPNOTSUPP** if the kernel configuration does not enable SYN
+ *		cookies (CONFIG_SYN_COOKIES is off).
+ *
+ *		**-EPROTONOSUPPORT** if the IP version is not 4 or 6 (or 6, but
+ *		CONFIG_IPV6 is disabled).
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5181,6 +5236,8 @@ union bpf_attr {
 	FN(ct_lookup_tcp),		\
 	FN(ct_lookup_udp),		\
 	FN(ct_release),			\
+	FN(tcp_raw_gen_syncookie),	\
+	FN(tcp_raw_check_syncookie),	\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/net/core/filter.c b/net/core/filter.c
index f913851c97f7..5f03d4a282a0 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -7285,6 +7285,124 @@ static const struct bpf_func_proto bpf_ct_release_proto = {
 };
 #endif
 
+BPF_CALL_4(bpf_tcp_raw_gen_syncookie, void *, iph, u32, iph_len,
+	   struct tcphdr *, th, u32, th_len)
+{
+#ifdef CONFIG_SYN_COOKIES
+	u32 cookie;
+	u16 mss;
+
+	if (unlikely(th_len < sizeof(*th) || th_len != th->doff * 4))
+		return -EINVAL;
+
+	if (!th->syn || th->ack || th->fin || th->rst)
+		return -EINVAL;
+
+	if (unlikely(iph_len < sizeof(struct iphdr)))
+		return -EINVAL;
+
+	/* Both struct iphdr and struct ipv6hdr have the version field at the
+	 * same offset so we can cast to the shorter header (struct iphdr).
+	 */
+	switch (((struct iphdr *)iph)->version) {
+	case 4:
+		mss = tcp_parse_mss_option(th, 0) ?: TCP_MSS_DEFAULT;
+		cookie = __cookie_v4_init_sequence(iph, th, &mss);
+		break;
+
+#if IS_BUILTIN(CONFIG_IPV6)
+	case 6: {
+		const u16 mss_clamp = IPV6_MIN_MTU - sizeof(struct tcphdr) - sizeof(struct ipv6hdr);
+
+		if (unlikely(iph_len < sizeof(struct ipv6hdr)))
+			return -EINVAL;
+
+		mss = tcp_parse_mss_option(th, 0) ?: mss_clamp;
+		cookie = __cookie_v6_init_sequence(iph, th, &mss);
+		break;
+		}
+#endif /* CONFIG_IPV6 */
+
+	default:
+		return -EPROTONOSUPPORT;
+	}
+
+	return cookie | ((u64)mss << 32);
+#else
+	return -EOPNOTSUPP;
+#endif /* CONFIG_SYN_COOKIES */
+}
+
+static const struct bpf_func_proto bpf_tcp_raw_gen_syncookie_proto = {
+	.func		= bpf_tcp_raw_gen_syncookie,
+	.gpl_only	= true, /* __cookie_v*_init_sequence() is GPL */
+	.pkt_access	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_MEM,
+	.arg2_type	= ARG_CONST_SIZE,
+	.arg3_type	= ARG_PTR_TO_MEM,
+	.arg4_type	= ARG_CONST_SIZE,
+};
+
+BPF_CALL_4(bpf_tcp_raw_check_syncookie, void *, iph, u32, iph_len,
+	   struct tcphdr *, th, u32, th_len)
+{
+#ifdef CONFIG_SYN_COOKIES
+	u32 cookie;
+	int ret;
+
+	if (unlikely(th_len < sizeof(*th)))
+		return -EINVAL;
+
+	if (!th->ack || th->rst || th->syn)
+		return -EINVAL;
+
+	if (unlikely(iph_len < sizeof(struct iphdr)))
+		return -EINVAL;
+
+	cookie = ntohl(th->ack_seq) - 1;
+
+	/* Both struct iphdr and struct ipv6hdr have the version field at the
+	 * same offset so we can cast to the shorter header (struct iphdr).
+	 */
+	switch (((struct iphdr *)iph)->version) {
+	case 4:
+		ret = __cookie_v4_check((struct iphdr *)iph, th, cookie);
+		break;
+
+#if IS_BUILTIN(CONFIG_IPV6)
+	case 6:
+		if (unlikely(iph_len < sizeof(struct ipv6hdr)))
+			return -EINVAL;
+
+		ret = __cookie_v6_check((struct ipv6hdr *)iph, th, cookie);
+		break;
+#endif /* CONFIG_IPV6 */
+
+	default:
+		return -EPROTONOSUPPORT;
+	}
+
+	if (ret > 0)
+		return 0;
+
+	return -EACCES;
+#else
+	return -EOPNOTSUPP;
+#endif
+}
+
+static const struct bpf_func_proto bpf_tcp_raw_check_syncookie_proto = {
+	.func		= bpf_tcp_raw_check_syncookie,
+	.gpl_only	= true, /* __cookie_v*_check is GPL */
+	.pkt_access	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_MEM,
+	.arg2_type	= ARG_CONST_SIZE,
+	.arg3_type	= ARG_PTR_TO_MEM,
+	.arg4_type	= ARG_CONST_SIZE,
+};
+
 #endif /* CONFIG_INET */
 
 bool bpf_helper_changes_pkt_data(void *func)
@@ -7703,6 +7821,10 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	case BPF_FUNC_ct_release:
 		return &bpf_ct_release_proto;
 #endif
+	case BPF_FUNC_tcp_raw_gen_syncookie:
+		return &bpf_tcp_raw_gen_syncookie_proto;
+	case BPF_FUNC_tcp_raw_check_syncookie:
+		return &bpf_tcp_raw_check_syncookie_proto;
 #endif
 	default:
 		return bpf_sk_base_func_proto(func_id);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 246ab7b5e857..659af6cc7d8c 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3961,7 +3961,7 @@ static bool smc_parse_options(const struct tcphdr *th,
 /* Try to parse the MSS option from the TCP header. Return 0 on failure, clamped
  * value on success.
  */
-static u16 tcp_parse_mss_option(const struct tcphdr *th, u16 user_mss)
+u16 tcp_parse_mss_option(const struct tcphdr *th, u16 user_mss)
 {
 	const unsigned char *ptr = (const unsigned char *)(th + 1);
 	int length = (th->doff * 4) - sizeof(struct tcphdr);
@@ -4000,6 +4000,7 @@ static u16 tcp_parse_mss_option(const struct tcphdr *th, u16 user_mss)
 	}
 	return mss;
 }
+EXPORT_SYMBOL_GPL(tcp_parse_mss_option);
 
 /* Look for tcp options. Normally only called on SYN and SYNACK packets.
  * But, this can also be called on packets in the established flow when
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 883de3f1bb8b..e32f72077250 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -4998,6 +4998,61 @@ union bpf_attr {
  *		pointer that was returned from **bpf_ct_lookup_xxx**\ ().
  *	Return
  *		0 on success, or a negative error in case of failure.
+ *
+ * s64 bpf_tcp_raw_gen_syncookie(void *iph, u32 iph_len, struct tcphdr *th, u32 th_len)
+ *	Description
+ *		Try to issue a SYN cookie for the packet with corresponding
+ *		IP/TCP headers, *iph* and *th*, without depending on a listening
+ *		socket.
+ *
+ *		*iph* points to the start of the IPv4 or IPv6 header, while
+ *		*iph_len* contains **sizeof**\ (**struct iphdr**) or
+ *		**sizeof**\ (**struct ip6hdr**).
+ *
+ *		*th* points to the start of the TCP header, while *th_len*
+ *		contains the length of the TCP header (at least
+ *		**sizeof**\ (**struct tcphdr**)).
+ *	Return
+ *		On success, lower 32 bits hold the generated SYN cookie in
+ *		followed by 16 bits which hold the MSS value for that cookie,
+ *		and the top 16 bits are unused.
+ *
+ *		On failure, the returned value is one of the following:
+ *
+ *		**-EINVAL** if the packet or input arguments are invalid.
+ *
+ *		**-EOPNOTSUPP** if the kernel configuration does not enable SYN
+ *		cookies (CONFIG_SYN_COOKIES is off).
+ *
+ *		**-EPROTONOSUPPORT** if the IP version is not 4 or 6 (or 6, but
+ *		CONFIG_IPV6 is disabled).
+ *
+ * int bpf_tcp_raw_check_syncookie(void *iph, u32 iph_len, struct tcphdr *th, u32 th_len)
+ *	Description
+ *		Check whether *iph* and *th* contain a valid SYN cookie ACK
+ *		without depending on a listening socket.
+ *
+ *		*iph* points to the start of the IPv4 or IPv6 header, while
+ *		*iph_len* contains **sizeof**\ (**struct iphdr**) or
+ *		**sizeof**\ (**struct ip6hdr**).
+ *
+ *		*th* points to the start of the TCP header, while *th_len*
+ *		contains the length of the TCP header (at least
+ *		**sizeof**\ (**struct tcphdr**)).
+ *	Return
+ *		0 if *iph* and *th* are a valid SYN cookie ACK.
+ *
+ *		On failure, the returned value is one of the following:
+ *
+ *		**-EACCES** if the SYN cookie is not valid.
+ *
+ *		**-EINVAL** if the packet or input arguments are invalid.
+ *
+ *		**-EOPNOTSUPP** if the kernel configuration does not enable SYN
+ *		cookies (CONFIG_SYN_COOKIES is off).
+ *
+ *		**-EPROTONOSUPPORT** if the IP version is not 4 or 6 (or 6, but
+ *		CONFIG_IPV6 is disabled).
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5181,6 +5236,8 @@ union bpf_attr {
 	FN(ct_lookup_tcp),		\
 	FN(ct_lookup_udp),		\
 	FN(ct_release),			\
+	FN(tcp_raw_gen_syncookie),	\
+	FN(tcp_raw_check_syncookie),	\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH bpf-next 09/10] bpf: Add a helper to issue timestamp cookies in XDP
  2021-10-19 14:46 [PATCH bpf-next 00/10] New BPF helpers to accelerate synproxy Maxim Mikityanskiy
                   ` (7 preceding siblings ...)
  2021-10-19 14:46 ` [PATCH bpf-next 08/10] bpf: Add helpers to issue and check SYN cookies in XDP Maxim Mikityanskiy
@ 2021-10-19 14:46 ` Maxim Mikityanskiy
  2021-10-19 16:45   ` Eric Dumazet
  2021-10-20 15:56   ` Lorenz Bauer
  2021-10-19 14:46 ` [PATCH bpf-next 10/10] bpf: Add sample for raw syncookie helpers Maxim Mikityanskiy
  9 siblings, 2 replies; 48+ messages in thread
From: Maxim Mikityanskiy @ 2021-10-19 14:46 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh
  Cc: Eric Dumazet, David S. Miller, Jakub Kicinski, Hideaki YOSHIFUJI,
	David Ahern, Jesper Dangaard Brouer, Nathan Chancellor,
	Nick Desaulniers, Brendan Jackman, Florent Revest, Joe Stringer,
	Lorenz Bauer, Tariq Toukan, netdev, bpf, clang-built-linux,
	Maxim Mikityanskiy

The new helper bpf_tcp_raw_gen_tscookie allows an XDP program to
generate timestamp cookies (to be used together with SYN cookies) which
encode different options set by the client in the SYN packet: SACK
support, ECN support, window scale. These options are encoded in lower
bits of the timestamp, which will be returned by the client in a
subsequent ACK packet. The format is the same used by synproxy.

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
---
 include/net/tcp.h              |  1 +
 include/uapi/linux/bpf.h       | 27 +++++++++++++++
 net/core/filter.c              | 38 +++++++++++++++++++++
 net/ipv4/syncookies.c          | 60 ++++++++++++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h | 27 +++++++++++++++
 5 files changed, 153 insertions(+)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 1cc96a225848..651820bef6a2 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -564,6 +564,7 @@ u32 __cookie_v4_init_sequence(const struct iphdr *iph, const struct tcphdr *th,
 			      u16 *mssp);
 __u32 cookie_v4_init_sequence(const struct sk_buff *skb, __u16 *mss);
 u64 cookie_init_timestamp(struct request_sock *req, u64 now);
+bool cookie_init_timestamp_raw(struct tcphdr *th, __be32 *tsval, __be32 *tsecr);
 bool cookie_timestamp_decode(const struct net *net,
 			     struct tcp_options_received *opt);
 bool cookie_ecn_ok(const struct tcp_options_received *opt,
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index e32f72077250..791790b41874 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -5053,6 +5053,32 @@ union bpf_attr {
  *
  *		**-EPROTONOSUPPORT** if the IP version is not 4 or 6 (or 6, but
  *		CONFIG_IPV6 is disabled).
+ *
+ * int bpf_tcp_raw_gen_tscookie(struct tcphdr *th, u32 th_len, __be32 *tsopt, u32 tsopt_len)
+ *	Description
+ *		Try to generate a timestamp cookie which encodes some of the
+ *		flags sent by the client in the SYN packet: SACK support, ECN
+ *		support, window scale. To be used with SYN cookies.
+ *
+ *		*th* points to the start of the TCP header of the client's SYN
+ *		packet, while *th_len* contains the length of the TCP header (at
+ *		least **sizeof**\ (**struct tcphdr**)).
+ *
+ *		*tsopt* points to the output location where to put the resulting
+ *		timestamp values: tsval and tsecr, in the format of the TCP
+ *		timestamp option.
+ *
+ *	Return
+ *		On success, 0.
+ *
+ *		On failure, the returned value is one of the following:
+ *
+ *		**-EINVAL** if the input arguments are invalid.
+ *
+ *		**-ENOENT** if the TCP header doesn't have the timestamp option.
+ *
+ *		**-EOPNOTSUPP** if the kernel configuration does not enable SYN
+ *		cookies (CONFIG_SYN_COOKIES is off).
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5238,6 +5264,7 @@ union bpf_attr {
 	FN(ct_release),			\
 	FN(tcp_raw_gen_syncookie),	\
 	FN(tcp_raw_check_syncookie),	\
+	FN(tcp_raw_gen_tscookie),	\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/net/core/filter.c b/net/core/filter.c
index 5f03d4a282a0..73fe20ef7442 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -7403,6 +7403,42 @@ static const struct bpf_func_proto bpf_tcp_raw_check_syncookie_proto = {
 	.arg4_type	= ARG_CONST_SIZE,
 };
 
+BPF_CALL_4(bpf_tcp_raw_gen_tscookie, struct tcphdr *, th, u32, th_len,
+	   __be32 *, tsopt, u32, tsopt_len)
+{
+	int err;
+
+#ifdef CONFIG_SYN_COOKIES
+	if (tsopt_len != sizeof(u64)) {
+		err = -EINVAL;
+		goto err_out;
+	}
+
+	if (!cookie_init_timestamp_raw(th, &tsopt[0], &tsopt[1])) {
+		err = -ENOENT;
+		goto err_out;
+	}
+
+	return 0;
+err_out:
+#else
+	err = -EOPNOTSUPP;
+#endif
+	memset(tsopt, 0, tsopt_len);
+	return err;
+}
+
+static const struct bpf_func_proto bpf_tcp_raw_gen_tscookie_proto = {
+	.func		= bpf_tcp_raw_gen_tscookie,
+	.gpl_only	= false,
+	.pkt_access	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_MEM,
+	.arg2_type	= ARG_CONST_SIZE,
+	.arg3_type	= ARG_PTR_TO_UNINIT_MEM,
+	.arg4_type	= ARG_CONST_SIZE,
+};
+
 #endif /* CONFIG_INET */
 
 bool bpf_helper_changes_pkt_data(void *func)
@@ -7825,6 +7861,8 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_tcp_raw_gen_syncookie_proto;
 	case BPF_FUNC_tcp_raw_check_syncookie:
 		return &bpf_tcp_raw_check_syncookie_proto;
+	case BPF_FUNC_tcp_raw_gen_tscookie:
+		return &bpf_tcp_raw_gen_tscookie_proto;
 #endif
 	default:
 		return bpf_sk_base_func_proto(func_id);
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index 8696dc343ad2..4dd2c7a096eb 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -85,6 +85,66 @@ u64 cookie_init_timestamp(struct request_sock *req, u64 now)
 	return (u64)ts * (NSEC_PER_SEC / TCP_TS_HZ);
 }
 
+bool cookie_init_timestamp_raw(struct tcphdr *th, __be32 *tsval, __be32 *tsecr)
+{
+	int length = (th->doff * 4) - sizeof(*th);
+	u8 wscale = TS_OPT_WSCALE_MASK;
+	bool option_timestamp = false;
+	bool option_sack = false;
+	u32 cookie;
+	u8 *ptr;
+
+	ptr = (u8 *)(th + 1);
+
+	while (length > 0) {
+		u8 opcode = *ptr++;
+		u8 opsize;
+
+		if (opcode == TCPOPT_EOL)
+			break;
+		if (opcode == TCPOPT_NOP) {
+			length--;
+			continue;
+		}
+
+		if (length < 2)
+			break;
+		opsize = *ptr++;
+		if (opsize < 2)
+			break;
+		if (opsize > length)
+			break;
+
+		switch (opcode) {
+		case TCPOPT_WINDOW:
+			wscale = min_t(u8, *ptr, TCP_MAX_WSCALE);
+			break;
+		case TCPOPT_TIMESTAMP:
+			option_timestamp = true;
+			/* Client's tsval becomes our tsecr. */
+			*tsecr = cpu_to_be32(get_unaligned_be32(ptr));
+			break;
+		case TCPOPT_SACK_PERM:
+			option_sack = true;
+			break;
+		}
+
+		ptr += opsize - 2;
+		length -= opsize;
+	}
+
+	if (!option_timestamp)
+		return false;
+
+	cookie = tcp_time_stamp_raw() & ~TSMASK;
+	cookie |= wscale & TS_OPT_WSCALE_MASK;
+	if (option_sack)
+		cookie |= TS_OPT_SACK;
+	if (th->ece && th->cwr)
+		cookie |= TS_OPT_ECN;
+	*tsval = cpu_to_be32(cookie);
+	return true;
+}
 
 static __u32 secure_tcp_syn_cookie(__be32 saddr, __be32 daddr, __be16 sport,
 				   __be16 dport, __u32 sseq, __u32 data)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index e32f72077250..791790b41874 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -5053,6 +5053,32 @@ union bpf_attr {
  *
  *		**-EPROTONOSUPPORT** if the IP version is not 4 or 6 (or 6, but
  *		CONFIG_IPV6 is disabled).
+ *
+ * int bpf_tcp_raw_gen_tscookie(struct tcphdr *th, u32 th_len, __be32 *tsopt, u32 tsopt_len)
+ *	Description
+ *		Try to generate a timestamp cookie which encodes some of the
+ *		flags sent by the client in the SYN packet: SACK support, ECN
+ *		support, window scale. To be used with SYN cookies.
+ *
+ *		*th* points to the start of the TCP header of the client's SYN
+ *		packet, while *th_len* contains the length of the TCP header (at
+ *		least **sizeof**\ (**struct tcphdr**)).
+ *
+ *		*tsopt* points to the output location where to put the resulting
+ *		timestamp values: tsval and tsecr, in the format of the TCP
+ *		timestamp option.
+ *
+ *	Return
+ *		On success, 0.
+ *
+ *		On failure, the returned value is one of the following:
+ *
+ *		**-EINVAL** if the input arguments are invalid.
+ *
+ *		**-ENOENT** if the TCP header doesn't have the timestamp option.
+ *
+ *		**-EOPNOTSUPP** if the kernel configuration does not enable SYN
+ *		cookies (CONFIG_SYN_COOKIES is off).
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5238,6 +5264,7 @@ union bpf_attr {
 	FN(ct_release),			\
 	FN(tcp_raw_gen_syncookie),	\
 	FN(tcp_raw_check_syncookie),	\
+	FN(tcp_raw_gen_tscookie),	\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH bpf-next 10/10] bpf: Add sample for raw syncookie helpers
  2021-10-19 14:46 [PATCH bpf-next 00/10] New BPF helpers to accelerate synproxy Maxim Mikityanskiy
                   ` (8 preceding siblings ...)
  2021-10-19 14:46 ` [PATCH bpf-next 09/10] bpf: Add a helper to issue timestamp " Maxim Mikityanskiy
@ 2021-10-19 14:46 ` Maxim Mikityanskiy
  2021-10-20 18:01   ` Joe Stringer
  2021-10-21  1:06   ` Alexei Starovoitov
  9 siblings, 2 replies; 48+ messages in thread
From: Maxim Mikityanskiy @ 2021-10-19 14:46 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh
  Cc: Eric Dumazet, David S. Miller, Jakub Kicinski, Hideaki YOSHIFUJI,
	David Ahern, Jesper Dangaard Brouer, Nathan Chancellor,
	Nick Desaulniers, Brendan Jackman, Florent Revest, Joe Stringer,
	Lorenz Bauer, Tariq Toukan, netdev, bpf, clang-built-linux,
	Maxim Mikityanskiy

This commit adds a sample for the new BPF helpers: bpf_ct_lookup_tcp,
bpf_tcp_raw_gen_syncookie and bpf_tcp_raw_check_syncookie.

samples/bpf/syncookie_kern.c is a BPF program that generates SYN cookies
on allowed TCP ports and sends SYNACKs to clients, accelerating synproxy
iptables module.

samples/bpf/syncookie_user.c is a userspace control application that
allows to configure the following options in runtime: list of allowed
ports, MSS, window scale, TTL.

samples/bpf/syncookie_test.sh is a script that demonstrates the setup of
synproxy with XDP acceleration.

Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
---
 samples/bpf/.gitignore        |   1 +
 samples/bpf/Makefile          |   3 +
 samples/bpf/syncookie_kern.c  | 591 ++++++++++++++++++++++++++++++++++
 samples/bpf/syncookie_test.sh |  55 ++++
 samples/bpf/syncookie_user.c  | 388 ++++++++++++++++++++++
 5 files changed, 1038 insertions(+)
 create mode 100644 samples/bpf/syncookie_kern.c
 create mode 100755 samples/bpf/syncookie_test.sh
 create mode 100644 samples/bpf/syncookie_user.c

diff --git a/samples/bpf/.gitignore b/samples/bpf/.gitignore
index 0e7bfdbff80a..6b74e835d323 100644
--- a/samples/bpf/.gitignore
+++ b/samples/bpf/.gitignore
@@ -61,3 +61,4 @@ iperf.*
 /vmlinux.h
 /bpftool/
 /libbpf/
+syncookie
diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 4c5ad15f8d28..59d90c76bfea 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -53,6 +53,7 @@ tprogs-y += task_fd_query
 tprogs-y += xdp_sample_pkts
 tprogs-y += ibumad
 tprogs-y += hbm
+tprogs-y += syncookie
 
 tprogs-y += xdp_redirect_cpu
 tprogs-y += xdp_redirect_map_multi
@@ -118,6 +119,7 @@ task_fd_query-objs := task_fd_query_user.o $(TRACE_HELPERS)
 xdp_sample_pkts-objs := xdp_sample_pkts_user.o
 ibumad-objs := ibumad_user.o
 hbm-objs := hbm.o $(CGROUP_HELPERS)
+syncookie-objs := syncookie_user.o
 
 xdp_redirect_map_multi-objs := xdp_redirect_map_multi_user.o $(XDP_SAMPLE)
 xdp_redirect_cpu-objs := xdp_redirect_cpu_user.o $(XDP_SAMPLE)
@@ -181,6 +183,7 @@ always-y += ibumad_kern.o
 always-y += hbm_out_kern.o
 always-y += hbm_edt_kern.o
 always-y += xdpsock_kern.o
+always-y += syncookie_kern.o
 
 ifeq ($(ARCH), arm)
 # Strip all except -D__LINUX_ARM_ARCH__ option needed to handle linux
diff --git a/samples/bpf/syncookie_kern.c b/samples/bpf/syncookie_kern.c
new file mode 100644
index 000000000000..b581ae30b650
--- /dev/null
+++ b/samples/bpf/syncookie_kern.c
@@ -0,0 +1,591 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/* Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved. */
+
+#include <stdbool.h>
+#include <stddef.h>
+
+#include <uapi/linux/errno.h>
+#include <uapi/linux/bpf.h>
+#include <uapi/linux/pkt_cls.h>
+#include <uapi/linux/if_ether.h>
+#include <uapi/linux/in.h>
+#include <uapi/linux/ip.h>
+#include <uapi/linux/ipv6.h>
+#include <uapi/linux/tcp.h>
+#include <uapi/linux/netfilter/nf_conntrack_common.h>
+#include <linux/minmax.h>
+
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+
+#define DEFAULT_MSS4 1460
+#define DEFAULT_MSS6 1440
+#define DEFAULT_WSCALE 7
+#define DEFAULT_TTL 64
+#define MAX_ALLOWED_PORTS 8
+
+struct bpf_map_def SEC("maps") values = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(__u64),
+	.max_entries = 2,
+};
+
+struct bpf_map_def SEC("maps") allowed_ports = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(__u16),
+	.max_entries = MAX_ALLOWED_PORTS,
+};
+
+#define IP_DF 0x4000
+#define IP_MF 0x2000
+#define IP_OFFSET 0x1fff
+
+#define NEXTHDR_TCP 6
+
+#define TCPOPT_NOP 1
+#define TCPOPT_EOL 0
+#define TCPOPT_MSS 2
+#define TCPOPT_WINDOW 3
+#define TCPOPT_SACK_PERM 4
+#define TCPOPT_TIMESTAMP 8
+
+#define TCPOLEN_MSS 4
+#define TCPOLEN_WINDOW 3
+#define TCPOLEN_SACK_PERM 2
+#define TCPOLEN_TIMESTAMP 10
+
+#define IPV4_MAXLEN 60
+#define TCP_MAXLEN 60
+
+static __always_inline void swap_eth_addr(__u8 *a, __u8 *b)
+{
+	__u8 tmp[ETH_ALEN];
+
+	__builtin_memcpy(tmp, a, ETH_ALEN);
+	__builtin_memcpy(a, b, ETH_ALEN);
+	__builtin_memcpy(b, tmp, ETH_ALEN);
+}
+
+static __always_inline __u16 csum_fold(__u32 csum)
+{
+	csum = (csum & 0xffff) + (csum >> 16);
+	csum = (csum & 0xffff) + (csum >> 16);
+	return (__u16)~csum;
+}
+
+static __always_inline __u16 csum_tcpudp_magic(__be32 saddr, __be32 daddr,
+					       __u32 len, __u8 proto,
+					       __u32 csum)
+{
+	__u64 s = csum;
+
+	s += (__u32)saddr;
+	s += (__u32)daddr;
+#if defined(__BIG_ENDIAN__)
+	s += proto + len;
+#elif defined(__LITTLE_ENDIAN__)
+	s += (proto + len) << 8;
+#else
+#error Unknown endian
+#endif
+	s = (s & 0xffffffff) + (s >> 32);
+	s = (s & 0xffffffff) + (s >> 32);
+
+	return csum_fold((__u32)s);
+}
+
+static __always_inline __u16 csum_ipv6_magic(const struct in6_addr *saddr,
+					     const struct in6_addr *daddr,
+					     __u32 len, __u8 proto, __u32 csum)
+{
+	__u64 sum = csum;
+	int i;
+
+#pragma unroll
+	for (i = 0; i < 4; i++)
+		sum += (__u32)saddr->s6_addr32[i];
+
+#pragma unroll
+	for (i = 0; i < 4; i++)
+		sum += (__u32)daddr->s6_addr32[i];
+
+	// Don't combine additions to avoid 32-bit overflow.
+	sum += bpf_htonl(len);
+	sum += bpf_htonl(proto);
+
+	sum = (sum & 0xffffffff) + (sum >> 32);
+	sum = (sum & 0xffffffff) + (sum >> 32);
+
+	return csum_fold((__u32)sum);
+}
+
+static __always_inline void values_get_tcpipopts(__u16 *mss, __u8 *wscale,
+						 __u8 *ttl, bool ipv6)
+{
+	__u32 key = 0;
+	__u64 *value;
+
+	value = bpf_map_lookup_elem(&values, &key);
+	if (value && *value != 0) {
+		if (ipv6)
+			*mss = (*value >> 32) & 0xffff;
+		else
+			*mss = *value & 0xffff;
+		*wscale = (*value >> 16) & 0xf;
+		*ttl = (*value >> 24) & 0xff;
+		return;
+	}
+
+	*mss = ipv6 ? DEFAULT_MSS6 : DEFAULT_MSS4;
+	*wscale = DEFAULT_WSCALE;
+	*ttl = DEFAULT_TTL;
+}
+
+static __always_inline void values_inc_synacks(void)
+{
+	__u32 key = 1;
+	__u32 *value;
+
+	value = bpf_map_lookup_elem(&values, &key);
+	if (value)
+		__sync_fetch_and_add(value, 1);
+}
+
+static __always_inline bool check_port_allowed(__u16 port)
+{
+	__u32 i;
+
+	for (i = 0; i < MAX_ALLOWED_PORTS; i++) {
+		__u32 key = i;
+		__u16 *value;
+
+		value = bpf_map_lookup_elem(&allowed_ports, &key);
+
+		if (!value)
+			break;
+		// 0 is a terminator value. Check it first to avoid matching on
+		// a forbidden port == 0 and returning true.
+		if (*value == 0)
+			break;
+
+		if (*value == port)
+			return true;
+	}
+
+	return false;
+}
+
+struct header_pointers {
+	struct ethhdr *eth;
+	struct iphdr *ipv4;
+	struct ipv6hdr *ipv6;
+	struct tcphdr *tcp;
+	__u16 tcp_len;
+};
+
+static __always_inline int tcp_dissect(void *data, void *data_end,
+				       struct header_pointers *hdr)
+{
+	hdr->eth = data;
+	if (hdr->eth + 1 > data_end)
+		return XDP_DROP;
+
+	switch (bpf_ntohs(hdr->eth->h_proto)) {
+	case ETH_P_IP:
+		hdr->ipv6 = NULL;
+
+		hdr->ipv4 = (void *)hdr->eth + sizeof(*hdr->eth);
+		if (hdr->ipv4 + 1 > data_end)
+			return XDP_DROP;
+		if (hdr->ipv4->ihl * 4 < sizeof(*hdr->ipv4))
+			return XDP_DROP;
+		if (hdr->ipv4->version != 4)
+			return XDP_DROP;
+
+		if (hdr->ipv4->protocol != IPPROTO_TCP)
+			return XDP_PASS;
+
+		hdr->tcp = (void *)hdr->ipv4 + hdr->ipv4->ihl * 4;
+		break;
+	case ETH_P_IPV6:
+		hdr->ipv4 = NULL;
+
+		hdr->ipv6 = (void *)hdr->eth + sizeof(*hdr->eth);
+		if (hdr->ipv6 + 1 > data_end)
+			return XDP_DROP;
+		if (hdr->ipv6->version != 6)
+			return XDP_DROP;
+
+		// XXX: Extension headers are not supported and could circumvent
+		// XDP SYN flood protection.
+		if (hdr->ipv6->nexthdr != NEXTHDR_TCP)
+			return XDP_PASS;
+
+		hdr->tcp = (void *)hdr->ipv6 + sizeof(*hdr->ipv6);
+		break;
+	default:
+		// XXX: VLANs will circumvent XDP SYN flood protection.
+		return XDP_PASS;
+	}
+
+	if (hdr->tcp + 1 > data_end)
+		return XDP_DROP;
+	hdr->tcp_len = hdr->tcp->doff * 4;
+	if (hdr->tcp_len < sizeof(*hdr->tcp))
+		return XDP_DROP;
+
+	return XDP_TX;
+}
+
+static __always_inline __u8 tcp_mkoptions(__be32 *buf, __be32 *tsopt, __u16 mss,
+					  __u8 wscale)
+{
+	__be32 *start = buf;
+
+	*buf++ = bpf_htonl((TCPOPT_MSS << 24) | (TCPOLEN_MSS << 16) | mss);
+
+	if (!tsopt)
+		return buf - start;
+
+	if (tsopt[0] & bpf_htonl(1 << 4))
+		*buf++ = bpf_htonl((TCPOPT_SACK_PERM << 24) |
+				   (TCPOLEN_SACK_PERM << 16) |
+				   (TCPOPT_TIMESTAMP << 8) |
+				   TCPOLEN_TIMESTAMP);
+	else
+		*buf++ = bpf_htonl((TCPOPT_NOP << 24) |
+				   (TCPOPT_NOP << 16) |
+				   (TCPOPT_TIMESTAMP << 8) |
+				   TCPOLEN_TIMESTAMP);
+	*buf++ = tsopt[0];
+	*buf++ = tsopt[1];
+
+	if ((tsopt[0] & bpf_htonl(0xf)) != bpf_htonl(0xf))
+		*buf++ = bpf_htonl((TCPOPT_NOP << 24) |
+				   (TCPOPT_WINDOW << 16) |
+				   (TCPOLEN_WINDOW << 8) |
+				   wscale);
+
+	return buf - start;
+}
+
+static __always_inline void tcp_gen_synack(struct tcphdr *tcp_header,
+					   __u32 cookie, __be32 *tsopt,
+					   __u16 mss, __u8 wscale)
+{
+	void *tcp_options;
+
+	tcp_flag_word(tcp_header) = TCP_FLAG_SYN | TCP_FLAG_ACK;
+	if (tsopt && (tsopt[0] & bpf_htonl(1 << 5)))
+		tcp_flag_word(tcp_header) |= TCP_FLAG_ECE;
+	tcp_header->doff = 5; // doff is part of tcp_flag_word.
+	swap(tcp_header->source, tcp_header->dest);
+	tcp_header->ack_seq = bpf_htonl(bpf_ntohl(tcp_header->seq) + 1);
+	tcp_header->seq = bpf_htonl(cookie);
+	tcp_header->window = 0;
+	tcp_header->urg_ptr = 0;
+	tcp_header->check = 0; // Rely on hardware checksum offload.
+
+	tcp_options = (void *)(tcp_header + 1);
+	tcp_header->doff += tcp_mkoptions(tcp_options, tsopt, mss, wscale);
+}
+
+static __always_inline void tcpv4_gen_synack(struct header_pointers *hdr,
+					     __u32 cookie, __be32 *tsopt)
+{
+	__u8 wscale;
+	__u16 mss;
+	__u8 ttl;
+
+	values_get_tcpipopts(&mss, &wscale, &ttl, false);
+
+	swap_eth_addr(hdr->eth->h_source, hdr->eth->h_dest);
+
+	swap(hdr->ipv4->saddr, hdr->ipv4->daddr);
+	hdr->ipv4->check = 0; // Rely on hardware checksum offload.
+	hdr->ipv4->tos = 0;
+	hdr->ipv4->id = 0;
+	hdr->ipv4->ttl = ttl;
+
+	tcp_gen_synack(hdr->tcp, cookie, tsopt, mss, wscale);
+
+	hdr->tcp_len = hdr->tcp->doff * 4;
+	hdr->ipv4->tot_len = bpf_htons(sizeof(*hdr->ipv4) + hdr->tcp_len);
+}
+
+static __always_inline void tcpv6_gen_synack(struct header_pointers *hdr,
+					     __u32 cookie, __be32 *tsopt)
+{
+	__u8 wscale;
+	__u16 mss;
+	__u8 ttl;
+
+	values_get_tcpipopts(&mss, &wscale, &ttl, true);
+
+	swap_eth_addr(hdr->eth->h_source, hdr->eth->h_dest);
+
+	swap(hdr->ipv6->saddr, hdr->ipv6->daddr);
+	*(__be32 *)hdr->ipv6 = bpf_htonl(0x60000000);
+	hdr->ipv6->hop_limit = ttl;
+
+	tcp_gen_synack(hdr->tcp, cookie, tsopt, mss, wscale);
+
+	hdr->tcp_len = hdr->tcp->doff * 4;
+	hdr->ipv6->payload_len = bpf_htons(hdr->tcp_len);
+}
+
+static __always_inline int syncookie_handle_syn(struct header_pointers *hdr,
+						struct xdp_md *ctx,
+						void *data, void *data_end)
+{
+	__u32 old_pkt_size, new_pkt_size;
+	// Unlike clang 10, clang 11 and 12 generate code that doesn't pass the
+	// BPF verifier if tsopt is not volatile. Volatile forces it to store
+	// the pointer value and use it directly, otherwise tcp_mkoptions is
+	// (mis)compiled like this:
+	//   if (!tsopt)
+	//       return buf - start;
+	//   reg = stored_return_value_of_bpf_tcp_raw_gen_tscookie;
+	//   if (reg == 0)
+	//       tsopt = tsopt_buf;
+	//   else
+	//       tsopt = NULL;
+	//   ...
+	//   *buf++ = tsopt[1];
+	// It creates a dead branch where tsopt is assigned NULL, but the
+	// verifier can't prove it's dead and blocks the program.
+	__be32 * volatile tsopt = NULL;
+	__be32 tsopt_buf[2];
+	void *ip_header;
+	__u16 ip_len;
+	__u32 cookie;
+	__s64 value;
+
+	if (hdr->ipv4) {
+		// Check the IPv4 and TCP checksums before creating a SYNACK.
+		value = bpf_csum_diff(0, 0, (void *)hdr->ipv4, hdr->ipv4->ihl * 4, 0);
+		if (value < 0)
+			return XDP_ABORTED;
+		if (csum_fold(value) != 0)
+			return XDP_DROP; // Bad IPv4 checksum.
+
+		value = bpf_csum_diff(0, 0, (void *)hdr->tcp, hdr->tcp_len, 0);
+		if (value < 0)
+			return XDP_ABORTED;
+		if (csum_tcpudp_magic(hdr->ipv4->saddr, hdr->ipv4->daddr,
+				      hdr->tcp_len, IPPROTO_TCP, value) != 0)
+			return XDP_DROP; // Bad TCP checksum.
+
+		ip_header = hdr->ipv4;
+		ip_len = sizeof(*hdr->ipv4);
+	} else if (hdr->ipv6) {
+		// Check the TCP checksum before creating a SYNACK.
+		value = bpf_csum_diff(0, 0, (void *)hdr->tcp, hdr->tcp_len, 0);
+		if (value < 0)
+			return XDP_ABORTED;
+		if (csum_ipv6_magic(&hdr->ipv6->saddr, &hdr->ipv6->daddr,
+				    hdr->tcp_len, IPPROTO_TCP, value) != 0)
+			return XDP_DROP; // Bad TCP checksum.
+
+		ip_header = hdr->ipv6;
+		ip_len = sizeof(*hdr->ipv6);
+	} else {
+		return XDP_ABORTED;
+	}
+
+	// Issue SYN cookies on allowed ports, drop SYN packets on
+	// blocked ports.
+	if (!check_port_allowed(bpf_ntohs(hdr->tcp->dest)))
+		return XDP_DROP;
+
+	value = bpf_tcp_raw_gen_syncookie(ip_header, ip_len,
+					  (void *)hdr->tcp, hdr->tcp_len);
+	if (value < 0)
+		return XDP_ABORTED;
+	cookie = (__u32)value;
+
+	if (bpf_tcp_raw_gen_tscookie((void *)hdr->tcp, hdr->tcp_len,
+				     tsopt_buf, sizeof(tsopt_buf)) == 0)
+		tsopt = tsopt_buf;
+
+	// Check that there is enough space for a SYNACK. It also covers
+	// the check that the destination of the __builtin_memmove below
+	// doesn't overflow.
+	if (data + sizeof(*hdr->eth) + ip_len + TCP_MAXLEN > data_end)
+		return XDP_ABORTED;
+
+	if (hdr->ipv4) {
+		if (hdr->ipv4->ihl * 4 > sizeof(*hdr->ipv4)) {
+			struct tcphdr *new_tcp_header;
+
+			new_tcp_header = data + sizeof(*hdr->eth) + sizeof(*hdr->ipv4);
+			__builtin_memmove(new_tcp_header, hdr->tcp, sizeof(*hdr->tcp));
+			hdr->tcp = new_tcp_header;
+
+			hdr->ipv4->ihl = sizeof(*hdr->ipv4) / 4;
+		}
+
+		tcpv4_gen_synack(hdr, cookie, tsopt);
+	} else if (hdr->ipv6) {
+		tcpv6_gen_synack(hdr, cookie, tsopt);
+	} else {
+		return XDP_ABORTED;
+	}
+
+	// Recalculate checksums.
+	hdr->tcp->check = 0;
+	value = bpf_csum_diff(0, 0, (void *)hdr->tcp, hdr->tcp_len, 0);
+	if (value < 0)
+		return XDP_ABORTED;
+	if (hdr->ipv4) {
+		hdr->tcp->check = csum_tcpudp_magic(hdr->ipv4->saddr,
+						    hdr->ipv4->daddr,
+						    hdr->tcp_len,
+						    IPPROTO_TCP,
+						    value);
+
+		hdr->ipv4->check = 0;
+		value = bpf_csum_diff(0, 0, (void *)hdr->ipv4, sizeof(*hdr->ipv4), 0);
+		if (value < 0)
+			return XDP_ABORTED;
+		hdr->ipv4->check = csum_fold(value);
+	} else if (hdr->ipv6) {
+		hdr->tcp->check = csum_ipv6_magic(&hdr->ipv6->saddr,
+						  &hdr->ipv6->daddr,
+						  hdr->tcp_len,
+						  IPPROTO_TCP,
+						  value);
+	} else {
+		return XDP_ABORTED;
+	}
+
+	// Set the new packet size.
+	old_pkt_size = data_end - data;
+	new_pkt_size = sizeof(*hdr->eth) + ip_len + hdr->tcp->doff * 4;
+	if (bpf_xdp_adjust_tail(ctx, new_pkt_size - old_pkt_size))
+		return XDP_ABORTED;
+
+	values_inc_synacks();
+
+	return XDP_TX;
+}
+
+static __always_inline int syncookie_handle_ack(struct header_pointers *hdr)
+{
+	int err;
+
+	if (hdr->ipv4)
+		err = bpf_tcp_raw_check_syncookie(hdr->ipv4, sizeof(*hdr->ipv4),
+						  (void *)hdr->tcp, hdr->tcp_len);
+	else if (hdr->ipv6)
+		err = bpf_tcp_raw_check_syncookie(hdr->ipv6, sizeof(*hdr->ipv6),
+						  (void *)hdr->tcp, hdr->tcp_len);
+	else
+		return XDP_ABORTED;
+	if (err)
+		return XDP_DROP;
+
+	return XDP_PASS;
+}
+
+SEC("xdp/syncookie")
+int syncookie_xdp(struct xdp_md *ctx)
+{
+	void *data_end = (void *)(long)ctx->data_end;
+	void *data = (void *)(long)ctx->data;
+	struct header_pointers hdr;
+	struct bpf_sock_tuple tup;
+	struct bpf_nf_conn *ct;
+	__u32 tup_size;
+	__s64 value;
+	int ret;
+
+	ret = tcp_dissect(data, data_end, &hdr);
+	if (ret != XDP_TX)
+		return ret;
+
+	if (hdr.ipv4) {
+		// TCP doesn't normally use fragments, and XDP can't reassemble them.
+		if ((hdr.ipv4->frag_off & bpf_htons(IP_DF | IP_MF | IP_OFFSET)) != bpf_htons(IP_DF))
+			return XDP_DROP;
+
+		tup.ipv4.saddr = hdr.ipv4->saddr;
+		tup.ipv4.daddr = hdr.ipv4->daddr;
+		tup.ipv4.sport = hdr.tcp->source;
+		tup.ipv4.dport = hdr.tcp->dest;
+		tup_size = sizeof(tup.ipv4);
+	} else if (hdr.ipv6) {
+		__builtin_memcpy(tup.ipv6.saddr, &hdr.ipv6->saddr, sizeof(tup.ipv6.saddr));
+		__builtin_memcpy(tup.ipv6.daddr, &hdr.ipv6->daddr, sizeof(tup.ipv6.daddr));
+		tup.ipv6.sport = hdr.tcp->source;
+		tup.ipv6.dport = hdr.tcp->dest;
+		tup_size = sizeof(tup.ipv6);
+	} else {
+		// The verifier can't track that either ipv4 or ipv6 is not NULL.
+		return XDP_ABORTED;
+	}
+
+	value = 0; // Flags.
+	ct = bpf_ct_lookup_tcp(ctx, &tup, tup_size, BPF_F_CURRENT_NETNS, &value);
+	if (ct) {
+		unsigned long status = ct->status;
+
+		bpf_ct_release(ct);
+		if (status & IPS_CONFIRMED_BIT)
+			return XDP_PASS;
+	} else if (value != -ENOENT) {
+		return XDP_ABORTED;
+	}
+
+	// value == -ENOENT || !(status & IPS_CONFIRMED_BIT)
+
+	if ((hdr.tcp->syn ^ hdr.tcp->ack) != 1)
+		return XDP_DROP;
+
+	// Grow the TCP header to TCP_MAXLEN to be able to pass any hdr.tcp_len
+	// to bpf_tcp_raw_gen_syncookie and pass the verifier.
+	if (bpf_xdp_adjust_tail(ctx, TCP_MAXLEN - hdr.tcp_len))
+		return XDP_ABORTED;
+
+	data_end = (void *)(long)ctx->data_end;
+	data = (void *)(long)ctx->data;
+
+	if (hdr.ipv4) {
+		hdr.eth = data;
+		hdr.ipv4 = (void *)hdr.eth + sizeof(*hdr.eth);
+		// IPV4_MAXLEN is needed when calculating checksum.
+		// At least sizeof(struct iphdr) is needed here to access ihl.
+		if ((void *)hdr.ipv4 + IPV4_MAXLEN > data_end)
+			return XDP_ABORTED;
+		hdr.tcp = (void *)hdr.ipv4 + hdr.ipv4->ihl * 4;
+	} else if (hdr.ipv6) {
+		hdr.eth = data;
+		hdr.ipv6 = (void *)hdr.eth + sizeof(*hdr.eth);
+		hdr.tcp = (void *)hdr.ipv6 + sizeof(*hdr.ipv6);
+	} else {
+		return XDP_ABORTED;
+	}
+
+	if ((void *)hdr.tcp + TCP_MAXLEN > data_end)
+		return XDP_ABORTED;
+
+	// We run out of registers, tcp_len gets spilled to the stack, and the
+	// verifier forgets its min and max values checked above in tcp_dissect.
+	hdr.tcp_len = hdr.tcp->doff * 4;
+	if (hdr.tcp_len < sizeof(*hdr.tcp))
+		return XDP_ABORTED;
+
+	return hdr.tcp->syn ? syncookie_handle_syn(&hdr, ctx, data, data_end) :
+			      syncookie_handle_ack(&hdr);
+}
+
+SEC("xdp/dummy")
+int dummy_xdp(struct xdp_md *ctx)
+{
+	// veth requires XDP programs to be set on both sides.
+	return XDP_PASS;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/syncookie_test.sh b/samples/bpf/syncookie_test.sh
new file mode 100755
index 000000000000..923f94a7d6f6
--- /dev/null
+++ b/samples/bpf/syncookie_test.sh
@@ -0,0 +1,55 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+# Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+set -e
+
+PORT=8080
+
+DIR="$(dirname "$0")"
+SERVER_PID=
+MONITOR_PID=
+
+cleanup() {
+	set +e
+	[ -n "$SERVER_PID" ] && kill "$SERVER_PID"
+	[ -n "$MONITOR_PID" ] && kill "$MONITOR_PID"
+	ip link del tmp0
+	ip netns del synproxy
+}
+
+trap cleanup EXIT
+
+ip netns add synproxy
+ip netns exec synproxy ip link set lo up
+ip link add tmp0 type veth peer name tmp1
+sleep 1 # Wait, otherwise the IP address is not applied to tmp0.
+ip link set tmp1 netns synproxy
+ip link set tmp0 up
+ip addr replace 198.18.0.1/24 dev tmp0
+ip netns exec synproxy ip link set tmp1 up
+ip netns exec synproxy ip addr replace 198.18.0.2/24 dev tmp1
+ip netns exec synproxy sysctl -w net.ipv4.tcp_syncookies=2
+ip netns exec synproxy sysctl -w net.ipv4.tcp_timestamps=1
+ip netns exec synproxy sysctl -w net.netfilter.nf_conntrack_tcp_loose=0
+ip netns exec synproxy iptables -t raw -I PREROUTING \
+	-i tmp1 -p tcp -m tcp --syn --dport "$PORT" -j CT --notrack
+ip netns exec synproxy iptables -t filter -A INPUT \
+	-i tmp1 -p tcp -m tcp --dport "$PORT" -m state --state INVALID,UNTRACKED \
+	-j SYNPROXY --sack-perm --timestamp --wscale 7 --mss 1460
+ip netns exec synproxy iptables -t filter -A INPUT \
+	-i tmp1 -m state --state INVALID -j DROP
+# When checksum offload is enabled, the XDP program sees wrong checksums and
+# drops packets.
+ethtool -K tmp0 tx off
+# Workaround required for veth.
+ip link set tmp0 xdp object "$DIR/syncookie_kern.o" section xdp/dummy
+ip netns exec synproxy "$DIR/syncookie" --iface tmp1 --ports "$PORT" \
+	--mss4 1460 --mss6 1440 --wscale 7 --ttl 64 &
+MONITOR_PID="$!"
+ip netns exec synproxy python3 -m http.server "$PORT" &
+SERVER_PID="$!"
+echo "Waiting a few seconds for the server to start..."
+sleep 5
+wget 'http://198.18.0.2:8080/' -O /dev/null -o /dev/null
+sleep 1 # Wait for stats to appear.
diff --git a/samples/bpf/syncookie_user.c b/samples/bpf/syncookie_user.c
new file mode 100644
index 000000000000..dcb074405691
--- /dev/null
+++ b/samples/bpf/syncookie_user.c
@@ -0,0 +1,388 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/* Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved. */
+
+#include <stdnoreturn.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <string.h>
+#include <errno.h>
+#include <unistd.h>
+#include <getopt.h>
+#include <signal.h>
+#include <sys/types.h>
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+#include <net/if.h>
+#include <linux/if_link.h>
+#include <linux/limits.h>
+
+static unsigned int ifindex;
+static __u32 attached_prog_id;
+
+static void noreturn cleanup(int sig)
+{
+	DECLARE_LIBBPF_OPTS(bpf_xdp_set_link_opts, opts);
+	int prog_fd;
+	int err;
+
+	if (attached_prog_id == 0)
+		exit(0);
+
+	prog_fd = bpf_prog_get_fd_by_id(attached_prog_id);
+	if (prog_fd < 0) {
+		fprintf(stderr, "Error: bpf_prog_get_fd_by_id: %s\n", strerror(-prog_fd));
+		err = bpf_set_link_xdp_fd(ifindex, -1, 0);
+		if (err < 0) {
+			fprintf(stderr, "Error: bpf_set_link_xdp_fd: %s\n", strerror(-err));
+			fprintf(stderr, "Failed to detach XDP program\n");
+			exit(1);
+		}
+	} else {
+		opts.old_fd = prog_fd;
+		err = bpf_set_link_xdp_fd_opts(ifindex, -1, XDP_FLAGS_REPLACE, &opts);
+		close(prog_fd);
+		if (err < 0) {
+			fprintf(stderr, "Error: bpf_set_link_xdp_fd_opts: %s\n", strerror(-err));
+			// Not an error if already replaced by someone else.
+			if (err != -EEXIST) {
+				fprintf(stderr, "Failed to detach XDP program\n");
+				exit(1);
+			}
+		}
+	}
+	exit(0);
+}
+
+static noreturn void usage(const char *progname)
+{
+	fprintf(stderr, "Usage: %s [--iface <iface>|--prog <prog_id>] [--mss4 <mss ipv4> --mss6 <mss ipv6> --wscale <wscale> --ttl <ttl>] [--ports <port1>,<port2>,...]\n",
+		progname);
+	exit(1);
+}
+
+static unsigned long parse_arg_ul(const char *progname, const char *arg, unsigned long limit)
+{
+	unsigned long res;
+	char *endptr;
+
+	errno = 0;
+	res = strtoul(arg, &endptr, 10);
+	if (errno != 0 || *endptr != '\0' || arg[0] == '\0' || res > limit)
+		usage(progname);
+
+	return res;
+}
+
+static void parse_options(int argc, char *argv[], unsigned int *ifindex, __u32 *prog_id,
+			  __u64 *tcpipopts, char **ports)
+{
+	static struct option long_options[] = {
+		{ "help", no_argument, NULL, 'h' },
+		{ "iface", required_argument, NULL, 'i' },
+		{ "prog", required_argument, NULL, 'x' },
+		{ "mss4", required_argument, NULL, 4 },
+		{ "mss6", required_argument, NULL, 6 },
+		{ "wscale", required_argument, NULL, 'w' },
+		{ "ttl", required_argument, NULL, 't' },
+		{ "ports", required_argument, NULL, 'p' },
+		{ NULL, 0, NULL, 0 },
+	};
+	unsigned long mss4, mss6, wscale, ttl;
+	unsigned int tcpipopts_mask = 0;
+
+	if (argc < 2)
+		usage(argv[0]);
+
+	*ifindex = 0;
+	*prog_id = 0;
+	*tcpipopts = 0;
+	*ports = 0;
+
+	while (true) {
+		int opt;
+
+		opt = getopt_long(argc, argv, "", long_options, NULL);
+		if (opt == -1)
+			break;
+
+		switch (opt) {
+		case 'h':
+			usage(argv[0]);
+			break;
+		case 'i':
+			*ifindex = if_nametoindex(optarg);
+			if (*ifindex == 0)
+				usage(argv[0]);
+			break;
+		case 'x':
+			*prog_id = parse_arg_ul(argv[0], optarg, UINT32_MAX);
+			if (*prog_id == 0)
+				usage(argv[0]);
+			break;
+		case 4:
+			mss4 = parse_arg_ul(argv[0], optarg, UINT16_MAX);
+			tcpipopts_mask |= 1 << 0;
+			break;
+		case 6:
+			mss6 = parse_arg_ul(argv[0], optarg, UINT16_MAX);
+			tcpipopts_mask |= 1 << 1;
+			break;
+		case 'w':
+			wscale = parse_arg_ul(argv[0], optarg, 14);
+			tcpipopts_mask |= 1 << 2;
+			break;
+		case 't':
+			ttl = parse_arg_ul(argv[0], optarg, UINT8_MAX);
+			tcpipopts_mask |= 1 << 3;
+			break;
+		case 'p':
+			*ports = optarg;
+			break;
+		default:
+			usage(argv[0]);
+		}
+	}
+	if (optind < argc)
+		usage(argv[0]);
+
+	if (tcpipopts_mask == 0xf) {
+		if (mss4 == 0 || mss6 == 0 || wscale == 0 || ttl == 0)
+			usage(argv[0]);
+		*tcpipopts = (mss6 << 32) | (ttl << 24) | (wscale << 16) | mss4;
+	} else if (tcpipopts_mask != 0) {
+		usage(argv[0]);
+	}
+
+	if (*ifindex != 0 && *prog_id != 0)
+		usage(argv[0]);
+	if (*ifindex == 0 && *prog_id == 0)
+		usage(argv[0]);
+}
+
+static int syncookie_attach(const char *argv0, unsigned int ifindex)
+{
+	struct bpf_prog_info info = {};
+	__u32 info_len = sizeof(info);
+	char xdp_filename[PATH_MAX];
+	struct bpf_object *obj;
+	int prog_fd;
+	int err;
+
+	snprintf(xdp_filename, sizeof(xdp_filename), "%s_kern.o", argv0);
+	err = bpf_prog_load(xdp_filename, BPF_PROG_TYPE_XDP, &obj, &prog_fd);
+	if (err < 0) {
+		fprintf(stderr, "Error: bpf_prog_load: %s\n", strerror(-err));
+		return err;
+	}
+	err = bpf_obj_get_info_by_fd(prog_fd, &info, &info_len);
+	if (err < 0) {
+		fprintf(stderr, "Error: bpf_obj_get_info_by_fd: %s\n", strerror(-err));
+		goto out;
+	}
+	attached_prog_id = info.id;
+	signal(SIGINT, cleanup);
+	signal(SIGTERM, cleanup);
+	err = bpf_set_link_xdp_fd(ifindex, prog_fd, XDP_FLAGS_UPDATE_IF_NOEXIST);
+	if (err < 0) {
+		fprintf(stderr, "Error: bpf_set_link_xdp_fd: %s\n", strerror(-err));
+		signal(SIGINT, SIG_DFL);
+		signal(SIGTERM, SIG_DFL);
+		attached_prog_id = 0;
+		goto out;
+	}
+	err = 0;
+out:
+	bpf_object__close(obj);
+	return err;
+}
+
+static int syncookie_open_bpf_maps(__u32 prog_id, int *values_map_fd, int *ports_map_fd)
+{
+	struct bpf_prog_info prog_info;
+	__u32 map_ids[3];
+	__u32 info_len;
+	int prog_fd;
+	int err;
+	int i;
+
+	*values_map_fd = -1;
+	*ports_map_fd = -1;
+
+	prog_fd = bpf_prog_get_fd_by_id(prog_id);
+	if (prog_fd < 0) {
+		fprintf(stderr, "Error: bpf_prog_get_fd_by_id: %s\n", strerror(-prog_fd));
+		return prog_fd;
+	}
+
+	prog_info = (struct bpf_prog_info) {
+		.nr_map_ids = 3,
+		.map_ids = (__u64)map_ids,
+	};
+	info_len = sizeof(prog_info);
+
+	err = bpf_obj_get_info_by_fd(prog_fd, &prog_info, &info_len);
+	if (err != 0) {
+		fprintf(stderr, "Error: bpf_obj_get_info_by_fd: %s\n", strerror(-err));
+		goto out;
+	}
+
+	if (prog_info.type != BPF_PROG_TYPE_XDP) {
+		fprintf(stderr, "Error: BPF prog type is not BPF_PROG_TYPE_XDP\n");
+		err = -ENOENT;
+		goto out;
+	}
+	if (prog_info.nr_map_ids != 2) {
+		fprintf(stderr, "Error: Found %u BPF maps, expected 2\n",
+			prog_info.nr_map_ids);
+		err = -ENOENT;
+		goto out;
+	}
+
+	for (i = 0; i < prog_info.nr_map_ids; i++) {
+		struct bpf_map_info map_info = {};
+		int map_fd;
+
+		err = bpf_map_get_fd_by_id(map_ids[i]);
+		if (err < 0) {
+			fprintf(stderr, "Error: bpf_map_get_fd_by_id: %s\n", strerror(-err));
+			goto err_close_map_fds;
+		}
+		map_fd = err;
+
+		info_len = sizeof(map_info);
+		err = bpf_obj_get_info_by_fd(map_fd, &map_info, &info_len);
+		if (err != 0) {
+			fprintf(stderr, "Error: bpf_obj_get_info_by_fd: %s\n", strerror(-err));
+			close(map_fd);
+			goto err_close_map_fds;
+		}
+		if (strcmp(map_info.name, "values") == 0) {
+			*values_map_fd = map_fd;
+			continue;
+		}
+		if (strcmp(map_info.name, "allowed_ports") == 0) {
+			*ports_map_fd = map_fd;
+			continue;
+		}
+		close(map_fd);
+		goto err_close_map_fds;
+	}
+
+	err = 0;
+	goto out;
+
+err_close_map_fds:
+	if (*values_map_fd != -1)
+		close(*values_map_fd);
+	if (*ports_map_fd != -1)
+		close(*ports_map_fd);
+	*values_map_fd = -1;
+	*ports_map_fd = -1;
+
+out:
+	close(prog_fd);
+	return err;
+}
+
+int main(int argc, char *argv[])
+{
+	int values_map_fd, ports_map_fd;
+	__u64 tcpipopts;
+	bool firstiter;
+	__u64 prevcnt;
+	__u32 prog_id;
+	char *ports;
+	int err = 0;
+
+	parse_options(argc, argv, &ifindex, &prog_id, &tcpipopts, &ports);
+
+	if (prog_id == 0) {
+		err = bpf_get_link_xdp_id(ifindex, &prog_id, 0);
+		if (err < 0) {
+			fprintf(stderr, "Error: bpf_get_link_xdp_id: %s\n", strerror(-err));
+			goto out;
+		}
+		if (prog_id == 0) {
+			err = syncookie_attach(argv[0], ifindex);
+			if (err < 0)
+				goto out;
+			prog_id = attached_prog_id;
+		}
+	}
+
+	err = syncookie_open_bpf_maps(prog_id, &values_map_fd, &ports_map_fd);
+	if (err < 0)
+		goto out;
+
+	if (ports) {
+		__u16 port_last = 0;
+		__u32 port_idx = 0;
+		char *p = ports;
+
+		fprintf(stderr, "Replacing allowed ports\n");
+
+		while (p && *p != '\0') {
+			char *token = strsep(&p, ",");
+			__u16 port;
+
+			port = parse_arg_ul(argv[0], token, UINT16_MAX);
+			err = bpf_map_update_elem(ports_map_fd, &port_idx, &port, BPF_ANY);
+			if (err != 0) {
+				fprintf(stderr, "Error: bpf_map_update_elem: %s\n", strerror(-err));
+				fprintf(stderr, "Failed to add port %u (index %u)\n",
+					port, port_idx);
+				goto out_close_maps;
+			}
+			fprintf(stderr, "Added port %u\n", port);
+			port_idx++;
+		}
+		err = bpf_map_update_elem(ports_map_fd, &port_idx, &port_last, BPF_ANY);
+		if (err != 0) {
+			fprintf(stderr, "Error: bpf_map_update_elem: %s\n", strerror(-err));
+			fprintf(stderr, "Failed to add the terminator value 0 (index %u)\n",
+				port_idx);
+			goto out_close_maps;
+		}
+	}
+
+	if (tcpipopts) {
+		__u32 key = 0;
+
+		fprintf(stderr, "Replacing TCP/IP options\n");
+
+		err = bpf_map_update_elem(values_map_fd, &key, &tcpipopts, BPF_ANY);
+		if (err != 0) {
+			fprintf(stderr, "Error: bpf_map_update_elem: %s\n", strerror(-err));
+			goto out_close_maps;
+		}
+	}
+
+	if ((ports || tcpipopts) && attached_prog_id == 0)
+		goto out_close_maps;
+
+	prevcnt = 0;
+	firstiter = true;
+	while (true) {
+		__u32 key = 1;
+		__u64 value;
+
+		err = bpf_map_lookup_elem(values_map_fd, &key, &value);
+		if (err != 0) {
+			fprintf(stderr, "Error: bpf_map_lookup_elem: %s\n", strerror(-err));
+			goto out_close_maps;
+		}
+		if (firstiter) {
+			prevcnt = value;
+			firstiter = false;
+		}
+		printf("SYNACKs generated: %llu (total %llu)\n", value - prevcnt, value);
+		prevcnt = value;
+		sleep(1);
+	}
+
+out_close_maps:
+	close(values_map_fd);
+	close(ports_map_fd);
+out:
+	return err == 0 ? 0 : 1;
+}
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH bpf-next 09/10] bpf: Add a helper to issue timestamp cookies in XDP
  2021-10-19 14:46 ` [PATCH bpf-next 09/10] bpf: Add a helper to issue timestamp " Maxim Mikityanskiy
@ 2021-10-19 16:45   ` Eric Dumazet
  2021-10-20 13:16     ` Maxim Mikityanskiy
  2021-10-20 15:56   ` Lorenz Bauer
  1 sibling, 1 reply; 48+ messages in thread
From: Eric Dumazet @ 2021-10-19 16:45 UTC (permalink / raw)
  To: Maxim Mikityanskiy, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh
  Cc: Eric Dumazet, David S. Miller, Jakub Kicinski, Hideaki YOSHIFUJI,
	David Ahern, Jesper Dangaard Brouer, Nathan Chancellor,
	Nick Desaulniers, Brendan Jackman, Florent Revest, Joe Stringer,
	Lorenz Bauer, Tariq Toukan, netdev, bpf, clang-built-linux



On 10/19/21 7:46 AM, Maxim Mikityanskiy wrote:
> The new helper bpf_tcp_raw_gen_tscookie allows an XDP program to
> generate timestamp cookies (to be used together with SYN cookies) which
> encode different options set by the client in the SYN packet: SACK
> support, ECN support, window scale. These options are encoded in lower
> bits of the timestamp, which will be returned by the client in a
> subsequent ACK packet. The format is the same used by synproxy.
> 
> Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
> Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
> ---
>  include/net/tcp.h              |  1 +
>  include/uapi/linux/bpf.h       | 27 +++++++++++++++
>  net/core/filter.c              | 38 +++++++++++++++++++++
>  net/ipv4/syncookies.c          | 60 ++++++++++++++++++++++++++++++++++
>  tools/include/uapi/linux/bpf.h | 27 +++++++++++++++
>  5 files changed, 153 insertions(+)
> 
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 1cc96a225848..651820bef6a2 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -564,6 +564,7 @@ u32 __cookie_v4_init_sequence(const struct iphdr *iph, const struct tcphdr *th,
>  			      u16 *mssp);
>  __u32 cookie_v4_init_sequence(const struct sk_buff *skb, __u16 *mss);
>  u64 cookie_init_timestamp(struct request_sock *req, u64 now);
> +bool cookie_init_timestamp_raw(struct tcphdr *th, __be32 *tsval, __be32 *tsecr);
>  bool cookie_timestamp_decode(const struct net *net,
>  			     struct tcp_options_received *opt);
>  bool cookie_ecn_ok(const struct tcp_options_received *opt,
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index e32f72077250..791790b41874 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -5053,6 +5053,32 @@ union bpf_attr {
>   *
>   *		**-EPROTONOSUPPORT** if the IP version is not 4 or 6 (or 6, but
>   *		CONFIG_IPV6 is disabled).
> + *
> + * int bpf_tcp_raw_gen_tscookie(struct tcphdr *th, u32 th_len, __be32 *tsopt, u32 tsopt_len)
> + *	Description
> + *		Try to generate a timestamp cookie which encodes some of the
> + *		flags sent by the client in the SYN packet: SACK support, ECN
> + *		support, window scale. To be used with SYN cookies.
> + *
> + *		*th* points to the start of the TCP header of the client's SYN
> + *		packet, while *th_len* contains the length of the TCP header (at
> + *		least **sizeof**\ (**struct tcphdr**)).
> + *
> + *		*tsopt* points to the output location where to put the resulting
> + *		timestamp values: tsval and tsecr, in the format of the TCP
> + *		timestamp option.
> + *
> + *	Return
> + *		On success, 0.
> + *
> + *		On failure, the returned value is one of the following:
> + *
> + *		**-EINVAL** if the input arguments are invalid.
> + *
> + *		**-ENOENT** if the TCP header doesn't have the timestamp option.
> + *
> + *		**-EOPNOTSUPP** if the kernel configuration does not enable SYN
> + *		cookies (CONFIG_SYN_COOKIES is off).
>   */
>  #define __BPF_FUNC_MAPPER(FN)		\
>  	FN(unspec),			\
> @@ -5238,6 +5264,7 @@ union bpf_attr {
>  	FN(ct_release),			\
>  	FN(tcp_raw_gen_syncookie),	\
>  	FN(tcp_raw_check_syncookie),	\
> +	FN(tcp_raw_gen_tscookie),	\
>  	/* */
>  
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 5f03d4a282a0..73fe20ef7442 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -7403,6 +7403,42 @@ static const struct bpf_func_proto bpf_tcp_raw_check_syncookie_proto = {
>  	.arg4_type	= ARG_CONST_SIZE,
>  };
>  
> +BPF_CALL_4(bpf_tcp_raw_gen_tscookie, struct tcphdr *, th, u32, th_len,
> +	   __be32 *, tsopt, u32, tsopt_len)
> +{
> +	int err;
> +
> +#ifdef CONFIG_SYN_COOKIES
> +	if (tsopt_len != sizeof(u64)) {
> +		err = -EINVAL;
> +		goto err_out;
> +	}
> +
> +	if (!cookie_init_timestamp_raw(th, &tsopt[0], &tsopt[1])) {
> +		err = -ENOENT;
> +		goto err_out;
> +	}
> +
> +	return 0;
> +err_out:
> +#else
> +	err = -EOPNOTSUPP;
> +#endif
> +	memset(tsopt, 0, tsopt_len);
> +	return err;
> +}
> +
> +static const struct bpf_func_proto bpf_tcp_raw_gen_tscookie_proto = {
> +	.func		= bpf_tcp_raw_gen_tscookie,
> +	.gpl_only	= false,
> +	.pkt_access	= true,
> +	.ret_type	= RET_INTEGER,
> +	.arg1_type	= ARG_PTR_TO_MEM,
> +	.arg2_type	= ARG_CONST_SIZE,
> +	.arg3_type	= ARG_PTR_TO_UNINIT_MEM,
> +	.arg4_type	= ARG_CONST_SIZE,
> +};
> +
>  #endif /* CONFIG_INET */
>  
>  bool bpf_helper_changes_pkt_data(void *func)
> @@ -7825,6 +7861,8 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>  		return &bpf_tcp_raw_gen_syncookie_proto;
>  	case BPF_FUNC_tcp_raw_check_syncookie:
>  		return &bpf_tcp_raw_check_syncookie_proto;
> +	case BPF_FUNC_tcp_raw_gen_tscookie:
> +		return &bpf_tcp_raw_gen_tscookie_proto;
>  #endif
>  	default:
>  		return bpf_sk_base_func_proto(func_id);
> diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
> index 8696dc343ad2..4dd2c7a096eb 100644
> --- a/net/ipv4/syncookies.c
> +++ b/net/ipv4/syncookies.c
> @@ -85,6 +85,66 @@ u64 cookie_init_timestamp(struct request_sock *req, u64 now)
>  	return (u64)ts * (NSEC_PER_SEC / TCP_TS_HZ);
>  }
>  
> +bool cookie_init_timestamp_raw(struct tcphdr *th, __be32 *tsval, __be32 *tsecr)
> +{
> +	int length = (th->doff * 4) - sizeof(*th);
> +	u8 wscale = TS_OPT_WSCALE_MASK;
> +	bool option_timestamp = false;
> +	bool option_sack = false;
> +	u32 cookie;
> +	u8 *ptr;
> +
> +	ptr = (u8 *)(th + 1);
> +
> +	while (length > 0) {
> +		u8 opcode = *ptr++;
> +		u8 opsize;
> +
> +		if (opcode == TCPOPT_EOL)
> +			break;
> +		if (opcode == TCPOPT_NOP) {
> +			length--;
> +			continue;
> +		}
> +
> +		if (length < 2)
> +			break;
> +		opsize = *ptr++;
> +		if (opsize < 2)
> +			break;
> +		if (opsize > length)
> +			break;
> +
> +		switch (opcode) {
> +		case TCPOPT_WINDOW:

You must check osize.

> +			wscale = min_t(u8, *ptr, TCP_MAX_WSCALE);
> +			break;
> +		case TCPOPT_TIMESTAMP:

You must check opsize.

> +			option_timestamp = true;
> +			/* Client's tsval becomes our tsecr. */
> +			*tsecr = cpu_to_be32(get_unaligned_be32(ptr));

Please avoid useless ntohl/htonl dance (even if compiler probably optimizes this)
No need to obfuscate :)

			*tsecr = get_unaligned((__be32 *)ptr);

> +			break;
> +		case TCPOPT_SACK_PERM:
> +			option_sack = true;
> +			break;
> +		}
> +
> +		ptr += opsize - 2;
> +		length -= opsize;
> +	}
> +
> +	if (!option_timestamp)
> +		return false;
> +
> +	cookie = tcp_time_stamp_raw() & ~TSMASK;
> +	cookie |= wscale & TS_OPT_WSCALE_MASK;
> +	if (option_sack)
> +		cookie |= TS_OPT_SACK;
> +	if (th->ece && th->cwr)
> +		cookie |= TS_OPT_ECN;
> +	*tsval = cpu_to_be32(cookie);
> +	return true;
> +}
>  
>  static __u32 secure_tcp_syn_cookie(__be32 saddr, __be32 daddr, __be16 sport,
>  				   __be16 dport, __u32 sseq, __u32 data)
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index e32f72077250..791790b41874 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -5053,6 +5053,32 @@ union bpf_attr {
>   *
>   *		**-EPROTONOSUPPORT** if the IP version is not 4 or 6 (or 6, but
>   *		CONFIG_IPV6 is disabled).
> + *
> + * int bpf_tcp_raw_gen_tscookie(struct tcphdr *th, u32 th_len, __be32 *tsopt, u32 tsopt_len)
> + *	Description
> + *		Try to generate a timestamp cookie which encodes some of the
> + *		flags sent by the client in the SYN packet: SACK support, ECN
> + *		support, window scale. To be used with SYN cookies.
> + *
> + *		*th* points to the start of the TCP header of the client's SYN
> + *		packet, while *th_len* contains the length of the TCP header (at
> + *		least **sizeof**\ (**struct tcphdr**)).
> + *
> + *		*tsopt* points to the output location where to put the resulting
> + *		timestamp values: tsval and tsecr, in the format of the TCP
> + *		timestamp option.
> + *
> + *	Return
> + *		On success, 0.
> + *
> + *		On failure, the returned value is one of the following:
> + *
> + *		**-EINVAL** if the input arguments are invalid.
> + *
> + *		**-ENOENT** if the TCP header doesn't have the timestamp option.
> + *
> + *		**-EOPNOTSUPP** if the kernel configuration does not enable SYN
> + *		cookies (CONFIG_SYN_COOKIES is off).
>   */
>  #define __BPF_FUNC_MAPPER(FN)		\
>  	FN(unspec),			\
> @@ -5238,6 +5264,7 @@ union bpf_attr {
>  	FN(ct_release),			\
>  	FN(tcp_raw_gen_syncookie),	\
>  	FN(tcp_raw_check_syncookie),	\
> +	FN(tcp_raw_gen_tscookie),	\
>  	/* */
>  
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: [PATCH bpf-next 04/10] bpf: Make errors of bpf_tcp_check_syncookie distinguishable
  2021-10-19 14:46 ` [PATCH bpf-next 04/10] bpf: Make errors of bpf_tcp_check_syncookie distinguishable Maxim Mikityanskiy
@ 2021-10-20  3:28   ` John Fastabend
  2021-10-20 13:16     ` Maxim Mikityanskiy
  0 siblings, 1 reply; 48+ messages in thread
From: John Fastabend @ 2021-10-20  3:28 UTC (permalink / raw)
  To: Maxim Mikityanskiy, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh
  Cc: Eric Dumazet, David S. Miller, Jakub Kicinski, Hideaki YOSHIFUJI,
	David Ahern, Jesper Dangaard Brouer, Nathan Chancellor,
	Nick Desaulniers, Brendan Jackman, Florent Revest, Joe Stringer,
	Lorenz Bauer, Tariq Toukan, netdev, bpf, clang-built-linux,
	Maxim Mikityanskiy

Maxim Mikityanskiy wrote:
> bpf_tcp_check_syncookie returns errors when SYN cookie generation is
> disabled (EINVAL) or when no cookies were recently generated (ENOENT).
> The same error codes are used for other kinds of errors: invalid
> parameters (EINVAL), invalid packet (EINVAL, ENOENT), bad cookie
> (ENOENT). Such an overlap makes it impossible for a BPF program to
> distinguish different cases that may require different handling.

I'm not sure we can change these errors now. They are embedded in
the helper API. I think a BPF program could uncover the meaning
of the error anyways with some error path handling?

Anyways even if we do change these most of us who run programs
on multiple kernel versions would not be able to rely on them
being one way or the other easily.

> 
> For a BPF program that accelerates generating and checking SYN cookies,
> typical logic looks like this (with current error codes annotated):
> 
> 1. Drop invalid packets (EINVAL, ENOENT).
> 
> 2. Drop packets with bad cookies (ENOENT).
> 
> 3. Pass packets with good cookies (0).
> 
> 4. Pass all packets when cookies are not in use (EINVAL, ENOENT).
> 
> The last point also matches the behavior of cookie_v4_check and
> cookie_v6_check that skip all checks if cookie generation is disabled or
> no cookies were recently generated. Overlapping error codes, however,
> make it impossible to distinguish case 4 from cases 1 and 2.
> 
> The original commit message of commit 399040847084 ("bpf: add helper to
> check for a valid SYN cookie") mentions another use case, though:
> traffic classification, where it's important to distinguish new
> connections from existing ones, and case 4 should be distinguishable
> from case 3.
> 
> To match the requirements of both use cases, this patch reassigns error
> codes of bpf_tcp_check_syncookie and adds missing documentation:
> 
> 1. EINVAL: Invalid packets.
> 
> 2. EACCES: Packets with bad cookies.
> 
> 3. 0: Packets with good cookies.
> 
> 4. ENOENT: Cookies are not in use.
> 
> This way all four cases are easily distinguishable.
> 
> Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
> Reviewed-by: Tariq Toukan <tariqt@nvidia.com>

At very leasst this would need a fixes tag and should be backported
as a bug. Then we at least have a chance stable and LTS kernels
report the same thing.

[...]

> --- a/net/core/filter.c
> +++ b/net/core/filter.c
 
I'll take a stab at how a program can learn the error cause today.

BPF_CALL_5(bpf_tcp_check_syncookie, struct sock *, sk, void *, iph, u32, iph_len,
	   struct tcphdr *, th, u32, th_len)
{
#ifdef CONFIG_SYN_COOKIES
	u32 cookie;
	int ret;

// BPF program should know it pass bad values and can check
	if (unlikely(!sk || th_len < sizeof(*th)))
		return -EINVAL;

// sk_protocol and sk_state are exposed in sk and can be read directly 
	/* sk_listener() allows TCP_NEW_SYN_RECV, which makes no sense here. */
	if (sk->sk_protocol != IPPROTO_TCP || sk->sk_state != TCP_LISTEN)
		return -EINVAL;

// This is a user space knob right? I think this is a misconfig user can
// check before loading a program with check_syncookie?
	if (!sock_net(sk)->ipv4.sysctl_tcp_syncookies)
		return -EINVAL;

// We have th pointer can't we just check?
	if (!th->ack || th->rst || th->syn)
		return -ENOENT;

	if (tcp_synq_no_recent_overflow(sk))
		return -ENOENT;

	cookie = ntohl(th->ack_seq) - 1;

	switch (sk->sk_family) {
	case AF_INET:
// misconfiguration but can be checked.
		if (unlikely(iph_len < sizeof(struct iphdr)))
			return -EINVAL;

		ret = __cookie_v4_check((struct iphdr *)iph, th, cookie);
		break;

#if IS_BUILTIN(CONFIG_IPV6)
	case AF_INET6:
// misconfiguration can check as well
		if (unlikely(iph_len < sizeof(struct ipv6hdr)))
			return -EINVAL;

		ret = __cookie_v6_check((struct ipv6hdr *)iph, th, cookie);
		break;
#endif /* CONFIG_IPV6 */

	default:
		return -EPROTONOSUPPORT;
	}

	if (ret > 0)
		return 0;

	return -ENOENT;
#else
	return -ENOTSUPP;
#endif
}


So I guess my point is we have all the fields we could write a bit
of BPF to find the error cause if necessary. Might be better than
dealing with changing the error code and having to deal with the
differences in kernels. I do see how it would have been better
to get errors correct on the first patch though :/

By the way I haven't got to the next set of patches with the
actual features, but why not push everything above this patch
as fixes in its own series. Then the fixes can get going why
we review the feature.

Thanks,
John

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH bpf-next 07/10] bpf: Add helpers to query conntrack info
  2021-10-19 14:46 ` [PATCH bpf-next 07/10] bpf: Add helpers to query conntrack info Maxim Mikityanskiy
@ 2021-10-20  3:56   ` Kumar Kartikeya Dwivedi
  2021-10-20  9:28     ` Florian Westphal
  2021-10-20 13:18     ` Maxim Mikityanskiy
  2021-10-20  9:46   ` Toke Høiland-Jørgensen
  1 sibling, 2 replies; 48+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2021-10-20  3:56 UTC (permalink / raw)
  To: Maxim Mikityanskiy
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Eric Dumazet, David S. Miller, Jakub Kicinski,
	Hideaki YOSHIFUJI, David Ahern, Jesper Dangaard Brouer,
	Nathan Chancellor, Nick Desaulniers, Brendan Jackman,
	Florent Revest, Joe Stringer, Lorenz Bauer, Tariq Toukan, netdev,
	bpf, clang-built-linux

On Tue, Oct 19, 2021 at 08:16:52PM IST, Maxim Mikityanskiy wrote:
> The new helpers (bpf_ct_lookup_tcp and bpf_ct_lookup_udp) allow to query
> connection tracking information of TCP and UDP connections based on
> source and destination IP address and port. The helper returns a pointer
> to struct nf_conn (if the conntrack entry was found), which needs to be
> released with bpf_ct_release.
>
> Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
> Reviewed-by: Tariq Toukan <tariqt@nvidia.com>

The last discussion on this [0] suggested that stable BPF helpers for conntrack
were not desired, hence the recent series [1] to extend kfunc support to modules
and base the conntrack work on top of it, which I'm working on now (supporting
both CT lookup and insert).

[0]: https://lore.kernel.org/bpf/CAADnVQJTJzxzig=1vvAUMXELUoOwm2vXq0ahP4mfhBWGsCm9QA@mail.gmail.com
[1]: https://lore.kernel.org/bpf/CAADnVQKDPG+U-NwoAeNSU5Ef9ZYhhGcgL4wBkFoP-E9h8-XZhw@mail.gmail.com

--
Kartikeya

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH bpf-next 07/10] bpf: Add helpers to query conntrack info
  2021-10-20  3:56   ` Kumar Kartikeya Dwivedi
@ 2021-10-20  9:28     ` Florian Westphal
  2021-10-20  9:48       ` Toke Høiland-Jørgensen
  2021-10-20 13:18     ` Maxim Mikityanskiy
  1 sibling, 1 reply; 48+ messages in thread
From: Florian Westphal @ 2021-10-20  9:28 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Maxim Mikityanskiy, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Eric Dumazet, David S. Miller,
	Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern,
	Jesper Dangaard Brouer, Nathan Chancellor, Nick Desaulniers,
	Brendan Jackman, Florent Revest, Joe Stringer, Lorenz Bauer,
	Tariq Toukan, netdev, bpf, clang-built-linux

Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
> On Tue, Oct 19, 2021 at 08:16:52PM IST, Maxim Mikityanskiy wrote:
> > The new helpers (bpf_ct_lookup_tcp and bpf_ct_lookup_udp) allow to query
> > connection tracking information of TCP and UDP connections based on
> > source and destination IP address and port. The helper returns a pointer
> > to struct nf_conn (if the conntrack entry was found), which needs to be
> > released with bpf_ct_release.
> >
> > Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
> > Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
> 
> The last discussion on this [0] suggested that stable BPF helpers for conntrack
> were not desired, hence the recent series [1] to extend kfunc support to modules
> and base the conntrack work on top of it, which I'm working on now (supporting
> both CT lookup and insert).

This will sabotage netfilter pipeline and the way things work more and
more 8-(

If you want to use netfilter with ebpf, please have a look at the RFC I
posted and lets work on adding a netfilter specific program type that
can run ebpf programs directly from any of the existing netfilter hook
points.

Thanks.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH bpf-next 07/10] bpf: Add helpers to query conntrack info
  2021-10-19 14:46 ` [PATCH bpf-next 07/10] bpf: Add helpers to query conntrack info Maxim Mikityanskiy
  2021-10-20  3:56   ` Kumar Kartikeya Dwivedi
@ 2021-10-20  9:46   ` Toke Høiland-Jørgensen
  1 sibling, 0 replies; 48+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-10-20  9:46 UTC (permalink / raw)
  To: Maxim Mikityanskiy, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh
  Cc: Eric Dumazet, David S. Miller, Jakub Kicinski, Hideaki YOSHIFUJI,
	David Ahern, Jesper Dangaard Brouer, Nathan Chancellor,
	Nick Desaulniers, Brendan Jackman, Florent Revest, Joe Stringer,
	Lorenz Bauer, Tariq Toukan, netdev, bpf, clang-built-linux,
	Maxim Mikityanskiy


> +#if IS_BUILTIN(CONFIG_NF_CONNTRACK)

This makes the helpers all but useless on distro kernels; I don't think
this is the right way to go about it. As Kumar mentioned, he's working
on an approach using kfuncs in modules; maybe you can collaborate on
that?

-Toke


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH bpf-next 07/10] bpf: Add helpers to query conntrack info
  2021-10-20  9:28     ` Florian Westphal
@ 2021-10-20  9:48       ` Toke Høiland-Jørgensen
  2021-10-20  9:58         ` Florian Westphal
  0 siblings, 1 reply; 48+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-10-20  9:48 UTC (permalink / raw)
  To: Florian Westphal, Kumar Kartikeya Dwivedi
  Cc: Maxim Mikityanskiy, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Eric Dumazet, David S. Miller,
	Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern,
	Jesper Dangaard Brouer, Nathan Chancellor, Nick Desaulniers,
	Brendan Jackman, Florent Revest, Joe Stringer, Lorenz Bauer,
	Tariq Toukan, netdev, bpf, clang-built-linux

Florian Westphal <fw@strlen.de> writes:

> Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
>> On Tue, Oct 19, 2021 at 08:16:52PM IST, Maxim Mikityanskiy wrote:
>> > The new helpers (bpf_ct_lookup_tcp and bpf_ct_lookup_udp) allow to query
>> > connection tracking information of TCP and UDP connections based on
>> > source and destination IP address and port. The helper returns a pointer
>> > to struct nf_conn (if the conntrack entry was found), which needs to be
>> > released with bpf_ct_release.
>> >
>> > Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
>> > Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
>> 
>> The last discussion on this [0] suggested that stable BPF helpers for conntrack
>> were not desired, hence the recent series [1] to extend kfunc support to modules
>> and base the conntrack work on top of it, which I'm working on now (supporting
>> both CT lookup and insert).
>
> This will sabotage netfilter pipeline and the way things work more and
> more 8-(

Why?

> If you want to use netfilter with ebpf, please have a look at the RFC
> I posted and lets work on adding a netfilter specific program type
> that can run ebpf programs directly from any of the existing netfilter
> hook points.

Accelerating netfilter using BPF is a worthy goal in itself, but I also
think having the ability to lookup into conntrack from XDP is useful for
cases where someone wants to bypass the stack entirely (for accelerating
packet forwarding, say). I don't think these goals are in conflict
either, what makes you say they are?

-Toke


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH bpf-next 07/10] bpf: Add helpers to query conntrack info
  2021-10-20  9:48       ` Toke Høiland-Jørgensen
@ 2021-10-20  9:58         ` Florian Westphal
  2021-10-20 12:21           ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 48+ messages in thread
From: Florian Westphal @ 2021-10-20  9:58 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Florian Westphal, Kumar Kartikeya Dwivedi, Maxim Mikityanskiy,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Eric Dumazet, David S. Miller, Jakub Kicinski,
	Hideaki YOSHIFUJI, David Ahern, Jesper Dangaard Brouer,
	Nathan Chancellor, Nick Desaulniers, Brendan Jackman,
	Florent Revest, Joe Stringer, Lorenz Bauer, Tariq Toukan, netdev,
	bpf, clang-built-linux

Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> Florian Westphal <fw@strlen.de> writes:
> 
> > Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
> >> On Tue, Oct 19, 2021 at 08:16:52PM IST, Maxim Mikityanskiy wrote:
> >> > The new helpers (bpf_ct_lookup_tcp and bpf_ct_lookup_udp) allow to query
> >> > connection tracking information of TCP and UDP connections based on
> >> > source and destination IP address and port. The helper returns a pointer
> >> > to struct nf_conn (if the conntrack entry was found), which needs to be
> >> > released with bpf_ct_release.
> >> >
> >> > Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
> >> > Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
> >> 
> >> The last discussion on this [0] suggested that stable BPF helpers for conntrack
> >> were not desired, hence the recent series [1] to extend kfunc support to modules
> >> and base the conntrack work on top of it, which I'm working on now (supporting
> >> both CT lookup and insert).
> >
> > This will sabotage netfilter pipeline and the way things work more and
> > more 8-(
> 
> Why?

Lookups should be fine.  Insertions are the problem.

NAT hooks are expected to execute before the insertion into the
conntrack table.

If you insert before, NAT hooks won't execute, i.e.
rules that use dnat/redirect/masquerade have no effect.

> > If you want to use netfilter with ebpf, please have a look at the RFC
> > I posted and lets work on adding a netfilter specific program type
> > that can run ebpf programs directly from any of the existing netfilter
> > hook points.
> 
> Accelerating netfilter using BPF is a worthy goal in itself, but I also
> think having the ability to lookup into conntrack from XDP is useful for
> cases where someone wants to bypass the stack entirely (for accelerating
> packet forwarding, say). I don't think these goals are in conflict
> either, what makes you say they are?

Lookup is fine, I don't see fundamental issues with XDP-based bypass,
there are flowtables that also bypass classic forward path via the
netfilter ingress hook (first packet needs to go via classic path to
pass through all filter + nat rules and is offlloaded to HW or SW via
the 'flow add' statement in nftables.

I don't think there is anything that stands in the way of replicating
this via XDP.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH bpf-next 07/10] bpf: Add helpers to query conntrack info
  2021-10-20  9:58         ` Florian Westphal
@ 2021-10-20 12:21           ` Toke Høiland-Jørgensen
  2021-10-20 12:44             ` Florian Westphal
  0 siblings, 1 reply; 48+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-10-20 12:21 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Florian Westphal, Kumar Kartikeya Dwivedi, Maxim Mikityanskiy,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Eric Dumazet, David S. Miller, Jakub Kicinski,
	Hideaki YOSHIFUJI, David Ahern, Jesper Dangaard Brouer,
	Nathan Chancellor, Nick Desaulniers, Brendan Jackman,
	Florent Revest, Joe Stringer, Lorenz Bauer, Tariq Toukan, netdev,
	bpf, clang-built-linux

Florian Westphal <fw@strlen.de> writes:

> Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> Florian Westphal <fw@strlen.de> writes:
>> 
>> > Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
>> >> On Tue, Oct 19, 2021 at 08:16:52PM IST, Maxim Mikityanskiy wrote:
>> >> > The new helpers (bpf_ct_lookup_tcp and bpf_ct_lookup_udp) allow to query
>> >> > connection tracking information of TCP and UDP connections based on
>> >> > source and destination IP address and port. The helper returns a pointer
>> >> > to struct nf_conn (if the conntrack entry was found), which needs to be
>> >> > released with bpf_ct_release.
>> >> >
>> >> > Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
>> >> > Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
>> >> 
>> >> The last discussion on this [0] suggested that stable BPF helpers for conntrack
>> >> were not desired, hence the recent series [1] to extend kfunc support to modules
>> >> and base the conntrack work on top of it, which I'm working on now (supporting
>> >> both CT lookup and insert).
>> >
>> > This will sabotage netfilter pipeline and the way things work more and
>> > more 8-(
>> 
>> Why?
>
> Lookups should be fine.  Insertions are the problem.
>
> NAT hooks are expected to execute before the insertion into the
> conntrack table.
>
> If you insert before, NAT hooks won't execute, i.e.
> rules that use dnat/redirect/masquerade have no effect.

Well yes, if you insert the wrong state into the conntrack table, you're
going to get wrong behaviour. That's sorta expected, there are lots of
things XDP can do to disrupt the packet flow (like just dropping the
packets :)).

>> > If you want to use netfilter with ebpf, please have a look at the RFC
>> > I posted and lets work on adding a netfilter specific program type
>> > that can run ebpf programs directly from any of the existing netfilter
>> > hook points.
>> 
>> Accelerating netfilter using BPF is a worthy goal in itself, but I also
>> think having the ability to lookup into conntrack from XDP is useful for
>> cases where someone wants to bypass the stack entirely (for accelerating
>> packet forwarding, say). I don't think these goals are in conflict
>> either, what makes you say they are?
>
> Lookup is fine, I don't see fundamental issues with XDP-based bypass,
> there are flowtables that also bypass classic forward path via the
> netfilter ingress hook (first packet needs to go via classic path to
> pass through all filter + nat rules and is offlloaded to HW or SW via
> the 'flow add' statement in nftables.
>
> I don't think there is anything that stands in the way of replicating
> this via XDP.

What I want to be able to do is write an XDP program that does the following:

1. Parse the packet header and determine if it's a packet type we know
   how to handle. If not, just return XDP_PASS and let the stack deal
   with corner cases.

2. If we know how to handle the packet (say, it's TCP or UDP), do a
   lookup into conntrack to figure out if there's state for it and we
   need to do things like NAT.

3. If we need to NAT, rewrite the packet based on the information we got
   back from conntrack.

4. Update the conntrack state to be consistent with the packet, and then
   redirect it out the destination interface.

I.e., in the common case the packet doesn't go through the stack at all;
but we need to make conntrack aware that we processed the packet so the
entry doesn't expire (and any state related to the flow gets updated).
Ideally we should also be able to create new state for a flow we haven't
seen before.

This requires updating of state, but I see no reason why this shouldn't
be possible?

-Toke


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH bpf-next 07/10] bpf: Add helpers to query conntrack info
  2021-10-20 12:21           ` Toke Høiland-Jørgensen
@ 2021-10-20 12:44             ` Florian Westphal
  2021-10-20 20:54               ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 48+ messages in thread
From: Florian Westphal @ 2021-10-20 12:44 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Florian Westphal, Kumar Kartikeya Dwivedi, Maxim Mikityanskiy,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Eric Dumazet, David S. Miller, Jakub Kicinski,
	Hideaki YOSHIFUJI, David Ahern, Jesper Dangaard Brouer,
	Nathan Chancellor, Nick Desaulniers, Brendan Jackman,
	Florent Revest, Joe Stringer, Lorenz Bauer, Tariq Toukan, netdev,
	bpf, clang-built-linux

Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> > Lookups should be fine.  Insertions are the problem.
> >
> > NAT hooks are expected to execute before the insertion into the
> > conntrack table.
> >
> > If you insert before, NAT hooks won't execute, i.e.
> > rules that use dnat/redirect/masquerade have no effect.
> 
> Well yes, if you insert the wrong state into the conntrack table, you're
> going to get wrong behaviour. That's sorta expected, there are lots of
> things XDP can do to disrupt the packet flow (like just dropping the
> packets :)).

Sure, but I'm not sure I understand the use case.

Insertion at XDP layer turns off netfilters NAT capability, so its
incompatible with the classic forwarding path.

If thats fine, why do you need to insert into the conntrack table to
begin with?  The entire infrastructure its designed for is disabled...

> > I don't think there is anything that stands in the way of replicating
> > this via XDP.
> 
> What I want to be able to do is write an XDP program that does the following:
> 
> 1. Parse the packet header and determine if it's a packet type we know
>    how to handle. If not, just return XDP_PASS and let the stack deal
>    with corner cases.
> 
> 2. If we know how to handle the packet (say, it's TCP or UDP), do a
>    lookup into conntrack to figure out if there's state for it and we
>    need to do things like NAT.
> 
> 3. If we need to NAT, rewrite the packet based on the information we got
>    back from conntrack.

You could already do that by storing that info in bpf maps
The ctnetlink event generated on conntrack insertion contains the NAT
mapping information, so you could have a userspace daemon that
intercepts those to update the map.

> 4. Update the conntrack state to be consistent with the packet, and then
>    redirect it out the destination interface.
> 
> I.e., in the common case the packet doesn't go through the stack at all;
> but we need to make conntrack aware that we processed the packet so the
> entry doesn't expire (and any state related to the flow gets updated).

In the HW offload case, conntrack is bypassed completely. There is an
IPS_(HW)_OFFLOAD_BIT that prevents the flow from expiring.

> Ideally we should also be able to create new state for a flow we haven't
> seen before.

The way HW offload was intended to work is to allow users to express
what flows should be offloaded via 'flow add' expression in nftables, so
they can e.g. use byte counters or rate estimators etc. to make such
a decision.  So initial packet always passes via normal stack.

This is also needed to consider e.g. XFRM -- nft_flow_offload.c won't
offload if the packet has a secpath attached (i.e., will get encrypted
later).

I suspect we'd want a way to notify/call an ebpf program instead so we
can avoid the ctnetlink -> userspace -> update dance and do the XDP
'flow bypass information update' from inside the kernel and ebpf/XDP
reimplementation of the nf flow table (it uses the netfilter ingress
hook on the configured devices; everyhing it does should be doable
from XDP).

> This requires updating of state, but I see no reason why this shouldn't
> be possible?

Updating ct->status is problematic, there would have to be extra checks
that prevent non-atomic writes and toggling of special bits such as
CONFIRMED, TEMPLATE or DYING.  Adding a helper to toggle something
specific, e.g. the offload state bit, should be okay.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH bpf-next 09/10] bpf: Add a helper to issue timestamp cookies in XDP
  2021-10-19 16:45   ` Eric Dumazet
@ 2021-10-20 13:16     ` Maxim Mikityanskiy
  0 siblings, 0 replies; 48+ messages in thread
From: Maxim Mikityanskiy @ 2021-10-20 13:16 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Eric Dumazet, David S. Miller, Jakub Kicinski,
	Hideaki YOSHIFUJI, David Ahern, Jesper Dangaard Brouer,
	Nathan Chancellor, Nick Desaulniers, Brendan Jackman,
	Florent Revest, Joe Stringer, Lorenz Bauer, Tariq Toukan, netdev,
	bpf, clang-built-linux

On 2021-10-19 19:45, Eric Dumazet wrote:
> 
> 
> On 10/19/21 7:46 AM, Maxim Mikityanskiy wrote:
>> The new helper bpf_tcp_raw_gen_tscookie allows an XDP program to
>> generate timestamp cookies (to be used together with SYN cookies) which
>> encode different options set by the client in the SYN packet: SACK
>> support, ECN support, window scale. These options are encoded in lower
>> bits of the timestamp, which will be returned by the client in a
>> subsequent ACK packet. The format is the same used by synproxy.
>>
>> Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
>> Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
>> ---
>>   include/net/tcp.h              |  1 +
>>   include/uapi/linux/bpf.h       | 27 +++++++++++++++
>>   net/core/filter.c              | 38 +++++++++++++++++++++
>>   net/ipv4/syncookies.c          | 60 ++++++++++++++++++++++++++++++++++
>>   tools/include/uapi/linux/bpf.h | 27 +++++++++++++++
>>   5 files changed, 153 insertions(+)
>>
>> diff --git a/include/net/tcp.h b/include/net/tcp.h
>> index 1cc96a225848..651820bef6a2 100644
>> --- a/include/net/tcp.h
>> +++ b/include/net/tcp.h
>> @@ -564,6 +564,7 @@ u32 __cookie_v4_init_sequence(const struct iphdr *iph, const struct tcphdr *th,
>>   			      u16 *mssp);
>>   __u32 cookie_v4_init_sequence(const struct sk_buff *skb, __u16 *mss);
>>   u64 cookie_init_timestamp(struct request_sock *req, u64 now);
>> +bool cookie_init_timestamp_raw(struct tcphdr *th, __be32 *tsval, __be32 *tsecr);
>>   bool cookie_timestamp_decode(const struct net *net,
>>   			     struct tcp_options_received *opt);
>>   bool cookie_ecn_ok(const struct tcp_options_received *opt,
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index e32f72077250..791790b41874 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -5053,6 +5053,32 @@ union bpf_attr {
>>    *
>>    *		**-EPROTONOSUPPORT** if the IP version is not 4 or 6 (or 6, but
>>    *		CONFIG_IPV6 is disabled).
>> + *
>> + * int bpf_tcp_raw_gen_tscookie(struct tcphdr *th, u32 th_len, __be32 *tsopt, u32 tsopt_len)
>> + *	Description
>> + *		Try to generate a timestamp cookie which encodes some of the
>> + *		flags sent by the client in the SYN packet: SACK support, ECN
>> + *		support, window scale. To be used with SYN cookies.
>> + *
>> + *		*th* points to the start of the TCP header of the client's SYN
>> + *		packet, while *th_len* contains the length of the TCP header (at
>> + *		least **sizeof**\ (**struct tcphdr**)).
>> + *
>> + *		*tsopt* points to the output location where to put the resulting
>> + *		timestamp values: tsval and tsecr, in the format of the TCP
>> + *		timestamp option.
>> + *
>> + *	Return
>> + *		On success, 0.
>> + *
>> + *		On failure, the returned value is one of the following:
>> + *
>> + *		**-EINVAL** if the input arguments are invalid.
>> + *
>> + *		**-ENOENT** if the TCP header doesn't have the timestamp option.
>> + *
>> + *		**-EOPNOTSUPP** if the kernel configuration does not enable SYN
>> + *		cookies (CONFIG_SYN_COOKIES is off).
>>    */
>>   #define __BPF_FUNC_MAPPER(FN)		\
>>   	FN(unspec),			\
>> @@ -5238,6 +5264,7 @@ union bpf_attr {
>>   	FN(ct_release),			\
>>   	FN(tcp_raw_gen_syncookie),	\
>>   	FN(tcp_raw_check_syncookie),	\
>> +	FN(tcp_raw_gen_tscookie),	\
>>   	/* */
>>   
>>   /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index 5f03d4a282a0..73fe20ef7442 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -7403,6 +7403,42 @@ static const struct bpf_func_proto bpf_tcp_raw_check_syncookie_proto = {
>>   	.arg4_type	= ARG_CONST_SIZE,
>>   };
>>   
>> +BPF_CALL_4(bpf_tcp_raw_gen_tscookie, struct tcphdr *, th, u32, th_len,
>> +	   __be32 *, tsopt, u32, tsopt_len)
>> +{
>> +	int err;
>> +
>> +#ifdef CONFIG_SYN_COOKIES
>> +	if (tsopt_len != sizeof(u64)) {
>> +		err = -EINVAL;
>> +		goto err_out;
>> +	}
>> +
>> +	if (!cookie_init_timestamp_raw(th, &tsopt[0], &tsopt[1])) {
>> +		err = -ENOENT;
>> +		goto err_out;
>> +	}
>> +
>> +	return 0;
>> +err_out:
>> +#else
>> +	err = -EOPNOTSUPP;
>> +#endif
>> +	memset(tsopt, 0, tsopt_len);
>> +	return err;
>> +}
>> +
>> +static const struct bpf_func_proto bpf_tcp_raw_gen_tscookie_proto = {
>> +	.func		= bpf_tcp_raw_gen_tscookie,
>> +	.gpl_only	= false,
>> +	.pkt_access	= true,
>> +	.ret_type	= RET_INTEGER,
>> +	.arg1_type	= ARG_PTR_TO_MEM,
>> +	.arg2_type	= ARG_CONST_SIZE,
>> +	.arg3_type	= ARG_PTR_TO_UNINIT_MEM,
>> +	.arg4_type	= ARG_CONST_SIZE,
>> +};
>> +
>>   #endif /* CONFIG_INET */
>>   
>>   bool bpf_helper_changes_pkt_data(void *func)
>> @@ -7825,6 +7861,8 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>>   		return &bpf_tcp_raw_gen_syncookie_proto;
>>   	case BPF_FUNC_tcp_raw_check_syncookie:
>>   		return &bpf_tcp_raw_check_syncookie_proto;
>> +	case BPF_FUNC_tcp_raw_gen_tscookie:
>> +		return &bpf_tcp_raw_gen_tscookie_proto;
>>   #endif
>>   	default:
>>   		return bpf_sk_base_func_proto(func_id);
>> diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
>> index 8696dc343ad2..4dd2c7a096eb 100644
>> --- a/net/ipv4/syncookies.c
>> +++ b/net/ipv4/syncookies.c
>> @@ -85,6 +85,66 @@ u64 cookie_init_timestamp(struct request_sock *req, u64 now)
>>   	return (u64)ts * (NSEC_PER_SEC / TCP_TS_HZ);
>>   }
>>   
>> +bool cookie_init_timestamp_raw(struct tcphdr *th, __be32 *tsval, __be32 *tsecr)
>> +{
>> +	int length = (th->doff * 4) - sizeof(*th);
>> +	u8 wscale = TS_OPT_WSCALE_MASK;
>> +	bool option_timestamp = false;
>> +	bool option_sack = false;
>> +	u32 cookie;
>> +	u8 *ptr;
>> +
>> +	ptr = (u8 *)(th + 1);
>> +
>> +	while (length > 0) {
>> +		u8 opcode = *ptr++;
>> +		u8 opsize;
>> +
>> +		if (opcode == TCPOPT_EOL)
>> +			break;
>> +		if (opcode == TCPOPT_NOP) {
>> +			length--;
>> +			continue;
>> +		}
>> +
>> +		if (length < 2)
>> +			break;
>> +		opsize = *ptr++;
>> +		if (opsize < 2)
>> +			break;
>> +		if (opsize > length)
>> +			break;
>> +
>> +		switch (opcode) {
>> +		case TCPOPT_WINDOW:
> 
> You must check osize.
> 
>> +			wscale = min_t(u8, *ptr, TCP_MAX_WSCALE);
>> +			break;
>> +		case TCPOPT_TIMESTAMP:
> 
> You must check opsize.

Ack.

>> +			option_timestamp = true;
>> +			/* Client's tsval becomes our tsecr. */
>> +			*tsecr = cpu_to_be32(get_unaligned_be32(ptr));
> 
> Please avoid useless ntohl/htonl dance (even if compiler probably optimizes this)
> No need to obfuscate :)

No obfuscation intended - I thought I was clearer this way. I can change it.

Thanks for reviewing!

> 
> 			*tsecr = get_unaligned((__be32 *)ptr);
> 
>> +			break;
>> +		case TCPOPT_SACK_PERM:
>> +			option_sack = true;
>> +			break;
>> +		}
>> +
>> +		ptr += opsize - 2;
>> +		length -= opsize;
>> +	}
>> +
>> +	if (!option_timestamp)
>> +		return false;
>> +
>> +	cookie = tcp_time_stamp_raw() & ~TSMASK;
>> +	cookie |= wscale & TS_OPT_WSCALE_MASK;
>> +	if (option_sack)
>> +		cookie |= TS_OPT_SACK;
>> +	if (th->ece && th->cwr)
>> +		cookie |= TS_OPT_ECN;
>> +	*tsval = cpu_to_be32(cookie);
>> +	return true;
>> +}
>>   
>>   static __u32 secure_tcp_syn_cookie(__be32 saddr, __be32 daddr, __be16 sport,
>>   				   __be16 dport, __u32 sseq, __u32 data)
>> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
>> index e32f72077250..791790b41874 100644
>> --- a/tools/include/uapi/linux/bpf.h
>> +++ b/tools/include/uapi/linux/bpf.h
>> @@ -5053,6 +5053,32 @@ union bpf_attr {
>>    *
>>    *		**-EPROTONOSUPPORT** if the IP version is not 4 or 6 (or 6, but
>>    *		CONFIG_IPV6 is disabled).
>> + *
>> + * int bpf_tcp_raw_gen_tscookie(struct tcphdr *th, u32 th_len, __be32 *tsopt, u32 tsopt_len)
>> + *	Description
>> + *		Try to generate a timestamp cookie which encodes some of the
>> + *		flags sent by the client in the SYN packet: SACK support, ECN
>> + *		support, window scale. To be used with SYN cookies.
>> + *
>> + *		*th* points to the start of the TCP header of the client's SYN
>> + *		packet, while *th_len* contains the length of the TCP header (at
>> + *		least **sizeof**\ (**struct tcphdr**)).
>> + *
>> + *		*tsopt* points to the output location where to put the resulting
>> + *		timestamp values: tsval and tsecr, in the format of the TCP
>> + *		timestamp option.
>> + *
>> + *	Return
>> + *		On success, 0.
>> + *
>> + *		On failure, the returned value is one of the following:
>> + *
>> + *		**-EINVAL** if the input arguments are invalid.
>> + *
>> + *		**-ENOENT** if the TCP header doesn't have the timestamp option.
>> + *
>> + *		**-EOPNOTSUPP** if the kernel configuration does not enable SYN
>> + *		cookies (CONFIG_SYN_COOKIES is off).
>>    */
>>   #define __BPF_FUNC_MAPPER(FN)		\
>>   	FN(unspec),			\
>> @@ -5238,6 +5264,7 @@ union bpf_attr {
>>   	FN(ct_release),			\
>>   	FN(tcp_raw_gen_syncookie),	\
>>   	FN(tcp_raw_check_syncookie),	\
>> +	FN(tcp_raw_gen_tscookie),	\
>>   	/* */
>>   
>>   /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>>


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH bpf-next 04/10] bpf: Make errors of bpf_tcp_check_syncookie distinguishable
  2021-10-20  3:28   ` John Fastabend
@ 2021-10-20 13:16     ` Maxim Mikityanskiy
  2021-10-20 15:26       ` Lorenz Bauer
  0 siblings, 1 reply; 48+ messages in thread
From: Maxim Mikityanskiy @ 2021-10-20 13:16 UTC (permalink / raw)
  To: John Fastabend
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Eric Dumazet, David S. Miller, Jakub Kicinski, Hideaki YOSHIFUJI,
	David Ahern, Jesper Dangaard Brouer, Nathan Chancellor,
	Nick Desaulniers, Brendan Jackman, Florent Revest, Joe Stringer,
	Lorenz Bauer, Tariq Toukan, netdev, bpf, clang-built-linux

On 2021-10-20 06:28, John Fastabend wrote:
> Maxim Mikityanskiy wrote:
>> bpf_tcp_check_syncookie returns errors when SYN cookie generation is
>> disabled (EINVAL) or when no cookies were recently generated (ENOENT).
>> The same error codes are used for other kinds of errors: invalid
>> parameters (EINVAL), invalid packet (EINVAL, ENOENT), bad cookie
>> (ENOENT). Such an overlap makes it impossible for a BPF program to
>> distinguish different cases that may require different handling.
> 
> I'm not sure we can change these errors now. They are embedded in
> the helper API. I think a BPF program could uncover the meaning
> of the error anyways with some error path handling?
> 
> Anyways even if we do change these most of us who run programs
> on multiple kernel versions would not be able to rely on them
> being one way or the other easily.

The thing is, the error codes aren't really documented:

  * 0 if *iph* and *th* are a valid SYN cookie ACK, or a negative
  * error otherwise.

My patch doesn't break this assumption.

Practically speaking, there are two use cases of bpf_tcp_check_syncookie 
that I know about: traffic classification (find NEW ACK packets with the 
right cookie) and SYN flood protection.

For traffic classification, it's not important what error code we get. 
The logic for ACK packets is as follows:

1. Connection established => ESTABLISHED. Otherwise,

2. bpf_tcp_check_syncookie returns 0 => NEW. Otherwise,

3. INVALID (regardless of the specific error code).

My patch doesn't break this use case.

>>
>> For a BPF program that accelerates generating and checking SYN cookies,
>> typical logic looks like this (with current error codes annotated):
>>
>> 1. Drop invalid packets (EINVAL, ENOENT).
>>
>> 2. Drop packets with bad cookies (ENOENT).
>>
>> 3. Pass packets with good cookies (0).
>>
>> 4. Pass all packets when cookies are not in use (EINVAL, ENOENT).

Now that I'm reflecting on it again, it would make more sense to drop 
packets in case 4: it's a new packet, it's an ACK, and we don't expect 
any cookies.

>> The last point also matches the behavior of cookie_v4_check and
>> cookie_v6_check that skip all checks if cookie generation is disabled or
>> no cookies were recently generated. Overlapping error codes, however,
>> make it impossible to distinguish case 4 from cases 1 and 2.

If so, we don't strictly need to distinguish case 4 from 1 and 2. The 
logic for ACK packets is similar:

1. Connection established => XDP_PASS. Otherwise,

2. bpf_tcp_check_syncookie returns 0 => XDP_PASS. Otherwise,

3. XDP_DROP.

So, on one hand, it looks like both use cases can be implemented without 
this patch. On the other hand, changing error codes to more meaningful 
shouldn't break existing programs and can have its benefits, for 
example, in debugging or in statistic counting.

>> The original commit message of commit 399040847084 ("bpf: add helper to
>> check for a valid SYN cookie") mentions another use case, though:
>> traffic classification, where it's important to distinguish new
>> connections from existing ones, and case 4 should be distinguishable
>> from case 3.
>>
>> To match the requirements of both use cases, this patch reassigns error
>> codes of bpf_tcp_check_syncookie and adds missing documentation:
>>
>> 1. EINVAL: Invalid packets.
>>
>> 2. EACCES: Packets with bad cookies.
>>
>> 3. 0: Packets with good cookies.
>>
>> 4. ENOENT: Cookies are not in use.
>>
>> This way all four cases are easily distinguishable.
>>
>> Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
>> Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
> 
> At very leasst this would need a fixes tag and should be backported
> as a bug. Then we at least have a chance stable and LTS kernels
> report the same thing.

That's a good idea.

> [...]
> 
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>   
> I'll take a stab at how a program can learn the error cause today.
> 
> BPF_CALL_5(bpf_tcp_check_syncookie, struct sock *, sk, void *, iph, u32, iph_len,
> 	   struct tcphdr *, th, u32, th_len)
> {
> #ifdef CONFIG_SYN_COOKIES
> 	u32 cookie;
> 	int ret;
> 
> // BPF program should know it pass bad values and can check
> 	if (unlikely(!sk || th_len < sizeof(*th)))
> 		return -EINVAL;
> 
> // sk_protocol and sk_state are exposed in sk and can be read directly
> 	/* sk_listener() allows TCP_NEW_SYN_RECV, which makes no sense here. */
> 	if (sk->sk_protocol != IPPROTO_TCP || sk->sk_state != TCP_LISTEN)
> 		return -EINVAL;
> 
> // This is a user space knob right? I think this is a misconfig user can
> // check before loading a program with check_syncookie?

bpf_tcp_check_syncookie was initially introduced for the classification 
use case, to be able to classify new ACK packets with the right cookie 
as NEW. The XDP program classifies traffic regardless of whether SYN 
cookies are enabled. If we need to check the sysctl in userspace, it 
means we need two XDP programs (or additional trickery passing this 
value through a map).

> 	if (!sock_net(sk)->ipv4.sysctl_tcp_syncookies)
> 		return -EINVAL;
> 
> // We have th pointer can't we just check?

Yes, most of the checks can be repeated in BPF, but it's obvious it's 
slower to do all the checks twice.

> 	if (!th->ack || th->rst || th->syn)
> 		return -ENOENT;
> 
> 	if (tcp_synq_no_recent_overflow(sk))
> 		return -ENOENT;

This specific check can't be done in BPF.

> 
> 	cookie = ntohl(th->ack_seq) - 1;
> 
> 	switch (sk->sk_family) {
> 	case AF_INET:
> // misconfiguration but can be checked.
> 		if (unlikely(iph_len < sizeof(struct iphdr)))
> 			return -EINVAL;
> 
> 		ret = __cookie_v4_check((struct iphdr *)iph, th, cookie);
> 		break;
> 
> #if IS_BUILTIN(CONFIG_IPV6)
> 	case AF_INET6:
> // misconfiguration can check as well
> 		if (unlikely(iph_len < sizeof(struct ipv6hdr)))
> 			return -EINVAL;
> 
> 		ret = __cookie_v6_check((struct ipv6hdr *)iph, th, cookie);
> 		break;
> #endif /* CONFIG_IPV6 */
> 
> 	default:
> 		return -EPROTONOSUPPORT;
> 	}
> 
> 	if (ret > 0)
> 		return 0;
> 
> 	return -ENOENT;
> #else
> 	return -ENOTSUPP;
> #endif
> }
> 
> 
> So I guess my point is we have all the fields we could write a bit
> of BPF to find the error cause if necessary. Might be better than
> dealing with changing the error code and having to deal with the
> differences in kernels. I do see how it would have been better
> to get errors correct on the first patch though :/
> 
> By the way I haven't got to the next set of patches with the
> actual features, but why not push everything above this patch
> as fixes in its own series. Then the fixes can get going why
> we review the feature.

OK, I'll respin the fixes separately, while the discussion on the 
approach to expose conntrack is going on.

Thanks for reviewing!

> 
> Thanks,
> John
> 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH bpf-next 07/10] bpf: Add helpers to query conntrack info
  2021-10-20  3:56   ` Kumar Kartikeya Dwivedi
  2021-10-20  9:28     ` Florian Westphal
@ 2021-10-20 13:18     ` Maxim Mikityanskiy
  2021-10-20 19:17       ` Kumar Kartikeya Dwivedi
  1 sibling, 1 reply; 48+ messages in thread
From: Maxim Mikityanskiy @ 2021-10-20 13:18 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Eric Dumazet, David S. Miller, Jakub Kicinski,
	Hideaki YOSHIFUJI, David Ahern, Jesper Dangaard Brouer,
	Nathan Chancellor, Nick Desaulniers, Brendan Jackman,
	Florent Revest, Joe Stringer, Lorenz Bauer, Tariq Toukan, netdev,
	bpf, clang-built-linux

On 2021-10-20 06:56, Kumar Kartikeya Dwivedi wrote:
> On Tue, Oct 19, 2021 at 08:16:52PM IST, Maxim Mikityanskiy wrote:
>> The new helpers (bpf_ct_lookup_tcp and bpf_ct_lookup_udp) allow to query
>> connection tracking information of TCP and UDP connections based on
>> source and destination IP address and port. The helper returns a pointer
>> to struct nf_conn (if the conntrack entry was found), which needs to be
>> released with bpf_ct_release.
>>
>> Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
>> Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
> 
> The last discussion on this [0] suggested that stable BPF helpers for conntrack
> were not desired, hence the recent series [1] to extend kfunc support to modules
> and base the conntrack work on top of it, which I'm working on now (supporting
> both CT lookup and insert).

If you have conntrack lookup, I can base my solution on top of yours. As 
it supports modules, it's even better. What is the current status of 
your work? When do you plan to submit a series? Please add me to Cc when 
you do.

Thanks for reviewing!

> [0]: https://lore.kernel.org/bpf/CAADnVQJTJzxzig=1vvAUMXELUoOwm2vXq0ahP4mfhBWGsCm9QA@mail.gmail.com
> [1]: https://lore.kernel.org/bpf/CAADnVQKDPG+U-NwoAeNSU5Ef9ZYhhGcgL4wBkFoP-E9h8-XZhw@mail.gmail.com
> 
> --
> Kartikeya
> 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH bpf-next 04/10] bpf: Make errors of bpf_tcp_check_syncookie distinguishable
  2021-10-20 13:16     ` Maxim Mikityanskiy
@ 2021-10-20 15:26       ` Lorenz Bauer
  0 siblings, 0 replies; 48+ messages in thread
From: Lorenz Bauer @ 2021-10-20 15:26 UTC (permalink / raw)
  To: Maxim Mikityanskiy
  Cc: John Fastabend, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	KP Singh, Eric Dumazet, David S. Miller, Jakub Kicinski,
	Hideaki YOSHIFUJI, David Ahern, Jesper Dangaard Brouer,
	Nathan Chancellor, Nick Desaulniers, Brendan Jackman,
	Florent Revest, Joe Stringer, Tariq Toukan, Networking, bpf,
	clang-built-linux

On Wed, 20 Oct 2021 at 14:16, Maxim Mikityanskiy <maximmi@nvidia.com> wrote:
>
> On 2021-10-20 06:28, John Fastabend wrote:
> > Maxim Mikityanskiy wrote:
> >> bpf_tcp_check_syncookie returns errors when SYN cookie generation is
> >> disabled (EINVAL) or when no cookies were recently generated (ENOENT).
> >> The same error codes are used for other kinds of errors: invalid
> >> parameters (EINVAL), invalid packet (EINVAL, ENOENT), bad cookie
> >> (ENOENT). Such an overlap makes it impossible for a BPF program to
> >> distinguish different cases that may require different handling.
> >
> > I'm not sure we can change these errors now. They are embedded in
> > the helper API. I think a BPF program could uncover the meaning
> > of the error anyways with some error path handling?
> >
> > Anyways even if we do change these most of us who run programs
> > on multiple kernel versions would not be able to rely on them
> > being one way or the other easily.
>
> The thing is, the error codes aren't really documented:
>
>   * 0 if *iph* and *th* are a valid SYN cookie ACK, or a negative
>   * error otherwise.

Yes, I kept this vague so that there is some wiggle room. FWIW your
proposed change would not break our BPF. Same for the examples
included in the kernel source itself. That is no guarantee of course.

Personally, I'm a bit on the fence regarding a backport of this.
Either this is a legitimate extension of the API and we don't
backport, or it's a bug (how?) and then we should backport.
-- 
Lorenz Bauer  |  Systems Engineer
6th Floor, County Hall/The Riverside Building, SE1 7PB, UK

www.cloudflare.com

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH bpf-next 09/10] bpf: Add a helper to issue timestamp cookies in XDP
  2021-10-19 14:46 ` [PATCH bpf-next 09/10] bpf: Add a helper to issue timestamp " Maxim Mikityanskiy
  2021-10-19 16:45   ` Eric Dumazet
@ 2021-10-20 15:56   ` Lorenz Bauer
  2021-10-20 16:16     ` Toke Høiland-Jørgensen
  1 sibling, 1 reply; 48+ messages in thread
From: Lorenz Bauer @ 2021-10-20 15:56 UTC (permalink / raw)
  To: Maxim Mikityanskiy
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Eric Dumazet, David S. Miller, Jakub Kicinski,
	Hideaki YOSHIFUJI, David Ahern, Jesper Dangaard Brouer,
	Nathan Chancellor, Nick Desaulniers, Brendan Jackman,
	Florent Revest, Joe Stringer, Tariq Toukan, Networking, bpf,
	clang-built-linux

On Tue, 19 Oct 2021 at 15:49, Maxim Mikityanskiy <maximmi@nvidia.com> wrote:
>
> The new helper bpf_tcp_raw_gen_tscookie allows an XDP program to
> generate timestamp cookies (to be used together with SYN cookies) which
> encode different options set by the client in the SYN packet: SACK
> support, ECN support, window scale. These options are encoded in lower
> bits of the timestamp, which will be returned by the client in a
> subsequent ACK packet. The format is the same used by synproxy.
>
> Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
> Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
> ---
>  include/net/tcp.h              |  1 +
>  include/uapi/linux/bpf.h       | 27 +++++++++++++++
>  net/core/filter.c              | 38 +++++++++++++++++++++
>  net/ipv4/syncookies.c          | 60 ++++++++++++++++++++++++++++++++++
>  tools/include/uapi/linux/bpf.h | 27 +++++++++++++++
>  5 files changed, 153 insertions(+)
>
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 1cc96a225848..651820bef6a2 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -564,6 +564,7 @@ u32 __cookie_v4_init_sequence(const struct iphdr *iph, const struct tcphdr *th,
>                               u16 *mssp);
>  __u32 cookie_v4_init_sequence(const struct sk_buff *skb, __u16 *mss);
>  u64 cookie_init_timestamp(struct request_sock *req, u64 now);
> +bool cookie_init_timestamp_raw(struct tcphdr *th, __be32 *tsval, __be32 *tsecr);
>  bool cookie_timestamp_decode(const struct net *net,
>                              struct tcp_options_received *opt);
>  bool cookie_ecn_ok(const struct tcp_options_received *opt,
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index e32f72077250..791790b41874 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -5053,6 +5053,32 @@ union bpf_attr {
>   *
>   *             **-EPROTONOSUPPORT** if the IP version is not 4 or 6 (or 6, but
>   *             CONFIG_IPV6 is disabled).
> + *
> + * int bpf_tcp_raw_gen_tscookie(struct tcphdr *th, u32 th_len, __be32 *tsopt, u32 tsopt_len)

flags which must be 0?

> + *     Description
> + *             Try to generate a timestamp cookie which encodes some of the
> + *             flags sent by the client in the SYN packet: SACK support, ECN
> + *             support, window scale. To be used with SYN cookies.
> + *
> + *             *th* points to the start of the TCP header of the client's SYN
> + *             packet, while *th_len* contains the length of the TCP header (at
> + *             least **sizeof**\ (**struct tcphdr**)).
> + *
> + *             *tsopt* points to the output location where to put the resulting
> + *             timestamp values: tsval and tsecr, in the format of the TCP
> + *             timestamp option.
> + *
> + *     Return
> + *             On success, 0.
> + *
> + *             On failure, the returned value is one of the following:
> + *
> + *             **-EINVAL** if the input arguments are invalid.
> + *
> + *             **-ENOENT** if the TCP header doesn't have the timestamp option.
> + *
> + *             **-EOPNOTSUPP** if the kernel configuration does not enable SYN
> + *             cookies (CONFIG_SYN_COOKIES is off).
>   */
>  #define __BPF_FUNC_MAPPER(FN)          \
>         FN(unspec),                     \
> @@ -5238,6 +5264,7 @@ union bpf_attr {
>         FN(ct_release),                 \
>         FN(tcp_raw_gen_syncookie),      \
>         FN(tcp_raw_check_syncookie),    \
> +       FN(tcp_raw_gen_tscookie),       \
>         /* */
>
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 5f03d4a282a0..73fe20ef7442 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -7403,6 +7403,42 @@ static const struct bpf_func_proto bpf_tcp_raw_check_syncookie_proto = {
>         .arg4_type      = ARG_CONST_SIZE,
>  };
>
> +BPF_CALL_4(bpf_tcp_raw_gen_tscookie, struct tcphdr *, th, u32, th_len,
> +          __be32 *, tsopt, u32, tsopt_len)
> +{
> +       int err;

Missing check for th_len?

> +
> +#ifdef CONFIG_SYN_COOKIES
> +       if (tsopt_len != sizeof(u64)) {

sizeof(u32) * 2? That u64 isn't really relevant here.

> +               err = -EINVAL;
> +               goto err_out;
> +       }
> +
> +       if (!cookie_init_timestamp_raw(th, &tsopt[0], &tsopt[1])) {
> +               err = -ENOENT;
> +               goto err_out;
> +       }
> +
> +       return 0;
> +err_out:
> +#else
> +       err = -EOPNOTSUPP;
> +#endif
> +       memset(tsopt, 0, tsopt_len);
> +       return err;
> +}
> +
> +static const struct bpf_func_proto bpf_tcp_raw_gen_tscookie_proto = {
> +       .func           = bpf_tcp_raw_gen_tscookie,
> +       .gpl_only       = false,
> +       .pkt_access     = true,
> +       .ret_type       = RET_INTEGER,
> +       .arg1_type      = ARG_PTR_TO_MEM,
> +       .arg2_type      = ARG_CONST_SIZE,
> +       .arg3_type      = ARG_PTR_TO_UNINIT_MEM,
> +       .arg4_type      = ARG_CONST_SIZE,
> +};
> +
>  #endif /* CONFIG_INET */
>
>  bool bpf_helper_changes_pkt_data(void *func)
> @@ -7825,6 +7861,8 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>                 return &bpf_tcp_raw_gen_syncookie_proto;
>         case BPF_FUNC_tcp_raw_check_syncookie:
>                 return &bpf_tcp_raw_check_syncookie_proto;
> +       case BPF_FUNC_tcp_raw_gen_tscookie:
> +               return &bpf_tcp_raw_gen_tscookie_proto;
>  #endif
>         default:
>                 return bpf_sk_base_func_proto(func_id);
> diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
> index 8696dc343ad2..4dd2c7a096eb 100644
> --- a/net/ipv4/syncookies.c
> +++ b/net/ipv4/syncookies.c
> @@ -85,6 +85,66 @@ u64 cookie_init_timestamp(struct request_sock *req, u64 now)
>         return (u64)ts * (NSEC_PER_SEC / TCP_TS_HZ);
>  }
>
> +bool cookie_init_timestamp_raw(struct tcphdr *th, __be32 *tsval, __be32 *tsecr)

I'm probably missing context, Is there something in this function that
means you can't implement it in BPF?

Lorenz

--
Lorenz Bauer  |  Systems Engineer
6th Floor, County Hall/The Riverside Building, SE1 7PB, UK

www.cloudflare.com

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH bpf-next 09/10] bpf: Add a helper to issue timestamp cookies in XDP
  2021-10-20 15:56   ` Lorenz Bauer
@ 2021-10-20 16:16     ` Toke Høiland-Jørgensen
  2021-10-22 16:56       ` Maxim Mikityanskiy
  2021-11-01 11:14       ` Maxim Mikityanskiy
  0 siblings, 2 replies; 48+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-10-20 16:16 UTC (permalink / raw)
  To: Lorenz Bauer, Maxim Mikityanskiy
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Eric Dumazet, David S. Miller, Jakub Kicinski,
	Hideaki YOSHIFUJI, David Ahern, Jesper Dangaard Brouer,
	Nathan Chancellor, Nick Desaulniers, Brendan Jackman,
	Florent Revest, Joe Stringer, Tariq Toukan, Networking, bpf,
	clang-built-linux

Lorenz Bauer <lmb@cloudflare.com> writes:

>> +bool cookie_init_timestamp_raw(struct tcphdr *th, __be32 *tsval, __be32 *tsecr)
>
> I'm probably missing context, Is there something in this function that
> means you can't implement it in BPF?

I was about to reply with some other comments but upon closer inspection
I ended up at the same conclusion: this helper doesn't seem to be needed
at all?

-Toke


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH bpf-next 10/10] bpf: Add sample for raw syncookie helpers
  2021-10-19 14:46 ` [PATCH bpf-next 10/10] bpf: Add sample for raw syncookie helpers Maxim Mikityanskiy
@ 2021-10-20 18:01   ` Joe Stringer
  2021-10-21 17:19     ` Maxim Mikityanskiy
  2021-10-21  1:06   ` Alexei Starovoitov
  1 sibling, 1 reply; 48+ messages in thread
From: Joe Stringer @ 2021-10-20 18:01 UTC (permalink / raw)
  To: Maxim Mikityanskiy
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Eric Dumazet, David S. Miller, Jakub Kicinski,
	Hideaki YOSHIFUJI, David Ahern, Jesper Dangaard Brouer,
	Nathan Chancellor, Nick Desaulniers, Brendan Jackman,
	Florent Revest, Joe Stringer, Lorenz Bauer, Tariq Toukan,
	Networking, bpf, clang-built-linux

Hi,  just one comment related to the discussion on patch 7.

On Tue, Oct 19, 2021 at 7:49 AM Maxim Mikityanskiy <maximmi@nvidia.com> wrote:

<snip>

> +
> +       value = 0; // Flags.
> +       ct = bpf_ct_lookup_tcp(ctx, &tup, tup_size, BPF_F_CURRENT_NETNS, &value);
> +       if (ct) {
> +               unsigned long status = ct->status;
> +
> +               bpf_ct_release(ct);
> +               if (status & IPS_CONFIRMED_BIT)
> +                       return XDP_PASS;
> +       } else if (value != -ENOENT) {
> +               return XDP_ABORTED;
> +       }

Is this the only reason that you wish to expose conntrack lookup
functions to the API?

You should be able to find out whether the TCP session is established
by doing a TCP socket lookup and checking sk->state.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH bpf-next 07/10] bpf: Add helpers to query conntrack info
  2021-10-20 13:18     ` Maxim Mikityanskiy
@ 2021-10-20 19:17       ` Kumar Kartikeya Dwivedi
  0 siblings, 0 replies; 48+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2021-10-20 19:17 UTC (permalink / raw)
  To: Maxim Mikityanskiy
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Eric Dumazet, David S. Miller, Jakub Kicinski,
	Hideaki YOSHIFUJI, David Ahern, Jesper Dangaard Brouer,
	Nathan Chancellor, Nick Desaulniers, Brendan Jackman,
	Florent Revest, Joe Stringer, Lorenz Bauer, Tariq Toukan, netdev,
	bpf, clang-built-linux

On Wed, Oct 20, 2021 at 06:48:25PM IST, Maxim Mikityanskiy wrote:
> On 2021-10-20 06:56, Kumar Kartikeya Dwivedi wrote:
> > On Tue, Oct 19, 2021 at 08:16:52PM IST, Maxim Mikityanskiy wrote:
> > > The new helpers (bpf_ct_lookup_tcp and bpf_ct_lookup_udp) allow to query
> > > connection tracking information of TCP and UDP connections based on
> > > source and destination IP address and port. The helper returns a pointer
> > > to struct nf_conn (if the conntrack entry was found), which needs to be
> > > released with bpf_ct_release.
> > >
> > > Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
> > > Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
> >
> > The last discussion on this [0] suggested that stable BPF helpers for conntrack
> > were not desired, hence the recent series [1] to extend kfunc support to modules
> > and base the conntrack work on top of it, which I'm working on now (supporting
> > both CT lookup and insert).
>
> If you have conntrack lookup, I can base my solution on top of yours. As it
> supports modules, it's even better. What is the current status of your work?
> When do you plan to submit a series? Please add me to Cc when you do.
>

Great, I'll post the lookup stuff separately next week, and Cc you.

Thanks!

> Thanks for reviewing!
>
> > [0]: https://lore.kernel.org/bpf/CAADnVQJTJzxzig=1vvAUMXELUoOwm2vXq0ahP4mfhBWGsCm9QA@mail.gmail.com
> > [1]: https://lore.kernel.org/bpf/CAADnVQKDPG+U-NwoAeNSU5Ef9ZYhhGcgL4wBkFoP-E9h8-XZhw@mail.gmail.com
> >
> > --
> > Kartikeya
> >
>

--
Kartikeya

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH bpf-next 07/10] bpf: Add helpers to query conntrack info
  2021-10-20 12:44             ` Florian Westphal
@ 2021-10-20 20:54               ` Toke Høiland-Jørgensen
  2021-10-20 22:55                 ` David Ahern
  2021-10-21  7:36                 ` Florian Westphal
  0 siblings, 2 replies; 48+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-10-20 20:54 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Florian Westphal, Kumar Kartikeya Dwivedi, Maxim Mikityanskiy,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Eric Dumazet, David S. Miller, Jakub Kicinski,
	Hideaki YOSHIFUJI, David Ahern, Jesper Dangaard Brouer,
	Nathan Chancellor, Nick Desaulniers, Brendan Jackman,
	Florent Revest, Joe Stringer, Lorenz Bauer, Tariq Toukan, netdev,
	bpf, clang-built-linux

Florian Westphal <fw@strlen.de> writes:

> Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> > Lookups should be fine.  Insertions are the problem.
>> >
>> > NAT hooks are expected to execute before the insertion into the
>> > conntrack table.
>> >
>> > If you insert before, NAT hooks won't execute, i.e.
>> > rules that use dnat/redirect/masquerade have no effect.
>> 
>> Well yes, if you insert the wrong state into the conntrack table, you're
>> going to get wrong behaviour. That's sorta expected, there are lots of
>> things XDP can do to disrupt the packet flow (like just dropping the
>> packets :)).
>
> Sure, but I'm not sure I understand the use case.
>
> Insertion at XDP layer turns off netfilters NAT capability, so its
> incompatible with the classic forwarding path.
>
> If thats fine, why do you need to insert into the conntrack table to
> begin with?  The entire infrastructure its designed for is disabled...

One of the major selling points of XDP is that you can reuse the
existing kernel infrastructure instead of having to roll your own. So
sure, one could implement their own conntrack using BPF maps (as indeed,
e.g., Cilium has done), but why do that when you can take advantage of
the existing one in the kernel? Same reason we have the bpf_fib_lookup()
helper...

>> > I don't think there is anything that stands in the way of replicating
>> > this via XDP.
>> 
>> What I want to be able to do is write an XDP program that does the following:
>> 
>> 1. Parse the packet header and determine if it's a packet type we know
>>    how to handle. If not, just return XDP_PASS and let the stack deal
>>    with corner cases.
>> 
>> 2. If we know how to handle the packet (say, it's TCP or UDP), do a
>>    lookup into conntrack to figure out if there's state for it and we
>>    need to do things like NAT.
>> 
>> 3. If we need to NAT, rewrite the packet based on the information we got
>>    back from conntrack.
>
> You could already do that by storing that info in bpf maps The
> ctnetlink event generated on conntrack insertion contains the NAT
> mapping information, so you could have a userspace daemon that
> intercepts those to update the map.

Sure, but see above.

>> 4. Update the conntrack state to be consistent with the packet, and then
>>    redirect it out the destination interface.
>> 
>> I.e., in the common case the packet doesn't go through the stack at all;
>> but we need to make conntrack aware that we processed the packet so the
>> entry doesn't expire (and any state related to the flow gets updated).
>
> In the HW offload case, conntrack is bypassed completely. There is an
> IPS_(HW)_OFFLOAD_BIT that prevents the flow from expiring.

That's comparable in execution semantics (stack is bypassed entirely),
but not in control plane semantics (we lookup from XDP instead of
pushing flows down to an offload).

>> Ideally we should also be able to create new state for a flow we haven't
>> seen before.
>
> The way HW offload was intended to work is to allow users to express
> what flows should be offloaded via 'flow add' expression in nftables, so
> they can e.g. use byte counters or rate estimators etc. to make such
> a decision.  So initial packet always passes via normal stack.
>
> This is also needed to consider e.g. XFRM -- nft_flow_offload.c won't
> offload if the packet has a secpath attached (i.e., will get encrypted
> later).
>
> I suspect we'd want a way to notify/call an ebpf program instead so we
> can avoid the ctnetlink -> userspace -> update dance and do the XDP
> 'flow bypass information update' from inside the kernel and ebpf/XDP
> reimplementation of the nf flow table (it uses the netfilter ingress
> hook on the configured devices; everyhing it does should be doable
> from XDP).

But the point is exactly that we don't have to duplicate the state into
BPF, we can make XDP look it up directly.

>> This requires updating of state, but I see no reason why this shouldn't
>> be possible?
>
> Updating ct->status is problematic, there would have to be extra checks
> that prevent non-atomic writes and toggling of special bits such as
> CONFIRMED, TEMPLATE or DYING.  Adding a helper to toggle something
> specific, e.g. the offload state bit, should be okay.

We can certainly constrain the update so it's not possible to get into
an unsafe state. The primary use case is accelerating the common case,
punting to the stack is fine for corner cases.

-Toke


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH bpf-next 07/10] bpf: Add helpers to query conntrack info
  2021-10-20 20:54               ` Toke Høiland-Jørgensen
@ 2021-10-20 22:55                 ` David Ahern
  2021-10-21  7:36                 ` Florian Westphal
  1 sibling, 0 replies; 48+ messages in thread
From: David Ahern @ 2021-10-20 22:55 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Florian Westphal
  Cc: Kumar Kartikeya Dwivedi, Maxim Mikityanskiy, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Eric Dumazet,
	David S. Miller, Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern,
	Jesper Dangaard Brouer, Nathan Chancellor, Nick Desaulniers,
	Brendan Jackman, Florent Revest, Joe Stringer, Lorenz Bauer,
	Tariq Toukan, netdev, bpf, clang-built-linux

On 10/20/21 2:54 PM, Toke Høiland-Jørgensen wrote:
>> Sure, but I'm not sure I understand the use case.
>>
>> Insertion at XDP layer turns off netfilters NAT capability, so its
>> incompatible with the classic forwarding path.
>>
>> If thats fine, why do you need to insert into the conntrack table to
>> begin with?  The entire infrastructure its designed for is disabled...
> One of the major selling points of XDP is that you can reuse the
> existing kernel infrastructure instead of having to roll your own. So
> sure, one could implement their own conntrack using BPF maps (as indeed,
> e.g., Cilium has done), but why do that when you can take advantage of
> the existing one in the kernel? Same reason we have the bpf_fib_lookup()
> helper...
> 

Exactly, and a key point is that it allows consistency between XDP fast
path and full stack slow path. e.g., the BPF program is removed or
defers a flow to full stack for some reason.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH bpf-next 10/10] bpf: Add sample for raw syncookie helpers
  2021-10-19 14:46 ` [PATCH bpf-next 10/10] bpf: Add sample for raw syncookie helpers Maxim Mikityanskiy
  2021-10-20 18:01   ` Joe Stringer
@ 2021-10-21  1:06   ` Alexei Starovoitov
  2021-10-21 17:31     ` Maxim Mikityanskiy
  1 sibling, 1 reply; 48+ messages in thread
From: Alexei Starovoitov @ 2021-10-21  1:06 UTC (permalink / raw)
  To: Maxim Mikityanskiy
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Eric Dumazet, David S. Miller, Jakub Kicinski,
	Hideaki YOSHIFUJI, David Ahern, Jesper Dangaard Brouer,
	Nathan Chancellor, Nick Desaulniers, Brendan Jackman,
	Florent Revest, Joe Stringer, Lorenz Bauer, Tariq Toukan, netdev,
	bpf, clang-built-linux

On Tue, Oct 19, 2021 at 05:46:55PM +0300, Maxim Mikityanskiy wrote:
> This commit adds a sample for the new BPF helpers: bpf_ct_lookup_tcp,
> bpf_tcp_raw_gen_syncookie and bpf_tcp_raw_check_syncookie.
> 
> samples/bpf/syncookie_kern.c is a BPF program that generates SYN cookies
> on allowed TCP ports and sends SYNACKs to clients, accelerating synproxy
> iptables module.
> 
> samples/bpf/syncookie_user.c is a userspace control application that
> allows to configure the following options in runtime: list of allowed
> ports, MSS, window scale, TTL.
> 
> samples/bpf/syncookie_test.sh is a script that demonstrates the setup of
> synproxy with XDP acceleration.
> 
> Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
> Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
> ---
>  samples/bpf/.gitignore        |   1 +
>  samples/bpf/Makefile          |   3 +
>  samples/bpf/syncookie_kern.c  | 591 ++++++++++++++++++++++++++++++++++
>  samples/bpf/syncookie_test.sh |  55 ++++
>  samples/bpf/syncookie_user.c  | 388 ++++++++++++++++++++++
>  5 files changed, 1038 insertions(+)
>  create mode 100644 samples/bpf/syncookie_kern.c
>  create mode 100755 samples/bpf/syncookie_test.sh
>  create mode 100644 samples/bpf/syncookie_user.c

Tests should be in selftests/bpf.
Samples are for samples only.

> +// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB

Isn't it deprecated?
LICENSES/deprecated/Linux-OpenIB

> +	// Don't combine additions to avoid 32-bit overflow.

c++ style comment?
did you run checkpatch?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH bpf-next 07/10] bpf: Add helpers to query conntrack info
  2021-10-20 20:54               ` Toke Høiland-Jørgensen
  2021-10-20 22:55                 ` David Ahern
@ 2021-10-21  7:36                 ` Florian Westphal
  1 sibling, 0 replies; 48+ messages in thread
From: Florian Westphal @ 2021-10-21  7:36 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Florian Westphal, Kumar Kartikeya Dwivedi, Maxim Mikityanskiy,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Eric Dumazet, David S. Miller, Jakub Kicinski,
	Hideaki YOSHIFUJI, David Ahern, Jesper Dangaard Brouer,
	Nathan Chancellor, Nick Desaulniers, Brendan Jackman,
	Florent Revest, Joe Stringer, Lorenz Bauer, Tariq Toukan, netdev,
	bpf, clang-built-linux

Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> Florian Westphal <fw@strlen.de> writes:
> 
> > Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >> > Lookups should be fine.  Insertions are the problem.
> >> >
> >> > NAT hooks are expected to execute before the insertion into the
> >> > conntrack table.
> >> >
> >> > If you insert before, NAT hooks won't execute, i.e.
> >> > rules that use dnat/redirect/masquerade have no effect.
> >> 
> >> Well yes, if you insert the wrong state into the conntrack table, you're
> >> going to get wrong behaviour. That's sorta expected, there are lots of
> >> things XDP can do to disrupt the packet flow (like just dropping the
> >> packets :)).
> >
> > Sure, but I'm not sure I understand the use case.
> >
> > Insertion at XDP layer turns off netfilters NAT capability, so its
> > incompatible with the classic forwarding path.
> >
> > If thats fine, why do you need to insert into the conntrack table to
> > begin with?  The entire infrastructure its designed for is disabled...
> 
> One of the major selling points of XDP is that you can reuse the
> existing kernel infrastructure instead of having to roll your own. So
> sure, one could implement their own conntrack using BPF maps (as indeed,
> e.g., Cilium has done), but why do that when you can take advantage of
> the existing one in the kernel? Same reason we have the bpf_fib_lookup()
> helper...

Insertion to conntrack via ebpf seems to be bad to me precisely because it
bypasses the existing infra.

In the bypass scenario you're envisioning, who is responsible for
fastpath-or-not decision?

> > In the HW offload case, conntrack is bypassed completely. There is an
> > IPS_(HW)_OFFLOAD_BIT that prevents the flow from expiring.
> 
> That's comparable in execution semantics (stack is bypassed entirely),
> but not in control plane semantics (we lookup from XDP instead of
> pushing flows down to an offload).

I'm not following.  As soon as you do insertion via XDP existing
control plane (*tables ruleset, xfrm and so on) becomes irrelevant.

Say e.g. user has a iptables ruleset that disables conntrack for udp dport
53 to avoid conntrack overhead for local resolver cache.

No longer relevant, ebpf overrides or whatever generates the epbf prog
needs to emulate existing config.

> > I suspect we'd want a way to notify/call an ebpf program instead so we
> > can avoid the ctnetlink -> userspace -> update dance and do the XDP
> > 'flow bypass information update' from inside the kernel and ebpf/XDP
> > reimplementation of the nf flow table (it uses the netfilter ingress
> > hook on the configured devices; everyhing it does should be doable
> > from XDP).
> 
> But the point is exactly that we don't have to duplicate the state into
> BPF, we can make XDP look it up directly.

Normally for fast bypass I'd expect that the bypass infra would want to
access all info in one lookup, but conntrack only gives you the NAT
transformation, so you'll also need a sk lookup and possibly a FIB
lookup later to get the route.
Also maybe an xfrm lookup as well if your bypass infra needs to support
ipsec.

So I neither understand the need for conntrack lookup (*for fast bypass use
case*) nor the need for insert IFF the control plane we have is to be
respected.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH bpf-next 10/10] bpf: Add sample for raw syncookie helpers
  2021-10-20 18:01   ` Joe Stringer
@ 2021-10-21 17:19     ` Maxim Mikityanskiy
  0 siblings, 0 replies; 48+ messages in thread
From: Maxim Mikityanskiy @ 2021-10-21 17:19 UTC (permalink / raw)
  To: Joe Stringer
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Eric Dumazet, David S. Miller, Jakub Kicinski,
	Hideaki YOSHIFUJI, David Ahern, Jesper Dangaard Brouer,
	Nathan Chancellor, Nick Desaulniers, Brendan Jackman,
	Florent Revest, Lorenz Bauer, Tariq Toukan, Networking, bpf,
	clang-built-linux

On 2021-10-20 21:01, Joe Stringer wrote:
> Hi,  just one comment related to the discussion on patch 7.
> 
> On Tue, Oct 19, 2021 at 7:49 AM Maxim Mikityanskiy <maximmi@nvidia.com> wrote:
> 
> <snip>
> 
>> +
>> +       value = 0; // Flags.
>> +       ct = bpf_ct_lookup_tcp(ctx, &tup, tup_size, BPF_F_CURRENT_NETNS, &value);
>> +       if (ct) {
>> +               unsigned long status = ct->status;
>> +
>> +               bpf_ct_release(ct);
>> +               if (status & IPS_CONFIRMED_BIT)
>> +                       return XDP_PASS;
>> +       } else if (value != -ENOENT) {
>> +               return XDP_ABORTED;
>> +       }
> 
> Is this the only reason that you wish to expose conntrack lookup
> functions to the API?
> 
> You should be able to find out whether the TCP session is established
> by doing a TCP socket lookup and checking sk->state.

It's not possible to lookup a socket, because there is no socket. The 
traffic is forwarded through the firewall machine that runs synproxy and 
this XDP program.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH bpf-next 10/10] bpf: Add sample for raw syncookie helpers
  2021-10-21  1:06   ` Alexei Starovoitov
@ 2021-10-21 17:31     ` Maxim Mikityanskiy
  2021-10-21 18:50       ` Alexei Starovoitov
  0 siblings, 1 reply; 48+ messages in thread
From: Maxim Mikityanskiy @ 2021-10-21 17:31 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Eric Dumazet, David S. Miller, Jakub Kicinski,
	Hideaki YOSHIFUJI, David Ahern, Jesper Dangaard Brouer,
	Nathan Chancellor, Nick Desaulniers, Brendan Jackman,
	Florent Revest, Joe Stringer, Lorenz Bauer, Tariq Toukan, netdev,
	bpf, clang-built-linux

On 2021-10-21 04:06, Alexei Starovoitov wrote:
> On Tue, Oct 19, 2021 at 05:46:55PM +0300, Maxim Mikityanskiy wrote:
>> This commit adds a sample for the new BPF helpers: bpf_ct_lookup_tcp,
>> bpf_tcp_raw_gen_syncookie and bpf_tcp_raw_check_syncookie.
>>
>> samples/bpf/syncookie_kern.c is a BPF program that generates SYN cookies
>> on allowed TCP ports and sends SYNACKs to clients, accelerating synproxy
>> iptables module.
>>
>> samples/bpf/syncookie_user.c is a userspace control application that
>> allows to configure the following options in runtime: list of allowed
>> ports, MSS, window scale, TTL.
>>
>> samples/bpf/syncookie_test.sh is a script that demonstrates the setup of
>> synproxy with XDP acceleration.
>>
>> Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
>> Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
>> ---
>>   samples/bpf/.gitignore        |   1 +
>>   samples/bpf/Makefile          |   3 +
>>   samples/bpf/syncookie_kern.c  | 591 ++++++++++++++++++++++++++++++++++
>>   samples/bpf/syncookie_test.sh |  55 ++++
>>   samples/bpf/syncookie_user.c  | 388 ++++++++++++++++++++++
>>   5 files changed, 1038 insertions(+)
>>   create mode 100644 samples/bpf/syncookie_kern.c
>>   create mode 100755 samples/bpf/syncookie_test.sh
>>   create mode 100644 samples/bpf/syncookie_user.c
> 
> Tests should be in selftests/bpf.
> Samples are for samples only.

It's not a test, please don't be confused by the name of 
syncookie_test.sh - it's more like a demo script.

syncookie_user.c and syncookie_kern.c are 100% a sample, they show how 
to use the new helpers and are themselves a more or less 
feature-complete solution to protect from SYN flood. syncookie_test.sh 
should probably be named syncookie_demo.sh, it demonstrates how to bring 
pieces together.

These files aren't aimed to be a unit test for the new helpers, their 
purpose is to show the usage.

> 
>> +// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
> 
> Isn't it deprecated?
> LICENSES/deprecated/Linux-OpenIB

Honestly, I had no idea, I just used our template. I'll ask whoever is 
responsible for the license.

If it's deprecated, what should be used instead?

> 
>> +	// Don't combine additions to avoid 32-bit overflow.
> 
> c++ style comment?
> did you run checkpatch?

Sure I did, and it doesn't complain on such comments. If such comments 
are a problem, please tell me, but I also saw them in other BPF samples.

Thanks for reviewing!

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH bpf-next 10/10] bpf: Add sample for raw syncookie helpers
  2021-10-21 17:31     ` Maxim Mikityanskiy
@ 2021-10-21 18:50       ` Alexei Starovoitov
  0 siblings, 0 replies; 48+ messages in thread
From: Alexei Starovoitov @ 2021-10-21 18:50 UTC (permalink / raw)
  To: Maxim Mikityanskiy
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Eric Dumazet, David S. Miller, Jakub Kicinski,
	Hideaki YOSHIFUJI, David Ahern, Jesper Dangaard Brouer,
	Nathan Chancellor, Nick Desaulniers, Brendan Jackman,
	Florent Revest, Joe Stringer, Lorenz Bauer, Tariq Toukan,
	Network Development, bpf, Clang-Built-Linux ML

On Thu, Oct 21, 2021 at 10:31 AM Maxim Mikityanskiy <maximmi@nvidia.com> wrote:
>
> On 2021-10-21 04:06, Alexei Starovoitov wrote:
> > On Tue, Oct 19, 2021 at 05:46:55PM +0300, Maxim Mikityanskiy wrote:
> >> This commit adds a sample for the new BPF helpers: bpf_ct_lookup_tcp,
> >> bpf_tcp_raw_gen_syncookie and bpf_tcp_raw_check_syncookie.
> >>
> >> samples/bpf/syncookie_kern.c is a BPF program that generates SYN cookies
> >> on allowed TCP ports and sends SYNACKs to clients, accelerating synproxy
> >> iptables module.
> >>
> >> samples/bpf/syncookie_user.c is a userspace control application that
> >> allows to configure the following options in runtime: list of allowed
> >> ports, MSS, window scale, TTL.
> >>
> >> samples/bpf/syncookie_test.sh is a script that demonstrates the setup of
> >> synproxy with XDP acceleration.
> >>
> >> Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
> >> Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
> >> ---
> >>   samples/bpf/.gitignore        |   1 +
> >>   samples/bpf/Makefile          |   3 +
> >>   samples/bpf/syncookie_kern.c  | 591 ++++++++++++++++++++++++++++++++++
> >>   samples/bpf/syncookie_test.sh |  55 ++++
> >>   samples/bpf/syncookie_user.c  | 388 ++++++++++++++++++++++
> >>   5 files changed, 1038 insertions(+)
> >>   create mode 100644 samples/bpf/syncookie_kern.c
> >>   create mode 100755 samples/bpf/syncookie_test.sh
> >>   create mode 100644 samples/bpf/syncookie_user.c
> >
> > Tests should be in selftests/bpf.
> > Samples are for samples only.
>
> It's not a test, please don't be confused by the name of
> syncookie_test.sh - it's more like a demo script.
>
> syncookie_user.c and syncookie_kern.c are 100% a sample, they show how
> to use the new helpers and are themselves a more or less
> feature-complete solution to protect from SYN flood. syncookie_test.sh
> should probably be named syncookie_demo.sh, it demonstrates how to bring
> pieces together.
>
> These files aren't aimed to be a unit test for the new helpers, their
> purpose is to show the usage.

Please convert it to a selftest.
Sooner or later we will convert all samples/bpf into tests and delete that dir.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH bpf-next 09/10] bpf: Add a helper to issue timestamp cookies in XDP
  2021-10-20 16:16     ` Toke Høiland-Jørgensen
@ 2021-10-22 16:56       ` Maxim Mikityanskiy
  2021-10-27  8:34         ` Lorenz Bauer
  2021-11-01 11:14       ` Maxim Mikityanskiy
  1 sibling, 1 reply; 48+ messages in thread
From: Maxim Mikityanskiy @ 2021-10-22 16:56 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Lorenz Bauer
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Eric Dumazet, David S. Miller, Jakub Kicinski,
	Hideaki YOSHIFUJI, David Ahern, Jesper Dangaard Brouer,
	Nathan Chancellor, Nick Desaulniers, Brendan Jackman,
	Florent Revest, Joe Stringer, Tariq Toukan, Networking, bpf,
	clang-built-linux

On 2021-10-20 19:16, Toke Høiland-Jørgensen wrote:
> Lorenz Bauer <lmb@cloudflare.com> writes:
> 
>>> +bool cookie_init_timestamp_raw(struct tcphdr *th, __be32 *tsval, __be32 *tsecr)
>>
>> I'm probably missing context, Is there something in this function that
>> means you can't implement it in BPF?
> 
> I was about to reply with some other comments but upon closer inspection
> I ended up at the same conclusion: this helper doesn't seem to be needed
> at all?

tcp_time_stamp_raw() uses ktime_get_ns(), while bpf_ktime_get_ns() uses 
ktime_get_mono_fast_ns(). Is it fine to use ktime_get_mono_fast_ns() 
instead of ktime_get_ns()? I'm a bit worried about this note in 
Documentation/core-api/timekeeping.rst:

 > most drivers should never call them,
 > since the time is allowed to jump under certain conditions.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH bpf-next 09/10] bpf: Add a helper to issue timestamp cookies in XDP
  2021-10-22 16:56       ` Maxim Mikityanskiy
@ 2021-10-27  8:34         ` Lorenz Bauer
  0 siblings, 0 replies; 48+ messages in thread
From: Lorenz Bauer @ 2021-10-27  8:34 UTC (permalink / raw)
  To: Maxim Mikityanskiy
  Cc: Toke Høiland-Jørgensen, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Eric Dumazet,
	David S. Miller, Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern,
	Jesper Dangaard Brouer, Nathan Chancellor, Nick Desaulniers,
	Brendan Jackman, Florent Revest, Joe Stringer, Tariq Toukan,
	Networking, bpf, clang-built-linux

On Fri, 22 Oct 2021 at 17:56, Maxim Mikityanskiy <maximmi@nvidia.com> wrote:
>
> tcp_time_stamp_raw() uses ktime_get_ns(), while bpf_ktime_get_ns() uses
> ktime_get_mono_fast_ns(). Is it fine to use ktime_get_mono_fast_ns()
> instead of ktime_get_ns()? I'm a bit worried about this note in
> Documentation/core-api/timekeeping.rst:
>
>  > most drivers should never call them,
>  > since the time is allowed to jump under certain conditions.

That depends on what happens when the timestamp is "off". Since you're
sending this value over the network I doubt that the two methods will
show a difference.

Lorenz

-- 
Lorenz Bauer  |  Systems Engineer
6th Floor, County Hall/The Riverside Building, SE1 7PB, UK

www.cloudflare.com

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH bpf-next 09/10] bpf: Add a helper to issue timestamp cookies in XDP
  2021-10-20 16:16     ` Toke Høiland-Jørgensen
  2021-10-22 16:56       ` Maxim Mikityanskiy
@ 2021-11-01 11:14       ` Maxim Mikityanskiy
  2021-11-03  2:10         ` Yonghong Song
  1 sibling, 1 reply; 48+ messages in thread
From: Maxim Mikityanskiy @ 2021-11-01 11:14 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Lorenz Bauer, Alexei Starovoitov
  Cc: Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Eric Dumazet,
	David S. Miller, Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern,
	Jesper Dangaard Brouer, Nathan Chancellor, Nick Desaulniers,
	Brendan Jackman, Florent Revest, Joe Stringer, Tariq Toukan,
	Networking, bpf, clang-built-linux

On 2021-10-20 19:16, Toke Høiland-Jørgensen wrote:
> Lorenz Bauer <lmb@cloudflare.com> writes:
> 
>>> +bool cookie_init_timestamp_raw(struct tcphdr *th, __be32 *tsval, __be32 *tsecr)
>>
>> I'm probably missing context, Is there something in this function that
>> means you can't implement it in BPF?
> 
> I was about to reply with some other comments but upon closer inspection
> I ended up at the same conclusion: this helper doesn't seem to be needed
> at all?

After trying to put this code into BPF (replacing the underlying 
ktime_get_ns with ktime_get_mono_fast_ns), I experienced issues with 
passing the verifier.

In addition to comparing ptr to end, I had to add checks that compare 
ptr to data_end, because the verifier can't deduce that end <= data_end. 
More branches will add a certain slowdown (not measured).

A more serious issue is the overall program complexity. Even though the 
loop over the TCP options has an upper bound, and the pointer advances 
by at least one byte every iteration, I had to limit the total number of 
iterations artificially. The maximum number of iterations that makes the 
verifier happy is 10. With more iterations, I have the following error:

BPF program is too large. Processed 1000001 insn 
 
 
                        processed 1000001 insns (limit 1000000) 
max_states_per_insn 29 total_states 35489 peak_states 596 mark_read 45

I assume that BPF_COMPLEXITY_LIMIT_INSNS (1 million) is the accumulated 
amount of instructions that the verifier can process in all branches, is 
that right? It doesn't look realistic that my program can run 1 million 
instructions in a single run, but it might be that if you take all 
possible flows and add up the instructions from these flows, it will 
exceed 1 million.

The limitation of maximum 10 TCP options might be not enough, given that 
valid packets are permitted to include more than 10 NOPs. An alternative 
of using bpf_load_hdr_opt and calling it three times doesn't look good 
either, because it will be about three times slower than going over the 
options once. So maybe having a helper for that is better than trying to 
fit it into BPF?

One more interesting fact is the time that it takes for the verifier to 
check my program. If it's limited to 10 iterations, it does it pretty 
fast, but if I try to increase the number to 11 iterations, it takes 
several minutes for the verifier to reach 1 million instructions and 
print the error then. I also tried grouping the NOPs in an inner loop to 
count only 10 real options, and the verifier has been running for a few 
hours without any response. Is it normal? Commit c04c0d2b968a ("bpf: 
increase complexity limit and maximum program size") says it shouldn't 
take more than one second in any case.

Thanks,
Max

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH bpf-next 09/10] bpf: Add a helper to issue timestamp cookies in XDP
  2021-11-01 11:14       ` Maxim Mikityanskiy
@ 2021-11-03  2:10         ` Yonghong Song
  2021-11-03 14:02           ` Maxim Mikityanskiy
  0 siblings, 1 reply; 48+ messages in thread
From: Yonghong Song @ 2021-11-03  2:10 UTC (permalink / raw)
  To: Maxim Mikityanskiy, Toke Høiland-Jørgensen,
	Lorenz Bauer, Alexei Starovoitov
  Cc: Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	John Fastabend, KP Singh, Eric Dumazet, David S. Miller,
	Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern,
	Jesper Dangaard Brouer, Nathan Chancellor, Nick Desaulniers,
	Brendan Jackman, Florent Revest, Joe Stringer, Tariq Toukan,
	Networking, bpf, clang-built-linux



On 11/1/21 4:14 AM, Maxim Mikityanskiy wrote:
> On 2021-10-20 19:16, Toke Høiland-Jørgensen wrote:
>> Lorenz Bauer <lmb@cloudflare.com> writes:
>>
>>>> +bool cookie_init_timestamp_raw(struct tcphdr *th, __be32 *tsval, 
>>>> __be32 *tsecr)
>>>
>>> I'm probably missing context, Is there something in this function that
>>> means you can't implement it in BPF?
>>
>> I was about to reply with some other comments but upon closer inspection
>> I ended up at the same conclusion: this helper doesn't seem to be needed
>> at all?
> 
> After trying to put this code into BPF (replacing the underlying 
> ktime_get_ns with ktime_get_mono_fast_ns), I experienced issues with 
> passing the verifier.
> 
> In addition to comparing ptr to end, I had to add checks that compare 
> ptr to data_end, because the verifier can't deduce that end <= data_end. 
> More branches will add a certain slowdown (not measured).
> 
> A more serious issue is the overall program complexity. Even though the 
> loop over the TCP options has an upper bound, and the pointer advances 
> by at least one byte every iteration, I had to limit the total number of 
> iterations artificially. The maximum number of iterations that makes the 
> verifier happy is 10. With more iterations, I have the following error:
> 
> BPF program is too large. Processed 1000001 insn
> 
>                         processed 1000001 insns (limit 1000000) 
> max_states_per_insn 29 total_states 35489 peak_states 596 mark_read 45
> 
> I assume that BPF_COMPLEXITY_LIMIT_INSNS (1 million) is the accumulated 
> amount of instructions that the verifier can process in all branches, is 
> that right? It doesn't look realistic that my program can run 1 million 
> instructions in a single run, but it might be that if you take all 
> possible flows and add up the instructions from these flows, it will 
> exceed 1 million.
> 
> The limitation of maximum 10 TCP options might be not enough, given that 
> valid packets are permitted to include more than 10 NOPs. An alternative 
> of using bpf_load_hdr_opt and calling it three times doesn't look good 
> either, because it will be about three times slower than going over the 
> options once. So maybe having a helper for that is better than trying to 
> fit it into BPF?
> 
> One more interesting fact is the time that it takes for the verifier to 
> check my program. If it's limited to 10 iterations, it does it pretty 
> fast, but if I try to increase the number to 11 iterations, it takes 
> several minutes for the verifier to reach 1 million instructions and 
> print the error then. I also tried grouping the NOPs in an inner loop to 
> count only 10 real options, and the verifier has been running for a few 
> hours without any response. Is it normal? 

Maxim, this may expose a verifier bug. Do you have a reproducer I can 
access? I would like to debug this to see what is the root case. Thanks!

> Commit c04c0d2b968a ("bpf: 
> increase complexity limit and maximum program size") says it shouldn't 
> take more than one second in any case.
> 
> Thanks,
> Max

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH bpf-next 09/10] bpf: Add a helper to issue timestamp cookies in XDP
  2021-11-03  2:10         ` Yonghong Song
@ 2021-11-03 14:02           ` Maxim Mikityanskiy
  2021-11-09  7:11             ` Yonghong Song
  0 siblings, 1 reply; 48+ messages in thread
From: Maxim Mikityanskiy @ 2021-11-03 14:02 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Toke Høiland-Jørgensen, Lorenz Bauer,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, John Fastabend, KP Singh,
	Eric Dumazet, David S. Miller, Jakub Kicinski, Hideaki YOSHIFUJI,
	David Ahern, Jesper Dangaard Brouer, Nathan Chancellor,
	Nick Desaulniers, Brendan Jackman, Florent Revest, Joe Stringer,
	Tariq Toukan, Networking, bpf, clang-built-linux

On 2021-11-03 04:10, Yonghong Song wrote:
> 
> 
> On 11/1/21 4:14 AM, Maxim Mikityanskiy wrote:
>> On 2021-10-20 19:16, Toke Høiland-Jørgensen wrote:
>>> Lorenz Bauer <lmb@cloudflare.com> writes:
>>>
>>>>> +bool cookie_init_timestamp_raw(struct tcphdr *th, __be32 *tsval, 
>>>>> __be32 *tsecr)
>>>>
>>>> I'm probably missing context, Is there something in this function that
>>>> means you can't implement it in BPF?
>>>
>>> I was about to reply with some other comments but upon closer inspection
>>> I ended up at the same conclusion: this helper doesn't seem to be needed
>>> at all?
>>
>> After trying to put this code into BPF (replacing the underlying 
>> ktime_get_ns with ktime_get_mono_fast_ns), I experienced issues with 
>> passing the verifier.
>>
>> In addition to comparing ptr to end, I had to add checks that compare 
>> ptr to data_end, because the verifier can't deduce that end <= 
>> data_end. More branches will add a certain slowdown (not measured).
>>
>> A more serious issue is the overall program complexity. Even though 
>> the loop over the TCP options has an upper bound, and the pointer 
>> advances by at least one byte every iteration, I had to limit the 
>> total number of iterations artificially. The maximum number of 
>> iterations that makes the verifier happy is 10. With more iterations, 
>> I have the following error:
>>
>> BPF program is too large. Processed 1000001 insn
>>
>>                         processed 1000001 insns (limit 1000000) 
>> max_states_per_insn 29 total_states 35489 peak_states 596 mark_read 45
>>
>> I assume that BPF_COMPLEXITY_LIMIT_INSNS (1 million) is the 
>> accumulated amount of instructions that the verifier can process in 
>> all branches, is that right? It doesn't look realistic that my program 
>> can run 1 million instructions in a single run, but it might be that 
>> if you take all possible flows and add up the instructions from these 
>> flows, it will exceed 1 million.
>>
>> The limitation of maximum 10 TCP options might be not enough, given 
>> that valid packets are permitted to include more than 10 NOPs. An 
>> alternative of using bpf_load_hdr_opt and calling it three times 
>> doesn't look good either, because it will be about three times slower 
>> than going over the options once. So maybe having a helper for that is 
>> better than trying to fit it into BPF?
>>
>> One more interesting fact is the time that it takes for the verifier 
>> to check my program. If it's limited to 10 iterations, it does it 
>> pretty fast, but if I try to increase the number to 11 iterations, it 
>> takes several minutes for the verifier to reach 1 million instructions 
>> and print the error then. I also tried grouping the NOPs in an inner 
>> loop to count only 10 real options, and the verifier has been running 
>> for a few hours without any response. Is it normal? 
> 
> Maxim, this may expose a verifier bug. Do you have a reproducer I can 
> access? I would like to debug this to see what is the root case. Thanks!

Thanks, I appreciate your help in debugging it. The reproducer is based 
on the modified XDP program from patch 10 in this series. You'll need to 
apply at least patches 6, 7, 8 from this series to get new BPF helpers 
needed for the XDP program (tell me if that's a problem, I can try to 
remove usage of new helpers, but it will affect the program length and 
may produce different results in the verifier).

See the C code of the program that passes the verifier (compiled with 
clang version 12.0.0-1ubuntu1) in the bottom of this email. If you 
increase the loop boundary from 10 to at least 11 in 
cookie_init_timestamp_raw(), it fails the verifier after a few minutes. 
If you apply this tiny change, it fails the verifier after about 3 hours:

--- a/samples/bpf/syncookie_kern.c
+++ b/samples/bpf/syncookie_kern.c
@@ -167,6 +167,7 @@ static __always_inline bool cookie_init_
  	for (i = 0; i < 10; i++) {
  		u8 opcode, opsize;

+skip_nop:
  		if (ptr >= end)
  			break;
  		if (ptr + 1 > data_end)
@@ -178,7 +179,7 @@ static __always_inline bool cookie_init_
  			break;
  		if (opcode == TCPOPT_NOP) {
  			++ptr;
-			continue;
+			goto skip_nop;
  		}

  		if (ptr + 1 >= end)

--cut--

// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
/* Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights 
reserved. */

#include <stdbool.h>
#include <stddef.h>

#include <uapi/linux/errno.h>
#include <uapi/linux/bpf.h>
#include <uapi/linux/pkt_cls.h>
#include <uapi/linux/if_ether.h>
#include <uapi/linux/in.h>
#include <uapi/linux/ip.h>
#include <uapi/linux/ipv6.h>
#include <uapi/linux/tcp.h>
#include <uapi/linux/netfilter/nf_conntrack_common.h>
#include <linux/minmax.h>
#include <vdso/time64.h>
#include <asm/unaligned.h>

#include <bpf/bpf_helpers.h>
#include <bpf/bpf_endian.h>

#define DEFAULT_MSS4 1460
#define DEFAULT_MSS6 1440
#define DEFAULT_WSCALE 7
#define DEFAULT_TTL 64
#define MAX_ALLOWED_PORTS 8

struct bpf_map_def SEC("maps") values = {
	.type = BPF_MAP_TYPE_ARRAY,
	.key_size = sizeof(__u32),
	.value_size = sizeof(__u64),
	.max_entries = 2,
};

struct bpf_map_def SEC("maps") allowed_ports = {
	.type = BPF_MAP_TYPE_ARRAY,
	.key_size = sizeof(__u32),
	.value_size = sizeof(__u16),
	.max_entries = MAX_ALLOWED_PORTS,
};

#define IP_DF 0x4000
#define IP_MF 0x2000
#define IP_OFFSET 0x1fff

#define NEXTHDR_TCP 6

#define TCPOPT_NOP 1
#define TCPOPT_EOL 0
#define TCPOPT_MSS 2
#define TCPOPT_WINDOW 3
#define TCPOPT_SACK_PERM 4
#define TCPOPT_TIMESTAMP 8

#define TCPOLEN_MSS 4
#define TCPOLEN_WINDOW 3
#define TCPOLEN_SACK_PERM 2
#define TCPOLEN_TIMESTAMP 10

#define TCP_MAX_WSCALE 14U

#define TS_OPT_WSCALE_MASK 0xf
#define TS_OPT_SACK BIT(4)
#define TS_OPT_ECN BIT(5)
#define TSBITS 6
#define TSMASK (((__u32)1 << TSBITS) - 1)

#define TCP_TS_HZ 1000

#define IPV4_MAXLEN 60
#define TCP_MAXLEN 60

static __always_inline void swap_eth_addr(__u8 *a, __u8 *b)
{
	__u8 tmp[ETH_ALEN];

	__builtin_memcpy(tmp, a, ETH_ALEN);
	__builtin_memcpy(a, b, ETH_ALEN);
	__builtin_memcpy(b, tmp, ETH_ALEN);
}

static __always_inline __u16 csum_fold(__u32 csum)
{
	csum = (csum & 0xffff) + (csum >> 16);
	csum = (csum & 0xffff) + (csum >> 16);
	return (__u16)~csum;
}

static __always_inline __u16 csum_tcpudp_magic(__be32 saddr, __be32 daddr,
					       __u32 len, __u8 proto,
					       __u32 csum)
{
	__u64 s = csum;

	s += (__u32)saddr;
	s += (__u32)daddr;
#if defined(__BIG_ENDIAN__)
	s += proto + len;
#elif defined(__LITTLE_ENDIAN__)
	s += (proto + len) << 8;
#else
#error Unknown endian
#endif
	s = (s & 0xffffffff) + (s >> 32);
	s = (s & 0xffffffff) + (s >> 32);

	return csum_fold((__u32)s);
}

static __always_inline __u16 csum_ipv6_magic(const struct in6_addr *saddr,
					     const struct in6_addr *daddr,
					     __u32 len, __u8 proto, __u32 csum)
{
	__u64 sum = csum;
	int i;

#pragma unroll
	for (i = 0; i < 4; i++)
		sum += (__u32)saddr->s6_addr32[i];

#pragma unroll
	for (i = 0; i < 4; i++)
		sum += (__u32)daddr->s6_addr32[i];

	// Don't combine additions to avoid 32-bit overflow.
	sum += bpf_htonl(len);
	sum += bpf_htonl(proto);

	sum = (sum & 0xffffffff) + (sum >> 32);
	sum = (sum & 0xffffffff) + (sum >> 32);

	return csum_fold((__u32)sum);
}

static __always_inline u64 tcp_clock_ns(void)
{
	return bpf_ktime_get_ns();
}

static __always_inline __u32 tcp_ns_to_ts(__u64 ns)
{
	return div_u64(ns, NSEC_PER_SEC / TCP_TS_HZ);
}

static __always_inline __u32 tcp_time_stamp_raw(void)
{
	return tcp_ns_to_ts(tcp_clock_ns());
}

static __always_inline bool cookie_init_timestamp_raw(struct tcphdr 
*tcp_header,
						      __u16 tcp_len,
						      __be32 *tsval,
						      __be32 *tsecr,
						      void *data_end)
{
	u8 wscale = TS_OPT_WSCALE_MASK;
	bool option_timestamp = false;
	bool option_sack = false;
	u8 *ptr, *end;
	u32 cookie;
	int i;

	ptr = (u8 *)(tcp_header + 1);
	end = (u8 *)tcp_header + tcp_len;

	for (i = 0; i < 10; i++) {
		u8 opcode, opsize;

		if (ptr >= end)
			break;
		if (ptr + 1 > data_end)
			return false;

		opcode = ptr[0];

		if (opcode == TCPOPT_EOL)
			break;
		if (opcode == TCPOPT_NOP) {
			++ptr;
			continue;
		}

		if (ptr + 1 >= end)
			break;
		if (ptr + 2 > data_end)
			return false;
		opsize = ptr[1];
		if (opsize < 2)
			break;

		if (ptr + opsize > end)
			break;

		switch (opcode) {
		case TCPOPT_WINDOW:
			if (opsize == TCPOLEN_WINDOW) {
				if (ptr + TCPOLEN_WINDOW > data_end)
					return false;
				wscale = min_t(u8, ptr[2], TCP_MAX_WSCALE);
			}
			break;
		case TCPOPT_TIMESTAMP:
			if (opsize == TCPOLEN_TIMESTAMP) {
				if (ptr + TCPOLEN_TIMESTAMP > data_end)
					return false;
				option_timestamp = true;
				/* Client's tsval becomes our tsecr. */
				*tsecr = get_unaligned((__be32 *)(ptr + 2));
			}
			break;
		case TCPOPT_SACK_PERM:
			if (opsize == TCPOLEN_SACK_PERM)
				option_sack = true;
			break;
		}

		ptr += opsize;
	}

	if (!option_timestamp)
		return false;

	cookie = tcp_time_stamp_raw() & ~TSMASK;
	cookie |= wscale & TS_OPT_WSCALE_MASK;
	if (option_sack)
		cookie |= TS_OPT_SACK;
	if (tcp_header->ece && tcp_header->cwr)
		cookie |= TS_OPT_ECN;
	*tsval = cpu_to_be32(cookie);

	return true;
}

static __always_inline void values_get_tcpipopts(__u16 *mss, __u8 *wscale,
						 __u8 *ttl, bool ipv6)
{
	__u32 key = 0;
	__u64 *value;

	value = bpf_map_lookup_elem(&values, &key);
	if (value && *value != 0) {
		if (ipv6)
			*mss = (*value >> 32) & 0xffff;
		else
			*mss = *value & 0xffff;
		*wscale = (*value >> 16) & 0xf;
		*ttl = (*value >> 24) & 0xff;
		return;
	}

	*mss = ipv6 ? DEFAULT_MSS6 : DEFAULT_MSS4;
	*wscale = DEFAULT_WSCALE;
	*ttl = DEFAULT_TTL;
}

static __always_inline void values_inc_synacks(void)
{
	__u32 key = 1;
	__u32 *value;

	value = bpf_map_lookup_elem(&values, &key);
	if (value)
		__sync_fetch_and_add(value, 1);
}

static __always_inline bool check_port_allowed(__u16 port)
{
	__u32 i;

	for (i = 0; i < MAX_ALLOWED_PORTS; i++) {
		__u32 key = i;
		__u16 *value;

		value = bpf_map_lookup_elem(&allowed_ports, &key);

		if (!value)
			break;
		// 0 is a terminator value. Check it first to avoid matching on
		// a forbidden port == 0 and returning true.
		if (*value == 0)
			break;

		if (*value == port)
			return true;
	}

	return false;
}

struct header_pointers {
	struct ethhdr *eth;
	struct iphdr *ipv4;
	struct ipv6hdr *ipv6;
	struct tcphdr *tcp;
	__u16 tcp_len;
};

static __always_inline int tcp_dissect(void *data, void *data_end,
				       struct header_pointers *hdr)
{
	hdr->eth = data;
	if (hdr->eth + 1 > data_end)
		return XDP_DROP;

	switch (bpf_ntohs(hdr->eth->h_proto)) {
	case ETH_P_IP:
		hdr->ipv6 = NULL;

		hdr->ipv4 = (void *)hdr->eth + sizeof(*hdr->eth);
		if (hdr->ipv4 + 1 > data_end)
			return XDP_DROP;
		if (hdr->ipv4->ihl * 4 < sizeof(*hdr->ipv4))
			return XDP_DROP;
		if (hdr->ipv4->version != 4)
			return XDP_DROP;

		if (hdr->ipv4->protocol != IPPROTO_TCP)
			return XDP_PASS;

		hdr->tcp = (void *)hdr->ipv4 + hdr->ipv4->ihl * 4;
		break;
	case ETH_P_IPV6:
		hdr->ipv4 = NULL;

		hdr->ipv6 = (void *)hdr->eth + sizeof(*hdr->eth);
		if (hdr->ipv6 + 1 > data_end)
			return XDP_DROP;
		if (hdr->ipv6->version != 6)
			return XDP_DROP;

		// XXX: Extension headers are not supported and could circumvent
		// XDP SYN flood protection.
		if (hdr->ipv6->nexthdr != NEXTHDR_TCP)
			return XDP_PASS;

		hdr->tcp = (void *)hdr->ipv6 + sizeof(*hdr->ipv6);
		break;
	default:
		// XXX: VLANs will circumvent XDP SYN flood protection.
		return XDP_PASS;
	}

	if (hdr->tcp + 1 > data_end)
		return XDP_DROP;
	hdr->tcp_len = hdr->tcp->doff * 4;
	if (hdr->tcp_len < sizeof(*hdr->tcp))
		return XDP_DROP;

	return XDP_TX;
}

static __always_inline __u8 tcp_mkoptions(__be32 *buf, __be32 *tsopt, 
__u16 mss,
					  __u8 wscale)
{
	__be32 *start = buf;

	*buf++ = bpf_htonl((TCPOPT_MSS << 24) | (TCPOLEN_MSS << 16) | mss);

	if (!tsopt)
		return buf - start;

	if (tsopt[0] & bpf_htonl(1 << 4))
		*buf++ = bpf_htonl((TCPOPT_SACK_PERM << 24) |
				   (TCPOLEN_SACK_PERM << 16) |
				   (TCPOPT_TIMESTAMP << 8) |
				   TCPOLEN_TIMESTAMP);
	else
		*buf++ = bpf_htonl((TCPOPT_NOP << 24) |
				   (TCPOPT_NOP << 16) |
				   (TCPOPT_TIMESTAMP << 8) |
				   TCPOLEN_TIMESTAMP);
	*buf++ = tsopt[0];
	*buf++ = tsopt[1];

	if ((tsopt[0] & bpf_htonl(0xf)) != bpf_htonl(0xf))
		*buf++ = bpf_htonl((TCPOPT_NOP << 24) |
				   (TCPOPT_WINDOW << 16) |
				   (TCPOLEN_WINDOW << 8) |
				   wscale);

	return buf - start;
}

static __always_inline void tcp_gen_synack(struct tcphdr *tcp_header,
					   __u32 cookie, __be32 *tsopt,
					   __u16 mss, __u8 wscale)
{
	void *tcp_options;

	tcp_flag_word(tcp_header) = TCP_FLAG_SYN | TCP_FLAG_ACK;
	if (tsopt && (tsopt[0] & bpf_htonl(1 << 5)))
		tcp_flag_word(tcp_header) |= TCP_FLAG_ECE;
	tcp_header->doff = 5; // doff is part of tcp_flag_word.
	swap(tcp_header->source, tcp_header->dest);
	tcp_header->ack_seq = bpf_htonl(bpf_ntohl(tcp_header->seq) + 1);
	tcp_header->seq = bpf_htonl(cookie);
	tcp_header->window = 0;
	tcp_header->urg_ptr = 0;
	tcp_header->check = 0; // Rely on hardware checksum offload.

	tcp_options = (void *)(tcp_header + 1);
	tcp_header->doff += tcp_mkoptions(tcp_options, tsopt, mss, wscale);
}

static __always_inline void tcpv4_gen_synack(struct header_pointers *hdr,
					     __u32 cookie, __be32 *tsopt)
{
	__u8 wscale;
	__u16 mss;
	__u8 ttl;

	values_get_tcpipopts(&mss, &wscale, &ttl, false);

	swap_eth_addr(hdr->eth->h_source, hdr->eth->h_dest);

	swap(hdr->ipv4->saddr, hdr->ipv4->daddr);
	hdr->ipv4->check = 0; // Rely on hardware checksum offload.
	hdr->ipv4->tos = 0;
	hdr->ipv4->id = 0;
	hdr->ipv4->ttl = ttl;

	tcp_gen_synack(hdr->tcp, cookie, tsopt, mss, wscale);

	hdr->tcp_len = hdr->tcp->doff * 4;
	hdr->ipv4->tot_len = bpf_htons(sizeof(*hdr->ipv4) + hdr->tcp_len);
}

static __always_inline void tcpv6_gen_synack(struct header_pointers *hdr,
					     __u32 cookie, __be32 *tsopt)
{
	__u8 wscale;
	__u16 mss;
	__u8 ttl;

	values_get_tcpipopts(&mss, &wscale, &ttl, true);

	swap_eth_addr(hdr->eth->h_source, hdr->eth->h_dest);

	swap(hdr->ipv6->saddr, hdr->ipv6->daddr);
	*(__be32 *)hdr->ipv6 = bpf_htonl(0x60000000);
	hdr->ipv6->hop_limit = ttl;

	tcp_gen_synack(hdr->tcp, cookie, tsopt, mss, wscale);

	hdr->tcp_len = hdr->tcp->doff * 4;
	hdr->ipv6->payload_len = bpf_htons(hdr->tcp_len);
}

static __always_inline int syncookie_handle_syn(struct header_pointers *hdr,
						struct xdp_md *ctx,
						void *data, void *data_end)
{
	__u32 old_pkt_size, new_pkt_size;
	// Unlike clang 10, clang 11 and 12 generate code that doesn't pass the
	// BPF verifier if tsopt is not volatile. Volatile forces it to store
	// the pointer value and use it directly, otherwise tcp_mkoptions is
	// (mis)compiled like this:
	//   if (!tsopt)
	//       return buf - start;
	//   reg = stored_return_value_of_bpf_tcp_raw_gen_tscookie;
	//   if (reg == 0)
	//       tsopt = tsopt_buf;
	//   else
	//       tsopt = NULL;
	//   ...
	//   *buf++ = tsopt[1];
	// It creates a dead branch where tsopt is assigned NULL, but the
	// verifier can't prove it's dead and blocks the program.
	__be32 * volatile tsopt = NULL;
	__be32 tsopt_buf[2];
	void *ip_header;
	__u16 ip_len;
	__u32 cookie;
	__s64 value;

	if (hdr->ipv4) {
		// Check the IPv4 and TCP checksums before creating a SYNACK.
		value = bpf_csum_diff(0, 0, (void *)hdr->ipv4, hdr->ipv4->ihl * 4, 0);
		if (value < 0)
			return XDP_ABORTED;
		if (csum_fold(value) != 0)
			return XDP_DROP; // Bad IPv4 checksum.

		value = bpf_csum_diff(0, 0, (void *)hdr->tcp, hdr->tcp_len, 0);
		if (value < 0)
			return XDP_ABORTED;
		if (csum_tcpudp_magic(hdr->ipv4->saddr, hdr->ipv4->daddr,
				      hdr->tcp_len, IPPROTO_TCP, value) != 0)
			return XDP_DROP; // Bad TCP checksum.

		ip_header = hdr->ipv4;
		ip_len = sizeof(*hdr->ipv4);
	} else if (hdr->ipv6) {
		// Check the TCP checksum before creating a SYNACK.
		value = bpf_csum_diff(0, 0, (void *)hdr->tcp, hdr->tcp_len, 0);
		if (value < 0)
			return XDP_ABORTED;
		if (csum_ipv6_magic(&hdr->ipv6->saddr, &hdr->ipv6->daddr,
				    hdr->tcp_len, IPPROTO_TCP, value) != 0)
			return XDP_DROP; // Bad TCP checksum.

		ip_header = hdr->ipv6;
		ip_len = sizeof(*hdr->ipv6);
	} else {
		return XDP_ABORTED;
	}

	// Issue SYN cookies on allowed ports, drop SYN packets on
	// blocked ports.
	if (!check_port_allowed(bpf_ntohs(hdr->tcp->dest)))
		return XDP_DROP;

	value = bpf_tcp_raw_gen_syncookie(ip_header, ip_len,
					  (void *)hdr->tcp, hdr->tcp_len);
	if (value < 0)
		return XDP_ABORTED;
	cookie = (__u32)value;

	if (cookie_init_timestamp_raw((void *)hdr->tcp, hdr->tcp_len,
				      &tsopt_buf[0], &tsopt_buf[1], data_end))
		tsopt = tsopt_buf;

	// Check that there is enough space for a SYNACK. It also covers
	// the check that the destination of the __builtin_memmove below
	// doesn't overflow.
	if (data + sizeof(*hdr->eth) + ip_len + TCP_MAXLEN > data_end)
		return XDP_ABORTED;

	if (hdr->ipv4) {
		if (hdr->ipv4->ihl * 4 > sizeof(*hdr->ipv4)) {
			struct tcphdr *new_tcp_header;

			new_tcp_header = data + sizeof(*hdr->eth) + sizeof(*hdr->ipv4);
			__builtin_memmove(new_tcp_header, hdr->tcp, sizeof(*hdr->tcp));
			hdr->tcp = new_tcp_header;

			hdr->ipv4->ihl = sizeof(*hdr->ipv4) / 4;
		}

		tcpv4_gen_synack(hdr, cookie, tsopt);
	} else if (hdr->ipv6) {
		tcpv6_gen_synack(hdr, cookie, tsopt);
	} else {
		return XDP_ABORTED;
	}

	// Recalculate checksums.
	hdr->tcp->check = 0;
	value = bpf_csum_diff(0, 0, (void *)hdr->tcp, hdr->tcp_len, 0);
	if (value < 0)
		return XDP_ABORTED;
	if (hdr->ipv4) {
		hdr->tcp->check = csum_tcpudp_magic(hdr->ipv4->saddr,
						    hdr->ipv4->daddr,
						    hdr->tcp_len,
						    IPPROTO_TCP,
						    value);

		hdr->ipv4->check = 0;
		value = bpf_csum_diff(0, 0, (void *)hdr->ipv4, sizeof(*hdr->ipv4), 0);
		if (value < 0)
			return XDP_ABORTED;
		hdr->ipv4->check = csum_fold(value);
	} else if (hdr->ipv6) {
		hdr->tcp->check = csum_ipv6_magic(&hdr->ipv6->saddr,
						  &hdr->ipv6->daddr,
						  hdr->tcp_len,
						  IPPROTO_TCP,
						  value);
	} else {
		return XDP_ABORTED;
	}

	// Set the new packet size.
	old_pkt_size = data_end - data;
	new_pkt_size = sizeof(*hdr->eth) + ip_len + hdr->tcp->doff * 4;
	if (bpf_xdp_adjust_tail(ctx, new_pkt_size - old_pkt_size))
		return XDP_ABORTED;

	values_inc_synacks();

	return XDP_TX;
}

static __always_inline int syncookie_handle_ack(struct header_pointers *hdr)
{
	int err;

	if (hdr->ipv4)
		err = bpf_tcp_raw_check_syncookie(hdr->ipv4, sizeof(*hdr->ipv4),
						  (void *)hdr->tcp, hdr->tcp_len);
	else if (hdr->ipv6)
		err = bpf_tcp_raw_check_syncookie(hdr->ipv6, sizeof(*hdr->ipv6),
						  (void *)hdr->tcp, hdr->tcp_len);
	else
		return XDP_ABORTED;
	if (err)
		return XDP_DROP;

	return XDP_PASS;
}

SEC("xdp/syncookie")
int syncookie_xdp(struct xdp_md *ctx)
{
	void *data_end = (void *)(long)ctx->data_end;
	void *data = (void *)(long)ctx->data;
	struct header_pointers hdr;
	struct bpf_sock_tuple tup;
	struct bpf_nf_conn *ct;
	__u32 tup_size;
	__s64 value;
	int ret;

	ret = tcp_dissect(data, data_end, &hdr);
	if (ret != XDP_TX)
		return ret;

	if (hdr.ipv4) {
		// TCP doesn't normally use fragments, and XDP can't reassemble them.
		if ((hdr.ipv4->frag_off & bpf_htons(IP_DF | IP_MF | IP_OFFSET)) != 
bpf_htons(IP_DF))
			return XDP_DROP;

		tup.ipv4.saddr = hdr.ipv4->saddr;
		tup.ipv4.daddr = hdr.ipv4->daddr;
		tup.ipv4.sport = hdr.tcp->source;
		tup.ipv4.dport = hdr.tcp->dest;
		tup_size = sizeof(tup.ipv4);
	} else if (hdr.ipv6) {
		__builtin_memcpy(tup.ipv6.saddr, &hdr.ipv6->saddr, 
sizeof(tup.ipv6.saddr));
		__builtin_memcpy(tup.ipv6.daddr, &hdr.ipv6->daddr, 
sizeof(tup.ipv6.daddr));
		tup.ipv6.sport = hdr.tcp->source;
		tup.ipv6.dport = hdr.tcp->dest;
		tup_size = sizeof(tup.ipv6);
	} else {
		// The verifier can't track that either ipv4 or ipv6 is not NULL.
		return XDP_ABORTED;
	}

	value = 0; // Flags.
	ct = bpf_ct_lookup_tcp(ctx, &tup, tup_size, BPF_F_CURRENT_NETNS, &value);
	if (ct) {
		unsigned long status = ct->status;

		bpf_ct_release(ct);
		if (status & IPS_CONFIRMED_BIT)
			return XDP_PASS;
	} else if (value != -ENOENT) {
		return XDP_ABORTED;
	}

	// value == -ENOENT || !(status & IPS_CONFIRMED_BIT)

	if ((hdr.tcp->syn ^ hdr.tcp->ack) != 1)
		return XDP_DROP;

	// Grow the TCP header to TCP_MAXLEN to be able to pass any hdr.tcp_len
	// to bpf_tcp_raw_gen_syncookie and pass the verifier.
	if (bpf_xdp_adjust_tail(ctx, TCP_MAXLEN - hdr.tcp_len))
		return XDP_ABORTED;

	data_end = (void *)(long)ctx->data_end;
	data = (void *)(long)ctx->data;

	if (hdr.ipv4) {
		hdr.eth = data;
		hdr.ipv4 = (void *)hdr.eth + sizeof(*hdr.eth);
		// IPV4_MAXLEN is needed when calculating checksum.
		// At least sizeof(struct iphdr) is needed here to access ihl.
		if ((void *)hdr.ipv4 + IPV4_MAXLEN > data_end)
			return XDP_ABORTED;
		hdr.tcp = (void *)hdr.ipv4 + hdr.ipv4->ihl * 4;
	} else if (hdr.ipv6) {
		hdr.eth = data;
		hdr.ipv6 = (void *)hdr.eth + sizeof(*hdr.eth);
		hdr.tcp = (void *)hdr.ipv6 + sizeof(*hdr.ipv6);
	} else {
		return XDP_ABORTED;
	}

	if ((void *)hdr.tcp + TCP_MAXLEN > data_end)
		return XDP_ABORTED;

	// We run out of registers, tcp_len gets spilled to the stack, and the
	// verifier forgets its min and max values checked above in tcp_dissect.
	hdr.tcp_len = hdr.tcp->doff * 4;
	if (hdr.tcp_len < sizeof(*hdr.tcp))
		return XDP_ABORTED;

	return hdr.tcp->syn ? syncookie_handle_syn(&hdr, ctx, data, data_end) :
			      syncookie_handle_ack(&hdr);
}

SEC("xdp/dummy")
int dummy_xdp(struct xdp_md *ctx)
{
	// veth requires XDP programs to be set on both sides.
	return XDP_PASS;
}

char _license[] SEC("license") = "GPL";

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH bpf-next 09/10] bpf: Add a helper to issue timestamp cookies in XDP
  2021-11-03 14:02           ` Maxim Mikityanskiy
@ 2021-11-09  7:11             ` Yonghong Song
  2021-11-25 14:34               ` Maxim Mikityanskiy
  0 siblings, 1 reply; 48+ messages in thread
From: Yonghong Song @ 2021-11-09  7:11 UTC (permalink / raw)
  To: Maxim Mikityanskiy
  Cc: Toke Høiland-Jørgensen, Lorenz Bauer,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, John Fastabend, KP Singh,
	Eric Dumazet, David S. Miller, Jakub Kicinski, Hideaki YOSHIFUJI,
	David Ahern, Jesper Dangaard Brouer, Nathan Chancellor,
	Nick Desaulniers, Brendan Jackman, Florent Revest, Joe Stringer,
	Tariq Toukan, Networking, bpf, clang-built-linux



On 11/3/21 7:02 AM, Maxim Mikityanskiy wrote:
> On 2021-11-03 04:10, Yonghong Song wrote:
>>
>>
>> On 11/1/21 4:14 AM, Maxim Mikityanskiy wrote:
>>> On 2021-10-20 19:16, Toke Høiland-Jørgensen wrote:
>>>> Lorenz Bauer <lmb@cloudflare.com> writes:
>>>>
>>>>>> +bool cookie_init_timestamp_raw(struct tcphdr *th, __be32 *tsval, 
>>>>>> __be32 *tsecr)
>>>>>
>>>>> I'm probably missing context, Is there something in this function that
>>>>> means you can't implement it in BPF?
>>>>
>>>> I was about to reply with some other comments but upon closer 
>>>> inspection
>>>> I ended up at the same conclusion: this helper doesn't seem to be 
>>>> needed
>>>> at all?
>>>
>>> After trying to put this code into BPF (replacing the underlying 
>>> ktime_get_ns with ktime_get_mono_fast_ns), I experienced issues with 
>>> passing the verifier.
>>>
>>> In addition to comparing ptr to end, I had to add checks that compare 
>>> ptr to data_end, because the verifier can't deduce that end <= 
>>> data_end. More branches will add a certain slowdown (not measured).
>>>
>>> A more serious issue is the overall program complexity. Even though 
>>> the loop over the TCP options has an upper bound, and the pointer 
>>> advances by at least one byte every iteration, I had to limit the 
>>> total number of iterations artificially. The maximum number of 
>>> iterations that makes the verifier happy is 10. With more iterations, 
>>> I have the following error:
>>>
>>> BPF program is too large. Processed 1000001 insn
>>>
>>>                         processed 1000001 insns (limit 1000000) 
>>> max_states_per_insn 29 total_states 35489 peak_states 596 mark_read 45
>>>
>>> I assume that BPF_COMPLEXITY_LIMIT_INSNS (1 million) is the 
>>> accumulated amount of instructions that the verifier can process in 
>>> all branches, is that right? It doesn't look realistic that my 
>>> program can run 1 million instructions in a single run, but it might 
>>> be that if you take all possible flows and add up the instructions 
>>> from these flows, it will exceed 1 million.
>>>
>>> The limitation of maximum 10 TCP options might be not enough, given 
>>> that valid packets are permitted to include more than 10 NOPs. An 
>>> alternative of using bpf_load_hdr_opt and calling it three times 
>>> doesn't look good either, because it will be about three times slower 
>>> than going over the options once. So maybe having a helper for that 
>>> is better than trying to fit it into BPF?
>>>
>>> One more interesting fact is the time that it takes for the verifier 
>>> to check my program. If it's limited to 10 iterations, it does it 
>>> pretty fast, but if I try to increase the number to 11 iterations, it 
>>> takes several minutes for the verifier to reach 1 million 
>>> instructions and print the error then. I also tried grouping the NOPs 
>>> in an inner loop to count only 10 real options, and the verifier has 
>>> been running for a few hours without any response. Is it normal? 
>>
>> Maxim, this may expose a verifier bug. Do you have a reproducer I can 
>> access? I would like to debug this to see what is the root case. Thanks!
> 
> Thanks, I appreciate your help in debugging it. The reproducer is based 
> on the modified XDP program from patch 10 in this series. You'll need to 
> apply at least patches 6, 7, 8 from this series to get new BPF helpers 
> needed for the XDP program (tell me if that's a problem, I can try to 
> remove usage of new helpers, but it will affect the program length and 
> may produce different results in the verifier).
> 
> See the C code of the program that passes the verifier (compiled with 
> clang version 12.0.0-1ubuntu1) in the bottom of this email. If you 
> increase the loop boundary from 10 to at least 11 in 
> cookie_init_timestamp_raw(), it fails the verifier after a few minutes. 

I tried to reproduce with latest llvm (llvm-project repo),
loop boundary 10 is okay and 11 exceeds the 1M complexity limit. For 10,
the number of verified instructions is 563626 (more than 0.5M) so it is
totally possible that one more iteration just blows past the limit.


> If you apply this tiny change, it fails the verifier after about 3 hours:
> 
> --- a/samples/bpf/syncookie_kern.c
> +++ b/samples/bpf/syncookie_kern.c
> @@ -167,6 +167,7 @@ static __always_inline bool cookie_init_
>       for (i = 0; i < 10; i++) {
>           u8 opcode, opsize;
> 
> +skip_nop:
>           if (ptr >= end)
>               break;
>           if (ptr + 1 > data_end)
> @@ -178,7 +179,7 @@ static __always_inline bool cookie_init_
>               break;
>           if (opcode == TCPOPT_NOP) {
>               ++ptr;
> -            continue;
> +            goto skip_nop;
>           }
> 
>           if (ptr + 1 >= end)

I tried this as well, with latest llvm, and got the following errors
in ~30 seconds:

......
536: (79) r2 = *(u64 *)(r10 -96)
537: R0=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) 
R1=pkt(id=9,off=499,r=518,umax_value=60,var_off=(0x0; 0x3c)) 
R2=pkt_end(id=0,off=0,imm=0) 
R3=pkt(id=27,off=14,r=0,umin_value=20,umax_value=120,var_off=(0x0; 
0x7c),s32_min_value=0,s32_max_value=124,u32_max_value=124) R4=invP3 
R5=inv1 R6=ctx(id=0,off=0,imm=0) R7=pkt(id=9,off=519,r=518,umax_va^C
[yhs@devbig309.ftw3 ~/work/bpf-next/samples/bpf] tail -f log
550: (55) if r0 != 0x4 goto pc+4
The sequence of 8193 jumps is too complex.
verification time 30631375 usec
stack depth 160
processed 330595 insns (limit 1000000) max_states_per_insn 4 
total_states 20377 peak_states 100 mark_read 37

With llvm12, I got the similar verification error:

The sequence of 8193 jumps is too complex.
processed 330592 insns (limit 1000000) max_states_per_insn 4 
total_states 20378 peak_states 101 mark_read 37

Could you check again with your experiment which takes 3 hours to
fail? What is the verification failure log?

> 
> --cut--
> 
> // SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
> /* Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights 
> reserved. */
> 
[...]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH bpf-next 09/10] bpf: Add a helper to issue timestamp cookies in XDP
  2021-11-09  7:11             ` Yonghong Song
@ 2021-11-25 14:34               ` Maxim Mikityanskiy
  2021-11-26  5:43                 ` Yonghong Song
  0 siblings, 1 reply; 48+ messages in thread
From: Maxim Mikityanskiy @ 2021-11-25 14:34 UTC (permalink / raw)
  To: Yonghong Song, Toke Høiland-Jørgensen, Lorenz Bauer
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, John Fastabend, KP Singh,
	Eric Dumazet, David S. Miller, Jakub Kicinski, Hideaki YOSHIFUJI,
	David Ahern, Jesper Dangaard Brouer, Nathan Chancellor,
	Nick Desaulniers, Brendan Jackman, Florent Revest, Joe Stringer,
	Tariq Toukan, Networking, bpf, clang-built-linux

On 2021-11-09 09:11, Yonghong Song wrote:
> 
> 
> On 11/3/21 7:02 AM, Maxim Mikityanskiy wrote:
>> On 2021-11-03 04:10, Yonghong Song wrote:
>>>
>>>
>>> On 11/1/21 4:14 AM, Maxim Mikityanskiy wrote:
>>>> On 2021-10-20 19:16, Toke Høiland-Jørgensen wrote:
>>>>> Lorenz Bauer <lmb@cloudflare.com> writes:
>>>>>
>>>>>>> +bool cookie_init_timestamp_raw(struct tcphdr *th, __be32 *tsval, 
>>>>>>> __be32 *tsecr)
>>>>>>
>>>>>> I'm probably missing context, Is there something in this function 
>>>>>> that
>>>>>> means you can't implement it in BPF?
>>>>>
>>>>> I was about to reply with some other comments but upon closer 
>>>>> inspection
>>>>> I ended up at the same conclusion: this helper doesn't seem to be 
>>>>> needed
>>>>> at all?
>>>>
>>>> After trying to put this code into BPF (replacing the underlying 
>>>> ktime_get_ns with ktime_get_mono_fast_ns), I experienced issues with 
>>>> passing the verifier.
>>>>
>>>> In addition to comparing ptr to end, I had to add checks that 
>>>> compare ptr to data_end, because the verifier can't deduce that end 
>>>> <= data_end. More branches will add a certain slowdown (not measured).
>>>>
>>>> A more serious issue is the overall program complexity. Even though 
>>>> the loop over the TCP options has an upper bound, and the pointer 
>>>> advances by at least one byte every iteration, I had to limit the 
>>>> total number of iterations artificially. The maximum number of 
>>>> iterations that makes the verifier happy is 10. With more 
>>>> iterations, I have the following error:
>>>>
>>>> BPF program is too large. Processed 1000001 insn
>>>>
>>>>                         processed 1000001 insns (limit 1000000) 
>>>> max_states_per_insn 29 total_states 35489 peak_states 596 mark_read 45
>>>>
>>>> I assume that BPF_COMPLEXITY_LIMIT_INSNS (1 million) is the 
>>>> accumulated amount of instructions that the verifier can process in 
>>>> all branches, is that right? It doesn't look realistic that my 
>>>> program can run 1 million instructions in a single run, but it might 
>>>> be that if you take all possible flows and add up the instructions 
>>>> from these flows, it will exceed 1 million.
>>>>
>>>> The limitation of maximum 10 TCP options might be not enough, given 
>>>> that valid packets are permitted to include more than 10 NOPs. An 
>>>> alternative of using bpf_load_hdr_opt and calling it three times 
>>>> doesn't look good either, because it will be about three times 
>>>> slower than going over the options once. So maybe having a helper 
>>>> for that is better than trying to fit it into BPF?
>>>>
>>>> One more interesting fact is the time that it takes for the verifier 
>>>> to check my program. If it's limited to 10 iterations, it does it 
>>>> pretty fast, but if I try to increase the number to 11 iterations, 
>>>> it takes several minutes for the verifier to reach 1 million 
>>>> instructions and print the error then. I also tried grouping the 
>>>> NOPs in an inner loop to count only 10 real options, and the 
>>>> verifier has been running for a few hours without any response. Is 
>>>> it normal? 
>>>
>>> Maxim, this may expose a verifier bug. Do you have a reproducer I can 
>>> access? I would like to debug this to see what is the root case. Thanks!
>>
>> Thanks, I appreciate your help in debugging it. The reproducer is 
>> based on the modified XDP program from patch 10 in this series. You'll 
>> need to apply at least patches 6, 7, 8 from this series to get new BPF 
>> helpers needed for the XDP program (tell me if that's a problem, I can 
>> try to remove usage of new helpers, but it will affect the program 
>> length and may produce different results in the verifier).
>>
>> See the C code of the program that passes the verifier (compiled with 
>> clang version 12.0.0-1ubuntu1) in the bottom of this email. If you 
>> increase the loop boundary from 10 to at least 11 in 
>> cookie_init_timestamp_raw(), it fails the verifier after a few minutes. 
> 
> I tried to reproduce with latest llvm (llvm-project repo),
> loop boundary 10 is okay and 11 exceeds the 1M complexity limit. For 10,
> the number of verified instructions is 563626 (more than 0.5M) so it is
> totally possible that one more iteration just blows past the limit.

So, does it mean that the verifying complexity grows exponentially with 
increasing the number of loop iterations (options parsed)?

Is it a good enough reason to keep this code as a BPF helper, rather 
than trying to fit it into the BPF program?

> 
>> If you apply this tiny change, it fails the verifier after about 3 hours:
>>
>> --- a/samples/bpf/syncookie_kern.c
>> +++ b/samples/bpf/syncookie_kern.c
>> @@ -167,6 +167,7 @@ static __always_inline bool cookie_init_
>>       for (i = 0; i < 10; i++) {
>>           u8 opcode, opsize;
>>
>> +skip_nop:
>>           if (ptr >= end)
>>               break;
>>           if (ptr + 1 > data_end)
>> @@ -178,7 +179,7 @@ static __always_inline bool cookie_init_
>>               break;
>>           if (opcode == TCPOPT_NOP) {
>>               ++ptr;
>> -            continue;
>> +            goto skip_nop;
>>           }
>>
>>           if (ptr + 1 >= end)
> 
> I tried this as well, with latest llvm, and got the following errors
> in ~30 seconds:
> 
> ......
> 536: (79) r2 = *(u64 *)(r10 -96)
> 537: R0=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) 
> R1=pkt(id=9,off=499,r=518,umax_value=60,var_off=(0x0; 0x3c)) 
> R2=pkt_end(id=0,off=0,imm=0) 
> R3=pkt(id=27,off=14,r=0,umin_value=20,umax_value=120,var_off=(0x0; 
> 0x7c),s32_min_value=0,s32_max_value=124,u32_max_value=124) R4=invP3 
> R5=inv1 R6=ctx(id=0,off=0,imm=0) R7=pkt(id=9,off=519,r=518,umax_va^C
> [yhs@devbig309.ftw3 ~/work/bpf-next/samples/bpf] tail -f log
> 550: (55) if r0 != 0x4 goto pc+4
> The sequence of 8193 jumps is too complex.
> verification time 30631375 usec
> stack depth 160
> processed 330595 insns (limit 1000000) max_states_per_insn 4 
> total_states 20377 peak_states 100 mark_read 37
> 
> With llvm12, I got the similar verification error:
> 
> The sequence of 8193 jumps is too complex.
> processed 330592 insns (limit 1000000) max_states_per_insn 4 
> total_states 20378 peak_states 101 mark_read 37
> 
> Could you check again with your experiment which takes 3 hours to
> fail? What is the verification failure log?

The log is similar:

...
; if (opsize == TCPOLEN_WINDOW) {
460: (55) if r8 != 0x3 goto pc+31
 
R0_w=pkt(id=28132,off=4037,r=0,umin_value=20,umax_value=2610,var_off=(0x0; 
0x3ffff),s32_min_value=0,s32_max_value=262143,u32_max_value=262143) 
R1=inv0 
R2=pkt(id=27,off=14,r=0,umin_value=20,umax_value=120,var_off=(0x0; 
0x7c),s32_min_value=0,s32_max_value=124,u32_max_value=124) R3_w=inv3 
R4_w=inv9 R5=inv0 R6=ctx(id=0,off=0,imm=0) 
R7_w=pkt(id=44,off=4047,r=4039,umin_value=18,umax_value=2355,var_off=(0x0; 
0x1ffff),s32_min_value=0,s32_max_value=131071,u32_max_value=131071) 
R8_w=invP3 R9=inv0 R10=fp0 fp-16=????mmmm fp-24=00000000 fp-64=????mmmm 
fp-72=mmmmmmmm fp-80=mmmmmmmm fp-88=pkt fp-96=pkt_end fp-104=pkt 
fp-112=pkt fp-120=inv20 fp-128=mmmmmmmm fp-136_w=inv14 fp-144=pkt
; if (ptr + TCPOLEN_WINDOW > data_end)
461: (bf) r3 = r7
462: (07) r3 += -7
; if (ptr + TCPOLEN_WINDOW > data_end)
463: (79) r8 = *(u64 *)(r10 -96)
464: (2d) if r3 > r8 goto pc+56
The sequence of 8193 jumps is too complex.
processed 414429 insns (limit 1000000) max_states_per_insn 4 
total_states 8097 peak_states 97 mark_read 49

libbpf: -- END LOG --
libbpf: failed to load program 'syncookie_xdp'
libbpf: failed to load object '.../samples/bpf/syncookie_kern.o'
Error: bpf_prog_load: Unknown error 4007

real    189m49.659s
user    0m0.012s
sys     189m26.322s

Ubuntu clang version 12.0.0-1ubuntu1

I wonder why it takes only 30 seconds for you. As I understand, the 
expectation is less than 1 second anyway, but the difference between 30 
seconds and 3 hours is huge. Maybe some kernel config options matter 
(KASAN?)

Is it expected that increasing the loop length linearly increases the 
verifying complexity exponentially? Is there any mitigation?

Thanks,
Max

>>
>> --cut--
>>
>> // SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
>> /* Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights 
>> reserved. */
>>
> [...]


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH bpf-next 09/10] bpf: Add a helper to issue timestamp cookies in XDP
  2021-11-25 14:34               ` Maxim Mikityanskiy
@ 2021-11-26  5:43                 ` Yonghong Song
  2021-11-26 16:50                   ` Maxim Mikityanskiy
  0 siblings, 1 reply; 48+ messages in thread
From: Yonghong Song @ 2021-11-26  5:43 UTC (permalink / raw)
  To: Maxim Mikityanskiy, Toke Høiland-Jørgensen, Lorenz Bauer
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, John Fastabend, KP Singh,
	Eric Dumazet, David S. Miller, Jakub Kicinski, Hideaki YOSHIFUJI,
	David Ahern, Jesper Dangaard Brouer, Nathan Chancellor,
	Nick Desaulniers, Brendan Jackman, Florent Revest, Joe Stringer,
	Tariq Toukan, Networking, bpf, clang-built-linux



On 11/25/21 6:34 AM, Maxim Mikityanskiy wrote:
> On 2021-11-09 09:11, Yonghong Song wrote:
>>
>>
>> On 11/3/21 7:02 AM, Maxim Mikityanskiy wrote:
>>> On 2021-11-03 04:10, Yonghong Song wrote:
>>>>
>>>>
>>>> On 11/1/21 4:14 AM, Maxim Mikityanskiy wrote:
>>>>> On 2021-10-20 19:16, Toke Høiland-Jørgensen wrote:
>>>>>> Lorenz Bauer <lmb@cloudflare.com> writes:
>>>>>>
>>>>>>>> +bool cookie_init_timestamp_raw(struct tcphdr *th, __be32 
>>>>>>>> *tsval, __be32 *tsecr)
>>>>>>>
>>>>>>> I'm probably missing context, Is there something in this function 
>>>>>>> that
>>>>>>> means you can't implement it in BPF?
>>>>>>
>>>>>> I was about to reply with some other comments but upon closer 
>>>>>> inspection
>>>>>> I ended up at the same conclusion: this helper doesn't seem to be 
>>>>>> needed
>>>>>> at all?
>>>>>
>>>>> After trying to put this code into BPF (replacing the underlying 
>>>>> ktime_get_ns with ktime_get_mono_fast_ns), I experienced issues 
>>>>> with passing the verifier.
>>>>>
>>>>> In addition to comparing ptr to end, I had to add checks that 
>>>>> compare ptr to data_end, because the verifier can't deduce that end 
>>>>> <= data_end. More branches will add a certain slowdown (not measured).
>>>>>
>>>>> A more serious issue is the overall program complexity. Even though 
>>>>> the loop over the TCP options has an upper bound, and the pointer 
>>>>> advances by at least one byte every iteration, I had to limit the 
>>>>> total number of iterations artificially. The maximum number of 
>>>>> iterations that makes the verifier happy is 10. With more 
>>>>> iterations, I have the following error:
>>>>>
>>>>> BPF program is too large. Processed 1000001 insn
>>>>>
>>>>>                         processed 1000001 insns (limit 1000000) 
>>>>> max_states_per_insn 29 total_states 35489 peak_states 596 mark_read 45
>>>>>
>>>>> I assume that BPF_COMPLEXITY_LIMIT_INSNS (1 million) is the 
>>>>> accumulated amount of instructions that the verifier can process in 
>>>>> all branches, is that right? It doesn't look realistic that my 
>>>>> program can run 1 million instructions in a single run, but it 
>>>>> might be that if you take all possible flows and add up the 
>>>>> instructions from these flows, it will exceed 1 million.
>>>>>
>>>>> The limitation of maximum 10 TCP options might be not enough, given 
>>>>> that valid packets are permitted to include more than 10 NOPs. An 
>>>>> alternative of using bpf_load_hdr_opt and calling it three times 
>>>>> doesn't look good either, because it will be about three times 
>>>>> slower than going over the options once. So maybe having a helper 
>>>>> for that is better than trying to fit it into BPF?
>>>>>
>>>>> One more interesting fact is the time that it takes for the 
>>>>> verifier to check my program. If it's limited to 10 iterations, it 
>>>>> does it pretty fast, but if I try to increase the number to 11 
>>>>> iterations, it takes several minutes for the verifier to reach 1 
>>>>> million instructions and print the error then. I also tried 
>>>>> grouping the NOPs in an inner loop to count only 10 real options, 
>>>>> and the verifier has been running for a few hours without any 
>>>>> response. Is it normal? 
>>>>
>>>> Maxim, this may expose a verifier bug. Do you have a reproducer I 
>>>> can access? I would like to debug this to see what is the root case. 
>>>> Thanks!
>>>
>>> Thanks, I appreciate your help in debugging it. The reproducer is 
>>> based on the modified XDP program from patch 10 in this series. 
>>> You'll need to apply at least patches 6, 7, 8 from this series to get 
>>> new BPF helpers needed for the XDP program (tell me if that's a 
>>> problem, I can try to remove usage of new helpers, but it will affect 
>>> the program length and may produce different results in the verifier).
>>>
>>> See the C code of the program that passes the verifier (compiled with 
>>> clang version 12.0.0-1ubuntu1) in the bottom of this email. If you 
>>> increase the loop boundary from 10 to at least 11 in 
>>> cookie_init_timestamp_raw(), it fails the verifier after a few minutes. 
>>
>> I tried to reproduce with latest llvm (llvm-project repo),
>> loop boundary 10 is okay and 11 exceeds the 1M complexity limit. For 10,
>> the number of verified instructions is 563626 (more than 0.5M) so it is
>> totally possible that one more iteration just blows past the limit.
> 
> So, does it mean that the verifying complexity grows exponentially with 
> increasing the number of loop iterations (options parsed)?

Depending on verification time pruning results, it is possible slightly 
increase number of branches could result quite some (2x, 4x, etc.) of
to-be-verified dynamic instructions.

> 
> Is it a good enough reason to keep this code as a BPF helper, rather 
> than trying to fit it into the BPF program?

Another option is to use global function, which is verified separately
from the main bpf program.

> 
>>
>>> If you apply this tiny change, it fails the verifier after about 3 
>>> hours:
>>>
[...]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH bpf-next 09/10] bpf: Add a helper to issue timestamp cookies in XDP
  2021-11-26  5:43                 ` Yonghong Song
@ 2021-11-26 16:50                   ` Maxim Mikityanskiy
  2021-11-26 17:07                     ` Yonghong Song
  0 siblings, 1 reply; 48+ messages in thread
From: Maxim Mikityanskiy @ 2021-11-26 16:50 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Toke Høiland-Jørgensen, Lorenz Bauer,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, John Fastabend, KP Singh,
	Eric Dumazet, David S. Miller, Jakub Kicinski, Hideaki YOSHIFUJI,
	David Ahern, Jesper Dangaard Brouer, Nathan Chancellor,
	Nick Desaulniers, Brendan Jackman, Florent Revest, Joe Stringer,
	Tariq Toukan, Networking, bpf, clang-built-linux

On 2021-11-26 07:43, Yonghong Song wrote:
> 
> 
> On 11/25/21 6:34 AM, Maxim Mikityanskiy wrote:
>> On 2021-11-09 09:11, Yonghong Song wrote:
>>>
>>>
>>> On 11/3/21 7:02 AM, Maxim Mikityanskiy wrote:
>>>> On 2021-11-03 04:10, Yonghong Song wrote:
>>>>>
>>>>>
>>>>> On 11/1/21 4:14 AM, Maxim Mikityanskiy wrote:
>>>>>> On 2021-10-20 19:16, Toke Høiland-Jørgensen wrote:
>>>>>>> Lorenz Bauer <lmb@cloudflare.com> writes:
>>>>>>>
>>>>>>>>> +bool cookie_init_timestamp_raw(struct tcphdr *th, __be32 
>>>>>>>>> *tsval, __be32 *tsecr)
>>>>>>>>
>>>>>>>> I'm probably missing context, Is there something in this 
>>>>>>>> function that
>>>>>>>> means you can't implement it in BPF?
>>>>>>>
>>>>>>> I was about to reply with some other comments but upon closer 
>>>>>>> inspection
>>>>>>> I ended up at the same conclusion: this helper doesn't seem to be 
>>>>>>> needed
>>>>>>> at all?
>>>>>>
>>>>>> After trying to put this code into BPF (replacing the underlying 
>>>>>> ktime_get_ns with ktime_get_mono_fast_ns), I experienced issues 
>>>>>> with passing the verifier.
>>>>>>
>>>>>> In addition to comparing ptr to end, I had to add checks that 
>>>>>> compare ptr to data_end, because the verifier can't deduce that 
>>>>>> end <= data_end. More branches will add a certain slowdown (not 
>>>>>> measured).
>>>>>>
>>>>>> A more serious issue is the overall program complexity. Even 
>>>>>> though the loop over the TCP options has an upper bound, and the 
>>>>>> pointer advances by at least one byte every iteration, I had to 
>>>>>> limit the total number of iterations artificially. The maximum 
>>>>>> number of iterations that makes the verifier happy is 10. With 
>>>>>> more iterations, I have the following error:
>>>>>>
>>>>>> BPF program is too large. Processed 1000001 insn
>>>>>>
>>>>>>                         processed 1000001 insns (limit 1000000) 
>>>>>> max_states_per_insn 29 total_states 35489 peak_states 596 
>>>>>> mark_read 45
>>>>>>
>>>>>> I assume that BPF_COMPLEXITY_LIMIT_INSNS (1 million) is the 
>>>>>> accumulated amount of instructions that the verifier can process 
>>>>>> in all branches, is that right? It doesn't look realistic that my 
>>>>>> program can run 1 million instructions in a single run, but it 
>>>>>> might be that if you take all possible flows and add up the 
>>>>>> instructions from these flows, it will exceed 1 million.
>>>>>>
>>>>>> The limitation of maximum 10 TCP options might be not enough, 
>>>>>> given that valid packets are permitted to include more than 10 
>>>>>> NOPs. An alternative of using bpf_load_hdr_opt and calling it 
>>>>>> three times doesn't look good either, because it will be about 
>>>>>> three times slower than going over the options once. So maybe 
>>>>>> having a helper for that is better than trying to fit it into BPF?
>>>>>>
>>>>>> One more interesting fact is the time that it takes for the 
>>>>>> verifier to check my program. If it's limited to 10 iterations, it 
>>>>>> does it pretty fast, but if I try to increase the number to 11 
>>>>>> iterations, it takes several minutes for the verifier to reach 1 
>>>>>> million instructions and print the error then. I also tried 
>>>>>> grouping the NOPs in an inner loop to count only 10 real options, 
>>>>>> and the verifier has been running for a few hours without any 
>>>>>> response. Is it normal? 
>>>>>
>>>>> Maxim, this may expose a verifier bug. Do you have a reproducer I 
>>>>> can access? I would like to debug this to see what is the root 
>>>>> case. Thanks!
>>>>
>>>> Thanks, I appreciate your help in debugging it. The reproducer is 
>>>> based on the modified XDP program from patch 10 in this series. 
>>>> You'll need to apply at least patches 6, 7, 8 from this series to 
>>>> get new BPF helpers needed for the XDP program (tell me if that's a 
>>>> problem, I can try to remove usage of new helpers, but it will 
>>>> affect the program length and may produce different results in the 
>>>> verifier).
>>>>
>>>> See the C code of the program that passes the verifier (compiled 
>>>> with clang version 12.0.0-1ubuntu1) in the bottom of this email. If 
>>>> you increase the loop boundary from 10 to at least 11 in 
>>>> cookie_init_timestamp_raw(), it fails the verifier after a few minutes. 
>>>
>>> I tried to reproduce with latest llvm (llvm-project repo),
>>> loop boundary 10 is okay and 11 exceeds the 1M complexity limit. For 10,
>>> the number of verified instructions is 563626 (more than 0.5M) so it is
>>> totally possible that one more iteration just blows past the limit.
>>
>> So, does it mean that the verifying complexity grows exponentially 
>> with increasing the number of loop iterations (options parsed)?
> 
> Depending on verification time pruning results, it is possible slightly 
> increase number of branches could result quite some (2x, 4x, etc.) of
> to-be-verified dynamic instructions.

Is it at least theoretically possible to make this coefficient below 2x? 
I.e. write a loop, so that adding another iteration will not double the 
number of verified instructions, but will have a smaller increase?

If that's not possible, then it looks like BPF can't have loops bigger 
than ~19 iterations (2^20 > 1M), and this function is not implementable 
in BPF.

>>
>> Is it a good enough reason to keep this code as a BPF helper, rather 
>> than trying to fit it into the BPF program?
> 
> Another option is to use global function, which is verified separately
> from the main bpf program.

Simply removing __always_inline didn't change anything. Do I need to 
make any other changes? Will it make sense to call a global function in 
a loop, i.e. will it increase chances to pass the verifier?

>>
>>>
>>>> If you apply this tiny change, it fails the verifier after about 3 
>>>> hours:
>>>>
> [...]


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH bpf-next 09/10] bpf: Add a helper to issue timestamp cookies in XDP
  2021-11-26 16:50                   ` Maxim Mikityanskiy
@ 2021-11-26 17:07                     ` Yonghong Song
  2021-11-29 17:51                       ` Maxim Mikityanskiy
  0 siblings, 1 reply; 48+ messages in thread
From: Yonghong Song @ 2021-11-26 17:07 UTC (permalink / raw)
  To: Maxim Mikityanskiy
  Cc: Toke Høiland-Jørgensen, Lorenz Bauer,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, John Fastabend, KP Singh,
	Eric Dumazet, David S. Miller, Jakub Kicinski, Hideaki YOSHIFUJI,
	David Ahern, Jesper Dangaard Brouer, Nathan Chancellor,
	Nick Desaulniers, Brendan Jackman, Florent Revest, Joe Stringer,
	Tariq Toukan, Networking, bpf, clang-built-linux



On 11/26/21 8:50 AM, Maxim Mikityanskiy wrote:
> On 2021-11-26 07:43, Yonghong Song wrote:
>>
>>
>> On 11/25/21 6:34 AM, Maxim Mikityanskiy wrote:
>>> On 2021-11-09 09:11, Yonghong Song wrote:
>>>>
>>>>
>>>> On 11/3/21 7:02 AM, Maxim Mikityanskiy wrote:
>>>>> On 2021-11-03 04:10, Yonghong Song wrote:
>>>>>>
>>>>>>
>>>>>> On 11/1/21 4:14 AM, Maxim Mikityanskiy wrote:
>>>>>>> On 2021-10-20 19:16, Toke Høiland-Jørgensen wrote:
>>>>>>>> Lorenz Bauer <lmb@cloudflare.com> writes:
>>>>>>>>
>>>>>>>>>> +bool cookie_init_timestamp_raw(struct tcphdr *th, __be32 
>>>>>>>>>> *tsval, __be32 *tsecr)
>>>>>>>>>
>>>>>>>>> I'm probably missing context, Is there something in this 
>>>>>>>>> function that
>>>>>>>>> means you can't implement it in BPF?
>>>>>>>>
>>>>>>>> I was about to reply with some other comments but upon closer 
>>>>>>>> inspection
>>>>>>>> I ended up at the same conclusion: this helper doesn't seem to 
>>>>>>>> be needed
>>>>>>>> at all?
>>>>>>>
>>>>>>> After trying to put this code into BPF (replacing the underlying 
>>>>>>> ktime_get_ns with ktime_get_mono_fast_ns), I experienced issues 
>>>>>>> with passing the verifier.
>>>>>>>
>>>>>>> In addition to comparing ptr to end, I had to add checks that 
>>>>>>> compare ptr to data_end, because the verifier can't deduce that 
>>>>>>> end <= data_end. More branches will add a certain slowdown (not 
>>>>>>> measured).
>>>>>>>
>>>>>>> A more serious issue is the overall program complexity. Even 
>>>>>>> though the loop over the TCP options has an upper bound, and the 
>>>>>>> pointer advances by at least one byte every iteration, I had to 
>>>>>>> limit the total number of iterations artificially. The maximum 
>>>>>>> number of iterations that makes the verifier happy is 10. With 
>>>>>>> more iterations, I have the following error:
>>>>>>>
>>>>>>> BPF program is too large. Processed 1000001 insn
>>>>>>>
>>>>>>>                         processed 1000001 insns (limit 1000000) 
>>>>>>> max_states_per_insn 29 total_states 35489 peak_states 596 
>>>>>>> mark_read 45
>>>>>>>
>>>>>>> I assume that BPF_COMPLEXITY_LIMIT_INSNS (1 million) is the 
>>>>>>> accumulated amount of instructions that the verifier can process 
>>>>>>> in all branches, is that right? It doesn't look realistic that my 
>>>>>>> program can run 1 million instructions in a single run, but it 
>>>>>>> might be that if you take all possible flows and add up the 
>>>>>>> instructions from these flows, it will exceed 1 million.
>>>>>>>
>>>>>>> The limitation of maximum 10 TCP options might be not enough, 
>>>>>>> given that valid packets are permitted to include more than 10 
>>>>>>> NOPs. An alternative of using bpf_load_hdr_opt and calling it 
>>>>>>> three times doesn't look good either, because it will be about 
>>>>>>> three times slower than going over the options once. So maybe 
>>>>>>> having a helper for that is better than trying to fit it into BPF?
>>>>>>>
>>>>>>> One more interesting fact is the time that it takes for the 
>>>>>>> verifier to check my program. If it's limited to 10 iterations, 
>>>>>>> it does it pretty fast, but if I try to increase the number to 11 
>>>>>>> iterations, it takes several minutes for the verifier to reach 1 
>>>>>>> million instructions and print the error then. I also tried 
>>>>>>> grouping the NOPs in an inner loop to count only 10 real options, 
>>>>>>> and the verifier has been running for a few hours without any 
>>>>>>> response. Is it normal? 
>>>>>>
>>>>>> Maxim, this may expose a verifier bug. Do you have a reproducer I 
>>>>>> can access? I would like to debug this to see what is the root 
>>>>>> case. Thanks!
>>>>>
>>>>> Thanks, I appreciate your help in debugging it. The reproducer is 
>>>>> based on the modified XDP program from patch 10 in this series. 
>>>>> You'll need to apply at least patches 6, 7, 8 from this series to 
>>>>> get new BPF helpers needed for the XDP program (tell me if that's a 
>>>>> problem, I can try to remove usage of new helpers, but it will 
>>>>> affect the program length and may produce different results in the 
>>>>> verifier).
>>>>>
>>>>> See the C code of the program that passes the verifier (compiled 
>>>>> with clang version 12.0.0-1ubuntu1) in the bottom of this email. If 
>>>>> you increase the loop boundary from 10 to at least 11 in 
>>>>> cookie_init_timestamp_raw(), it fails the verifier after a few 
>>>>> minutes. 
>>>>
>>>> I tried to reproduce with latest llvm (llvm-project repo),
>>>> loop boundary 10 is okay and 11 exceeds the 1M complexity limit. For 
>>>> 10,
>>>> the number of verified instructions is 563626 (more than 0.5M) so it is
>>>> totally possible that one more iteration just blows past the limit.
>>>
>>> So, does it mean that the verifying complexity grows exponentially 
>>> with increasing the number of loop iterations (options parsed)?
>>
>> Depending on verification time pruning results, it is possible 
>> slightly increase number of branches could result quite some (2x, 4x, 
>> etc.) of
>> to-be-verified dynamic instructions.
> 
> Is it at least theoretically possible to make this coefficient below 2x? 
> I.e. write a loop, so that adding another iteration will not double the 
> number of verified instructions, but will have a smaller increase?
> 
> If that's not possible, then it looks like BPF can't have loops bigger 
> than ~19 iterations (2^20 > 1M), and this function is not implementable 
> in BPF.

This is the worst case. As I mentioned pruning plays a huge role in 
verification. Effective pruning can add little increase of dynamic 
instructions say from 19 iterations to 20 iterations. But we have
to look at verifier log to find out whether pruning is less effective or
something else... Based on my experience, in most cases, pruning is
quite effective. But occasionally it is not... You can look at
verifier.c file to roughly understand how pruning work.

Not sure whether in this case it is due to less effective pruning or 
inherently we just have to go through all these dynamic instructions for 
verification.

> 
>>>
>>> Is it a good enough reason to keep this code as a BPF helper, rather 
>>> than trying to fit it into the BPF program?
>>
>> Another option is to use global function, which is verified separately
>> from the main bpf program.
> 
> Simply removing __always_inline didn't change anything. Do I need to 
> make any other changes? Will it make sense to call a global function in 
> a loop, i.e. will it increase chances to pass the verifier?

global function cannot be static function. You can try
either global function inside the loop or global function
containing the loop. It probably more effective to put loops
inside the global function. You have to do some experiments
to see which one is better.

> 
>>>
>>>>
>>>>> If you apply this tiny change, it fails the verifier after about 3 
>>>>> hours:
>>>>>
>> [...]
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH bpf-next 09/10] bpf: Add a helper to issue timestamp cookies in XDP
  2021-11-26 17:07                     ` Yonghong Song
@ 2021-11-29 17:51                       ` Maxim Mikityanskiy
  2021-12-01  6:39                         ` Yonghong Song
  0 siblings, 1 reply; 48+ messages in thread
From: Maxim Mikityanskiy @ 2021-11-29 17:51 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Toke Høiland-Jørgensen, Lorenz Bauer,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, John Fastabend, KP Singh,
	Eric Dumazet, David S. Miller, Jakub Kicinski, Hideaki YOSHIFUJI,
	David Ahern, Jesper Dangaard Brouer, Nathan Chancellor,
	Nick Desaulniers, Brendan Jackman, Florent Revest, Joe Stringer,
	Tariq Toukan, Networking, bpf, clang-built-linux

On 2021-11-26 19:07, Yonghong Song wrote:
> 
> 
> On 11/26/21 8:50 AM, Maxim Mikityanskiy wrote:
>> On 2021-11-26 07:43, Yonghong Song wrote:
>>>
>>>
>>> On 11/25/21 6:34 AM, Maxim Mikityanskiy wrote:
>>>> On 2021-11-09 09:11, Yonghong Song wrote:
>>>>>
>>>>>
>>>>> On 11/3/21 7:02 AM, Maxim Mikityanskiy wrote:
>>>>>> On 2021-11-03 04:10, Yonghong Song wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 11/1/21 4:14 AM, Maxim Mikityanskiy wrote:
>>>>>>>> On 2021-10-20 19:16, Toke Høiland-Jørgensen wrote:
>>>>>>>>> Lorenz Bauer <lmb@cloudflare.com> writes:
>>>>>>>>>
>>>>>>>>>>> +bool cookie_init_timestamp_raw(struct tcphdr *th, __be32 
>>>>>>>>>>> *tsval, __be32 *tsecr)
>>>>>>>>>>
>>>>>>>>>> I'm probably missing context, Is there something in this 
>>>>>>>>>> function that
>>>>>>>>>> means you can't implement it in BPF?
>>>>>>>>>
>>>>>>>>> I was about to reply with some other comments but upon closer 
>>>>>>>>> inspection
>>>>>>>>> I ended up at the same conclusion: this helper doesn't seem to 
>>>>>>>>> be needed
>>>>>>>>> at all?
>>>>>>>>
>>>>>>>> After trying to put this code into BPF (replacing the underlying 
>>>>>>>> ktime_get_ns with ktime_get_mono_fast_ns), I experienced issues 
>>>>>>>> with passing the verifier.
>>>>>>>>
>>>>>>>> In addition to comparing ptr to end, I had to add checks that 
>>>>>>>> compare ptr to data_end, because the verifier can't deduce that 
>>>>>>>> end <= data_end. More branches will add a certain slowdown (not 
>>>>>>>> measured).
>>>>>>>>
>>>>>>>> A more serious issue is the overall program complexity. Even 
>>>>>>>> though the loop over the TCP options has an upper bound, and the 
>>>>>>>> pointer advances by at least one byte every iteration, I had to 
>>>>>>>> limit the total number of iterations artificially. The maximum 
>>>>>>>> number of iterations that makes the verifier happy is 10. With 
>>>>>>>> more iterations, I have the following error:
>>>>>>>>
>>>>>>>> BPF program is too large. Processed 1000001 insn
>>>>>>>>
>>>>>>>>                         processed 1000001 insns (limit 1000000) 
>>>>>>>> max_states_per_insn 29 total_states 35489 peak_states 596 
>>>>>>>> mark_read 45
>>>>>>>>
>>>>>>>> I assume that BPF_COMPLEXITY_LIMIT_INSNS (1 million) is the 
>>>>>>>> accumulated amount of instructions that the verifier can process 
>>>>>>>> in all branches, is that right? It doesn't look realistic that 
>>>>>>>> my program can run 1 million instructions in a single run, but 
>>>>>>>> it might be that if you take all possible flows and add up the 
>>>>>>>> instructions from these flows, it will exceed 1 million.
>>>>>>>>
>>>>>>>> The limitation of maximum 10 TCP options might be not enough, 
>>>>>>>> given that valid packets are permitted to include more than 10 
>>>>>>>> NOPs. An alternative of using bpf_load_hdr_opt and calling it 
>>>>>>>> three times doesn't look good either, because it will be about 
>>>>>>>> three times slower than going over the options once. So maybe 
>>>>>>>> having a helper for that is better than trying to fit it into BPF?
>>>>>>>>
>>>>>>>> One more interesting fact is the time that it takes for the 
>>>>>>>> verifier to check my program. If it's limited to 10 iterations, 
>>>>>>>> it does it pretty fast, but if I try to increase the number to 
>>>>>>>> 11 iterations, it takes several minutes for the verifier to 
>>>>>>>> reach 1 million instructions and print the error then. I also 
>>>>>>>> tried grouping the NOPs in an inner loop to count only 10 real 
>>>>>>>> options, and the verifier has been running for a few hours 
>>>>>>>> without any response. Is it normal? 
>>>>>>>
>>>>>>> Maxim, this may expose a verifier bug. Do you have a reproducer I 
>>>>>>> can access? I would like to debug this to see what is the root 
>>>>>>> case. Thanks!
>>>>>>
>>>>>> Thanks, I appreciate your help in debugging it. The reproducer is 
>>>>>> based on the modified XDP program from patch 10 in this series. 
>>>>>> You'll need to apply at least patches 6, 7, 8 from this series to 
>>>>>> get new BPF helpers needed for the XDP program (tell me if that's 
>>>>>> a problem, I can try to remove usage of new helpers, but it will 
>>>>>> affect the program length and may produce different results in the 
>>>>>> verifier).
>>>>>>
>>>>>> See the C code of the program that passes the verifier (compiled 
>>>>>> with clang version 12.0.0-1ubuntu1) in the bottom of this email. 
>>>>>> If you increase the loop boundary from 10 to at least 11 in 
>>>>>> cookie_init_timestamp_raw(), it fails the verifier after a few 
>>>>>> minutes. 
>>>>>
>>>>> I tried to reproduce with latest llvm (llvm-project repo),
>>>>> loop boundary 10 is okay and 11 exceeds the 1M complexity limit. 
>>>>> For 10,
>>>>> the number of verified instructions is 563626 (more than 0.5M) so 
>>>>> it is
>>>>> totally possible that one more iteration just blows past the limit.
>>>>
>>>> So, does it mean that the verifying complexity grows exponentially 
>>>> with increasing the number of loop iterations (options parsed)?
>>>
>>> Depending on verification time pruning results, it is possible 
>>> slightly increase number of branches could result quite some (2x, 4x, 
>>> etc.) of
>>> to-be-verified dynamic instructions.
>>
>> Is it at least theoretically possible to make this coefficient below 
>> 2x? I.e. write a loop, so that adding another iteration will not 
>> double the number of verified instructions, but will have a smaller 
>> increase?
>>
>> If that's not possible, then it looks like BPF can't have loops bigger 
>> than ~19 iterations (2^20 > 1M), and this function is not 
>> implementable in BPF.
> 
> This is the worst case. As I mentioned pruning plays a huge role in 
> verification. Effective pruning can add little increase of dynamic 
> instructions say from 19 iterations to 20 iterations. But we have
> to look at verifier log to find out whether pruning is less effective or
> something else... Based on my experience, in most cases, pruning is
> quite effective. But occasionally it is not... You can look at
> verifier.c file to roughly understand how pruning work.
> 
> Not sure whether in this case it is due to less effective pruning or 
> inherently we just have to go through all these dynamic instructions for 
> verification.
> 
>>
>>>>
>>>> Is it a good enough reason to keep this code as a BPF helper, rather 
>>>> than trying to fit it into the BPF program?
>>>
>>> Another option is to use global function, which is verified separately
>>> from the main bpf program.
>>
>> Simply removing __always_inline didn't change anything. Do I need to 
>> make any other changes? Will it make sense to call a global function 
>> in a loop, i.e. will it increase chances to pass the verifier?
> 
> global function cannot be static function. You can try
> either global function inside the loop or global function
> containing the loop. It probably more effective to put loops
> inside the global function. You have to do some experiments
> to see which one is better.

Sorry for a probably noob question, but how can I pass data_end to a 
global function? I'm getting this error:

Validating cookie_init_timestamp_raw() func#1...
arg#4 reference type('UNKNOWN ') size cannot be determined: -22
processed 0 insns (limit 1000000) max_states_per_insn 0 total_states 0 
peak_states 0 mark_read 0

When I removed data_end, I got another one:

; opcode = ptr[0];
969: (71) r8 = *(u8 *)(r0 +0)
  R0=mem(id=0,ref_obj_id=0,off=20,imm=0) 
R1=mem(id=0,ref_obj_id=0,off=0,umin_value=4,umax_value=60,var_off=(0x0; 
0x3f),s32_min_value=0,s32_max_value=63,u32_max_value=63)
  R2=invP0 R3=invP0 R4=mem_or_null(id=6,ref_obj_id=0,off=0,imm=0) 
R5=invP0 R6=mem_or_null(id=5,ref_obj_id=0,off=0,imm=0) 
R7=mem(id=0,ref_obj_id=0,off=0,imm=0) R10=fp0 fp
-8=00000000 fp-16=invP15
invalid access to memory, mem_size=20 off=20 size=1
R0 min value is outside of the allowed memory range
processed 20 insns (limit 1000000) max_states_per_insn 0 total_states 2 
peak_states 2 mark_read 1

It looks like pointers to the context aren't supported:

https://www.spinics.net/lists/bpf/msg34907.html

 > test_global_func11 - check that CTX pointer cannot be passed

What is the standard way to pass packet data to a global function?

Thanks,
Max

>>
>>>>
>>>>>
>>>>>> If you apply this tiny change, it fails the verifier after about 3 
>>>>>> hours:
>>>>>>
>>> [...]
>>


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH bpf-next 09/10] bpf: Add a helper to issue timestamp cookies in XDP
  2021-11-29 17:51                       ` Maxim Mikityanskiy
@ 2021-12-01  6:39                         ` Yonghong Song
  2021-12-01 18:06                           ` Andrii Nakryiko
  0 siblings, 1 reply; 48+ messages in thread
From: Yonghong Song @ 2021-12-01  6:39 UTC (permalink / raw)
  To: Maxim Mikityanskiy
  Cc: Toke Høiland-Jørgensen, Lorenz Bauer,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, John Fastabend, KP Singh,
	Eric Dumazet, David S. Miller, Jakub Kicinski, Hideaki YOSHIFUJI,
	David Ahern, Jesper Dangaard Brouer, Nathan Chancellor,
	Nick Desaulniers, Brendan Jackman, Florent Revest, Joe Stringer,
	Tariq Toukan, Networking, bpf, clang-built-linux



On 11/29/21 9:51 AM, Maxim Mikityanskiy wrote:
> On 2021-11-26 19:07, Yonghong Song wrote:
>>
>>
>> On 11/26/21 8:50 AM, Maxim Mikityanskiy wrote:
>>> On 2021-11-26 07:43, Yonghong Song wrote:
>>>>
>>>>
>>>> On 11/25/21 6:34 AM, Maxim Mikityanskiy wrote:
>>>>> On 2021-11-09 09:11, Yonghong Song wrote:
>>>>>>
>>>>>>
>>>>>> On 11/3/21 7:02 AM, Maxim Mikityanskiy wrote:
>>>>>>> On 2021-11-03 04:10, Yonghong Song wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 11/1/21 4:14 AM, Maxim Mikityanskiy wrote:
>>>>>>>>> On 2021-10-20 19:16, Toke Høiland-Jørgensen wrote:
>>>>>>>>>> Lorenz Bauer <lmb@cloudflare.com> writes:
>>>>>>>>>>
>>>>>>>>>>>> +bool cookie_init_timestamp_raw(struct tcphdr *th, __be32 
>>>>>>>>>>>> *tsval, __be32 *tsecr)
>>>>>>>>>>>
>>>>>>>>>>> I'm probably missing context, Is there something in this 
>>>>>>>>>>> function that
>>>>>>>>>>> means you can't implement it in BPF?
>>>>>>>>>>
>>>>>>>>>> I was about to reply with some other comments but upon closer 
>>>>>>>>>> inspection
>>>>>>>>>> I ended up at the same conclusion: this helper doesn't seem to 
>>>>>>>>>> be needed
>>>>>>>>>> at all?
>>>>>>>>>
>>>>>>>>> After trying to put this code into BPF (replacing the 
>>>>>>>>> underlying ktime_get_ns with ktime_get_mono_fast_ns), I 
>>>>>>>>> experienced issues with passing the verifier.
>>>>>>>>>
>>>>>>>>> In addition to comparing ptr to end, I had to add checks that 
>>>>>>>>> compare ptr to data_end, because the verifier can't deduce that 
>>>>>>>>> end <= data_end. More branches will add a certain slowdown (not 
>>>>>>>>> measured).
>>>>>>>>>
>>>>>>>>> A more serious issue is the overall program complexity. Even 
>>>>>>>>> though the loop over the TCP options has an upper bound, and 
>>>>>>>>> the pointer advances by at least one byte every iteration, I 
>>>>>>>>> had to limit the total number of iterations artificially. The 
>>>>>>>>> maximum number of iterations that makes the verifier happy is 
>>>>>>>>> 10. With more iterations, I have the following error:
>>>>>>>>>
>>>>>>>>> BPF program is too large. Processed 1000001 insn
>>>>>>>>>
>>>>>>>>>                         processed 1000001 insns (limit 1000000) 
>>>>>>>>> max_states_per_insn 29 total_states 35489 peak_states 596 
>>>>>>>>> mark_read 45
>>>>>>>>>
>>>>>>>>> I assume that BPF_COMPLEXITY_LIMIT_INSNS (1 million) is the 
>>>>>>>>> accumulated amount of instructions that the verifier can 
>>>>>>>>> process in all branches, is that right? It doesn't look 
>>>>>>>>> realistic that my program can run 1 million instructions in a 
>>>>>>>>> single run, but it might be that if you take all possible flows 
>>>>>>>>> and add up the instructions from these flows, it will exceed 1 
>>>>>>>>> million.
>>>>>>>>>
>>>>>>>>> The limitation of maximum 10 TCP options might be not enough, 
>>>>>>>>> given that valid packets are permitted to include more than 10 
>>>>>>>>> NOPs. An alternative of using bpf_load_hdr_opt and calling it 
>>>>>>>>> three times doesn't look good either, because it will be about 
>>>>>>>>> three times slower than going over the options once. So maybe 
>>>>>>>>> having a helper for that is better than trying to fit it into BPF?
>>>>>>>>>
>>>>>>>>> One more interesting fact is the time that it takes for the 
>>>>>>>>> verifier to check my program. If it's limited to 10 iterations, 
>>>>>>>>> it does it pretty fast, but if I try to increase the number to 
>>>>>>>>> 11 iterations, it takes several minutes for the verifier to 
>>>>>>>>> reach 1 million instructions and print the error then. I also 
>>>>>>>>> tried grouping the NOPs in an inner loop to count only 10 real 
>>>>>>>>> options, and the verifier has been running for a few hours 
>>>>>>>>> without any response. Is it normal? 
>>>>>>>>
>>>>>>>> Maxim, this may expose a verifier bug. Do you have a reproducer 
>>>>>>>> I can access? I would like to debug this to see what is the root 
>>>>>>>> case. Thanks!
>>>>>>>
>>>>>>> Thanks, I appreciate your help in debugging it. The reproducer is 
>>>>>>> based on the modified XDP program from patch 10 in this series. 
>>>>>>> You'll need to apply at least patches 6, 7, 8 from this series to 
>>>>>>> get new BPF helpers needed for the XDP program (tell me if that's 
>>>>>>> a problem, I can try to remove usage of new helpers, but it will 
>>>>>>> affect the program length and may produce different results in 
>>>>>>> the verifier).
>>>>>>>
>>>>>>> See the C code of the program that passes the verifier (compiled 
>>>>>>> with clang version 12.0.0-1ubuntu1) in the bottom of this email. 
>>>>>>> If you increase the loop boundary from 10 to at least 11 in 
>>>>>>> cookie_init_timestamp_raw(), it fails the verifier after a few 
>>>>>>> minutes. 
>>>>>>
>>>>>> I tried to reproduce with latest llvm (llvm-project repo),
>>>>>> loop boundary 10 is okay and 11 exceeds the 1M complexity limit. 
>>>>>> For 10,
>>>>>> the number of verified instructions is 563626 (more than 0.5M) so 
>>>>>> it is
>>>>>> totally possible that one more iteration just blows past the limit.
>>>>>
>>>>> So, does it mean that the verifying complexity grows exponentially 
>>>>> with increasing the number of loop iterations (options parsed)?
>>>>
>>>> Depending on verification time pruning results, it is possible 
>>>> slightly increase number of branches could result quite some (2x, 
>>>> 4x, etc.) of
>>>> to-be-verified dynamic instructions.
>>>
>>> Is it at least theoretically possible to make this coefficient below 
>>> 2x? I.e. write a loop, so that adding another iteration will not 
>>> double the number of verified instructions, but will have a smaller 
>>> increase?
>>>
>>> If that's not possible, then it looks like BPF can't have loops 
>>> bigger than ~19 iterations (2^20 > 1M), and this function is not 
>>> implementable in BPF.
>>
>> This is the worst case. As I mentioned pruning plays a huge role in 
>> verification. Effective pruning can add little increase of dynamic 
>> instructions say from 19 iterations to 20 iterations. But we have
>> to look at verifier log to find out whether pruning is less effective or
>> something else... Based on my experience, in most cases, pruning is
>> quite effective. But occasionally it is not... You can look at
>> verifier.c file to roughly understand how pruning work.
>>
>> Not sure whether in this case it is due to less effective pruning or 
>> inherently we just have to go through all these dynamic instructions 
>> for verification.
>>
>>>
>>>>>
>>>>> Is it a good enough reason to keep this code as a BPF helper, 
>>>>> rather than trying to fit it into the BPF program?
>>>>
>>>> Another option is to use global function, which is verified separately
>>>> from the main bpf program.
>>>
>>> Simply removing __always_inline didn't change anything. Do I need to 
>>> make any other changes? Will it make sense to call a global function 
>>> in a loop, i.e. will it increase chances to pass the verifier?
>>
>> global function cannot be static function. You can try
>> either global function inside the loop or global function
>> containing the loop. It probably more effective to put loops
>> inside the global function. You have to do some experiments
>> to see which one is better.
> 
> Sorry for a probably noob question, but how can I pass data_end to a 
> global function? I'm getting this error:
> 
> Validating cookie_init_timestamp_raw() func#1...
> arg#4 reference type('UNKNOWN ') size cannot be determined: -22
> processed 0 insns (limit 1000000) max_states_per_insn 0 total_states 0 
> peak_states 0 mark_read 0
> 
> When I removed data_end, I got another one:
> 
> ; opcode = ptr[0];
> 969: (71) r8 = *(u8 *)(r0 +0)
>   R0=mem(id=0,ref_obj_id=0,off=20,imm=0) 
> R1=mem(id=0,ref_obj_id=0,off=0,umin_value=4,umax_value=60,var_off=(0x0; 
> 0x3f),s32_min_value=0,s32_max_value=63,u32_max_value=63)
>   R2=invP0 R3=invP0 R4=mem_or_null(id=6,ref_obj_id=0,off=0,imm=0) 
> R5=invP0 R6=mem_or_null(id=5,ref_obj_id=0,off=0,imm=0) 
> R7=mem(id=0,ref_obj_id=0,off=0,imm=0) R10=fp0 fp
> -8=00000000 fp-16=invP15
> invalid access to memory, mem_size=20 off=20 size=1
> R0 min value is outside of the allowed memory range
> processed 20 insns (limit 1000000) max_states_per_insn 0 total_states 2 
> peak_states 2 mark_read 1
> 
> It looks like pointers to the context aren't supported:
> 
> https://www.spinics.net/lists/bpf/msg34907.html 
> 
>  > test_global_func11 - check that CTX pointer cannot be passed
> 
> What is the standard way to pass packet data to a global function?

Since global function is separately verified, you need to pass the 'ctx' 
to the global function and do the 'data_end' check again in the global
function. This will incur some packet re-parsing overhead similar to
tail calls.

> 
> Thanks,
> Max
> 
>>>
>>>>>
>>>>>>
>>>>>>> If you apply this tiny change, it fails the verifier after about 
>>>>>>> 3 hours:
>>>>>>>
>>>> [...]
>>>
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH bpf-next 09/10] bpf: Add a helper to issue timestamp cookies in XDP
  2021-12-01  6:39                         ` Yonghong Song
@ 2021-12-01 18:06                           ` Andrii Nakryiko
  0 siblings, 0 replies; 48+ messages in thread
From: Andrii Nakryiko @ 2021-12-01 18:06 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Maxim Mikityanskiy, Toke Høiland-Jørgensen,
	Lorenz Bauer, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, John Fastabend,
	KP Singh, Eric Dumazet, David S. Miller, Jakub Kicinski,
	Hideaki YOSHIFUJI, David Ahern, Jesper Dangaard Brouer,
	Nathan Chancellor, Nick Desaulniers, Brendan Jackman,
	Florent Revest, Joe Stringer, Tariq Toukan, Networking, bpf,
	clang-built-linux

On Tue, Nov 30, 2021 at 10:40 PM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 11/29/21 9:51 AM, Maxim Mikityanskiy wrote:
> > On 2021-11-26 19:07, Yonghong Song wrote:
> >>
> >>
> >> On 11/26/21 8:50 AM, Maxim Mikityanskiy wrote:
> >>> On 2021-11-26 07:43, Yonghong Song wrote:
> >>>>
> >>>>
> >>>> On 11/25/21 6:34 AM, Maxim Mikityanskiy wrote:
> >>>>> On 2021-11-09 09:11, Yonghong Song wrote:
> >>>>>>
> >>>>>>
> >>>>>> On 11/3/21 7:02 AM, Maxim Mikityanskiy wrote:
> >>>>>>> On 2021-11-03 04:10, Yonghong Song wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 11/1/21 4:14 AM, Maxim Mikityanskiy wrote:
> >>>>>>>>> On 2021-10-20 19:16, Toke Høiland-Jørgensen wrote:
> >>>>>>>>>> Lorenz Bauer <lmb@cloudflare.com> writes:
> >>>>>>>>>>
> >>>>>>>>>>>> +bool cookie_init_timestamp_raw(struct tcphdr *th, __be32
> >>>>>>>>>>>> *tsval, __be32 *tsecr)
> >>>>>>>>>>>
> >>>>>>>>>>> I'm probably missing context, Is there something in this
> >>>>>>>>>>> function that
> >>>>>>>>>>> means you can't implement it in BPF?
> >>>>>>>>>>
> >>>>>>>>>> I was about to reply with some other comments but upon closer
> >>>>>>>>>> inspection
> >>>>>>>>>> I ended up at the same conclusion: this helper doesn't seem to
> >>>>>>>>>> be needed
> >>>>>>>>>> at all?
> >>>>>>>>>
> >>>>>>>>> After trying to put this code into BPF (replacing the
> >>>>>>>>> underlying ktime_get_ns with ktime_get_mono_fast_ns), I
> >>>>>>>>> experienced issues with passing the verifier.
> >>>>>>>>>
> >>>>>>>>> In addition to comparing ptr to end, I had to add checks that
> >>>>>>>>> compare ptr to data_end, because the verifier can't deduce that
> >>>>>>>>> end <= data_end. More branches will add a certain slowdown (not
> >>>>>>>>> measured).
> >>>>>>>>>
> >>>>>>>>> A more serious issue is the overall program complexity. Even
> >>>>>>>>> though the loop over the TCP options has an upper bound, and
> >>>>>>>>> the pointer advances by at least one byte every iteration, I
> >>>>>>>>> had to limit the total number of iterations artificially. The
> >>>>>>>>> maximum number of iterations that makes the verifier happy is
> >>>>>>>>> 10. With more iterations, I have the following error:
> >>>>>>>>>
> >>>>>>>>> BPF program is too large. Processed 1000001 insn
> >>>>>>>>>
> >>>>>>>>>                         processed 1000001 insns (limit 1000000)
> >>>>>>>>> max_states_per_insn 29 total_states 35489 peak_states 596
> >>>>>>>>> mark_read 45
> >>>>>>>>>
> >>>>>>>>> I assume that BPF_COMPLEXITY_LIMIT_INSNS (1 million) is the
> >>>>>>>>> accumulated amount of instructions that the verifier can
> >>>>>>>>> process in all branches, is that right? It doesn't look
> >>>>>>>>> realistic that my program can run 1 million instructions in a
> >>>>>>>>> single run, but it might be that if you take all possible flows
> >>>>>>>>> and add up the instructions from these flows, it will exceed 1
> >>>>>>>>> million.
> >>>>>>>>>
> >>>>>>>>> The limitation of maximum 10 TCP options might be not enough,
> >>>>>>>>> given that valid packets are permitted to include more than 10
> >>>>>>>>> NOPs. An alternative of using bpf_load_hdr_opt and calling it
> >>>>>>>>> three times doesn't look good either, because it will be about
> >>>>>>>>> three times slower than going over the options once. So maybe
> >>>>>>>>> having a helper for that is better than trying to fit it into BPF?
> >>>>>>>>>
> >>>>>>>>> One more interesting fact is the time that it takes for the
> >>>>>>>>> verifier to check my program. If it's limited to 10 iterations,
> >>>>>>>>> it does it pretty fast, but if I try to increase the number to
> >>>>>>>>> 11 iterations, it takes several minutes for the verifier to
> >>>>>>>>> reach 1 million instructions and print the error then. I also
> >>>>>>>>> tried grouping the NOPs in an inner loop to count only 10 real
> >>>>>>>>> options, and the verifier has been running for a few hours
> >>>>>>>>> without any response. Is it normal?
> >>>>>>>>
> >>>>>>>> Maxim, this may expose a verifier bug. Do you have a reproducer
> >>>>>>>> I can access? I would like to debug this to see what is the root
> >>>>>>>> case. Thanks!
> >>>>>>>
> >>>>>>> Thanks, I appreciate your help in debugging it. The reproducer is
> >>>>>>> based on the modified XDP program from patch 10 in this series.
> >>>>>>> You'll need to apply at least patches 6, 7, 8 from this series to
> >>>>>>> get new BPF helpers needed for the XDP program (tell me if that's
> >>>>>>> a problem, I can try to remove usage of new helpers, but it will
> >>>>>>> affect the program length and may produce different results in
> >>>>>>> the verifier).
> >>>>>>>
> >>>>>>> See the C code of the program that passes the verifier (compiled
> >>>>>>> with clang version 12.0.0-1ubuntu1) in the bottom of this email.
> >>>>>>> If you increase the loop boundary from 10 to at least 11 in
> >>>>>>> cookie_init_timestamp_raw(), it fails the verifier after a few
> >>>>>>> minutes.
> >>>>>>
> >>>>>> I tried to reproduce with latest llvm (llvm-project repo),
> >>>>>> loop boundary 10 is okay and 11 exceeds the 1M complexity limit.
> >>>>>> For 10,
> >>>>>> the number of verified instructions is 563626 (more than 0.5M) so
> >>>>>> it is
> >>>>>> totally possible that one more iteration just blows past the limit.
> >>>>>
> >>>>> So, does it mean that the verifying complexity grows exponentially
> >>>>> with increasing the number of loop iterations (options parsed)?
> >>>>
> >>>> Depending on verification time pruning results, it is possible
> >>>> slightly increase number of branches could result quite some (2x,
> >>>> 4x, etc.) of
> >>>> to-be-verified dynamic instructions.
> >>>
> >>> Is it at least theoretically possible to make this coefficient below
> >>> 2x? I.e. write a loop, so that adding another iteration will not
> >>> double the number of verified instructions, but will have a smaller
> >>> increase?
> >>>
> >>> If that's not possible, then it looks like BPF can't have loops
> >>> bigger than ~19 iterations (2^20 > 1M), and this function is not
> >>> implementable in BPF.
> >>
> >> This is the worst case. As I mentioned pruning plays a huge role in
> >> verification. Effective pruning can add little increase of dynamic
> >> instructions say from 19 iterations to 20 iterations. But we have
> >> to look at verifier log to find out whether pruning is less effective or
> >> something else... Based on my experience, in most cases, pruning is
> >> quite effective. But occasionally it is not... You can look at
> >> verifier.c file to roughly understand how pruning work.
> >>
> >> Not sure whether in this case it is due to less effective pruning or
> >> inherently we just have to go through all these dynamic instructions
> >> for verification.
> >>
> >>>
> >>>>>
> >>>>> Is it a good enough reason to keep this code as a BPF helper,
> >>>>> rather than trying to fit it into the BPF program?
> >>>>
> >>>> Another option is to use global function, which is verified separately
> >>>> from the main bpf program.
> >>>
> >>> Simply removing __always_inline didn't change anything. Do I need to
> >>> make any other changes? Will it make sense to call a global function
> >>> in a loop, i.e. will it increase chances to pass the verifier?
> >>
> >> global function cannot be static function. You can try
> >> either global function inside the loop or global function
> >> containing the loop. It probably more effective to put loops
> >> inside the global function. You have to do some experiments
> >> to see which one is better.
> >
> > Sorry for a probably noob question, but how can I pass data_end to a
> > global function? I'm getting this error:
> >
> > Validating cookie_init_timestamp_raw() func#1...
> > arg#4 reference type('UNKNOWN ') size cannot be determined: -22
> > processed 0 insns (limit 1000000) max_states_per_insn 0 total_states 0
> > peak_states 0 mark_read 0
> >
> > When I removed data_end, I got another one:
> >
> > ; opcode = ptr[0];
> > 969: (71) r8 = *(u8 *)(r0 +0)
> >   R0=mem(id=0,ref_obj_id=0,off=20,imm=0)
> > R1=mem(id=0,ref_obj_id=0,off=0,umin_value=4,umax_value=60,var_off=(0x0;
> > 0x3f),s32_min_value=0,s32_max_value=63,u32_max_value=63)
> >   R2=invP0 R3=invP0 R4=mem_or_null(id=6,ref_obj_id=0,off=0,imm=0)
> > R5=invP0 R6=mem_or_null(id=5,ref_obj_id=0,off=0,imm=0)
> > R7=mem(id=0,ref_obj_id=0,off=0,imm=0) R10=fp0 fp
> > -8=00000000 fp-16=invP15
> > invalid access to memory, mem_size=20 off=20 size=1
> > R0 min value is outside of the allowed memory range
> > processed 20 insns (limit 1000000) max_states_per_insn 0 total_states 2
> > peak_states 2 mark_read 1
> >
> > It looks like pointers to the context aren't supported:
> >
> > https://www.spinics.net/lists/bpf/msg34907.html
> >
> >  > test_global_func11 - check that CTX pointer cannot be passed
> >
> > What is the standard way to pass packet data to a global function?
>
> Since global function is separately verified, you need to pass the 'ctx'
> to the global function and do the 'data_end' check again in the global
> function. This will incur some packet re-parsing overhead similar to
> tail calls.

Now that the bpf_loop() helper landed, it's another option for doing
repeated work. Please see [0].

  [0] https://patchwork.kernel.org/project/netdevbpf/list/?series=587497&state=*

>
> >
> > Thanks,
> > Max
> >
> >>>
> >>>>>
> >>>>>>
> >>>>>>> If you apply this tiny change, it fails the verifier after about
> >>>>>>> 3 hours:
> >>>>>>>
> >>>> [...]
> >>>
> >

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2021-12-01 18:06 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-19 14:46 [PATCH bpf-next 00/10] New BPF helpers to accelerate synproxy Maxim Mikityanskiy
2021-10-19 14:46 ` [PATCH bpf-next 01/10] bpf: Use ipv6_only_sock in bpf_tcp_gen_syncookie Maxim Mikityanskiy
2021-10-19 14:46 ` [PATCH bpf-next 02/10] bpf: Support dual-stack sockets in bpf_tcp_check_syncookie Maxim Mikityanskiy
2021-10-19 14:46 ` [PATCH bpf-next 03/10] bpf: Use EOPNOTSUPP " Maxim Mikityanskiy
2021-10-19 14:46 ` [PATCH bpf-next 04/10] bpf: Make errors of bpf_tcp_check_syncookie distinguishable Maxim Mikityanskiy
2021-10-20  3:28   ` John Fastabend
2021-10-20 13:16     ` Maxim Mikityanskiy
2021-10-20 15:26       ` Lorenz Bauer
2021-10-19 14:46 ` [PATCH bpf-next 05/10] bpf: Fix documentation of th_len in bpf_tcp_{gen,check}_syncookie Maxim Mikityanskiy
2021-10-19 14:46 ` [PATCH bpf-next 06/10] bpf: Expose struct nf_conn to BPF Maxim Mikityanskiy
2021-10-19 14:46 ` [PATCH bpf-next 07/10] bpf: Add helpers to query conntrack info Maxim Mikityanskiy
2021-10-20  3:56   ` Kumar Kartikeya Dwivedi
2021-10-20  9:28     ` Florian Westphal
2021-10-20  9:48       ` Toke Høiland-Jørgensen
2021-10-20  9:58         ` Florian Westphal
2021-10-20 12:21           ` Toke Høiland-Jørgensen
2021-10-20 12:44             ` Florian Westphal
2021-10-20 20:54               ` Toke Høiland-Jørgensen
2021-10-20 22:55                 ` David Ahern
2021-10-21  7:36                 ` Florian Westphal
2021-10-20 13:18     ` Maxim Mikityanskiy
2021-10-20 19:17       ` Kumar Kartikeya Dwivedi
2021-10-20  9:46   ` Toke Høiland-Jørgensen
2021-10-19 14:46 ` [PATCH bpf-next 08/10] bpf: Add helpers to issue and check SYN cookies in XDP Maxim Mikityanskiy
2021-10-19 14:46 ` [PATCH bpf-next 09/10] bpf: Add a helper to issue timestamp " Maxim Mikityanskiy
2021-10-19 16:45   ` Eric Dumazet
2021-10-20 13:16     ` Maxim Mikityanskiy
2021-10-20 15:56   ` Lorenz Bauer
2021-10-20 16:16     ` Toke Høiland-Jørgensen
2021-10-22 16:56       ` Maxim Mikityanskiy
2021-10-27  8:34         ` Lorenz Bauer
2021-11-01 11:14       ` Maxim Mikityanskiy
2021-11-03  2:10         ` Yonghong Song
2021-11-03 14:02           ` Maxim Mikityanskiy
2021-11-09  7:11             ` Yonghong Song
2021-11-25 14:34               ` Maxim Mikityanskiy
2021-11-26  5:43                 ` Yonghong Song
2021-11-26 16:50                   ` Maxim Mikityanskiy
2021-11-26 17:07                     ` Yonghong Song
2021-11-29 17:51                       ` Maxim Mikityanskiy
2021-12-01  6:39                         ` Yonghong Song
2021-12-01 18:06                           ` Andrii Nakryiko
2021-10-19 14:46 ` [PATCH bpf-next 10/10] bpf: Add sample for raw syncookie helpers Maxim Mikityanskiy
2021-10-20 18:01   ` Joe Stringer
2021-10-21 17:19     ` Maxim Mikityanskiy
2021-10-21  1:06   ` Alexei Starovoitov
2021-10-21 17:31     ` Maxim Mikityanskiy
2021-10-21 18:50       ` Alexei Starovoitov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.