bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next 00/10] BPF TCP header options
@ 2020-06-26 17:55 Martin KaFai Lau
  2020-06-26 17:55 ` [PATCH bpf-next 01/10] tcp: Use a struct to represent a saved_syn Martin KaFai Lau
                   ` (9 more replies)
  0 siblings, 10 replies; 28+ messages in thread
From: Martin KaFai Lau @ 2020-06-26 17:55 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Eric Dumazet, kernel-team,
	Lawrence Brakmo, Neal Cardwell, netdev, Yuchung Cheng

The earlier effort in BPF-TCP-CC allows the TCP Congestion Control
algorithm to be written in BPF.  It opens up opportunities to allow
a faster turnaround time in testing/releasing new congestion control
ideas to production environment.

The same flexibility can be extended to writing TCP header option.
It is not uncommon that people want to test new TCP header option
to improve the TCP performance.  Another use case is for data-center
that has a more controlled environment and has more flexibility in
putting header options for internal traffic only.
    
This patch set introduces the necessary BPF logic and API to
allow bpf program (BPF_PROG_TYPE_SOCK_OPS) to write and parse
TCP options under experimental kind(254) and 16bit-magic(0xeB9F).
The experimental kind(254) usage is defined in RFC 6994.

There are also some changes to TCP and they are mostly to provide
the needed sk and skb info to the bpf program to make decision.

Patch 4 is the main patch and has more details on the API and design.

The set ends with an example which sends the max delay ack in
the BPF TCP header option and the receiving side can
then adjust its RTO accordingly.

Martin KaFai Lau (10):
  tcp: Use a struct to represent a saved_syn
  tcp: bpf: Parse BPF experimental header option
  bpf: sock_ops: Change some members of sock_ops_kern from u32 to u8
  bpf: tcp: Allow bpf prog to write and parse BPF TCP header option
  bpf: selftests: A few improvements to network_helpers.c
  bpf: selftests: Add fastopen_connect to network_helpers
  bpf: selftests: Restore netns after each test
  bpf: selftests: tcp header options
  tcp: bpf: Add TCP_BPF_DELACK_MAX and TCP_BPF_RTO_MIN to bpf_setsockopt
  bpf: selftest: Add test for TCP_BPF_DELACK_MAX and TCP_BPF_RTO_MIN

 include/linux/bpf-cgroup.h                    |  25 +
 include/linux/filter.h                        |  10 +-
 include/linux/tcp.h                           |  11 +-
 include/net/inet_connection_sock.h            |   2 +
 include/net/request_sock.h                    |   8 +-
 include/net/tcp.h                             |  58 +-
 include/uapi/linux/bpf.h                      | 189 ++++-
 net/core/filter.c                             | 236 +++++-
 net/ipv4/tcp.c                                |  13 +-
 net/ipv4/tcp_fastopen.c                       |   2 +-
 net/ipv4/tcp_input.c                          |  99 ++-
 net/ipv4/tcp_ipv4.c                           |   4 +-
 net/ipv4/tcp_minisocks.c                      |   1 +
 net/ipv4/tcp_output.c                         | 188 ++++-
 net/ipv6/tcp_ipv6.c                           |   4 +-
 tools/include/uapi/linux/bpf.h                | 189 ++++-
 tools/testing/selftests/bpf/network_helpers.c | 182 +++--
 tools/testing/selftests/bpf/network_helpers.h |  11 +-
 .../bpf/prog_tests/cgroup_skb_sk_lookup.c     |  12 +-
 .../bpf/prog_tests/connect_force_port.c       |  10 +-
 .../bpf/prog_tests/load_bytes_relative.c      |   4 +-
 .../bpf/prog_tests/tcp_hdr_options.c          | 522 +++++++++++++
 .../selftests/bpf/prog_tests/tcp_rtt.c        |   4 +-
 .../bpf/progs/test_tcp_hdr_options.c          | 708 ++++++++++++++++++
 tools/testing/selftests/bpf/test_progs.c      |  21 +
 tools/testing/selftests/bpf/test_progs.h      |   2 +
 .../selftests/bpf/test_tcp_hdr_options.h      |  34 +
 27 files changed, 2426 insertions(+), 123 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/tcp_hdr_options.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_tcp_hdr_options.c
 create mode 100644 tools/testing/selftests/bpf/test_tcp_hdr_options.h

-- 
2.24.1


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH bpf-next 01/10] tcp: Use a struct to represent a saved_syn
  2020-06-26 17:55 [PATCH bpf-next 00/10] BPF TCP header options Martin KaFai Lau
@ 2020-06-26 17:55 ` Martin KaFai Lau
  2020-06-27 17:41   ` Eric Dumazet
  2020-06-26 17:55 ` [PATCH bpf-next 02/10] tcp: bpf: Parse BPF experimental header option Martin KaFai Lau
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 28+ messages in thread
From: Martin KaFai Lau @ 2020-06-26 17:55 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Eric Dumazet, kernel-team,
	Lawrence Brakmo, Neal Cardwell, netdev, Yuchung Cheng

The total length of the saved syn packet is currently stored in
the first 4 bytes (u32) and the actual packet data is stored after that.

A latter patch will also want to store an offset (bpf_hdr_opt_off) to
a TCP header option which the bpf program will be interested in parsing.
Instead of anonymously storing this offset into the second 4 bytes,
this patch creates a struct for the existing saved_syn.
It can give a readable name to the stored lengths instead of implicitly
using the first few u32(s) to do that.

The new TCP bpf header offset (bpf_hdr_opt_off) added in a latter patch is
an offset from the tcp header instead of from the network header.
It will make the bpf programming side easier.  Thus, this patch stores
the network header length instead of the total length of the syn
header.  The total length can be obtained by the
"network header len + tcp_hdrlen".  The latter patch can
then also gets the offset to the TCP bpf header option by
"network header len + bpf_hdr_opt_off".

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/tcp.h        | 11 ++++++++++-
 include/net/request_sock.h |  7 ++++++-
 net/core/filter.c          |  4 ++--
 net/ipv4/tcp.c             |  9 +++++----
 net/ipv4/tcp_input.c       | 12 ++++++------
 5 files changed, 29 insertions(+), 14 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 3bdec31ce8f4..9d50132d95e6 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -404,7 +404,7 @@ struct tcp_sock {
 	 * socket. Used to retransmit SYNACKs etc.
 	 */
 	struct request_sock __rcu *fastopen_rsk;
-	u32	*saved_syn;
+	struct saved_syn *saved_syn;
 };
 
 enum tsq_enum {
@@ -482,6 +482,15 @@ static inline void tcp_saved_syn_free(struct tcp_sock *tp)
 	tp->saved_syn = NULL;
 }
 
+static inline u32 tcp_saved_syn_len(const struct saved_syn *saved_syn)
+{
+	const struct tcphdr *th;
+
+	th = (void *)saved_syn->data + saved_syn->network_hdrlen;
+
+	return saved_syn->network_hdrlen + __tcp_hdrlen(th);
+}
+
 struct sk_buff *tcp_get_timestamping_opt_stats(const struct sock *sk);
 
 static inline u16 tcp_mss_clamp(const struct tcp_sock *tp, u16 mss)
diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index cf8b33213bbc..d77237ec9fb4 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -41,6 +41,11 @@ struct request_sock_ops {
 
 int inet_rtx_syn_ack(const struct sock *parent, struct request_sock *req);
 
+struct saved_syn {
+	u32 network_hdrlen;
+	u8 data[];
+};
+
 /* struct request_sock - mini sock to represent a connection request
  */
 struct request_sock {
@@ -60,7 +65,7 @@ struct request_sock {
 	struct timer_list		rsk_timer;
 	const struct request_sock_ops	*rsk_ops;
 	struct sock			*sk;
-	u32				*saved_syn;
+	struct saved_syn		*saved_syn;
 	u32				secid;
 	u32				peer_secid;
 };
diff --git a/net/core/filter.c b/net/core/filter.c
index c796e141ea8e..19dbcc8448d8 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4522,9 +4522,9 @@ static int _bpf_getsockopt(struct sock *sk, int level, int optname,
 			tp = tcp_sk(sk);
 
 			if (optlen <= 0 || !tp->saved_syn ||
-			    optlen > tp->saved_syn[0])
+			    optlen > tcp_saved_syn_len(tp->saved_syn))
 				goto err_clear;
-			memcpy(optval, tp->saved_syn + 1, optlen);
+			memcpy(optval, tp->saved_syn->data, optlen);
 			break;
 		default:
 			goto err_clear;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index de36c91d32ea..60093a211f4d 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3805,20 +3805,21 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
 
 		lock_sock(sk);
 		if (tp->saved_syn) {
-			if (len < tp->saved_syn[0]) {
-				if (put_user(tp->saved_syn[0], optlen)) {
+			if (len < tcp_saved_syn_len(tp->saved_syn)) {
+				if (put_user(tcp_saved_syn_len(tp->saved_syn),
+					     optlen)) {
 					release_sock(sk);
 					return -EFAULT;
 				}
 				release_sock(sk);
 				return -EINVAL;
 			}
-			len = tp->saved_syn[0];
+			len = tcp_saved_syn_len(tp->saved_syn);
 			if (put_user(len, optlen)) {
 				release_sock(sk);
 				return -EFAULT;
 			}
-			if (copy_to_user(optval, tp->saved_syn + 1, len)) {
+			if (copy_to_user(optval, tp->saved_syn->data, len)) {
 				release_sock(sk);
 				return -EFAULT;
 			}
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 12fda8f27b08..eb0e32b2def9 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6557,13 +6557,13 @@ static void tcp_reqsk_record_syn(const struct sock *sk,
 {
 	if (tcp_sk(sk)->save_syn) {
 		u32 len = skb_network_header_len(skb) + tcp_hdrlen(skb);
-		u32 *copy;
+		struct saved_syn *saved_syn;
 
-		copy = kmalloc(len + sizeof(u32), GFP_ATOMIC);
-		if (copy) {
-			copy[0] = len;
-			memcpy(&copy[1], skb_network_header(skb), len);
-			req->saved_syn = copy;
+		saved_syn = kmalloc(len + sizeof(*saved_syn), GFP_ATOMIC);
+		if (saved_syn) {
+			saved_syn->network_hdrlen = skb_network_header_len(skb);
+			memcpy(saved_syn->data, skb_network_header(skb), len);
+			req->saved_syn = saved_syn;
 		}
 	}
 }
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH bpf-next 02/10] tcp: bpf: Parse BPF experimental header option
  2020-06-26 17:55 [PATCH bpf-next 00/10] BPF TCP header options Martin KaFai Lau
  2020-06-26 17:55 ` [PATCH bpf-next 01/10] tcp: Use a struct to represent a saved_syn Martin KaFai Lau
@ 2020-06-26 17:55 ` Martin KaFai Lau
  2020-06-27 16:44   ` Eric Dumazet
  2020-06-27 17:17   ` Eric Dumazet
  2020-06-26 17:55 ` [PATCH bpf-next 03/10] bpf: sock_ops: Change some members of sock_ops_kern from u32 to u8 Martin KaFai Lau
                   ` (7 subsequent siblings)
  9 siblings, 2 replies; 28+ messages in thread
From: Martin KaFai Lau @ 2020-06-26 17:55 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Eric Dumazet, kernel-team,
	Lawrence Brakmo, Neal Cardwell, netdev, Yuchung Cheng

This patch adds logic to parse experimental kind 254 with 16 bit magic
0xeB9F.  The latter patch will allow bpf prog to write and parse data
under this experimental kind and magic.

A one byte bpf_hdr_opt_off is added to tcp_skb_cb by using an existing
4 byte hole.  It is only used in rx.  It stores the offset to the
bpf experimental option and will be made available to BPF prog
in a latter patch.  This offset is also stored in the saved_syn.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/net/request_sock.h | 1 +
 include/net/tcp.h          | 3 +++
 net/ipv4/tcp_input.c       | 6 ++++++
 net/ipv4/tcp_ipv4.c        | 1 +
 net/ipv6/tcp_ipv6.c        | 1 +
 5 files changed, 12 insertions(+)

diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index d77237ec9fb4..55297286c066 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -43,6 +43,7 @@ int inet_rtx_syn_ack(const struct sock *parent, struct request_sock *req);
 
 struct saved_syn {
 	u32 network_hdrlen;
+	u32 bpf_hdr_opt_off;
 	u8 data[];
 };
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index eab1c7d0facb..07a9dfe35242 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -191,6 +191,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
  */
 #define TCPOPT_FASTOPEN_MAGIC	0xF989
 #define TCPOPT_SMC_MAGIC	0xE2D4C3D9
+#define TCPOPT_BPF_MAGIC	0xEB9F
 
 /*
  *     TCP option lengths
@@ -204,6 +205,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
 #define TCPOLEN_FASTOPEN_BASE  2
 #define TCPOLEN_EXP_FASTOPEN_BASE  4
 #define TCPOLEN_EXP_SMC_BASE   6
+#define TCPOLEN_EXP_BPF_BASE   4
 
 /* But this is what stacks really send out. */
 #define TCPOLEN_TSTAMP_ALIGNED		12
@@ -857,6 +859,7 @@ struct tcp_skb_cb {
 			has_rxtstamp:1,	/* SKB has a RX timestamp	*/
 			unused:5;
 	__u32		ack_seq;	/* Sequence number ACK'd	*/
+	__u8            bpf_hdr_opt_off;/* offset to bpf hdr option. rx only. */
 	union {
 		struct {
 			/* There is space for up to 24 bytes */
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index eb0e32b2def9..640408a80b3d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3924,6 +3924,10 @@ void tcp_parse_options(const struct net *net,
 					tcp_parse_fastopen_option(opsize -
 						TCPOLEN_EXP_FASTOPEN_BASE,
 						ptr + 2, th->syn, foc, true);
+				else if (opsize >= TCPOLEN_EXP_BPF_BASE &&
+					 get_unaligned_be16(ptr) ==
+					 TCPOPT_BPF_MAGIC)
+					TCP_SKB_CB(skb)->bpf_hdr_opt_off = (ptr - 2) - (unsigned char *)th;
 				else
 					smc_parse_options(th, opt_rx, ptr,
 							  opsize);
@@ -6562,6 +6566,8 @@ static void tcp_reqsk_record_syn(const struct sock *sk,
 		saved_syn = kmalloc(len + sizeof(*saved_syn), GFP_ATOMIC);
 		if (saved_syn) {
 			saved_syn->network_hdrlen = skb_network_header_len(skb);
+			saved_syn->bpf_hdr_opt_off =
+				TCP_SKB_CB(skb)->bpf_hdr_opt_off;
 			memcpy(saved_syn->data, skb_network_header(skb), len);
 			req->saved_syn = saved_syn;
 		}
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index ea0df9fd7618..a3535b7fe002 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1864,6 +1864,7 @@ static void tcp_v4_fill_cb(struct sk_buff *skb, const struct iphdr *iph,
 	TCP_SKB_CB(skb)->sacked	 = 0;
 	TCP_SKB_CB(skb)->has_rxtstamp =
 			skb->tstamp || skb_hwtstamps(skb)->hwtstamp;
+	TCP_SKB_CB(skb)->bpf_hdr_opt_off = 0;
 }
 
 /*
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index f67d45ff00b4..8356d0562279 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1545,6 +1545,7 @@ static void tcp_v6_fill_cb(struct sk_buff *skb, const struct ipv6hdr *hdr,
 	TCP_SKB_CB(skb)->sacked = 0;
 	TCP_SKB_CB(skb)->has_rxtstamp =
 			skb->tstamp || skb_hwtstamps(skb)->hwtstamp;
+	TCP_SKB_CB(skb)->bpf_hdr_opt_off = 0;
 }
 
 INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb)
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH bpf-next 03/10] bpf: sock_ops: Change some members of sock_ops_kern from u32 to u8
  2020-06-26 17:55 [PATCH bpf-next 00/10] BPF TCP header options Martin KaFai Lau
  2020-06-26 17:55 ` [PATCH bpf-next 01/10] tcp: Use a struct to represent a saved_syn Martin KaFai Lau
  2020-06-26 17:55 ` [PATCH bpf-next 02/10] tcp: bpf: Parse BPF experimental header option Martin KaFai Lau
@ 2020-06-26 17:55 ` Martin KaFai Lau
  2020-06-26 17:55 ` [PATCH bpf-next 04/10] bpf: tcp: Allow bpf prog to write and parse BPF TCP header option Martin KaFai Lau
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Martin KaFai Lau @ 2020-06-26 17:55 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Eric Dumazet, kernel-team,
	Lawrence Brakmo, Neal Cardwell, netdev, Yuchung Cheng

A latter patch needs to add a few pointers and a few u8 to
sock_ops_kern.  Hence, this patch saves some spaces by moving
some of the existing members from u32 to u8 so that the latter
patch can still fit everything in a cacheline.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/filter.h |  4 ++--
 net/core/filter.c      | 15 ++++++++++-----
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 259377723603..aae52cbda670 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1236,13 +1236,13 @@ struct bpf_sock_addr_kern {
 
 struct bpf_sock_ops_kern {
 	struct	sock *sk;
-	u32	op;
 	union {
 		u32 args[4];
 		u32 reply;
 		u32 replylong[4];
 	};
-	u32	is_fullsock;
+	u8	op;
+	u8	is_fullsock;
 	u64	temp;			/* temp and everything after is not
 					 * initialized to 0 before calling
 					 * the BPF program. New fields that
diff --git a/net/core/filter.c b/net/core/filter.c
index 19dbcc8448d8..1dd07972c5c7 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -8355,17 +8355,22 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
 		return insn - insn_buf;
 
 	switch (si->off) {
-	case offsetof(struct bpf_sock_ops, op) ...
+	case offsetof(struct bpf_sock_ops, op):
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sock_ops_kern,
+						       op),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_sock_ops_kern, op));
+		break;
+
+	case offsetof(struct bpf_sock_ops, replylong[0]) ...
 	     offsetof(struct bpf_sock_ops, replylong[3]):
-		BUILD_BUG_ON(sizeof_field(struct bpf_sock_ops, op) !=
-			     sizeof_field(struct bpf_sock_ops_kern, op));
 		BUILD_BUG_ON(sizeof_field(struct bpf_sock_ops, reply) !=
 			     sizeof_field(struct bpf_sock_ops_kern, reply));
 		BUILD_BUG_ON(sizeof_field(struct bpf_sock_ops, replylong) !=
 			     sizeof_field(struct bpf_sock_ops_kern, replylong));
 		off = si->off;
-		off -= offsetof(struct bpf_sock_ops, op);
-		off += offsetof(struct bpf_sock_ops_kern, op);
+		off -= offsetof(struct bpf_sock_ops, replylong[0]);
+		off += offsetof(struct bpf_sock_ops_kern, replylong[0]);
 		if (type == BPF_WRITE)
 			*insn++ = BPF_STX_MEM(BPF_W, si->dst_reg, si->src_reg,
 					      off);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH bpf-next 04/10] bpf: tcp: Allow bpf prog to write and parse BPF TCP header option
  2020-06-26 17:55 [PATCH bpf-next 00/10] BPF TCP header options Martin KaFai Lau
                   ` (2 preceding siblings ...)
  2020-06-26 17:55 ` [PATCH bpf-next 03/10] bpf: sock_ops: Change some members of sock_ops_kern from u32 to u8 Martin KaFai Lau
@ 2020-06-26 17:55 ` Martin KaFai Lau
  2020-06-28 18:24   ` Alexei Starovoitov
  2020-06-26 17:55 ` [PATCH bpf-next 05/10] bpf: selftests: A few improvements to network_helpers.c Martin KaFai Lau
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 28+ messages in thread
From: Martin KaFai Lau @ 2020-06-26 17:55 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Eric Dumazet, kernel-team,
	Lawrence Brakmo, Neal Cardwell, netdev, Yuchung Cheng

The earlier effort in BPF-TCP-CC allows the TCP Congestion Control
algorithm to be written in BPF.  It opens up opportunities to allow
a faster turnaround time in testing/releasing new congestion control
ideas to production environment.

The same flexibility can be extended to writing TCP header option.
It is not uncommon that people want to test new TCP header option
to improve the TCP performance.  Another use case is for data-center
that has a more controlled environment and has more flexibility in
putting header options for internal use only.

For example, we want to test the idea in putting maximum delay
ACK in TCP header option which is similar to a draft RFC proposal [1].

This patch introduces the necessary BPF API and use them in the
TCP stack to allow BPF_PROG_TYPE_SOCK_OPS program to parse
and write TCP header options.  It currently supports most of
the TCP packet except RST.

Header Option Format
────────────────────
The bpf prog will be allowed to write options under kind (254)
which is defined as the experimental TCP options in RFC 6994.
The exact format will be:

 0                  8                 16                             31
┌──────────────────┬─────────────────┬─────────────────────────────────┐
│   Kind: 254      │     length      │      magic: 0xeB9F              │
├──────────────────┴─────────────────┴─────────────────────────────────┤
│                                                                      │
│               BPF program written data                               │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘

By putting it under the standard kind 254 and magic 0xeB9F, it will be
recognizable by the usual tool like tcpdump and tshark.  The kernel
can ensure the header option format is valid before sending out to
the wire and avoid the bpf program from writing options
duplicated/conflicted with what the kernel TCP stack has
already written.

A similar experimental numbering also exists in other protocols (e.g. IPv6).
Thus,  a similar idea (and API) could be extended to other layers/protocols
in the future.

As mentioned above, this patch set does not allow the bpf program to create
its own option "kind".  However, the header-writing's BPF API (mainly through
the helper "bpf_reserve_hdr_opt" and "bpf_store_hdr_opt") could be extended
in the future to allow a "raw" mode (e.g. by introducing a new helper
flag).

Sockops Callback Flags:
──────────────────────
The header parsing and writing callback can be turned on
by setting the existing callback flags "tp->bpf_sock_ops_cb_flags".
BPF_SOCK_OPS_PARSE_HDR_OPT_CB_FLAG and
BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG are (newly) added.
The default is off, i.e. the bpf prog will not
be called to parse or write bpf hdr option.

3 Way HandShake
───────────────
* Passive side

When writing SYNACK, the received SYN skb will be available to the
bpf prog.  The bpf prog will also know if it is in syncookie
mode (want_cookie) or not.  The bpf prog can use the SYN skb (which
may carry the bpf hdr opt sent from the remote peer) to decide what
bpf header option should be written to the outgoing SYNACK skb.

The bpf prog can store the bpf header option of the received
SYN by using the existing "TCP_SAVE_SYN" setsockopt.
The example in a latter patch also uses TCP_SAVE_SYN.
[ Note that the fullsock here is a listen sk, bpf_sk_storage
  is not very useful here since the listen sk will be shared
  by many concurrent connection requests.

  Extending bpf_sk_storage support to request_sock will add weight
  to the minisock and it is not necessary better than storing the
  whole ~100 bytes SYN pkt. ]

When the connection is established, the bpf prog will be called
in the existing PASSIVE_ESTABLISHED_CB callback.  At that time,
the bpf prog can get the bpf header option from the saved syn and
then apply the needed operation to the newly established socket.
The latter patch will use the max delay ack specified in the SYN
packet as an example.  The received ack (that concludes the 3WHS)
will also be available to the bpf prog during PASSIVE_ESTABLISHED_CB
through the sock_ops->skb_data, which could be useful in
syncookie scenario.

There is an existing getsockopt "TCP_SAVED_SYN" to return the whole
saved syn which includes the IP[46] header and the TCP header.
A (new) BPF only "TCP_BPF_SYN_HDR_OPT" getsockopt is added to get
the bpf header option alone (without the IP and TCP header) from the
saved syn.  The kernel remembers the offset to the bpf header option (i.e.
kind:254, magic:0xeB9F) as it passes the TCP header.  It is stored in
the tcp_skb_cb and then also saved in the "struct saved_syn".  The new
"TCP_BPF_SYN_HDR_OPT" can directly return the bpf header option to the
bpf prog instead of asking the bpf prog to parse the IP[46] header and
TCP header again in order to get to the bpf header option.

In the new "TCP_BPF_SYN_HDR_OPT" getsockopt, the kernel will know
where it can get the SYN's bpf header option from:
  - the just received syn (available when the bpf prog is writing SYNACK)
  or
  - the saved syn (available in PASSIVE_ESTABLISHED_CB).

The bpf prog does not need to know where this bpf header option
can be obtained from.  The "TCP_BPF_SYN_HDR_OPT" getsockopt will
hide this details.

Fastopen should work the same as the regular non fastopen case.
This is a test in a latter patch.

For syncookie, the latter example patch asks the active
side's bpf prog to resend the header options in ACK.  Please refer
to this latter example for its details and limitation.

* Active side

The bpf prog will get a chance to write the bpf header option
in the SYN packet during WRITE_HDR_OPT_CB.  The received SYNACK
pkt will also be available to the bpf prog during the existing
ACTIVE_ESTABLISHED_CB callback through the sock_ops->skb_data.

In short, regardless of active or passive ESTABLISHED_CB,
the sock_ops->skb_data is always the received skb that
completed the 3WHS.

If the bpf prog does not need to write/parse header options
beyond the 3WHS, the bpf prog can clear the bpf_sock_ops_cb_flags
to avoid being called for header options.

Established Connection
──────────────────────
The bpf prog will be called as long as the parse/write
*_HDR_OPT_CB_FLAG is enabled in bpf_sock_ops_cb_flags.
That will allow the bpf prog to parse/write header options
in the data, pure-ack, and fin packet.

Writing BPF Header Option
─────────────────────────
[ bpf prog context: sock_ops ]

When writing header, the bpf prog is first called to reserve
the needed number of bytes in skb during "HDR_OPT_LEN_CB" and
then called to write the header during "WRITE_HDR_OPT_CB".
During these two write CB, the sock_ops->skb_* is always
representing the outgoing skb.

The bpf prog is expected to use the two (new) helpers,
"bpf_reserve_hdr_opt" and "bpf_store_hdr_opt", to
reserve option space and write the option.

In cgroup MULTI mode, the max reserved space among multiple bpf progs
will be used.  e.g. prog#1 reserves 8 bytes and a latter prog#2 reserves
4 bytes.  8 bytes will be reserved.
When multiple bpf progs write the bpf header option, the last
prog's header option will be used.  The "bpf_store_hdr_opt"
helper will take care of the TCP header option's kind-length.

When writing header in "WRITE_HDR_OPT_CB", the sock_ops->skb_data
is pointing to the outgoing skb.  If there is a need, the bpf prog
can inspect what has been written to the header.
sock_ops->skb_bpf_hdr_opt_off also provides an offset to the
beginning of the bpf header option (i.e. the beginning of
kind:245, magic:0xeB9F).
However, during option space reservation in "HDR_OPT_LEN_CB",
the sock_ops->skb_data does not have the tcp header because
the header has not been written yet.

Parsing BPF Header Option
─────────────────────────

As mentioned earlier, the received SYN/SYNACK/ACK during the 3WHS
will be available to some specific CB (e.g. the *_ESTABLISHED_CB)

For established connection, if the kernel finds a bpf header
option (i.e. option with kind:254 and magic:0xeB9F) and the
the "PARSE_HDR_OPT_CB_FLAG" flag is set,  the
bpf prog will be called in the "BPF_SOCK_OPS_PARSE_HDR_OPT_CB" op.
The received skb will be available through sock_ops->skb_data
and the bpf header option offset will also be specified
in sock_ops->skb_bpf_hdr_opt_off.

[1]: draft-wang-tcpm-low-latency-opt-00
     https://tools.ietf.org/html/draft-wang-tcpm-low-latency-opt-00

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/bpf-cgroup.h     |  25 ++++
 include/linux/filter.h         |   6 +
 include/net/tcp.h              |  53 ++++++++-
 include/uapi/linux/bpf.h       | 187 +++++++++++++++++++++++++++++-
 net/core/filter.c              | 202 +++++++++++++++++++++++++++++++++
 net/ipv4/tcp_fastopen.c        |   2 +-
 net/ipv4/tcp_input.c           |  79 ++++++++++++-
 net/ipv4/tcp_ipv4.c            |   3 +-
 net/ipv4/tcp_minisocks.c       |   1 +
 net/ipv4/tcp_output.c          | 186 +++++++++++++++++++++++++++++-
 net/ipv6/tcp_ipv6.c            |   3 +-
 tools/include/uapi/linux/bpf.h | 187 +++++++++++++++++++++++++++++-
 12 files changed, 914 insertions(+), 20 deletions(-)

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index c66c545e161a..c069295564a4 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -270,6 +270,31 @@ int bpf_percpu_cgroup_storage_update(struct bpf_map *map, void *key,
 #define BPF_CGROUP_RUN_PROG_UDP6_RECVMSG_LOCK(sk, uaddr)			\
 	BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, BPF_CGROUP_UDP6_RECVMSG, NULL)
 
+/* The SOCK_OPS"_SK" macro should be used when sock_ops->sk is not a
+ * fullsock and its parent fullsock cannot be traced by
+ * sk_to_full_sk().
+ *
+ * e.g. sock_ops->sk is a request_sock and it is under syncookie mode.
+ * Its listener-sk is not attached to the rsk_listener.
+ * In this case, the caller holds the listener-sk (unlocked),
+ * set its sock_ops->sk to req_sk, and call this SOCK_OPS"_SK" with
+ * the listener-sk such that the cgroup-bpf-progs of the
+ * listener-sk will be run.
+ *
+ * Regardless of syncookie mode or not,
+ * calling bpf_setsockopt on listener-sk will not make sense anyway,
+ * so passing req_sk to bpf prog is appropriate here.
+ */
+#define BPF_CGROUP_RUN_PROG_SOCK_OPS_SK(sock_ops, sk)			\
+({									\
+	int __ret = 0;							\
+	if (cgroup_bpf_enabled)						\
+		__ret = __cgroup_bpf_run_filter_sock_ops(sk,		\
+							 sock_ops,	\
+							 BPF_CGROUP_SOCK_OPS); \
+	__ret;								\
+})
+
 #define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops)				       \
 ({									       \
 	int __ret = 0;							       \
diff --git a/include/linux/filter.h b/include/linux/filter.h
index aae52cbda670..2ba008f4937c 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1241,8 +1241,14 @@ struct bpf_sock_ops_kern {
 		u32 reply;
 		u32 replylong[4];
 	};
+	void	*syn_skb;
+	void	*skb;
+	void	*skb_data_end;
 	u8	op;
 	u8	is_fullsock;
+	u8	write_hdr_opt_len;
+	u8	skb_bpf_hdr_opt_off;
+	u8	reservable_hdr_opt_len;	/* Only used in HDR_OPT_LEN */
 	u64	temp;			/* temp and everything after is not
 					 * initialized to 0 before calling
 					 * the BPF program. New fields that
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 07a9dfe35242..e93ef2d324f3 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -395,7 +395,7 @@ void tcp_metrics_init(void);
 bool tcp_peer_is_proven(struct request_sock *req, struct dst_entry *dst);
 void tcp_close(struct sock *sk, long timeout);
 void tcp_init_sock(struct sock *sk);
-void tcp_init_transfer(struct sock *sk, int bpf_op);
+void tcp_init_transfer(struct sock *sk, bool passive, struct sk_buff *skb);
 __poll_t tcp_poll(struct file *file, struct socket *sock,
 		      struct poll_table_struct *wait);
 int tcp_getsockopt(struct sock *sk, int level, int optname,
@@ -459,6 +459,7 @@ enum tcp_synack_type {
 };
 struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst,
 				struct request_sock *req,
+				struct sk_buff *syn_skb,
 				struct tcp_fastopen_cookie *foc,
 				enum tcp_synack_type synack_type);
 int tcp_disconnect(struct sock *sk, int flags);
@@ -2023,6 +2024,7 @@ struct tcp_request_sock_ops {
 	u32 (*init_ts_off)(const struct net *net, const struct sk_buff *skb);
 	int (*send_synack)(const struct sock *sk, struct dst_entry *dst,
 			   struct flowi *fl, struct request_sock *req,
+			   struct sk_buff *syn_skb,
 			   struct tcp_fastopen_cookie *foc,
 			   enum tcp_synack_type synack_type);
 };
@@ -2222,6 +2224,55 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock,
 		      struct msghdr *msg, int len, int flags);
 #endif /* CONFIG_NET_SOCK_MSG */
 
+#ifdef CONFIG_CGROUP_BPF
+/* Copy the listen sk's HDR_OPT_CB flags to its child.
+ *
+ * During 3-Way-HandShake, the synack is usually sent from
+ * the listen sk with the HDR_OPT_CB flags set so that
+ * bpf-prog will be called to write the BPF hdr option.
+ *
+ * In fastopen, the child sk is used to send synack instead
+ * of the listen sk.  Thus, inheriting the HDR_OPT_CB flags
+ * from the listen sk gives the bpf-prog a chance to write
+ * BPF hdr option in the synack pkt during fastopen.
+ *
+ * Both fastopen and non-fastopen child will inherit the
+ * HDR_OPT_CB flags to keep the bpf-prog having a consistent
+ * behavior when deciding to clear this cb flags (or not)
+ * during the PASSIVE_ESTABLISHED_CB.
+ *
+ * In the future, other cb flags could be inherited here also.
+ */
+static inline void bpf_skops_init_child(const struct sock *sk,
+					struct sock *child)
+{
+	tcp_sk(child)->bpf_sock_ops_cb_flags =
+		tcp_sk(sk)->bpf_sock_ops_cb_flags &
+		(BPF_SOCK_OPS_PARSE_HDR_OPT_CB_FLAG |
+		 BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG);
+}
+
+static inline void bpf_skops_init_skb(struct bpf_sock_ops_kern *skops,
+				      struct sk_buff *skb,
+				      u8 bpf_hdr_opt_off)
+{
+	skops->skb = skb;
+	skops->skb_data_end = skb->data + skb_headlen(skb);
+	skops->skb_bpf_hdr_opt_off = bpf_hdr_opt_off;
+}
+#else
+static inline void bpf_skops_init_child(const struct sock *sk,
+					struct sock *child)
+{
+}
+
+static inline void bpf_skops_init_skb(struct bpf_sock_ops_kern *skops,
+				      struct sk_buff *skb,
+				      u8 bpf_hdr_opt_off)
+{
+}
+#endif
+
 /* Call BPF_SOCK_OPS program that returns an int. If the return value
  * is < 0, then the BPF op failed (for example if the loaded BPF
  * program does not support the chosen operation or there is no BPF
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 0cb8ec948816..479b83d05811 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3285,6 +3285,110 @@ union bpf_attr {
  *		Dynamically cast a *sk* pointer to a *udp6_sock* pointer.
  *	Return
  *		*sk* if casting is valid, or NULL otherwise.
+ *
+ * long bpf_store_hdr_opt(struct bpf_sock_ops *skops, u32 offset, const void *from, u32 len, u64 flags)
+ *	Description
+ *		Store BPF header option.  The data will be copied
+ *		from buffer *from* with length *len* to the bpf header
+ *		option at *offset*.
+ *
+ *		This helper does not allow updating the header "kind",
+ *		"kind length", and the 16-bit magic.  *offset* 0
+ *		points immediately after the 16-bit magic.
+ *		This helper will automatically update the "kind length" as
+ *		the bpf header option is written by this helper.  It can
+ *		guarantee that the header option format always
+ *		conforms to the standard.
+ *
+ *		The bpf program can write everything at once
+ *		from offset 0.
+ *
+ *		The bpf program can also do incremental write by calling
+ *		this helper multiple times with incremental (but
+ *		continuous) offset.
+ *
+ *		The bpf program (e.g. the parent cgroup bpf program)
+ *		can also rewrite a few bytes that has already been
+ *		written by some ealier bpf programs.  The bpf
+ *		program can also learn what header option has been
+ *		written by the earlier bpf programs through
+ *		sock_ops->skb_data.
+ *
+ *		A bpf program can always write at offset 0 again
+ *		to overwrite all bpf header options that has been written
+ *		by the previous bpf programs.  If the bpf program
+ *		writes less than the previous bpf programs do,
+ *		the tailing bytes will be removed unless
+ *		*BP_F_STORE_HDR_OPT_NOTRIM* is used.
+ *		e.g. bpf_store_hdr_opt(skops, 0, NULL, 0, 0) will
+ *		remove all previously written option.
+ *
+ *		This helper will reject the *offset* if there is any
+ *		byte/hole that has never been written before this
+ *		*offset*.
+ *
+ *		For example, it can write at offset 0 for 2 bytes,
+ *		then write at offset 2 for 1 bytes, and then finally
+ *		write at offset 3 for 2 bytes.  Thus, the bpf prog
+ *		has totally written 5 bytes.
+ *		The "kind length" will be 9 bytes:
+ *		4 bytes (1byte kind + 1byte kind_len + 2byte magic) +
+ *		5 bytes written by the bpf program.
+ *
+ *		However, this helper does not allow writing at offset 0
+ *		for 2 bytes and then immediately write at offset 3 without
+ *		writing at offset 2 first.
+ *
+ *		This helper can only be called during
+ *		BPF_SOCK_OPS_WRITE_HDR_OPT_CB.  The previous reserved
+ *		space is passed in sock_ops->args[0].
+ *
+ *		The following flags are supported:
+ *		* **BPF_F_STORE_HDR_OPT_NOTRIM**: Usually, bpf program
+ *		  does write with incremental offset instead of
+ *		  jumping the offset back and forth.
+ *		  Thus, by default, after calling this helper,
+ *		  anything previously written after offset + len
+ *		  will be removed.
+ *		  This flag can be used to keep the tailing
+ *		  data after "offset + len".
+ *
+ *	Return
+ *		0 on success, or negative error in case of failure:
+ *
+ *		**-EINVAL** if param is invalid
+ *
+ *		**-EFAULT** *offset* + *len* is outside of the
+ *		            reserved option space or
+ *			    there is unwritten byte before *offset*.
+ *
+ *		**-EPERM** This helper cannot be used under the
+ *			   current sock_ops->op.
+ *
+ * long bpf_reserve_hdr_opt(struct bpf_sock_ops *skops, u32 len, u64 flags)
+ *	Description
+ *		Reserve *len* bytes for the bpf header option.  This
+ *		space will be available to use by bpf_store_hdr_opt()
+ *		later.
+ *
+ *		If bpf_reserve_hdr_opt() is called multiple times (e.g
+ *		by mulitple cgroup bpf programs),
+ *		the maximum *len* among them will be reserved.
+ *
+ *		This helper can only be called during
+ *		BPF_SOCK_OPS_HDR_OPT_LEN_CB.  The maximum
+ *		reservable space is passed in
+ *		sock_ops->args[0].
+ *
+ *	Return
+ *		0 on success, or negative error in case of failure:
+ *
+ *		**-EINVAL** if param is invalid
+ *
+ *		**-ENOSPC** Not enough remaining space in the header.
+ *
+ *		**-EPERM** This helper cannot be used under the
+ *			   current sock_ops->op.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3427,7 +3531,9 @@ union bpf_attr {
 	FN(skc_to_tcp_sock),		\
 	FN(skc_to_tcp_timewait_sock),	\
 	FN(skc_to_tcp_request_sock),	\
-	FN(skc_to_udp6_sock),
+	FN(skc_to_udp6_sock),		\
+	FN(store_hdr_opt),		\
+	FN(reserve_hdr_opt),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -4024,6 +4130,11 @@ struct bpf_sock_ops {
 	__u64 bytes_received;
 	__u64 bytes_acked;
 	__bpf_md_ptr(struct bpf_sock *, sk);
+	__bpf_md_ptr(void *, skb_data);
+	__bpf_md_ptr(void *, skb_data_end);
+	__u32 skb_len;
+	__u32 skb_bpf_hdr_opt_off;
+	__u32 skb_tcp_flags;
 };
 
 /* Definitions for bpf_sock_ops_cb_flags */
@@ -4032,8 +4143,14 @@ enum {
 	BPF_SOCK_OPS_RETRANS_CB_FLAG	= (1<<1),
 	BPF_SOCK_OPS_STATE_CB_FLAG	= (1<<2),
 	BPF_SOCK_OPS_RTT_CB_FLAG	= (1<<3),
+	BPF_SOCK_OPS_PARSE_HDR_OPT_CB_FLAG = (1<<4),
+	BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<5),
 /* Mask of all currently supported cb flags */
-	BPF_SOCK_OPS_ALL_CB_FLAGS       = 0xF,
+	BPF_SOCK_OPS_ALL_CB_FLAGS       = 0x3F,
+};
+
+enum {
+	BPF_F_STORE_HDR_OPT_NOTRIM	= (1ULL << 0),
 };
 
 /* List of known BPF sock_ops operators.
@@ -4089,6 +4206,58 @@ enum {
 					 */
 	BPF_SOCK_OPS_RTT_CB,		/* Called on every RTT.
 					 */
+	BPF_SOCK_OPS_PARSE_HDR_OPT_CB,	/* Parse the BPF header option.
+					 * It will be called to handle
+					 * the packets received at
+					 * socket that has been established.
+					 */
+	BPF_SOCK_OPS_HDR_OPT_LEN_CB,	/* Reserve space for writing the BPF
+					 * header option later in
+					 * BPF_SOCK_OPS_WRITE_HDR_OPT_CB.
+					 * Arg1: Max reservable bytes
+					 * Arg2: bool want_cookie. (in
+					 *       writing SYNACK only)
+					 *
+					 * sock_ops->skb_data:
+					 *	Referring to the
+					 *	outgoing skb.
+					 *	It is pointing to the
+					 *	TCP payload because
+					 *	no TCP header has
+					 *	been written yet.
+					 *
+					 * sock_ops->skb_tcp_flags:
+					 *	the tcp_flags of the
+					 *	outgoing skb. (e.g.
+					 *      SYN, ACK, FIN).
+					 *
+					 * bpf_reserve_hdr_opt() should
+					 * be used to reserve space.
+					 */
+	BPF_SOCK_OPS_WRITE_HDR_OPT_CB,	/* Write the BPF header OPTIONS
+					 * Arg1: num of bytes reserved during
+					 *       BPF_SOCK_OPS_HDR_OPT_LEN_CB
+					 * Arg2: bool want_cookie. (in
+					 *       writing SYNACK only)
+					 *
+					 * sock_ops->skb_data:
+					 *	referring to the
+					 *	outgoing skb.
+					 *	It is pointing to the
+					 *	TCP header.
+					 *
+					 * sock_ops->skb_bpf_hdr_opt_off:
+					 *	offset to the bpf
+					 *	header option.  bpf
+					 *	prog can inspect what bpf
+					 *	option has already been
+					 *	written.
+					 *
+					 * sock_ops->skb_tcp_flags:
+					 *	the tcp_flags of the
+					 *	outgoing skb. (e.g.
+					 *	SYN, ACK, FIN).
+					 */
 };
 
 /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
@@ -4116,6 +4285,20 @@ enum {
 enum {
 	TCP_BPF_IW		= 1001,	/* Set TCP initial congestion window */
 	TCP_BPF_SNDCWND_CLAMP	= 1002,	/* Set sndcwnd_clamp */
+	/* Copy the BPF header option from syn pkt to optval
+	 *
+	 * BPF_PROG_TYPE_SOCK_OPS only.  It is similar to the
+	 * TCP_SAVED_SYN but it only gets the bpf header option
+	 * from the syn packet.
+	 *
+	 *       0: Success
+	 * -ENOSPC: Not enough space in optval. Only optlen number of
+	 *          bytes is copied.
+	 * -ENOMSG: BPF TCP header option not found
+	 * -ENOENT: The SYN skb is not available now and the earlier SYN pkt
+	 *	    is not saved by TCP_SAVE_SYN.
+	 */
+	TCP_BPF_SYN_HDR_OPT	= 1003,
 };
 
 struct bpf_perf_event_value {
diff --git a/net/core/filter.c b/net/core/filter.c
index 1dd07972c5c7..5784f1bede2f 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4629,6 +4629,64 @@ static const struct bpf_func_proto bpf_sock_ops_setsockopt_proto = {
 BPF_CALL_5(bpf_sock_ops_getsockopt, struct bpf_sock_ops_kern *, bpf_sock,
 	   int, level, int, optname, char *, optval, int, optlen)
 {
+	if (IS_ENABLED(CONFIG_INET) &&
+	    level == SOL_TCP && optname == TCP_BPF_SYN_HDR_OPT) {
+		struct sk_buff *syn_skb = bpf_sock->syn_skb;
+		u8 *kind_start, copy_len = 0;
+		int ret = 0;
+
+		if (bpf_sock->syn_skb) {
+			/* sk is a request_sock here */
+
+			if (!TCP_SKB_CB(syn_skb)->bpf_hdr_opt_off) {
+				ret = -ENOMSG;
+				goto init_unused;
+			}
+
+			kind_start = syn_skb->data +
+				TCP_SKB_CB(syn_skb)->bpf_hdr_opt_off;
+		} else {
+			struct sock *sk = bpf_sock->sk;
+			struct saved_syn *saved_syn;
+
+			if (sk->sk_state == TCP_NEW_SYN_RECV)
+				/* synack retransmit. bpf_sock->syn_skb will
+				 * not be available.
+				 */
+				saved_syn = inet_reqsk(sk)->saved_syn;
+			else
+				saved_syn = tcp_sk(sk)->saved_syn;
+
+			if (!saved_syn) {
+				ret = -ENOENT;
+				goto init_unused;
+			}
+
+			if (!saved_syn->bpf_hdr_opt_off) {
+				ret = -ENOMSG;
+				goto init_unused;
+			}
+
+			kind_start = saved_syn->data +
+				saved_syn->network_hdrlen +
+				saved_syn->bpf_hdr_opt_off;
+		}
+
+		copy_len = kind_start[1];
+		if (optlen < copy_len) {
+			copy_len = optlen;
+			ret = -ENOSPC;
+		}
+
+		memcpy(optval, kind_start, copy_len);
+
+init_unused:
+		if (optlen > copy_len)
+			memset(optval + copy_len, 0, optlen - copy_len);
+
+		return ret;
+	}
+
 	return _bpf_getsockopt(bpf_sock->sk, level, optname, optval, optlen);
 }
 
@@ -4665,6 +4723,80 @@ static const struct bpf_func_proto bpf_sock_ops_cb_flags_set_proto = {
 	.arg2_type	= ARG_ANYTHING,
 };
 
+BPF_CALL_5(bpf_sock_ops_store_hdr_opt, struct bpf_sock_ops_kern *, bpf_sock,
+	   u32, offset, const void *, from, u32, len, u64, flags)
+{
+	u8 *kind_start, nr_written, max_len;
+	struct sk_buff *skb;
+
+	if (bpf_sock->op != BPF_SOCK_OPS_WRITE_HDR_OPT_CB)
+		return -EPERM;
+
+	if (flags > BPF_F_STORE_HDR_OPT_NOTRIM)
+		return -EINVAL;
+
+	max_len = bpf_sock->write_hdr_opt_len;
+	if (len > max_len || offset > max_len - len)
+		return -EFAULT;
+
+	skb = bpf_sock->skb;
+	kind_start = skb->data + bpf_sock->skb_bpf_hdr_opt_off;
+	nr_written = kind_start[1] - TCPOLEN_EXP_BPF_BASE;
+	if (offset > nr_written)
+		return -EFAULT;
+
+	if (len)
+		memcpy(kind_start + TCPOLEN_EXP_BPF_BASE + offset, from, len);
+
+	if (kind_start[1] < TCPOLEN_EXP_BPF_BASE + offset + len ||
+	    !(flags & BPF_F_STORE_HDR_OPT_NOTRIM))
+		/* Update kind_len */
+		kind_start[1] = TCPOLEN_EXP_BPF_BASE + offset + len;
+
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_sock_ops_store_hdr_opt_proto = {
+	.func		= bpf_sock_ops_store_hdr_opt,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_ANYTHING,
+	/* bpf_store_hdr_opt(skops, 0, NULL, 0, 0) will remove
+	 * all previously written options.
+	 */
+	.arg3_type	= ARG_PTR_TO_MEM_OR_NULL,
+	.arg4_type	= ARG_CONST_SIZE_OR_ZERO,
+	.arg5_type	= ARG_ANYTHING,
+};
+
+BPF_CALL_3(bpf_sock_ops_reserve_hdr_opt, struct bpf_sock_ops_kern *, bpf_sock,
+	   u32, len, u64, flags)
+{
+	if (bpf_sock->op != BPF_SOCK_OPS_HDR_OPT_LEN_CB)
+		return -EPERM;
+
+	if (flags)
+		return -EINVAL;
+
+	if (len > bpf_sock->reservable_hdr_opt_len)
+		return -ENOSPC;
+
+	if (len > bpf_sock->write_hdr_opt_len)
+		bpf_sock->write_hdr_opt_len = len;
+
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_sock_ops_reserve_hdr_opt_proto = {
+	.func		= bpf_sock_ops_reserve_hdr_opt,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_ANYTHING,
+	.arg3_type	= ARG_ANYTHING,
+};
+
 const struct ipv6_bpf_stub *ipv6_bpf_stub __read_mostly;
 EXPORT_SYMBOL_GPL(ipv6_bpf_stub);
 
@@ -6515,6 +6647,10 @@ sock_ops_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	case BPF_FUNC_sk_storage_delete:
 		return &bpf_sk_storage_delete_proto;
 #ifdef CONFIG_INET
+	case BPF_FUNC_store_hdr_opt:
+		return &bpf_sock_ops_store_hdr_opt_proto;
+	case BPF_FUNC_reserve_hdr_opt:
+		return &bpf_sock_ops_reserve_hdr_opt_proto;
 	case BPF_FUNC_tcp_sock:
 		return &bpf_tcp_sock_proto;
 #endif /* CONFIG_INET */
@@ -7313,6 +7449,21 @@ static bool sock_ops_is_valid_access(int off, int size,
 				return false;
 			info->reg_type = PTR_TO_SOCKET_OR_NULL;
 			break;
+		case offsetof(struct bpf_sock_ops, skb_data):
+			if (size != sizeof(__u64))
+				return false;
+			info->reg_type = PTR_TO_PACKET;
+			break;
+		case offsetof(struct bpf_sock_ops, skb_data_end):
+			if (size != sizeof(__u64))
+				return false;
+			info->reg_type = PTR_TO_PACKET_END;
+			break;
+		case offsetof(struct bpf_sock_ops, skb_tcp_flags):
+		case offsetof(struct bpf_sock_ops, skb_bpf_hdr_opt_off):
+			bpf_ctx_record_field_size(info, size_default);
+			return bpf_ctx_narrow_access_ok(off, size,
+							size_default);
 		default:
 			if (size != size_default)
 				return false;
@@ -8601,6 +8752,57 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
 				      si->dst_reg, si->src_reg,
 				      offsetof(struct bpf_sock_ops_kern, sk));
 		break;
+	case offsetof(struct bpf_sock_ops, skb_bpf_hdr_opt_off):
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sock_ops_kern,
+						       skb_bpf_hdr_opt_off),
+				      si->dst_reg, si->src_reg,
+				      bpf_target_off(struct bpf_sock_ops_kern,
+						     skb_bpf_hdr_opt_off, 1,
+						     target_size));
+		break;
+	case offsetof(struct bpf_sock_ops, skb_data_end):
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sock_ops_kern,
+						       skb_data_end),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_sock_ops_kern,
+					       skb_data_end));
+		break;
+	case offsetof(struct bpf_sock_ops, skb_data):
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sock_ops_kern,
+						       skb),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_sock_ops_kern,
+					       skb));
+		*insn++ = BPF_JMP_IMM(BPF_JEQ, si->dst_reg, 0, 1);
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_buff, data),
+				      si->dst_reg, si->dst_reg,
+				      offsetof(struct sk_buff, data));
+		break;
+	case offsetof(struct bpf_sock_ops, skb_len):
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sock_ops_kern,
+						       skb),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_sock_ops_kern,
+					       skb));
+		*insn++ = BPF_JMP_IMM(BPF_JEQ, si->dst_reg, 0, 1);
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_buff, len),
+				      si->dst_reg, si->dst_reg,
+				      offsetof(struct sk_buff, len));
+		break;
+	case offsetof(struct bpf_sock_ops, skb_tcp_flags):
+		off = offsetof(struct sk_buff, cb);
+		off += offsetof(struct tcp_skb_cb, tcp_flags);
+		*target_size = sizeof_field(struct tcp_skb_cb, tcp_flags);
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sock_ops_kern,
+						       skb),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_sock_ops_kern,
+					       skb));
+		*insn++ = BPF_JMP_IMM(BPF_JEQ, si->dst_reg, 0, 1);
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct tcp_skb_cb,
+						       tcp_flags),
+				      si->dst_reg, si->dst_reg, off);
+		break;
 	}
 	return insn - insn_buf;
 }
diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
index 19ad9586c720..dd70b04da52a 100644
--- a/net/ipv4/tcp_fastopen.c
+++ b/net/ipv4/tcp_fastopen.c
@@ -272,7 +272,7 @@ static struct sock *tcp_fastopen_create_child(struct sock *sk,
 	refcount_set(&req->rsk_refcnt, 2);
 
 	/* Now finish processing the fastopen child socket. */
-	tcp_init_transfer(child, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB);
+	tcp_init_transfer(child, true, skb);
 
 	tp->rcv_nxt = TCP_SKB_CB(skb)->seq + 1;
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 640408a80b3d..16dbf5dd46c1 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -138,6 +138,69 @@ void clean_acked_data_flush(void)
 EXPORT_SYMBOL_GPL(clean_acked_data_flush);
 #endif
 
+#ifdef CONFIG_CGROUP_BPF
+static void bpf_skops_parse_bpf_hdr(struct sock *sk, struct sk_buff *skb)
+{
+	struct bpf_sock_ops_kern sock_ops;
+
+	if (likely(!BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk),
+					   BPF_SOCK_OPS_PARSE_HDR_OPT_CB_FLAG)))
+		return;
+
+	/* The skb will be handled in the
+	 * BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB or
+	 * BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB.
+	 */
+	switch (sk->sk_state) {
+	case TCP_SYN_RECV:
+	case TCP_LISTEN:
+	case TCP_SYN_SENT:
+		return;
+	}
+
+	sock_owned_by_me(sk);
+
+	memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
+	sock_ops.op = BPF_SOCK_OPS_PARSE_HDR_OPT_CB;
+	sock_ops.is_fullsock = 1;
+	sock_ops.sk = sk;
+	bpf_skops_init_skb(&sock_ops, skb, TCP_SKB_CB(skb)->bpf_hdr_opt_off);
+
+	BPF_CGROUP_RUN_PROG_SOCK_OPS(&sock_ops);
+}
+
+static void bpf_skops_established(struct sock *sk, bool passive,
+				  struct sk_buff *skb)
+{
+	struct bpf_sock_ops_kern sock_ops;
+
+	sock_owned_by_me(sk);
+
+	memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
+	if (passive)
+		sock_ops.op = BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB;
+	else
+		sock_ops.op = BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB;
+	sock_ops.is_fullsock = 1;
+	sock_ops.sk = sk;
+	/* sk with TCP_REPAIR_ON does not have skb in tcp_finish_connect */
+	if (skb)
+		bpf_skops_init_skb(&sock_ops, skb,
+				   TCP_SKB_CB(skb)->bpf_hdr_opt_off);
+
+	BPF_CGROUP_RUN_PROG_SOCK_OPS(&sock_ops);
+}
+#else
+static void bpf_skops_parse_bpf_hdr(struct sock *sk, struct sk_buff *skb)
+{
+}
+
+static void bpf_skops_established(struct sock *sk, bool passive,
+				  struct sk_buff *skb)
+{
+}
+#endif
+
 static void tcp_gro_dev_warn(struct sock *sk, const struct sk_buff *skb,
 			     unsigned int len)
 {
@@ -5551,6 +5614,9 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
 		goto discard;
 	}
 
+	if (TCP_SKB_CB(skb)->bpf_hdr_opt_off)
+		bpf_skops_parse_bpf_hdr(sk, skb);
+
 	return true;
 
 discard:
@@ -5759,7 +5825,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
 }
 EXPORT_SYMBOL(tcp_rcv_established);
 
-void tcp_init_transfer(struct sock *sk, int bpf_op)
+void tcp_init_transfer(struct sock *sk, bool passive, struct sk_buff *skb)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
@@ -5780,7 +5846,7 @@ void tcp_init_transfer(struct sock *sk, int bpf_op)
 		tp->snd_cwnd = tcp_init_cwnd(tp, __sk_dst_get(sk));
 	tp->snd_cwnd_stamp = tcp_jiffies32;
 
-	tcp_call_bpf(sk, bpf_op, 0, NULL);
+	bpf_skops_established(sk, passive, skb);
 	tcp_init_congestion_control(sk);
 	tcp_init_buffer_space(sk);
 }
@@ -5799,7 +5865,7 @@ void tcp_finish_connect(struct sock *sk, struct sk_buff *skb)
 		sk_mark_napi_id(sk, skb);
 	}
 
-	tcp_init_transfer(sk, BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB);
+	tcp_init_transfer(sk, false, skb);
 
 	/* Prevent spurious tcp_cwnd_restart() on first data
 	 * packet.
@@ -6271,7 +6337,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
 		} else {
 			tcp_try_undo_spurious_syn(sk);
 			tp->retrans_stamp = 0;
-			tcp_init_transfer(sk, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB);
+			tcp_init_transfer(sk, true, skb);
 			WRITE_ONCE(tp->copied_seq, tp->rcv_nxt);
 		}
 		smp_mb();
@@ -6718,7 +6784,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
 		fastopen_sk = tcp_try_fastopen(sk, skb, req, &foc, dst);
 	}
 	if (fastopen_sk) {
-		af_ops->send_synack(fastopen_sk, dst, &fl, req,
+		af_ops->send_synack(fastopen_sk, dst, &fl, req, skb,
 				    &foc, TCP_SYNACK_FASTOPEN);
 		/* Add the child socket directly into the accept queue */
 		if (!inet_csk_reqsk_queue_add(sk, req, fastopen_sk)) {
@@ -6735,7 +6801,8 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
 		if (!want_cookie)
 			inet_csk_reqsk_queue_hash_add(sk, req,
 				tcp_timeout_init((struct sock *)req));
-		af_ops->send_synack(sk, dst, &fl, req, &foc,
+
+		af_ops->send_synack(sk, dst, &fl, req, skb, &foc,
 				    !want_cookie ? TCP_SYNACK_NORMAL :
 						   TCP_SYNACK_COOKIE);
 		if (want_cookie) {
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index a3535b7fe002..8a0b62fb4f79 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -963,6 +963,7 @@ static void tcp_v4_reqsk_send_ack(const struct sock *sk, struct sk_buff *skb,
 static int tcp_v4_send_synack(const struct sock *sk, struct dst_entry *dst,
 			      struct flowi *fl,
 			      struct request_sock *req,
+			      struct sk_buff *syn_skb,
 			      struct tcp_fastopen_cookie *foc,
 			      enum tcp_synack_type synack_type)
 {
@@ -975,7 +976,7 @@ static int tcp_v4_send_synack(const struct sock *sk, struct dst_entry *dst,
 	if (!dst && (dst = inet_csk_route_req(sk, &fl4, req)) == NULL)
 		return -1;
 
-	skb = tcp_make_synack(sk, dst, req, foc, synack_type);
+	skb = tcp_make_synack(sk, dst, req, syn_skb, foc, synack_type);
 
 	if (skb) {
 		__tcp_v4_send_check(skb, ireq->ir_loc_addr, ireq->ir_rmt_addr);
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 495dda2449fe..56c306e3cd2f 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -548,6 +548,7 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
 	newtp->fastopen_req = NULL;
 	RCU_INIT_POINTER(newtp->fastopen_rsk, NULL);
 
+	bpf_skops_init_child(sk, newsk);
 	tcp_bpf_clone(sk, newsk);
 
 	__TCP_INC_STATS(sock_net(sk), TCP_MIB_PASSIVEOPENS);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index a50e1990a845..a78a29980e1f 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -438,6 +438,7 @@ struct tcp_out_options {
 	u8 ws;			/* window scale, 0 to disable */
 	u8 num_sack_blocks;	/* number of SACK blocks to include */
 	u8 hash_size;		/* bytes in hash_location */
+	u8 bpf_hdr_opt_len;	/* length of BPF hdr option */
 	__u8 *hash_location;	/* temporary pointer, overloaded */
 	__u32 tsval, tsecr;	/* need to include OPTION_TS */
 	struct tcp_fastopen_cookie *fastopen_cookie;	/* Fast open cookie */
@@ -452,6 +453,152 @@ static void mptcp_options_write(__be32 *ptr, struct tcp_out_options *opts)
 #endif
 }
 
+#ifdef CONFIG_CGROUP_BPF
+static void bpf_skops_hdr_opt_len(struct sock *sk, struct sk_buff *skb,
+				  struct request_sock *req,
+				  struct sk_buff *syn_skb,
+				  bool want_cookie,
+				  struct tcp_out_options *opts,
+				  unsigned int *res_remaining)
+{
+	/* res_remaining has already been aligned to 4 bytes */
+	unsigned int remaining = *res_remaining;
+	struct bpf_sock_ops_kern sock_ops;
+	int err;
+
+	if (likely(!BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk),
+					BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG)) ||
+	    remaining <= TCPOLEN_EXP_BPF_BASE)
+		return;
+
+	BUILD_BUG_ON(TCPOLEN_EXP_BPF_BASE != 4);
+	/* remaining is still rounded to 4 bytes */
+	remaining -= TCPOLEN_EXP_BPF_BASE;
+
+	/* init sock_ops */
+	memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
+
+	sock_ops.op = BPF_SOCK_OPS_HDR_OPT_LEN_CB;
+
+	if (req) {
+		/* The listen "sk" cannot be passed here because
+		 * it is not locked.  It would not make too much
+		 * sense to do bpf_setsockopt(listen_sk) based
+		 * on individual connection request also.
+		 *
+		 * Thus, "req" is passed here and the cgroup-bpf-progs
+		 * of the listen "sk" will be run.
+		 *
+		 * "req" is also used here for fastopen even the "sk" here is
+		 * the "child" sk.  It is to keep the behavior
+		 * consistent between fastopen and non-fastopen on
+		 * the bpf programming side.
+		 */
+		sock_ops.sk = (struct sock *)req;
+		sock_ops.syn_skb = syn_skb;
+	} else {
+		sock_owned_by_me(sk);
+
+		sock_ops.is_fullsock = 1;
+		sock_ops.sk = sk;
+	}
+
+	/* tcp_current_mss() does not pass a skb */
+	if (skb)
+		bpf_skops_init_skb(&sock_ops, skb, 0);
+
+	sock_ops.args[0] = remaining;
+	sock_ops.args[1] = want_cookie;
+	sock_ops.reservable_hdr_opt_len = remaining;
+
+	err = BPF_CGROUP_RUN_PROG_SOCK_OPS_SK(&sock_ops, sk);
+
+	if (err || !sock_ops.write_hdr_opt_len)
+		return;
+
+	opts->bpf_hdr_opt_len = sock_ops.write_hdr_opt_len +
+		TCPOLEN_EXP_BPF_BASE;
+	/* round up to 4 bytes */
+	opts->bpf_hdr_opt_len = (opts->bpf_hdr_opt_len + 3) & ~3;
+
+	*res_remaining -= opts->bpf_hdr_opt_len;
+}
+
+static void bpf_skops_write_hdr_opt(struct sock *sk, struct sk_buff *skb,
+				    struct request_sock *req,
+				    struct sk_buff *syn_skb,
+				    bool want_cookie,
+				    struct tcp_out_options *opts)
+{
+	u8 *kind_start, kind_len, max_kind_len = opts->bpf_hdr_opt_len;
+	struct tcphdr *th = (struct tcphdr *)skb->data;
+	struct bpf_sock_ops_kern sock_ops;
+	int err;
+
+	if (likely(!max_kind_len))
+		return;
+
+	memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
+
+	sock_ops.op = BPF_SOCK_OPS_WRITE_HDR_OPT_CB;
+
+	if (req) {
+		sock_ops.sk = (struct sock *)req;
+		sock_ops.syn_skb = syn_skb;
+	} else {
+		sock_owned_by_me(sk);
+
+		sock_ops.is_fullsock = 1;
+		sock_ops.sk = sk;
+	}
+
+	bpf_skops_init_skb(&sock_ops, skb, __tcp_hdrlen(th) - max_kind_len);
+
+	sock_ops.args[0] = max_kind_len - TCPOLEN_EXP_BPF_BASE;
+	sock_ops.args[1] = want_cookie;
+	sock_ops.write_hdr_opt_len = sock_ops.args[0];
+
+	kind_start = skb->data + sock_ops.skb_bpf_hdr_opt_off;
+	*(__be32 *)kind_start = htonl((TCPOPT_EXP << 24) |
+				      TCPOLEN_EXP_BPF_BASE |
+				      TCPOPT_BPF_MAGIC);
+
+	err = BPF_CGROUP_RUN_PROG_SOCK_OPS_SK(&sock_ops, sk);
+
+	kind_len = kind_start[1];
+	if (err || kind_len == TCPOLEN_EXP_BPF_BASE) {
+		/* zero out the unwritten memory.
+		 *
+		 * bpf is the last one writing options, so it
+		 * can end with TCPOPT_EOL under abnormal cases.
+		 */
+		memset(kind_start, 0, max_kind_len);
+		kind_start[0] = TCPOPT_EOL;
+	} else if (kind_len < max_kind_len) {
+		/* Pad by NOP for normal case */
+		memset(kind_start + kind_len, TCPOPT_NOP,
+		       max_kind_len - kind_len);
+	}
+}
+#else
+static void bpf_skops_hdr_opt_len(struct sock *sk, struct sk_buff *skb,
+				  struct request_sock *req,
+				  struct sk_buff *syn_skb,
+				  bool want_cookie,
+				  struct tcp_out_options *opts,
+				  unsigned int *res_remaining)
+{
+}
+
+static void bpf_skops_write_hdr_opt(struct sock *sk, struct sk_buff *skb,
+				    struct request_sock *req,
+				    struct sk_buff *syn_skb,
+				    bool want_cookie,
+				    struct tcp_out_options *opts)
+{
+}
+#endif
+
 /* Write previously computed TCP options to the packet.
  *
  * Beware: Something in the Internet is very sensitive to the ordering of
@@ -691,12 +838,16 @@ static unsigned int tcp_syn_options(struct sock *sk, struct sk_buff *skb,
 		}
 	}
 
+	bpf_skops_hdr_opt_len(sk, skb, NULL, NULL, false, opts, &remaining);
+
 	return MAX_TCP_OPTION_SPACE - remaining;
 }
 
 /* Set up TCP options for SYN-ACKs. */
 static unsigned int tcp_synack_options(const struct sock *sk,
 				       struct request_sock *req,
+				       struct sk_buff *syn_skb,
+				       bool want_cookie,
 				       unsigned int mss, struct sk_buff *skb,
 				       struct tcp_out_options *opts,
 				       const struct tcp_md5sig_key *md5,
@@ -756,15 +907,19 @@ static unsigned int tcp_synack_options(const struct sock *sk,
 
 	smc_set_option_cond(tcp_sk(sk), ireq, opts, &remaining);
 
+	bpf_skops_hdr_opt_len((struct sock *)sk, skb, req, syn_skb,
+			      want_cookie, opts, &remaining);
+
 	return MAX_TCP_OPTION_SPACE - remaining;
 }
 
 /* Compute TCP options for ESTABLISHED sockets. This is not the
  * final wire format yet.
  */
-static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb,
-					struct tcp_out_options *opts,
-					struct tcp_md5sig_key **md5)
+static unsigned int tcp_established_options(struct sock *sk,
+					    struct sk_buff *skb,
+					    struct tcp_out_options *opts,
+					    struct tcp_md5sig_key **md5)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	unsigned int size = 0;
@@ -824,6 +979,16 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
 			opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK;
 	}
 
+	if (unlikely(BPF_SOCK_OPS_TEST_FLAG(tp,
+					    BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG))) {
+		unsigned int remaining = MAX_TCP_OPTION_SPACE - size;
+
+		bpf_skops_hdr_opt_len(sk, skb, NULL, NULL, false, opts,
+				      &remaining);
+
+		size = MAX_TCP_OPTION_SPACE - remaining;
+	}
+
 	return size;
 }
 
@@ -1206,6 +1371,7 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb,
 					       md5, sk, skb);
 	}
 #endif
+	bpf_skops_write_hdr_opt(sk, skb, NULL, NULL, false, &opts);
 
 	icsk->icsk_af_ops->send_check(sk, skb);
 
@@ -3333,9 +3499,11 @@ int tcp_send_synack(struct sock *sk)
  */
 struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst,
 				struct request_sock *req,
+				struct sk_buff *syn_skb,
 				struct tcp_fastopen_cookie *foc,
 				enum tcp_synack_type synack_type)
 {
+	bool want_cookie = (synack_type == TCP_SYNACK_COOKIE);
 	struct inet_request_sock *ireq = inet_rsk(req);
 	const struct tcp_sock *tp = tcp_sk(sk);
 	struct tcp_md5sig_key *md5 = NULL;
@@ -3393,8 +3561,11 @@ struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst,
 	md5 = tcp_rsk(req)->af_specific->req_md5_lookup(sk, req_to_sk(req));
 #endif
 	skb_set_hash(skb, tcp_rsk(req)->txhash, PKT_HASH_TYPE_L4);
-	tcp_header_size = tcp_synack_options(sk, req, mss, skb, &opts, md5,
-					     foc) + sizeof(*th);
+	/* bpf program will be interested in the tcp_flags */
+	TCP_SKB_CB(skb)->tcp_flags = TCPHDR_SYN | TCPHDR_ACK;
+	tcp_header_size = tcp_synack_options(sk, req, syn_skb, want_cookie,
+					     mss, skb, &opts,
+					     md5, foc) + sizeof(*th);
 
 	skb_push(skb, tcp_header_size);
 	skb_reset_transport_header(skb);
@@ -3416,6 +3587,8 @@ struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst,
 	th->window = htons(min(req->rsk_rcv_wnd, 65535U));
 	tcp_options_write((__be32 *)(th + 1), NULL, &opts);
 	th->doff = (tcp_header_size >> 2);
+	bpf_skops_write_hdr_opt((struct sock *)sk, skb, req, syn_skb,
+				want_cookie, &opts);
 	__TCP_INC_STATS(sock_net(sk), TCP_MIB_OUTSEGS);
 
 #ifdef CONFIG_TCP_MD5SIG
@@ -3919,7 +4092,8 @@ int tcp_rtx_synack(const struct sock *sk, struct request_sock *req)
 	int res;
 
 	tcp_rsk(req)->txhash = net_tx_rndhash();
-	res = af_ops->send_synack(sk, NULL, &fl, req, NULL, TCP_SYNACK_NORMAL);
+	res = af_ops->send_synack(sk, NULL, &fl, req, NULL, NULL,
+				  TCP_SYNACK_NORMAL);
 	if (!res) {
 		__TCP_INC_STATS(sock_net(sk), TCP_MIB_RETRANSSEGS);
 		__NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPSYNRETRANS);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 8356d0562279..3036ae6532dd 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -500,6 +500,7 @@ static int tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 static int tcp_v6_send_synack(const struct sock *sk, struct dst_entry *dst,
 			      struct flowi *fl,
 			      struct request_sock *req,
+			      struct sk_buff *syn_skb,
 			      struct tcp_fastopen_cookie *foc,
 			      enum tcp_synack_type synack_type)
 {
@@ -515,7 +516,7 @@ static int tcp_v6_send_synack(const struct sock *sk, struct dst_entry *dst,
 					       IPPROTO_TCP)) == NULL)
 		goto done;
 
-	skb = tcp_make_synack(sk, dst, req, foc, synack_type);
+	skb = tcp_make_synack(sk, dst, req, syn_skb, foc, synack_type);
 
 	if (skb) {
 		__tcp_v6_send_check(skb, &ireq->ir_v6_loc_addr,
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 0cb8ec948816..479b83d05811 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -3285,6 +3285,110 @@ union bpf_attr {
  *		Dynamically cast a *sk* pointer to a *udp6_sock* pointer.
  *	Return
  *		*sk* if casting is valid, or NULL otherwise.
+ *
+ * long bpf_store_hdr_opt(struct bpf_sock_ops *skops, u32 offset, const void *from, u32 len, u64 flags)
+ *	Description
+ *		Store BPF header option.  The data will be copied
+ *		from buffer *from* with length *len* to the bpf header
+ *		option at *offset*.
+ *
+ *		This helper does not allow updating the header "kind",
+ *		"kind length", and the 16-bit magic.  *offset* 0
+ *		points immediately after the 16-bit magic.
+ *		This helper will automatically update the "kind length" as
+ *		the bpf header option is written by this helper.  It can
+ *		guarantee that the header option format always
+ *		conforms to the standard.
+ *
+ *		The bpf program can write everything at once
+ *		from offset 0.
+ *
+ *		The bpf program can also do incremental write by calling
+ *		this helper multiple times with incremental (but
+ *		continuous) offset.
+ *
+ *		The bpf program (e.g. the parent cgroup bpf program)
+ *		can also rewrite a few bytes that has already been
+ *		written by some ealier bpf programs.  The bpf
+ *		program can also learn what header option has been
+ *		written by the earlier bpf programs through
+ *		sock_ops->skb_data.
+ *
+ *		A bpf program can always write at offset 0 again
+ *		to overwrite all bpf header options that has been written
+ *		by the previous bpf programs.  If the bpf program
+ *		writes less than the previous bpf programs do,
+ *		the tailing bytes will be removed unless
+ *		*BP_F_STORE_HDR_OPT_NOTRIM* is used.
+ *		e.g. bpf_store_hdr_opt(skops, 0, NULL, 0, 0) will
+ *		remove all previously written option.
+ *
+ *		This helper will reject the *offset* if there is any
+ *		byte/hole that has never been written before this
+ *		*offset*.
+ *
+ *		For example, it can write at offset 0 for 2 bytes,
+ *		then write at offset 2 for 1 bytes, and then finally
+ *		write at offset 3 for 2 bytes.  Thus, the bpf prog
+ *		has totally written 5 bytes.
+ *		The "kind length" will be 9 bytes:
+ *		4 bytes (1byte kind + 1byte kind_len + 2byte magic) +
+ *		5 bytes written by the bpf program.
+ *
+ *		However, this helper does not allow writing at offset 0
+ *		for 2 bytes and then immediately write at offset 3 without
+ *		writing at offset 2 first.
+ *
+ *		This helper can only be called during
+ *		BPF_SOCK_OPS_WRITE_HDR_OPT_CB.  The previous reserved
+ *		space is passed in sock_ops->args[0].
+ *
+ *		The following flags are supported:
+ *		* **BPF_F_STORE_HDR_OPT_NOTRIM**: Usually, bpf program
+ *		  does write with incremental offset instead of
+ *		  jumping the offset back and forth.
+ *		  Thus, by default, after calling this helper,
+ *		  anything previously written after offset + len
+ *		  will be removed.
+ *		  This flag can be used to keep the tailing
+ *		  data after "offset + len".
+ *
+ *	Return
+ *		0 on success, or negative error in case of failure:
+ *
+ *		**-EINVAL** if param is invalid
+ *
+ *		**-EFAULT** *offset* + *len* is outside of the
+ *		            reserved option space or
+ *			    there is unwritten byte before *offset*.
+ *
+ *		**-EPERM** This helper cannot be used under the
+ *			   current sock_ops->op.
+ *
+ * long bpf_reserve_hdr_opt(struct bpf_sock_ops *skops, u32 len, u64 flags)
+ *	Description
+ *		Reserve *len* bytes for the bpf header option.  This
+ *		space will be available to use by bpf_store_hdr_opt()
+ *		later.
+ *
+ *		If bpf_reserve_hdr_opt() is called multiple times (e.g
+ *		by mulitple cgroup bpf programs),
+ *		the maximum *len* among them will be reserved.
+ *
+ *		This helper can only be called during
+ *		BPF_SOCK_OPS_HDR_OPT_LEN_CB.  The maximum
+ *		reservable space is passed in
+ *		sock_ops->args[0].
+ *
+ *	Return
+ *		0 on success, or negative error in case of failure:
+ *
+ *		**-EINVAL** if param is invalid
+ *
+ *		**-ENOSPC** Not enough remaining space in the header.
+ *
+ *		**-EPERM** This helper cannot be used under the
+ *			   current sock_ops->op.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3427,7 +3531,9 @@ union bpf_attr {
 	FN(skc_to_tcp_sock),		\
 	FN(skc_to_tcp_timewait_sock),	\
 	FN(skc_to_tcp_request_sock),	\
-	FN(skc_to_udp6_sock),
+	FN(skc_to_udp6_sock),		\
+	FN(store_hdr_opt),		\
+	FN(reserve_hdr_opt),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -4024,6 +4130,11 @@ struct bpf_sock_ops {
 	__u64 bytes_received;
 	__u64 bytes_acked;
 	__bpf_md_ptr(struct bpf_sock *, sk);
+	__bpf_md_ptr(void *, skb_data);
+	__bpf_md_ptr(void *, skb_data_end);
+	__u32 skb_len;
+	__u32 skb_bpf_hdr_opt_off;
+	__u32 skb_tcp_flags;
 };
 
 /* Definitions for bpf_sock_ops_cb_flags */
@@ -4032,8 +4143,14 @@ enum {
 	BPF_SOCK_OPS_RETRANS_CB_FLAG	= (1<<1),
 	BPF_SOCK_OPS_STATE_CB_FLAG	= (1<<2),
 	BPF_SOCK_OPS_RTT_CB_FLAG	= (1<<3),
+	BPF_SOCK_OPS_PARSE_HDR_OPT_CB_FLAG = (1<<4),
+	BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<5),
 /* Mask of all currently supported cb flags */
-	BPF_SOCK_OPS_ALL_CB_FLAGS       = 0xF,
+	BPF_SOCK_OPS_ALL_CB_FLAGS       = 0x3F,
+};
+
+enum {
+	BPF_F_STORE_HDR_OPT_NOTRIM	= (1ULL << 0),
 };
 
 /* List of known BPF sock_ops operators.
@@ -4089,6 +4206,58 @@ enum {
 					 */
 	BPF_SOCK_OPS_RTT_CB,		/* Called on every RTT.
 					 */
+	BPF_SOCK_OPS_PARSE_HDR_OPT_CB,	/* Parse the BPF header option.
+					 * It will be called to handle
+					 * the packets received at
+					 * socket that has been established.
+					 */
+	BPF_SOCK_OPS_HDR_OPT_LEN_CB,	/* Reserve space for writing the BPF
+					 * header option later in
+					 * BPF_SOCK_OPS_WRITE_HDR_OPT_CB.
+					 * Arg1: Max reservable bytes
+					 * Arg2: bool want_cookie. (in
+					 *       writing SYNACK only)
+					 *
+					 * sock_ops->skb_data:
+					 *	Referring to the
+					 *	outgoing skb.
+					 *	It is pointing to the
+					 *	TCP payload because
+					 *	no TCP header has
+					 *	been written yet.
+					 *
+					 * sock_ops->skb_tcp_flags:
+					 *	the tcp_flags of the
+					 *	outgoing skb. (e.g.
+					 *      SYN, ACK, FIN).
+					 *
+					 * bpf_reserve_hdr_opt() should
+					 * be used to reserve space.
+					 */
+	BPF_SOCK_OPS_WRITE_HDR_OPT_CB,	/* Write the BPF header OPTIONS
+					 * Arg1: num of bytes reserved during
+					 *       BPF_SOCK_OPS_HDR_OPT_LEN_CB
+					 * Arg2: bool want_cookie. (in
+					 *       writing SYNACK only)
+					 *
+					 * sock_ops->skb_data:
+					 *	referring to the
+					 *	outgoing skb.
+					 *	It is pointing to the
+					 *	TCP header.
+					 *
+					 * sock_ops->skb_bpf_hdr_opt_off:
+					 *	offset to the bpf
+					 *	header option.  bpf
+					 *	prog can inspect what bpf
+					 *	option has already been
+					 *	written.
+					 *
+					 * sock_ops->skb_tcp_flags:
+					 *	the tcp_flags of the
+					 *	outgoing skb. (e.g.
+					 *	SYN, ACK, FIN).
+					 */
 };
 
 /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
@@ -4116,6 +4285,20 @@ enum {
 enum {
 	TCP_BPF_IW		= 1001,	/* Set TCP initial congestion window */
 	TCP_BPF_SNDCWND_CLAMP	= 1002,	/* Set sndcwnd_clamp */
+	/* Copy the BPF header option from syn pkt to optval
+	 *
+	 * BPF_PROG_TYPE_SOCK_OPS only.  It is similar to the
+	 * TCP_SAVED_SYN but it only gets the bpf header option
+	 * from the syn packet.
+	 *
+	 *       0: Success
+	 * -ENOSPC: Not enough space in optval. Only optlen number of
+	 *          bytes is copied.
+	 * -ENOMSG: BPF TCP header option not found
+	 * -ENOENT: The SYN skb is not available now and the earlier SYN pkt
+	 *	    is not saved by TCP_SAVE_SYN.
+	 */
+	TCP_BPF_SYN_HDR_OPT	= 1003,
 };
 
 struct bpf_perf_event_value {
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH bpf-next 05/10] bpf: selftests: A few improvements to network_helpers.c
  2020-06-26 17:55 [PATCH bpf-next 00/10] BPF TCP header options Martin KaFai Lau
                   ` (3 preceding siblings ...)
  2020-06-26 17:55 ` [PATCH bpf-next 04/10] bpf: tcp: Allow bpf prog to write and parse BPF TCP header option Martin KaFai Lau
@ 2020-06-26 17:55 ` Martin KaFai Lau
  2020-06-26 17:55 ` [PATCH bpf-next 06/10] bpf: selftests: Add fastopen_connect to network_helpers Martin KaFai Lau
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Martin KaFai Lau @ 2020-06-26 17:55 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Eric Dumazet, kernel-team,
	Lawrence Brakmo, Neal Cardwell, netdev, Yuchung Cheng

This patch makes a few changes to the network_helpers.c

1) Enforce SO_RCVTIMEO and SO_SNDTIMEO
   This patch enforces timeout to the network fds through setsockopt
   SO_RCVTIMEO and SO_SNDTIMEO.

   It will remove the need for SOCK_NONBLOCK that requires a more demanding
   timeout logic with epoll/select, e.g. epoll_create, epoll_ctrl, and
   then epoll_wait for timeout.

   That removes the need for connect_wait() from the
   cgroup_skb_sk_lookup.c. The needed change is made in
   cgroup_skb_sk_lookup.c.

2) start_server():
   Add optional addr_str and port to start_server().
   That removes the need of the start_server_with_port().  The caller
   can pass addr_str==NULL and/or port==0.

   "int timeout_ms" is also added.

3) connect_to_fd(): Fully use the server_fd.
   The server sock address has already been obtained from
   getsockname(server_fd).  The sockaddr includes the family,
   so the "int family" arg is redundant.

   Since the server address is obtained from server_fd,  there
   is little reason not to get the server's socket type from the
   server_fd also.  getsockopt(server_fd) can be used to do that,
   so "int type" arg is also removed.

   "int timeout_ms" is added.

4) connect_fd_to_fd():
   "int timeout_ms" is added.
   Some code is also refactored to connect_fd_to_addr() which is
   shared with connect_to_fd().

5) Preserve errno:
   Some callers need to check errno, e.g. cgroup_skb_sk_lookup.c.
   Make changes to do it more consistently in save_errno_close()
   and log_err().

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 tools/testing/selftests/bpf/network_helpers.c | 157 +++++++++++-------
 tools/testing/selftests/bpf/network_helpers.h |   9 +-
 .../bpf/prog_tests/cgroup_skb_sk_lookup.c     |  12 +-
 .../bpf/prog_tests/connect_force_port.c       |  10 +-
 .../bpf/prog_tests/load_bytes_relative.c      |   4 +-
 .../selftests/bpf/prog_tests/tcp_rtt.c        |   4 +-
 6 files changed, 110 insertions(+), 86 deletions(-)

diff --git a/tools/testing/selftests/bpf/network_helpers.c b/tools/testing/selftests/bpf/network_helpers.c
index e36dd1a1780d..1a371d3eca7d 100644
--- a/tools/testing/selftests/bpf/network_helpers.c
+++ b/tools/testing/selftests/bpf/network_helpers.c
@@ -7,8 +7,6 @@
 
 #include <arpa/inet.h>
 
-#include <sys/epoll.h>
-
 #include <linux/err.h>
 #include <linux/in.h>
 #include <linux/in6.h>
@@ -17,8 +15,13 @@
 #include "network_helpers.h"
 
 #define clean_errno() (errno == 0 ? "None" : strerror(errno))
-#define log_err(MSG, ...) fprintf(stderr, "(%s:%d: errno: %s) " MSG "\n", \
-	__FILE__, __LINE__, clean_errno(), ##__VA_ARGS__)
+#define log_err(MSG, ...) ({						\
+			int save = errno;				\
+			fprintf(stderr, "(%s:%d: errno: %s) " MSG "\n", \
+				__FILE__, __LINE__, clean_errno(),	\
+				##__VA_ARGS__);				\
+			errno = save;					\
+})
 
 struct ipv4_packet pkt_v4 = {
 	.eth.h_proto = __bpf_constant_htons(ETH_P_IP),
@@ -37,7 +40,34 @@ struct ipv6_packet pkt_v6 = {
 	.tcp.doff = 5,
 };
 
-int start_server_with_port(int family, int type, __u16 port)
+static int settimeo(int fd, int timeout_ms)
+{
+	struct timeval timeout = { .tv_sec = 3 };
+
+	if (timeout_ms > 0) {
+		timeout.tv_sec = timeout_ms / 1000;
+		timeout.tv_usec = (timeout_ms % 1000) * 1000;
+	}
+
+	if (setsockopt(fd, SOL_SOCKET, SO_RCVTIMEO, &timeout,
+		       sizeof(timeout))) {
+		log_err("Failed to set SO_RCVTIMEO");
+		return -1;
+	}
+
+	if (setsockopt(fd, SOL_SOCKET, SO_SNDTIMEO, &timeout,
+		       sizeof(timeout))) {
+		log_err("Failed to set SO_SNDTIMEO");
+		return -1;
+	}
+
+	return 0;
+}
+
+#define save_errno_close(fd) ({ int save = errno; close(fd); errno = save; })
+
+int start_server(int family, int type, const char *addr_str, __u16 port,
+		 int timeout_ms)
 {
 	struct sockaddr_storage addr = {};
 	socklen_t len;
@@ -48,120 +78,119 @@ int start_server_with_port(int family, int type, __u16 port)
 
 		sin->sin_family = AF_INET;
 		sin->sin_port = htons(port);
+		if (addr_str &&
+		    inet_pton(AF_INET, addr_str, &sin->sin_addr) != 1) {
+			log_err("inet_pton(AF_INET, %s)", addr_str);
+			return -1;
+		}
 		len = sizeof(*sin);
 	} else {
 		struct sockaddr_in6 *sin6 = (void *)&addr;
 
 		sin6->sin6_family = AF_INET6;
 		sin6->sin6_port = htons(port);
+		if (addr_str &&
+		    inet_pton(AF_INET6, addr_str, &sin6->sin6_addr) != 1) {
+			log_err("inet_pton(AF_INET6, %s)", addr_str);
+			return -1;
+		}
 		len = sizeof(*sin6);
 	}
 
-	fd = socket(family, type | SOCK_NONBLOCK, 0);
+	fd = socket(family, type, 0);
 	if (fd < 0) {
 		log_err("Failed to create server socket");
 		return -1;
 	}
 
+	if (settimeo(fd, timeout_ms))
+		goto error_close;
+
 	if (bind(fd, (const struct sockaddr *)&addr, len) < 0) {
 		log_err("Failed to bind socket");
-		close(fd);
-		return -1;
+		goto error_close;
 	}
 
 	if (type == SOCK_STREAM) {
 		if (listen(fd, 1) < 0) {
 			log_err("Failed to listed on socket");
-			close(fd);
-			return -1;
+			goto error_close;
 		}
 	}
 
 	return fd;
-}
 
-int start_server(int family, int type)
-{
-	return start_server_with_port(family, type, 0);
+error_close:
+	save_errno_close(fd);
+	return -1;
 }
 
-static const struct timeval timeo_sec = { .tv_sec = 3 };
-static const size_t timeo_optlen = sizeof(timeo_sec);
-
-int connect_to_fd(int family, int type, int server_fd)
+static int connect_fd_to_addr(int fd,
+			      const struct sockaddr_storage *addr,
+			      socklen_t addrlen)
 {
-	int fd, save_errno;
-
-	fd = socket(family, type, 0);
-	if (fd < 0) {
-		log_err("Failed to create client socket");
+	if (connect(fd, (const struct sockaddr *)addr, addrlen)) {
+		log_err("Failed to connect to server");
 		return -1;
 	}
 
-	if (connect_fd_to_fd(fd, server_fd) < 0 && errno != EINPROGRESS) {
-		save_errno = errno;
-		close(fd);
-		errno = save_errno;
-		return -1;
-	}
-
-	return fd;
+	return 0;
 }
 
-int connect_fd_to_fd(int client_fd, int server_fd)
+int connect_to_fd(int server_fd, int timeout_ms)
 {
 	struct sockaddr_storage addr;
-	socklen_t len = sizeof(addr);
-	int save_errno;
+	struct sockaddr_in *addr_in;
+	socklen_t addrlen, optlen;
+	int fd, type;
 
-	if (setsockopt(client_fd, SOL_SOCKET, SO_RCVTIMEO, &timeo_sec,
-		       timeo_optlen)) {
-		log_err("Failed to set SO_RCVTIMEO");
+	optlen = sizeof(type);
+	if (getsockopt(server_fd, SOL_SOCKET, SO_TYPE, &type, &optlen)) {
+		log_err("getsockopt(SOL_TYPE)");
 		return -1;
 	}
 
-	if (getsockname(server_fd, (struct sockaddr *)&addr, &len)) {
+	addrlen = sizeof(addr);
+	if (getsockname(server_fd, (struct sockaddr *)&addr, &addrlen)) {
 		log_err("Failed to get server addr");
 		return -1;
 	}
 
-	if (connect(client_fd, (const struct sockaddr *)&addr, len) < 0) {
-		if (errno != EINPROGRESS) {
-			save_errno = errno;
-			log_err("Failed to connect to server");
-			errno = save_errno;
-		}
+	addr_in = (struct sockaddr_in *)&addr;
+	fd = socket(addr_in->sin_family, type, 0);
+	if (fd < 0) {
+		log_err("Failed to create client socket");
 		return -1;
 	}
 
-	return 0;
+	if (settimeo(fd, timeout_ms))
+		goto error_close;
+
+	if (connect_fd_to_addr(fd, &addr, addrlen))
+		goto error_close;
+
+	return fd;
+
+error_close:
+	save_errno_close(fd);
+	return -1;
 }
 
-int connect_wait(int fd)
+int connect_fd_to_fd(int client_fd, int server_fd, int timeout_ms)
 {
-	struct epoll_event ev = {}, events[2];
-	int timeout_ms = 1000;
-	int efd, nfd;
+	struct sockaddr_storage addr;
+	socklen_t len = sizeof(addr);
 
-	efd = epoll_create1(EPOLL_CLOEXEC);
-	if (efd < 0) {
-		log_err("Failed to open epoll fd");
+	if (settimeo(client_fd, timeout_ms))
 		return -1;
-	}
-
-	ev.events = EPOLLRDHUP | EPOLLOUT;
-	ev.data.fd = fd;
 
-	if (epoll_ctl(efd, EPOLL_CTL_ADD, fd, &ev) < 0) {
-		log_err("Failed to register fd=%d on epoll fd=%d", fd, efd);
-		close(efd);
+	if (getsockname(server_fd, (struct sockaddr *)&addr, &len)) {
+		log_err("Failed to get server addr");
 		return -1;
 	}
 
-	nfd = epoll_wait(efd, events, ARRAY_SIZE(events), timeout_ms);
-	if (nfd < 0)
-		log_err("Failed to wait for I/O event on epoll fd=%d", efd);
+	if (connect_fd_to_addr(client_fd, &addr, len))
+		return -1;
 
-	close(efd);
-	return nfd;
+	return 0;
 }
diff --git a/tools/testing/selftests/bpf/network_helpers.h b/tools/testing/selftests/bpf/network_helpers.h
index 6a8009605670..f580e82fda58 100644
--- a/tools/testing/selftests/bpf/network_helpers.h
+++ b/tools/testing/selftests/bpf/network_helpers.h
@@ -33,10 +33,9 @@ struct ipv6_packet {
 } __packed;
 extern struct ipv6_packet pkt_v6;
 
-int start_server(int family, int type);
-int start_server_with_port(int family, int type, __u16 port);
-int connect_to_fd(int family, int type, int server_fd);
-int connect_fd_to_fd(int client_fd, int server_fd);
-int connect_wait(int client_fd);
+int start_server(int family, int type, const char *addr, __u16 port,
+		 int timeout_ms);
+int connect_to_fd(int server_fd, int timeout_ms);
+int connect_fd_to_fd(int client_fd, int server_fd, int timeout_ms);
 
 #endif
diff --git a/tools/testing/selftests/bpf/prog_tests/cgroup_skb_sk_lookup.c b/tools/testing/selftests/bpf/prog_tests/cgroup_skb_sk_lookup.c
index 059047af7df3..464edc1c1708 100644
--- a/tools/testing/selftests/bpf/prog_tests/cgroup_skb_sk_lookup.c
+++ b/tools/testing/selftests/bpf/prog_tests/cgroup_skb_sk_lookup.c
@@ -13,7 +13,7 @@ static void run_lookup_test(__u16 *g_serv_port, int out_sk)
 	socklen_t addr_len = sizeof(addr);
 	__u32 duration = 0;
 
-	serv_sk = start_server(AF_INET6, SOCK_STREAM);
+	serv_sk = start_server(AF_INET6, SOCK_STREAM, NULL, 0, 0);
 	if (CHECK(serv_sk < 0, "start_server", "failed to start server\n"))
 		return;
 
@@ -24,17 +24,13 @@ static void run_lookup_test(__u16 *g_serv_port, int out_sk)
 	*g_serv_port = addr.sin6_port;
 
 	/* Client outside of test cgroup should fail to connect by timeout. */
-	err = connect_fd_to_fd(out_sk, serv_sk);
+	err = connect_fd_to_fd(out_sk, serv_sk, 1000);
 	if (CHECK(!err || errno != EINPROGRESS, "connect_fd_to_fd",
 		  "unexpected result err %d errno %d\n", err, errno))
 		goto cleanup;
 
-	err = connect_wait(out_sk);
-	if (CHECK(err, "connect_wait", "unexpected result %d\n", err))
-		goto cleanup;
-
 	/* Client inside test cgroup should connect just fine. */
-	in_sk = connect_to_fd(AF_INET6, SOCK_STREAM, serv_sk);
+	in_sk = connect_to_fd(serv_sk, 0);
 	if (CHECK(in_sk < 0, "connect_to_fd", "errno %d\n", errno))
 		goto cleanup;
 
@@ -85,7 +81,7 @@ void test_cgroup_skb_sk_lookup(void)
 	 * differs from that of testing cgroup. Moving selftests process to
 	 * testing cgroup won't change cgroup id of an already created socket.
 	 */
-	out_sk = socket(AF_INET6, SOCK_STREAM | SOCK_NONBLOCK, 0);
+	out_sk = socket(AF_INET6, SOCK_STREAM, 0);
 	if (CHECK_FAIL(out_sk < 0))
 		return;
 
diff --git a/tools/testing/selftests/bpf/prog_tests/connect_force_port.c b/tools/testing/selftests/bpf/prog_tests/connect_force_port.c
index 17bbf76812ca..9229db2f5ca5 100644
--- a/tools/testing/selftests/bpf/prog_tests/connect_force_port.c
+++ b/tools/testing/selftests/bpf/prog_tests/connect_force_port.c
@@ -114,7 +114,7 @@ static int run_test(int cgroup_fd, int server_fd, int family, int type)
 		goto close_bpf_object;
 	}
 
-	fd = connect_to_fd(family, type, server_fd);
+	fd = connect_to_fd(server_fd, 0);
 	if (fd < 0) {
 		err = -1;
 		goto close_bpf_object;
@@ -137,25 +137,25 @@ void test_connect_force_port(void)
 	if (CHECK_FAIL(cgroup_fd < 0))
 		return;
 
-	server_fd = start_server_with_port(AF_INET, SOCK_STREAM, 60123);
+	server_fd = start_server(AF_INET, SOCK_STREAM, NULL, 60123, 0);
 	if (CHECK_FAIL(server_fd < 0))
 		goto close_cgroup_fd;
 	CHECK_FAIL(run_test(cgroup_fd, server_fd, AF_INET, SOCK_STREAM));
 	close(server_fd);
 
-	server_fd = start_server_with_port(AF_INET6, SOCK_STREAM, 60124);
+	server_fd = start_server(AF_INET6, SOCK_STREAM, NULL, 60124, 0);
 	if (CHECK_FAIL(server_fd < 0))
 		goto close_cgroup_fd;
 	CHECK_FAIL(run_test(cgroup_fd, server_fd, AF_INET6, SOCK_STREAM));
 	close(server_fd);
 
-	server_fd = start_server_with_port(AF_INET, SOCK_DGRAM, 60123);
+	server_fd = start_server(AF_INET, SOCK_DGRAM, NULL, 60123, 0);
 	if (CHECK_FAIL(server_fd < 0))
 		goto close_cgroup_fd;
 	CHECK_FAIL(run_test(cgroup_fd, server_fd, AF_INET, SOCK_DGRAM));
 	close(server_fd);
 
-	server_fd = start_server_with_port(AF_INET6, SOCK_DGRAM, 60124);
+	server_fd = start_server(AF_INET6, SOCK_DGRAM, NULL, 60124, 0);
 	if (CHECK_FAIL(server_fd < 0))
 		goto close_cgroup_fd;
 	CHECK_FAIL(run_test(cgroup_fd, server_fd, AF_INET6, SOCK_DGRAM));
diff --git a/tools/testing/selftests/bpf/prog_tests/load_bytes_relative.c b/tools/testing/selftests/bpf/prog_tests/load_bytes_relative.c
index c1168e4a9036..5a2a689dbb68 100644
--- a/tools/testing/selftests/bpf/prog_tests/load_bytes_relative.c
+++ b/tools/testing/selftests/bpf/prog_tests/load_bytes_relative.c
@@ -23,7 +23,7 @@ void test_load_bytes_relative(void)
 	if (CHECK_FAIL(cgroup_fd < 0))
 		return;
 
-	server_fd = start_server(AF_INET, SOCK_STREAM);
+	server_fd = start_server(AF_INET, SOCK_STREAM, NULL, 0, 0);
 	if (CHECK_FAIL(server_fd < 0))
 		goto close_cgroup_fd;
 
@@ -49,7 +49,7 @@ void test_load_bytes_relative(void)
 	if (CHECK_FAIL(err))
 		goto close_bpf_object;
 
-	client_fd = connect_to_fd(AF_INET, SOCK_STREAM, server_fd);
+	client_fd = connect_to_fd(server_fd, 0);
 	if (CHECK_FAIL(client_fd < 0))
 		goto close_bpf_object;
 	close(client_fd);
diff --git a/tools/testing/selftests/bpf/prog_tests/tcp_rtt.c b/tools/testing/selftests/bpf/prog_tests/tcp_rtt.c
index 9013a0c01eed..d207e968e6b1 100644
--- a/tools/testing/selftests/bpf/prog_tests/tcp_rtt.c
+++ b/tools/testing/selftests/bpf/prog_tests/tcp_rtt.c
@@ -118,7 +118,7 @@ static int run_test(int cgroup_fd, int server_fd)
 		goto close_bpf_object;
 	}
 
-	client_fd = connect_to_fd(AF_INET, SOCK_STREAM, server_fd);
+	client_fd = connect_to_fd(server_fd, 0);
 	if (client_fd < 0) {
 		err = -1;
 		goto close_bpf_object;
@@ -161,7 +161,7 @@ void test_tcp_rtt(void)
 	if (CHECK_FAIL(cgroup_fd < 0))
 		return;
 
-	server_fd = start_server(AF_INET, SOCK_STREAM);
+	server_fd = start_server(AF_INET, SOCK_STREAM, NULL, 0, 0);
 	if (CHECK_FAIL(server_fd < 0))
 		goto close_cgroup_fd;
 
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH bpf-next 06/10] bpf: selftests: Add fastopen_connect to network_helpers
  2020-06-26 17:55 [PATCH bpf-next 00/10] BPF TCP header options Martin KaFai Lau
                   ` (4 preceding siblings ...)
  2020-06-26 17:55 ` [PATCH bpf-next 05/10] bpf: selftests: A few improvements to network_helpers.c Martin KaFai Lau
@ 2020-06-26 17:55 ` Martin KaFai Lau
  2020-06-26 17:55 ` [PATCH bpf-next 07/10] bpf: selftests: Restore netns after each test Martin KaFai Lau
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Martin KaFai Lau @ 2020-06-26 17:55 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Eric Dumazet, kernel-team,
	Lawrence Brakmo, Neal Cardwell, netdev, Yuchung Cheng

This patch adds a fastopen_connect() helper which will
be used in a latter test.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 tools/testing/selftests/bpf/network_helpers.c | 37 +++++++++++++++++++
 tools/testing/selftests/bpf/network_helpers.h |  2 +
 2 files changed, 39 insertions(+)

diff --git a/tools/testing/selftests/bpf/network_helpers.c b/tools/testing/selftests/bpf/network_helpers.c
index 1a371d3eca7d..93028f0d4081 100644
--- a/tools/testing/selftests/bpf/network_helpers.c
+++ b/tools/testing/selftests/bpf/network_helpers.c
@@ -125,6 +125,43 @@ int start_server(int family, int type, const char *addr_str, __u16 port,
 	return -1;
 }
 
+int fastopen_connect(int server_fd, const char *data, unsigned int data_len,
+		     int timeout_ms)
+{
+	struct sockaddr_storage addr;
+	socklen_t addrlen = sizeof(addr);
+	struct sockaddr_in *addr_in;
+	int fd, ret;
+
+	if (getsockname(server_fd, (struct sockaddr *)&addr, &addrlen)) {
+		log_err("Failed to get server addr");
+		return -1;
+	}
+
+	addr_in = (struct sockaddr_in *)&addr;
+	fd = socket(addr_in->sin_family, SOCK_STREAM, 0);
+	if (fd < 0) {
+		log_err("Failed to create client socket");
+		return -1;
+	}
+
+	if (settimeo(fd, timeout_ms))
+		goto error_close;
+
+	ret = sendto(fd, data, data_len, MSG_FASTOPEN, (struct sockaddr *)&addr,
+		     addrlen);
+	if (ret != data_len) {
+		log_err("sendto(data, %u) != %d\n", data_len, ret);
+		goto error_close;
+	}
+
+	return fd;
+
+error_close:
+	save_errno_close(fd);
+	return -1;
+}
+
 static int connect_fd_to_addr(int fd,
 			      const struct sockaddr_storage *addr,
 			      socklen_t addrlen)
diff --git a/tools/testing/selftests/bpf/network_helpers.h b/tools/testing/selftests/bpf/network_helpers.h
index f580e82fda58..c4827941327a 100644
--- a/tools/testing/selftests/bpf/network_helpers.h
+++ b/tools/testing/selftests/bpf/network_helpers.h
@@ -37,5 +37,7 @@ int start_server(int family, int type, const char *addr, __u16 port,
 		 int timeout_ms);
 int connect_to_fd(int server_fd, int timeout_ms);
 int connect_fd_to_fd(int client_fd, int server_fd, int timeout_ms);
+int fastopen_connect(int server_fd, const char *data, unsigned int data_len,
+		     int timeout_ms);
 
 #endif
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH bpf-next 07/10] bpf: selftests: Restore netns after each test
  2020-06-26 17:55 [PATCH bpf-next 00/10] BPF TCP header options Martin KaFai Lau
                   ` (5 preceding siblings ...)
  2020-06-26 17:55 ` [PATCH bpf-next 06/10] bpf: selftests: Add fastopen_connect to network_helpers Martin KaFai Lau
@ 2020-06-26 17:55 ` Martin KaFai Lau
  2020-06-26 22:45   ` Andrii Nakryiko
  2020-06-26 17:55 ` [PATCH bpf-next 08/10] bpf: selftests: tcp header options Martin KaFai Lau
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 28+ messages in thread
From: Martin KaFai Lau @ 2020-06-26 17:55 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Eric Dumazet, kernel-team,
	Lawrence Brakmo, Neal Cardwell, netdev, Yuchung Cheng

It is common for networking tests creating its netns and making its own
setting under this new netns (e.g. changing tcp sysctl).  If the test
forgot to restore to the original netns, it would affect the
result of other tests.

This patch saves the original netns at the beginning and then restores it
after every test.  Since the restore "setns()" is not expensive, it does it
on all tests without tracking if a test has created a new netns or not.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 tools/testing/selftests/bpf/test_progs.c | 21 +++++++++++++++++++++
 tools/testing/selftests/bpf/test_progs.h |  2 ++
 2 files changed, 23 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_progs.c b/tools/testing/selftests/bpf/test_progs.c
index 54fa5fa688ce..b521ce366381 100644
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@ -121,6 +121,24 @@ static void reset_affinity() {
 	}
 }
 
+static void save_netns(void)
+{
+	env.saved_netns_fd = open("/proc/self/ns/net", O_RDONLY);
+	if (env.saved_netns_fd == -1) {
+		perror("open(/proc/self/ns/net)");
+		exit(-1);
+	}
+}
+
+static void restore_netns(void)
+{
+	if (setns(env.saved_netns_fd, CLONE_NEWNET) == -1) {
+		stdio_restore();
+		perror("setns(CLONE_NEWNS)");
+		exit(-1);
+	}
+}
+
 void test__end_subtest()
 {
 	struct prog_test_def *test = env.test;
@@ -643,6 +661,7 @@ int main(int argc, char **argv)
 		return -1;
 	}
 
+	save_netns();
 	stdio_hijack();
 	for (i = 0; i < prog_test_cnt; i++) {
 		struct prog_test_def *test = &prog_test_defs[i];
@@ -673,6 +692,7 @@ int main(int argc, char **argv)
 			test->error_cnt ? "FAIL" : "OK");
 
 		reset_affinity();
+		restore_netns();
 		if (test->need_cgroup_cleanup)
 			cleanup_cgroup_environment();
 	}
@@ -686,6 +706,7 @@ int main(int argc, char **argv)
 	free_str_set(&env.subtest_selector.blacklist);
 	free_str_set(&env.subtest_selector.whitelist);
 	free(env.subtest_selector.num_set);
+	close(env.saved_netns_fd);
 
 	return env.fail_cnt ? EXIT_FAILURE : EXIT_SUCCESS;
 }
diff --git a/tools/testing/selftests/bpf/test_progs.h b/tools/testing/selftests/bpf/test_progs.h
index f4503c926aca..b80924603918 100644
--- a/tools/testing/selftests/bpf/test_progs.h
+++ b/tools/testing/selftests/bpf/test_progs.h
@@ -78,6 +78,8 @@ struct test_env {
 	int sub_succ_cnt; /* successful sub-tests */
 	int fail_cnt; /* total failed tests + sub-tests */
 	int skip_cnt; /* skipped tests */
+
+	int saved_netns_fd;
 };
 
 extern struct test_env env;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH bpf-next 08/10] bpf: selftests: tcp header options
  2020-06-26 17:55 [PATCH bpf-next 00/10] BPF TCP header options Martin KaFai Lau
                   ` (6 preceding siblings ...)
  2020-06-26 17:55 ` [PATCH bpf-next 07/10] bpf: selftests: Restore netns after each test Martin KaFai Lau
@ 2020-06-26 17:55 ` Martin KaFai Lau
  2020-06-26 17:55 ` [PATCH bpf-next 09/10] tcp: bpf: Add TCP_BPF_DELACK_MAX and TCP_BPF_RTO_MIN to bpf_setsockopt Martin KaFai Lau
  2020-06-26 17:56 ` [PATCH bpf-next 10/10] bpf: selftest: Add test for TCP_BPF_DELACK_MAX and TCP_BPF_RTO_MIN Martin KaFai Lau
  9 siblings, 0 replies; 28+ messages in thread
From: Martin KaFai Lau @ 2020-06-26 17:55 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Eric Dumazet, kernel-team,
	Lawrence Brakmo, Neal Cardwell, netdev, Yuchung Cheng

This patch adds tests for the new bpf tcp header option feature.
It tests header option writing and parsing in 3WHS, normal regular
connection establishment, fastopen, and syncookie.  In syncookie,
the passive side's bpf prog is asking the active side to resend
its bpf header option by specifying a RESEND bit in the outgoing SYNACK.
handle_active_estab() and write_nodata_opt() has some details on its
limitation.
handle_passive_estab() has comments on fastopen.

It also has test for header writing and parsing in FIN packet.

The "max_delack_ms" in bpf_test_option will be used in a latter test.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 .../bpf/prog_tests/tcp_hdr_options.c          | 520 ++++++++++++++
 .../bpf/progs/test_tcp_hdr_options.c          | 674 ++++++++++++++++++
 .../selftests/bpf/test_tcp_hdr_options.h      |  34 +
 3 files changed, 1228 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/tcp_hdr_options.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_tcp_hdr_options.c
 create mode 100644 tools/testing/selftests/bpf/test_tcp_hdr_options.h

diff --git a/tools/testing/selftests/bpf/prog_tests/tcp_hdr_options.c b/tools/testing/selftests/bpf/prog_tests/tcp_hdr_options.c
new file mode 100644
index 000000000000..f8daf36783f3
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/tcp_hdr_options.c
@@ -0,0 +1,520 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2020 Facebook */
+
+#define _GNU_SOURCE
+#include <sched.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <sys/socket.h>
+#include <linux/compiler.h>
+
+#include "test_progs.h"
+#include "cgroup_helpers.h"
+#include "network_helpers.h"
+#include "test_tcp_hdr_options.h"
+#include "test_tcp_hdr_options.skel.h"
+
+#define LO_ADDR6 "::eB9F"
+#define CG_NAME "/tcpbpf-hdr-opt-test"
+
+struct bpf_test_option exp_passive_estab_in;
+struct bpf_test_option exp_active_estab_in;
+struct bpf_test_option exp_passive_fin_in;
+struct bpf_test_option exp_active_fin_in;
+struct hdr_stg exp_passive_hdr_stg;
+struct hdr_stg exp_active_hdr_stg = { .active = true, };
+
+static struct test_tcp_hdr_options *skel;
+static int lport_linum_map_fd = -1;
+static int hdr_stg_map_fd = -1;
+static int cg_fd = -1;
+static __u32 duration;
+
+struct sk_fds {
+	int srv_fd;
+	int passive_fd;
+	int active_fd;
+	int passive_lport;
+	int active_lport;
+};
+
+static int add_lo_addr(int family, const char *addr_str)
+{
+	char ip_addr_cmd[256];
+	int cmdlen;
+
+	if (family == AF_INET6)
+		cmdlen = snprintf(ip_addr_cmd, sizeof(ip_addr_cmd),
+				  "ip -6 addr add %s/128 dev lo scope host",
+				  addr_str);
+	else
+		cmdlen = snprintf(ip_addr_cmd, sizeof(ip_addr_cmd),
+				  "ip addr add %s/32 dev lo",
+				  addr_str);
+
+	if (CHECK(cmdlen >= sizeof(ip_addr_cmd), "compile ip cmd",
+		  "failed to add host addr %s to lo. ip cmdlen is too long\n",
+		  addr_str))
+		return -1;
+
+	if (CHECK(system(ip_addr_cmd), "run ip cmd",
+		  "failed to add host addr %s to lo\n", addr_str))
+		return -1;
+
+	return 0;
+}
+
+static int create_netns(void)
+{
+	if (CHECK(unshare(CLONE_NEWNET), "create netns",
+		  "unshare(CLONE_NEWNET): %s (%d)",
+		  strerror(errno), errno))
+		return -1;
+
+	if (CHECK(system("ip link set dev lo up"), "run ip cmd",
+		  "failed to bring lo link up\n"))
+		return -1;
+
+	if (add_lo_addr(AF_INET6, LO_ADDR6))
+		return -1;
+
+	return 0;
+}
+
+static int write_sysctl(const char *sysctl, const char *value)
+{
+	int fd, err, len;
+
+	fd = open(sysctl, O_WRONLY);
+	if (CHECK(fd == -1, "open sysctl", "open(%s): %s (%d)\n",
+		  sysctl, strerror(errno), errno))
+		return -1;
+
+	len = strlen(value);
+	err = write(fd, value, len);
+	close(fd);
+	if (CHECK(err != len, "write sysctl",
+		  "write(%s, %s): err:%d %s (%d)\n",
+		  sysctl, value, err, strerror(errno), errno))
+		return -1;
+
+	return 0;
+}
+
+static void print_hdr_stg(const struct hdr_stg *hdr_stg, const char *prefix)
+{
+	fprintf(stderr, "%s{active:%u, resend_syn:%u, syncookie:%u, fastopen:%u}\n",
+		prefix ? : "", hdr_stg->active, hdr_stg->resend_syn,
+		hdr_stg->syncookie, hdr_stg->fastopen);
+}
+
+static void print_option(const struct bpf_test_option *opt, const char *prefix)
+{
+	fprintf(stderr, "%s{flags:0x%x, max_delack_ms:%u, magic:0x%x}\n",
+		prefix ? : "", opt->flags, opt->max_delack_ms, opt->magic);
+}
+
+static void sk_fds_close(struct sk_fds *sk_fds)
+{
+	close(sk_fds->srv_fd);
+	close(sk_fds->passive_fd);
+	close(sk_fds->active_fd);
+}
+
+static int sk_fds_connect(struct sk_fds *sk_fds, bool fast_open)
+{
+	const char fast[] = "FAST!!!";
+	struct sockaddr_storage addr;
+	struct sockaddr_in *addr_in;
+	socklen_t len;
+
+	sk_fds->srv_fd = start_server(AF_INET6, SOCK_STREAM, LO_ADDR6, 0, 0);
+	if (CHECK(sk_fds->srv_fd == -1, "start_server", "%s (%d)\n",
+		  strerror(errno), errno))
+		goto error;
+
+	if (fast_open)
+		sk_fds->active_fd = fastopen_connect(sk_fds->srv_fd, fast,
+						     sizeof(fast), 0);
+	else
+		sk_fds->active_fd = connect_to_fd(sk_fds->srv_fd, 0);
+
+	if (CHECK_FAIL(sk_fds->active_fd == -1)) {
+		close(sk_fds->srv_fd);
+		goto error;
+	}
+
+	addr_in = (struct sockaddr_in *)&addr;
+	len = sizeof(addr);
+	if (CHECK(getsockname(sk_fds->srv_fd, (struct sockaddr *)&addr,
+			      &len), "getsockname(srv_fd)", "%s (%d)\n",
+		  strerror(errno), errno))
+		goto error_close;
+	sk_fds->passive_lport = ntohs(addr_in->sin_port);
+
+	len = sizeof(addr);
+	if (CHECK(getsockname(sk_fds->active_fd, (struct sockaddr *)&addr,
+			      &len), "getsockname(active_fd)", "%s (%d)\n",
+		  strerror(errno), errno))
+		goto error_close;
+	sk_fds->active_lport = ntohs(addr_in->sin_port);
+
+	sk_fds->passive_fd = accept(sk_fds->srv_fd, NULL, 0);
+	if (CHECK(sk_fds->passive_fd == -1, "accept(srv_fd)", "%s (%d)\n",
+		  strerror(errno), errno))
+		goto error_close;
+
+	if (fast_open) {
+		char bytes_in[sizeof(fast)];
+		int ret;
+
+		ret = read(sk_fds->passive_fd, bytes_in, sizeof(bytes_in));
+		CHECK(ret != sizeof(fast), "read fastopen syn data",
+		      "expected=%lu actual=%d\n", sizeof(fast), ret);
+		close(sk_fds->passive_fd);
+		goto error_close;
+	}
+
+	return 0;
+
+error_close:
+	close(sk_fds->active_fd);
+	close(sk_fds->srv_fd);
+
+error:
+	memset(sk_fds, -1, sizeof(*sk_fds));
+	return -1;
+}
+
+static int check_hdr_opt(const struct bpf_test_option *exp,
+			 const struct bpf_test_option *act,
+			 const char *hdr_desc)
+{
+	if (CHECK(memcmp(exp, act, sizeof(*exp)),
+		  "expected-vs-actual", "unexpected %s\n", hdr_desc)) {
+		print_option(exp, "expected: ");
+		print_option(act, "  actual: ");
+		return -1;
+	}
+
+	return 0;
+}
+
+static int check_hdr_stg(const struct hdr_stg *exp, int fd,
+			 const char *stg_desc)
+{
+	struct hdr_stg act;
+
+	if (CHECK(bpf_map_lookup_elem(hdr_stg_map_fd, &fd, &act),
+		  "map_lookup(hdr_stg_map_fd)", "%s %s (%d)\n",
+		  stg_desc, strerror(errno), errno))
+		return -1;
+
+	if (CHECK(memcmp(exp, &act, sizeof(*exp)),
+		  "expected-vs-actual", "unexpected %s\n", stg_desc)) {
+		print_hdr_stg(exp, "expected: ");
+		print_hdr_stg(&act, "  actual: ");
+		return -1;
+	}
+
+	return 0;
+}
+
+static int check_error_linum(const struct sk_fds *sk_fds)
+{
+	unsigned int linum, nr_errors = 0;
+	int lport = -1;
+
+	if (!bpf_map_lookup_elem(lport_linum_map_fd, &lport, &linum)) {
+		fprintf(stderr,
+			"bpf prog error out at lport:<unknown> linum:%u\n",
+			linum);
+		nr_errors++;
+	}
+
+	lport = sk_fds->passive_lport;
+	if (!bpf_map_lookup_elem(lport_linum_map_fd, &lport, &linum)) {
+		fprintf(stderr,
+			"bpf prog error out at lport:passive(%d), linum:%u\n",
+			lport, linum);
+		nr_errors++;
+	}
+
+	lport = sk_fds->active_lport;
+	if (!bpf_map_lookup_elem(lport_linum_map_fd, &lport, &linum)) {
+		fprintf(stderr,
+			"bpf prog error out at lport:active(%d), linum:%u\n",
+			lport, linum);
+		nr_errors++;
+	}
+
+	return nr_errors;
+}
+
+static void check_exp_and_close_fds(struct sk_fds *sk_fds)
+{
+	int ret, abyte;
+
+	if (check_hdr_stg(&exp_passive_hdr_stg, sk_fds->passive_fd,
+			  "passive_hdr_stg"))
+		goto check_linum;
+
+	if (check_hdr_stg(&exp_active_hdr_stg, sk_fds->active_fd,
+			  "active_hdr_stg"))
+		goto check_linum;
+
+	shutdown(sk_fds->active_fd, SHUT_WR);
+	ret = read(sk_fds->passive_fd, &abyte, sizeof(abyte));
+	if (CHECK(ret != 0, "read-after-shutdown(passive_fd):",
+		  "ret:%d %s (%d)\n",
+		  ret, strerror(errno), errno))
+		goto check_linum;
+
+	shutdown(sk_fds->passive_fd, SHUT_WR);
+	ret = read(sk_fds->active_fd, &abyte, sizeof(abyte));
+	if (CHECK(ret != 0, "read-after-shutdown(active_fd):",
+		  "ret:%d %s (%d)\n",
+		  ret, strerror(errno), errno))
+		goto check_linum;
+
+	if (check_hdr_opt(&exp_passive_estab_in, &skel->bss->passive_estab_in,
+			  "passive_estab_in"))
+		goto check_linum;
+
+	if (check_hdr_opt(&exp_active_estab_in, &skel->bss->active_estab_in,
+			  "active_estab_in"))
+		goto check_linum;
+
+	if (check_hdr_opt(&exp_passive_fin_in, &skel->bss->passive_fin_in,
+			  "passive_fin_in"))
+		goto check_linum;
+
+	check_hdr_opt(&exp_active_fin_in, &skel->bss->active_fin_in,
+		      "active_fin_in");
+
+check_linum:
+	CHECK_FAIL(check_error_linum(sk_fds));
+	sk_fds_close(sk_fds);
+}
+
+static void prepare_out(void)
+{
+	skel->bss->active_syn_out = exp_passive_estab_in;
+	skel->bss->passive_synack_out = exp_active_estab_in;
+
+	skel->bss->active_fin_out = exp_passive_fin_in;
+	skel->bss->passive_fin_out = exp_active_fin_in;
+}
+
+static void reset_test(void)
+{
+	size_t optsize = sizeof(struct bpf_test_option);
+	int lport, err;
+
+	memset(&skel->bss->passive_synack_out, 0, optsize);
+	memset(&skel->bss->passive_fin_out, 0, optsize);
+
+	memset(&skel->bss->passive_estab_in, 0, optsize);
+	memset(&skel->bss->passive_fin_in, 0, optsize);
+
+	memset(&skel->bss->active_syn_out, 0, optsize);
+	memset(&skel->bss->active_fin_out, 0, optsize);
+
+	memset(&skel->bss->active_estab_in, 0, optsize);
+	memset(&skel->bss->active_fin_in, 0, optsize);
+
+	memset(&exp_passive_estab_in, 0, optsize);
+	memset(&exp_active_estab_in, 0, optsize);
+	memset(&exp_passive_fin_in, 0, optsize);
+	memset(&exp_active_fin_in, 0, optsize);
+
+	memset(&exp_passive_hdr_stg, 0, sizeof(exp_passive_hdr_stg));
+	memset(&exp_active_hdr_stg, 0, sizeof(exp_active_hdr_stg));
+	exp_active_hdr_stg.active = true;
+
+	err = bpf_map_get_next_key(lport_linum_map_fd, NULL, &lport);
+	while (!err) {
+		bpf_map_delete_elem(lport_linum_map_fd, &lport);
+		err = bpf_map_get_next_key(lport_linum_map_fd, &lport, &lport);
+	}
+}
+
+static void fastopen_estab(void)
+{
+	struct bpf_link *link;
+	struct sk_fds sk_fds;
+
+	exp_passive_estab_in.flags = OPTION_F_MAGIC;
+	exp_passive_estab_in.magic = 0xfa;
+
+	exp_active_estab_in.flags = OPTION_F_MAGIC;
+	exp_active_estab_in.magic = 0xce;
+
+	exp_passive_hdr_stg.fastopen = true;
+
+	prepare_out();
+
+	/* Allow fastopen without fastopen cookie */
+	if (write_sysctl("/proc/sys/net/ipv4/tcp_fastopen", "1543"))
+		return;
+
+	link = bpf_program__attach_cgroup(skel->progs.estab, cg_fd);
+	if (CHECK(IS_ERR(link), "attach_cgroup(estab)", "err: %ld\n",
+		  PTR_ERR(link)))
+		return;
+
+	if (sk_fds_connect(&sk_fds, true)) {
+		bpf_link__destroy(link);
+		return;
+	}
+
+	check_exp_and_close_fds(&sk_fds);
+	bpf_link__destroy(link);
+}
+
+static void syncookie_estab(void)
+{
+	struct bpf_link *link;
+	struct sk_fds sk_fds;
+
+	exp_passive_estab_in.flags = OPTION_F_MAGIC;
+	exp_passive_estab_in.magic = 0xfa;
+
+	exp_active_estab_in.flags = OPTION_F_MAGIC | OPTION_F_RESEND;
+	exp_active_estab_in.magic = 0xce;
+
+	exp_passive_hdr_stg.syncookie = true;
+	exp_active_hdr_stg.resend_syn = true,
+
+	prepare_out();
+
+	/* Clear the RESEND to ensure the bpf prog can learn
+	 * want_cookie and set the RESEND by itself.
+	 */
+	skel->bss->passive_synack_out.flags &= ~OPTION_F_RESEND;
+
+	/* Enforce syncookie mode */
+	if (write_sysctl("/proc/sys/net/ipv4/tcp_syncookies", "2"))
+		return;
+
+	link = bpf_program__attach_cgroup(skel->progs.estab, cg_fd);
+	if (CHECK(IS_ERR(link), "attach_cgroup(estab)", "err: %ld\n",
+		  PTR_ERR(link)))
+		return;
+
+	if (sk_fds_connect(&sk_fds, false)) {
+		bpf_link__destroy(link);
+		return;
+	}
+
+	check_exp_and_close_fds(&sk_fds);
+	bpf_link__destroy(link);
+}
+
+static void fin(void)
+{
+	struct bpf_link *link;
+	struct sk_fds sk_fds;
+
+	exp_passive_fin_in.flags = OPTION_F_MAGIC;
+	exp_passive_fin_in.magic = 0xfa;
+
+	exp_active_fin_in.flags = OPTION_F_MAGIC;
+	exp_active_fin_in.magic = 0xce;
+
+	prepare_out();
+
+	if (write_sysctl("/proc/sys/net/ipv4/tcp_syncookies", "1"))
+		return;
+
+	link = bpf_program__attach_cgroup(skel->progs.estab,
+					  cg_fd);
+	if (CHECK(IS_ERR(link), "attach_cgroup(estab)", "err: %ld\n",
+		  PTR_ERR(link)))
+		return;
+
+	if (sk_fds_connect(&sk_fds, false)) {
+		bpf_link__destroy(link);
+		return;
+	}
+
+	check_exp_and_close_fds(&sk_fds);
+	bpf_link__destroy(link);
+}
+
+static void simple_estab(void)
+{
+	struct bpf_link *link;
+	struct sk_fds sk_fds;
+
+	exp_passive_estab_in.flags = OPTION_F_MAGIC;
+	exp_passive_estab_in.magic = 0xfa;
+
+	exp_active_estab_in.flags = OPTION_F_MAGIC;
+	exp_active_estab_in.magic = 0xce;
+
+	prepare_out();
+
+	if (write_sysctl("/proc/sys/net/ipv4/tcp_syncookies", "1"))
+		return;
+
+	link = bpf_program__attach_cgroup(skel->progs.estab,
+					  cg_fd);
+	if (CHECK(IS_ERR(link), "attach_cgroup(estab)", "err: %ld\n",
+		  PTR_ERR(link)))
+		return;
+
+	if (sk_fds_connect(&sk_fds, false)) {
+		bpf_link__destroy(link);
+		return;
+	}
+
+	check_exp_and_close_fds(&sk_fds);
+	bpf_link__destroy(link);
+}
+
+struct test {
+	const char *desc;
+	void (*run)(void);
+};
+
+#define DEF_TEST(name) { #name, name }
+static struct test tests[] = {
+	DEF_TEST(simple_estab),
+	DEF_TEST(syncookie_estab),
+	DEF_TEST(fastopen_estab),
+	DEF_TEST(fin),
+};
+
+void test_tcp_hdr_options(void)
+{
+	int i;
+
+	skel = test_tcp_hdr_options__open_and_load();
+	if (CHECK(!skel, "open and load skel", "failed"))
+		return;
+
+	hdr_stg_map_fd = bpf_map__fd(skel->maps.hdr_stg_map);
+	lport_linum_map_fd = bpf_map__fd(skel->maps.lport_linum_map);
+
+	cg_fd = test__join_cgroup(CG_NAME);
+	if (CHECK_FAIL(cg_fd < 0)) {
+		test_tcp_hdr_options__destroy(skel);
+		return;
+	}
+
+	for (i = 0; i < ARRAY_SIZE(tests); i++) {
+		if (!test__start_subtest(tests[i].desc))
+			continue;
+
+		if (create_netns())
+			break;
+
+		tests[i].run();
+
+		reset_test();
+	}
+
+	close(cg_fd);
+	test_tcp_hdr_options__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_tcp_hdr_options.c b/tools/testing/selftests/bpf/progs/test_tcp_hdr_options.c
new file mode 100644
index 000000000000..631181bfb4cc
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_tcp_hdr_options.c
@@ -0,0 +1,674 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2020 Facebook */
+
+#include <stddef.h>
+#include <errno.h>
+#include <stdbool.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <linux/tcp.h>
+#include <linux/socket.h>
+#include <linux/bpf.h>
+#include <linux/types.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+#include "test_tcp_hdr_options.h"
+
+#define TCPHDR_FIN 0x01
+#define TCPHDR_SYN 0x02
+#define TCPHDR_RST 0x04
+#define TCPHDR_PSH 0x08
+#define TCPHDR_ACK 0x10
+#define TCPHDR_URG 0x20
+#define TCPHDR_ECE 0x40
+#define TCPHDR_CWR 0x80
+#define TCPHDR_SYNACK (TCPHDR_SYN | TCPHDR_ACK)
+
+#define CG_OK	1
+#define CG_ERR	0
+
+#ifndef SOL_TCP
+#define SOL_TCP 6
+#endif
+
+#define MAX_TCPHDR_SIZE 60
+
+#ifndef sizeof_field
+#define sizeof_field(TYPE, MEMBER) sizeof((((TYPE *)0)->MEMBER))
+#endif
+
+const __u8 option_len[] = {
+	[OPTION_RESEND]		= 0,
+	[OPTION_MAX_DELACK_MS]	= sizeof_field(struct bpf_test_option,
+					       max_delack_ms),
+	[OPTION_MAGIC]		= sizeof_field(struct bpf_test_option, magic),
+};
+
+struct tcp_bpf_expopt {
+	__u8 kind;
+	__u8 len;
+	__u16 magic;
+	__u8 data[16];	/* <-- bpf prog can only write into this area */
+};
+
+#define TCP_BPF_EXPOPT_BASE_LEN 4
+
+struct bpf_test_option passive_synack_out = {};
+struct bpf_test_option passive_fin_out	= {};
+
+struct bpf_test_option passive_estab_in = {};
+struct bpf_test_option passive_fin_in	= {};
+
+struct bpf_test_option active_syn_out	= {};
+struct bpf_test_option active_fin_out	= {};
+
+struct bpf_test_option active_estab_in	= {};
+struct bpf_test_option active_fin_in	= {};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_SK_STORAGE);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__type(key, int);
+	__type(value, struct hdr_stg);
+} hdr_stg_map SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, 3);
+	__type(key, int);
+	__type(value, unsigned int);
+} lport_linum_map SEC(".maps");
+
+#define RET_CG_ERR(skops) ({			\
+	struct bpf_sock *tmp_sk = skops->sk;	\
+	__u32 linum = __LINE__;			\
+	int lport;				\
+						\
+	if (!skops->sk)	{			\
+		lport = -1;			\
+		bpf_map_update_elem(&lport_linum_map, &lport, &linum, BPF_NOEXIST); \
+		clear_hdr_cb_flags(skops);	\
+		return CG_ERR;			\
+	}					\
+									\
+	lport = skops->sk->src_port;					\
+	bpf_map_update_elem(&lport_linum_map, &lport, &linum, BPF_NOEXIST); \
+	clear_hdr_cb_flags(skops);					\
+	return CG_ERR;							\
+})
+
+static __always_inline void
+clear_hdr_cb_flags(struct bpf_sock_ops *skops)
+{
+	bpf_sock_ops_cb_flags_set(skops,
+				  skops->bpf_sock_ops_cb_flags &
+				  ~(BPF_SOCK_OPS_PARSE_HDR_OPT_CB_FLAG |
+				    BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG));
+}
+
+static __always_inline void
+set_hdr_cb_flags(struct bpf_sock_ops *skops)
+{
+	bpf_sock_ops_cb_flags_set(skops,
+				  skops->bpf_sock_ops_cb_flags |
+				  BPF_SOCK_OPS_PARSE_HDR_OPT_CB_FLAG |
+				  BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG);
+}
+
+static __always_inline __u32 skops_avail_len(const struct bpf_sock_ops *skops)
+{
+	return skops->args[0];
+}
+
+static __always_inline __u32 skops_want_cookie(const struct bpf_sock_ops *skops)
+{
+	return skops->args[1];
+}
+
+static __always_inline __u8 skops_tcp_flags(const struct bpf_sock_ops *skops)
+{
+	return skops->skb_tcp_flags;
+}
+
+static __always_inline unsigned int tcp_hdrlen(struct tcphdr *th)
+{
+	return th->doff * 4;
+}
+
+static __always_inline __u8 option_total_len(__u8 flags)
+{
+	__u8 i, len = sizeof_field(struct bpf_test_option, flags);
+
+	if (!flags)
+		return 0;
+
+	for (i = 0; i < __NR_OPTION_FLAGS; i++) {
+		if (TEST_OPTION_FLAGS(flags, i))
+			len += option_len[i];
+	}
+
+	return len;
+}
+
+static __always_inline int write_option(struct bpf_sock_ops *skops,
+					struct bpf_test_option *opt)
+{
+	__u8 flags = opt->flags;
+	__u8 offset = 0;
+	int err;
+
+	err = bpf_store_hdr_opt(skops, 0, &opt->flags, sizeof(opt->flags), 0);
+	if (err)
+		RET_CG_ERR(skops);
+	offset += sizeof(opt->flags);
+
+	if (TEST_OPTION_FLAGS(flags, OPTION_MAX_DELACK_MS)) {
+		err = bpf_store_hdr_opt(skops, offset, &opt->max_delack_ms,
+					sizeof(opt->max_delack_ms), 0);
+		if (err)
+			RET_CG_ERR(skops);
+		offset += sizeof(opt->max_delack_ms);
+	}
+
+	if (TEST_OPTION_FLAGS(flags, OPTION_MAGIC)) {
+		err = bpf_store_hdr_opt(skops, offset, &opt->magic,
+					sizeof(opt->magic), 0);
+		if (err)
+			RET_CG_ERR(skops);
+		offset += sizeof(opt->magic);
+	}
+
+	return CG_OK;
+}
+
+static __always_inline int parse_option(struct bpf_test_option *opt,
+					__u8 *start, __u8 *hdr_end)
+{
+	__u8 flags;
+
+	if (start + sizeof(opt->flags) > hdr_end)
+		return -EINVAL;
+
+	flags = *start;
+	opt->flags = flags;
+	start += sizeof(opt->flags);
+
+	if (TEST_OPTION_FLAGS(flags, OPTION_MAX_DELACK_MS)) {
+		if (start + sizeof(opt->max_delack_ms) > hdr_end)
+			return -EINVAL;
+		opt->max_delack_ms = *(typeof(opt->max_delack_ms) *)start;
+		start += sizeof(opt->max_delack_ms);
+	}
+
+	if (TEST_OPTION_FLAGS(flags, OPTION_MAGIC)) {
+		if (start + sizeof(opt->magic) > hdr_end)
+			return -EINVAL;
+		opt->magic = *(typeof(opt->magic) *)start;
+		start += sizeof(opt->magic);
+	}
+
+	return 0;
+}
+
+static __always_inline int synack_opt_len(struct bpf_sock_ops *skops)
+{
+	__u8 avail = skops_avail_len(skops);
+	__u8 exp_kind, flags, optlen;
+	int err;
+
+	if (!passive_synack_out.flags)
+		return CG_OK;
+
+	err = bpf_getsockopt(skops, SOL_TCP, TCP_BPF_SYN_HDR_OPT,
+			     &exp_kind, sizeof(exp_kind));
+
+	/* ENOSPC:
+	 * BPF header option is found but not enough
+	 * space to copy everything and it is because
+	 * we intentionally only get the "kind" byte.
+	 *
+	 * All we want to learn here is the SYN pkt contains
+	 * the bpf header option before we proceed to write
+	 * the options in the outgoing SYNACK pkt.
+	 */
+	if (err != -ENOSPC)
+		return CG_OK;
+
+	flags = passive_synack_out.flags;
+	if (skops_want_cookie(skops))
+		SET_OPTION_FLAGS(flags, OPTION_RESEND);
+
+	optlen = option_total_len(flags);
+	if (optlen > avail)
+		RET_CG_ERR(skops);
+
+	if (optlen && bpf_reserve_hdr_opt(skops, optlen, 0))
+		RET_CG_ERR(skops);
+
+	return CG_OK;
+}
+
+static __always_inline int write_synack_opt(struct bpf_sock_ops *skops)
+{
+	struct bpf_test_option opt;
+	__u8 exp_kind;
+	int err;
+
+	if (!passive_synack_out.flags)
+		/* We should not even be called since no header
+		 * space has been reserved.
+		 */
+		RET_CG_ERR(skops);
+
+	err = bpf_getsockopt(skops, SOL_TCP, TCP_BPF_SYN_HDR_OPT,
+			     &exp_kind, sizeof(exp_kind));
+
+	if (err != -ENOSPC)
+		RET_CG_ERR(skops);
+
+	opt = passive_synack_out;
+	if (skops_want_cookie(skops))
+		SET_OPTION_FLAGS(opt.flags, OPTION_RESEND);
+
+	return write_option(skops, &opt);
+}
+
+static __always_inline int syn_opt_len(struct bpf_sock_ops *skops)
+{
+	__u8 avail = skops_avail_len(skops);
+	__u8 optlen;
+
+	if (!active_syn_out.flags)
+		return CG_OK;
+
+	optlen = option_total_len(active_syn_out.flags);
+	if (optlen > avail)
+		RET_CG_ERR(skops);
+
+	if (optlen && bpf_reserve_hdr_opt(skops, optlen, 0))
+		RET_CG_ERR(skops);
+
+	return CG_OK;
+}
+
+static __always_inline int write_syn_opt(struct bpf_sock_ops *skops)
+{
+	if (!active_syn_out.flags)
+		return CG_OK;
+
+	return write_option(skops, &active_syn_out);
+}
+
+static __always_inline int fin_opt_len(struct bpf_sock_ops *skops)
+{
+	__u8 avail = skops_avail_len(skops);
+	struct bpf_test_option *opt;
+	struct hdr_stg *hdr_stg;
+	__u8 optlen;
+
+	if (!skops->sk)
+		RET_CG_ERR(skops);
+
+	hdr_stg = bpf_sk_storage_get(&hdr_stg_map, skops->sk, NULL, 0);
+	if (!hdr_stg)
+		RET_CG_ERR(skops);
+
+	if (hdr_stg->active)
+		opt = &active_fin_out;
+	else
+		opt = &passive_fin_out;
+
+	optlen = option_total_len(opt->flags);
+	if (optlen > avail)
+		RET_CG_ERR(skops);
+
+	if (optlen && bpf_reserve_hdr_opt(skops, optlen, 0))
+		RET_CG_ERR(skops);
+
+	return CG_OK;
+}
+
+static __always_inline int write_fin_opt(struct bpf_sock_ops *skops)
+{
+	struct bpf_test_option *opt;
+	struct hdr_stg *hdr_stg;
+
+	if (!skops->sk)
+		RET_CG_ERR(skops);
+
+	hdr_stg = bpf_sk_storage_get(&hdr_stg_map, skops->sk, NULL, 0);
+	if (!hdr_stg)
+		RET_CG_ERR(skops);
+
+	return write_option(skops, hdr_stg->active ?
+			    &active_fin_out : &passive_fin_out);
+}
+
+static __always_inline int nodata_opt_len(struct bpf_sock_ops *skops)
+{
+	struct hdr_stg *hdr_stg;
+
+	if (!skops->sk)
+		RET_CG_ERR(skops);
+
+	hdr_stg = bpf_sk_storage_get(&hdr_stg_map, skops->sk, NULL, 0);
+	if (!hdr_stg)
+		RET_CG_ERR(skops);
+
+	if (hdr_stg->resend_syn)
+		return syn_opt_len(skops);
+
+	return CG_OK;
+}
+
+static __always_inline int write_nodata_opt(struct bpf_sock_ops *skops)
+{
+	struct hdr_stg *hdr_stg;
+
+	if (!skops->sk)
+		RET_CG_ERR(skops);
+
+	hdr_stg = bpf_sk_storage_get(&hdr_stg_map, skops->sk, NULL, 0);
+	if (!hdr_stg)
+		RET_CG_ERR(skops);
+
+	if (hdr_stg->resend_syn) {
+		int ret = write_syn_opt(skops);
+
+		/* Passive server side is in syncookie mode and
+		 * ask us (the active connect side) to resend
+		 * the header option.
+		 *
+		 * We resend it here in the last ack of 3WHS and
+		 * then clear the cb_flags so that we will not
+		 * be called to write the options again.
+		 * This strategy is considered best effort
+		 * because this ack could be lost.
+		 *
+		 * In practice, the active side bpf prog can also
+		 * select to keep sending the options until the
+		 * first data/ack packet is received from the
+		 * passive side.
+		 */
+		clear_hdr_cb_flags(skops);
+
+		return ret;
+	}
+
+	return CG_OK;
+}
+
+static __always_inline int data_opt_len(struct bpf_sock_ops *skops)
+{
+	return CG_OK;
+}
+
+static __always_inline int write_data_opt(struct bpf_sock_ops *skops)
+{
+	return CG_OK;
+}
+
+static __always_inline int current_mss_opt_len(struct bpf_sock_ops *skops)
+{
+	/* Reserve maximum that may be needed */
+	if (bpf_reserve_hdr_opt(skops, sizeof(struct bpf_test_option), 0))
+		RET_CG_ERR(skops);
+
+	return CG_OK;
+}
+
+static __always_inline int handle_hdr_opt_len(struct bpf_sock_ops *skops)
+{
+	__u8 tcp_flags = skops_tcp_flags(skops);
+
+	if ((tcp_flags & TCPHDR_SYNACK) == TCPHDR_SYNACK)
+		return synack_opt_len(skops);
+
+	if (tcp_flags & TCPHDR_SYN)
+		return syn_opt_len(skops);
+
+	if (tcp_flags & TCPHDR_FIN)
+		return fin_opt_len(skops);
+
+	if (!skops->skb_data)
+		/* The kernel is calculating the MSS */
+		return current_mss_opt_len(skops);
+
+	if (skops->skb_len)
+		return data_opt_len(skops);
+
+	return nodata_opt_len(skops);
+}
+
+static __always_inline int handle_write_hdr_opt(struct bpf_sock_ops *skops)
+{
+	__u8 tcp_flags = skops_tcp_flags(skops);
+	struct tcphdr *th;
+	int err;
+
+	if ((tcp_flags & TCPHDR_SYNACK) == TCPHDR_SYNACK)
+		return write_synack_opt(skops);
+
+	if (tcp_flags & TCPHDR_SYN)
+		return write_syn_opt(skops);
+
+	if (tcp_flags & TCPHDR_FIN)
+		return write_fin_opt(skops);
+
+	th = skops->skb_data;
+	if (th + 1 > skops->skb_data_end)
+		RET_CG_ERR(skops);
+
+	if (skops->skb_len > tcp_hdrlen(th))
+		return write_data_opt(skops);
+
+	return write_nodata_opt(skops);
+}
+
+static __always_inline int handle_active_estab(struct bpf_sock_ops *skops)
+{
+	__u8 bpf_hdr_opt_off = skops->skb_bpf_hdr_opt_off;
+	struct hdr_stg *stg, init_stg = {
+		.active = true,
+	};
+	__u8 *start;
+	int err;
+
+	if (!bpf_hdr_opt_off)
+		goto update_sk_stg;
+
+	/* Bound the "start" reg->umax_value in the verifier */
+	if (bpf_hdr_opt_off >= MAX_TCPHDR_SIZE)
+		RET_CG_ERR(skops);
+
+	start = skops->skb_data;
+	start += bpf_hdr_opt_off;
+
+	if (start + TCP_BPF_EXPOPT_BASE_LEN > skops->skb_data_end)
+		RET_CG_ERR(skops);
+
+	start += TCP_BPF_EXPOPT_BASE_LEN;
+	err = parse_option(&active_estab_in, start, skops->skb_data_end);
+
+	if (err)
+		RET_CG_ERR(skops);
+
+update_sk_stg:
+	init_stg.resend_syn = TEST_OPTION_FLAGS(active_estab_in.flags,
+						OPTION_RESEND);
+	if (!skops->sk || !bpf_sk_storage_get(&hdr_stg_map, skops->sk,
+					      &init_stg,
+					      BPF_SK_STORAGE_GET_F_CREATE))
+		RET_CG_ERR(skops);
+
+	if (!init_stg.resend_syn && !active_fin_out.flags)
+		/* No options will be written from now */
+		clear_hdr_cb_flags(skops);
+
+	return CG_OK;
+}
+
+static __always_inline int handle_passive_estab(struct bpf_sock_ops *skops)
+{
+	struct tcp_bpf_expopt expopt;
+	struct hdr_stg *hdr_stg;
+	bool syncookie = false;
+	__u8 *start, *end, len;
+	struct tcphdr *th;
+	int err;
+
+	err = bpf_getsockopt(skops, SOL_TCP, TCP_BPF_SYN_HDR_OPT, &expopt,
+			     sizeof(expopt));
+	/* ENOMSG: no bpf hdr opt in the saved syn pkt
+	 * ENOENT: no saved syn (syncookie mode)
+	 */
+	if (err && err != -ENOMSG && err != -ENOENT)
+		RET_CG_ERR(skops);
+
+	if (err) {
+		__u8 bpf_hdr_opt_off = skops->skb_bpf_hdr_opt_off;
+
+		/* No bpf hdr option in SYN.  We were in syncookie
+		 * mode, so try to find it in ACK.
+		 */
+
+		/* No bpf hdr option in ACK either */
+		if (!bpf_hdr_opt_off)
+			goto update_sk_stg;
+
+		if (bpf_hdr_opt_off >= MAX_TCPHDR_SIZE)
+			RET_CG_ERR(skops);
+
+		start = skops->skb_data;
+		start += bpf_hdr_opt_off;
+
+		if (start + TCP_BPF_EXPOPT_BASE_LEN > skops->skb_data_end)
+			RET_CG_ERR(skops);
+
+		len = start[1] - TCP_BPF_EXPOPT_BASE_LEN;
+		start += TCP_BPF_EXPOPT_BASE_LEN;
+		end = skops->skb_data_end;
+		syncookie = true;
+	} else {
+		len = expopt.len - TCP_BPF_EXPOPT_BASE_LEN;
+		start = expopt.data;
+		end = expopt.data + len;
+	}
+
+	if (parse_option(&passive_estab_in, start, end))
+		RET_CG_ERR(skops);
+
+update_sk_stg:
+	if (!skops->sk)
+		RET_CG_ERR(skops);
+
+	hdr_stg = bpf_sk_storage_get(&hdr_stg_map, skops->sk, NULL,
+				     BPF_SK_STORAGE_GET_F_CREATE);
+	if (!hdr_stg)
+		RET_CG_ERR(skops);
+	hdr_stg->syncookie = syncookie;
+
+	th = skops->skb_data;
+	if (th + 1 > skops->skb_data_end)
+		RET_CG_ERR(skops);
+
+	/* Fastopen */
+	if (skops->skb_len > tcp_hdrlen(th)) {
+		/* Cannot clear cb_flags to stop write_hdr cb.
+		 * synack is not sent yet for fast open.
+		 * Even it was, the synack may need
+		 * to be retransmitted.
+		 *
+		 * The cb_flags will be cleared in
+		 * handle_parse_hdr() later.
+		 */
+		hdr_stg->fastopen = true;
+		return CG_OK;
+	}
+
+	if (!passive_fin_out.flags)
+		/* No options will be written from now */
+		clear_hdr_cb_flags(skops);
+
+	return CG_OK;
+}
+
+static __always_inline int handle_parse_hdr(struct bpf_sock_ops *skops)
+{
+	struct hdr_stg *hdr_stg;
+	struct tcphdr *th;
+	__u32 cb_flags;
+
+	if (!skops->sk)
+		RET_CG_ERR(skops);
+
+	th = skops->skb_data;
+	if (th + 1 > skops->skb_data_end)
+		RET_CG_ERR(skops);
+
+	hdr_stg = bpf_sk_storage_get(&hdr_stg_map, skops->sk, NULL, 0);
+	if (!hdr_stg)
+		RET_CG_ERR(skops);
+
+	if (th->fin) {
+		__u32 skb_bpf_hdr_opt_off = skops->skb_bpf_hdr_opt_off;
+		struct bpf_test_option *fin_opt;
+		__u8 *kind_start;
+		int ret;
+
+		if (hdr_stg->active)
+			fin_opt = &active_fin_in;
+		else
+			fin_opt = &passive_fin_in;
+
+		if (!skb_bpf_hdr_opt_off)
+			return CG_OK;
+
+		if (skb_bpf_hdr_opt_off >= MAX_TCPHDR_SIZE)
+			RET_CG_ERR(skops);
+
+		kind_start = skops->skb_data;
+		kind_start += skb_bpf_hdr_opt_off;
+		if (kind_start + TCP_BPF_EXPOPT_BASE_LEN > skops->skb_data_end)
+			RET_CG_ERR(skops);
+
+		kind_start += TCP_BPF_EXPOPT_BASE_LEN;
+		if (parse_option(fin_opt, kind_start, skops->skb_data_end))
+			RET_CG_ERR(skops);
+
+		return CG_OK;
+	}
+
+	return CG_OK;
+}
+
+SEC("sockops/estab")
+int estab(struct bpf_sock_ops *skops)
+{
+	int op, optlen, true_val = 1;
+
+	switch (skops->op) {
+	case BPF_SOCK_OPS_TCP_LISTEN_CB:
+		bpf_setsockopt(skops, SOL_TCP, TCP_SAVE_SYN,
+			       &true_val, sizeof(true_val));
+		set_hdr_cb_flags(skops);
+		break;
+	case BPF_SOCK_OPS_TCP_CONNECT_CB:
+		set_hdr_cb_flags(skops);
+		break;
+	case BPF_SOCK_OPS_PARSE_HDR_OPT_CB:
+		return handle_parse_hdr(skops);
+	case BPF_SOCK_OPS_HDR_OPT_LEN_CB:
+		return handle_hdr_opt_len(skops);
+	case BPF_SOCK_OPS_WRITE_HDR_OPT_CB:
+		return handle_write_hdr_opt(skops);
+	case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB:
+		return handle_passive_estab(skops);
+	case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
+		return handle_active_estab(skops);
+	}
+
+	return CG_OK;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/test_tcp_hdr_options.h b/tools/testing/selftests/bpf/test_tcp_hdr_options.h
new file mode 100644
index 000000000000..d3a4593c81c8
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_tcp_hdr_options.h
@@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _TEST_TCP_HDR_OPTIONS_H
+#define _TEST_TCP_HDR_OPTIONS_H
+
+struct bpf_test_option {
+	__u8 flags;
+	__u8 max_delack_ms;
+	__u8 magic;
+};
+
+enum {
+	OPTION_RESEND,
+	OPTION_MAX_DELACK_MS,
+	OPTION_MAGIC,
+	__NR_OPTION_FLAGS,
+};
+
+#define OPTION_F_RESEND		(1 << OPTION_RESEND)
+#define OPTION_F_MAX_DELACK_MS	(1 << OPTION_MAX_DELACK_MS)
+#define OPTION_F_MAGIC		(1 << OPTION_MAGIC)
+
+#define TEST_OPTION_FLAGS(flags, option) (1 & ((flags) >> (option)))
+#define SET_OPTION_FLAGS(flags, option)	((flags) |= (1 << (option)))
+
+/* Store in bpf_sk_storage */
+struct hdr_stg {
+	bool active;
+	bool resend_syn; /* active side only */
+	bool syncookie;  /* passive side only */
+	bool fastopen;	/* passive side only */
+};
+
+#endif
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH bpf-next 09/10] tcp: bpf: Add TCP_BPF_DELACK_MAX and TCP_BPF_RTO_MIN to bpf_setsockopt
  2020-06-26 17:55 [PATCH bpf-next 00/10] BPF TCP header options Martin KaFai Lau
                   ` (7 preceding siblings ...)
  2020-06-26 17:55 ` [PATCH bpf-next 08/10] bpf: selftests: tcp header options Martin KaFai Lau
@ 2020-06-26 17:55 ` Martin KaFai Lau
  2020-06-27 17:30   ` Eric Dumazet
  2020-06-26 17:56 ` [PATCH bpf-next 10/10] bpf: selftest: Add test for TCP_BPF_DELACK_MAX and TCP_BPF_RTO_MIN Martin KaFai Lau
  9 siblings, 1 reply; 28+ messages in thread
From: Martin KaFai Lau @ 2020-06-26 17:55 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Eric Dumazet, kernel-team,
	Lawrence Brakmo, Neal Cardwell, netdev, Yuchung Cheng

This change is mostly from an internal patch and adapts it from sysctl
config to the bpf_setsockopt setup.

The bpf_prog can set the max delay ack by using
bpf_setsockopt(TCP_BPF_DELACK_MAX).  This max delay ack can be communicated
to its peer through bpf header option.  The receiving peer can then use
this max delay ack and set a potentially lower rto by using
bpf_setsockopt(TCP_BPF_RTO_MIN).  A latter patch will use it
like this in a test as an example.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/net/inet_connection_sock.h |  2 ++
 include/net/tcp.h                  |  2 +-
 include/uapi/linux/bpf.h           |  2 ++
 net/core/filter.c                  | 15 +++++++++++++++
 net/ipv4/tcp.c                     |  4 ++++
 net/ipv4/tcp_output.c              |  2 ++
 tools/include/uapi/linux/bpf.h     |  2 ++
 7 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index e5b388f5fa20..43d45864e5f0 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -93,6 +93,8 @@ struct inet_connection_sock {
  	struct timer_list	  icsk_retransmit_timer;
  	struct timer_list	  icsk_delack_timer;
 	__u32			  icsk_rto;
+	__u32                     icsk_rto_min;
+	__u32                     icsk_delack_max;
 	__u32			  icsk_pmtu_cookie;
 	const struct tcp_congestion_ops *icsk_ca_ops;
 	const struct inet_connection_sock_af_ops *icsk_af_ops;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index e93ef2d324f3..8a682d678971 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -703,7 +703,7 @@ static inline void tcp_fast_path_check(struct sock *sk)
 static inline u32 tcp_rto_min(struct sock *sk)
 {
 	const struct dst_entry *dst = __sk_dst_get(sk);
-	u32 rto_min = TCP_RTO_MIN;
+	u32 rto_min = inet_csk(sk)->icsk_rto_min;
 
 	if (dst && dst_metric_locked(dst, RTAX_RTO_MIN))
 		rto_min = dst_metric_rtt(dst, RTAX_RTO_MIN);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 479b83d05811..2ccc81548eef 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -4299,6 +4299,8 @@ enum {
 	 *	    is not saved by TCP_SAVE_SYN.
 	 */
 	TCP_BPF_SYN_HDR_OPT	= 1003,
+	TCP_BPF_DELACK_MAX	= 1004, /* Max delay ack in usecs */
+	TCP_BPF_RTO_MIN		= 1005, /* Min delay ack in usecs */
 };
 
 struct bpf_perf_event_value {
diff --git a/net/core/filter.c b/net/core/filter.c
index 5784f1bede2f..4b80934e6876 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4431,6 +4431,7 @@ static int _bpf_setsockopt(struct sock *sk, int level, int optname,
 		} else {
 			struct inet_connection_sock *icsk = inet_csk(sk);
 			struct tcp_sock *tp = tcp_sk(sk);
+			unsigned long timeout;
 
 			if (optlen != sizeof(int))
 				return -EINVAL;
@@ -4452,6 +4453,20 @@ static int _bpf_setsockopt(struct sock *sk, int level, int optname,
 					tp->snd_ssthresh = val;
 				}
 				break;
+			case TCP_BPF_DELACK_MAX:
+				timeout = usecs_to_jiffies(val);
+				if (timeout > TCP_DELACK_MAX ||
+				    timeout < TCP_TIMEOUT_MIN)
+					return -EINVAL;
+				inet_csk(sk)->icsk_delack_max = timeout;
+				break;
+			case TCP_BPF_RTO_MIN:
+				timeout = usecs_to_jiffies(val);
+				if (timeout > TCP_RTO_MIN ||
+				    timeout < TCP_TIMEOUT_MIN)
+					return -EINVAL;
+				inet_csk(sk)->icsk_rto_min = timeout;
+				break;
 			case TCP_SAVE_SYN:
 				if (val < 0 || val > 1)
 					ret = -EINVAL;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 60093a211f4d..02be3e2a2fdb 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -418,6 +418,8 @@ void tcp_init_sock(struct sock *sk)
 	INIT_LIST_HEAD(&tp->tsorted_sent_queue);
 
 	icsk->icsk_rto = TCP_TIMEOUT_INIT;
+	icsk->icsk_rto_min = TCP_RTO_MIN;
+	icsk->icsk_delack_max = TCP_DELACK_MAX;
 	tp->mdev_us = jiffies_to_usecs(TCP_TIMEOUT_INIT);
 	minmax_reset(&tp->rtt_min, tcp_jiffies32, ~0U);
 
@@ -2685,6 +2687,8 @@ int tcp_disconnect(struct sock *sk, int flags)
 	icsk->icsk_backoff = 0;
 	icsk->icsk_probes_out = 0;
 	icsk->icsk_rto = TCP_TIMEOUT_INIT;
+	icsk->icsk_rto_min = TCP_RTO_MIN;
+	icsk->icsk_delack_max = TCP_DELACK_MAX;
 	tp->snd_ssthresh = TCP_INFINITE_SSTHRESH;
 	tp->snd_cwnd = TCP_INIT_CWND;
 	tp->snd_cwnd_cnt = 0;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index a78a29980e1f..db872a2a01c6 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3899,6 +3899,8 @@ void tcp_send_delayed_ack(struct sock *sk)
 		ato = min(ato, max_ato);
 	}
 
+	ato = min_t(u32, ato, inet_csk(sk)->icsk_delack_max);
+
 	/* Stay within the limit we were given */
 	timeout = jiffies + ato;
 
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 479b83d05811..2ccc81548eef 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -4299,6 +4299,8 @@ enum {
 	 *	    is not saved by TCP_SAVE_SYN.
 	 */
 	TCP_BPF_SYN_HDR_OPT	= 1003,
+	TCP_BPF_DELACK_MAX	= 1004, /* Max delay ack in usecs */
+	TCP_BPF_RTO_MIN		= 1005, /* Min delay ack in usecs */
 };
 
 struct bpf_perf_event_value {
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH bpf-next 10/10] bpf: selftest: Add test for TCP_BPF_DELACK_MAX and TCP_BPF_RTO_MIN
  2020-06-26 17:55 [PATCH bpf-next 00/10] BPF TCP header options Martin KaFai Lau
                   ` (8 preceding siblings ...)
  2020-06-26 17:55 ` [PATCH bpf-next 09/10] tcp: bpf: Add TCP_BPF_DELACK_MAX and TCP_BPF_RTO_MIN to bpf_setsockopt Martin KaFai Lau
@ 2020-06-26 17:56 ` Martin KaFai Lau
  9 siblings, 0 replies; 28+ messages in thread
From: Martin KaFai Lau @ 2020-06-26 17:56 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Eric Dumazet, kernel-team,
	Lawrence Brakmo, Neal Cardwell, netdev, Yuchung Cheng

This patch tests a bpf prog that parses/writes a max_delack_ms bpf header
option and also bpf_setsockopt its TCP_BPF_DELACK_MAX/TCP_BPF_RTO_MIN
accordingly.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 .../bpf/prog_tests/tcp_hdr_options.c          |  6 ++--
 .../bpf/progs/test_tcp_hdr_options.c          | 34 +++++++++++++++++++
 2 files changed, 38 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/tcp_hdr_options.c b/tools/testing/selftests/bpf/prog_tests/tcp_hdr_options.c
index f8daf36783f3..5a58f60d2889 100644
--- a/tools/testing/selftests/bpf/prog_tests/tcp_hdr_options.c
+++ b/tools/testing/selftests/bpf/prog_tests/tcp_hdr_options.c
@@ -447,11 +447,13 @@ static void simple_estab(void)
 	struct bpf_link *link;
 	struct sk_fds sk_fds;
 
-	exp_passive_estab_in.flags = OPTION_F_MAGIC;
+	exp_passive_estab_in.flags = OPTION_F_MAGIC | OPTION_F_MAX_DELACK_MS;
 	exp_passive_estab_in.magic = 0xfa;
+	exp_passive_estab_in.max_delack_ms = 11;
 
-	exp_active_estab_in.flags = OPTION_F_MAGIC;
+	exp_active_estab_in.flags = OPTION_F_MAGIC | OPTION_F_MAX_DELACK_MS;
 	exp_active_estab_in.magic = 0xce;
+	exp_active_estab_in.max_delack_ms = 22;
 
 	prepare_out();
 
diff --git a/tools/testing/selftests/bpf/progs/test_tcp_hdr_options.c b/tools/testing/selftests/bpf/progs/test_tcp_hdr_options.c
index 631181bfb4cc..eb3b3c2a21f9 100644
--- a/tools/testing/selftests/bpf/progs/test_tcp_hdr_options.c
+++ b/tools/testing/selftests/bpf/progs/test_tcp_hdr_options.c
@@ -465,6 +465,24 @@ static __always_inline int handle_write_hdr_opt(struct bpf_sock_ops *skops)
 	return write_nodata_opt(skops);
 }
 
+static __always_inline int set_delack_max(struct bpf_sock_ops *skops,
+					  __u8 max_delack_ms)
+{
+	__u32 max_delack_us = max_delack_ms * 1000;
+
+	return bpf_setsockopt(skops, SOL_TCP, TCP_BPF_DELACK_MAX,
+			      &max_delack_us, sizeof(max_delack_us));
+}
+
+static __always_inline int set_rto_min(struct bpf_sock_ops *skops,
+				       __u8 peer_max_delack_ms)
+{
+	__u32 min_rto_us = peer_max_delack_ms * 1000;
+
+	return bpf_setsockopt(skops, SOL_TCP, TCP_BPF_RTO_MIN, &min_rto_us,
+			      sizeof(min_rto_us));
+}
+
 static __always_inline int handle_active_estab(struct bpf_sock_ops *skops)
 {
 	__u8 bpf_hdr_opt_off = skops->skb_bpf_hdr_opt_off;
@@ -505,6 +523,14 @@ static __always_inline int handle_active_estab(struct bpf_sock_ops *skops)
 		/* No options will be written from now */
 		clear_hdr_cb_flags(skops);
 
+	if (active_syn_out.max_delack_ms &&
+	    set_delack_max(skops, active_syn_out.max_delack_ms))
+		RET_CG_ERR(skops);
+
+	if (active_estab_in.max_delack_ms &&
+	    set_rto_min(skops, active_estab_in.max_delack_ms))
+		RET_CG_ERR(skops);
+
 	return CG_OK;
 }
 
@@ -590,6 +616,14 @@ static __always_inline int handle_passive_estab(struct bpf_sock_ops *skops)
 		/* No options will be written from now */
 		clear_hdr_cb_flags(skops);
 
+	if (passive_synack_out.max_delack_ms &&
+	    set_delack_max(skops, passive_synack_out.max_delack_ms))
+		RET_CG_ERR(skops);
+
+	if (passive_estab_in.max_delack_ms &&
+	    set_rto_min(skops, passive_estab_in.max_delack_ms))
+		RET_CG_ERR(skops);
+
 	return CG_OK;
 }
 
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 07/10] bpf: selftests: Restore netns after each test
  2020-06-26 17:55 ` [PATCH bpf-next 07/10] bpf: selftests: Restore netns after each test Martin KaFai Lau
@ 2020-06-26 22:45   ` Andrii Nakryiko
  2020-06-27  0:23     ` Martin KaFai Lau
  0 siblings, 1 reply; 28+ messages in thread
From: Andrii Nakryiko @ 2020-06-26 22:45 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
	Kernel Team, Lawrence Brakmo, Neal Cardwell, Networking,
	Yuchung Cheng

On Fri, Jun 26, 2020 at 10:56 AM Martin KaFai Lau <kafai@fb.com> wrote:
>
> It is common for networking tests creating its netns and making its own
> setting under this new netns (e.g. changing tcp sysctl).  If the test
> forgot to restore to the original netns, it would affect the
> result of other tests.
>
> This patch saves the original netns at the beginning and then restores it
> after every test.  Since the restore "setns()" is not expensive, it does it
> on all tests without tracking if a test has created a new netns or not.
>
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> ---
>  tools/testing/selftests/bpf/test_progs.c | 21 +++++++++++++++++++++
>  tools/testing/selftests/bpf/test_progs.h |  2 ++
>  2 files changed, 23 insertions(+)
>
> diff --git a/tools/testing/selftests/bpf/test_progs.c b/tools/testing/selftests/bpf/test_progs.c
> index 54fa5fa688ce..b521ce366381 100644
> --- a/tools/testing/selftests/bpf/test_progs.c
> +++ b/tools/testing/selftests/bpf/test_progs.c
> @@ -121,6 +121,24 @@ static void reset_affinity() {
>         }
>  }
>
> +static void save_netns(void)
> +{
> +       env.saved_netns_fd = open("/proc/self/ns/net", O_RDONLY);
> +       if (env.saved_netns_fd == -1) {
> +               perror("open(/proc/self/ns/net)");
> +               exit(-1);
> +       }
> +}
> +
> +static void restore_netns(void)
> +{
> +       if (setns(env.saved_netns_fd, CLONE_NEWNET) == -1) {
> +               stdio_restore();
> +               perror("setns(CLONE_NEWNS)");
> +               exit(-1);
> +       }
> +}
> +
>  void test__end_subtest()
>  {
>         struct prog_test_def *test = env.test;
> @@ -643,6 +661,7 @@ int main(int argc, char **argv)
>                 return -1;
>         }
>
> +       save_netns();

you should probably do this also after each sub-test in test__end_subtest()?

Otherwise everything looks good.

>         stdio_hijack();
>         for (i = 0; i < prog_test_cnt; i++) {
>                 struct prog_test_def *test = &prog_test_defs[i];
> @@ -673,6 +692,7 @@ int main(int argc, char **argv)
>                         test->error_cnt ? "FAIL" : "OK");
>
>                 reset_affinity();
> +               restore_netns();
>                 if (test->need_cgroup_cleanup)
>                         cleanup_cgroup_environment();
>         }
> @@ -686,6 +706,7 @@ int main(int argc, char **argv)
>         free_str_set(&env.subtest_selector.blacklist);
>         free_str_set(&env.subtest_selector.whitelist);
>         free(env.subtest_selector.num_set);
> +       close(env.saved_netns_fd);
>
>         return env.fail_cnt ? EXIT_FAILURE : EXIT_SUCCESS;
>  }
> diff --git a/tools/testing/selftests/bpf/test_progs.h b/tools/testing/selftests/bpf/test_progs.h
> index f4503c926aca..b80924603918 100644
> --- a/tools/testing/selftests/bpf/test_progs.h
> +++ b/tools/testing/selftests/bpf/test_progs.h
> @@ -78,6 +78,8 @@ struct test_env {
>         int sub_succ_cnt; /* successful sub-tests */
>         int fail_cnt; /* total failed tests + sub-tests */
>         int skip_cnt; /* skipped tests */
> +
> +       int saved_netns_fd;
>  };
>
>  extern struct test_env env;
> --
> 2.24.1
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 07/10] bpf: selftests: Restore netns after each test
  2020-06-26 22:45   ` Andrii Nakryiko
@ 2020-06-27  0:23     ` Martin KaFai Lau
  2020-06-27 20:31       ` Andrii Nakryiko
  0 siblings, 1 reply; 28+ messages in thread
From: Martin KaFai Lau @ 2020-06-27  0:23 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
	Kernel Team, Lawrence Brakmo, Neal Cardwell, Networking,
	Yuchung Cheng

On Fri, Jun 26, 2020 at 03:45:04PM -0700, Andrii Nakryiko wrote:
> On Fri, Jun 26, 2020 at 10:56 AM Martin KaFai Lau <kafai@fb.com> wrote:
> >
> > It is common for networking tests creating its netns and making its own
> > setting under this new netns (e.g. changing tcp sysctl).  If the test
> > forgot to restore to the original netns, it would affect the
> > result of other tests.
> >
> > This patch saves the original netns at the beginning and then restores it
> > after every test.  Since the restore "setns()" is not expensive, it does it
> > on all tests without tracking if a test has created a new netns or not.
> >
> > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > ---
> >  tools/testing/selftests/bpf/test_progs.c | 21 +++++++++++++++++++++
> >  tools/testing/selftests/bpf/test_progs.h |  2 ++
> >  2 files changed, 23 insertions(+)
> >
> > diff --git a/tools/testing/selftests/bpf/test_progs.c b/tools/testing/selftests/bpf/test_progs.c
> > index 54fa5fa688ce..b521ce366381 100644
> > --- a/tools/testing/selftests/bpf/test_progs.c
> > +++ b/tools/testing/selftests/bpf/test_progs.c
> > @@ -121,6 +121,24 @@ static void reset_affinity() {
> >         }
> >  }
> >
> > +static void save_netns(void)
> > +{
> > +       env.saved_netns_fd = open("/proc/self/ns/net", O_RDONLY);
> > +       if (env.saved_netns_fd == -1) {
> > +               perror("open(/proc/self/ns/net)");
> > +               exit(-1);
> > +       }
> > +}
> > +
> > +static void restore_netns(void)
> > +{
> > +       if (setns(env.saved_netns_fd, CLONE_NEWNET) == -1) {
> > +               stdio_restore();
> > +               perror("setns(CLONE_NEWNS)");
> > +               exit(-1);
> > +       }
> > +}
> > +
> >  void test__end_subtest()
> >  {
> >         struct prog_test_def *test = env.test;
> > @@ -643,6 +661,7 @@ int main(int argc, char **argv)
> >                 return -1;
> >         }
> >
> > +       save_netns();
> 
> you should probably do this also after each sub-test in test__end_subtest()?
You mean restore_netns()?

It is a tough call.
Some tests may only want to create a netns at the beginning for all the subtests
to use (e.g. sk_assign.c).  restore_netns() after each subtest may catch
tester in surprise that the netns is not in its full control while its
own test is running.

I think an individual test should have managed the netns properly within its
subtests already if it wants a correct test result.  It can unshare at the
beginning of each subtest to get a clean state (e.g. in patch 8).
test_progs.c only ensures a config made by an earlier test does
not affect the following tests.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 02/10] tcp: bpf: Parse BPF experimental header option
  2020-06-26 17:55 ` [PATCH bpf-next 02/10] tcp: bpf: Parse BPF experimental header option Martin KaFai Lau
@ 2020-06-27 16:44   ` Eric Dumazet
  2020-06-27 17:17   ` Eric Dumazet
  1 sibling, 0 replies; 28+ messages in thread
From: Eric Dumazet @ 2020-06-27 16:44 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, kernel-team,
	Lawrence Brakmo, Neal Cardwell, netdev, Yuchung Cheng

On Fri, Jun 26, 2020 at 10:55 AM Martin KaFai Lau <kafai@fb.com> wrote:
>
> This patch adds logic to parse experimental kind 254 with 16 bit magic
> 0xeB9F.  The latter patch will allow bpf prog to write and parse data
> under this experimental kind and magic.
>
> A one byte bpf_hdr_opt_off is added to tcp_skb_cb by using an existing
> 4 byte hole.  It is only used in rx.  It stores the offset to the
> bpf experimental option and will be made available to BPF prog
> in a latter patch.  This offset is also stored in the saved_syn.
>
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> ---
>  include/net/request_sock.h | 1 +
>  include/net/tcp.h          | 3 +++
>  net/ipv4/tcp_input.c       | 6 ++++++
>  net/ipv4/tcp_ipv4.c        | 1 +
>  net/ipv6/tcp_ipv6.c        | 1 +
>  5 files changed, 12 insertions(+)
>
> diff --git a/include/net/request_sock.h b/include/net/request_sock.h
> index d77237ec9fb4..55297286c066 100644
> --- a/include/net/request_sock.h
> +++ b/include/net/request_sock.h
> @@ -43,6 +43,7 @@ int inet_rtx_syn_ack(const struct sock *parent, struct request_sock *req);
>
>  struct saved_syn {
>         u32 network_hdrlen;
> +       u32 bpf_hdr_opt_off;
>         u8 data[];
>  };
>
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index eab1c7d0facb..07a9dfe35242 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -191,6 +191,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
>   */
>  #define TCPOPT_FASTOPEN_MAGIC  0xF989
>  #define TCPOPT_SMC_MAGIC       0xE2D4C3D9
> +#define TCPOPT_BPF_MAGIC       0xEB9F
>
>  /*
>   *     TCP option lengths
> @@ -204,6 +205,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
>  #define TCPOLEN_FASTOPEN_BASE  2
>  #define TCPOLEN_EXP_FASTOPEN_BASE  4
>  #define TCPOLEN_EXP_SMC_BASE   6
> +#define TCPOLEN_EXP_BPF_BASE   4
>
>  /* But this is what stacks really send out. */
>  #define TCPOLEN_TSTAMP_ALIGNED         12
> @@ -857,6 +859,7 @@ struct tcp_skb_cb {
>                         has_rxtstamp:1, /* SKB has a RX timestamp       */
>                         unused:5;
>         __u32           ack_seq;        /* Sequence number ACK'd        */
> +       __u8            bpf_hdr_opt_off;/* offset to bpf hdr option. rx only. */
>         union {
>                 struct {
>                         /* There is space for up to 24 bytes */
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index eb0e32b2def9..640408a80b3d 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -3924,6 +3924,10 @@ void tcp_parse_options(const struct net *net,
>                                         tcp_parse_fastopen_option(opsize -
>                                                 TCPOLEN_EXP_FASTOPEN_BASE,
>                                                 ptr + 2, th->syn, foc, true);
> +                               else if (opsize >= TCPOLEN_EXP_BPF_BASE &&
> +                                        get_unaligned_be16(ptr) ==
> +                                        TCPOPT_BPF_MAGIC)
> +                                       TCP_SKB_CB(skb)->bpf_hdr_opt_off = (ptr - 2) - (unsigned char *)th;
>                                 else
>                                         smc_parse_options(th, opt_rx, ptr,
>                                                           opsize);
> @@ -6562,6 +6566,8 @@ static void tcp_reqsk_record_syn(const struct sock *sk,
>                 saved_syn = kmalloc(len + sizeof(*saved_syn), GFP_ATOMIC);
>                 if (saved_syn) {
>                         saved_syn->network_hdrlen = skb_network_header_len(skb);
> +                       saved_syn->bpf_hdr_opt_off =
> +                               TCP_SKB_CB(skb)->bpf_hdr_opt_off;
>                         memcpy(saved_syn->data, skb_network_header(skb), len);
>                         req->saved_syn = saved_syn;
>                 }
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index ea0df9fd7618..a3535b7fe002 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -1864,6 +1864,7 @@ static void tcp_v4_fill_cb(struct sk_buff *skb, const struct iphdr *iph,
>         TCP_SKB_CB(skb)->sacked  = 0;
>         TCP_SKB_CB(skb)->has_rxtstamp =
>                         skb->tstamp || skb_hwtstamps(skb)->hwtstamp;
> +       TCP_SKB_CB(skb)->bpf_hdr_opt_off = 0;
>  }
>
>  /*
> diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> index f67d45ff00b4..8356d0562279 100644
> --- a/net/ipv6/tcp_ipv6.c
> +++ b/net/ipv6/tcp_ipv6.c
> @@ -1545,6 +1545,7 @@ static void tcp_v6_fill_cb(struct sk_buff *skb, const struct ipv6hdr *hdr,
>         TCP_SKB_CB(skb)->sacked = 0;
>         TCP_SKB_CB(skb)->has_rxtstamp =
>                         skb->tstamp || skb_hwtstamps(skb)->hwtstamp;
> +       TCP_SKB_CB(skb)->bpf_hdr_opt_off = 0;
>  }
>
>  INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb)
> --
> 2.24.1
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 02/10] tcp: bpf: Parse BPF experimental header option
  2020-06-26 17:55 ` [PATCH bpf-next 02/10] tcp: bpf: Parse BPF experimental header option Martin KaFai Lau
  2020-06-27 16:44   ` Eric Dumazet
@ 2020-06-27 17:17   ` Eric Dumazet
  2020-06-28 23:44     ` Martin KaFai Lau
  2020-06-29  0:45     ` Martin KaFai Lau
  1 sibling, 2 replies; 28+ messages in thread
From: Eric Dumazet @ 2020-06-27 17:17 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, kernel-team,
	Lawrence Brakmo, Neal Cardwell, netdev, Yuchung Cheng

On Fri, Jun 26, 2020 at 10:55 AM Martin KaFai Lau <kafai@fb.com> wrote:
>
> This patch adds logic to parse experimental kind 254 with 16 bit magic
> 0xeB9F.  The latter patch will allow bpf prog to write and parse data
> under this experimental kind and magic.
>
> A one byte bpf_hdr_opt_off is added to tcp_skb_cb by using an existing
> 4 byte hole.  It is only used in rx.  It stores the offset to the
> bpf experimental option and will be made available to BPF prog
> in a latter patch.  This offset is also stored in the saved_syn.
>
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> ---
>  include/net/request_sock.h | 1 +
>  include/net/tcp.h          | 3 +++
>  net/ipv4/tcp_input.c       | 6 ++++++
>  net/ipv4/tcp_ipv4.c        | 1 +
>  net/ipv6/tcp_ipv6.c        | 1 +
>  5 files changed, 12 insertions(+)
>
> diff --git a/include/net/request_sock.h b/include/net/request_sock.h
> index d77237ec9fb4..55297286c066 100644
> --- a/include/net/request_sock.h
> +++ b/include/net/request_sock.h
> @@ -43,6 +43,7 @@ int inet_rtx_syn_ack(const struct sock *parent, struct request_sock *req);
>
>  struct saved_syn {
>         u32 network_hdrlen;
> +       u32 bpf_hdr_opt_off;
>         u8 data[];
>  };
>
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index eab1c7d0facb..07a9dfe35242 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -191,6 +191,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
>   */
>  #define TCPOPT_FASTOPEN_MAGIC  0xF989
>  #define TCPOPT_SMC_MAGIC       0xE2D4C3D9
> +#define TCPOPT_BPF_MAGIC       0xEB9F
>
>  /*
>   *     TCP option lengths
> @@ -204,6 +205,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
>  #define TCPOLEN_FASTOPEN_BASE  2
>  #define TCPOLEN_EXP_FASTOPEN_BASE  4
>  #define TCPOLEN_EXP_SMC_BASE   6
> +#define TCPOLEN_EXP_BPF_BASE   4
>
>  /* But this is what stacks really send out. */
>  #define TCPOLEN_TSTAMP_ALIGNED         12
> @@ -857,6 +859,7 @@ struct tcp_skb_cb {
>                         has_rxtstamp:1, /* SKB has a RX timestamp       */
>                         unused:5;
>         __u32           ack_seq;        /* Sequence number ACK'd        */
> +       __u8            bpf_hdr_opt_off;/* offset to bpf hdr option. rx only. */
>         union {
>                 struct {
>                         /* There is space for up to 24 bytes */
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index eb0e32b2def9..640408a80b3d 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -3924,6 +3924,10 @@ void tcp_parse_options(const struct net *net,
>                                         tcp_parse_fastopen_option(opsize -
>                                                 TCPOLEN_EXP_FASTOPEN_BASE,
>                                                 ptr + 2, th->syn, foc, true);
> +                               else if (opsize >= TCPOLEN_EXP_BPF_BASE &&
> +                                        get_unaligned_be16(ptr) ==
> +                                        TCPOPT_BPF_MAGIC)
> +                                       TCP_SKB_CB(skb)->bpf_hdr_opt_off = (ptr - 2) - (unsigned char *)th;
>                                 else
>                                         smc_parse_options(th, opt_rx, ptr,
>                                                           opsize);
> @@ -6562,6 +6566,8 @@ static void tcp_reqsk_record_syn(const struct sock *sk,
>                 saved_syn = kmalloc(len + sizeof(*saved_syn), GFP_ATOMIC);
>                 if (saved_syn) {
>                         saved_syn->network_hdrlen = skb_network_header_len(skb);
> +                       saved_syn->bpf_hdr_opt_off =
> +                               TCP_SKB_CB(skb)->bpf_hdr_opt_off;
>                         memcpy(saved_syn->data, skb_network_header(skb), len);
>                         req->saved_syn = saved_syn;
>                 }
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index ea0df9fd7618..a3535b7fe002 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -1864,6 +1864,7 @@ static void tcp_v4_fill_cb(struct sk_buff *skb, const struct iphdr *iph,
>         TCP_SKB_CB(skb)->sacked  = 0;
>         TCP_SKB_CB(skb)->has_rxtstamp =
>                         skb->tstamp || skb_hwtstamps(skb)->hwtstamp;
> +       TCP_SKB_CB(skb)->bpf_hdr_opt_off = 0;
>  }
>
>  /*
> diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> index f67d45ff00b4..8356d0562279 100644
> --- a/net/ipv6/tcp_ipv6.c
> +++ b/net/ipv6/tcp_ipv6.c
> @@ -1545,6 +1545,7 @@ static void tcp_v6_fill_cb(struct sk_buff *skb, const struct ipv6hdr *hdr,
>         TCP_SKB_CB(skb)->sacked = 0;
>         TCP_SKB_CB(skb)->has_rxtstamp =
>                         skb->tstamp || skb_hwtstamps(skb)->hwtstamp;
> +       TCP_SKB_CB(skb)->bpf_hdr_opt_off = 0;
>  }
>
>  INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb)
> --
> 2.24.1
>

(Sorry for the prior empty reply, accidentally click the wrong area)

It seems strange that we want to add code in TCP stack only to cover a
limited use case (kind 254 and 0xEB9F magic)

For something like the work Petar Penkov did (to be able to generate
SYNCOOKIES from XDP), we do not go through tcp_parse_options() and BPF
program
would have to implement its own parsing (without having an SKB at
hand), probably calling a helper function, with no
TCP_SKB_CB(skb)->bpf_hdr_opt_off.

This patch is hard coding a specific option and will prevent anyone
using private option(s) from using this infrastructure in the future,
yet paying the extra overhead.

TCP_SKB_CB(skb) is tight, I would prefer keeping the space in it for
standard TCP stack features.

If an optional BPF program needs to re-parse the TCP options to find a
specific option, maybe the extra cost is noise (especially if this is
only for SYN & SYNACK packets) ?

Thanks

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 09/10] tcp: bpf: Add TCP_BPF_DELACK_MAX and TCP_BPF_RTO_MIN to bpf_setsockopt
  2020-06-26 17:55 ` [PATCH bpf-next 09/10] tcp: bpf: Add TCP_BPF_DELACK_MAX and TCP_BPF_RTO_MIN to bpf_setsockopt Martin KaFai Lau
@ 2020-06-27 17:30   ` Eric Dumazet
  0 siblings, 0 replies; 28+ messages in thread
From: Eric Dumazet @ 2020-06-27 17:30 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, kernel-team,
	Lawrence Brakmo, Neal Cardwell, netdev, Yuchung Cheng

On Fri, Jun 26, 2020 at 10:56 AM Martin KaFai Lau <kafai@fb.com> wrote:
>
> This change is mostly from an internal patch and adapts it from sysctl
> config to the bpf_setsockopt setup.
>
> The bpf_prog can set the max delay ack by using
> bpf_setsockopt(TCP_BPF_DELACK_MAX).  This max delay ack can be communicated
> to its peer through bpf header option.  The receiving peer can then use
> this max delay ack and set a potentially lower rto by using
> bpf_setsockopt(TCP_BPF_RTO_MIN).  A latter patch will use it
> like this in a test as an example.
>
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> ---

Reviewed-by: Eric Dumazet <edumazet@google.com>

This could be split in two patches, but no big deal.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 01/10] tcp: Use a struct to represent a saved_syn
  2020-06-26 17:55 ` [PATCH bpf-next 01/10] tcp: Use a struct to represent a saved_syn Martin KaFai Lau
@ 2020-06-27 17:41   ` Eric Dumazet
  2020-06-30 23:24     ` Martin KaFai Lau
  0 siblings, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2020-06-27 17:41 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, kernel-team,
	Lawrence Brakmo, Neal Cardwell, netdev, Yuchung Cheng

On Fri, Jun 26, 2020 at 10:55 AM Martin KaFai Lau <kafai@fb.com> wrote:
>
> The total length of the saved syn packet is currently stored in
> the first 4 bytes (u32) and the actual packet data is stored after that.
>
> A latter patch will also want to store an offset (bpf_hdr_opt_off) to
> a TCP header option which the bpf program will be interested in parsing.
> Instead of anonymously storing this offset into the second 4 bytes,
> this patch creates a struct for the existing saved_syn.
> It can give a readable name to the stored lengths instead of implicitly
> using the first few u32(s) to do that.
>
> The new TCP bpf header offset (bpf_hdr_opt_off) added in a latter patch is
> an offset from the tcp header instead of from the network header.
> It will make the bpf programming side easier.  Thus, this patch stores
> the network header length instead of the total length of the syn
> header.  The total length can be obtained by the
> "network header len + tcp_hdrlen".  The latter patch can
> then also gets the offset to the TCP bpf header option by
> "network header len + bpf_hdr_opt_off".
>
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> ---
>  include/linux/tcp.h        | 11 ++++++++++-
>  include/net/request_sock.h |  7 ++++++-
>  net/core/filter.c          |  4 ++--
>  net/ipv4/tcp.c             |  9 +++++----
>  net/ipv4/tcp_input.c       | 12 ++++++------
>  5 files changed, 29 insertions(+), 14 deletions(-)
>
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index 3bdec31ce8f4..9d50132d95e6 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -404,7 +404,7 @@ struct tcp_sock {
>          * socket. Used to retransmit SYNACKs etc.
>          */
>         struct request_sock __rcu *fastopen_rsk;
> -       u32     *saved_syn;
> +       struct saved_syn *saved_syn;
>  };
>
>  enum tsq_enum {
> @@ -482,6 +482,15 @@ static inline void tcp_saved_syn_free(struct tcp_sock *tp)
>         tp->saved_syn = NULL;
>  }
>
> +static inline u32 tcp_saved_syn_len(const struct saved_syn *saved_syn)
> +{
> +       const struct tcphdr *th;
> +
> +       th = (void *)saved_syn->data + saved_syn->network_hdrlen;
> +
> +       return saved_syn->network_hdrlen + __tcp_hdrlen(th);
> +}


Ah... We have a patch extending TCP_SAVE_SYN to save all headers, so
keeping the length in a proper field would be better than going back
to TCP header to get __tcp_hdrlen(th)

I am not sure why trying to save 4 bytes in this saved_syn would matter ;)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index c9fcfa4ec43f3f0d75763e2bc6773e15bd38d68f..8ecdc5f87788439c7a08d3b72f9567e6369e7c4e
100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -258,7 +258,7 @@ struct tcp_sock {
                fastopen_connect:1, /* FASTOPEN_CONNECT sockopt */
                fastopen_no_cookie:1, /* Allow send/recv SYN+data
without a cookie */
                is_sack_reneg:1,    /* in recovery from loss with SACK reneg? */
-               unused:2;
+               save_syn:2;     /* Save headers of SYN packet */
        u8      nonagle     : 4,/* Disable Nagle algorithm?             */
                thin_lto    : 1,/* Use linear timeouts for thin streams */
                recvmsg_inq : 1,/* Indicate # of bytes in queue upon recvmsg */
@@ -270,7 +270,7 @@ struct tcp_sock {
                syn_fastopen_exp:1,/* SYN includes Fast Open exp. option */
                syn_fastopen_ch:1, /* Active TFO re-enabling probe */
                syn_data_acked:1,/* data in SYN is acked by SYN-ACK */
-               save_syn:1,     /* Save headers of SYN packet */
+               unused_save_syn:1,      /* Moved above */
                is_cwnd_limited:1,/* forward progress limited by snd_cwnd? */
                syn_smc:1;      /* SYN includes SMC */
        u32     tlp_high_seq;   /* snd_nxt at the time of TLP retransmit. */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index fa62baf509c8075cb7da30ee0f65059ac35c1c60..7e108de07fb4e45a994d3d75331489ad82f9deb7
100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3097,7 +3097,8 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
                break;

        case TCP_SAVE_SYN:
-               if (val < 0 || val > 1)
+               /* 0: disable, 1: enable, 2: start from ether_header */
+               if (val < 0 || val > 2)
                        err = -EINVAL;
                else
                        tp->save_syn = val;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 7ce5bad2308134954133f612ec129cf56946d9a1..5513f8aaae9f6c0303fac4d2c590ead1a6076502
100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6708,11 +6708,19 @@ static void tcp_reqsk_record_syn(const struct sock *sk,
        if (tcp_sk(sk)->save_syn) {
                u32 len = skb_network_header_len(skb) + tcp_hdrlen(skb);
                u32 *copy;
+               void *base;
+
+               if (tcp_sk(sk)->save_syn == 2) {  /* Save full header. */
+                       len += skb->network_header - skb->mac_header;
+                       base = skb_mac_header(skb);
+               } else {
+                       base = skb_network_header(skb);
+               }

                copy = kmalloc(len + sizeof(u32), GFP_ATOMIC);
                if (copy) {
                        copy[0] = len;
-                       memcpy(&copy[1], skb_network_header(skb), len);
+                       memcpy(&copy[1], base, len);
                        req->saved_syn = copy;
                }
        }

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 07/10] bpf: selftests: Restore netns after each test
  2020-06-27  0:23     ` Martin KaFai Lau
@ 2020-06-27 20:31       ` Andrii Nakryiko
  2020-06-29 18:00         ` Martin KaFai Lau
  0 siblings, 1 reply; 28+ messages in thread
From: Andrii Nakryiko @ 2020-06-27 20:31 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
	Kernel Team, Lawrence Brakmo, Neal Cardwell, Networking,
	Yuchung Cheng

On Fri, Jun 26, 2020 at 5:23 PM Martin KaFai Lau <kafai@fb.com> wrote:
>
> On Fri, Jun 26, 2020 at 03:45:04PM -0700, Andrii Nakryiko wrote:
> > On Fri, Jun 26, 2020 at 10:56 AM Martin KaFai Lau <kafai@fb.com> wrote:
> > >
> > > It is common for networking tests creating its netns and making its own
> > > setting under this new netns (e.g. changing tcp sysctl).  If the test
> > > forgot to restore to the original netns, it would affect the
> > > result of other tests.
> > >
> > > This patch saves the original netns at the beginning and then restores it
> > > after every test.  Since the restore "setns()" is not expensive, it does it
> > > on all tests without tracking if a test has created a new netns or not.
> > >
> > > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > > ---
> > >  tools/testing/selftests/bpf/test_progs.c | 21 +++++++++++++++++++++
> > >  tools/testing/selftests/bpf/test_progs.h |  2 ++
> > >  2 files changed, 23 insertions(+)
> > >
> > > diff --git a/tools/testing/selftests/bpf/test_progs.c b/tools/testing/selftests/bpf/test_progs.c
> > > index 54fa5fa688ce..b521ce366381 100644
> > > --- a/tools/testing/selftests/bpf/test_progs.c
> > > +++ b/tools/testing/selftests/bpf/test_progs.c
> > > @@ -121,6 +121,24 @@ static void reset_affinity() {
> > >         }
> > >  }
> > >
> > > +static void save_netns(void)
> > > +{
> > > +       env.saved_netns_fd = open("/proc/self/ns/net", O_RDONLY);
> > > +       if (env.saved_netns_fd == -1) {
> > > +               perror("open(/proc/self/ns/net)");
> > > +               exit(-1);
> > > +       }
> > > +}
> > > +
> > > +static void restore_netns(void)
> > > +{
> > > +       if (setns(env.saved_netns_fd, CLONE_NEWNET) == -1) {
> > > +               stdio_restore();
> > > +               perror("setns(CLONE_NEWNS)");
> > > +               exit(-1);
> > > +       }
> > > +}
> > > +
> > >  void test__end_subtest()
> > >  {
> > >         struct prog_test_def *test = env.test;
> > > @@ -643,6 +661,7 @@ int main(int argc, char **argv)
> > >                 return -1;
> > >         }
> > >
> > > +       save_netns();
> >
> > you should probably do this also after each sub-test in test__end_subtest()?
> You mean restore_netns()?

oops, yeah :)

>
> It is a tough call.
> Some tests may only want to create a netns at the beginning for all the subtests
> to use (e.g. sk_assign.c).  restore_netns() after each subtest may catch
> tester in surprise that the netns is not in its full control while its
> own test is running.

Wouldn't it be better to update such self-tests to setns on each
sub-test properly? It should be a simple code re-use exercise, unless
I'm missing some other implications of having to do it before each
sub-test?

The idea behind sub-test is (at least it was so far) that it's
independent from other sub-tests and tests, and it's only co-located
with other sub-tests for the purpose of code reuse and logical
grouping. Which is why we reset CPU affinity, for instance.

>
> I think an individual test should have managed the netns properly within its
> subtests already if it wants a correct test result.  It can unshare at the
> beginning of each subtest to get a clean state (e.g. in patch 8).
> test_progs.c only ensures a config made by an earlier test does
> not affect the following tests.

It's true that it gives more flexibility for test setup, but if we go
that way, we should do it consistently for CPU affinity resetting and
whatever else we do per-subtest. I wonder what your answers would be
for the above questions. We can go either way, just let's be
consistent.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 04/10] bpf: tcp: Allow bpf prog to write and parse BPF TCP header option
  2020-06-26 17:55 ` [PATCH bpf-next 04/10] bpf: tcp: Allow bpf prog to write and parse BPF TCP header option Martin KaFai Lau
@ 2020-06-28 18:24   ` Alexei Starovoitov
  2020-06-29  0:34     ` Martin KaFai Lau
  0 siblings, 1 reply; 28+ messages in thread
From: Alexei Starovoitov @ 2020-06-28 18:24 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
	kernel-team, Lawrence Brakmo, Neal Cardwell, netdev,
	Yuchung Cheng

On Fri, Jun 26, 2020 at 10:55:26AM -0700, Martin KaFai Lau wrote:
> 
> Parsing BPF Header Option
> ─────────────────────────
> 
> As mentioned earlier, the received SYN/SYNACK/ACK during the 3WHS
> will be available to some specific CB (e.g. the *_ESTABLISHED_CB)
> 
> For established connection, if the kernel finds a bpf header
> option (i.e. option with kind:254 and magic:0xeB9F) and the
> the "PARSE_HDR_OPT_CB_FLAG" flag is set,  the
> bpf prog will be called in the "BPF_SOCK_OPS_PARSE_HDR_OPT_CB" op.
> The received skb will be available through sock_ops->skb_data
> and the bpf header option offset will also be specified
> in sock_ops->skb_bpf_hdr_opt_off.

TCP noob question:
- can tcp header have two or more options with the same kind and magic?
I scanned draft-ietf-tcpm-experimental-options-00.txt and it seems
it's not prohibiting collisions.
So should be ok?
Why I'm asking... I think existing bpf_sock_ops style of running
multiple bpf progs is gonna be awkward to use.
Picking the max of bpf_reserve_hdr_opt() from many calls and let
parent bpf progs override children written headers feels a bit hackish.
I feel the users will thank us if we design the progs to be more
isolated and independent somehow.
I was thinking may be each bpf prog will bpf_reserve_hdr_opt()
and bpf_store_hdr_opt() into their own option?
Then during option writing side the tcp header will have two or more
options with the same kind and magic.
Obviously it creates a headache during parsing. Which bpf prog
should be called for each option?

I suspect tcp draft actually prefers all options to have unique kind+magic.
Can we add an attribute to prog load time that will request particular magic ?
Then only that _one_ program will be called for the given kind+magic.
We can still have multiple progs attached to a cgroup (likely root cgroup)
and different progs will take care of parsing and writing their own option.
cgroup attaching side can make sure that multi progs have different magics.

Saving multiple bpf_hdr_opt_off in patch 2 for different magics becomes
problematic. So clearly the implementation complexity shots through the roof
with above proposal, but it feels more flexible and more user friendly?
Not a strong opinion. The feature is great as-is.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 02/10] tcp: bpf: Parse BPF experimental header option
  2020-06-27 17:17   ` Eric Dumazet
@ 2020-06-28 23:44     ` Martin KaFai Lau
  2020-06-29  0:45     ` Martin KaFai Lau
  1 sibling, 0 replies; 28+ messages in thread
From: Martin KaFai Lau @ 2020-06-28 23:44 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, kernel-team,
	Lawrence Brakmo, Neal Cardwell, netdev, Yuchung Cheng

On Sat, Jun 27, 2020 at 10:17:26AM -0700, Eric Dumazet wrote:
> On Fri, Jun 26, 2020 at 10:55 AM Martin KaFai Lau <kafai@fb.com> wrote:
> >
> > This patch adds logic to parse experimental kind 254 with 16 bit magic
> > 0xeB9F.  The latter patch will allow bpf prog to write and parse data
> > under this experimental kind and magic.
> >
> > A one byte bpf_hdr_opt_off is added to tcp_skb_cb by using an existing
> > 4 byte hole.  It is only used in rx.  It stores the offset to the
> > bpf experimental option and will be made available to BPF prog
> > in a latter patch.  This offset is also stored in the saved_syn.
> >
> > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > ---
> >  include/net/request_sock.h | 1 +
> >  include/net/tcp.h          | 3 +++
> >  net/ipv4/tcp_input.c       | 6 ++++++
> >  net/ipv4/tcp_ipv4.c        | 1 +
> >  net/ipv6/tcp_ipv6.c        | 1 +
> >  5 files changed, 12 insertions(+)
> >
> > diff --git a/include/net/request_sock.h b/include/net/request_sock.h
> > index d77237ec9fb4..55297286c066 100644
> > --- a/include/net/request_sock.h
> > +++ b/include/net/request_sock.h
> > @@ -43,6 +43,7 @@ int inet_rtx_syn_ack(const struct sock *parent, struct request_sock *req);
> >
> >  struct saved_syn {
> >         u32 network_hdrlen;
> > +       u32 bpf_hdr_opt_off;
> >         u8 data[];
> >  };
> >
> > diff --git a/include/net/tcp.h b/include/net/tcp.h
> > index eab1c7d0facb..07a9dfe35242 100644
> > --- a/include/net/tcp.h
> > +++ b/include/net/tcp.h
> > @@ -191,6 +191,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
> >   */
> >  #define TCPOPT_FASTOPEN_MAGIC  0xF989
> >  #define TCPOPT_SMC_MAGIC       0xE2D4C3D9
> > +#define TCPOPT_BPF_MAGIC       0xEB9F
> >
> >  /*
> >   *     TCP option lengths
> > @@ -204,6 +205,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
> >  #define TCPOLEN_FASTOPEN_BASE  2
> >  #define TCPOLEN_EXP_FASTOPEN_BASE  4
> >  #define TCPOLEN_EXP_SMC_BASE   6
> > +#define TCPOLEN_EXP_BPF_BASE   4
> >
> >  /* But this is what stacks really send out. */
> >  #define TCPOLEN_TSTAMP_ALIGNED         12
> > @@ -857,6 +859,7 @@ struct tcp_skb_cb {
> >                         has_rxtstamp:1, /* SKB has a RX timestamp       */
> >                         unused:5;
> >         __u32           ack_seq;        /* Sequence number ACK'd        */
> > +       __u8            bpf_hdr_opt_off;/* offset to bpf hdr option. rx only. */
> >         union {
> >                 struct {
> >                         /* There is space for up to 24 bytes */
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index eb0e32b2def9..640408a80b3d 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -3924,6 +3924,10 @@ void tcp_parse_options(const struct net *net,
> >                                         tcp_parse_fastopen_option(opsize -
> >                                                 TCPOLEN_EXP_FASTOPEN_BASE,
> >                                                 ptr + 2, th->syn, foc, true);
> > +                               else if (opsize >= TCPOLEN_EXP_BPF_BASE &&
> > +                                        get_unaligned_be16(ptr) ==
> > +                                        TCPOPT_BPF_MAGIC)
> > +                                       TCP_SKB_CB(skb)->bpf_hdr_opt_off = (ptr - 2) - (unsigned char *)th;
> >                                 else
> >                                         smc_parse_options(th, opt_rx, ptr,
> >                                                           opsize);
> > @@ -6562,6 +6566,8 @@ static void tcp_reqsk_record_syn(const struct sock *sk,
> >                 saved_syn = kmalloc(len + sizeof(*saved_syn), GFP_ATOMIC);
> >                 if (saved_syn) {
> >                         saved_syn->network_hdrlen = skb_network_header_len(skb);
> > +                       saved_syn->bpf_hdr_opt_off =
> > +                               TCP_SKB_CB(skb)->bpf_hdr_opt_off;
> >                         memcpy(saved_syn->data, skb_network_header(skb), len);
> >                         req->saved_syn = saved_syn;
> >                 }
> > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > index ea0df9fd7618..a3535b7fe002 100644
> > --- a/net/ipv4/tcp_ipv4.c
> > +++ b/net/ipv4/tcp_ipv4.c
> > @@ -1864,6 +1864,7 @@ static void tcp_v4_fill_cb(struct sk_buff *skb, const struct iphdr *iph,
> >         TCP_SKB_CB(skb)->sacked  = 0;
> >         TCP_SKB_CB(skb)->has_rxtstamp =
> >                         skb->tstamp || skb_hwtstamps(skb)->hwtstamp;
> > +       TCP_SKB_CB(skb)->bpf_hdr_opt_off = 0;
> >  }
> >
> >  /*
> > diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> > index f67d45ff00b4..8356d0562279 100644
> > --- a/net/ipv6/tcp_ipv6.c
> > +++ b/net/ipv6/tcp_ipv6.c
> > @@ -1545,6 +1545,7 @@ static void tcp_v6_fill_cb(struct sk_buff *skb, const struct ipv6hdr *hdr,
> >         TCP_SKB_CB(skb)->sacked = 0;
> >         TCP_SKB_CB(skb)->has_rxtstamp =
> >                         skb->tstamp || skb_hwtstamps(skb)->hwtstamp;
> > +       TCP_SKB_CB(skb)->bpf_hdr_opt_off = 0;
> >  }
> >
> >  INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb)
> > --
> > 2.24.1
> >
> 
> (Sorry for the prior empty reply, accidentally click the wrong area)
> 
> It seems strange that we want to add code in TCP stack only to cover a
> limited use case (kind 254 and 0xEB9F magic)
> 
> For something like the work Petar Penkov did (to be able to generate
> SYNCOOKIES from XDP), we do not go through tcp_parse_options() and BPF
> program
> would have to implement its own parsing (without having an SKB at
> hand), probably calling a helper function, with no
> TCP_SKB_CB(skb)->bpf_hdr_opt_off.
> 
> This patch is hard coding a specific option and will prevent anyone
> using private option(s) from using this infrastructure in the future,
> yet paying the extra overhead.
> 
> TCP_SKB_CB(skb) is tight, I would prefer keeping the space in it for
> standard TCP stack features.
> 
> If an optional BPF program needs to re-parse the TCP options to find a
> specific option, maybe the extra cost is noise (especially if this is
> only for SYN & SYNACK packets) ?
Thanks for the feedback.

Re: syn & synack only

The bpf tcp hdr option infrastructure is not only limited to syn
and synack.  It is available to the data/ack/fin pkt also, although
most of the use cases may be limited to syn and synack.
e.g. the latter example tests parsing the 0xeB9F option in FIN.

After a connection is established, the bpf may choose to continue hearing
for (kind 254 and 0xEB9F magic).  bpf_hdr_opt_off is also used to
decide if the tcphdr has the 0xeB9F option and then call the bpf prog
to handle it.

Re: the spaces in TCP_SKB_CB(skb).  I think I can avoid tapping into it.

bpf_hdr_opt_off is only needed upto calling the bpf prog.  i.e. after
the bpf prog returns, the bpf_hdr_opt_off is no longer needed in TCP_SKB_CB.
Like "struct tcp_fastopen_cookie *foc", "u8 *bpf_hdr_opt_off" can be
added to tcp_parse_options() instead of saving it in TCP_SKB_CB(skb).
Then pass it all the way to the bpf prog and also save this to "saved_syn".
Does it address the concern in the spaces in TCP_SKB_CB(skb)?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 04/10] bpf: tcp: Allow bpf prog to write and parse BPF TCP header option
  2020-06-28 18:24   ` Alexei Starovoitov
@ 2020-06-29  0:34     ` Martin KaFai Lau
  2020-07-02  5:31       ` Martin KaFai Lau
  0 siblings, 1 reply; 28+ messages in thread
From: Martin KaFai Lau @ 2020-06-29  0:34 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
	kernel-team, Lawrence Brakmo, Neal Cardwell, netdev,
	Yuchung Cheng

On Sun, Jun 28, 2020 at 11:24:27AM -0700, Alexei Starovoitov wrote:
> On Fri, Jun 26, 2020 at 10:55:26AM -0700, Martin KaFai Lau wrote:
> > 
> > Parsing BPF Header Option
> > ─────────────────────────
> > 
> > As mentioned earlier, the received SYN/SYNACK/ACK during the 3WHS
> > will be available to some specific CB (e.g. the *_ESTABLISHED_CB)
> > 
> > For established connection, if the kernel finds a bpf header
> > option (i.e. option with kind:254 and magic:0xeB9F) and the
> > the "PARSE_HDR_OPT_CB_FLAG" flag is set,  the
> > bpf prog will be called in the "BPF_SOCK_OPS_PARSE_HDR_OPT_CB" op.
> > The received skb will be available through sock_ops->skb_data
> > and the bpf header option offset will also be specified
> > in sock_ops->skb_bpf_hdr_opt_off.
> 
> TCP noob question:
> - can tcp header have two or more options with the same kind and magic?
> I scanned draft-ietf-tcpm-experimental-options-00.txt and it seems
> it's not prohibiting collisions.
> So should be ok?
I also think it is ok.  Regardless of kind, the kernel's tcp_parse_options()
seems to be ok on duplication also.

> Why I'm asking... I think existing bpf_sock_ops style of running
> multiple bpf progs is gonna be awkward to use.
> Picking the max of bpf_reserve_hdr_opt() from many calls and let
> parent bpf progs override children written headers feels a bit hackish.
> I feel the users will thank us if we design the progs to be more
> isolated and independent somehow.
> I was thinking may be each bpf prog will bpf_reserve_hdr_opt()
> and bpf_store_hdr_opt() into their own option?
> Then during option writing side the tcp header will have two or more
> options with the same kind and magic.
> Obviously it creates a headache during parsing. Which bpf prog
> should be called for each option?
> 
> I suspect tcp draft actually prefers all options to have unique kind+magic.
> Can we add an attribute to prog load time that will request particular magic ?
> Then only that _one_ program will be called for the given kind+magic.
> We can still have multiple progs attached to a cgroup (likely root cgroup)
> and different progs will take care of parsing and writing their own option.
> cgroup attaching side can make sure that multi progs have different magics.
Interesting idea.

If the magic can be specified at load time,
may be extend this for the "length" requirement too.  At load time,
both magic and length should be specified.  The total length can
be calculated during the attach time.  That will avoid making
an extra call to bpf prog to learn the length.

If we don't limit magic, I think we should discuss if we need to limit the
kind to 254 too.  How about we allow user to write any option kind?  That can
save 2 byte magic from the limited TCP option spaces.  At load
time, we can definitely reject the kind that the kernel is already
writing, e.g. timestamp, sack...etc.

When a tcp pkt is received at an established sk, this patch decides
if there is 0xeB9F option before calling into bpf prog.  That needs
to be adjusted as well.  It could be changed to: call into bpf prog
if there is option "kind" that the kernel cannot handle and the
PARSE_HDR_CB_FLAGS is set.

> Saving multiple bpf_hdr_opt_off in patch 2 for different magics becomes
> problematic. So clearly the implementation complexity shots through the roof
> with above proposal, but it feels more flexible and more user friendly?
> Not a strong opinion. The feature is great as-is.
If we go with the above route that multiple cgroup-bpf has multiple
kinds (or magics),  each of them has to parse the tcphdr to figure
out where is its own kind (or magic).  The intention of providing
bpf_hdr_opt_off in this patch is mostly for bpf-prog convenience only.
I think having it work nicer in multi bpf prog is the proper trade off.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 02/10] tcp: bpf: Parse BPF experimental header option
  2020-06-27 17:17   ` Eric Dumazet
  2020-06-28 23:44     ` Martin KaFai Lau
@ 2020-06-29  0:45     ` Martin KaFai Lau
  1 sibling, 0 replies; 28+ messages in thread
From: Martin KaFai Lau @ 2020-06-29  0:45 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, kernel-team,
	Lawrence Brakmo, Neal Cardwell, netdev, Yuchung Cheng

On Sat, Jun 27, 2020 at 10:17:26AM -0700, Eric Dumazet wrote:
[ ... ]

> It seems strange that we want to add code in TCP stack only to cover a
> limited use case (kind 254 and 0xEB9F magic)
> 
> For something like the work Petar Penkov did (to be able to generate
> SYNCOOKIES from XDP), we do not go through tcp_parse_options() and BPF
> program
> would have to implement its own parsing (without having an SKB at
> hand), probably calling a helper function, with no
> TCP_SKB_CB(skb)->bpf_hdr_opt_off.
> 
> This patch is hard coding a specific option and will prevent anyone
> using private option(s) from using this infrastructure in the future,
> yet paying the extra overhead.
There is a discussion in patch 4 about not limiting this patch set
to option kind 254.  That will affect the usefulness of bpf_hdr_opt_off.

> 
> TCP_SKB_CB(skb) is tight, I would prefer keeping the space in it for
> standard TCP stack features.
> 
> If an optional BPF program needs to re-parse the TCP options to find a
> specific option, maybe the extra cost is noise (especially if this is
> only for SYN & SYNACK packets) ?
> 
> Thanks

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 07/10] bpf: selftests: Restore netns after each test
  2020-06-27 20:31       ` Andrii Nakryiko
@ 2020-06-29 18:00         ` Martin KaFai Lau
  2020-06-29 18:13           ` Andrii Nakryiko
  0 siblings, 1 reply; 28+ messages in thread
From: Martin KaFai Lau @ 2020-06-29 18:00 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
	Kernel Team, Lawrence Brakmo, Neal Cardwell, Networking,
	Yuchung Cheng

On Sat, Jun 27, 2020 at 01:31:42PM -0700, Andrii Nakryiko wrote:
> On Fri, Jun 26, 2020 at 5:23 PM Martin KaFai Lau <kafai@fb.com> wrote:
> >
> > On Fri, Jun 26, 2020 at 03:45:04PM -0700, Andrii Nakryiko wrote:
> > > On Fri, Jun 26, 2020 at 10:56 AM Martin KaFai Lau <kafai@fb.com> wrote:
> > > >
> > > > It is common for networking tests creating its netns and making its own
> > > > setting under this new netns (e.g. changing tcp sysctl).  If the test
> > > > forgot to restore to the original netns, it would affect the
> > > > result of other tests.
> > > >
> > > > This patch saves the original netns at the beginning and then restores it
> > > > after every test.  Since the restore "setns()" is not expensive, it does it
> > > > on all tests without tracking if a test has created a new netns or not.
> > > >
> > > > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > > > ---
> > > >  tools/testing/selftests/bpf/test_progs.c | 21 +++++++++++++++++++++
> > > >  tools/testing/selftests/bpf/test_progs.h |  2 ++
> > > >  2 files changed, 23 insertions(+)
> > > >
> > > > diff --git a/tools/testing/selftests/bpf/test_progs.c b/tools/testing/selftests/bpf/test_progs.c
> > > > index 54fa5fa688ce..b521ce366381 100644
> > > > --- a/tools/testing/selftests/bpf/test_progs.c
> > > > +++ b/tools/testing/selftests/bpf/test_progs.c
> > > > @@ -121,6 +121,24 @@ static void reset_affinity() {
> > > >         }
> > > >  }
> > > >
> > > > +static void save_netns(void)
> > > > +{
> > > > +       env.saved_netns_fd = open("/proc/self/ns/net", O_RDONLY);
> > > > +       if (env.saved_netns_fd == -1) {
> > > > +               perror("open(/proc/self/ns/net)");
> > > > +               exit(-1);
> > > > +       }
> > > > +}
> > > > +
> > > > +static void restore_netns(void)
> > > > +{
> > > > +       if (setns(env.saved_netns_fd, CLONE_NEWNET) == -1) {
> > > > +               stdio_restore();
> > > > +               perror("setns(CLONE_NEWNS)");
> > > > +               exit(-1);
> > > > +       }
> > > > +}
> > > > +
> > > >  void test__end_subtest()
> > > >  {
> > > >         struct prog_test_def *test = env.test;
> > > > @@ -643,6 +661,7 @@ int main(int argc, char **argv)
> > > >                 return -1;
> > > >         }
> > > >
> > > > +       save_netns();
> > >
> > > you should probably do this also after each sub-test in test__end_subtest()?
> > You mean restore_netns()?
> 
> oops, yeah :)
> 
> >
> > It is a tough call.
> > Some tests may only want to create a netns at the beginning for all the subtests
> > to use (e.g. sk_assign.c).  restore_netns() after each subtest may catch
> > tester in surprise that the netns is not in its full control while its
> > own test is running.
> 
> Wouldn't it be better to update such self-tests to setns on each
> sub-test properly? It should be a simple code re-use exercise, unless
> I'm missing some other implications of having to do it before each
> sub-test?
It should be simple, I think.  Haven't looked into details of each test.
However, I won't count re-running the same piece of code in a for-loop
as a re-use exercise ;)

In my vm, a quick try in forcing sk_assign.c to reconfigure netns in each
subtest in the for loop increased the runtime from 1s to 8s.
I guess it is not a big deal for test_progs.

> 
> The idea behind sub-test is (at least it was so far) that it's
> independent from other sub-tests and tests, and it's only co-located
> with other sub-tests for the purpose of code reuse and logical
> grouping. Which is why we reset CPU affinity, for instance.
> 
> >
> > I think an individual test should have managed the netns properly within its
> > subtests already if it wants a correct test result.  It can unshare at the
> > beginning of each subtest to get a clean state (e.g. in patch 8).
> > test_progs.c only ensures a config made by an earlier test does
> > not affect the following tests.
> 
> It's true that it gives more flexibility for test setup, but if we go
> that way, we should do it consistently for CPU affinity resetting and
> whatever else we do per-subtest. I wonder what your answers would be
> for the above questions. We can go either way, just let's be
> consistent.
Right, I also don't feel strongly about which way to go for netns.
I noticed reset_affinity().  cgroup cleanup is also per test though.
I think netns is more close to cgroup in terms of bpf prog is running under it,
so this patch picked the current way.

If it is decided to stay with reset_affinity's way,  I can make netns change
to other tests (there are two if i grep properly).

It seems there is no existing subtest requires to reset_affinity.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 07/10] bpf: selftests: Restore netns after each test
  2020-06-29 18:00         ` Martin KaFai Lau
@ 2020-06-29 18:13           ` Andrii Nakryiko
  2020-06-29 18:24             ` Martin KaFai Lau
  0 siblings, 1 reply; 28+ messages in thread
From: Andrii Nakryiko @ 2020-06-29 18:13 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
	Kernel Team, Lawrence Brakmo, Neal Cardwell, Networking,
	Yuchung Cheng

On Mon, Jun 29, 2020 at 11:00 AM Martin KaFai Lau <kafai@fb.com> wrote:
>
> On Sat, Jun 27, 2020 at 01:31:42PM -0700, Andrii Nakryiko wrote:
> > On Fri, Jun 26, 2020 at 5:23 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > >
> > > On Fri, Jun 26, 2020 at 03:45:04PM -0700, Andrii Nakryiko wrote:
> > > > On Fri, Jun 26, 2020 at 10:56 AM Martin KaFai Lau <kafai@fb.com> wrote:
> > > > >
> > > > > It is common for networking tests creating its netns and making its own
> > > > > setting under this new netns (e.g. changing tcp sysctl).  If the test
> > > > > forgot to restore to the original netns, it would affect the
> > > > > result of other tests.
> > > > >
> > > > > This patch saves the original netns at the beginning and then restores it
> > > > > after every test.  Since the restore "setns()" is not expensive, it does it
> > > > > on all tests without tracking if a test has created a new netns or not.
> > > > >
> > > > > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > > > > ---
> > > > >  tools/testing/selftests/bpf/test_progs.c | 21 +++++++++++++++++++++
> > > > >  tools/testing/selftests/bpf/test_progs.h |  2 ++
> > > > >  2 files changed, 23 insertions(+)
> > > > >
> > > > > diff --git a/tools/testing/selftests/bpf/test_progs.c b/tools/testing/selftests/bpf/test_progs.c
> > > > > index 54fa5fa688ce..b521ce366381 100644
> > > > > --- a/tools/testing/selftests/bpf/test_progs.c
> > > > > +++ b/tools/testing/selftests/bpf/test_progs.c
> > > > > @@ -121,6 +121,24 @@ static void reset_affinity() {
> > > > >         }
> > > > >  }
> > > > >
> > > > > +static void save_netns(void)
> > > > > +{
> > > > > +       env.saved_netns_fd = open("/proc/self/ns/net", O_RDONLY);
> > > > > +       if (env.saved_netns_fd == -1) {
> > > > > +               perror("open(/proc/self/ns/net)");
> > > > > +               exit(-1);
> > > > > +       }
> > > > > +}
> > > > > +
> > > > > +static void restore_netns(void)
> > > > > +{
> > > > > +       if (setns(env.saved_netns_fd, CLONE_NEWNET) == -1) {
> > > > > +               stdio_restore();
> > > > > +               perror("setns(CLONE_NEWNS)");
> > > > > +               exit(-1);
> > > > > +       }
> > > > > +}
> > > > > +
> > > > >  void test__end_subtest()
> > > > >  {
> > > > >         struct prog_test_def *test = env.test;
> > > > > @@ -643,6 +661,7 @@ int main(int argc, char **argv)
> > > > >                 return -1;
> > > > >         }
> > > > >
> > > > > +       save_netns();
> > > >
> > > > you should probably do this also after each sub-test in test__end_subtest()?
> > > You mean restore_netns()?
> >
> > oops, yeah :)
> >
> > >
> > > It is a tough call.
> > > Some tests may only want to create a netns at the beginning for all the subtests
> > > to use (e.g. sk_assign.c).  restore_netns() after each subtest may catch
> > > tester in surprise that the netns is not in its full control while its
> > > own test is running.
> >
> > Wouldn't it be better to update such self-tests to setns on each
> > sub-test properly? It should be a simple code re-use exercise, unless
> > I'm missing some other implications of having to do it before each
> > sub-test?
> It should be simple, I think.  Haven't looked into details of each test.
> However, I won't count re-running the same piece of code in a for-loop
> as a re-use exercise ;)
>
> In my vm, a quick try in forcing sk_assign.c to reconfigure netns in each
> subtest in the for loop increased the runtime from 1s to 8s.
> I guess it is not a big deal for test_progs.

Oh, no, thank you very much, no one needs extra 7 seconds of
test_progs run. Can you please remove reset_affinity() from sub-test
clean up then, and consistently do clean ups only between tests?

>
> >
> > The idea behind sub-test is (at least it was so far) that it's
> > independent from other sub-tests and tests, and it's only co-located
> > with other sub-tests for the purpose of code reuse and logical
> > grouping. Which is why we reset CPU affinity, for instance.
> >
> > >
> > > I think an individual test should have managed the netns properly within its
> > > subtests already if it wants a correct test result.  It can unshare at the
> > > beginning of each subtest to get a clean state (e.g. in patch 8).
> > > test_progs.c only ensures a config made by an earlier test does
> > > not affect the following tests.
> >
> > It's true that it gives more flexibility for test setup, but if we go
> > that way, we should do it consistently for CPU affinity resetting and
> > whatever else we do per-subtest. I wonder what your answers would be
> > for the above questions. We can go either way, just let's be
> > consistent.
> Right, I also don't feel strongly about which way to go for netns.
> I noticed reset_affinity().  cgroup cleanup is also per test though.
> I think netns is more close to cgroup in terms of bpf prog is running under it,
> so this patch picked the current way.
>
> If it is decided to stay with reset_affinity's way,  I can make netns change
> to other tests (there are two if i grep properly).
>
> It seems there is no existing subtest requires to reset_affinity.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 07/10] bpf: selftests: Restore netns after each test
  2020-06-29 18:13           ` Andrii Nakryiko
@ 2020-06-29 18:24             ` Martin KaFai Lau
  0 siblings, 0 replies; 28+ messages in thread
From: Martin KaFai Lau @ 2020-06-29 18:24 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
	Kernel Team, Lawrence Brakmo, Neal Cardwell, Networking,
	Yuchung Cheng

On Mon, Jun 29, 2020 at 11:13:07AM -0700, Andrii Nakryiko wrote:
> On Mon, Jun 29, 2020 at 11:00 AM Martin KaFai Lau <kafai@fb.com> wrote:
> >
> > On Sat, Jun 27, 2020 at 01:31:42PM -0700, Andrii Nakryiko wrote:
> > > On Fri, Jun 26, 2020 at 5:23 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > > >
> > > > On Fri, Jun 26, 2020 at 03:45:04PM -0700, Andrii Nakryiko wrote:
> > > > > On Fri, Jun 26, 2020 at 10:56 AM Martin KaFai Lau <kafai@fb.com> wrote:
> > > > > >
> > > > > > It is common for networking tests creating its netns and making its own
> > > > > > setting under this new netns (e.g. changing tcp sysctl).  If the test
> > > > > > forgot to restore to the original netns, it would affect the
> > > > > > result of other tests.
> > > > > >
> > > > > > This patch saves the original netns at the beginning and then restores it
> > > > > > after every test.  Since the restore "setns()" is not expensive, it does it
> > > > > > on all tests without tracking if a test has created a new netns or not.
> > > > > >
> > > > > > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > > > > > ---
> > > > > >  tools/testing/selftests/bpf/test_progs.c | 21 +++++++++++++++++++++
> > > > > >  tools/testing/selftests/bpf/test_progs.h |  2 ++
> > > > > >  2 files changed, 23 insertions(+)
> > > > > >
> > > > > > diff --git a/tools/testing/selftests/bpf/test_progs.c b/tools/testing/selftests/bpf/test_progs.c
> > > > > > index 54fa5fa688ce..b521ce366381 100644
> > > > > > --- a/tools/testing/selftests/bpf/test_progs.c
> > > > > > +++ b/tools/testing/selftests/bpf/test_progs.c
> > > > > > @@ -121,6 +121,24 @@ static void reset_affinity() {
> > > > > >         }
> > > > > >  }
> > > > > >
> > > > > > +static void save_netns(void)
> > > > > > +{
> > > > > > +       env.saved_netns_fd = open("/proc/self/ns/net", O_RDONLY);
> > > > > > +       if (env.saved_netns_fd == -1) {
> > > > > > +               perror("open(/proc/self/ns/net)");
> > > > > > +               exit(-1);
> > > > > > +       }
> > > > > > +}
> > > > > > +
> > > > > > +static void restore_netns(void)
> > > > > > +{
> > > > > > +       if (setns(env.saved_netns_fd, CLONE_NEWNET) == -1) {
> > > > > > +               stdio_restore();
> > > > > > +               perror("setns(CLONE_NEWNS)");
> > > > > > +               exit(-1);
> > > > > > +       }
> > > > > > +}
> > > > > > +
> > > > > >  void test__end_subtest()
> > > > > >  {
> > > > > >         struct prog_test_def *test = env.test;
> > > > > > @@ -643,6 +661,7 @@ int main(int argc, char **argv)
> > > > > >                 return -1;
> > > > > >         }
> > > > > >
> > > > > > +       save_netns();
> > > > >
> > > > > you should probably do this also after each sub-test in test__end_subtest()?
> > > > You mean restore_netns()?
> > >
> > > oops, yeah :)
> > >
> > > >
> > > > It is a tough call.
> > > > Some tests may only want to create a netns at the beginning for all the subtests
> > > > to use (e.g. sk_assign.c).  restore_netns() after each subtest may catch
> > > > tester in surprise that the netns is not in its full control while its
> > > > own test is running.
> > >
> > > Wouldn't it be better to update such self-tests to setns on each
> > > sub-test properly? It should be a simple code re-use exercise, unless
> > > I'm missing some other implications of having to do it before each
> > > sub-test?
> > It should be simple, I think.  Haven't looked into details of each test.
> > However, I won't count re-running the same piece of code in a for-loop
> > as a re-use exercise ;)
> >
> > In my vm, a quick try in forcing sk_assign.c to reconfigure netns in each
> > subtest in the for loop increased the runtime from 1s to 8s.
> > I guess it is not a big deal for test_progs.
> 
> Oh, no, thank you very much, no one needs extra 7 seconds of
> test_progs run. Can you please remove reset_affinity() from sub-test
> clean up then, and consistently do clean ups only between tests?
Sure.

reset_affinity() has already been called after each test, so should be
fine as is.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 01/10] tcp: Use a struct to represent a saved_syn
  2020-06-27 17:41   ` Eric Dumazet
@ 2020-06-30 23:24     ` Martin KaFai Lau
  2020-06-30 23:35       ` Eric Dumazet
  0 siblings, 1 reply; 28+ messages in thread
From: Martin KaFai Lau @ 2020-06-30 23:24 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, kernel-team,
	Lawrence Brakmo, Neal Cardwell, netdev, Yuchung Cheng

On Sat, Jun 27, 2020 at 10:41:32AM -0700, Eric Dumazet wrote:
> On Fri, Jun 26, 2020 at 10:55 AM Martin KaFai Lau <kafai@fb.com> wrote:
> >
> > The total length of the saved syn packet is currently stored in
> > the first 4 bytes (u32) and the actual packet data is stored after that.
> >
> > A latter patch will also want to store an offset (bpf_hdr_opt_off) to
> > a TCP header option which the bpf program will be interested in parsing.
> > Instead of anonymously storing this offset into the second 4 bytes,
> > this patch creates a struct for the existing saved_syn.
> > It can give a readable name to the stored lengths instead of implicitly
> > using the first few u32(s) to do that.
> >
> > The new TCP bpf header offset (bpf_hdr_opt_off) added in a latter patch is
> > an offset from the tcp header instead of from the network header.
> > It will make the bpf programming side easier.  Thus, this patch stores
> > the network header length instead of the total length of the syn
> > header.  The total length can be obtained by the
> > "network header len + tcp_hdrlen".  The latter patch can
> > then also gets the offset to the TCP bpf header option by
> > "network header len + bpf_hdr_opt_off".
> >
> > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > ---
> >  include/linux/tcp.h        | 11 ++++++++++-
> >  include/net/request_sock.h |  7 ++++++-
> >  net/core/filter.c          |  4 ++--
> >  net/ipv4/tcp.c             |  9 +++++----
> >  net/ipv4/tcp_input.c       | 12 ++++++------
> >  5 files changed, 29 insertions(+), 14 deletions(-)
> >
> > diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> > index 3bdec31ce8f4..9d50132d95e6 100644
> > --- a/include/linux/tcp.h
> > +++ b/include/linux/tcp.h
> > @@ -404,7 +404,7 @@ struct tcp_sock {
> >          * socket. Used to retransmit SYNACKs etc.
> >          */
> >         struct request_sock __rcu *fastopen_rsk;
> > -       u32     *saved_syn;
> > +       struct saved_syn *saved_syn;
> >  };
> >
> >  enum tsq_enum {
> > @@ -482,6 +482,15 @@ static inline void tcp_saved_syn_free(struct tcp_sock *tp)
> >         tp->saved_syn = NULL;
> >  }
> >
> > +static inline u32 tcp_saved_syn_len(const struct saved_syn *saved_syn)
> > +{
> > +       const struct tcphdr *th;
> > +
> > +       th = (void *)saved_syn->data + saved_syn->network_hdrlen;
> > +
> > +       return saved_syn->network_hdrlen + __tcp_hdrlen(th);
> > +}
> 
> 
> Ah... We have a patch extending TCP_SAVE_SYN to save all headers, so
> keeping the length in a proper field would be better than going back
> to TCP header to get __tcp_hdrlen(th)
> 
> I am not sure why trying to save 4 bytes in this saved_syn would matter ;)
I will use another 4 bytes to store __tcp_hdrlen().

> 
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index c9fcfa4ec43f3f0d75763e2bc6773e15bd38d68f..8ecdc5f87788439c7a08d3b72f9567e6369e7c4e
> 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -258,7 +258,7 @@ struct tcp_sock {
>                 fastopen_connect:1, /* FASTOPEN_CONNECT sockopt */
>                 fastopen_no_cookie:1, /* Allow send/recv SYN+data
> without a cookie */
>                 is_sack_reneg:1,    /* in recovery from loss with SACK reneg? */
> -               unused:2;
> +               save_syn:2;     /* Save headers of SYN packet */
Interesting idea.  I don't have an immediate use case in the mac header of
the SYN but I think it may be useful going forward.

Although bpf_hdr_opt_off may be no longer needed in v2,
it will still be convenient for the bpf prog to be able to get the TCP header
only instead of reparsing from the mac/ip[46] and also save some stack space
of the bpf prog.  Thus, storing the length of each hdr would still be nice
so that the bpf helper can directly offset to the start of the required
header.

Do you prefer to incorporate this "save_syn:2" idea in this set
so that mac hdrlen can be stored in the "struct saved_syn"?

This "unused:2" bits have already been used by "fastopen_client_fail:2".
May be get 2 bits from repair_queue?


>         u8      nonagle     : 4,/* Disable Nagle algorithm?             */
>                 thin_lto    : 1,/* Use linear timeouts for thin streams */
>                 recvmsg_inq : 1,/* Indicate # of bytes in queue upon recvmsg */
> @@ -270,7 +270,7 @@ struct tcp_sock {
>                 syn_fastopen_exp:1,/* SYN includes Fast Open exp. option */
>                 syn_fastopen_ch:1, /* Active TFO re-enabling probe */
>                 syn_data_acked:1,/* data in SYN is acked by SYN-ACK */
> -               save_syn:1,     /* Save headers of SYN packet */
> +               unused_save_syn:1,      /* Moved above */
>                 is_cwnd_limited:1,/* forward progress limited by snd_cwnd? */
>                 syn_smc:1;      /* SYN includes SMC */
>         u32     tlp_high_seq;   /* snd_nxt at the time of TLP retransmit. */

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 01/10] tcp: Use a struct to represent a saved_syn
  2020-06-30 23:24     ` Martin KaFai Lau
@ 2020-06-30 23:35       ` Eric Dumazet
  0 siblings, 0 replies; 28+ messages in thread
From: Eric Dumazet @ 2020-06-30 23:35 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, kernel-team,
	Lawrence Brakmo, Neal Cardwell, netdev, Yuchung Cheng

On Tue, Jun 30, 2020 at 4:24 PM Martin KaFai Lau <kafai@fb.com> wrote:

> Interesting idea.  I don't have an immediate use case in the mac header of
> the SYN but I think it may be useful going forward.
>
> Although bpf_hdr_opt_off may be no longer needed in v2,
> it will still be convenient for the bpf prog to be able to get the TCP header
> only instead of reparsing from the mac/ip[46] and also save some stack space
> of the bpf prog.  Thus, storing the length of each hdr would still be nice
> so that the bpf helper can directly offset to the start of the required
> header.
>
> Do you prefer to incorporate this "save_syn:2" idea in this set
> so that mac hdrlen can be stored in the "struct saved_syn"?

Sure, please go ahead.

>
> This "unused:2" bits have already been used by "fastopen_client_fail:2".

Oh that is possible, I have sent the patch as it was when we merged it
years ago.

> May be get 2 bits from repair_queue?

Hmm... this repair_queue is used in fast path, I would rather keep it
as u8 for better code generation.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 04/10] bpf: tcp: Allow bpf prog to write and parse BPF TCP header option
  2020-06-29  0:34     ` Martin KaFai Lau
@ 2020-07-02  5:31       ` Martin KaFai Lau
  0 siblings, 0 replies; 28+ messages in thread
From: Martin KaFai Lau @ 2020-07-02  5:31 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
	kernel-team, Lawrence Brakmo, Neal Cardwell, netdev,
	Yuchung Cheng

On Sun, Jun 28, 2020 at 05:34:48PM -0700, Martin KaFai Lau wrote:
> On Sun, Jun 28, 2020 at 11:24:27AM -0700, Alexei Starovoitov wrote:
> > On Fri, Jun 26, 2020 at 10:55:26AM -0700, Martin KaFai Lau wrote:
> > > 
> > > Parsing BPF Header Option
> > > ─────────────────────────
> > > 
> > > As mentioned earlier, the received SYN/SYNACK/ACK during the 3WHS
> > > will be available to some specific CB (e.g. the *_ESTABLISHED_CB)
> > > 
> > > For established connection, if the kernel finds a bpf header
> > > option (i.e. option with kind:254 and magic:0xeB9F) and the
> > > the "PARSE_HDR_OPT_CB_FLAG" flag is set,  the
> > > bpf prog will be called in the "BPF_SOCK_OPS_PARSE_HDR_OPT_CB" op.
> > > The received skb will be available through sock_ops->skb_data
> > > and the bpf header option offset will also be specified
> > > in sock_ops->skb_bpf_hdr_opt_off.
> > 
> > TCP noob question:
> > - can tcp header have two or more options with the same kind and magic?
> > I scanned draft-ietf-tcpm-experimental-options-00.txt and it seems
> > it's not prohibiting collisions.
> > So should be ok?
> I also think it is ok.  Regardless of kind, the kernel's tcp_parse_options()
> seems to be ok on duplication also.
> 
> > Why I'm asking... I think existing bpf_sock_ops style of running
> > multiple bpf progs is gonna be awkward to use.
> > Picking the max of bpf_reserve_hdr_opt() from many calls and let
> > parent bpf progs override children written headers feels a bit hackish.
> > I feel the users will thank us if we design the progs to be more
> > isolated and independent somehow.
> > I was thinking may be each bpf prog will bpf_reserve_hdr_opt()
> > and bpf_store_hdr_opt() into their own option?
> > Then during option writing side the tcp header will have two or more
> > options with the same kind and magic.
> > Obviously it creates a headache during parsing. Which bpf prog
> > should be called for each option?
> > 
> > I suspect tcp draft actually prefers all options to have unique kind+magic.
> > Can we add an attribute to prog load time that will request particular magic ?
> > Then only that _one_ program will be called for the given kind+magic.
> > We can still have multiple progs attached to a cgroup (likely root cgroup)
> > and different progs will take care of parsing and writing their own option.
> > cgroup attaching side can make sure that multi progs have different magics.
> Interesting idea.
> 
> If the magic can be specified at load time,
> may be extend this for the "length" requirement too.  At load time,
> both magic and length should be specified.  The total length can
> be calculated during the attach time.  That will avoid making
> an extra call to bpf prog to learn the length.
> 
> If we don't limit magic, I think we should discuss if we need to limit the
> kind to 254 too.  How about we allow user to write any option kind?  That can
> save 2 byte magic from the limited TCP option spaces.  At load
> time, we can definitely reject the kind that the kernel is already
> writing, e.g. timestamp, sack...etc.
I have thought more about allowing only one kind per bpf prog at load time.

I think it is not ideal for some common cases in 3WHS.  For example,
when rolling/testing out a newer option to replace the old one,
the bpf@server may want to see if client supports option-A or option-B
from the SYN and then reply SYNACK accordingly with either option-A
or option-B.  It will be easier if it allows one bpf prog to make
the decision instead of having two bpf progs (one for option-A and one
for option-B) and may require these two bpf progs to co-ordinate with
each other.  The option length will not be static also.

The prog load attribute can be extended to take >1 kinds or may be
some arraymap convention can be used to do this.  However,
I am not sure that worths it considering most of the usecases is only
in 3WHS and checking for kind duplication in runtime may not be too bad
considering the TCP option space is only 40bytes and the option
has to be 4 bytes aligned.

I am thinking to allow the bpf prog to write multiple option kinds
and check for kind uniqueness at runtime.  That will include
checking the options already written by the kernel.  In SYN, there are
usually 4 options: mss, sackOK, TS, and wscale.

The bpf_store_hdr_opt API will be changed to:

long bpf_store_hdr_opt(struct bpf_sock_ops *skops, u8 kind, const void *from, u32 len, u64 flags)

It writes one _complete_ option at a time and "u8 kind" is required.
No offset is needed because it does not allow going back to rewrite
something that has already been written.

Thoughts?

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2020-07-02  5:32 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-26 17:55 [PATCH bpf-next 00/10] BPF TCP header options Martin KaFai Lau
2020-06-26 17:55 ` [PATCH bpf-next 01/10] tcp: Use a struct to represent a saved_syn Martin KaFai Lau
2020-06-27 17:41   ` Eric Dumazet
2020-06-30 23:24     ` Martin KaFai Lau
2020-06-30 23:35       ` Eric Dumazet
2020-06-26 17:55 ` [PATCH bpf-next 02/10] tcp: bpf: Parse BPF experimental header option Martin KaFai Lau
2020-06-27 16:44   ` Eric Dumazet
2020-06-27 17:17   ` Eric Dumazet
2020-06-28 23:44     ` Martin KaFai Lau
2020-06-29  0:45     ` Martin KaFai Lau
2020-06-26 17:55 ` [PATCH bpf-next 03/10] bpf: sock_ops: Change some members of sock_ops_kern from u32 to u8 Martin KaFai Lau
2020-06-26 17:55 ` [PATCH bpf-next 04/10] bpf: tcp: Allow bpf prog to write and parse BPF TCP header option Martin KaFai Lau
2020-06-28 18:24   ` Alexei Starovoitov
2020-06-29  0:34     ` Martin KaFai Lau
2020-07-02  5:31       ` Martin KaFai Lau
2020-06-26 17:55 ` [PATCH bpf-next 05/10] bpf: selftests: A few improvements to network_helpers.c Martin KaFai Lau
2020-06-26 17:55 ` [PATCH bpf-next 06/10] bpf: selftests: Add fastopen_connect to network_helpers Martin KaFai Lau
2020-06-26 17:55 ` [PATCH bpf-next 07/10] bpf: selftests: Restore netns after each test Martin KaFai Lau
2020-06-26 22:45   ` Andrii Nakryiko
2020-06-27  0:23     ` Martin KaFai Lau
2020-06-27 20:31       ` Andrii Nakryiko
2020-06-29 18:00         ` Martin KaFai Lau
2020-06-29 18:13           ` Andrii Nakryiko
2020-06-29 18:24             ` Martin KaFai Lau
2020-06-26 17:55 ` [PATCH bpf-next 08/10] bpf: selftests: tcp header options Martin KaFai Lau
2020-06-26 17:55 ` [PATCH bpf-next 09/10] tcp: bpf: Add TCP_BPF_DELACK_MAX and TCP_BPF_RTO_MIN to bpf_setsockopt Martin KaFai Lau
2020-06-27 17:30   ` Eric Dumazet
2020-06-26 17:56 ` [PATCH bpf-next 10/10] bpf: selftest: Add test for TCP_BPF_DELACK_MAX and TCP_BPF_RTO_MIN Martin KaFai Lau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).