BPF Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v3 bpf-next 0/9] BPF TCP header options
@ 2020-07-30 20:56 Martin KaFai Lau
  2020-07-30 20:57 ` [PATCH v3 bpf-next 1/9] tcp: Use a struct to represent a saved_syn Martin KaFai Lau
                   ` (8 more replies)
  0 siblings, 9 replies; 17+ messages in thread
From: Martin KaFai Lau @ 2020-07-30 20:56 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Eric Dumazet, kernel-team,
	Lawrence Brakmo, Neal Cardwell, netdev, Yuchung Cheng

The earlier effort in BPF-TCP-CC allows the TCP Congestion Control
algorithm to be written in BPF.  It opens up opportunities to allow
a faster turnaround time in testing/releasing new congestion control
ideas to production environment.

The same flexibility can be extended to writing TCP header option.
It is not uncommon that people want to test new TCP header option
to improve the TCP performance.  Another use case is for data-center
that has a more controlled environment and has more flexibility in
putting header options for internal traffic only.
    
This patch set introduces the necessary BPF logic and API to
allow bpf program to write and parse header options.

There are also some changes to TCP and they are mostly to provide
the needed sk and skb info to the bpf program to make decision.

Patch 6 is the main patch and has more details on the API and design.

The set includes an example which sends the max delay ack in
the BPF TCP header option and the receiving side can
then adjust its RTO accordingly.

v3:
- Add kdoc for tcp_make_synack (Jakub Kicinski)
- Add BPF_WRITE_HDR_TCP_CURRENT_MSS and BPF_WRITE_HDR_TCP_SYNACK_COOKIE
  in bpf.h to give a clearer meaning to sock_ops->args[0] when
  writing header option.
- Rename BPF_SOCK_OPS_PARSE_UNKWN_HDR_OPT_CB_FLAG
  to     BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG

v2:
- Instead of limiting the bpf prog to write experimental
  option (kind:254, magic:0xeB9F), this revision allows the bpf prog to
  write any TCP header option through the bpf_store_hdr_opt() helper.
  That will allow different bpf-progs to write its own
  option and the helper will guarantee there is no duplication.

- Add bpf_load_hdr_opt() helper to search a particular option by kind.
  Some of the get_syn logic is refactored to bpf_sock_ops_get_syn().

- Since bpf prog is no longer limited to option (254, 0xeB9F),
  the TCP_SKB_CB(skb)->bpf_hdr_opt_off is no longer needed.
  Instead, when there is any option kernel cannot recognize,
  the bpf prog will be called if the
  BPF_SOCK_OPS_PARSE_UNKWN_HDR_OPT_CB_FLAG is set.
  [ The "unknown_opt" is learned in tcp_parse_options() in patch 4. ]

- Add BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG.
  If this flag is set, the bpf-prog will be called
  on all tcp packet received at an established sk.
  It will be useful to ensure a previously written header option is
  received by the peer.
  e.g. The latter test is using this on the active-side during syncookie.

- The test_tcp_hdr_options.c is adjusted accordingly
  to test writing both experimental and regular TCP header option.

- The test_misc_tcp_hdr_options.c is added to mainly
  test different cases on the new helpers.
  
- Break up the TCP_BPF_RTO_MIN and TCP_BPF_DELACK_MAX into
  two patches.

- Directly store the tcp_hdrlen in "struct saved_syn" instead of
  going back to the tcp header to obtain it by "th->doff * 4"

- Add a new optval(==2) for setsockopt(TCP_SAVE_SYN) such
  that it will also store the mac header (patch 9).

Martin KaFai Lau (9):
  tcp: Use a struct to represent a saved_syn
  tcp: bpf: Add TCP_BPF_DELACK_MAX setsockopt
  tcp: bpf: Add TCP_BPF_RTO_MIN for bpf_setsockopt
  tcp: Add unknown_opt arg to tcp_parse_options
  bpf: sock_ops: Change some members of sock_ops_kern from u32 to u8
  bpf: tcp: Allow bpf prog to write and parse TCP header option
  bpf: selftests: Add fastopen_connect to network_helpers
  bpf: selftests: tcp header options
  tcp: bpf: Optionally store mac header in TCP_SAVE_SYN

 drivers/infiniband/hw/cxgb4/cm.c              |   2 +-
 include/linux/bpf-cgroup.h                    |  25 +
 include/linux/filter.h                        |   8 +-
 include/linux/tcp.h                           |  18 +-
 include/net/inet_connection_sock.h            |   2 +
 include/net/request_sock.h                    |   9 +-
 include/net/tcp.h                             |  58 +-
 include/uapi/linux/bpf.h                      | 234 ++++++-
 net/core/filter.c                             | 416 ++++++++++-
 net/ipv4/syncookies.c                         |   2 +-
 net/ipv4/tcp.c                                |  16 +-
 net/ipv4/tcp_fastopen.c                       |   2 +-
 net/ipv4/tcp_input.c                          | 151 +++-
 net/ipv4/tcp_ipv4.c                           |   3 +-
 net/ipv4/tcp_minisocks.c                      |   5 +-
 net/ipv4/tcp_output.c                         | 196 +++++-
 net/ipv6/syncookies.c                         |   2 +-
 net/ipv6/tcp_ipv6.c                           |   3 +-
 tools/include/uapi/linux/bpf.h                | 234 ++++++-
 tools/testing/selftests/bpf/network_helpers.c |  37 +
 tools/testing/selftests/bpf/network_helpers.h |   2 +
 .../bpf/prog_tests/tcp_hdr_options.c          | 629 +++++++++++++++++
 .../bpf/progs/test_misc_tcp_hdr_options.c     | 338 +++++++++
 .../bpf/progs/test_tcp_hdr_options.c          | 657 ++++++++++++++++++
 .../selftests/bpf/test_tcp_hdr_options.h      | 150 ++++
 25 files changed, 3126 insertions(+), 73 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/tcp_hdr_options.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_misc_tcp_hdr_options.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_tcp_hdr_options.c
 create mode 100644 tools/testing/selftests/bpf/test_tcp_hdr_options.h

-- 
2.24.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v3 bpf-next 1/9] tcp: Use a struct to represent a saved_syn
  2020-07-30 20:56 [PATCH v3 bpf-next 0/9] BPF TCP header options Martin KaFai Lau
@ 2020-07-30 20:57 ` Martin KaFai Lau
  2020-07-31 15:57   ` Eric Dumazet
  2020-07-30 20:57 ` [PATCH v3 bpf-next 2/9] tcp: bpf: Add TCP_BPF_DELACK_MAX setsockopt Martin KaFai Lau
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 17+ messages in thread
From: Martin KaFai Lau @ 2020-07-30 20:57 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Eric Dumazet, kernel-team,
	Lawrence Brakmo, Neal Cardwell, netdev, Yuchung Cheng

The TCP_SAVE_SYN has both the network header and tcp header.
The total length of the saved syn packet is currently stored in
the first 4 bytes (u32) of an array and the actual packet data is
stored after that.

A latter patch will add a bpf helper that allows to get the tcp header
alone from the saved syn without the network header.  It will be more
convenient to have a direct offset to a specific header instead of
re-parsing it.  This requires to separately store the network hdrlen.
The total header length (i.e. network + tcp) is still needed for the
current usage in getsockopt.  Although this total length can be obtained
by looking into the tcphdr and then get the (th->doff << 2), this patch
chooses to directly store the tcp hdrlen in the second four bytes of
this newly created "struct saved_syn".  By using a new struct, it can
give a readable name to each individual header length.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/tcp.h        |  7 ++++++-
 include/net/request_sock.h |  8 +++++++-
 net/core/filter.c          |  4 ++--
 net/ipv4/tcp.c             |  9 +++++----
 net/ipv4/tcp_input.c       | 15 ++++++++-------
 5 files changed, 28 insertions(+), 15 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 527d668a5275..5528912dc468 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -406,7 +406,7 @@ struct tcp_sock {
 	 * socket. Used to retransmit SYNACKs etc.
 	 */
 	struct request_sock __rcu *fastopen_rsk;
-	u32	*saved_syn;
+	struct saved_syn *saved_syn;
 };
 
 enum tsq_enum {
@@ -484,6 +484,11 @@ static inline void tcp_saved_syn_free(struct tcp_sock *tp)
 	tp->saved_syn = NULL;
 }
 
+static inline u32 tcp_saved_syn_len(const struct saved_syn *saved_syn)
+{
+	return saved_syn->network_hdrlen + saved_syn->tcp_hdrlen;
+}
+
 struct sk_buff *tcp_get_timestamping_opt_stats(const struct sock *sk);
 
 static inline u16 tcp_mss_clamp(const struct tcp_sock *tp, u16 mss)
diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index cf8b33213bbc..b1b101814ecb 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -41,6 +41,12 @@ struct request_sock_ops {
 
 int inet_rtx_syn_ack(const struct sock *parent, struct request_sock *req);
 
+struct saved_syn {
+	u32 network_hdrlen;
+	u32 tcp_hdrlen;
+	u8 data[];
+};
+
 /* struct request_sock - mini sock to represent a connection request
  */
 struct request_sock {
@@ -60,7 +66,7 @@ struct request_sock {
 	struct timer_list		rsk_timer;
 	const struct request_sock_ops	*rsk_ops;
 	struct sock			*sk;
-	u32				*saved_syn;
+	struct saved_syn		*saved_syn;
 	u32				secid;
 	u32				peer_secid;
 };
diff --git a/net/core/filter.c b/net/core/filter.c
index 7124f0fe6974..250b5552a148 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4550,9 +4550,9 @@ static int _bpf_getsockopt(struct sock *sk, int level, int optname,
 			tp = tcp_sk(sk);
 
 			if (optlen <= 0 || !tp->saved_syn ||
-			    optlen > tp->saved_syn[0])
+			    optlen > tcp_saved_syn_len(tp->saved_syn))
 				goto err_clear;
-			memcpy(optval, tp->saved_syn + 1, optlen);
+			memcpy(optval, tp->saved_syn->data, optlen);
 			break;
 		default:
 			goto err_clear;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 27de9380ed14..8a774b5094e9 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3791,20 +3791,21 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
 
 		lock_sock(sk);
 		if (tp->saved_syn) {
-			if (len < tp->saved_syn[0]) {
-				if (put_user(tp->saved_syn[0], optlen)) {
+			if (len < tcp_saved_syn_len(tp->saved_syn)) {
+				if (put_user(tcp_saved_syn_len(tp->saved_syn),
+					     optlen)) {
 					release_sock(sk);
 					return -EFAULT;
 				}
 				release_sock(sk);
 				return -EINVAL;
 			}
-			len = tp->saved_syn[0];
+			len = tcp_saved_syn_len(tp->saved_syn);
 			if (put_user(len, optlen)) {
 				release_sock(sk);
 				return -EFAULT;
 			}
-			if (copy_to_user(optval, tp->saved_syn + 1, len)) {
+			if (copy_to_user(optval, tp->saved_syn->data, len)) {
 				release_sock(sk);
 				return -EFAULT;
 			}
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index a018bafd7bdf..6c38ca9de17e 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6598,13 +6598,14 @@ static void tcp_reqsk_record_syn(const struct sock *sk,
 {
 	if (tcp_sk(sk)->save_syn) {
 		u32 len = skb_network_header_len(skb) + tcp_hdrlen(skb);
-		u32 *copy;
-
-		copy = kmalloc(len + sizeof(u32), GFP_ATOMIC);
-		if (copy) {
-			copy[0] = len;
-			memcpy(&copy[1], skb_network_header(skb), len);
-			req->saved_syn = copy;
+		struct saved_syn *saved_syn;
+
+		saved_syn = kmalloc(len + sizeof(*saved_syn), GFP_ATOMIC);
+		if (saved_syn) {
+			saved_syn->network_hdrlen = skb_network_header_len(skb);
+			saved_syn->tcp_hdrlen = tcp_hdrlen(skb);
+			memcpy(saved_syn->data, skb_network_header(skb), len);
+			req->saved_syn = saved_syn;
 		}
 	}
 }
-- 
2.24.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v3 bpf-next 2/9] tcp: bpf: Add TCP_BPF_DELACK_MAX setsockopt
  2020-07-30 20:56 [PATCH v3 bpf-next 0/9] BPF TCP header options Martin KaFai Lau
  2020-07-30 20:57 ` [PATCH v3 bpf-next 1/9] tcp: Use a struct to represent a saved_syn Martin KaFai Lau
@ 2020-07-30 20:57 ` Martin KaFai Lau
  2020-07-30 20:57 ` [PATCH v3 bpf-next 3/9] tcp: bpf: Add TCP_BPF_RTO_MIN for bpf_setsockopt Martin KaFai Lau
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: Martin KaFai Lau @ 2020-07-30 20:57 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Eric Dumazet, kernel-team,
	Lawrence Brakmo, Neal Cardwell, netdev, Yuchung Cheng

This change is mostly from an internal patch and adapts it from sysctl
config to the bpf_setsockopt setup.

The bpf_prog can set the max delay ack by using
bpf_setsockopt(TCP_BPF_DELACK_MAX).  This max delay ack can be communicated
to its peer through bpf header option.  The receiving peer can then use
this max delay ack and set a potentially lower rto by using
bpf_setsockopt(TCP_BPF_RTO_MIN) which will be introduced
in the next patch.

Another latter selftest patch will also use it like the above to show
how to write and parse bpf tcp header option.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/net/inet_connection_sock.h | 1 +
 include/uapi/linux/bpf.h           | 1 +
 net/core/filter.c                  | 8 ++++++++
 net/ipv4/tcp.c                     | 2 ++
 net/ipv4/tcp_output.c              | 2 ++
 tools/include/uapi/linux/bpf.h     | 1 +
 6 files changed, 15 insertions(+)

diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index 1e209ce7d1bd..8b6d89ac91cc 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -86,6 +86,7 @@ struct inet_connection_sock {
  	struct timer_list	  icsk_retransmit_timer;
  	struct timer_list	  icsk_delack_timer;
 	__u32			  icsk_rto;
+	__u32                     icsk_delack_max;
 	__u32			  icsk_pmtu_cookie;
 	const struct tcp_congestion_ops *icsk_ca_ops;
 	const struct inet_connection_sock_af_ops *icsk_af_ops;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index eb5e0c38eb2c..993060d9ecf2 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -4244,6 +4244,7 @@ enum {
 enum {
 	TCP_BPF_IW		= 1001,	/* Set TCP initial congestion window */
 	TCP_BPF_SNDCWND_CLAMP	= 1002,	/* Set sndcwnd_clamp */
+	TCP_BPF_DELACK_MAX	= 1003, /* Max delay ack in usecs */
 };
 
 struct bpf_perf_event_value {
diff --git a/net/core/filter.c b/net/core/filter.c
index 250b5552a148..969c6b6b98d0 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4459,6 +4459,7 @@ static int _bpf_setsockopt(struct sock *sk, int level, int optname,
 		} else {
 			struct inet_connection_sock *icsk = inet_csk(sk);
 			struct tcp_sock *tp = tcp_sk(sk);
+			unsigned long timeout;
 
 			if (optlen != sizeof(int))
 				return -EINVAL;
@@ -4480,6 +4481,13 @@ static int _bpf_setsockopt(struct sock *sk, int level, int optname,
 					tp->snd_ssthresh = val;
 				}
 				break;
+			case TCP_BPF_DELACK_MAX:
+				timeout = usecs_to_jiffies(val);
+				if (timeout > TCP_DELACK_MAX ||
+				    timeout < TCP_TIMEOUT_MIN)
+					return -EINVAL;
+				inet_csk(sk)->icsk_delack_max = timeout;
+				break;
 			case TCP_SAVE_SYN:
 				if (val < 0 || val > 1)
 					ret = -EINVAL;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 8a774b5094e9..10db079cb9e3 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -418,6 +418,7 @@ void tcp_init_sock(struct sock *sk)
 	INIT_LIST_HEAD(&tp->tsorted_sent_queue);
 
 	icsk->icsk_rto = TCP_TIMEOUT_INIT;
+	icsk->icsk_delack_max = TCP_DELACK_MAX;
 	tp->mdev_us = jiffies_to_usecs(TCP_TIMEOUT_INIT);
 	minmax_reset(&tp->rtt_min, tcp_jiffies32, ~0U);
 
@@ -2685,6 +2686,7 @@ int tcp_disconnect(struct sock *sk, int flags)
 	icsk->icsk_backoff = 0;
 	icsk->icsk_probes_out = 0;
 	icsk->icsk_rto = TCP_TIMEOUT_INIT;
+	icsk->icsk_delack_max = TCP_DELACK_MAX;
 	tp->snd_ssthresh = TCP_INFINITE_SSTHRESH;
 	tp->snd_cwnd = TCP_INIT_CWND;
 	tp->snd_cwnd_cnt = 0;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index d8f16f6a9b02..984114087263 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3741,6 +3741,8 @@ void tcp_send_delayed_ack(struct sock *sk)
 		ato = min(ato, max_ato);
 	}
 
+	ato = min_t(u32, ato, inet_csk(sk)->icsk_delack_max);
+
 	/* Stay within the limit we were given */
 	timeout = jiffies + ato;
 
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index eb5e0c38eb2c..993060d9ecf2 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -4244,6 +4244,7 @@ enum {
 enum {
 	TCP_BPF_IW		= 1001,	/* Set TCP initial congestion window */
 	TCP_BPF_SNDCWND_CLAMP	= 1002,	/* Set sndcwnd_clamp */
+	TCP_BPF_DELACK_MAX	= 1003, /* Max delay ack in usecs */
 };
 
 struct bpf_perf_event_value {
-- 
2.24.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v3 bpf-next 3/9] tcp: bpf: Add TCP_BPF_RTO_MIN for bpf_setsockopt
  2020-07-30 20:56 [PATCH v3 bpf-next 0/9] BPF TCP header options Martin KaFai Lau
  2020-07-30 20:57 ` [PATCH v3 bpf-next 1/9] tcp: Use a struct to represent a saved_syn Martin KaFai Lau
  2020-07-30 20:57 ` [PATCH v3 bpf-next 2/9] tcp: bpf: Add TCP_BPF_DELACK_MAX setsockopt Martin KaFai Lau
@ 2020-07-30 20:57 ` Martin KaFai Lau
  2020-07-30 20:57 ` [PATCH v3 bpf-next 4/9] tcp: Add unknown_opt arg to tcp_parse_options Martin KaFai Lau
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: Martin KaFai Lau @ 2020-07-30 20:57 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Eric Dumazet, kernel-team,
	Lawrence Brakmo, Neal Cardwell, netdev, Yuchung Cheng

This patch adds bpf_setsockopt(TCP_BPF_RTO_MIN) to allow bpf prog
to set the min rto of a connection.  It could be used together
with the earlier patch which has added bpf_setsockopt(TCP_BPF_DELACK_MAX).

A latter selftest patch will communicate the max delay ack in a
bpf tcp header option and then the receiving side can use
bpf_setsockopt(TCP_BPF_RTO_MIN) to set a shorter rto.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/net/inet_connection_sock.h | 1 +
 include/net/tcp.h                  | 2 +-
 include/uapi/linux/bpf.h           | 1 +
 net/core/filter.c                  | 7 +++++++
 net/ipv4/tcp.c                     | 2 ++
 tools/include/uapi/linux/bpf.h     | 1 +
 6 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index 8b6d89ac91cc..d54aa3cc71df 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -86,6 +86,7 @@ struct inet_connection_sock {
  	struct timer_list	  icsk_retransmit_timer;
  	struct timer_list	  icsk_delack_timer;
 	__u32			  icsk_rto;
+	__u32                     icsk_rto_min;
 	__u32                     icsk_delack_max;
 	__u32			  icsk_pmtu_cookie;
 	const struct tcp_congestion_ops *icsk_ca_ops;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index e0c35d56091f..895e7aabf136 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -697,7 +697,7 @@ static inline void tcp_fast_path_check(struct sock *sk)
 static inline u32 tcp_rto_min(struct sock *sk)
 {
 	const struct dst_entry *dst = __sk_dst_get(sk);
-	u32 rto_min = TCP_RTO_MIN;
+	u32 rto_min = inet_csk(sk)->icsk_rto_min;
 
 	if (dst && dst_metric_locked(dst, RTAX_RTO_MIN))
 		rto_min = dst_metric_rtt(dst, RTAX_RTO_MIN);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 993060d9ecf2..d77b7df71784 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -4245,6 +4245,7 @@ enum {
 	TCP_BPF_IW		= 1001,	/* Set TCP initial congestion window */
 	TCP_BPF_SNDCWND_CLAMP	= 1002,	/* Set sndcwnd_clamp */
 	TCP_BPF_DELACK_MAX	= 1003, /* Max delay ack in usecs */
+	TCP_BPF_RTO_MIN		= 1004, /* Min delay ack in usecs */
 };
 
 struct bpf_perf_event_value {
diff --git a/net/core/filter.c b/net/core/filter.c
index 969c6b6b98d0..0a1bf520c55d 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4488,6 +4488,13 @@ static int _bpf_setsockopt(struct sock *sk, int level, int optname,
 					return -EINVAL;
 				inet_csk(sk)->icsk_delack_max = timeout;
 				break;
+			case TCP_BPF_RTO_MIN:
+				timeout = usecs_to_jiffies(val);
+				if (timeout > TCP_RTO_MIN ||
+				    timeout < TCP_TIMEOUT_MIN)
+					return -EINVAL;
+				inet_csk(sk)->icsk_rto_min = timeout;
+				break;
 			case TCP_SAVE_SYN:
 				if (val < 0 || val > 1)
 					ret = -EINVAL;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 10db079cb9e3..98606f94b01b 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -418,6 +418,7 @@ void tcp_init_sock(struct sock *sk)
 	INIT_LIST_HEAD(&tp->tsorted_sent_queue);
 
 	icsk->icsk_rto = TCP_TIMEOUT_INIT;
+	icsk->icsk_rto_min = TCP_RTO_MIN;
 	icsk->icsk_delack_max = TCP_DELACK_MAX;
 	tp->mdev_us = jiffies_to_usecs(TCP_TIMEOUT_INIT);
 	minmax_reset(&tp->rtt_min, tcp_jiffies32, ~0U);
@@ -2686,6 +2687,7 @@ int tcp_disconnect(struct sock *sk, int flags)
 	icsk->icsk_backoff = 0;
 	icsk->icsk_probes_out = 0;
 	icsk->icsk_rto = TCP_TIMEOUT_INIT;
+	icsk->icsk_rto_min = TCP_RTO_MIN;
 	icsk->icsk_delack_max = TCP_DELACK_MAX;
 	tp->snd_ssthresh = TCP_INFINITE_SSTHRESH;
 	tp->snd_cwnd = TCP_INIT_CWND;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 993060d9ecf2..d77b7df71784 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -4245,6 +4245,7 @@ enum {
 	TCP_BPF_IW		= 1001,	/* Set TCP initial congestion window */
 	TCP_BPF_SNDCWND_CLAMP	= 1002,	/* Set sndcwnd_clamp */
 	TCP_BPF_DELACK_MAX	= 1003, /* Max delay ack in usecs */
+	TCP_BPF_RTO_MIN		= 1004, /* Min delay ack in usecs */
 };
 
 struct bpf_perf_event_value {
-- 
2.24.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v3 bpf-next 4/9] tcp: Add unknown_opt arg to tcp_parse_options
  2020-07-30 20:56 [PATCH v3 bpf-next 0/9] BPF TCP header options Martin KaFai Lau
                   ` (2 preceding siblings ...)
  2020-07-30 20:57 ` [PATCH v3 bpf-next 3/9] tcp: bpf: Add TCP_BPF_RTO_MIN for bpf_setsockopt Martin KaFai Lau
@ 2020-07-30 20:57 ` Martin KaFai Lau
  2020-07-31 16:12   ` Eric Dumazet
  2020-07-30 20:57 ` [PATCH v3 bpf-next 5/9] bpf: sock_ops: Change some members of sock_ops_kern from u32 to u8 Martin KaFai Lau
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 17+ messages in thread
From: Martin KaFai Lau @ 2020-07-30 20:57 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Eric Dumazet, kernel-team,
	Lawrence Brakmo, Neal Cardwell, netdev, Yuchung Cheng

In the latter patch, the bpf prog only wants to be called to handle
a header option if that particular header option cannot be handled by
the kernel.  This unknown option could be written by the peer's bpf-prog.
It could also be a new standard option that the running kernel does not
support it while a bpf-prog can handle it.

In a latter patch, the bpf prog will be called from tcp_validate_incoming()
if there is unknown option and a flag is set in tp->bpf_sock_ops_cb_flags.

Instead of using skb->cb[] in an earlier attempt, this patch
adds an optional arg "bool *unknown_opt" to tcp_parse_options().
The bool will be set to true if it has encountered an option
that the kernel does not recognize.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 drivers/infiniband/hw/cxgb4/cm.c |  2 +-
 include/net/tcp.h                |  3 ++-
 net/ipv4/syncookies.c            |  2 +-
 net/ipv4/tcp_input.c             | 40 +++++++++++++++++++++-----------
 net/ipv4/tcp_minisocks.c         |  4 ++--
 net/ipv6/syncookies.c            |  2 +-
 6 files changed, 34 insertions(+), 19 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c
index 30e08bcc9afb..dedca6576bb9 100644
--- a/drivers/infiniband/hw/cxgb4/cm.c
+++ b/drivers/infiniband/hw/cxgb4/cm.c
@@ -3949,7 +3949,7 @@ static void build_cpl_pass_accept_req(struct sk_buff *skb, int stid , u8 tos)
 	 */
 	memset(&tmp_opt, 0, sizeof(tmp_opt));
 	tcp_clear_options(&tmp_opt);
-	tcp_parse_options(&init_net, skb, &tmp_opt, 0, NULL);
+	tcp_parse_options(&init_net, skb, &tmp_opt, 0, NULL, NULL);
 
 	req = __skb_push(skb, sizeof(*req));
 	memset(req, 0, sizeof(*req));
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 895e7aabf136..d49d8f1c961a 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -413,7 +413,8 @@ int tcp_mmap(struct file *file, struct socket *sock,
 #endif
 void tcp_parse_options(const struct net *net, const struct sk_buff *skb,
 		       struct tcp_options_received *opt_rx,
-		       int estab, struct tcp_fastopen_cookie *foc);
+		       int estab, struct tcp_fastopen_cookie *foc,
+		       bool *unknown_opt);
 const u8 *tcp_parse_md5sig_option(const struct tcphdr *th);
 
 /*
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index 9a4f6b16c9bc..fd39aed4fcd3 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -313,7 +313,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 
 	/* check for timestamp cookie support */
 	memset(&tcp_opt, 0, sizeof(tcp_opt));
-	tcp_parse_options(sock_net(sk), skb, &tcp_opt, 0, NULL);
+	tcp_parse_options(sock_net(sk), skb, &tcp_opt, 0, NULL, NULL);
 
 	if (tcp_opt.saw_tstamp && tcp_opt.rcv_tsecr) {
 		tsoff = secure_tcp_ts_off(sock_net(sk),
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 6c38ca9de17e..d9c878001be2 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3799,7 +3799,7 @@ static void tcp_parse_fastopen_option(int len, const unsigned char *cookie,
 	foc->exp = exp_opt;
 }
 
-static void smc_parse_options(const struct tcphdr *th,
+static bool smc_parse_options(const struct tcphdr *th,
 			      struct tcp_options_received *opt_rx,
 			      const unsigned char *ptr,
 			      int opsize)
@@ -3808,10 +3808,13 @@ static void smc_parse_options(const struct tcphdr *th,
 	if (static_branch_unlikely(&tcp_have_smc)) {
 		if (th->syn && !(opsize & 1) &&
 		    opsize >= TCPOLEN_EXP_SMC_BASE &&
-		    get_unaligned_be32(ptr) == TCPOPT_SMC_MAGIC)
+		    get_unaligned_be32(ptr) == TCPOPT_SMC_MAGIC) {
 			opt_rx->smc_ok = 1;
+			return true;
+		}
 	}
 #endif
+	return false;
 }
 
 /* Try to parse the MSS option from the TCP header. Return 0 on failure, clamped
@@ -3864,7 +3867,8 @@ static u16 tcp_parse_mss_option(const struct tcphdr *th, u16 user_mss)
 void tcp_parse_options(const struct net *net,
 		       const struct sk_buff *skb,
 		       struct tcp_options_received *opt_rx, int estab,
-		       struct tcp_fastopen_cookie *foc)
+		       struct tcp_fastopen_cookie *foc,
+		       bool *unknown_opt)
 {
 	const unsigned char *ptr;
 	const struct tcphdr *th = tcp_hdr(skb);
@@ -3962,15 +3966,23 @@ void tcp_parse_options(const struct net *net,
 				 */
 				if (opsize >= TCPOLEN_EXP_FASTOPEN_BASE &&
 				    get_unaligned_be16(ptr) ==
-				    TCPOPT_FASTOPEN_MAGIC)
+				    TCPOPT_FASTOPEN_MAGIC) {
 					tcp_parse_fastopen_option(opsize -
 						TCPOLEN_EXP_FASTOPEN_BASE,
 						ptr + 2, th->syn, foc, true);
-				else
-					smc_parse_options(th, opt_rx, ptr,
-							  opsize);
+					break;
+				}
+
+				if (smc_parse_options(th, opt_rx, ptr, opsize))
+					break;
+
+				if (unknown_opt)
+					*unknown_opt = true;
 				break;
 
+			default:
+				if (unknown_opt)
+					*unknown_opt = true;
 			}
 			ptr += opsize-2;
 			length -= opsize;
@@ -4003,7 +4015,8 @@ static bool tcp_parse_aligned_timestamp(struct tcp_sock *tp, const struct tcphdr
  */
 static bool tcp_fast_parse_options(const struct net *net,
 				   const struct sk_buff *skb,
-				   const struct tcphdr *th, struct tcp_sock *tp)
+				   const struct tcphdr *th, struct tcp_sock *tp,
+				   bool *unknown_opt)
 {
 	/* In the spirit of fast parsing, compare doff directly to constant
 	 * values.  Because equality is used, short doff can be ignored here.
@@ -4017,7 +4030,7 @@ static bool tcp_fast_parse_options(const struct net *net,
 			return true;
 	}
 
-	tcp_parse_options(net, skb, &tp->rx_opt, 1, NULL);
+	tcp_parse_options(net, skb, &tp->rx_opt, 1, NULL, unknown_opt);
 	if (tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr)
 		tp->rx_opt.rcv_tsecr -= tp->tsoffset;
 
@@ -5492,9 +5505,10 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	bool rst_seq_match = false;
+	bool unknown_opt = false;
 
 	/* RFC1323: H1. Apply PAWS check first. */
-	if (tcp_fast_parse_options(sock_net(sk), skb, th, tp) &&
+	if (tcp_fast_parse_options(sock_net(sk), skb, th, tp, &unknown_opt) &&
 	    tp->rx_opt.saw_tstamp &&
 	    tcp_paws_discard(sk, skb)) {
 		if (!th->rst) {
@@ -5866,7 +5880,7 @@ static bool tcp_rcv_fastopen_synack(struct sock *sk, struct sk_buff *synack,
 		/* Get original SYNACK MSS value if user MSS sets mss_clamp */
 		tcp_clear_options(&opt);
 		opt.user_mss = opt.mss_clamp = 0;
-		tcp_parse_options(sock_net(sk), synack, &opt, 0, NULL);
+		tcp_parse_options(sock_net(sk), synack, &opt, 0, NULL, NULL);
 		mss = opt.mss_clamp;
 	}
 
@@ -5951,7 +5965,7 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 	int saved_clamp = tp->rx_opt.mss_clamp;
 	bool fastopen_fail;
 
-	tcp_parse_options(sock_net(sk), skb, &tp->rx_opt, 0, &foc);
+	tcp_parse_options(sock_net(sk), skb, &tp->rx_opt, 0, &foc, NULL);
 	if (tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr)
 		tp->rx_opt.rcv_tsecr -= tp->tsoffset;
 
@@ -6685,7 +6699,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
 	tmp_opt.mss_clamp = af_ops->mss_clamp;
 	tmp_opt.user_mss  = tp->rx_opt.user_mss;
 	tcp_parse_options(sock_net(sk), skb, &tmp_opt, 0,
-			  want_cookie ? NULL : &foc);
+			  want_cookie ? NULL : &foc, NULL);
 
 	if (want_cookie && !tmp_opt.saw_tstamp)
 		tcp_clear_options(&tmp_opt);
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 495dda2449fe..61f9194802c4 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -98,7 +98,7 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
 
 	tmp_opt.saw_tstamp = 0;
 	if (th->doff > (sizeof(*th) >> 2) && tcptw->tw_ts_recent_stamp) {
-		tcp_parse_options(twsk_net(tw), skb, &tmp_opt, 0, NULL);
+		tcp_parse_options(twsk_net(tw), skb, &tmp_opt, 0, NULL, NULL);
 
 		if (tmp_opt.saw_tstamp) {
 			if (tmp_opt.rcv_tsecr)
@@ -580,7 +580,7 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
 
 	tmp_opt.saw_tstamp = 0;
 	if (th->doff > (sizeof(struct tcphdr)>>2)) {
-		tcp_parse_options(sock_net(sk), skb, &tmp_opt, 0, NULL);
+		tcp_parse_options(sock_net(sk), skb, &tmp_opt, 0, NULL, NULL);
 
 		if (tmp_opt.saw_tstamp) {
 			tmp_opt.ts_recent = req->ts_recent;
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index 13235a012388..f22961a73c2b 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -157,7 +157,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 
 	/* check for timestamp cookie support */
 	memset(&tcp_opt, 0, sizeof(tcp_opt));
-	tcp_parse_options(sock_net(sk), skb, &tcp_opt, 0, NULL);
+	tcp_parse_options(sock_net(sk), skb, &tcp_opt, 0, NULL, NULL);
 
 	if (tcp_opt.saw_tstamp && tcp_opt.rcv_tsecr) {
 		tsoff = secure_tcpv6_ts_off(sock_net(sk),
-- 
2.24.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v3 bpf-next 5/9] bpf: sock_ops: Change some members of sock_ops_kern from u32 to u8
  2020-07-30 20:56 [PATCH v3 bpf-next 0/9] BPF TCP header options Martin KaFai Lau
                   ` (3 preceding siblings ...)
  2020-07-30 20:57 ` [PATCH v3 bpf-next 4/9] tcp: Add unknown_opt arg to tcp_parse_options Martin KaFai Lau
@ 2020-07-30 20:57 ` Martin KaFai Lau
  2020-07-30 20:57 ` [PATCH v3 bpf-next 6/9] bpf: tcp: Allow bpf prog to write and parse TCP header option Martin KaFai Lau
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: Martin KaFai Lau @ 2020-07-30 20:57 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Eric Dumazet, kernel-team,
	Lawrence Brakmo, Neal Cardwell, netdev, Yuchung Cheng

A latter patch needs to add a few pointers and a few u8 to
sock_ops_kern.  Hence, this patch saves some spaces by moving
some of the existing members from u32 to u8 so that the latter
patch can still fit everything in a cacheline.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/filter.h |  4 ++--
 net/core/filter.c      | 15 ++++++++++-----
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 0a355b005bf4..c427dfa5f908 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1236,13 +1236,13 @@ struct bpf_sock_addr_kern {
 
 struct bpf_sock_ops_kern {
 	struct	sock *sk;
-	u32	op;
 	union {
 		u32 args[4];
 		u32 reply;
 		u32 replylong[4];
 	};
-	u32	is_fullsock;
+	u8	op;
+	u8	is_fullsock;
 	u64	temp;			/* temp and everything after is not
 					 * initialized to 0 before calling
 					 * the BPF program. New fields that
diff --git a/net/core/filter.c b/net/core/filter.c
index 0a1bf520c55d..4adfc90b221e 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -8406,17 +8406,22 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
 		return insn - insn_buf;
 
 	switch (si->off) {
-	case offsetof(struct bpf_sock_ops, op) ...
+	case offsetof(struct bpf_sock_ops, op):
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sock_ops_kern,
+						       op),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_sock_ops_kern, op));
+		break;
+
+	case offsetof(struct bpf_sock_ops, replylong[0]) ...
 	     offsetof(struct bpf_sock_ops, replylong[3]):
-		BUILD_BUG_ON(sizeof_field(struct bpf_sock_ops, op) !=
-			     sizeof_field(struct bpf_sock_ops_kern, op));
 		BUILD_BUG_ON(sizeof_field(struct bpf_sock_ops, reply) !=
 			     sizeof_field(struct bpf_sock_ops_kern, reply));
 		BUILD_BUG_ON(sizeof_field(struct bpf_sock_ops, replylong) !=
 			     sizeof_field(struct bpf_sock_ops_kern, replylong));
 		off = si->off;
-		off -= offsetof(struct bpf_sock_ops, op);
-		off += offsetof(struct bpf_sock_ops_kern, op);
+		off -= offsetof(struct bpf_sock_ops, replylong[0]);
+		off += offsetof(struct bpf_sock_ops_kern, replylong[0]);
 		if (type == BPF_WRITE)
 			*insn++ = BPF_STX_MEM(BPF_W, si->dst_reg, si->src_reg,
 					      off);
-- 
2.24.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v3 bpf-next 6/9] bpf: tcp: Allow bpf prog to write and parse TCP header option
  2020-07-30 20:56 [PATCH v3 bpf-next 0/9] BPF TCP header options Martin KaFai Lau
                   ` (4 preceding siblings ...)
  2020-07-30 20:57 ` [PATCH v3 bpf-next 5/9] bpf: sock_ops: Change some members of sock_ops_kern from u32 to u8 Martin KaFai Lau
@ 2020-07-30 20:57 ` Martin KaFai Lau
  2020-07-31 16:06   ` Eric Dumazet
  2020-07-30 20:57 ` [PATCH v3 bpf-next 7/9] bpf: selftests: Add fastopen_connect to network_helpers Martin KaFai Lau
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 17+ messages in thread
From: Martin KaFai Lau @ 2020-07-30 20:57 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Eric Dumazet, kernel-team,
	Lawrence Brakmo, Neal Cardwell, netdev, Yuchung Cheng

The earlier effort in BPF-TCP-CC allows the TCP Congestion Control
algorithm to be written in BPF.  It opens up opportunities to allow
a faster turnaround time in testing/releasing new congestion control
ideas to production environment.

The same flexibility can be extended to writing TCP header option.
It is not uncommon that people want to test new TCP header option
to improve the TCP performance.  Another use case is for data-center
that has a more controlled environment and has more flexibility in
putting header options for internal only use.

For example, we want to test the idea in putting maximum delay
ACK in TCP header option which is similar to a draft RFC proposal [1].

This patch introduces the necessary BPF API and use them in the
TCP stack to allow BPF_PROG_TYPE_SOCK_OPS program to parse
and write TCP header options.  It currently supports most of
the TCP packet except RST.

Supported TCP header option:
───────────────────────────
This patch allows the bpf-prog to write any option kind.
Different bpf-progs can write its own option by calling the new helper
bpf_store_hdr_opt().  The helper will ensure there is no duplicated
option in the header.

By allowing bpf-prog to write any option kind, this gives a lot of
flexibility to the bpf-prog.  Different bpf-prog can write its
own option kind.  It could also allow the bpf-prog to support a
recently standardized option on an older kernel.

Sockops Callback Flags:
──────────────────────
The header parsing and writing callback can be turned on
by enabling a few newly added callback flags:

BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG:
	Call bpf when kernel has received a header option that
	the kernel cannot handle.  It is useful when the peer doesn't
	send bpf-options very often.

	The bpf-prog can inspect the received header by sock_ops->skb_data
	which covers the whole header (including the fixed fields like
	ports, flags...etc) or
	use the new bpf_load_hdr_opt() to search for a particular TCP
	header option.

BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG:
	Call bpf for all received TCP header.

	It could be used at the client/active side (i.e. connect() side)
	when the server told it that the server was in syncookie
	mode and required the active side to resend the bpf-written
	options.  The active side can keep writing the bpf-options until
	it received a valid packet from the server side to confirm
	the earlier packet (and options) has been received.  The latter
	example patch is using it like this at the active side when the
	server is in syncookie mode.

	The bpf prog will usually turn this off in the common cases.

When the above PARSE CB flags are turned on, the bpf-prog will
be called under sock_ops->op == BPF_SOCK_OPS_PARSE_HDR_OPT_CB.
These PARSE CB flags will only ask the kernel to call the bpf-prog when
the tcp packet is received at an already-established sk.
It does not include the SYN-SYNACK-ACK during the 3WHS where the connection
has not been established.  The parsing of the SYN-SYNACK-ACK will be
discussed in the "3 Way HandShake" section.

BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG:
	Call bpf when the kernel is writing header options for the
	outgoing packet.

	The bpf will first be called to reserve the space in a skb during
	sock_ops->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB.  The kernel native options
	will get the spaces first and the bpf can only reserve the remaining
	spaces left.  The bpf prog can reserve through the new
	bpf_reserve_hdr_opt() helper.

	The bpf-prog will then be called to write the option during
	sock_ops->op == BPF_SOCK_OPS_WRITE_HDR_OPT_CB.
	Again, the kernel will write its native options first before
	calling the bpf-prog.  The bpf-prog will use the new
	bpf_store_hdr_opt() to write the option.  The bpf_store_hdr_opt()
	will ensure the writing option has not already been written by
	the kernel or by the earlier bpf-progs.  This will avoid
	sending duplicated options in the header.

	The bpf-prog can also learn what options have been written
	by the kernel or other bpf-progs by reading sock_ops->skb_data
	or by calling the new bpf_load_hdr_opt() helper.

The default is off for all of the above new CB flags, i.e. the bpf prog will
not be called to parse or write bpf hdr option.

sock_ops->skb_data and bpf_load_hdr_opt()
─────────────────────────────────────────
sock_ops->skb_data and sock_ops->skb_data_end covers the whole
TCP header and its options.

bpf_load_hdr_opt() helps to search a particular option "kind"
in the TCP header.

When parsing header (sock_ops->op == BPF_SOCK_OPS_PARSE_HDR_OPT_CB),
they are the received skb.

When reserving header space (sock_ops->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB),
they will not be useful because the header has not been written yet.

When writing header (sock_ops->op == BPF_SOCK_OPS_WRITE_HDR_OPT_CB),
they are the outgoing skb that the header option will be written into.
The bpf-prog can use them to see what header has been written by the kernel
or the earlier bpf-progs.

When concluding a 3WHS, it is the received skb that completes
the 3WHS:
In ACTIVE_ESTABLISHED_CB, it is the received SYNACK
In PASSIVE_ESTABLISHED_CB, it is usually the received ACK (or the received
SYN-DATA in fastopen).

3 Way HandShake
───────────────
The bpf-prog can learn if it is sending SYN or SYNACK by reading the
sock_ops->skb_tcp_flags.

* Passive side

When writing SYNACK (i.e. sock_ops->op == BPF_SOCK_OPS_WRITE_HDR_OPT_CB),
the received SYN skb will be available to the bpf prog.  The bpf prog can
use the SYN skb (which may carry the header option sent from the remote bpf
prog) to decide what bpf header option should be written to the outgoing
SYNACK skb.

The bpf-prog can get the whole SYN TCP header by
"bpf_getsockopt(TCP_BPF_SYN)" or search for a particular header option by
"bpf_load_hdr_opt(BPF_LOAD_HDR_OPT_TCP_SYN)".

The bpf prog will also know if it is in syncookie mode
(sock_ops->args[0] == BPF_WRITE_HDR_TCP_SYNACK_COOKIE) or not.

The bpf prog can store the received SYN pkt by using the existing
bpf_setsockopt(TCP_SAVE_SYN).  The example in a latter patch also
does this.
[ Note that the fullsock here is a listen sk, bpf_sk_storage
  is not very useful here since the listen sk will be shared
  by many concurrent connection requests.

  Extending bpf_sk_storage support to request_sock will add weight
  to the minisock and it is not necessary better than storing the
  whole ~100 bytes SYN pkt. ]

When the connection is established, the bpf prog will be called
in the existing PASSIVE_ESTABLISHED_CB callback.  At that time,
the bpf prog can get the header option from the saved syn and
then apply the needed operation to the newly established socket.
The latter patch will use the max delay ack specified in the SYN
packet as an example.  The received ACK (that concludes the 3WHS)
will also be available to the bpf prog during PASSIVE_ESTABLISHED_CB
through the sock_ops->skb_data and bpf_load_hdr_opt() and it
could be useful in syncookie scenario.  More on this later.

There is an existing getsockopt "TCP_SAVED_SYN" to return the whole
saved syn pkt which includes the IP[46] header and the TCP header.
A few "TCP_BPF_SYN*" getsockopt has been added to allow specifying where to
start getting from, e.g. starting from TCP header, or from IP[46] header.

In the new getsockopt(TCP_BPF_SYN*), the kernel will know
where it can get the SYN's packet from:
  - (a) the just received syn (available when the bpf prog is writing SYNACK)
        and it is the only way to get SYN in syncookie mode.
  or
  - (b) the saved syn (available in PASSIVE_ESTABLISHED_CB and also other
        existing CB).

The bpf prog does not need to know where the SYN pkt can be obtained from.
The "TCP_BPF_SYN*" getsockopt will hide this details.

Similarly, a flags "BPF_LOAD_HDR_OPT_TCP_SYN" is also added to
bpf_load_hdr_opt() to search header option from the SYN packet.

* Fastopen

Fastopen should work the same as the regular non fastopen case.
This is a test in a latter patch.

* Syncookie

For syncookie, the latter example patch asks the active
side's bpf prog to resend the header options in ACK.  The server
can use bpf_load_hdr_opt() to look at the options in this
received ACK during PASSIVE_ESTABLISHED_CB.

* Active side

The bpf prog will get a chance to write the bpf header option
in the SYN packet during WRITE_HDR_OPT_CB.  The received SYNACK
pkt will also be available to the bpf prog during the existing
ACTIVE_ESTABLISHED_CB callback through the sock_ops->skb_data
and bpf_load_hdr_opt().

* Turn off header CB flags after 3WHS

If the bpf prog does not need to write/parse header options
beyond the 3WHS, the bpf prog can clear the bpf_sock_ops_cb_flags
to avoid being called for header options.
Or the bpf-prog can select to leave the UNKNOWN_HDR_OPT_CB_FLAG on
so that the kernel will only call it when there is option that
the kernel cannot handle.

Established Connection
──────────────────────
The bpf prog will be called as long as the parse/write
*_HDR_OPT_CB_FLAG is enabled in bpf_sock_ops_cb_flags.
That will allow the bpf prog to parse/write header options
in the data, pure-ack, and fin packet.

[1]: draft-wang-tcpm-low-latency-opt-00
     https://tools.ietf.org/html/draft-wang-tcpm-low-latency-opt-00

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/bpf-cgroup.h     |  25 +++
 include/linux/filter.h         |   4 +
 include/net/tcp.h              |  53 ++++-
 include/uapi/linux/bpf.h       | 231 ++++++++++++++++++++-
 net/core/filter.c              | 365 +++++++++++++++++++++++++++++++++
 net/ipv4/tcp_fastopen.c        |   2 +-
 net/ipv4/tcp_input.c           |  86 +++++++-
 net/ipv4/tcp_ipv4.c            |   3 +-
 net/ipv4/tcp_minisocks.c       |   1 +
 net/ipv4/tcp_output.c          | 194 ++++++++++++++++--
 net/ipv6/tcp_ipv6.c            |   3 +-
 tools/include/uapi/linux/bpf.h | 231 ++++++++++++++++++++-
 12 files changed, 1171 insertions(+), 27 deletions(-)

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index 64f367044e25..2f98d2fce62e 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -279,6 +279,31 @@ int bpf_percpu_cgroup_storage_update(struct bpf_map *map, void *key,
 #define BPF_CGROUP_RUN_PROG_UDP6_RECVMSG_LOCK(sk, uaddr)			\
 	BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, BPF_CGROUP_UDP6_RECVMSG, NULL)
 
+/* The SOCK_OPS"_SK" macro should be used when sock_ops->sk is not a
+ * fullsock and its parent fullsock cannot be traced by
+ * sk_to_full_sk().
+ *
+ * e.g. sock_ops->sk is a request_sock and it is under syncookie mode.
+ * Its listener-sk is not attached to the rsk_listener.
+ * In this case, the caller holds the listener-sk (unlocked),
+ * set its sock_ops->sk to req_sk, and call this SOCK_OPS"_SK" with
+ * the listener-sk such that the cgroup-bpf-progs of the
+ * listener-sk will be run.
+ *
+ * Regardless of syncookie mode or not,
+ * calling bpf_setsockopt on listener-sk will not make sense anyway,
+ * so passing 'sock_ops->sk == req_sk' to the bpf prog is appropriate here.
+ */
+#define BPF_CGROUP_RUN_PROG_SOCK_OPS_SK(sock_ops, sk)			\
+({									\
+	int __ret = 0;							\
+	if (cgroup_bpf_enabled)						\
+		__ret = __cgroup_bpf_run_filter_sock_ops(sk,		\
+							 sock_ops,	\
+							 BPF_CGROUP_SOCK_OPS); \
+	__ret;								\
+})
+
 #define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops)				       \
 ({									       \
 	int __ret = 0;							       \
diff --git a/include/linux/filter.h b/include/linux/filter.h
index c427dfa5f908..995625950cc1 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1241,8 +1241,12 @@ struct bpf_sock_ops_kern {
 		u32 reply;
 		u32 replylong[4];
 	};
+	struct sk_buff	*syn_skb;
+	struct sk_buff	*skb;
+	void	*skb_data_end;
 	u8	op;
 	u8	is_fullsock;
+	u8	remaining_opt_len;
 	u64	temp;			/* temp and everything after is not
 					 * initialized to 0 before calling
 					 * the BPF program. New fields that
diff --git a/include/net/tcp.h b/include/net/tcp.h
index d49d8f1c961a..0eb43fee7e29 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -394,7 +394,7 @@ void tcp_metrics_init(void);
 bool tcp_peer_is_proven(struct request_sock *req, struct dst_entry *dst);
 void tcp_close(struct sock *sk, long timeout);
 void tcp_init_sock(struct sock *sk);
-void tcp_init_transfer(struct sock *sk, int bpf_op);
+void tcp_init_transfer(struct sock *sk, bool passive, struct sk_buff *skb);
 __poll_t tcp_poll(struct file *file, struct socket *sock,
 		      struct poll_table_struct *wait);
 int tcp_getsockopt(struct sock *sk, int level, int optname,
@@ -455,6 +455,7 @@ enum tcp_synack_type {
 };
 struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst,
 				struct request_sock *req,
+				struct sk_buff *syn_skb,
 				struct tcp_fastopen_cookie *foc,
 				enum tcp_synack_type synack_type);
 int tcp_disconnect(struct sock *sk, int flags);
@@ -2031,6 +2032,7 @@ struct tcp_request_sock_ops {
 	u32 (*init_ts_off)(const struct net *net, const struct sk_buff *skb);
 	int (*send_synack)(const struct sock *sk, struct dst_entry *dst,
 			   struct flowi *fl, struct request_sock *req,
+			   struct sk_buff *syn_skb,
 			   struct tcp_fastopen_cookie *foc,
 			   enum tcp_synack_type synack_type);
 };
@@ -2230,6 +2232,55 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock,
 		      struct msghdr *msg, int len, int flags);
 #endif /* CONFIG_NET_SOCK_MSG */
 
+#ifdef CONFIG_CGROUP_BPF
+/* Copy the listen sk's HDR_OPT_CB flags to its child.
+ *
+ * During 3-Way-HandShake, the synack is usually sent from
+ * the listen sk with the HDR_OPT_CB flags set so that
+ * bpf-prog will be called to write the BPF hdr option.
+ *
+ * In fastopen, the child sk is used to send synack instead
+ * of the listen sk.  Thus, inheriting the HDR_OPT_CB flags
+ * from the listen sk gives the bpf-prog a chance to write
+ * BPF hdr option in the synack pkt during fastopen.
+ *
+ * Both fastopen and non-fastopen child will inherit the
+ * HDR_OPT_CB flags to keep the bpf-prog having a consistent
+ * behavior when deciding to clear this cb flags (or not)
+ * during the PASSIVE_ESTABLISHED_CB.
+ *
+ * In the future, other cb flags could be inherited here also.
+ */
+static inline void bpf_skops_init_child(const struct sock *sk,
+					struct sock *child)
+{
+	tcp_sk(child)->bpf_sock_ops_cb_flags =
+		tcp_sk(sk)->bpf_sock_ops_cb_flags &
+		(BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG |
+		 BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG |
+		 BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG);
+}
+
+static inline void bpf_skops_init_skb(struct bpf_sock_ops_kern *skops,
+				      struct sk_buff *skb,
+				      unsigned int end_offset)
+{
+	skops->skb = skb;
+	skops->skb_data_end = skb->data + end_offset;
+}
+#else
+static inline void bpf_skops_init_child(const struct sock *sk,
+					struct sock *child)
+{
+}
+
+static inline void bpf_skops_init_skb(struct bpf_sock_ops_kern *skops,
+				      struct sk_buff *skb,
+				      unsigned int end_offset)
+{
+}
+#endif
+
 /* Call BPF_SOCK_OPS program that returns an int. If the return value
  * is < 0, then the BPF op failed (for example if the loaded BPF
  * program does not support the chosen operation or there is no BPF
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index d77b7df71784..c6162ee573e9 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3389,6 +3389,114 @@ union bpf_attr {
  *		A non-negative value equal to or less than *size* on success,
  *		or a negative error in case of failure.
  *
+ * long bpf_load_hdr_opt(struct bpf_sock_ops *skops, void *searchby_res, u32 len, u64 flags)
+ *	Description
+ *		Load Header option.  Support loading TCP header option
+ *		of a particular kind in BPF_PROG_TYPE_SOCK_OPS.
+ *
+ *		The first byte of the *searchby_res* specifies the
+ *		kind that it wants to search.
+ *
+ *		If the searching kind is an experimental kind
+ *		(i.e. 253 or 254 according to RFC6994).  It also
+ *		needs to specify the "magic" which is either
+ *		2 bytes or 4 bytes.  It then also needs to
+ *		specify the size of the magic by using
+ *		the 2nd byte which is "kind-length" of a TCP
+ *		header option and the "kind-length" also
+ *		includes the first 2 bytes "kind" and "kind-length"
+ *		itself as a normal TCP header option also does.
+ *
+ *		For example, to search experimental kind 254 with
+ *		2 byte magic 0xeB9F, the searchby_res should be
+ *		[ 254, 4, 0xeB, 0x9F, 0, 0, .... 0 ].
+ *
+ *		To search for the standard window scale option (3),
+ *		the searchby_res should be [ 3, 0, 0, .... 0 ].
+ *		Note, kind-length must be 0 for regular option.
+ *
+ *		Searching for No-Op (0) and End-of-Option-List (1) are
+ *		not supported.
+ *
+ *		*len* must be at least 2 bytes which is the minimal size
+ *		of a header option.
+ *
+ *		Supported flags:
+ *		* **BPF_LOAD_HDR_OPT_TCP_SYN** to search from the
+ *		  saved_syn packet or the just-received syn packet.
+ *
+ *	Return
+ *		>0 when found, the header option is copied to *searchby_res*.
+ *		The return value is the total length copied.
+ *
+ *		**-EINVAL** If param is invalid
+ *
+ *		**-ENOMSG** The option is not found
+ *
+ *		**-ENOENT** No syn packet available when
+ *			    **BPF_LOAD_HDR_OPT_TCP_SYN** is used
+ *
+ *		**-ENOSPC** Not enough space.  Only *len* number of
+ *			    bytes are copied.
+ *
+ *		**-EFAULT** Cannot parse the existing header options
+ *
+ *		**-EPERM** This helper cannot be used under the
+ *			   current sock_ops->op.
+ *
+ * long bpf_store_hdr_opt(struct bpf_sock_ops *skops, const void *from, u32 len, u64 flags)
+ *	Description
+ *		Store BPF header option.  The data will be copied
+ *		from buffer *from* with length *len* to the TCP header.
+ *
+ *		The buffer *from* should have the whole option that
+ *		includes the kind, kind-length, and the actual
+ *		option data.  The *len* must be at least kind-length
+ *		long.  The kind-length does not have to be 4 byte
+ *		aligned.  The kernel will take care of the padding
+ *		and setting the 4 bytes aligned value to th->doff.
+ *
+ *		This helper will check for duplicated option
+ *		by searching the same option in the outgoing skb.
+ *
+ *		This helper can only be called during
+ *		BPF_SOCK_OPS_WRITE_HDR_OPT_CB.
+ *
+ *	Return
+ *		0 on success, or negative error in case of failure:
+ *
+ *		**-EINVAL** If param is invalid
+ *
+ *		**-ENOSPC** Not enough space.  Nothing has been written
+ *
+ *		**-EEXIST** The writing option has already existed
+ *
+ *		**-EFAULT** Cannot parse the existing header options
+ *
+ *		**-EPERM** This helper cannot be used under the
+ *			   current sock_ops->op.
+ *
+ * long bpf_reserve_hdr_opt(struct bpf_sock_ops *skops, u32 len, u64 flags)
+ *	Description
+ *		Reserve *len* bytes for the bpf header option.  The
+ *		space will be used by bpf_store_hdr_opt() later in
+ *		BPF_SOCK_OPS_WRITE_HDR_OPT_CB.
+ *
+ *		If bpf_reserve_hdr_opt() is called multiple times,
+ *		the total number of bytes will be reserved.
+ *
+ *		This helper can only be called during
+ *		BPF_SOCK_OPS_HDR_OPT_LEN_CB.
+ *
+ *	Return
+ *		0 on success, or negative error in case of failure:
+ *
+ *		**-EINVAL** if param is invalid
+ *
+ *		**-ENOSPC** Not enough remaining space in the header.
+ *
+ *		**-EPERM** This helper cannot be used under the
+ *			   current sock_ops->op.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3533,6 +3641,9 @@ union bpf_attr {
 	FN(skc_to_tcp_request_sock),	\
 	FN(skc_to_udp6_sock),		\
 	FN(get_task_stack),		\
+	FN(load_hdr_opt),		\
+	FN(store_hdr_opt),		\
+	FN(reserve_hdr_opt),
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
@@ -4152,6 +4263,10 @@ struct bpf_sock_ops {
 	__u64 bytes_received;
 	__u64 bytes_acked;
 	__bpf_md_ptr(struct bpf_sock *, sk);
+	__bpf_md_ptr(void *, skb_data);
+	__bpf_md_ptr(void *, skb_data_end);
+	__u32 skb_len;
+	__u32 skb_tcp_flags;
 };
 
 /* Definitions for bpf_sock_ops_cb_flags */
@@ -4160,8 +4275,11 @@ enum {
 	BPF_SOCK_OPS_RETRANS_CB_FLAG	= (1<<1),
 	BPF_SOCK_OPS_STATE_CB_FLAG	= (1<<2),
 	BPF_SOCK_OPS_RTT_CB_FLAG	= (1<<3),
+	BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG	= (1<<4),
+	BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG = (1<<5),
+	BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<6),
 /* Mask of all currently supported cb flags */
-	BPF_SOCK_OPS_ALL_CB_FLAGS       = 0xF,
+	BPF_SOCK_OPS_ALL_CB_FLAGS       = 0x7F,
 };
 
 /* List of known BPF sock_ops operators.
@@ -4217,6 +4335,63 @@ enum {
 					 */
 	BPF_SOCK_OPS_RTT_CB,		/* Called on every RTT.
 					 */
+	BPF_SOCK_OPS_PARSE_HDR_OPT_CB,	/* Parse the header option.
+					 * It will be called to handle
+					 * the packets received at
+					 * an already established
+					 * connection.
+					 *
+					 * sock_ops->skb_data:
+					 * Referring to the received skb.
+					 * It covers the TCP header only.
+					 *
+					 * bpf_load_hdr_opt() can also
+					 * be used to search for a
+					 * particular option.
+					 */
+	BPF_SOCK_OPS_HDR_OPT_LEN_CB,	/* Reserve space for writing the
+					 * header option later in
+					 * BPF_SOCK_OPS_WRITE_HDR_OPT_CB.
+					 * Arg1: bool want_cookie. (in
+					 *       writing SYNACK only)
+					 *
+					 * sock_ops->skb_data:
+					 * Not available because no header has
+					 * been	written yet.
+					 *
+					 * sock_ops->skb_tcp_flags:
+					 * The tcp_flags of the
+					 * outgoing skb. (e.g. SYN, ACK, FIN).
+					 *
+					 * bpf_reserve_hdr_opt() should
+					 * be used to reserve space.
+					 */
+	BPF_SOCK_OPS_WRITE_HDR_OPT_CB,	/* Write the header options
+					 * Arg1: bool want_cookie. (in
+					 *       writing SYNACK only)
+					 *
+					 * sock_ops->skb_data:
+					 * Referring to the outgoing skb.
+					 * It covers the TCP header
+					 * that has already been written
+					 * by the kernel and the
+					 * earlier bpf-progs.
+					 *
+					 * sock_ops->skb_tcp_flags:
+					 * The tcp_flags of the outgoing
+					 * skb. (e.g. SYN, ACK, FIN).
+					 *
+					 * bpf_store_hdr_opt() should
+					 * be used to write the
+					 * option.
+					 *
+					 * bpf_load_hdr_opt() can also
+					 * be used to search for a
+					 * particular option that
+					 * has already been written
+					 * by the kernel or the
+					 * earlier bpf-progs.
+					 */
 };
 
 /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
@@ -4246,6 +4421,60 @@ enum {
 	TCP_BPF_SNDCWND_CLAMP	= 1002,	/* Set sndcwnd_clamp */
 	TCP_BPF_DELACK_MAX	= 1003, /* Max delay ack in usecs */
 	TCP_BPF_RTO_MIN		= 1004, /* Min delay ack in usecs */
+	/* Copy the SYN pkt to optval
+	 *
+	 * BPF_PROG_TYPE_SOCK_OPS only.  It is similar to the
+	 * bpf_getsockopt(TCP_SAVED_SYN) but it does not limit
+	 * to only getting from the saved_syn.  It can either get the
+	 * syn packet from:
+	 *
+	 * 1. the just-received SYN packet (only available when writing the
+	 *    SYNACK).  It will be useful when it is not necessary to
+	 *    save the SYN packet for latter use.  It is also the only way
+	 *    to get the SYN during syncookie mode because the syn
+	 *    packet cannot be saved during syncookie.
+	 *
+	 * OR
+	 *
+	 * 2. the earlier saved syn which was done by
+	 *    bpf_setsockopt(TCP_SAVE_SYN).
+	 *
+	 * The bpf_getsockopt(TCP_BPF_SYN*) option will hide where the
+	 * SYN packet is obtained.
+	 *
+	 * If the bpf-prog does not need the IP[46] header,  the
+	 * bpf-prog can avoid parsing the IP header by using
+	 * TCP_BPF_SYN.  Otherwise, the bpf-prog can get both
+	 * IP[46] and TCP header by using TCP_BPF_SYN_IP.
+	 *
+	 *      >0: Total number of bytes copied
+	 * -ENOSPC: Not enough space in optval. Only optlen number of
+	 *          bytes is copied.
+	 * -ENOENT: The SYN skb is not available now and the earlier SYN pkt
+	 *	    is not saved by setsockopt(TCP_SAVE_SYN).
+	 */
+	TCP_BPF_SYN		= 1005, /* Copy the TCP header */
+	TCP_BPF_SYN_IP		= 1006, /* Copy the IP[46] and TCP header */
+};
+
+enum {
+	BPF_LOAD_HDR_OPT_TCP_SYN = (1ULL << 0),
+};
+
+/* args[0] value during BPF_SOCK_OPS_HDR_OPT_LEN_CB and
+ * BPF_SOCK_OPS_WRITE_HDR_OPT_CB.
+ */
+enum {
+	BPF_WRITE_HDR_TCP_CURRENT_MSS = 1,	/* Kernel is finding the
+						 * total option spaces
+						 * required for an established
+						 * sk in order to calculate the
+						 * the MSS.  No skb is actually
+						 * sent.
+						 */
+	BPF_WRITE_HDR_TCP_SYNACK_COOKIE = 2,	/* Kernel is in syncookie mode
+						 * when sending a SYN.
+						 */
 };
 
 struct bpf_perf_event_value {
diff --git a/net/core/filter.c b/net/core/filter.c
index 4adfc90b221e..607f87901122 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4669,9 +4669,82 @@ static const struct bpf_func_proto bpf_sock_ops_setsockopt_proto = {
 	.arg5_type	= ARG_CONST_SIZE,
 };
 
+static int bpf_sock_ops_get_syn(struct bpf_sock_ops_kern *bpf_sock,
+				int optname, const u8 **start)
+{
+	struct sk_buff *syn_skb = bpf_sock->syn_skb;
+	const u8 *hdr_start;
+	int ret;
+
+	if (syn_skb) {
+		/* sk is a request_sock here */
+
+		if (optname == TCP_BPF_SYN) {
+			hdr_start = syn_skb->data;
+			ret = tcp_hdrlen(syn_skb);
+		} else {
+			/* optname == TCP_BPF_SYN_IP */
+			hdr_start = skb_network_header(syn_skb);
+			ret = skb_network_header_len(syn_skb) +
+				tcp_hdrlen(syn_skb);
+		}
+	} else {
+		struct sock *sk = bpf_sock->sk;
+		struct saved_syn *saved_syn;
+
+		if (sk->sk_state == TCP_NEW_SYN_RECV)
+			/* synack retransmit. bpf_sock->syn_skb will
+			 * not be available.  It has to resort to
+			 * saved_syn (if it is saved).
+			 */
+			saved_syn = inet_reqsk(sk)->saved_syn;
+		else
+			saved_syn = tcp_sk(sk)->saved_syn;
+
+		if (!saved_syn)
+			return -ENOENT;
+
+		if (optname == TCP_BPF_SYN) {
+			hdr_start = saved_syn->data +
+				saved_syn->network_hdrlen;
+			ret = saved_syn->tcp_hdrlen;
+		} else {
+			/* optname == TCP_BPF_SYN_IP */
+			hdr_start = saved_syn->data;
+			ret = saved_syn->network_hdrlen +
+				saved_syn->tcp_hdrlen;
+		}
+	}
+
+	*start = hdr_start;
+	return ret;
+}
+
 BPF_CALL_5(bpf_sock_ops_getsockopt, struct bpf_sock_ops_kern *, bpf_sock,
 	   int, level, int, optname, char *, optval, int, optlen)
 {
+	if (IS_ENABLED(CONFIG_INET) && level == SOL_TCP &&
+	    optname >= TCP_BPF_SYN && optname <= TCP_BPF_SYN_IP) {
+		int ret, copy_len = 0;
+		const u8 *start;
+
+		ret = bpf_sock_ops_get_syn(bpf_sock, optname, &start);
+		if (ret > 0) {
+			copy_len = ret;
+			if (optlen < copy_len) {
+				copy_len = optlen;
+				ret = -ENOSPC;
+			}
+
+			memcpy(optval, start, copy_len);
+		}
+
+		/* Zero out unused buffer at the end */
+		memset(optval + copy_len, 0, optlen - copy_len);
+
+		return ret;
+	}
+
 	return _bpf_getsockopt(bpf_sock->sk, level, optname, optval, optlen);
 }
 
@@ -6165,6 +6238,232 @@ static const struct bpf_func_proto bpf_sk_assign_proto = {
 	.arg3_type	= ARG_ANYTHING,
 };
 
+static const u8 *bpf_search_tcp_opt(const u8 *op, const u8 *opend,
+				    u8 search_kind, const u8 *magic,
+				    u8 magic_len, bool *eol)
+{
+	u8 kind, kind_len;
+
+	*eol = false;
+
+	while (op < opend) {
+		kind = op[0];
+
+		if (kind == TCPOPT_EOL) {
+			*eol = true;
+			return ERR_PTR(-ENOMSG);
+		} else if (kind == TCPOPT_NOP) {
+			op++;
+			continue;
+		}
+
+		if (opend - op < 2 || opend - op < op[1] || op[1] < 2)
+			/* Something is wrong in the received header.
+			 * Follow the TCP stack's tcp_parse_options()
+			 * and just bail here.
+			 */
+			return ERR_PTR(-EFAULT);
+
+		kind_len = op[1];
+		if (search_kind == kind) {
+			if (!magic_len)
+				return op;
+
+			if (magic_len > kind_len - 2)
+				return ERR_PTR(-ENOMSG);
+
+			if (!memcmp(&op[2], magic, magic_len))
+				return op;
+		}
+
+		op += kind_len;
+	}
+
+	return ERR_PTR(-ENOMSG);
+}
+
+BPF_CALL_4(bpf_sock_ops_load_hdr_opt, struct bpf_sock_ops_kern *, bpf_sock,
+	   void *, search_res, u32, len, u64, flags)
+{
+	bool eol, load_syn = flags & BPF_LOAD_HDR_OPT_TCP_SYN;
+	const u8 *op, *opend, *magic, *search = search_res;
+	u8 search_kind, search_len, copy_len, magic_len;
+	int ret;
+
+	/* 2 byte is the minimal option len except TCPOPT_NOP and
+	 * TCPOPT_EOL which are useless for the bpf prog to learn
+	 * and this helper disallow loading them also.
+	 */
+	if (len < 2 || flags & ~BPF_LOAD_HDR_OPT_TCP_SYN)
+		return -EINVAL;
+
+	search_kind = search[0];
+	search_len = search[1];
+
+	if (search_len > len || search_kind == TCPOPT_NOP ||
+	    search_kind == TCPOPT_EOL)
+		return -EINVAL;
+
+	if (search_kind == TCPOPT_EXP || search_kind == 253) {
+		/* 16 or 32 bit magic.  +2 for kind and kind length */
+		if (search_len != 4 && search_len != 6)
+			return -EINVAL;
+		magic = &search[2];
+		magic_len = search_len - 2;
+	} else {
+		if (search_len)
+			return -EINVAL;
+		magic = NULL;
+		magic_len = 0;
+	}
+
+	if (load_syn) {
+		ret = bpf_sock_ops_get_syn(bpf_sock, TCP_BPF_SYN, &op);
+		if (ret < 0)
+			return ret;
+
+		opend = op + ret;
+		op += sizeof(struct tcphdr);
+	} else {
+		if (!bpf_sock->skb ||
+		    bpf_sock->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB)
+			/* This bpf_sock->op cannot call this helper */
+			return -EPERM;
+
+		opend = bpf_sock->skb_data_end;
+		op = bpf_sock->skb->data + sizeof(struct tcphdr);
+	}
+
+	op = bpf_search_tcp_opt(op, opend, search_kind, magic, magic_len,
+				&eol);
+	if (IS_ERR(op))
+		return PTR_ERR(op);
+
+	copy_len = op[1];
+	ret = copy_len;
+	if (copy_len > len) {
+		ret = -ENOSPC;
+		copy_len = len;
+	}
+
+	memcpy(search_res, op, copy_len);
+	return ret;
+}
+
+static const struct bpf_func_proto bpf_sock_ops_load_hdr_opt_proto = {
+	.func		= bpf_sock_ops_load_hdr_opt,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_PTR_TO_MEM,
+	.arg3_type	= ARG_CONST_SIZE,
+	.arg4_type	= ARG_ANYTHING,
+};
+
+BPF_CALL_4(bpf_sock_ops_store_hdr_opt, struct bpf_sock_ops_kern *, bpf_sock,
+	   const void *, from, u32, len, u64, flags)
+{
+	u8 new_kind, new_kind_len, magic_len = 0, *opend;
+	const u8 *op, *new_op, *magic = NULL;
+	struct sk_buff *skb;
+	bool eol;
+
+	if (bpf_sock->op != BPF_SOCK_OPS_WRITE_HDR_OPT_CB)
+		return -EPERM;
+
+	if (len < 2 || flags)
+		return -EINVAL;
+
+	new_op = from;
+	new_kind = new_op[0];
+	new_kind_len = new_op[1];
+
+	if (new_kind_len > len || new_kind == TCPOPT_NOP ||
+	    new_kind == TCPOPT_EOL)
+		return -EINVAL;
+
+	if (new_kind_len > bpf_sock->remaining_opt_len)
+		return -ENOSPC;
+
+	/* 253 is another experimental kind */
+	if (new_kind == TCPOPT_EXP || new_kind == 253)  {
+		if (new_kind_len < 4)
+			return -EINVAL;
+		/* Match for the 2 byte magic also.
+		 * RFC 6994: the magic could be 2 or 4 bytes.
+		 * Hence, matching by 2 byte only is on the
+		 * conservative side but it is the right
+		 * thing to do for the 'search-for-duplication'
+		 * purpose.
+		 */
+		magic = &new_op[2];
+		magic_len = 2;
+	}
+
+	/* Check for duplication */
+	skb = bpf_sock->skb;
+	op = skb->data + sizeof(struct tcphdr);
+	opend = bpf_sock->skb_data_end;
+
+	op = bpf_search_tcp_opt(op, opend, new_kind, magic, magic_len,
+				&eol);
+	if (!IS_ERR(op))
+		return -EEXIST;
+
+	if (PTR_ERR(op) != -ENOMSG)
+		return PTR_ERR(op);
+
+	if (eol)
+		/* The option has been ended.  Treat it as no more
+		 * header option can be written.
+		 */
+		return -ENOSPC;
+
+	/* No duplication found.  Store the header option. */
+	memcpy(opend, from, new_kind_len);
+
+	bpf_sock->remaining_opt_len -= new_kind_len;
+	bpf_sock->skb_data_end += new_kind_len;
+
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_sock_ops_store_hdr_opt_proto = {
+	.func		= bpf_sock_ops_store_hdr_opt,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_PTR_TO_MEM,
+	.arg3_type	= ARG_CONST_SIZE,
+	.arg4_type	= ARG_ANYTHING,
+};
+
+BPF_CALL_3(bpf_sock_ops_reserve_hdr_opt, struct bpf_sock_ops_kern *, bpf_sock,
+	   u32, len, u64, flags)
+{
+	if (bpf_sock->op != BPF_SOCK_OPS_HDR_OPT_LEN_CB)
+		return -EPERM;
+
+	if (flags || len < 2)
+		return -EINVAL;
+
+	if (len > bpf_sock->remaining_opt_len)
+		return -ENOSPC;
+
+	bpf_sock->remaining_opt_len -= len;
+
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_sock_ops_reserve_hdr_opt_proto = {
+	.func		= bpf_sock_ops_reserve_hdr_opt,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_ANYTHING,
+	.arg3_type	= ARG_ANYTHING,
+};
+
 #endif /* CONFIG_INET */
 
 bool bpf_helper_changes_pkt_data(void *func)
@@ -6193,6 +6492,9 @@ bool bpf_helper_changes_pkt_data(void *func)
 	    func == bpf_lwt_seg6_store_bytes ||
 	    func == bpf_lwt_seg6_adjust_srh ||
 	    func == bpf_lwt_seg6_action ||
+#endif
+#ifdef CONFIG_INET
+	    func == bpf_sock_ops_store_hdr_opt ||
 #endif
 	    func == bpf_lwt_in_push_encap ||
 	    func == bpf_lwt_xmit_push_encap)
@@ -6565,6 +6867,12 @@ sock_ops_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	case BPF_FUNC_sk_storage_delete:
 		return &bpf_sk_storage_delete_proto;
 #ifdef CONFIG_INET
+	case BPF_FUNC_load_hdr_opt:
+		return &bpf_sock_ops_load_hdr_opt_proto;
+	case BPF_FUNC_store_hdr_opt:
+		return &bpf_sock_ops_store_hdr_opt_proto;
+	case BPF_FUNC_reserve_hdr_opt:
+		return &bpf_sock_ops_reserve_hdr_opt_proto;
 	case BPF_FUNC_tcp_sock:
 		return &bpf_tcp_sock_proto;
 #endif /* CONFIG_INET */
@@ -7364,6 +7672,20 @@ static bool sock_ops_is_valid_access(int off, int size,
 				return false;
 			info->reg_type = PTR_TO_SOCKET_OR_NULL;
 			break;
+		case offsetof(struct bpf_sock_ops, skb_data):
+			if (size != sizeof(__u64))
+				return false;
+			info->reg_type = PTR_TO_PACKET;
+			break;
+		case offsetof(struct bpf_sock_ops, skb_data_end):
+			if (size != sizeof(__u64))
+				return false;
+			info->reg_type = PTR_TO_PACKET_END;
+			break;
+		case offsetof(struct bpf_sock_ops, skb_tcp_flags):
+			bpf_ctx_record_field_size(info, size_default);
+			return bpf_ctx_narrow_access_ok(off, size,
+							size_default);
 		default:
 			if (size != size_default)
 				return false;
@@ -8652,6 +8974,49 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
 				      si->dst_reg, si->src_reg,
 				      offsetof(struct bpf_sock_ops_kern, sk));
 		break;
+	case offsetof(struct bpf_sock_ops, skb_data_end):
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sock_ops_kern,
+						       skb_data_end),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_sock_ops_kern,
+					       skb_data_end));
+		break;
+	case offsetof(struct bpf_sock_ops, skb_data):
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sock_ops_kern,
+						       skb),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_sock_ops_kern,
+					       skb));
+		*insn++ = BPF_JMP_IMM(BPF_JEQ, si->dst_reg, 0, 1);
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_buff, data),
+				      si->dst_reg, si->dst_reg,
+				      offsetof(struct sk_buff, data));
+		break;
+	case offsetof(struct bpf_sock_ops, skb_len):
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sock_ops_kern,
+						       skb),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_sock_ops_kern,
+					       skb));
+		*insn++ = BPF_JMP_IMM(BPF_JEQ, si->dst_reg, 0, 1);
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_buff, len),
+				      si->dst_reg, si->dst_reg,
+				      offsetof(struct sk_buff, len));
+		break;
+	case offsetof(struct bpf_sock_ops, skb_tcp_flags):
+		off = offsetof(struct sk_buff, cb);
+		off += offsetof(struct tcp_skb_cb, tcp_flags);
+		*target_size = sizeof_field(struct tcp_skb_cb, tcp_flags);
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sock_ops_kern,
+						       skb),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_sock_ops_kern,
+					       skb));
+		*insn++ = BPF_JMP_IMM(BPF_JEQ, si->dst_reg, 0, 1);
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct tcp_skb_cb,
+						       tcp_flags),
+				      si->dst_reg, si->dst_reg, off);
+		break;
 	}
 	return insn - insn_buf;
 }
diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
index 19ad9586c720..dd70b04da52a 100644
--- a/net/ipv4/tcp_fastopen.c
+++ b/net/ipv4/tcp_fastopen.c
@@ -272,7 +272,7 @@ static struct sock *tcp_fastopen_create_child(struct sock *sk,
 	refcount_set(&req->rsk_refcnt, 2);
 
 	/* Now finish processing the fastopen child socket. */
-	tcp_init_transfer(child, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB);
+	tcp_init_transfer(child, true, skb);
 
 	tp->rcv_nxt = TCP_SKB_CB(skb)->seq + 1;
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index d9c878001be2..43520dcaafe3 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -138,6 +138,77 @@ void clean_acked_data_flush(void)
 EXPORT_SYMBOL_GPL(clean_acked_data_flush);
 #endif
 
+#ifdef CONFIG_CGROUP_BPF
+static void bpf_skops_parse_bpf_hdr(struct sock *sk, struct sk_buff *skb,
+				    bool unknown_opt)
+{
+	struct bpf_sock_ops_kern sock_ops;
+
+	if (likely(!((unknown_opt &&
+		      BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk),
+					     BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG)) ||
+		     BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk),
+					    BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG))))
+		return;
+
+	/* (unknown_opt && BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG) ||
+	 * BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG
+	 */
+
+	/* The skb will be handled in the
+	 * BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB or
+	 * BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB.
+	 */
+	switch (sk->sk_state) {
+	case TCP_SYN_RECV:
+	case TCP_LISTEN:
+	case TCP_SYN_SENT:
+		return;
+	}
+
+	sock_owned_by_me(sk);
+
+	memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
+	sock_ops.op = BPF_SOCK_OPS_PARSE_HDR_OPT_CB;
+	sock_ops.is_fullsock = 1;
+	sock_ops.sk = sk;
+	bpf_skops_init_skb(&sock_ops, skb, tcp_hdrlen(skb));
+
+	BPF_CGROUP_RUN_PROG_SOCK_OPS(&sock_ops);
+}
+
+static void bpf_skops_established(struct sock *sk, bool passive,
+				  struct sk_buff *skb)
+{
+	struct bpf_sock_ops_kern sock_ops;
+
+	sock_owned_by_me(sk);
+
+	memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
+	if (passive)
+		sock_ops.op = BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB;
+	else
+		sock_ops.op = BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB;
+	sock_ops.is_fullsock = 1;
+	sock_ops.sk = sk;
+	/* sk with TCP_REPAIR_ON does not have skb in tcp_finish_connect */
+	if (skb)
+		bpf_skops_init_skb(&sock_ops, skb, tcp_hdrlen(skb));
+
+	BPF_CGROUP_RUN_PROG_SOCK_OPS(&sock_ops);
+}
+#else
+static void bpf_skops_parse_bpf_hdr(struct sock *sk, struct sk_buff *skb,
+				    bool unknown_opt)
+{
+}
+
+static void bpf_skops_established(struct sock *sk, bool passive,
+				  struct sk_buff *skb)
+{
+}
+#endif
+
 static void tcp_gro_dev_warn(struct sock *sk, const struct sk_buff *skb,
 			     unsigned int len)
 {
@@ -5602,6 +5673,8 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
 		goto discard;
 	}
 
+	bpf_skops_parse_bpf_hdr(sk, skb, unknown_opt);
+
 	return true;
 
 discard:
@@ -5810,7 +5883,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
 }
 EXPORT_SYMBOL(tcp_rcv_established);
 
-void tcp_init_transfer(struct sock *sk, int bpf_op)
+void tcp_init_transfer(struct sock *sk, bool passive, struct sk_buff *skb)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
@@ -5831,7 +5904,7 @@ void tcp_init_transfer(struct sock *sk, int bpf_op)
 		tp->snd_cwnd = tcp_init_cwnd(tp, __sk_dst_get(sk));
 	tp->snd_cwnd_stamp = tcp_jiffies32;
 
-	tcp_call_bpf(sk, bpf_op, 0, NULL);
+	bpf_skops_established(sk, passive, skb);
 	tcp_init_congestion_control(sk);
 	tcp_init_buffer_space(sk);
 }
@@ -5850,7 +5923,7 @@ void tcp_finish_connect(struct sock *sk, struct sk_buff *skb)
 		sk_mark_napi_id(sk, skb);
 	}
 
-	tcp_init_transfer(sk, BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB);
+	tcp_init_transfer(sk, false, skb);
 
 	/* Prevent spurious tcp_cwnd_restart() on first data
 	 * packet.
@@ -6322,7 +6395,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
 		} else {
 			tcp_try_undo_spurious_syn(sk);
 			tp->retrans_stamp = 0;
-			tcp_init_transfer(sk, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB);
+			tcp_init_transfer(sk, true, skb);
 			WRITE_ONCE(tp->copied_seq, tp->rcv_nxt);
 		}
 		smp_mb();
@@ -6768,7 +6841,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
 		fastopen_sk = tcp_try_fastopen(sk, skb, req, &foc, dst);
 	}
 	if (fastopen_sk) {
-		af_ops->send_synack(fastopen_sk, dst, &fl, req,
+		af_ops->send_synack(fastopen_sk, dst, &fl, req, skb,
 				    &foc, TCP_SYNACK_FASTOPEN);
 		/* Add the child socket directly into the accept queue */
 		if (!inet_csk_reqsk_queue_add(sk, req, fastopen_sk)) {
@@ -6785,7 +6858,8 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
 		if (!want_cookie)
 			inet_csk_reqsk_queue_hash_add(sk, req,
 				tcp_timeout_init((struct sock *)req));
-		af_ops->send_synack(sk, dst, &fl, req, &foc,
+
+		af_ops->send_synack(sk, dst, &fl, req, skb, &foc,
 				    !want_cookie ? TCP_SYNACK_NORMAL :
 						   TCP_SYNACK_COOKIE);
 		if (want_cookie) {
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 5084333b5ab6..f2f8577e2427 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -964,6 +964,7 @@ static void tcp_v4_reqsk_send_ack(const struct sock *sk, struct sk_buff *skb,
 static int tcp_v4_send_synack(const struct sock *sk, struct dst_entry *dst,
 			      struct flowi *fl,
 			      struct request_sock *req,
+			      struct sk_buff *syn_skb,
 			      struct tcp_fastopen_cookie *foc,
 			      enum tcp_synack_type synack_type)
 {
@@ -976,7 +977,7 @@ static int tcp_v4_send_synack(const struct sock *sk, struct dst_entry *dst,
 	if (!dst && (dst = inet_csk_route_req(sk, &fl4, req)) == NULL)
 		return -1;
 
-	skb = tcp_make_synack(sk, dst, req, foc, synack_type);
+	skb = tcp_make_synack(sk, dst, req, syn_skb, foc, synack_type);
 
 	if (skb) {
 		__tcp_v4_send_check(skb, ireq->ir_loc_addr, ireq->ir_rmt_addr);
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 61f9194802c4..7819243c97f0 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -548,6 +548,7 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
 	newtp->fastopen_req = NULL;
 	RCU_INIT_POINTER(newtp->fastopen_rsk, NULL);
 
+	bpf_skops_init_child(sk, newsk);
 	tcp_bpf_clone(sk, newsk);
 
 	__TCP_INC_STATS(sock_net(sk), TCP_MIB_PASSIVEOPENS);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 984114087263..fa0d1d94271e 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -438,6 +438,7 @@ struct tcp_out_options {
 	u8 ws;			/* window scale, 0 to disable */
 	u8 num_sack_blocks;	/* number of SACK blocks to include */
 	u8 hash_size;		/* bytes in hash_location */
+	u8 bpf_opt_len;		/* length of BPF hdr option */
 	__u8 *hash_location;	/* temporary pointer, overloaded */
 	__u32 tsval, tsecr;	/* need to include OPTION_TS */
 	struct tcp_fastopen_cookie *fastopen_cookie;	/* Fast open cookie */
@@ -452,6 +453,144 @@ static void mptcp_options_write(__be32 *ptr, struct tcp_out_options *opts)
 #endif
 }
 
+#ifdef CONFIG_CGROUP_BPF
+static int bpf_skops_write_hdr_opt_arg0(struct sk_buff *skb,
+					enum tcp_synack_type synack_type)
+{
+	if (unlikely(!skb))
+		return BPF_WRITE_HDR_TCP_CURRENT_MSS;
+
+	if (unlikely(synack_type == TCP_SYNACK_COOKIE))
+		return BPF_WRITE_HDR_TCP_SYNACK_COOKIE;
+
+	return 0;
+}
+
+static void bpf_skops_hdr_opt_len(struct sock *sk, struct sk_buff *skb,
+				  struct request_sock *req,
+				  struct sk_buff *syn_skb,
+				  enum tcp_synack_type synack_type,
+				  struct tcp_out_options *opts,
+				  unsigned int *remaining)
+{
+	struct bpf_sock_ops_kern sock_ops;
+	int err;
+
+	if (likely(!BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk),
+					   BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG)) ||
+	    !*remaining)
+		return;
+
+	/* *remaining has already been aligned to 4 bytes, so *remaining >= 4 */
+
+	/* init sock_ops */
+	memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
+
+	sock_ops.op = BPF_SOCK_OPS_HDR_OPT_LEN_CB;
+
+	if (req) {
+		/* The listen "sk" cannot be passed here because
+		 * it is not locked.  It would not make too much
+		 * sense to do bpf_setsockopt(listen_sk) based
+		 * on individual connection request also.
+		 *
+		 * Thus, "req" is passed here and the cgroup-bpf-progs
+		 * of the listen "sk" will be run.
+		 *
+		 * "req" is also used here for fastopen even the "sk" here is
+		 * a fullsock "child" sk.  It is to keep the behavior
+		 * consistent between fastopen and non-fastopen on
+		 * the bpf programming side.
+		 */
+		sock_ops.sk = (struct sock *)req;
+		sock_ops.syn_skb = syn_skb;
+	} else {
+		sock_owned_by_me(sk);
+
+		sock_ops.is_fullsock = 1;
+		sock_ops.sk = sk;
+	}
+
+	sock_ops.args[0] = bpf_skops_write_hdr_opt_arg0(skb, synack_type);
+	sock_ops.remaining_opt_len = *remaining;
+	/* tcp_current_mss() does not pass a skb */
+	if (skb)
+		bpf_skops_init_skb(&sock_ops, skb, 0);
+
+	err = BPF_CGROUP_RUN_PROG_SOCK_OPS_SK(&sock_ops, sk);
+
+	if (err || sock_ops.remaining_opt_len == *remaining)
+		return;
+
+	opts->bpf_opt_len = *remaining - sock_ops.remaining_opt_len;
+	/* round up to 4 bytes */
+	opts->bpf_opt_len = (opts->bpf_opt_len + 3) & ~3;
+
+	*remaining -= opts->bpf_opt_len;
+}
+
+static void bpf_skops_write_hdr_opt(struct sock *sk, struct sk_buff *skb,
+				    struct request_sock *req,
+				    struct sk_buff *syn_skb,
+				    enum tcp_synack_type synack_type,
+				    struct tcp_out_options *opts)
+{
+	u8 first_opt_off, nr_written, max_opt_len = opts->bpf_opt_len;
+	struct bpf_sock_ops_kern sock_ops;
+	int err;
+
+	if (likely(!max_opt_len))
+		return;
+
+	memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
+
+	sock_ops.op = BPF_SOCK_OPS_WRITE_HDR_OPT_CB;
+
+	if (req) {
+		sock_ops.sk = (struct sock *)req;
+		sock_ops.syn_skb = syn_skb;
+	} else {
+		sock_owned_by_me(sk);
+
+		sock_ops.is_fullsock = 1;
+		sock_ops.sk = sk;
+	}
+
+	sock_ops.args[0] = bpf_skops_write_hdr_opt_arg0(skb, synack_type);
+	sock_ops.remaining_opt_len = max_opt_len;
+	first_opt_off = tcp_hdrlen(skb) - max_opt_len;
+	bpf_skops_init_skb(&sock_ops, skb, first_opt_off);
+
+	err = BPF_CGROUP_RUN_PROG_SOCK_OPS_SK(&sock_ops, sk);
+
+	if (err)
+		nr_written = 0;
+	else
+		nr_written = max_opt_len - sock_ops.remaining_opt_len;
+
+	if (nr_written < max_opt_len)
+		memset(skb->data + first_opt_off + nr_written, TCPOPT_NOP,
+		       max_opt_len - nr_written);
+}
+#else
+static void bpf_skops_hdr_opt_len(struct sock *sk, struct sk_buff *skb,
+				  struct request_sock *req,
+				  struct sk_buff *syn_skb,
+				  enum tcp_synack_type synack_type,
+				  struct tcp_out_options *opts,
+				  unsigned int *remaining)
+{
+}
+
+static void bpf_skops_write_hdr_opt(struct sock *sk, struct sk_buff *skb,
+				    struct request_sock *req,
+				    struct sk_buff *syn_skb,
+				    enum tcp_synack_type synack_type,
+				    struct tcp_out_options *opts)
+{
+}
+#endif
+
 /* Write previously computed TCP options to the packet.
  *
  * Beware: Something in the Internet is very sensitive to the ordering of
@@ -691,6 +830,8 @@ static unsigned int tcp_syn_options(struct sock *sk, struct sk_buff *skb,
 		}
 	}
 
+	bpf_skops_hdr_opt_len(sk, skb, NULL, NULL, 0, opts, &remaining);
+
 	return MAX_TCP_OPTION_SPACE - remaining;
 }
 
@@ -701,7 +842,8 @@ static unsigned int tcp_synack_options(const struct sock *sk,
 				       struct tcp_out_options *opts,
 				       const struct tcp_md5sig_key *md5,
 				       struct tcp_fastopen_cookie *foc,
-				       enum tcp_synack_type synack_type)
+				       enum tcp_synack_type synack_type,
+				       struct sk_buff *syn_skb)
 {
 	struct inet_request_sock *ireq = inet_rsk(req);
 	unsigned int remaining = MAX_TCP_OPTION_SPACE;
@@ -758,15 +900,19 @@ static unsigned int tcp_synack_options(const struct sock *sk,
 
 	smc_set_option_cond(tcp_sk(sk), ireq, opts, &remaining);
 
+	bpf_skops_hdr_opt_len((struct sock *)sk, skb, req, syn_skb,
+			      synack_type, opts, &remaining);
+
 	return MAX_TCP_OPTION_SPACE - remaining;
 }
 
 /* Compute TCP options for ESTABLISHED sockets. This is not the
  * final wire format yet.
  */
-static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb,
-					struct tcp_out_options *opts,
-					struct tcp_md5sig_key **md5)
+static unsigned int tcp_established_options(struct sock *sk,
+					    struct sk_buff *skb,
+					    struct tcp_out_options *opts,
+					    struct tcp_md5sig_key **md5)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	unsigned int size = 0;
@@ -826,6 +972,15 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
 			opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK;
 	}
 
+	if (unlikely(BPF_SOCK_OPS_TEST_FLAG(tp,
+					    BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG))) {
+		unsigned int remaining = MAX_TCP_OPTION_SPACE - size;
+
+		bpf_skops_hdr_opt_len(sk, skb, NULL, NULL, 0, opts, &remaining);
+
+		size = MAX_TCP_OPTION_SPACE - remaining;
+	}
+
 	return size;
 }
 
@@ -1213,6 +1368,9 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb,
 	}
 #endif
 
+	/* BPF prog is the last one writing header option */
+	bpf_skops_write_hdr_opt(sk, skb, NULL, NULL, 0, &opts);
+
 	INDIRECT_CALL_INET(icsk->icsk_af_ops->send_check,
 			   tcp_v6_send_check, tcp_v4_send_check,
 			   sk, skb);
@@ -3336,18 +3494,18 @@ int tcp_send_synack(struct sock *sk)
 }
 
 /**
- * tcp_make_synack - Prepare a SYN-ACK.
- * sk: listener socket
- * dst: dst entry attached to the SYNACK
- * req: request_sock pointer
- * foc: cookie for tcp fast open
- * synack_type: Type of synback to prepare
- *
- * Allocate one skb and build a SYNACK packet.
- * @dst is consumed : Caller should not use it again.
+ * tcp_make_synack - Allocate one skb and build a SYNACK packet.
+ * @sk: listener socket
+ * @dst: dst entry attached to the SYNACK. It is consumed and caller
+ *       should not use it again.
+ * @req: request_sock pointer
+ * @syn_skb: SYN packet just received.  It could be NULL for rtx case.
+ * @foc: cookie for tcp fast open
+ * @synack_type: Type of synack to prepare
  */
 struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst,
 				struct request_sock *req,
+				struct sk_buff *syn_skb,
 				struct tcp_fastopen_cookie *foc,
 				enum tcp_synack_type synack_type)
 {
@@ -3408,8 +3566,11 @@ struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst,
 	md5 = tcp_rsk(req)->af_specific->req_md5_lookup(sk, req_to_sk(req));
 #endif
 	skb_set_hash(skb, tcp_rsk(req)->txhash, PKT_HASH_TYPE_L4);
+	/* bpf program will be interested in the tcp_flags */
+	TCP_SKB_CB(skb)->tcp_flags = TCPHDR_SYN | TCPHDR_ACK;
 	tcp_header_size = tcp_synack_options(sk, req, mss, skb, &opts, md5,
-					     foc, synack_type) + sizeof(*th);
+					     foc, synack_type,
+					     syn_skb) + sizeof(*th);
 
 	skb_push(skb, tcp_header_size);
 	skb_reset_transport_header(skb);
@@ -3431,6 +3592,8 @@ struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst,
 	th->window = htons(min(req->rsk_rcv_wnd, 65535U));
 	tcp_options_write((__be32 *)(th + 1), NULL, &opts);
 	th->doff = (tcp_header_size >> 2);
+	bpf_skops_write_hdr_opt((struct sock *)sk, skb, req, syn_skb,
+				synack_type, &opts);
 	__TCP_INC_STATS(sock_net(sk), TCP_MIB_OUTSEGS);
 
 #ifdef CONFIG_TCP_MD5SIG
@@ -3936,7 +4099,8 @@ int tcp_rtx_synack(const struct sock *sk, struct request_sock *req)
 	int res;
 
 	tcp_rsk(req)->txhash = net_tx_rndhash();
-	res = af_ops->send_synack(sk, NULL, &fl, req, NULL, TCP_SYNACK_NORMAL);
+	res = af_ops->send_synack(sk, NULL, &fl, req, NULL, NULL,
+				  TCP_SYNACK_NORMAL);
 	if (!res) {
 		__TCP_INC_STATS(sock_net(sk), TCP_MIB_RETRANSSEGS);
 		__NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPSYNRETRANS);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 305870a72352..003bb39a2fec 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -500,6 +500,7 @@ static int tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 static int tcp_v6_send_synack(const struct sock *sk, struct dst_entry *dst,
 			      struct flowi *fl,
 			      struct request_sock *req,
+			      struct sk_buff *syn_skb,
 			      struct tcp_fastopen_cookie *foc,
 			      enum tcp_synack_type synack_type)
 {
@@ -515,7 +516,7 @@ static int tcp_v6_send_synack(const struct sock *sk, struct dst_entry *dst,
 					       IPPROTO_TCP)) == NULL)
 		goto done;
 
-	skb = tcp_make_synack(sk, dst, req, foc, synack_type);
+	skb = tcp_make_synack(sk, dst, req, syn_skb, foc, synack_type);
 
 	if (skb) {
 		__tcp_v6_send_check(skb, &ireq->ir_v6_loc_addr,
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index d77b7df71784..c6162ee573e9 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -3389,6 +3389,114 @@ union bpf_attr {
  *		A non-negative value equal to or less than *size* on success,
  *		or a negative error in case of failure.
  *
+ * long bpf_load_hdr_opt(struct bpf_sock_ops *skops, void *searchby_res, u32 len, u64 flags)
+ *	Description
+ *		Load Header option.  Support loading TCP header option
+ *		of a particular kind in BPF_PROG_TYPE_SOCK_OPS.
+ *
+ *		The first byte of the *searchby_res* specifies the
+ *		kind that it wants to search.
+ *
+ *		If the searching kind is an experimental kind
+ *		(i.e. 253 or 254 according to RFC6994).  It also
+ *		needs to specify the "magic" which is either
+ *		2 bytes or 4 bytes.  It then also needs to
+ *		specify the size of the magic by using
+ *		the 2nd byte which is "kind-length" of a TCP
+ *		header option and the "kind-length" also
+ *		includes the first 2 bytes "kind" and "kind-length"
+ *		itself as a normal TCP header option also does.
+ *
+ *		For example, to search experimental kind 254 with
+ *		2 byte magic 0xeB9F, the searchby_res should be
+ *		[ 254, 4, 0xeB, 0x9F, 0, 0, .... 0 ].
+ *
+ *		To search for the standard window scale option (3),
+ *		the searchby_res should be [ 3, 0, 0, .... 0 ].
+ *		Note, kind-length must be 0 for regular option.
+ *
+ *		Searching for No-Op (0) and End-of-Option-List (1) are
+ *		not supported.
+ *
+ *		*len* must be at least 2 bytes which is the minimal size
+ *		of a header option.
+ *
+ *		Supported flags:
+ *		* **BPF_LOAD_HDR_OPT_TCP_SYN** to search from the
+ *		  saved_syn packet or the just-received syn packet.
+ *
+ *	Return
+ *		>0 when found, the header option is copied to *searchby_res*.
+ *		The return value is the total length copied.
+ *
+ *		**-EINVAL** If param is invalid
+ *
+ *		**-ENOMSG** The option is not found
+ *
+ *		**-ENOENT** No syn packet available when
+ *			    **BPF_LOAD_HDR_OPT_TCP_SYN** is used
+ *
+ *		**-ENOSPC** Not enough space.  Only *len* number of
+ *			    bytes are copied.
+ *
+ *		**-EFAULT** Cannot parse the existing header options
+ *
+ *		**-EPERM** This helper cannot be used under the
+ *			   current sock_ops->op.
+ *
+ * long bpf_store_hdr_opt(struct bpf_sock_ops *skops, const void *from, u32 len, u64 flags)
+ *	Description
+ *		Store BPF header option.  The data will be copied
+ *		from buffer *from* with length *len* to the TCP header.
+ *
+ *		The buffer *from* should have the whole option that
+ *		includes the kind, kind-length, and the actual
+ *		option data.  The *len* must be at least kind-length
+ *		long.  The kind-length does not have to be 4 byte
+ *		aligned.  The kernel will take care of the padding
+ *		and setting the 4 bytes aligned value to th->doff.
+ *
+ *		This helper will check for duplicated option
+ *		by searching the same option in the outgoing skb.
+ *
+ *		This helper can only be called during
+ *		BPF_SOCK_OPS_WRITE_HDR_OPT_CB.
+ *
+ *	Return
+ *		0 on success, or negative error in case of failure:
+ *
+ *		**-EINVAL** If param is invalid
+ *
+ *		**-ENOSPC** Not enough space.  Nothing has been written
+ *
+ *		**-EEXIST** The writing option has already existed
+ *
+ *		**-EFAULT** Cannot parse the existing header options
+ *
+ *		**-EPERM** This helper cannot be used under the
+ *			   current sock_ops->op.
+ *
+ * long bpf_reserve_hdr_opt(struct bpf_sock_ops *skops, u32 len, u64 flags)
+ *	Description
+ *		Reserve *len* bytes for the bpf header option.  The
+ *		space will be used by bpf_store_hdr_opt() later in
+ *		BPF_SOCK_OPS_WRITE_HDR_OPT_CB.
+ *
+ *		If bpf_reserve_hdr_opt() is called multiple times,
+ *		the total number of bytes will be reserved.
+ *
+ *		This helper can only be called during
+ *		BPF_SOCK_OPS_HDR_OPT_LEN_CB.
+ *
+ *	Return
+ *		0 on success, or negative error in case of failure:
+ *
+ *		**-EINVAL** if param is invalid
+ *
+ *		**-ENOSPC** Not enough remaining space in the header.
+ *
+ *		**-EPERM** This helper cannot be used under the
+ *			   current sock_ops->op.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3533,6 +3641,9 @@ union bpf_attr {
 	FN(skc_to_tcp_request_sock),	\
 	FN(skc_to_udp6_sock),		\
 	FN(get_task_stack),		\
+	FN(load_hdr_opt),		\
+	FN(store_hdr_opt),		\
+	FN(reserve_hdr_opt),
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
@@ -4152,6 +4263,10 @@ struct bpf_sock_ops {
 	__u64 bytes_received;
 	__u64 bytes_acked;
 	__bpf_md_ptr(struct bpf_sock *, sk);
+	__bpf_md_ptr(void *, skb_data);
+	__bpf_md_ptr(void *, skb_data_end);
+	__u32 skb_len;
+	__u32 skb_tcp_flags;
 };
 
 /* Definitions for bpf_sock_ops_cb_flags */
@@ -4160,8 +4275,11 @@ enum {
 	BPF_SOCK_OPS_RETRANS_CB_FLAG	= (1<<1),
 	BPF_SOCK_OPS_STATE_CB_FLAG	= (1<<2),
 	BPF_SOCK_OPS_RTT_CB_FLAG	= (1<<3),
+	BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG	= (1<<4),
+	BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG = (1<<5),
+	BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<6),
 /* Mask of all currently supported cb flags */
-	BPF_SOCK_OPS_ALL_CB_FLAGS       = 0xF,
+	BPF_SOCK_OPS_ALL_CB_FLAGS       = 0x7F,
 };
 
 /* List of known BPF sock_ops operators.
@@ -4217,6 +4335,63 @@ enum {
 					 */
 	BPF_SOCK_OPS_RTT_CB,		/* Called on every RTT.
 					 */
+	BPF_SOCK_OPS_PARSE_HDR_OPT_CB,	/* Parse the header option.
+					 * It will be called to handle
+					 * the packets received at
+					 * an already established
+					 * connection.
+					 *
+					 * sock_ops->skb_data:
+					 * Referring to the received skb.
+					 * It covers the TCP header only.
+					 *
+					 * bpf_load_hdr_opt() can also
+					 * be used to search for a
+					 * particular option.
+					 */
+	BPF_SOCK_OPS_HDR_OPT_LEN_CB,	/* Reserve space for writing the
+					 * header option later in
+					 * BPF_SOCK_OPS_WRITE_HDR_OPT_CB.
+					 * Arg1: bool want_cookie. (in
+					 *       writing SYNACK only)
+					 *
+					 * sock_ops->skb_data:
+					 * Not available because no header has
+					 * been	written yet.
+					 *
+					 * sock_ops->skb_tcp_flags:
+					 * The tcp_flags of the
+					 * outgoing skb. (e.g. SYN, ACK, FIN).
+					 *
+					 * bpf_reserve_hdr_opt() should
+					 * be used to reserve space.
+					 */
+	BPF_SOCK_OPS_WRITE_HDR_OPT_CB,	/* Write the header options
+					 * Arg1: bool want_cookie. (in
+					 *       writing SYNACK only)
+					 *
+					 * sock_ops->skb_data:
+					 * Referring to the outgoing skb.
+					 * It covers the TCP header
+					 * that has already been written
+					 * by the kernel and the
+					 * earlier bpf-progs.
+					 *
+					 * sock_ops->skb_tcp_flags:
+					 * The tcp_flags of the outgoing
+					 * skb. (e.g. SYN, ACK, FIN).
+					 *
+					 * bpf_store_hdr_opt() should
+					 * be used to write the
+					 * option.
+					 *
+					 * bpf_load_hdr_opt() can also
+					 * be used to search for a
+					 * particular option that
+					 * has already been written
+					 * by the kernel or the
+					 * earlier bpf-progs.
+					 */
 };
 
 /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
@@ -4246,6 +4421,60 @@ enum {
 	TCP_BPF_SNDCWND_CLAMP	= 1002,	/* Set sndcwnd_clamp */
 	TCP_BPF_DELACK_MAX	= 1003, /* Max delay ack in usecs */
 	TCP_BPF_RTO_MIN		= 1004, /* Min delay ack in usecs */
+	/* Copy the SYN pkt to optval
+	 *
+	 * BPF_PROG_TYPE_SOCK_OPS only.  It is similar to the
+	 * bpf_getsockopt(TCP_SAVED_SYN) but it does not limit
+	 * to only getting from the saved_syn.  It can either get the
+	 * syn packet from:
+	 *
+	 * 1. the just-received SYN packet (only available when writing the
+	 *    SYNACK).  It will be useful when it is not necessary to
+	 *    save the SYN packet for latter use.  It is also the only way
+	 *    to get the SYN during syncookie mode because the syn
+	 *    packet cannot be saved during syncookie.
+	 *
+	 * OR
+	 *
+	 * 2. the earlier saved syn which was done by
+	 *    bpf_setsockopt(TCP_SAVE_SYN).
+	 *
+	 * The bpf_getsockopt(TCP_BPF_SYN*) option will hide where the
+	 * SYN packet is obtained.
+	 *
+	 * If the bpf-prog does not need the IP[46] header,  the
+	 * bpf-prog can avoid parsing the IP header by using
+	 * TCP_BPF_SYN.  Otherwise, the bpf-prog can get both
+	 * IP[46] and TCP header by using TCP_BPF_SYN_IP.
+	 *
+	 *      >0: Total number of bytes copied
+	 * -ENOSPC: Not enough space in optval. Only optlen number of
+	 *          bytes is copied.
+	 * -ENOENT: The SYN skb is not available now and the earlier SYN pkt
+	 *	    is not saved by setsockopt(TCP_SAVE_SYN).
+	 */
+	TCP_BPF_SYN		= 1005, /* Copy the TCP header */
+	TCP_BPF_SYN_IP		= 1006, /* Copy the IP[46] and TCP header */
+};
+
+enum {
+	BPF_LOAD_HDR_OPT_TCP_SYN = (1ULL << 0),
+};
+
+/* args[0] value during BPF_SOCK_OPS_HDR_OPT_LEN_CB and
+ * BPF_SOCK_OPS_WRITE_HDR_OPT_CB.
+ */
+enum {
+	BPF_WRITE_HDR_TCP_CURRENT_MSS = 1,	/* Kernel is finding the
+						 * total option spaces
+						 * required for an established
+						 * sk in order to calculate the
+						 * the MSS.  No skb is actually
+						 * sent.
+						 */
+	BPF_WRITE_HDR_TCP_SYNACK_COOKIE = 2,	/* Kernel is in syncookie mode
+						 * when sending a SYN.
+						 */
 };
 
 struct bpf_perf_event_value {
-- 
2.24.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v3 bpf-next 7/9] bpf: selftests: Add fastopen_connect to network_helpers
  2020-07-30 20:56 [PATCH v3 bpf-next 0/9] BPF TCP header options Martin KaFai Lau
                   ` (5 preceding siblings ...)
  2020-07-30 20:57 ` [PATCH v3 bpf-next 6/9] bpf: tcp: Allow bpf prog to write and parse TCP header option Martin KaFai Lau
@ 2020-07-30 20:57 ` Martin KaFai Lau
  2020-07-30 20:57 ` [PATCH v3 bpf-next 8/9] bpf: selftests: tcp header options Martin KaFai Lau
  2020-07-30 20:57 ` [PATCH v3 bpf-next 9/9] tcp: bpf: Optionally store mac header in TCP_SAVE_SYN Martin KaFai Lau
  8 siblings, 0 replies; 17+ messages in thread
From: Martin KaFai Lau @ 2020-07-30 20:57 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Eric Dumazet, kernel-team,
	Lawrence Brakmo, Neal Cardwell, netdev, Yuchung Cheng

This patch adds a fastopen_connect() helper which will
be used in a latter test.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 tools/testing/selftests/bpf/network_helpers.c | 37 +++++++++++++++++++
 tools/testing/selftests/bpf/network_helpers.h |  2 +
 2 files changed, 39 insertions(+)

diff --git a/tools/testing/selftests/bpf/network_helpers.c b/tools/testing/selftests/bpf/network_helpers.c
index f56655690f9b..12ee40284da0 100644
--- a/tools/testing/selftests/bpf/network_helpers.c
+++ b/tools/testing/selftests/bpf/network_helpers.c
@@ -104,6 +104,43 @@ int start_server(int family, int type, const char *addr_str, __u16 port,
 	return -1;
 }
 
+int fastopen_connect(int server_fd, const char *data, unsigned int data_len,
+		     int timeout_ms)
+{
+	struct sockaddr_storage addr;
+	socklen_t addrlen = sizeof(addr);
+	struct sockaddr_in *addr_in;
+	int fd, ret;
+
+	if (getsockname(server_fd, (struct sockaddr *)&addr, &addrlen)) {
+		log_err("Failed to get server addr");
+		return -1;
+	}
+
+	addr_in = (struct sockaddr_in *)&addr;
+	fd = socket(addr_in->sin_family, SOCK_STREAM, 0);
+	if (fd < 0) {
+		log_err("Failed to create client socket");
+		return -1;
+	}
+
+	if (settimeo(fd, timeout_ms))
+		goto error_close;
+
+	ret = sendto(fd, data, data_len, MSG_FASTOPEN, (struct sockaddr *)&addr,
+		     addrlen);
+	if (ret != data_len) {
+		log_err("sendto(data, %u) != %d\n", data_len, ret);
+		goto error_close;
+	}
+
+	return fd;
+
+error_close:
+	save_errno_close(fd);
+	return -1;
+}
+
 static int connect_fd_to_addr(int fd,
 			      const struct sockaddr_storage *addr,
 			      socklen_t addrlen)
diff --git a/tools/testing/selftests/bpf/network_helpers.h b/tools/testing/selftests/bpf/network_helpers.h
index c3728f6667e4..7205f8afdba1 100644
--- a/tools/testing/selftests/bpf/network_helpers.h
+++ b/tools/testing/selftests/bpf/network_helpers.h
@@ -37,6 +37,8 @@ int start_server(int family, int type, const char *addr, __u16 port,
 		 int timeout_ms);
 int connect_to_fd(int server_fd, int timeout_ms);
 int connect_fd_to_fd(int client_fd, int server_fd, int timeout_ms);
+int fastopen_connect(int server_fd, const char *data, unsigned int data_len,
+		     int timeout_ms);
 int make_sockaddr(int family, const char *addr_str, __u16 port,
 		  struct sockaddr_storage *addr, socklen_t *len);
 
-- 
2.24.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v3 bpf-next 8/9] bpf: selftests: tcp header options
  2020-07-30 20:56 [PATCH v3 bpf-next 0/9] BPF TCP header options Martin KaFai Lau
                   ` (6 preceding siblings ...)
  2020-07-30 20:57 ` [PATCH v3 bpf-next 7/9] bpf: selftests: Add fastopen_connect to network_helpers Martin KaFai Lau
@ 2020-07-30 20:57 ` Martin KaFai Lau
  2020-07-30 20:57 ` [PATCH v3 bpf-next 9/9] tcp: bpf: Optionally store mac header in TCP_SAVE_SYN Martin KaFai Lau
  8 siblings, 0 replies; 17+ messages in thread
From: Martin KaFai Lau @ 2020-07-30 20:57 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Eric Dumazet, kernel-team,
	Lawrence Brakmo, Neal Cardwell, netdev, Yuchung Cheng

This patch adds tests for the new bpf tcp header option feature.

test_tcp_hdr_options.c:
- It tests header option writing and parsing in 3WHS, normal regular
  connection establishment, fastopen, and syncookie.
- In syncookie, the passive side's bpf prog is asking the active side
  to resend its bpf header option by specifying a RESEND bit in the
  outgoing SYNACK. handle_active_estab() and write_nodata_opt() has
  some details.
- handle_passive_estab() has comments on fastopen.
- It also has test for header writing and parsing in FIN packet.
- Most of the tests is writing an experimental option 254 with magic 0xeB9F.
- The no_exprm_estab() also tests writing a regular TCP option
  without any magic.

test_misc_tcp_options.c:
- It is an one directional test.  Active side writes option and
  passive side parses option.  The focus is to exercise
  the new helpers and API.
- Testing the new helper, bpf_load_hdr_opt() and bpf_store_hdr_opt().
- Testing the bpf_getsockopt(TCP_BPF_SYN).
- Negative tests for the above helpers.
- Testing the sock_ops->skb_data.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 .../bpf/prog_tests/tcp_hdr_options.c          | 629 +++++++++++++++++
 .../bpf/progs/test_misc_tcp_hdr_options.c     | 338 +++++++++
 .../bpf/progs/test_tcp_hdr_options.c          | 657 ++++++++++++++++++
 .../selftests/bpf/test_tcp_hdr_options.h      | 150 ++++
 4 files changed, 1774 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/tcp_hdr_options.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_misc_tcp_hdr_options.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_tcp_hdr_options.c
 create mode 100644 tools/testing/selftests/bpf/test_tcp_hdr_options.h

diff --git a/tools/testing/selftests/bpf/prog_tests/tcp_hdr_options.c b/tools/testing/selftests/bpf/prog_tests/tcp_hdr_options.c
new file mode 100644
index 000000000000..82b90aaa918c
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/tcp_hdr_options.c
@@ -0,0 +1,629 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2020 Facebook */
+
+#define _GNU_SOURCE
+#include <sched.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <sys/socket.h>
+#include <linux/compiler.h>
+
+#include "test_progs.h"
+#include "cgroup_helpers.h"
+#include "network_helpers.h"
+#include "test_tcp_hdr_options.h"
+#include "test_tcp_hdr_options.skel.h"
+#include "test_misc_tcp_hdr_options.skel.h"
+
+#define LO_ADDR6 "::eB9F"
+#define CG_NAME "/tcpbpf-hdr-opt-test"
+
+struct bpf_test_option exp_passive_estab_in;
+struct bpf_test_option exp_active_estab_in;
+struct bpf_test_option exp_passive_fin_in;
+struct bpf_test_option exp_active_fin_in;
+struct hdr_stg exp_passive_hdr_stg;
+struct hdr_stg exp_active_hdr_stg = { .active = true, };
+
+static struct test_misc_tcp_hdr_options *misc_skel;
+static struct test_tcp_hdr_options *skel;
+static int lport_linum_map_fd;
+static int hdr_stg_map_fd;
+static __u32 duration;
+static int cg_fd;
+
+struct sk_fds {
+	int srv_fd;
+	int passive_fd;
+	int active_fd;
+	int passive_lport;
+	int active_lport;
+};
+
+static int add_lo_addr(int family, const char *addr_str)
+{
+	char ip_addr_cmd[256];
+	int cmdlen;
+
+	if (family == AF_INET6)
+		cmdlen = snprintf(ip_addr_cmd, sizeof(ip_addr_cmd),
+				  "ip -6 addr add %s/128 dev lo scope host",
+				  addr_str);
+	else
+		cmdlen = snprintf(ip_addr_cmd, sizeof(ip_addr_cmd),
+				  "ip addr add %s/32 dev lo",
+				  addr_str);
+
+	if (CHECK(cmdlen >= sizeof(ip_addr_cmd), "compile ip cmd",
+		  "failed to add host addr %s to lo. ip cmdlen is too long\n",
+		  addr_str))
+		return -1;
+
+	if (CHECK(system(ip_addr_cmd), "run ip cmd",
+		  "failed to add host addr %s to lo\n", addr_str))
+		return -1;
+
+	return 0;
+}
+
+static int create_netns(void)
+{
+	if (CHECK(unshare(CLONE_NEWNET), "create netns",
+		  "unshare(CLONE_NEWNET): %s (%d)",
+		  strerror(errno), errno))
+		return -1;
+
+	if (CHECK(system("ip link set dev lo up"), "run ip cmd",
+		  "failed to bring lo link up\n"))
+		return -1;
+
+	if (add_lo_addr(AF_INET6, LO_ADDR6))
+		return -1;
+
+	return 0;
+}
+
+static int write_sysctl(const char *sysctl, const char *value)
+{
+	int fd, err, len;
+
+	fd = open(sysctl, O_WRONLY);
+	if (CHECK(fd == -1, "open sysctl", "open(%s): %s (%d)\n",
+		  sysctl, strerror(errno), errno))
+		return -1;
+
+	len = strlen(value);
+	err = write(fd, value, len);
+	close(fd);
+	if (CHECK(err != len, "write sysctl",
+		  "write(%s, %s): err:%d %s (%d)\n",
+		  sysctl, value, err, strerror(errno), errno))
+		return -1;
+
+	return 0;
+}
+
+static void print_hdr_stg(const struct hdr_stg *hdr_stg, const char *prefix)
+{
+	fprintf(stderr, "%s{active:%u, resend_syn:%u, syncookie:%u, fastopen:%u}\n",
+		prefix ? : "", hdr_stg->active, hdr_stg->resend_syn,
+		hdr_stg->syncookie, hdr_stg->fastopen);
+}
+
+static void print_option(const struct bpf_test_option *opt, const char *prefix)
+{
+	fprintf(stderr, "%s{flags:0x%x, max_delack_ms:%u, rand:0x%x}\n",
+		prefix ? : "", opt->flags, opt->max_delack_ms, opt->rand);
+}
+
+static void sk_fds_close(struct sk_fds *sk_fds)
+{
+	close(sk_fds->srv_fd);
+	close(sk_fds->passive_fd);
+	close(sk_fds->active_fd);
+}
+
+static int sk_fds_shutdown(struct sk_fds *sk_fds)
+{
+	int ret, abyte;
+
+	shutdown(sk_fds->active_fd, SHUT_WR);
+	ret = read(sk_fds->passive_fd, &abyte, sizeof(abyte));
+	if (CHECK(ret != 0, "read-after-shutdown(passive_fd):",
+		  "ret:%d %s (%d)\n",
+		  ret, strerror(errno), errno))
+		return -1;
+
+	shutdown(sk_fds->passive_fd, SHUT_WR);
+	ret = read(sk_fds->active_fd, &abyte, sizeof(abyte));
+	if (CHECK(ret != 0, "read-after-shutdown(active_fd):",
+		  "ret:%d %s (%d)\n",
+		  ret, strerror(errno), errno))
+		return -1;
+
+	return 0;
+}
+
+static int sk_fds_connect(struct sk_fds *sk_fds, bool fast_open)
+{
+	const char fast[] = "FAST!!!";
+	struct sockaddr_storage addr;
+	struct sockaddr_in *addr_in;
+	socklen_t len;
+
+	sk_fds->srv_fd = start_server(AF_INET6, SOCK_STREAM, LO_ADDR6, 0, 0);
+	if (CHECK(sk_fds->srv_fd == -1, "start_server", "%s (%d)\n",
+		  strerror(errno), errno))
+		goto error;
+
+	if (fast_open)
+		sk_fds->active_fd = fastopen_connect(sk_fds->srv_fd, fast,
+						     sizeof(fast), 0);
+	else
+		sk_fds->active_fd = connect_to_fd(sk_fds->srv_fd, 0);
+
+	if (CHECK_FAIL(sk_fds->active_fd == -1)) {
+		close(sk_fds->srv_fd);
+		goto error;
+	}
+
+	addr_in = (struct sockaddr_in *)&addr;
+	len = sizeof(addr);
+	if (CHECK(getsockname(sk_fds->srv_fd, (struct sockaddr *)&addr,
+			      &len), "getsockname(srv_fd)", "%s (%d)\n",
+		  strerror(errno), errno))
+		goto error_close;
+	sk_fds->passive_lport = ntohs(addr_in->sin_port);
+
+	len = sizeof(addr);
+	if (CHECK(getsockname(sk_fds->active_fd, (struct sockaddr *)&addr,
+			      &len), "getsockname(active_fd)", "%s (%d)\n",
+		  strerror(errno), errno))
+		goto error_close;
+	sk_fds->active_lport = ntohs(addr_in->sin_port);
+
+	sk_fds->passive_fd = accept(sk_fds->srv_fd, NULL, 0);
+	if (CHECK(sk_fds->passive_fd == -1, "accept(srv_fd)", "%s (%d)\n",
+		  strerror(errno), errno))
+		goto error_close;
+
+	if (fast_open) {
+		char bytes_in[sizeof(fast)];
+		int ret;
+
+		ret = read(sk_fds->passive_fd, bytes_in, sizeof(bytes_in));
+		if (CHECK(ret != sizeof(fast), "read fastopen syn data",
+			  "expected=%lu actual=%d\n", sizeof(fast), ret)) {
+			close(sk_fds->passive_fd);
+			goto error_close;
+		}
+	}
+
+	return 0;
+
+error_close:
+	close(sk_fds->active_fd);
+	close(sk_fds->srv_fd);
+
+error:
+	memset(sk_fds, -1, sizeof(*sk_fds));
+	return -1;
+}
+
+static int check_hdr_opt(const struct bpf_test_option *exp,
+			 const struct bpf_test_option *act,
+			 const char *hdr_desc)
+{
+	if (CHECK(memcmp(exp, act, sizeof(*exp)),
+		  "expected-vs-actual", "unexpected %s\n", hdr_desc)) {
+		print_option(exp, "expected: ");
+		print_option(act, "  actual: ");
+		return -1;
+	}
+
+	return 0;
+}
+
+static int check_hdr_stg(const struct hdr_stg *exp, int fd,
+			 const char *stg_desc)
+{
+	struct hdr_stg act;
+
+	if (CHECK(bpf_map_lookup_elem(hdr_stg_map_fd, &fd, &act),
+		  "map_lookup(hdr_stg_map_fd)", "%s %s (%d)\n",
+		  stg_desc, strerror(errno), errno))
+		return -1;
+
+	if (CHECK(memcmp(exp, &act, sizeof(*exp)),
+		  "expected-vs-actual", "unexpected %s\n", stg_desc)) {
+		print_hdr_stg(exp, "expected: ");
+		print_hdr_stg(&act, "  actual: ");
+		return -1;
+	}
+
+	return 0;
+}
+
+static int check_error_linum(const struct sk_fds *sk_fds)
+{
+	unsigned int nr_errors = 0;
+	struct linum_err linum_err;
+	int lport;
+
+	lport = sk_fds->passive_lport;
+	if (!bpf_map_lookup_elem(lport_linum_map_fd, &lport, &linum_err)) {
+		fprintf(stderr,
+			"bpf prog error out at lport:passive(%d), linum:%u err:%d\n",
+			lport, linum_err.linum, linum_err.err);
+		nr_errors++;
+	}
+
+	lport = sk_fds->active_lport;
+	if (!bpf_map_lookup_elem(lport_linum_map_fd, &lport, &linum_err)) {
+		fprintf(stderr,
+			"bpf prog error out at lport:active(%d), linum:%u err:%d\n",
+			lport, linum_err.linum, linum_err.err);
+		nr_errors++;
+	}
+
+	return nr_errors;
+}
+
+static void check_hdr_and_close_fds(struct sk_fds *sk_fds)
+{
+	if (sk_fds_shutdown(sk_fds))
+		goto check_linum;
+
+	if (check_hdr_stg(&exp_passive_hdr_stg, sk_fds->passive_fd,
+			  "passive_hdr_stg"))
+		goto check_linum;
+
+	if (check_hdr_stg(&exp_active_hdr_stg, sk_fds->active_fd,
+			  "active_hdr_stg"))
+		goto check_linum;
+
+	if (check_hdr_opt(&exp_passive_estab_in, &skel->bss->passive_estab_in,
+			  "passive_estab_in"))
+		goto check_linum;
+
+	if (check_hdr_opt(&exp_active_estab_in, &skel->bss->active_estab_in,
+			  "active_estab_in"))
+		goto check_linum;
+
+	if (check_hdr_opt(&exp_passive_fin_in, &skel->bss->passive_fin_in,
+			  "passive_fin_in"))
+		goto check_linum;
+
+	check_hdr_opt(&exp_active_fin_in, &skel->bss->active_fin_in,
+		      "active_fin_in");
+
+check_linum:
+	CHECK_FAIL(check_error_linum(sk_fds));
+	sk_fds_close(sk_fds);
+}
+
+static void prepare_out(void)
+{
+	skel->bss->active_syn_out = exp_passive_estab_in;
+	skel->bss->passive_synack_out = exp_active_estab_in;
+
+	skel->bss->active_fin_out = exp_passive_fin_in;
+	skel->bss->passive_fin_out = exp_active_fin_in;
+}
+
+static void reset_test(void)
+{
+	size_t optsize = sizeof(struct bpf_test_option);
+	int lport, err;
+
+	memset(&skel->bss->passive_synack_out, 0, optsize);
+	memset(&skel->bss->passive_fin_out, 0, optsize);
+
+	memset(&skel->bss->passive_estab_in, 0, optsize);
+	memset(&skel->bss->passive_fin_in, 0, optsize);
+
+	memset(&skel->bss->active_syn_out, 0, optsize);
+	memset(&skel->bss->active_fin_out, 0, optsize);
+
+	memset(&skel->bss->active_estab_in, 0, optsize);
+	memset(&skel->bss->active_fin_in, 0, optsize);
+
+	skel->data->test_kind = TCPOPT_EXP;
+	skel->data->test_magic = 0xeB9F;
+
+	memset(&exp_passive_estab_in, 0, optsize);
+	memset(&exp_active_estab_in, 0, optsize);
+	memset(&exp_passive_fin_in, 0, optsize);
+	memset(&exp_active_fin_in, 0, optsize);
+
+	memset(&exp_passive_hdr_stg, 0, sizeof(exp_passive_hdr_stg));
+	memset(&exp_active_hdr_stg, 0, sizeof(exp_active_hdr_stg));
+	exp_active_hdr_stg.active = true;
+
+	err = bpf_map_get_next_key(lport_linum_map_fd, NULL, &lport);
+	while (!err) {
+		bpf_map_delete_elem(lport_linum_map_fd, &lport);
+		err = bpf_map_get_next_key(lport_linum_map_fd, &lport, &lport);
+	}
+}
+
+static void fastopen_estab(void)
+{
+	struct bpf_link *link;
+	struct sk_fds sk_fds;
+
+	hdr_stg_map_fd = bpf_map__fd(skel->maps.hdr_stg_map);
+	lport_linum_map_fd = bpf_map__fd(skel->maps.lport_linum_map);
+
+	exp_passive_estab_in.flags = OPTION_F_RAND | OPTION_F_MAX_DELACK_MS;
+	exp_passive_estab_in.rand = 0xfa;
+	exp_passive_estab_in.max_delack_ms = 11;
+
+	exp_active_estab_in.flags = OPTION_F_RAND | OPTION_F_MAX_DELACK_MS;
+	exp_active_estab_in.rand = 0xce;
+	exp_active_estab_in.max_delack_ms = 22;
+
+	exp_passive_hdr_stg.fastopen = true;
+
+	prepare_out();
+
+	/* Allow fastopen without fastopen cookie */
+	if (write_sysctl("/proc/sys/net/ipv4/tcp_fastopen", "1543"))
+		return;
+
+	link = bpf_program__attach_cgroup(skel->progs.estab, cg_fd);
+	if (CHECK(IS_ERR(link), "attach_cgroup(estab)", "err: %ld\n",
+		  PTR_ERR(link)))
+		return;
+
+	if (sk_fds_connect(&sk_fds, true)) {
+		bpf_link__destroy(link);
+		return;
+	}
+
+	check_hdr_and_close_fds(&sk_fds);
+	bpf_link__destroy(link);
+}
+
+static void syncookie_estab(void)
+{
+	struct bpf_link *link;
+	struct sk_fds sk_fds;
+
+	hdr_stg_map_fd = bpf_map__fd(skel->maps.hdr_stg_map);
+	lport_linum_map_fd = bpf_map__fd(skel->maps.lport_linum_map);
+
+	exp_passive_estab_in.flags = OPTION_F_RAND | OPTION_F_MAX_DELACK_MS;
+	exp_passive_estab_in.rand = 0xfa;
+	exp_passive_estab_in.max_delack_ms = 11;
+
+	exp_active_estab_in.flags = OPTION_F_RAND | OPTION_F_MAX_DELACK_MS |
+					OPTION_F_RESEND;
+	exp_active_estab_in.rand = 0xce;
+	exp_active_estab_in.max_delack_ms = 22;
+
+	exp_passive_hdr_stg.syncookie = true;
+	exp_active_hdr_stg.resend_syn = true,
+
+	prepare_out();
+
+	/* Clear the RESEND to ensure the bpf prog can learn
+	 * want_cookie and set the RESEND by itself.
+	 */
+	skel->bss->passive_synack_out.flags &= ~OPTION_F_RESEND;
+
+	/* Enforce syncookie mode */
+	if (write_sysctl("/proc/sys/net/ipv4/tcp_syncookies", "2"))
+		return;
+
+	link = bpf_program__attach_cgroup(skel->progs.estab, cg_fd);
+	if (CHECK(IS_ERR(link), "attach_cgroup(estab)", "err: %ld\n",
+		  PTR_ERR(link)))
+		return;
+
+	if (sk_fds_connect(&sk_fds, false)) {
+		bpf_link__destroy(link);
+		return;
+	}
+
+	check_hdr_and_close_fds(&sk_fds);
+	bpf_link__destroy(link);
+}
+
+static void fin(void)
+{
+	struct bpf_link *link;
+	struct sk_fds sk_fds;
+
+	hdr_stg_map_fd = bpf_map__fd(skel->maps.hdr_stg_map);
+	lport_linum_map_fd = bpf_map__fd(skel->maps.lport_linum_map);
+
+	exp_passive_fin_in.flags = OPTION_F_RAND;
+	exp_passive_fin_in.rand = 0xfa;
+
+	exp_active_fin_in.flags = OPTION_F_RAND;
+	exp_active_fin_in.rand = 0xce;
+
+	prepare_out();
+
+	if (write_sysctl("/proc/sys/net/ipv4/tcp_syncookies", "1"))
+		return;
+
+	link = bpf_program__attach_cgroup(skel->progs.estab, cg_fd);
+	if (CHECK(IS_ERR(link), "attach_cgroup(estab)", "err: %ld\n",
+		  PTR_ERR(link)))
+		return;
+
+	if (sk_fds_connect(&sk_fds, false)) {
+		bpf_link__destroy(link);
+		return;
+	}
+
+	check_hdr_and_close_fds(&sk_fds);
+	bpf_link__destroy(link);
+}
+
+static void __simple_estab(bool exprm)
+{
+	struct bpf_link *link;
+	struct sk_fds sk_fds;
+
+	hdr_stg_map_fd = bpf_map__fd(skel->maps.hdr_stg_map);
+	lport_linum_map_fd = bpf_map__fd(skel->maps.lport_linum_map);
+
+	exp_passive_estab_in.flags = OPTION_F_RAND | OPTION_F_MAX_DELACK_MS;
+	exp_passive_estab_in.rand = 0xfa;
+	exp_passive_estab_in.max_delack_ms = 11;
+
+	exp_active_estab_in.flags = OPTION_F_RAND | OPTION_F_MAX_DELACK_MS;
+	exp_active_estab_in.rand = 0xce;
+	exp_active_estab_in.max_delack_ms = 22;
+
+	prepare_out();
+
+	if (!exprm) {
+		skel->data->test_kind = 0xB9;
+		skel->data->test_magic = 0;
+	}
+
+	if (write_sysctl("/proc/sys/net/ipv4/tcp_syncookies", "1"))
+		return;
+
+	link = bpf_program__attach_cgroup(skel->progs.estab, cg_fd);
+	if (CHECK(IS_ERR(link), "attach_cgroup(estab)", "err: %ld\n",
+		  PTR_ERR(link)))
+		return;
+
+	if (sk_fds_connect(&sk_fds, false)) {
+		bpf_link__destroy(link);
+		return;
+	}
+
+	check_hdr_and_close_fds(&sk_fds);
+	bpf_link__destroy(link);
+}
+
+static void no_exprm_estab(void)
+{
+	__simple_estab(false);
+}
+
+static void simple_estab(void)
+{
+	__simple_estab(true);
+}
+
+static void misc(void)
+{
+	const char send_msg[] = "MISC!!!";
+	char recv_msg[sizeof(send_msg)];
+	const unsigned int nr_data = 2;
+	struct bpf_link *link;
+	struct sk_fds sk_fds;
+	int i, ret;
+
+	lport_linum_map_fd = bpf_map__fd(misc_skel->maps.lport_linum_map);
+
+	if (write_sysctl("/proc/sys/net/ipv4/tcp_syncookies", "1"))
+		return;
+
+	link = bpf_program__attach_cgroup(misc_skel->progs.misc_estab, cg_fd);
+	if (CHECK(IS_ERR(link), "attach_cgroup(misc_estab)", "err: %ld\n",
+		  PTR_ERR(link)))
+		return;
+
+	if (sk_fds_connect(&sk_fds, false)) {
+		bpf_link__destroy(link);
+		return;
+	}
+
+	for (i = 0; i < nr_data; i++) {
+		/* MSG_EOR to ensure skb will not be combined */
+		ret = send(sk_fds.active_fd, send_msg, sizeof(send_msg),
+			   MSG_EOR);
+		if (CHECK(ret != sizeof(send_msg), "send(msg)", "ret:%d\n",
+			  ret))
+			goto check_linum;
+
+		ret = read(sk_fds.passive_fd, recv_msg, sizeof(recv_msg));
+		if (CHECK(ret != sizeof(send_msg), "recv(msg)", "ret:%d\n",
+			  ret))
+			goto check_linum;
+	}
+
+	if (sk_fds_shutdown(&sk_fds))
+		goto check_linum;
+
+	CHECK(misc_skel->bss->nr_syn != 1, "unexpected nr_syn",
+	      "expected (1) != actual (%u)\n",
+		misc_skel->bss->nr_syn);
+
+	CHECK(misc_skel->bss->nr_data != nr_data, "unexpected nr_data",
+	      "expected (%u) != actual (%u)\n",
+	      nr_data, misc_skel->bss->nr_data);
+
+	/* The last ACK may have been delayed, so it is either 1 or 2. */
+	CHECK(misc_skel->bss->nr_pure_ack != 1 &&
+	      misc_skel->bss->nr_pure_ack != 2,
+	      "unexpected nr_pure_ack",
+	      "expected (1 or 2) != actual (%u)\n",
+		misc_skel->bss->nr_pure_ack);
+
+	CHECK(misc_skel->bss->nr_fin != 1, "unexpected nr_fin",
+	      "expected (1) != actual (%u)\n",
+	      misc_skel->bss->nr_fin);
+
+check_linum:
+	CHECK_FAIL(check_error_linum(&sk_fds));
+	sk_fds_close(&sk_fds);
+	bpf_link__destroy(link);
+}
+
+struct test {
+	const char *desc;
+	void (*run)(void);
+};
+
+#define DEF_TEST(name) { #name, name }
+static struct test tests[] = {
+	DEF_TEST(simple_estab),
+	DEF_TEST(no_exprm_estab),
+	DEF_TEST(syncookie_estab),
+	DEF_TEST(fastopen_estab),
+	DEF_TEST(fin),
+	DEF_TEST(misc),
+};
+
+void test_tcp_hdr_options(void)
+{
+	int i;
+
+	skel = test_tcp_hdr_options__open_and_load();
+	if (CHECK(!skel, "open and load skel", "failed"))
+		return;
+
+	misc_skel = test_misc_tcp_hdr_options__open_and_load();
+	if (CHECK(!misc_skel, "open and load misc test skel", "failed"))
+		goto skel_destroy;
+
+	cg_fd = test__join_cgroup(CG_NAME);
+	if (CHECK_FAIL(cg_fd < 0))
+		goto skel_destroy;
+
+	for (i = 0; i < ARRAY_SIZE(tests); i++) {
+		if (!test__start_subtest(tests[i].desc))
+			continue;
+
+		if (create_netns())
+			break;
+
+		tests[i].run();
+
+		reset_test();
+	}
+
+	close(cg_fd);
+skel_destroy:
+	test_misc_tcp_hdr_options__destroy(misc_skel);
+	test_tcp_hdr_options__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_misc_tcp_hdr_options.c b/tools/testing/selftests/bpf/progs/test_misc_tcp_hdr_options.c
new file mode 100644
index 000000000000..6c33fb8a34d4
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_misc_tcp_hdr_options.c
@@ -0,0 +1,338 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2020 Facebook */
+
+#include <stddef.h>
+#include <errno.h>
+#include <stdbool.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <linux/ipv6.h>
+#include <linux/tcp.h>
+#include <linux/socket.h>
+#include <linux/bpf.h>
+#include <linux/types.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+#define BPF_PROG_TEST_TCP_HDR_OPTIONS
+#include "test_tcp_hdr_options.h"
+
+__u16 last_addr16_n = __bpf_htons(0xeB9F);
+__u16 active_lport_n = 0;
+__u16 active_lport_h = 0;
+__u16 passive_lport_n = 0;
+__u16 passive_lport_h = 0;
+
+/* options received at passive side */
+unsigned int nr_pure_ack = 0;
+unsigned int nr_data = 0;
+unsigned int nr_syn = 0;
+unsigned int nr_fin = 0;
+
+/* Check the header received from the active side */
+static __always_inline int __check_active_hdr_in(struct bpf_sock_ops *skops,
+						 bool check_syn)
+{
+	union {
+		struct tcphdr th;
+		struct ipv6hdr ip6;
+		struct tcp_exprm_opt exprm_opt;
+		struct tcp_opt reg_opt;
+		__u8 data[100]; /* IPv6 (40) + Max TCP hdr (60) */
+	} hdr = {};
+	__u64 load_flags = check_syn ? BPF_LOAD_HDR_OPT_TCP_SYN : 0;
+	struct tcphdr *pth;
+	int err, ret;
+
+	hdr.reg_opt.kind = 0xB9;
+
+	/* The option is 4 bytes long instead of 2 bytes */
+	ret = bpf_load_hdr_opt(skops, &hdr.reg_opt, 2, load_flags);
+	if (ret != -ENOSPC)
+		RET_CG_ERR(ret);
+
+	/* Test searching magic with regular kind */
+	hdr.reg_opt.len = 4;
+	ret = bpf_load_hdr_opt(skops, &hdr.reg_opt, sizeof(hdr.reg_opt),
+			       load_flags);
+	if (ret != -EINVAL)
+		RET_CG_ERR(ret);
+
+	hdr.reg_opt.len = 0;
+	ret = bpf_load_hdr_opt(skops, &hdr.reg_opt, sizeof(hdr.reg_opt),
+			       load_flags);
+	if (ret != 4 || hdr.reg_opt.len != 4 || hdr.reg_opt.kind != 0xB9 ||
+	    hdr.reg_opt.data[0] != 0xfa || hdr.reg_opt.data[1] != 0xce)
+		RET_CG_ERR(ret);
+
+	/* Test searching experimental option with invalid kind length */
+	hdr.exprm_opt.kind = TCPOPT_EXP;
+	hdr.exprm_opt.len = 5;
+	hdr.exprm_opt.magic = 0;
+	ret = bpf_load_hdr_opt(skops, &hdr.exprm_opt, sizeof(hdr.exprm_opt),
+			       load_flags);
+	if (ret != -EINVAL)
+		RET_CG_ERR(ret);
+
+	/* Test searching experimental option with 0 magic value */
+	hdr.exprm_opt.len = 4;
+	ret = bpf_load_hdr_opt(skops, &hdr.exprm_opt, sizeof(hdr.exprm_opt),
+			       load_flags);
+	if (ret != -ENOMSG)
+		RET_CG_ERR(ret);
+
+	hdr.exprm_opt.magic = __bpf_htons(0xeB9F);
+	ret = bpf_load_hdr_opt(skops, &hdr.exprm_opt, sizeof(hdr.exprm_opt),
+			       load_flags);
+	if (ret != 4 || hdr.exprm_opt.len != 4 ||
+	    hdr.exprm_opt.kind != TCPOPT_EXP ||
+	    hdr.exprm_opt.magic != __bpf_htons(0xeB9F))
+		RET_CG_ERR(ret);
+
+	if (!check_syn)
+		return CG_OK;
+
+	/* Test loading from skops->syn_skb if sk_state == TCP_NEW_SYN_RECV
+	 * (i.e. skops->sk is a req. and it means it received a syn
+	 *       and writing synack).
+	 *
+	 * Test loading from tp->saved_syn for other sk_state.
+	 */
+	ret = bpf_getsockopt(skops, SOL_TCP, TCP_BPF_SYN_IP, &hdr.ip6,
+			     sizeof(hdr.ip6));
+	if (ret != -ENOSPC)
+		RET_CG_ERR(ret);
+
+	if (hdr.ip6.saddr.s6_addr16[7] != last_addr16_n ||
+	    hdr.ip6.daddr.s6_addr16[7] != last_addr16_n)
+		RET_CG_ERR(0);
+
+	ret = bpf_getsockopt(skops, SOL_TCP, TCP_BPF_SYN_IP, &hdr, sizeof(hdr));
+	if (ret < 0)
+		RET_CG_ERR(ret);
+
+	pth = (struct tcphdr *)(&hdr.ip6 + 1);
+	if (pth->dest != passive_lport_n || pth->source != active_lport_n)
+		RET_CG_ERR(0);
+
+	ret = bpf_getsockopt(skops, SOL_TCP, TCP_BPF_SYN, &hdr, sizeof(hdr));
+	if (ret < 0)
+		RET_CG_ERR(ret);
+
+	if (hdr.th.dest != passive_lport_n || hdr.th.source != active_lport_n)
+		RET_CG_ERR(0);
+
+	return CG_OK;
+}
+
+static __always_inline int check_active_syn_in(struct bpf_sock_ops *skops)
+{
+	return __check_active_hdr_in(skops, true);
+}
+
+static __always_inline int check_active_hdr_in(struct bpf_sock_ops *skops)
+{
+	struct tcphdr *th;
+
+	if (__check_active_hdr_in(skops, false) == CG_ERR)
+		return CG_ERR;
+
+	th = skops->skb_data;
+	if (th + 1 > skops->skb_data_end)
+		RET_CG_ERR(0);
+
+	if (tcp_hdrlen(th) < skops->skb_len)
+		nr_data++;
+
+	if (th->fin)
+		nr_fin++;
+
+	if (th->ack && !th->fin && tcp_hdrlen(th) == skops->skb_len)
+		nr_pure_ack++;
+
+	return CG_OK;
+}
+
+static __always_inline int active_opt_len(struct bpf_sock_ops *skops)
+{
+	int err;
+
+	/* Reserve more than enough to allow the -EEXIST test in
+	 * the write_active_opt().
+	 */
+	err = bpf_reserve_hdr_opt(skops, 12, 0);
+	if (err)
+		RET_CG_ERR(err);
+
+	return CG_OK;
+}
+
+static __always_inline int write_active_opt(struct bpf_sock_ops *skops)
+{
+	struct tcp_exprm_opt exprm_opt = {};
+	struct tcp_opt win_scale_opt = {};
+	struct tcp_opt reg_opt = {};
+	struct tcphdr *th;
+	int err, ret;
+
+	exprm_opt.kind = TCPOPT_EXP;
+	exprm_opt.len = 4;
+	exprm_opt.magic = __bpf_htons(0xeB9F);
+
+	reg_opt.kind = 0xB9;
+	reg_opt.len = 4;
+	reg_opt.data[0] = 0xfa;
+	reg_opt.data[1] = 0xce;
+
+	win_scale_opt.kind = TCPOPT_WINDOW;
+
+	err = bpf_store_hdr_opt(skops, &exprm_opt, sizeof(exprm_opt), 0);
+	if (err)
+		RET_CG_ERR(err);
+
+	/* Store the same exprm option */
+	err = bpf_store_hdr_opt(skops, &exprm_opt, sizeof(exprm_opt), 0);
+	if (err != -EEXIST)
+		RET_CG_ERR(err);
+
+	err = bpf_store_hdr_opt(skops, &reg_opt, sizeof(reg_opt), 0);
+	if (err)
+		RET_CG_ERR(err);
+	err = bpf_store_hdr_opt(skops, &reg_opt, sizeof(reg_opt), 0);
+	if (err != -EEXIST)
+		RET_CG_ERR(err);
+
+	/* Check the option has been written and can be searched */
+	ret = bpf_load_hdr_opt(skops, &exprm_opt, sizeof(exprm_opt), 0);
+	if (ret != 4 || exprm_opt.len != 4 || exprm_opt.kind != TCPOPT_EXP ||
+	    exprm_opt.magic != __bpf_htons(0xeB9F))
+		RET_CG_ERR(ret);
+
+	reg_opt.len = 0;
+	ret = bpf_load_hdr_opt(skops, &reg_opt, sizeof(reg_opt), 0);
+	if (ret != 4 || reg_opt.len != 4 || reg_opt.kind != 0xB9 ||
+	    reg_opt.data[0] != 0xfa || reg_opt.data[1] != 0xce)
+		RET_CG_ERR(ret);
+
+	th = skops->skb_data;
+	if (th + 1 > skops->skb_data_end)
+		RET_CG_ERR(0);
+
+	if (th->syn) {
+		active_lport_h = skops->local_port;
+		active_lport_n = th->source;
+
+		/* Search the win scale option written by kernel
+		 * in the SYN packet.
+		 */
+		ret = bpf_load_hdr_opt(skops, &win_scale_opt,
+				       sizeof(win_scale_opt), 0);
+		if (ret != 3 || win_scale_opt.len != 3 ||
+		    win_scale_opt.kind != TCPOPT_WINDOW)
+			RET_CG_ERR(ret);
+
+		/* Write the win scale option that kernel
+		 * has already written.
+		 */
+		err = bpf_store_hdr_opt(skops, &win_scale_opt,
+					sizeof(win_scale_opt), 0);
+		if (err != -EEXIST)
+			RET_CG_ERR(err);
+	}
+
+	return CG_OK;
+}
+
+static __always_inline int handle_hdr_opt_len(struct bpf_sock_ops *skops)
+{
+	__u8 tcp_flags = skops_tcp_flags(skops);
+
+	if ((tcp_flags & TCPHDR_SYNACK) == TCPHDR_SYNACK)
+		/* Check the SYN from bpf_sock_ops_kern->syn_skb */
+		return check_active_syn_in(skops);
+
+	/* Passive side should have cleared the write hdr cb by now */
+	if (skops->local_port == passive_lport_h)
+		RET_CG_ERR(0);
+
+	return active_opt_len(skops);
+}
+
+static __always_inline int handle_write_hdr_opt(struct bpf_sock_ops *skops)
+{
+	__u8 tcp_flags = skops_tcp_flags(skops);
+
+	/* Hdr space has not been reserved on passive-side in
+	 * handle_hdr_opt_len(), so we should not have been called.
+	 */
+	if ((tcp_flags & TCPHDR_SYNACK) == TCPHDR_SYNACK)
+		RET_CG_ERR(0);
+
+	if (skops->local_port == passive_lport_h)
+		RET_CG_ERR(0);
+
+	return write_active_opt(skops);
+}
+
+static __always_inline int handle_parse_hdr(struct bpf_sock_ops *skops)
+{
+	struct bpf_sock *sk = skops->sk;
+
+	/* Passive side is not writing any non-standard/unknown
+	 * option, so the active side should never be called.
+	 */
+	if (skops->local_port == active_lport_h)
+		RET_CG_ERR(0);
+
+	return check_active_hdr_in(skops);
+}
+
+static __always_inline int handle_passive_estab(struct bpf_sock_ops *skops)
+{
+	int err;
+
+	/* No more write hdr cb */
+	bpf_sock_ops_cb_flags_set(skops,
+				  skops->bpf_sock_ops_cb_flags &
+				  ~BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG);
+
+	/* Recheck the SYN but check the tp->saved_syn this time */
+	err = check_active_syn_in(skops);
+	if (err == CG_ERR)
+		return err;
+
+	nr_syn++;
+
+	/* The ack has header option written by the active side also */
+	return check_active_hdr_in(skops);
+}
+
+SEC("sockops/misc_estab")
+int misc_estab(struct bpf_sock_ops *skops)
+{
+	int op, optlen, true_val = 1;
+
+	switch (skops->op) {
+	case BPF_SOCK_OPS_TCP_LISTEN_CB:
+		passive_lport_h = skops->local_port;
+		passive_lport_n = __bpf_htons(passive_lport_h);
+		bpf_setsockopt(skops, SOL_TCP, TCP_SAVE_SYN,
+			       &true_val, sizeof(true_val));
+		set_hdr_cb_flags(skops);
+		break;
+	case BPF_SOCK_OPS_TCP_CONNECT_CB:
+		set_hdr_cb_flags(skops);
+		break;
+	case BPF_SOCK_OPS_PARSE_HDR_OPT_CB:
+		return handle_parse_hdr(skops);
+	case BPF_SOCK_OPS_HDR_OPT_LEN_CB:
+		return handle_hdr_opt_len(skops);
+	case BPF_SOCK_OPS_WRITE_HDR_OPT_CB:
+		return handle_write_hdr_opt(skops);
+	case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB:
+		return handle_passive_estab(skops);
+	}
+
+	return CG_OK;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/progs/test_tcp_hdr_options.c b/tools/testing/selftests/bpf/progs/test_tcp_hdr_options.c
new file mode 100644
index 000000000000..e745ce480106
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_tcp_hdr_options.c
@@ -0,0 +1,657 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2020 Facebook */
+
+#include <stddef.h>
+#include <errno.h>
+#include <stdbool.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <linux/tcp.h>
+#include <linux/socket.h>
+#include <linux/bpf.h>
+#include <linux/types.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+#define BPF_PROG_TEST_TCP_HDR_OPTIONS
+#include "test_tcp_hdr_options.h"
+
+#ifndef sizeof_field
+#define sizeof_field(TYPE, MEMBER) sizeof((((TYPE *)0)->MEMBER))
+#endif
+
+__u8 test_kind = TCPOPT_EXP;
+__u16 test_magic = 0xeB9F;
+
+struct bpf_test_option passive_synack_out = {};
+struct bpf_test_option passive_fin_out	= {};
+
+struct bpf_test_option passive_estab_in = {};
+struct bpf_test_option passive_fin_in	= {};
+
+struct bpf_test_option active_syn_out	= {};
+struct bpf_test_option active_fin_out	= {};
+
+struct bpf_test_option active_estab_in	= {};
+struct bpf_test_option active_fin_in	= {};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_SK_STORAGE);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__type(key, int);
+	__type(value, struct hdr_stg);
+} hdr_stg_map SEC(".maps");
+
+static __always_inline bool skops_want_cookie(const struct bpf_sock_ops *skops)
+{
+	return skops->args[0] == BPF_WRITE_HDR_TCP_SYNACK_COOKIE;
+}
+
+static __always_inline bool skops_current_mss(const struct bpf_sock_ops *skops)
+{
+	return skops->args[0] == BPF_WRITE_HDR_TCP_CURRENT_MSS;
+}
+
+static __always_inline __u8 option_total_len(__u8 flags)
+{
+	__u8 i, len = 1; /* +1 for flags */
+
+	if (!flags)
+		return 0;
+
+	/* RESEND bit does not use a byte */
+	for (i = OPTION_RESEND + 1; i < __NR_OPTION_FLAGS; i++)
+		len += !!TEST_OPTION_FLAGS(flags, i);
+
+	if (test_kind == TCPOPT_EXP)
+		return len + TCP_BPF_EXPOPT_BASE_LEN;
+	else
+		return len + 2;
+}
+
+static __always_inline
+void write_test_option(const struct bpf_test_option *test_opt,
+		       __u8 *data)
+{
+	__u8 offset = 0;
+
+	data[offset++] = test_opt->flags;
+	if (TEST_OPTION_FLAGS(test_opt->flags, OPTION_MAX_DELACK_MS))
+		data[offset++] = test_opt->max_delack_ms;
+
+	if (TEST_OPTION_FLAGS(test_opt->flags, OPTION_RAND))
+		data[offset++] = test_opt->rand;
+}
+
+static __always_inline int store_option(struct bpf_sock_ops *skops,
+					const struct bpf_test_option *test_opt)
+{
+	union {
+		struct tcp_exprm_opt exprm;
+		struct tcp_opt regular;
+	} write_opt;
+	__u8 flags = test_opt->flags, offset = 0;
+	__u8 *data;
+	int err;
+
+	if (test_kind == TCPOPT_EXP) {
+		write_opt.exprm.kind = TCPOPT_EXP;
+		write_opt.exprm.len = option_total_len(test_opt->flags);
+		write_opt.exprm.magic = __bpf_htons(test_magic);
+		write_opt.exprm.data32 = 0;
+		write_test_option(test_opt, write_opt.exprm.data);
+		err = bpf_store_hdr_opt(skops, &write_opt.exprm,
+					sizeof(write_opt.exprm), 0);
+	} else {
+		write_opt.regular.kind = test_kind;
+		write_opt.regular.len = option_total_len(test_opt->flags);
+		write_opt.regular.data32 = 0;
+		write_test_option(test_opt, write_opt.regular.data);
+		err = bpf_store_hdr_opt(skops, &write_opt.regular,
+					sizeof(write_opt.regular), 0);
+	}
+
+	if (err)
+		RET_CG_ERR(err);
+
+	return CG_OK;
+}
+
+static __always_inline int parse_test_option(struct bpf_test_option *opt,
+					     const __u8 *start, const __u8 *end)
+{
+	__u8 flags;
+
+	if (start + sizeof(opt->flags) > end)
+		return -EINVAL;
+
+	flags = *start;
+	opt->flags = flags;
+	start += sizeof(opt->flags);
+
+	if (TEST_OPTION_FLAGS(flags, OPTION_MAX_DELACK_MS)) {
+		if (start + sizeof(opt->max_delack_ms) > end)
+			return -EINVAL;
+		opt->max_delack_ms = *(typeof(opt->max_delack_ms) *)start;
+		start += sizeof(opt->max_delack_ms);
+	}
+
+	if (TEST_OPTION_FLAGS(flags, OPTION_RAND)) {
+		if (start + sizeof(opt->rand) > end)
+			return -EINVAL;
+		opt->rand = *(typeof(opt->rand) *)start;
+		start += sizeof(opt->rand);
+	}
+
+	return 0;
+}
+
+static __always_inline int load_option(struct bpf_sock_ops *skops,
+				       struct bpf_test_option *test_opt,
+				       bool from_syn)
+{
+	union {
+		struct tcp_exprm_opt exprm;
+		struct tcp_opt regular;
+	} search_opt;
+	int ret, load_flags = from_syn ? BPF_LOAD_HDR_OPT_TCP_SYN : 0;
+
+	if (test_kind == TCPOPT_EXP) {
+		search_opt.exprm.kind = TCPOPT_EXP;
+		search_opt.exprm.len = 4;
+		search_opt.exprm.magic = __bpf_htons(test_magic);
+		search_opt.exprm.data32 = 0;
+		ret = bpf_load_hdr_opt(skops, &search_opt.exprm,
+				       sizeof(search_opt.exprm), load_flags);
+		if (ret < 0)
+			return ret;
+		return parse_test_option(test_opt, search_opt.exprm.data,
+					 search_opt.exprm.data +
+					 sizeof(search_opt.exprm.data));
+	} else {
+		search_opt.regular.kind = test_kind;
+		search_opt.regular.len = 0;
+		search_opt.regular.data32 = 0;
+		ret = bpf_load_hdr_opt(skops, &search_opt.regular,
+				       sizeof(search_opt.regular), load_flags);
+		if (ret < 0)
+			return ret;
+		return parse_test_option(test_opt, search_opt.regular.data,
+					 search_opt.regular.data +
+					 sizeof(search_opt.regular.data));
+	}
+}
+
+static __always_inline int synack_opt_len(struct bpf_sock_ops *skops)
+{
+	struct bpf_test_option test_opt = {};
+	__u8 optlen, flags;
+	int err;
+
+	if (!passive_synack_out.flags)
+		return CG_OK;
+
+	err = load_option(skops, &test_opt, true);
+
+	/* bpf_test_option is not found */
+	if (err == -ENOMSG)
+		return CG_OK;
+
+	if (err)
+		RET_CG_ERR(err);
+
+	flags = passive_synack_out.flags;
+	if (skops_want_cookie(skops))
+		SET_OPTION_FLAGS(flags, OPTION_RESEND);
+
+	optlen = option_total_len(flags);
+	if (optlen) {
+		err = bpf_reserve_hdr_opt(skops, optlen, 0);
+		if (err)
+			RET_CG_ERR(err);
+	}
+
+	return CG_OK;
+}
+
+static __always_inline int write_synack_opt(struct bpf_sock_ops *skops)
+{
+	struct bpf_test_option opt;
+	__u8 exp_kind;
+	int err;
+
+	if (!passive_synack_out.flags)
+		/* We should not even be called since no header
+		 * space has been reserved.
+		 */
+		RET_CG_ERR(0);
+
+	opt = passive_synack_out;
+	if (skops_want_cookie(skops))
+		SET_OPTION_FLAGS(opt.flags, OPTION_RESEND);
+
+	return store_option(skops, &opt);
+}
+
+static __always_inline int syn_opt_len(struct bpf_sock_ops *skops)
+{
+	__u8 optlen;
+	int err;
+
+	if (!active_syn_out.flags)
+		return CG_OK;
+
+	optlen = option_total_len(active_syn_out.flags);
+	if (optlen) {
+		err = bpf_reserve_hdr_opt(skops, optlen, 0);
+		if (err)
+			RET_CG_ERR(err);
+	}
+
+	return CG_OK;
+}
+
+static __always_inline int write_syn_opt(struct bpf_sock_ops *skops)
+{
+	if (!active_syn_out.flags)
+		return CG_OK;
+
+	return store_option(skops, &active_syn_out);
+}
+
+static __always_inline int fin_opt_len(struct bpf_sock_ops *skops)
+{
+	struct bpf_test_option *opt;
+	struct hdr_stg *hdr_stg;
+	__u8 optlen;
+	int err;
+
+	if (!skops->sk)
+		RET_CG_ERR(0);
+
+	hdr_stg = bpf_sk_storage_get(&hdr_stg_map, skops->sk, NULL, 0);
+	if (!hdr_stg)
+		RET_CG_ERR(0);
+
+	if (hdr_stg->active)
+		opt = &active_fin_out;
+	else
+		opt = &passive_fin_out;
+
+	optlen = option_total_len(opt->flags);
+	if (optlen) {
+		err = bpf_reserve_hdr_opt(skops, optlen, 0);
+		if (err)
+			RET_CG_ERR(err);
+	}
+
+	return CG_OK;
+}
+
+static __always_inline int write_fin_opt(struct bpf_sock_ops *skops)
+{
+	struct bpf_test_option *opt;
+	struct hdr_stg *hdr_stg;
+
+	if (!skops->sk)
+		RET_CG_ERR(0);
+
+	hdr_stg = bpf_sk_storage_get(&hdr_stg_map, skops->sk, NULL, 0);
+	if (!hdr_stg)
+		RET_CG_ERR(0);
+
+	return store_option(skops, hdr_stg->active ?
+			    &active_fin_out : &passive_fin_out);
+}
+
+static __always_inline int resend_in_ack(struct bpf_sock_ops *skops)
+{
+	struct hdr_stg *hdr_stg;
+
+	if (!skops->sk)
+		return -1;
+
+	hdr_stg = bpf_sk_storage_get(&hdr_stg_map, skops->sk, NULL, 0);
+	if (!hdr_stg)
+		return -1;
+
+	return !!hdr_stg->resend_syn;
+}
+
+static __always_inline int nodata_opt_len(struct bpf_sock_ops *skops)
+{
+	struct hdr_stg *hdr_stg;
+	int resend;
+
+	resend = resend_in_ack(skops);
+	if (resend < 0)
+		RET_CG_ERR(0);
+
+	if (resend)
+		return syn_opt_len(skops);
+
+	return CG_OK;
+}
+
+static __always_inline int write_nodata_opt(struct bpf_sock_ops *skops)
+{
+	struct hdr_stg *hdr_stg;
+	int resend;
+
+	resend = resend_in_ack(skops);
+	if (resend < 0)
+		RET_CG_ERR(0);
+
+	if (resend)
+		return write_syn_opt(skops);
+
+	return CG_OK;
+}
+
+static __always_inline int data_opt_len(struct bpf_sock_ops *skops)
+{
+	return nodata_opt_len(skops);
+}
+
+static __always_inline int write_data_opt(struct bpf_sock_ops *skops)
+{
+	return write_nodata_opt(skops);
+}
+
+static __always_inline int current_mss_opt_len(struct bpf_sock_ops *skops)
+{
+	/* Reserve maximum that may be needed */
+	int err;
+
+	err = bpf_reserve_hdr_opt(skops, option_total_len(OPTION_MASK), 0);
+	if (err)
+		RET_CG_ERR(err);
+
+	return CG_OK;
+}
+
+static __always_inline int handle_hdr_opt_len(struct bpf_sock_ops *skops)
+{
+	__u8 tcp_flags = skops_tcp_flags(skops);
+
+	if ((tcp_flags & TCPHDR_SYNACK) == TCPHDR_SYNACK)
+		return synack_opt_len(skops);
+
+	if (tcp_flags & TCPHDR_SYN)
+		return syn_opt_len(skops);
+
+	if (tcp_flags & TCPHDR_FIN)
+		return fin_opt_len(skops);
+
+	if (skops_current_mss(skops))
+		/* The kernel is calculating the MSS */
+		return current_mss_opt_len(skops);
+
+	if (skops->skb_len)
+		return data_opt_len(skops);
+
+	return nodata_opt_len(skops);
+}
+
+static __always_inline int handle_write_hdr_opt(struct bpf_sock_ops *skops)
+{
+	__u8 tcp_flags = skops_tcp_flags(skops);
+	struct tcphdr *th;
+	int err;
+
+	if ((tcp_flags & TCPHDR_SYNACK) == TCPHDR_SYNACK)
+		return write_synack_opt(skops);
+
+	if (tcp_flags & TCPHDR_SYN)
+		return write_syn_opt(skops);
+
+	if (tcp_flags & TCPHDR_FIN)
+		return write_fin_opt(skops);
+
+	th = skops->skb_data;
+	if (th + 1 > skops->skb_data_end)
+		RET_CG_ERR(0);
+
+	if (skops->skb_len > tcp_hdrlen(th))
+		return write_data_opt(skops);
+
+	return write_nodata_opt(skops);
+}
+
+static __always_inline int set_delack_max(struct bpf_sock_ops *skops,
+					  __u8 max_delack_ms)
+{
+	__u32 max_delack_us = max_delack_ms * 1000;
+
+	return bpf_setsockopt(skops, SOL_TCP, TCP_BPF_DELACK_MAX,
+			      &max_delack_us, sizeof(max_delack_us));
+}
+
+static __always_inline int set_rto_min(struct bpf_sock_ops *skops,
+				       __u8 peer_max_delack_ms)
+{
+	__u32 min_rto_us = peer_max_delack_ms * 1000;
+
+	return bpf_setsockopt(skops, SOL_TCP, TCP_BPF_RTO_MIN, &min_rto_us,
+			      sizeof(min_rto_us));
+}
+
+static __always_inline int handle_active_estab(struct bpf_sock_ops *skops)
+{
+	struct hdr_stg *stg, init_stg = {
+		.active = true,
+	};
+	int err;
+
+	err = load_option(skops, &active_estab_in, false);
+	if (err && err != -ENOMSG)
+		RET_CG_ERR(err);
+
+update_sk_stg:
+	init_stg.resend_syn = TEST_OPTION_FLAGS(active_estab_in.flags,
+						OPTION_RESEND);
+	if (!skops->sk || !bpf_sk_storage_get(&hdr_stg_map, skops->sk,
+					      &init_stg,
+					      BPF_SK_STORAGE_GET_F_CREATE))
+		RET_CG_ERR(0);
+
+	if (init_stg.resend_syn)
+		/* Don't clear the write_hdr cb now because
+		 * the ACK may get lost and retransmit may
+		 * be needed.
+		 *
+		 * PARSE_ALL_HDR cb flag is set to learn if this
+		 * resend_syn option has received by the peer.
+		 *
+		 * The header option will be resent until a valid
+		 * packet is received at handle_parse_hdr()
+		 * and all hdr cb flags will be cleared in
+		 * handle_parse_hdr().
+		 */
+		set_parse_all_hdr_cb_flags(skops);
+	else if (!active_fin_out.flags)
+		/* No options will be written from now */
+		clear_hdr_cb_flags(skops);
+
+	if (active_syn_out.max_delack_ms) {
+		err = set_delack_max(skops, active_syn_out.max_delack_ms);
+		if (err)
+			RET_CG_ERR(err);
+	}
+
+	if (active_estab_in.max_delack_ms) {
+		err = set_rto_min(skops, active_estab_in.max_delack_ms);
+		if (err)
+			RET_CG_ERR(err);
+	}
+
+	return CG_OK;
+}
+
+static __always_inline int handle_passive_estab(struct bpf_sock_ops *skops)
+{
+	struct hdr_stg *hdr_stg;
+	bool syncookie = false;
+	__u8 *start, *end, len;
+	struct tcphdr *th;
+	int err;
+
+	err = load_option(skops, &passive_estab_in, true);
+	if (err == -ENOENT) {
+		/* saved_syn is not found. It was in syncookie mode.
+		 * We have asked the active side to resend the options
+		 * in ACK, so try to find the bpf_test_option from ACK now.
+		 */
+		err = load_option(skops, &passive_estab_in, false);
+		syncookie = true;
+	}
+
+	/* ENOMSG: The bpf_test_option is not found which is fine.
+	 * Bail out now for all other errors.
+	 */
+	if (err && err != -ENOMSG)
+		RET_CG_ERR(err);
+
+	if (!skops->sk)
+		RET_CG_ERR(0);
+
+	hdr_stg = bpf_sk_storage_get(&hdr_stg_map, skops->sk, NULL,
+				     BPF_SK_STORAGE_GET_F_CREATE);
+	if (!hdr_stg)
+		RET_CG_ERR(0);
+	hdr_stg->syncookie = syncookie;
+
+	th = skops->skb_data;
+	if (th + 1 > skops->skb_data_end)
+		RET_CG_ERR(0);
+
+	if (th->syn) {
+		/* Fastopen */
+
+		/* Cannot clear cb_flags to stop write_hdr cb.
+		 * synack is not sent yet for fast open.
+		 * Even it was, the synack may need to be retransmitted.
+		 *
+		 * PARSE_ALL_HDR cb flag is set to learn
+		 * if synack has reached the peer.
+		 * All cb_flags will be cleared in handle_parse_hdr().
+		 */
+		set_parse_all_hdr_cb_flags(skops);
+		hdr_stg->fastopen = true;
+	} else if (!passive_fin_out.flags) {
+		/* No options will be written from now */
+		clear_hdr_cb_flags(skops);
+	}
+
+	if (passive_synack_out.max_delack_ms) {
+		err = set_delack_max(skops, passive_synack_out.max_delack_ms);
+		if (err)
+			RET_CG_ERR(err);
+	}
+
+	if (passive_estab_in.max_delack_ms) {
+		err = set_rto_min(skops, passive_estab_in.max_delack_ms);
+		if (err)
+			RET_CG_ERR(err);
+	}
+
+	return CG_OK;
+}
+
+static __always_inline int handle_parse_hdr(struct bpf_sock_ops *skops)
+{
+	struct hdr_stg *hdr_stg;
+	struct tcphdr *th;
+	__u32 cb_flags;
+
+	if (!skops->sk)
+		RET_CG_ERR(0);
+
+	th = skops->skb_data;
+	if (th + 1 > skops->skb_data_end)
+		RET_CG_ERR(0);
+
+	hdr_stg = bpf_sk_storage_get(&hdr_stg_map, skops->sk, NULL, 0);
+	if (!hdr_stg)
+		RET_CG_ERR(0);
+
+	if (hdr_stg->resend_syn || hdr_stg->fastopen)
+		/* The PARSE_ALL_HDR cb flag was turned on
+		 * to ensure that the previously written
+		 * options have reached the peer.
+		 * Those previously written option includes:
+		 *     - Active side: resend_syn in ACK during syncookie
+		 *      or
+		 *     - Passive side: SYNACK during fastopen
+		 *
+		 * A valid packet has been received here after
+		 * the 3WHS, so the PARSE_ALL_HDR cb flag
+		 * can be cleared now.
+		 */
+		clear_parse_all_hdr_cb_flags(skops);
+
+	if (hdr_stg->resend_syn && !active_fin_out.flags)
+		/* Active side resent the syn option in ACK
+		 * because the server was in syncookie mode.
+		 * A valid packet has been received, so
+		 * clear header cb flags if there is no
+		 * more option to send.
+		 */
+		clear_hdr_cb_flags(skops);
+
+	if (hdr_stg->fastopen && !passive_fin_out.flags)
+		/* Passive side was in fastopen.
+		 * A valid packet has been received, so
+		 * the SYNACK has reached the peer.
+		 * Clear header cb flags if there is no more
+		 * option to send.
+		 */
+		clear_hdr_cb_flags(skops);
+
+	if (th->fin) {
+		struct bpf_test_option *fin_opt;
+		__u8 *kind_start;
+		int err;
+
+		if (hdr_stg->active)
+			fin_opt = &active_fin_in;
+		else
+			fin_opt = &passive_fin_in;
+
+		err = load_option(skops, fin_opt, false);
+		if (err && err != -ENOMSG)
+			RET_CG_ERR(err);
+	}
+
+	return CG_OK;
+}
+
+SEC("sockops/estab")
+int estab(struct bpf_sock_ops *skops)
+{
+	int op, optlen, true_val = 1;
+
+	switch (skops->op) {
+	case BPF_SOCK_OPS_TCP_LISTEN_CB:
+		bpf_setsockopt(skops, SOL_TCP, TCP_SAVE_SYN,
+			       &true_val, sizeof(true_val));
+		set_hdr_cb_flags(skops);
+		break;
+	case BPF_SOCK_OPS_TCP_CONNECT_CB:
+		set_hdr_cb_flags(skops);
+		break;
+	case BPF_SOCK_OPS_PARSE_HDR_OPT_CB:
+		return handle_parse_hdr(skops);
+	case BPF_SOCK_OPS_HDR_OPT_LEN_CB:
+		return handle_hdr_opt_len(skops);
+	case BPF_SOCK_OPS_WRITE_HDR_OPT_CB:
+		return handle_write_hdr_opt(skops);
+	case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB:
+		return handle_passive_estab(skops);
+	case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
+		return handle_active_estab(skops);
+	}
+
+	return CG_OK;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/test_tcp_hdr_options.h b/tools/testing/selftests/bpf/test_tcp_hdr_options.h
new file mode 100644
index 000000000000..9aa19cc87146
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_tcp_hdr_options.h
@@ -0,0 +1,150 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _TEST_TCP_HDR_OPTIONS_H
+#define _TEST_TCP_HDR_OPTIONS_H
+
+struct bpf_test_option {
+	__u8 flags;
+	__u8 max_delack_ms;
+	__u8 rand;
+} __attribute__((packed));
+
+enum {
+	OPTION_RESEND,
+	OPTION_MAX_DELACK_MS,
+	OPTION_RAND,
+	__NR_OPTION_FLAGS,
+};
+
+#define OPTION_F_RESEND		(1 << OPTION_RESEND)
+#define OPTION_F_MAX_DELACK_MS	(1 << OPTION_MAX_DELACK_MS)
+#define OPTION_F_RAND		(1 << OPTION_RAND)
+#define OPTION_MASK		((1 << __NR_OPTION_FLAGS) - 1)
+
+#define TEST_OPTION_FLAGS(flags, option) (1 & ((flags) >> (option)))
+#define SET_OPTION_FLAGS(flags, option)	((flags) |= (1 << (option)))
+
+/* Store in bpf_sk_storage */
+struct hdr_stg {
+	bool active;
+	bool resend_syn; /* active side only */
+	bool syncookie;  /* passive side only */
+	bool fastopen;	/* passive side only */
+};
+
+struct linum_err {
+	unsigned int linum;
+	int err;
+};
+
+#define TCPHDR_FIN 0x01
+#define TCPHDR_SYN 0x02
+#define TCPHDR_RST 0x04
+#define TCPHDR_PSH 0x08
+#define TCPHDR_ACK 0x10
+#define TCPHDR_URG 0x20
+#define TCPHDR_ECE 0x40
+#define TCPHDR_CWR 0x80
+#define TCPHDR_SYNACK (TCPHDR_SYN | TCPHDR_ACK)
+
+#define TCPOPT_EOL		0
+#define TCPOPT_NOP		1
+#define TCPOPT_WINDOW		3
+#define TCPOPT_EXP		254
+
+#define TCP_BPF_EXPOPT_BASE_LEN 4
+#define MAX_TCP_HDR_LEN		60
+#define MAX_TCP_OPTION_SPACE	40
+
+#ifdef BPF_PROG_TEST_TCP_HDR_OPTIONS
+
+#define CG_OK	1
+#define CG_ERR	0
+
+#ifndef SOL_TCP
+#define SOL_TCP 6
+#endif
+
+struct tcp_exprm_opt {
+	__u8 kind;
+	__u8 len;
+	__u16 magic;
+	union {
+		__u8 data[4];
+		__u32 data32;
+	};
+} __attribute__((packed));
+
+struct tcp_opt {
+	__u8 kind;
+	__u8 len;
+	union {
+		__u8 data[4];
+		__u32 data32;
+	};
+} __attribute__((packed));
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, 2);
+	__type(key, int);
+	__type(value, struct linum_err);
+} lport_linum_map SEC(".maps");
+
+static inline unsigned int tcp_hdrlen(const struct tcphdr *th)
+{
+	return th->doff << 2;
+}
+
+static inline __u8 skops_tcp_flags(const struct bpf_sock_ops *skops)
+{
+	return skops->skb_tcp_flags;
+}
+
+static inline void clear_hdr_cb_flags(struct bpf_sock_ops *skops)
+{
+	bpf_sock_ops_cb_flags_set(skops,
+				  skops->bpf_sock_ops_cb_flags &
+				  ~(BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG |
+				    BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG));
+}
+
+static inline void set_hdr_cb_flags(struct bpf_sock_ops *skops)
+{
+	bpf_sock_ops_cb_flags_set(skops,
+				  skops->bpf_sock_ops_cb_flags |
+				  BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG |
+				  BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG);
+}
+static inline void
+clear_parse_all_hdr_cb_flags(struct bpf_sock_ops *skops)
+{
+	bpf_sock_ops_cb_flags_set(skops,
+				  skops->bpf_sock_ops_cb_flags &
+				  ~BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG);
+}
+
+static inline void
+set_parse_all_hdr_cb_flags(struct bpf_sock_ops *skops)
+{
+	bpf_sock_ops_cb_flags_set(skops,
+				  skops->bpf_sock_ops_cb_flags |
+				  BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG);
+}
+
+#define RET_CG_ERR(__err) ({				\
+	struct linum_err __linum_err;		\
+	int __lport;				\
+						\
+	__linum_err.linum = __LINE__;		\
+	__linum_err.err = __err;		\
+	__lport = skops->local_port;		\
+	bpf_map_update_elem(&lport_linum_map, &__lport, &__linum_err, BPF_NOEXIST); \
+	clear_hdr_cb_flags(skops);					\
+	clear_parse_all_hdr_cb_flags(skops);				\
+	return CG_ERR;							\
+})
+
+#endif /* BPF_PROG_TEST_TCP_HDR_OPTIONS */
+
+#endif /* _TEST_TCP_HDR_OPTIONS_H */
-- 
2.24.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v3 bpf-next 9/9] tcp: bpf: Optionally store mac header in TCP_SAVE_SYN
  2020-07-30 20:56 [PATCH v3 bpf-next 0/9] BPF TCP header options Martin KaFai Lau
                   ` (7 preceding siblings ...)
  2020-07-30 20:57 ` [PATCH v3 bpf-next 8/9] bpf: selftests: tcp header options Martin KaFai Lau
@ 2020-07-30 20:57 ` Martin KaFai Lau
  2020-07-31 15:51   ` Eric Dumazet
  8 siblings, 1 reply; 17+ messages in thread
From: Martin KaFai Lau @ 2020-07-30 20:57 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Eric Dumazet, kernel-team,
	Lawrence Brakmo, Neal Cardwell, netdev, Yuchung Cheng

This patch is adapted from Eric's patch in an earlier discussion [1].

The TCP_SAVE_SYN currently only stores the network header and
tcp header.  This patch allows it to optionally store
the mac header also if the setsockopt's optval is 2.

It requires one more bit for the "save_syn" bit field in tcp_sock.
This patch achieves this by moving the syn_smc bit next to the is_mptcp.
The syn_smc is currently used with the TCP experimental option.  Since
syn_smc is only used when CONFIG_SMC is enabled, this patch also puts
the "IS_ENABLED(CONFIG_SMC)" around it like the is_mptcp did
with "IS_ENABLED(CONFIG_MPTCP)".

The mac_hdrlen is also stored in the "struct saved_syn"
to allow a quick offset from the bpf prog if it chooses to start
getting from the network header or the tcp header.

[1]: https://lore.kernel.org/netdev/CANn89iLJNWh6bkH7DNhy_kmcAexuUCccqERqe7z2QsvPhGrYPQ@mail.gmail.com/

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/tcp.h            | 13 ++++++++-----
 include/net/request_sock.h     |  1 +
 include/uapi/linux/bpf.h       |  1 +
 net/core/filter.c              | 27 ++++++++++++++++++++++-----
 net/ipv4/tcp.c                 |  3 ++-
 net/ipv4/tcp_input.c           | 14 +++++++++++++-
 tools/include/uapi/linux/bpf.h |  1 +
 7 files changed, 48 insertions(+), 12 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 5528912dc468..7d448800c9f4 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -237,14 +237,13 @@ struct tcp_sock {
 		repair      : 1,
 		frto        : 1;/* F-RTO (RFC5682) activated in CA_Loss */
 	u8	repair_queue;
-	u8	syn_data:1,	/* SYN includes data */
+	u8	save_syn:2,	/* Save headers of SYN packet */
+		syn_data:1,	/* SYN includes data */
 		syn_fastopen:1,	/* SYN includes Fast Open option */
 		syn_fastopen_exp:1,/* SYN includes Fast Open exp. option */
 		syn_fastopen_ch:1, /* Active TFO re-enabling probe */
 		syn_data_acked:1,/* data in SYN is acked by SYN-ACK */
-		save_syn:1,	/* Save headers of SYN packet */
-		is_cwnd_limited:1,/* forward progress limited by snd_cwnd? */
-		syn_smc:1;	/* SYN includes SMC */
+		is_cwnd_limited:1;/* forward progress limited by snd_cwnd? */
 	u32	tlp_high_seq;	/* snd_nxt at the time of TLP */
 
 	u32	tcp_tx_delay;	/* delay (in usec) added to TX packets */
@@ -391,6 +390,9 @@ struct tcp_sock {
 #if IS_ENABLED(CONFIG_MPTCP)
 	bool	is_mptcp;
 #endif
+#if IS_ENABLED(CONFIG_SMC)
+	bool	syn_smc;	/* SYN includes SMC */
+#endif
 
 #ifdef CONFIG_TCP_MD5SIG
 /* TCP AF-Specific parts; only used by MD5 Signature support so far */
@@ -486,7 +488,8 @@ static inline void tcp_saved_syn_free(struct tcp_sock *tp)
 
 static inline u32 tcp_saved_syn_len(const struct saved_syn *saved_syn)
 {
-	return saved_syn->network_hdrlen + saved_syn->tcp_hdrlen;
+	return saved_syn->mac_hdrlen + saved_syn->network_hdrlen +
+		saved_syn->tcp_hdrlen;
 }
 
 struct sk_buff *tcp_get_timestamping_opt_stats(const struct sock *sk);
diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index b1b101814ecb..8c99af5ea891 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -42,6 +42,7 @@ struct request_sock_ops {
 int inet_rtx_syn_ack(const struct sock *parent, struct request_sock *req);
 
 struct saved_syn {
+	u32 mac_hdrlen;
 	u32 network_hdrlen;
 	u32 tcp_hdrlen;
 	u8 data[];
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c6162ee573e9..68e71c948dc3 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -4455,6 +4455,7 @@ enum {
 	 */
 	TCP_BPF_SYN		= 1005, /* Copy the TCP header */
 	TCP_BPF_SYN_IP		= 1006, /* Copy the IP[46] and TCP header */
+	TCP_BPF_SYN_MAC         = 1007, /* Copy the MAC, IP[46], and TCP header */
 };
 
 enum {
diff --git a/net/core/filter.c b/net/core/filter.c
index 607f87901122..477192550838 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4682,11 +4682,16 @@ static int bpf_sock_ops_get_syn(struct bpf_sock_ops_kern *bpf_sock,
 		if (optname == TCP_BPF_SYN) {
 			hdr_start = syn_skb->data;
 			ret = tcp_hdrlen(syn_skb);
-		} else {
-			/* optname == TCP_BPF_SYN_IP */
+		} else if (optname == TCP_BPF_SYN_IP) {
 			hdr_start = skb_network_header(syn_skb);
 			ret = skb_network_header_len(syn_skb) +
 				tcp_hdrlen(syn_skb);
+		} else {
+			/* optname == TCP_BPF_SYN_MAC */
+			hdr_start = skb_mac_header(syn_skb);
+			ret = skb_mac_header_len(syn_skb) +
+				skb_network_header_len(syn_skb) +
+				tcp_hdrlen(syn_skb);
 		}
 	} else {
 		struct sock *sk = bpf_sock->sk;
@@ -4706,12 +4711,24 @@ static int bpf_sock_ops_get_syn(struct bpf_sock_ops_kern *bpf_sock,
 
 		if (optname == TCP_BPF_SYN) {
 			hdr_start = saved_syn->data +
+				saved_syn->mac_hdrlen +
 				saved_syn->network_hdrlen;
 			ret = saved_syn->tcp_hdrlen;
+		} else if (optname == TCP_BPF_SYN_IP) {
+			hdr_start = saved_syn->data +
+				saved_syn->mac_hdrlen;
+			ret = saved_syn->network_hdrlen +
+				saved_syn->tcp_hdrlen;
 		} else {
-			/* optname == TCP_BPF_SYN_IP */
+			/* optname == TCP_BPF_SYN_MAC */
+
+			/* TCP_SAVE_SYN may not have saved the mac hdr */
+			if (!saved_syn->mac_hdrlen)
+				return -ENOENT;
+
 			hdr_start = saved_syn->data;
-			ret = saved_syn->network_hdrlen +
+			ret = saved_syn->mac_hdrlen +
+				saved_syn->network_hdrlen +
 				saved_syn->tcp_hdrlen;
 		}
 	}
@@ -4724,7 +4741,7 @@ BPF_CALL_5(bpf_sock_ops_getsockopt, struct bpf_sock_ops_kern *, bpf_sock,
 	   int, level, int, optname, char *, optval, int, optlen)
 {
 	if (IS_ENABLED(CONFIG_INET) && level == SOL_TCP &&
-	    optname >= TCP_BPF_SYN && optname <= TCP_BPF_SYN_IP) {
+	    optname >= TCP_BPF_SYN && optname <= TCP_BPF_SYN_MAC) {
 		int ret, copy_len = 0;
 		const u8 *start;
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 98606f94b01b..b3d275624658 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3210,7 +3210,8 @@ static int do_tcp_setsockopt(struct sock *sk, int level, int optname,
 		break;
 
 	case TCP_SAVE_SYN:
-		if (val < 0 || val > 1)
+		/* 0: disable, 1: enable, 2: start from ether_header */
+		if (val < 0 || val > 2)
 			err = -EINVAL;
 		else
 			tp->save_syn = val;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 43520dcaafe3..79ecd61ff613 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6686,12 +6686,24 @@ static void tcp_reqsk_record_syn(const struct sock *sk,
 	if (tcp_sk(sk)->save_syn) {
 		u32 len = skb_network_header_len(skb) + tcp_hdrlen(skb);
 		struct saved_syn *saved_syn;
+		u32 mac_hdrlen;
+		void *base;
+
+		if (tcp_sk(sk)->save_syn == 2) {  /* Save full header. */
+			base = skb_mac_header(skb);
+			mac_hdrlen = skb_mac_header_len(skb);
+			len += mac_hdrlen;
+		} else {
+			base = skb_network_header(skb);
+			mac_hdrlen = 0;
+		}
 
 		saved_syn = kmalloc(len + sizeof(*saved_syn), GFP_ATOMIC);
 		if (saved_syn) {
+			saved_syn->mac_hdrlen = mac_hdrlen;
 			saved_syn->network_hdrlen = skb_network_header_len(skb);
 			saved_syn->tcp_hdrlen = tcp_hdrlen(skb);
-			memcpy(saved_syn->data, skb_network_header(skb), len);
+			memcpy(saved_syn->data, base, len);
 			req->saved_syn = saved_syn;
 		}
 	}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index c6162ee573e9..68e71c948dc3 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -4455,6 +4455,7 @@ enum {
 	 */
 	TCP_BPF_SYN		= 1005, /* Copy the TCP header */
 	TCP_BPF_SYN_IP		= 1006, /* Copy the IP[46] and TCP header */
+	TCP_BPF_SYN_MAC         = 1007, /* Copy the MAC, IP[46], and TCP header */
 };
 
 enum {
-- 
2.24.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 bpf-next 9/9] tcp: bpf: Optionally store mac header in TCP_SAVE_SYN
  2020-07-30 20:57 ` [PATCH v3 bpf-next 9/9] tcp: bpf: Optionally store mac header in TCP_SAVE_SYN Martin KaFai Lau
@ 2020-07-31 15:51   ` Eric Dumazet
  0 siblings, 0 replies; 17+ messages in thread
From: Eric Dumazet @ 2020-07-31 15:51 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, kernel-team,
	Lawrence Brakmo, Neal Cardwell, netdev, Yuchung Cheng

On Thu, Jul 30, 2020 at 1:58 PM Martin KaFai Lau <kafai@fb.com> wrote:
>
> This patch is adapted from Eric's patch in an earlier discussion [1].
>
> The TCP_SAVE_SYN currently only stores the network header and
> tcp header.  This patch allows it to optionally store
> the mac header also if the setsockopt's optval is 2.
>
> It requires one more bit for the "save_syn" bit field in tcp_sock.
> This patch achieves this by moving the syn_smc bit next to the is_mptcp.
> The syn_smc is currently used with the TCP experimental option.  Since
> syn_smc is only used when CONFIG_SMC is enabled, this patch also puts
> the "IS_ENABLED(CONFIG_SMC)" around it like the is_mptcp did
> with "IS_ENABLED(CONFIG_MPTCP)".
>
> The mac_hdrlen is also stored in the "struct saved_syn"
> to allow a quick offset from the bpf prog if it chooses to start
> getting from the network header or the tcp header.
>
> [1]: https://lore.kernel.org/netdev/CANn89iLJNWh6bkH7DNhy_kmcAexuUCccqERqe7z2QsvPhGrYPQ@mail.gmail.com/
>
> Suggested-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> ---

Reviewed-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 bpf-next 1/9] tcp: Use a struct to represent a saved_syn
  2020-07-30 20:57 ` [PATCH v3 bpf-next 1/9] tcp: Use a struct to represent a saved_syn Martin KaFai Lau
@ 2020-07-31 15:57   ` Eric Dumazet
  2020-07-31 17:31     ` Eric Dumazet
  0 siblings, 1 reply; 17+ messages in thread
From: Eric Dumazet @ 2020-07-31 15:57 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, kernel-team,
	Lawrence Brakmo, Neal Cardwell, netdev, Yuchung Cheng

On Thu, Jul 30, 2020 at 1:57 PM Martin KaFai Lau <kafai@fb.com> wrote:
>
> The TCP_SAVE_SYN has both the network header and tcp header.
> The total length of the saved syn packet is currently stored in
> the first 4 bytes (u32) of an array and the actual packet data is
> stored after that.
>
> A latter patch will add a bpf helper that allows to get the tcp header

s/latter/later/

> alone from the saved syn without the network header.  It will be more
> convenient to have a direct offset to a specific header instead of
> re-parsing it.  This requires to separately store the network hdrlen.
> The total header length (i.e. network + tcp) is still needed for the
> current usage in getsockopt.  Although this total length can be obtained
> by looking into the tcphdr and then get the (th->doff << 2), this patch
> chooses to directly store the tcp hdrlen in the second four bytes of
> this newly created "struct saved_syn".  By using a new struct, it can
> give a readable name to each individual header length.
>
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>


> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index a018bafd7bdf..6c38ca9de17e 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -6598,13 +6598,14 @@ static void tcp_reqsk_record_syn(const struct sock *sk,
>  {
>         if (tcp_sk(sk)->save_syn) {
>                 u32 len = skb_network_header_len(skb) + tcp_hdrlen(skb);
> -               u32 *copy;
> -
> -               copy = kmalloc(len + sizeof(u32), GFP_ATOMIC);
> -               if (copy) {
> -                       copy[0] = len;
> -                       memcpy(&copy[1], skb_network_header(skb), len);
> -                       req->saved_syn = copy;
> +               struct saved_syn *saved_syn;
> +
> +               saved_syn = kmalloc(len + sizeof(*saved_syn), GFP_ATOMIC);

Please use
                  saved_syn = kmalloc(struct_size(saved_syn, data,
len), GFP_ATOMIC)

This will avoid yet another trivial patch in the future.

> +               if (saved_syn) {
> +                       saved_syn->network_hdrlen = skb_network_header_len(skb);
> +                       saved_syn->tcp_hdrlen = tcp_hdrlen(skb);
> +                       memcpy(saved_syn->data, skb_network_header(skb), len);
> +                       req->saved_syn = saved_syn;
>                 }
>         }
>  }
> --
> 2.24.1
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 bpf-next 6/9] bpf: tcp: Allow bpf prog to write and parse TCP header option
  2020-07-30 20:57 ` [PATCH v3 bpf-next 6/9] bpf: tcp: Allow bpf prog to write and parse TCP header option Martin KaFai Lau
@ 2020-07-31 16:06   ` Eric Dumazet
  2020-07-31 17:59     ` Martin KaFai Lau
  0 siblings, 1 reply; 17+ messages in thread
From: Eric Dumazet @ 2020-07-31 16:06 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, kernel-team,
	Lawrence Brakmo, Neal Cardwell, netdev, Yuchung Cheng

On Thu, Jul 30, 2020 at 1:57 PM Martin KaFai Lau <kafai@fb.com> wrote:
>
> The earlier effort in BPF-TCP-CC allows the TCP Congestion Control
> algorithm to be written in BPF.  It opens up opportunities to allow
> a faster turnaround time in testing/releasing new congestion control
> ideas to production environment.
>
> The same flexibility can be extended to writing TCP header option.
> It is not uncommon that people want to test new TCP header option
> to improve the TCP performance.  Another use case is for data-center
> that has a more controlled environment and has more flexibility in
> putting header options for internal only use.
>
> For example, we want to test the idea in putting maximum delay
> ACK in TCP header option which is similar to a draft RFC proposal [1].
>
> This patch introduces the necessary BPF API and use them in the
> TCP stack to allow BPF_PROG_TYPE_SOCK_OPS program to parse
> and write TCP header options.  It currently supports most of
> the TCP packet except RST.
>
> Supported TCP header option:
> ───────────────────────────
> This patch allows the bpf-prog to write any option kind.
> Different bpf-progs can write its own option by calling the new helper
> bpf_store_hdr_opt().  The helper will ensure there is no duplicated
> option in the header.
>
> By allowing bpf-prog to write any option kind, this gives a lot of
> flexibility to the bpf-prog.  Different bpf-prog can write its
> own option kind.  It could also allow the bpf-prog to support a
> recently standardized option on an older kernel.
>
> Sockops Callback Flags:
> ──────────────────────
> The header parsing and writing callback can be turned on
> by enabling a few newly added callback flags:
>
> BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG:
>         Call bpf when kernel has received a header option that
>         the kernel cannot handle.  It is useful when the peer doesn't
>         send bpf-options very often.
>
>         The bpf-prog can inspect the received header by sock_ops->skb_data
>         which covers the whole header (including the fixed fields like
>         ports, flags...etc) or
>         use the new bpf_load_hdr_opt() to search for a particular TCP
>         header option.
>
>
>
>

> [1]: draft-wang-tcpm-low-latency-opt-00
>      https://tools.ietf.org/html/draft-wang-tcpm-low-latency-opt-00
>
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> ---
>  include/linux/bpf-cgroup.h     |  25 +++
>  include/linux/filter.h         |   4 +
>  include/net/tcp.h              |  53 ++++-
>  include/uapi/linux/bpf.h       | 231 ++++++++++++++++++++-
>  net/core/filter.c              | 365 +++++++++++++++++++++++++++++++++
>  net/ipv4/tcp_fastopen.c        |   2 +-
>  net/ipv4/tcp_input.c           |  86 +++++++-
>  net/ipv4/tcp_ipv4.c            |   3 +-
>  net/ipv4/tcp_minisocks.c       |   1 +
>  net/ipv4/tcp_output.c          | 194 ++++++++++++++++--
>  net/ipv6/tcp_ipv6.c            |   3 +-
>  tools/include/uapi/linux/bpf.h | 231 ++++++++++++++++++++-
>  12 files changed, 1171 insertions(+), 27 deletions(-)

This is a truly gigantic patch.

Could you split it in maybe two parts ?

This way I could focus on the TCP changes, and let eBPF experts focus
on BPF changes.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 bpf-next 4/9] tcp: Add unknown_opt arg to tcp_parse_options
  2020-07-30 20:57 ` [PATCH v3 bpf-next 4/9] tcp: Add unknown_opt arg to tcp_parse_options Martin KaFai Lau
@ 2020-07-31 16:12   ` Eric Dumazet
  2020-07-31 17:37     ` Martin KaFai Lau
  0 siblings, 1 reply; 17+ messages in thread
From: Eric Dumazet @ 2020-07-31 16:12 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, kernel-team,
	Lawrence Brakmo, Neal Cardwell, netdev, Yuchung Cheng

On Thu, Jul 30, 2020 at 1:58 PM Martin KaFai Lau <kafai@fb.com> wrote:
>
> In the latter patch, the bpf prog only wants to be called to handle
> a header option if that particular header option cannot be handled by
> the kernel.  This unknown option could be written by the peer's bpf-prog.
> It could also be a new standard option that the running kernel does not
> support it while a bpf-prog can handle it.
>
> In a latter patch, the bpf prog will be called from tcp_validate_incoming()
> if there is unknown option and a flag is set in tp->bpf_sock_ops_cb_flags.
>
> Instead of using skb->cb[] in an earlier attempt, this patch
> adds an optional arg "bool *unknown_opt" to tcp_parse_options().
> The bool will be set to true if it has encountered an option
> that the kernel does not recognize.
>
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> ---
>  drivers/infiniband/hw/cxgb4/cm.c |  2 +-
>  include/net/tcp.h                |  3 ++-
>  net/ipv4/syncookies.c            |  2 +-
>  net/ipv4/tcp_input.c             | 40 +++++++++++++++++++++-----------
>  net/ipv4/tcp_minisocks.c         |  4 ++--
>  net/ipv6/syncookies.c            |  2 +-
>  6 files changed, 34 insertions(+), 19 deletions(-)
>
> diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c
> index 30e08bcc9afb..dedca6576bb9 100644
> --- a/drivers/infiniband/hw/cxgb4/cm.c
> +++ b/drivers/infiniband/hw/cxgb4/cm.c
> @@ -3949,7 +3949,7 @@ static void build_cpl_pass_accept_req(struct sk_buff *skb, int stid , u8 tos)
>          */
>         memset(&tmp_opt, 0, sizeof(tmp_opt));
>         tcp_clear_options(&tmp_opt);
> -       tcp_parse_options(&init_net, skb, &tmp_opt, 0, NULL);
> +       tcp_parse_options(&init_net, skb, &tmp_opt, 0, NULL, NULL);
>
>         req = __skb_push(skb, sizeof(*req));
>         memset(req, 0, sizeof(*req));
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 895e7aabf136..d49d8f1c961a 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -413,7 +413,8 @@ int tcp_mmap(struct file *file, struct socket *sock,
>  #endif
>  void tcp_parse_options(const struct net *net, const struct sk_buff *skb,
>                        struct tcp_options_received *opt_rx,
> -                      int estab, struct tcp_fastopen_cookie *foc);
> +                      int estab, struct tcp_fastopen_cookie *foc,
> +                      bool *unknown_opt);
>  const u8 *tcp_parse_md5sig_option(const struct tcphdr *th);
>

Instead of changing signatures of many functions (and make future
stable backports challenging)
how about adding a field into 'struct tcp_options_received' ?

Sorry for not suggesting this earlier :/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 bpf-next 1/9] tcp: Use a struct to represent a saved_syn
  2020-07-31 15:57   ` Eric Dumazet
@ 2020-07-31 17:31     ` Eric Dumazet
  0 siblings, 0 replies; 17+ messages in thread
From: Eric Dumazet @ 2020-07-31 17:31 UTC (permalink / raw)
  To: Eric Dumazet, Martin KaFai Lau
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, kernel-team,
	Lawrence Brakmo, Neal Cardwell, netdev, Yuchung Cheng



On 7/31/20 8:57 AM, Eric Dumazet wrote:
> On Thu, Jul 30, 2020 at 1:57 PM Martin KaFai Lau <kafai@fb.com> wrote:
>>
>> The TCP_SAVE_SYN has both the network header and tcp header.
>> The total length of the saved syn packet is currently stored in
>> the first 4 bytes (u32) of an array and the actual packet data is
>> stored after that.
>>
>> A latter patch will add a bpf helper that allows to get the tcp header
> 
> s/latter/later/

Sorry, brain fart. I am using a different wording, that is all ;)


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 bpf-next 4/9] tcp: Add unknown_opt arg to tcp_parse_options
  2020-07-31 16:12   ` Eric Dumazet
@ 2020-07-31 17:37     ` Martin KaFai Lau
  0 siblings, 0 replies; 17+ messages in thread
From: Martin KaFai Lau @ 2020-07-31 17:37 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, kernel-team,
	Lawrence Brakmo, Neal Cardwell, netdev, Yuchung Cheng

On Fri, Jul 31, 2020 at 09:12:10AM -0700, Eric Dumazet wrote:
> On Thu, Jul 30, 2020 at 1:58 PM Martin KaFai Lau <kafai@fb.com> wrote:
> >
> > In the latter patch, the bpf prog only wants to be called to handle
> > a header option if that particular header option cannot be handled by
> > the kernel.  This unknown option could be written by the peer's bpf-prog.
> > It could also be a new standard option that the running kernel does not
> > support it while a bpf-prog can handle it.
> >
> > In a latter patch, the bpf prog will be called from tcp_validate_incoming()
> > if there is unknown option and a flag is set in tp->bpf_sock_ops_cb_flags.
> >
> > Instead of using skb->cb[] in an earlier attempt, this patch
> > adds an optional arg "bool *unknown_opt" to tcp_parse_options().
> > The bool will be set to true if it has encountered an option
> > that the kernel does not recognize.
> >
> > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > ---
> >  drivers/infiniband/hw/cxgb4/cm.c |  2 +-
> >  include/net/tcp.h                |  3 ++-
> >  net/ipv4/syncookies.c            |  2 +-
> >  net/ipv4/tcp_input.c             | 40 +++++++++++++++++++++-----------
> >  net/ipv4/tcp_minisocks.c         |  4 ++--
> >  net/ipv6/syncookies.c            |  2 +-
> >  6 files changed, 34 insertions(+), 19 deletions(-)
> >
> > diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c
> > index 30e08bcc9afb..dedca6576bb9 100644
> > --- a/drivers/infiniband/hw/cxgb4/cm.c
> > +++ b/drivers/infiniband/hw/cxgb4/cm.c
> > @@ -3949,7 +3949,7 @@ static void build_cpl_pass_accept_req(struct sk_buff *skb, int stid , u8 tos)
> >          */
> >         memset(&tmp_opt, 0, sizeof(tmp_opt));
> >         tcp_clear_options(&tmp_opt);
> > -       tcp_parse_options(&init_net, skb, &tmp_opt, 0, NULL);
> > +       tcp_parse_options(&init_net, skb, &tmp_opt, 0, NULL, NULL);
> >
> >         req = __skb_push(skb, sizeof(*req));
> >         memset(req, 0, sizeof(*req));
> > diff --git a/include/net/tcp.h b/include/net/tcp.h
> > index 895e7aabf136..d49d8f1c961a 100644
> > --- a/include/net/tcp.h
> > +++ b/include/net/tcp.h
> > @@ -413,7 +413,8 @@ int tcp_mmap(struct file *file, struct socket *sock,
> >  #endif
> >  void tcp_parse_options(const struct net *net, const struct sk_buff *skb,
> >                        struct tcp_options_received *opt_rx,
> > -                      int estab, struct tcp_fastopen_cookie *foc);
> > +                      int estab, struct tcp_fastopen_cookie *foc,
> > +                      bool *unknown_opt);
> >  const u8 *tcp_parse_md5sig_option(const struct tcphdr *th);
> >
> 
> Instead of changing signatures of many functions (and make future
> stable backports challenging)
> how about adding a field into 'struct tcp_options_received' ?
Sounds good.  There is a one byte hole in 'struct tcp_options_received',
so it won't matter much even there is "rx_opt" in "struct tcp_sock".

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 bpf-next 6/9] bpf: tcp: Allow bpf prog to write and parse TCP header option
  2020-07-31 16:06   ` Eric Dumazet
@ 2020-07-31 17:59     ` Martin KaFai Lau
  0 siblings, 0 replies; 17+ messages in thread
From: Martin KaFai Lau @ 2020-07-31 17:59 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, kernel-team,
	Lawrence Brakmo, Neal Cardwell, netdev, Yuchung Cheng

On Fri, Jul 31, 2020 at 09:06:57AM -0700, Eric Dumazet wrote:
> On Thu, Jul 30, 2020 at 1:57 PM Martin KaFai Lau <kafai@fb.com> wrote:
> >
> > The earlier effort in BPF-TCP-CC allows the TCP Congestion Control
> > algorithm to be written in BPF.  It opens up opportunities to allow
> > a faster turnaround time in testing/releasing new congestion control
> > ideas to production environment.
> >
> > The same flexibility can be extended to writing TCP header option.
> > It is not uncommon that people want to test new TCP header option
> > to improve the TCP performance.  Another use case is for data-center
> > that has a more controlled environment and has more flexibility in
> > putting header options for internal only use.
> >
> > For example, we want to test the idea in putting maximum delay
> > ACK in TCP header option which is similar to a draft RFC proposal [1].
> >
> > This patch introduces the necessary BPF API and use them in the
> > TCP stack to allow BPF_PROG_TYPE_SOCK_OPS program to parse
> > and write TCP header options.  It currently supports most of
> > the TCP packet except RST.
> >
> > Supported TCP header option:
> > ───────────────────────────
> > This patch allows the bpf-prog to write any option kind.
> > Different bpf-progs can write its own option by calling the new helper
> > bpf_store_hdr_opt().  The helper will ensure there is no duplicated
> > option in the header.
> >
> > By allowing bpf-prog to write any option kind, this gives a lot of
> > flexibility to the bpf-prog.  Different bpf-prog can write its
> > own option kind.  It could also allow the bpf-prog to support a
> > recently standardized option on an older kernel.
> >
> > Sockops Callback Flags:
> > ──────────────────────
> > The header parsing and writing callback can be turned on
> > by enabling a few newly added callback flags:
> >
> > BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG:
> >         Call bpf when kernel has received a header option that
> >         the kernel cannot handle.  It is useful when the peer doesn't
> >         send bpf-options very often.
> >
> >         The bpf-prog can inspect the received header by sock_ops->skb_data
> >         which covers the whole header (including the fixed fields like
> >         ports, flags...etc) or
> >         use the new bpf_load_hdr_opt() to search for a particular TCP
> >         header option.
> >
> >
> >
> >
> 
> > [1]: draft-wang-tcpm-low-latency-opt-00
> >      https://urldefense.proofpoint.com/v2/url?u=https-3A__tools.ietf.org_html_draft-2Dwang-2Dtcpm-2Dlow-2Dlatency-2Dopt-2D00&d=DwIFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=VQnoQ7LvghIj0gVEaiQSUw&m=Z-syoz304fodO8xPKCcJh0QYhXbb7_XVuRgTINFba2U&s=Ad66Zb5r0utWgnrB-QuDXBft6G1HXW2C_aBV9fTMxoo&e= 
> >
> > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > ---
> >  include/linux/bpf-cgroup.h     |  25 +++
> >  include/linux/filter.h         |   4 +
> >  include/net/tcp.h              |  53 ++++-
> >  include/uapi/linux/bpf.h       | 231 ++++++++++++++++++++-
> >  net/core/filter.c              | 365 +++++++++++++++++++++++++++++++++
> >  net/ipv4/tcp_fastopen.c        |   2 +-
> >  net/ipv4/tcp_input.c           |  86 +++++++-
> >  net/ipv4/tcp_ipv4.c            |   3 +-
> >  net/ipv4/tcp_minisocks.c       |   1 +
> >  net/ipv4/tcp_output.c          | 194 ++++++++++++++++--
> >  net/ipv6/tcp_ipv6.c            |   3 +-
> >  tools/include/uapi/linux/bpf.h | 231 ++++++++++++++++++++-
> >  12 files changed, 1171 insertions(+), 27 deletions(-)
> 
> This is a truly gigantic patch.
> 
> Could you split it in maybe two parts ?
Yes.

Most of the code changes in TCP are calling out the bpf prog to parse and
write header.  Thus, they are all in this one patch.

I will put those callout changes (and a few func arg changes) in TCP
to a separate patch but leave the bpf callout function empty.

Then the next bpf specific patch will fill out those empty bpf
callout functions.

> 
> This way I could focus on the TCP changes, and let eBPF experts focus
> on BPF changes.
Thanks for the review!

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, back to index

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-30 20:56 [PATCH v3 bpf-next 0/9] BPF TCP header options Martin KaFai Lau
2020-07-30 20:57 ` [PATCH v3 bpf-next 1/9] tcp: Use a struct to represent a saved_syn Martin KaFai Lau
2020-07-31 15:57   ` Eric Dumazet
2020-07-31 17:31     ` Eric Dumazet
2020-07-30 20:57 ` [PATCH v3 bpf-next 2/9] tcp: bpf: Add TCP_BPF_DELACK_MAX setsockopt Martin KaFai Lau
2020-07-30 20:57 ` [PATCH v3 bpf-next 3/9] tcp: bpf: Add TCP_BPF_RTO_MIN for bpf_setsockopt Martin KaFai Lau
2020-07-30 20:57 ` [PATCH v3 bpf-next 4/9] tcp: Add unknown_opt arg to tcp_parse_options Martin KaFai Lau
2020-07-31 16:12   ` Eric Dumazet
2020-07-31 17:37     ` Martin KaFai Lau
2020-07-30 20:57 ` [PATCH v3 bpf-next 5/9] bpf: sock_ops: Change some members of sock_ops_kern from u32 to u8 Martin KaFai Lau
2020-07-30 20:57 ` [PATCH v3 bpf-next 6/9] bpf: tcp: Allow bpf prog to write and parse TCP header option Martin KaFai Lau
2020-07-31 16:06   ` Eric Dumazet
2020-07-31 17:59     ` Martin KaFai Lau
2020-07-30 20:57 ` [PATCH v3 bpf-next 7/9] bpf: selftests: Add fastopen_connect to network_helpers Martin KaFai Lau
2020-07-30 20:57 ` [PATCH v3 bpf-next 8/9] bpf: selftests: tcp header options Martin KaFai Lau
2020-07-30 20:57 ` [PATCH v3 bpf-next 9/9] tcp: bpf: Optionally store mac header in TCP_SAVE_SYN Martin KaFai Lau
2020-07-31 15:51   ` Eric Dumazet

BPF Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/bpf/0 bpf/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 bpf bpf/ https://lore.kernel.org/bpf \
		bpf@vger.kernel.org
	public-inbox-index bpf

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.bpf


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git