All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] ipsec: Add ESP over TCP encapsulation
@ 2018-01-11 13:21 Herbert Xu
  2018-01-11 13:21 ` [PATCH 1/3] skbuff: Avoid sleeping in skb_send_sock_locked Herbert Xu
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Herbert Xu @ 2018-01-11 13:21 UTC (permalink / raw)
  To: Steffen Klassert, netdev

Hi:

This series of patches add basic support for ESP over TCP (RFC 8229).
Note that this does not include TLS support but it could be added in
future.

Here is an iproute patch to setup xfrm states with this:

diff --git a/ip/ipxfrm.c b/ip/ipxfrm.c
index 12c2f72..f3fb1e2 100644
--- a/ip/ipxfrm.c
+++ b/ip/ipxfrm.c
@@ -738,6 +738,9 @@ void xfrm_xfrma_print(struct rtattr *tb[], __u16 family,
 		case 2:
 			fprintf(fp, "espinudp ");
 			break;
+		case 6:
+			fprintf(fp, "espintcp ");
+			break;
 		default:
 			fprintf(fp, "%u ", e->encap_type);
 			break;
@@ -1182,6 +1185,8 @@ int xfrm_encap_type_parse(__u16 *type, int *argcp, char ***argvp)
 		*type = 1;
 	else if (strcmp(*argv, "espinudp") == 0)
 		*type = 2;
+	else if (strcmp(*argv, "espintcp") == 0)
+		*type = 6;
 	else
 		invarg("ENCAP-TYPE value is invalid", *argv);
 

Here is a sample program for setting up the TCP socket to use this.
Note that it doesn't do the magic word as required by RFC 8229 so
you'll need to add that for a real key manager.

#include <arpa/inet.h>
#include <errno.h>
#include <error.h>
#include <netinet/ip.h>
#include <netinet/tcp.h>
#include <stdlib.h>
#include <sys/socket.h>

#define TCP_ENCAP 35

int main(int argc, char **argv)
{
	struct sockaddr_in addr = {
		.sin_family = AF_INET,
		.sin_port = htons(4500),
	};
	char buf[4096];
	int one = 1;
	int err;
	int s;

	s = socket(AF_INET, SOCK_STREAM, 0);
	if (s < 0)
		error(-1, errno, "socket");

	if (bind(s, (struct sockaddr *)&addr, sizeof(addr)) < 0)
		error(-1, errno, "bind");

	if (argc > 1) {
		addr.sin_addr.s_addr = inet_addr(argv[1]);
		if (connect(s, (struct sockaddr *)&addr, sizeof(addr)) < 0)
			error(-1, errno, "connect");
	} else {
		if (listen(s, 0) < 0)
			error(-1, errno, "listen");

		s = accept(s, NULL, 0);
		if (s < 0)
			error(-1, errno, "accept");
	}

	if (setsockopt(s, SOL_TCP, TCP_NODELAY, &one, sizeof(one)) < 0)
		error(-1, errno, "TCP_NODELAY");

	if (setsockopt(s, SOL_TCP, TCP_ENCAP, NULL, 0) < 0)
		error(-1, errno, "TCP_ENCAP");

	while ((err = read(s, buf, sizeof(buf))) > 0)
		;

	if (err < 0)
		error(-1, errno, "read");

	return 0;
}


Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 1/3] skbuff: Avoid sleeping in skb_send_sock_locked
  2018-01-11 13:21 [PATCH 0/3] ipsec: Add ESP over TCP encapsulation Herbert Xu
@ 2018-01-11 13:21 ` Herbert Xu
  2018-01-11 13:21 ` [PATCH 2/3] tcp: Add ESP encapsulation support Herbert Xu
  2018-01-11 13:21 ` [PATCH 3/3] ipsec: Add ESP over TCP " Herbert Xu
  2 siblings, 0 replies; 7+ messages in thread
From: Herbert Xu @ 2018-01-11 13:21 UTC (permalink / raw)
  To: Steffen Klassert, netdev

For a function that needs to be called with the socket spinlock
held, sleeping would seem to be a bad idea.  This function does
in fact avoid sleeping when calling kernel_sendpage_locked on the
page part of the skb.  However, it doesn't do that when sending
the linear part.  Resulting in sleeping when the socket send buffer
is full.

This patch fixes it by setting the MSG_DONTWAIT flag when calling
kernel_sendmsg_locked.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 net/core/skbuff.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 6b0ff39..8197b7a 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2279,6 +2279,7 @@ int skb_send_sock_locked(struct sock *sk, struct sk_buff *skb, int offset,
 		kv.iov_base = skb->data + offset;
 		kv.iov_len = slen;
 		memset(&msg, 0, sizeof(msg));
+		msg.msg_flags = MSG_DONTWAIT;
 
 		ret = kernel_sendmsg_locked(sk, &msg, &kv, 1, slen);
 		if (ret <= 0)

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/3] tcp: Add ESP encapsulation support
  2018-01-11 13:21 [PATCH 0/3] ipsec: Add ESP over TCP encapsulation Herbert Xu
  2018-01-11 13:21 ` [PATCH 1/3] skbuff: Avoid sleeping in skb_send_sock_locked Herbert Xu
@ 2018-01-11 13:21 ` Herbert Xu
  2018-01-12 16:38   ` Eric Dumazet
  2018-01-11 13:21 ` [PATCH 3/3] ipsec: Add ESP over TCP " Herbert Xu
  2 siblings, 1 reply; 7+ messages in thread
From: Herbert Xu @ 2018-01-11 13:21 UTC (permalink / raw)
  To: Steffen Klassert, netdev

This patch adds the plumbing in TCP for ESP encapsulation support
per RFC8229.

The patch mostly deals with inbound processing, as well as enabling
TCP encapsulation on a socket through setsockopt.  The outbound
processing is dealt with in the ESP code as is done for UDP.

The inbound processing is split into two halves.  First of all,
the softirq path directly intercepts ESP packets and feeds them
into the IPsec stack.  Most of the time the packet will be freed
right away if it contains complete ESP packets.  However, if
the message is incomplete or it contains non-ESP data, then the
skb will be added to the receive queue.  We also add packets to
the receive queue if it is currently non-emtpy, in order to
preserve sequence number continuity and minimise the changes
to the TCP code.

On the user-space facing side, packets marked as ESP-only are
skipped and not visible to user-space.  However, some ESP data
may seep through.  For example, if we receive a partial message
then we will always give it to user-space regardless of whether
it turns out to be ESP or not.  So user-space should be prepared
to skip ESP messages (SPI != 0).

There is a little bit of code dealing with the encapsulation side.
In particular, if encapsulation data comes in while the socket
is owned by user-space, the packets will be stored in tp->encap_out
and processed during release_sock.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 include/linux/tcp.h      |   15 ++
 include/net/tcp.h        |   27 +++
 include/uapi/linux/tcp.h |    1 
 include/uapi/linux/udp.h |    1 
 net/ipv4/tcp.c           |   68 +++++++++
 net/ipv4/tcp_input.c     |  326 +++++++++++++++++++++++++++++++++++++++++++++--
 net/ipv4/tcp_ipv4.c      |    1 
 net/ipv4/tcp_output.c    |   48 ++++++
 8 files changed, 473 insertions(+), 14 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index ca4a636..1360a0e 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -225,7 +225,8 @@ struct tcp_sock {
 		fastopen_connect:1, /* FASTOPEN_CONNECT sockopt */
 		fastopen_no_cookie:1, /* Allow send/recv SYN+data without a cookie */
 		is_sack_reneg:1,    /* in recovery from loss with SACK reneg? */
-		unused:2;
+		encap:1,	/* TCP IKE/ESP encapsulation */
+		encap_lenhi_valid:1;
 	u8	nonagle     : 4,/* Disable Nagle algorithm?             */
 		thin_lto    : 1,/* Use linear timeouts for thin streams */
 		unused1	    : 1,
@@ -373,6 +374,16 @@ struct tcp_sock {
 	 */
 	struct request_sock *fastopen_rsk;
 	u32	*saved_syn;
+
+#ifdef CONFIG_XFRM
+/* TCP ESP encapsulation */
+	struct sk_buff *encap_in;
+	struct sk_buff_head encap_out;
+	u32	encap_seq;
+	u32	encap_last;
+	u16	encap_backlog;
+	u8	encap_lenhi;
+#endif
 };
 
 enum tsq_enum {
@@ -384,6 +395,7 @@ enum tsq_enum {
 	TCP_MTU_REDUCED_DEFERRED,  /* tcp_v{4|6}_err() could not call
 				    * tcp_v{4|6}_mtu_reduced()
 				    */
+	TCP_ESP_DEFERRED,	   /* esp_output_tcp_encap2 queued packets */
 };
 
 enum tsq_flags {
@@ -393,6 +405,7 @@ enum tsq_flags {
 	TCPF_WRITE_TIMER_DEFERRED	= (1UL << TCP_WRITE_TIMER_DEFERRED),
 	TCPF_DELACK_TIMER_DEFERRED	= (1UL << TCP_DELACK_TIMER_DEFERRED),
 	TCPF_MTU_REDUCED_DEFERRED	= (1UL << TCP_MTU_REDUCED_DEFERRED),
+	TCPF_ESP_DEFERRED		= (1UL << TCP_ESP_DEFERRED),
 };
 
 static inline struct tcp_sock *tcp_sk(const struct sock *sk)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 6da880d..6513ae2 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -327,6 +327,7 @@ int tcp_sendpage_locked(struct sock *sk, struct page *page, int offset,
 			size_t size, int flags);
 ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
 		 size_t size, int flags);
+int tcp_encap_output(struct sock *sk, struct sk_buff *skb);
 void tcp_release_cb(struct sock *sk);
 void tcp_wfree(struct sk_buff *skb);
 void tcp_write_timer_handler(struct sock *sk);
@@ -399,6 +400,7 @@ int compat_tcp_setsockopt(struct sock *sk, int level, int optname,
 			  char __user *optval, unsigned int optlen);
 void tcp_set_keepalive(struct sock *sk, int val);
 void tcp_syn_ack_timeout(const struct request_sock *req);
+void tcp_cleanup_rbuf(struct sock *sk, int copied);
 int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock,
 		int flags, int *addr_len);
 void tcp_parse_options(const struct net *net, const struct sk_buff *skb,
@@ -789,7 +791,8 @@ struct tcp_skb_cb {
 	__u8		txstamp_ack:1,	/* Record TX timestamp for ack? */
 			eor:1,		/* Is skb MSG_EOR marked? */
 			has_rxtstamp:1,	/* SKB has a RX timestamp	*/
-			unused:5;
+			esp_skip:1,	/* SKB is pure ESP */
+			unused:4;
 	__u32		ack_seq;	/* Sequence number ACK'd	*/
 	union {
 		struct {
@@ -2062,4 +2065,26 @@ static inline bool tcp_bpf_ca_needs_ecn(struct sock *sk)
 #if IS_ENABLED(CONFIG_SMC)
 extern struct static_key_false tcp_have_smc;
 #endif
+
+#ifdef CONFIG_XFRM
+DECLARE_STATIC_KEY_FALSE(tcp_encap_needed);
+
+int tcp_encap_enable(struct sock *sk);
+
+static inline bool tcp_esp_skipped(struct sk_buff *skb)
+{
+	return TCP_SKB_CB(skb)->esp_skip;
+}
+#else
+static inline int tcp_encap_enable(struct sock *sk)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline bool tcp_esp_skipped(struct sk_buff *skb)
+{
+	return false;
+}
+#endif
+
 #endif	/* _TCP_H */
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index b4a4f64..769cab0 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -122,6 +122,7 @@ enum {
 #define TCP_MD5SIG_EXT		32	/* TCP MD5 Signature with extensions */
 #define TCP_FASTOPEN_KEY	33	/* Set the key for Fast Open (cookie) */
 #define TCP_FASTOPEN_NO_COOKIE	34	/* Enable TFO without a TFO cookie */
+#define TCP_ENCAP		35	/* Set the socket to accept encapsulated packets */
 
 struct tcp_repair_opt {
 	__u32	opt_code;
diff --git a/include/uapi/linux/udp.h b/include/uapi/linux/udp.h
index efb7b59..1102846 100644
--- a/include/uapi/linux/udp.h
+++ b/include/uapi/linux/udp.h
@@ -39,5 +39,6 @@ struct udphdr {
 #define UDP_ENCAP_L2TPINUDP	3 /* rfc2661 */
 #define UDP_ENCAP_GTP0		4 /* GSM TS 09.60 */
 #define UDP_ENCAP_GTP1U		5 /* 3GPP TS 29.060 */
+#define TCP_ENCAP_ESPINTCP	6 /* Yikes, this is really xfrm encap types. */
 
 #endif /* _UAPI_LINUX_UDP_H */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index f08eebe..032b46c 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1545,7 +1545,7 @@ static int tcp_peek_sndq(struct sock *sk, struct msghdr *msg, int len)
  * calculation of whether or not we must ACK for the sake of
  * a window update.
  */
-static void tcp_cleanup_rbuf(struct sock *sk, int copied)
+void tcp_cleanup_rbuf(struct sock *sk, int copied)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	bool time_to_ack = false;
@@ -1627,6 +1627,35 @@ static struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off)
 	return NULL;
 }
 
+#ifdef CONFIG_XFRM
+static void __tcp_esp_skip(struct sock *sk)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct sk_buff *skb;
+	unsigned offset;
+	unsigned used;
+
+	while ((skb = tcp_recv_skb(sk, tp->copied_seq, &offset)) &&
+	       tcp_esp_skipped(skb)) {
+		used = skb->len - offset;
+		tp->copied_seq += used;
+		tcp_rcv_space_adjust(sk);
+		sk_eat_skb(sk, skb);
+	}
+}
+
+static inline void tcp_esp_skip(struct sock *sk, int flags)
+{
+	if (static_branch_unlikely(&tcp_encap_needed) &&
+	    tcp_sk(sk)->encap && !(flags & MSG_PEEK))
+		__tcp_esp_skip(sk);
+}
+#else
+static inline void tcp_esp_skip(struct sock *sk, int flags)
+{
+}
+#endif
+
 /*
  * This routine provides an alternative to tcp_recvmsg() for routines
  * that would like to handle copying from skbuffs directly in 'sendfile'
@@ -1650,7 +1679,9 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 	if (sk->sk_state == TCP_LISTEN)
 		return -ENOTCONN;
 	while ((skb = tcp_recv_skb(sk, seq, &offset)) != NULL) {
-		if (offset < skb->len) {
+		if (tcp_esp_skipped(skb))
+			seq += skb->len - offset;
+		else if (offset < skb->len) {
 			int used;
 			size_t len;
 
@@ -1704,6 +1735,7 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 	/* Clean up data we have read: This will do ACK frames. */
 	if (copied > 0) {
 		tcp_recv_skb(sk, seq, &offset);
+		tcp_esp_skip(sk, 0);
 		tcp_cleanup_rbuf(sk, copied);
 	}
 	return copied;
@@ -1946,6 +1978,13 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock,
 	found_ok_skb:
 		/* Ok so how much can we use? */
 		used = skb->len - offset;
+
+		if (tcp_esp_skipped(skb)) {
+			*seq += used;
+			urg_hole += used;
+			goto skip_copy;
+		}
+
 		if (len < used)
 			used = len;
 
@@ -2009,6 +2048,8 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock,
 		break;
 	} while (len > 0);
 
+	tcp_esp_skip(sk, flags);
+
 	/* According to UNIX98, msg_name/msg_namelen are ignored
 	 * on connected socket. I was just happy when found this 8) --ANK
 	 */
@@ -2146,6 +2187,21 @@ bool tcp_check_oom(struct sock *sk, int shift)
 	return too_many_orphans || out_of_socket_memory;
 }
 
+#ifdef CONFIG_XFRM
+static inline void tcp_encap_free(struct tcp_sock *tp)
+{
+	struct sk_buff *skb;
+
+	kfree_skb(tp->encap_in);
+	while ((skb = __skb_dequeue(&tp->encap_out)) != NULL)
+		__kfree_skb(skb);
+}
+#else
+static inline void tcp_encap_free(struct tcp_sock *tp)
+{
+}
+#endif
+
 void tcp_close(struct sock *sk, long timeout)
 {
 	struct sk_buff *skb;
@@ -2177,6 +2233,8 @@ void tcp_close(struct sock *sk, long timeout)
 		__kfree_skb(skb);
 	}
 
+	tcp_encap_free(tcp_sk(sk));
+
 	sk_mem_reclaim(sk);
 
 	/* If socket has been already reset (e.g. in tcp_reset()) - kill it. */
@@ -2583,6 +2641,12 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 
 		return tcp_fastopen_reset_cipher(net, sk, key, sizeof(key));
 	}
+	case TCP_ENCAP:
+		if (sk->sk_state == TCP_ESTABLISHED)
+			return tcp_encap_enable(sk);
+		else
+			return -ENOTCONN;
+		break;
 	default:
 		/* fallthru */
 		break;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 9550cc4..22c9f70 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -72,12 +72,14 @@
 #include <linux/prefetch.h>
 #include <net/dst.h>
 #include <net/tcp.h>
+#include <net/xfrm.h>
 #include <net/inet_common.h>
 #include <linux/ipsec.h>
 #include <asm/unaligned.h>
 #include <linux/errqueue.h>
 #include <trace/events/tcp.h>
 #include <linux/static_key.h>
+#include <uapi/linux/udp.h>
 
 int sysctl_tcp_max_orphans __read_mostly = NR_FILE;
 
@@ -110,6 +112,10 @@
 #define REXMIT_LOST	1 /* retransmit packets marked lost */
 #define REXMIT_NEW	2 /* FRTO-style transmit of unsent/new packets */
 
+#ifdef CONFIG_XFRM
+DEFINE_STATIC_KEY_FALSE(tcp_encap_needed);
+#endif
+
 static void tcp_gro_dev_warn(struct sock *sk, const struct sk_buff *skb,
 			     unsigned int len)
 {
@@ -4294,6 +4300,314 @@ static void tcp_drop(struct sock *sk, struct sk_buff *skb)
 	__kfree_skb(skb);
 }
 
+#ifdef CONFIG_XFRM
+static void tcp_set_encap_seq(struct tcp_sock *tp, struct sk_buff *skb,
+			      unsigned offset, __be16 len)
+{
+	while ((offset += min(be16_to_cpu(len), 2)) + 1 < skb->len)
+		skb_copy_bits(skb, offset, &len, 2);
+
+	if (skb->len <= offset) {
+		tp->encap_seq = TCP_SKB_CB(skb)->seq + offset;
+		return;
+	}
+
+	skb_copy_bits(skb, offset, &tp->encap_lenhi, 1);
+	tp->encap_lenhi_valid = true;
+}
+
+static void tcp_encap_error(struct tcp_sock *tp, struct sk_buff *skb,
+			    unsigned offset)
+{
+	struct sk_buff *prev = tp->encap_in;
+	union {
+		u8 bytes[2];
+		__be16 len;
+	} hdr;
+
+	if (!prev) {
+		tcp_set_encap_seq(tp, skb, offset - 2, 0);
+		return;
+	}
+
+	if (prev->len == 1) {
+		skb_copy_bits(prev, 0, &hdr.bytes[0], 1);
+		skb_copy_bits(skb, offset, &hdr.bytes[1], 1);
+		tcp_set_encap_seq(tp, skb, offset - 1, hdr.len);
+	}
+
+	__kfree_skb(prev);
+	tp->encap_in = NULL;
+}
+
+static void tcp_encap_error_free(struct tcp_sock *tp, struct sk_buff *skb,
+			       unsigned offset)
+{
+	tcp_encap_error(tp, skb, offset);
+	__kfree_skb(skb);
+}
+
+static int tcp_decap_skb(struct tcp_sock *tp, struct sk_buff *skb)
+{
+	/* Get rid of length field to get pure ESP. */
+	if (!__pskb_pull(skb, 2))
+		return -ENOMEM;
+	skb_reset_transport_header(skb);
+
+	rcu_read_lock();
+	skb->dev = dev_get_by_index_rcu(sock_net((struct sock *)tp),
+					skb->skb_iif);
+	if (skb->dev)
+		xfrm4_rcv_encap(skb, IPPROTO_ESP, 0, TCP_ENCAP_ESPINTCP);
+	rcu_read_unlock();
+	return 0;
+}
+
+static bool __tcp_encap_process(struct tcp_sock *tp, struct sk_buff *skb)
+{
+	struct sock *sk = (void *)tp;
+	struct {
+		union {
+			u8 bytes[2];
+			__be16 len;
+		};
+		__be32 spi;
+	} hdr;
+	struct sk_buff *prev;
+	bool eaten = false;
+	unsigned headlen;
+	unsigned offset2;
+	unsigned offset;
+	bool fragstolen;
+	int delta;
+
+	offset = tp->encap_last - TCP_SKB_CB(skb)->seq;
+	if (unlikely(skb->len <= offset))
+		return false;
+
+	tp->encap_last = TCP_SKB_CB(skb)->seq + skb->len;
+
+	if (unlikely(tp->encap_lenhi_valid)) {
+		tp->encap_lenhi_valid = false;
+		hdr.bytes[0] = tp->encap_lenhi;
+		skb_copy_bits(skb, offset, &hdr.bytes[1], 1);
+		tcp_set_encap_seq(tp, skb, offset - 1, hdr.len);
+		return false;
+	}
+
+	if (unlikely(tp->urg_data))
+		goto slow_path;
+
+	if (unlikely(tp->encap_in))
+		goto slow_path;
+
+	offset = tp->encap_seq - TCP_SKB_CB(skb)->seq;
+	if (unlikely(skb->len <= offset))
+		return false;
+
+	if (unlikely(offset))
+		goto slow_path;
+
+	if (unlikely(skb_has_frag_list(skb)))
+		goto slow_path;
+
+	offset2 = 0;
+
+	do {
+		if (unlikely(skb->len < sizeof(hdr)))
+			goto slow_path;
+
+		skb_copy_bits(skb, offset2, &hdr, sizeof(hdr));
+		offset2 += be16_to_cpu(hdr.len);
+		if (skb->len < offset2)
+			goto slow_path;
+
+		if (!hdr.spi)
+			goto slow_path;
+	} while (skb->len > offset2);
+
+	if (offset2 != be16_to_cpu(hdr.len))
+		goto slow_path;
+
+	tp->encap_seq = TCP_SKB_CB(skb)->seq + skb->len;
+
+	if (!skb_peek_tail(&sk->sk_receive_queue) &&
+	    !(TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)) {
+		tp->copied_seq = tp->encap_seq;
+		tcp_rcv_space_adjust(sk);
+		tcp_cleanup_rbuf(sk, skb->len);
+		eaten = true;
+	}
+
+	TCP_SKB_CB(skb)->esp_skip = 1;
+
+	skb = skb_clone(skb, GFP_ATOMIC);
+	if (unlikely(!skb))
+		return eaten;
+
+	if (unlikely(tcp_decap_skb(tp, skb)))
+		__kfree_skb(skb);
+
+	return eaten;
+
+slow_path:
+	headlen = -(skb_mac_header_was_set(skb) ? skb_mac_offset(skb) :
+						  skb_network_offset(skb));
+	__skb_push(skb, headlen);
+	prev = skb;
+
+	skb = pskb_copy(prev, GFP_ATOMIC);
+	__skb_pull(prev, headlen);
+
+	if (!skb) {
+		tcp_encap_error(tp, prev, offset);
+		return false;
+	}
+
+	__skb_pull(skb, headlen);
+	skb->mac_len = prev->mac_len;
+
+	if (!__pskb_pull(skb, offset)) {
+		tcp_encap_error_free(tp, skb, offset);
+		return false;
+	}
+
+	TCP_SKB_CB(skb)->seq += offset;
+	prev = tp->encap_in;
+	tp->encap_in = NULL;
+
+	if (!prev)
+		prev = skb;
+	else if (skb_try_coalesce(prev, skb, &fragstolen, &delta))
+		kfree_skb_partial(skb, fragstolen);
+	else {
+		skb_shinfo(prev)->frag_list = skb;
+		prev->data_len += skb->len;
+		prev->len += skb->len;
+		prev->truesize += skb->truesize;
+	}
+
+	/* We could do a list instead of linearising, but that would
+	 * open the door to abuses such as a stream of single-byte
+	 * datagrams up to 64K.
+	 */
+	if (skb_has_frag_list(prev) && __skb_linearize(prev)) {
+		tcp_encap_error_free(tp, prev, 0);
+		return false;
+	}
+
+	headlen = -(skb_mac_header_was_set(prev) ? skb_mac_offset(prev) :
+						   skb_network_offset(prev));
+
+	while (prev->len >= sizeof(hdr.len)) {
+		skb_copy_bits(prev, 0, &hdr,
+			      min((unsigned)sizeof(hdr), prev->len));
+
+		offset = be16_to_cpu(hdr.len);
+		tp->encap_seq = TCP_SKB_CB(prev)->seq + offset;
+
+		if (prev->len < offset)
+			break;
+
+		skb = prev;
+		if (prev->len > offset) {
+			int nsize = skb_headlen(skb) - offset;
+
+			if (nsize < 0)
+				nsize = 0;
+
+			prev = alloc_skb(nsize + headlen, GFP_ATOMIC);
+			if (!prev) {
+				tcp_encap_error_free(tp, skb, offset);
+				return false;
+			}
+
+			/* Slap on a header on each message. */
+			if (skb_mac_header_was_set(skb)) {
+				skb_reset_mac_header(prev);
+				skb_set_network_header(
+					prev, skb_mac_header_len(skb));
+				prev->mac_len = skb->mac_len;
+			} else
+				skb_reset_network_header(prev);
+			memcpy(__skb_put(prev, headlen),
+			       skb->data - headlen, headlen);
+			__skb_pull(prev, headlen);
+
+			nsize = skb->len - offset - nsize;
+
+			skb_split(skb, prev, offset);
+			skb->truesize -= nsize;
+			prev->truesize += nsize;
+			prev->skb_iif = skb->skb_iif;
+			TCP_SKB_CB(prev)->seq = TCP_SKB_CB(skb)->seq + offset;
+		}
+
+		if (!hdr.spi || tcp_decap_skb(tp, skb))
+			__kfree_skb(skb);
+
+		if (prev == skb)
+			return eaten;
+	}
+
+	tp->encap_in = prev;
+
+	return false;
+}
+
+int tcp_encap_enable(struct sock *sk)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct sk_buff *skb;
+
+	lock_sock(sk);
+
+	if (tp->encap)
+		goto out;
+
+	__skb_queue_head_init(&tp->encap_out);
+
+	tp->encap_last = tp->encap_seq = tp->copied_seq;
+
+	skb_queue_walk(&sk->sk_receive_queue, skb)
+		__tcp_encap_process(tp, skb);
+
+	tp->encap = 1;
+	static_branch_enable(&tcp_encap_needed);
+
+out:
+	release_sock(sk);
+
+	return 0;
+}
+
+static inline bool tcp_encap_process(struct tcp_sock *tp, struct sk_buff *skb)
+{
+	if (static_branch_unlikely(&tcp_encap_needed) && tp->encap)
+		return __tcp_encap_process(tp, skb);
+
+	return false;
+}
+#else
+static inline bool tcp_encap_process(struct tcp_sock *tp, struct sk_buff *skb)
+{
+	return false;
+}
+#endif
+
+static bool tcp_eat_skb(struct sock *sk, struct sk_buff *skb, bool *fragstolen)
+{
+	struct sk_buff *tail = skb_peek_tail(&sk->sk_receive_queue);
+	struct tcp_sock *tp = tcp_sk(sk);
+	bool eaten;
+
+	eaten = tcp_encap_process(tp, skb) ||
+		(tail && tcp_try_coalesce(sk, tail, skb, fragstolen));
+	tcp_rcv_nxt_update(tp, TCP_SKB_CB(skb)->end_seq);
+
+	return eaten;
+}
+
 /* This one checks to see if we can put data from the
  * out_of_order queue into the receive_queue.
  */
@@ -4302,7 +4616,7 @@ static void tcp_ofo_queue(struct sock *sk)
 	struct tcp_sock *tp = tcp_sk(sk);
 	__u32 dsack_high = tp->rcv_nxt;
 	bool fin, fragstolen, eaten;
-	struct sk_buff *skb, *tail;
+	struct sk_buff *skb;
 	struct rb_node *p;
 
 	p = rb_first(&tp->out_of_order_queue);
@@ -4329,9 +4643,7 @@ static void tcp_ofo_queue(struct sock *sk)
 			   tp->rcv_nxt, TCP_SKB_CB(skb)->seq,
 			   TCP_SKB_CB(skb)->end_seq);
 
-		tail = skb_peek_tail(&sk->sk_receive_queue);
-		eaten = tail && tcp_try_coalesce(sk, tail, skb, &fragstolen);
-		tcp_rcv_nxt_update(tp, TCP_SKB_CB(skb)->end_seq);
+		eaten = tcp_eat_skb(sk, skb, &fragstolen);
 		fin = TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN;
 		if (!eaten)
 			__skb_queue_tail(&sk->sk_receive_queue, skb);
@@ -4508,13 +4820,9 @@ static int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, int
 		  bool *fragstolen)
 {
 	int eaten;
-	struct sk_buff *tail = skb_peek_tail(&sk->sk_receive_queue);
 
 	__skb_pull(skb, hdrlen);
-	eaten = (tail &&
-		 tcp_try_coalesce(sk, tail,
-				  skb, fragstolen)) ? 1 : 0;
-	tcp_rcv_nxt_update(tcp_sk(sk), TCP_SKB_CB(skb)->end_seq);
+	eaten = tcp_eat_skb(sk, skb, fragstolen);
 	if (!eaten) {
 		__skb_queue_tail(&sk->sk_receive_queue, skb);
 		skb_set_owner_r(skb, sk);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 77ea45d..a613ff4 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1617,6 +1617,7 @@ static void tcp_v4_fill_cb(struct sk_buff *skb, const struct iphdr *iph,
 	TCP_SKB_CB(skb)->sacked	 = 0;
 	TCP_SKB_CB(skb)->has_rxtstamp =
 			skb->tstamp || skb_hwtstamps(skb)->hwtstamp;
+	TCP_SKB_CB(skb)->esp_skip = 0;
 }
 
 /*
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index a4d214c..66e1121 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -830,10 +830,53 @@ static void tcp_tasklet_func(unsigned long data)
 	}
 }
 
+#ifdef CONFIG_XFRM
+int tcp_encap_output(struct sock *sk, struct sk_buff *skb)
+{
+	int offset;
+	unsigned len;
+
+	if (sk->sk_state != TCP_ESTABLISHED)
+		return -ECONNRESET;
+
+	offset = skb_transport_offset(skb);
+	len = skb->len - offset;
+
+	*(__be16 *)skb_transport_header(skb) = cpu_to_be16(len);
+
+	offset = skb_send_sock_locked(sk, skb, offset, len);
+	if (offset >= 0) {
+		__kfree_skb(skb);
+		offset = 0;
+	}
+
+	return offset;
+}
+EXPORT_SYMBOL(tcp_encap_output);
+
+static void tcp_process_encap(struct sock *sk)
+{
+	struct sk_buff_head queue;
+	struct sk_buff *skb;
+
+	__skb_queue_head_init(&queue);
+	skb_queue_splice_init(&tcp_sk(sk)->encap_out, &queue);
+
+	while ((skb = __skb_dequeue(&queue)))
+		if (tcp_encap_output(sk, skb))
+			__kfree_skb(skb);
+}
+#else
+static inline void tcp_process_encap(struct sock *sk)
+{
+}
+#endif
+
 #define TCP_DEFERRED_ALL (TCPF_TSQ_DEFERRED |		\
 			  TCPF_WRITE_TIMER_DEFERRED |	\
 			  TCPF_DELACK_TIMER_DEFERRED |	\
-			  TCPF_MTU_REDUCED_DEFERRED)
+			  TCPF_MTU_REDUCED_DEFERRED |	\
+			  TCPF_ESP_DEFERRED)
 /**
  * tcp_release_cb - tcp release_sock() callback
  * @sk: socket
@@ -879,6 +922,8 @@ void tcp_release_cb(struct sock *sk)
 		inet_csk(sk)->icsk_af_ops->mtu_reduced(sk);
 		__sock_put(sk);
 	}
+	if (flags & TCPF_ESP_DEFERRED)
+		tcp_process_encap(sk);
 }
 EXPORT_SYMBOL(tcp_release_cb);
 
@@ -1609,6 +1654,7 @@ unsigned int tcp_current_mss(struct sock *sk)
 
 	return mss_now;
 }
+EXPORT_SYMBOL(tcp_current_mss);
 
 /* RFC2861, slow part. Adjust cwnd, after it was not full during one rto.
  * As additional protections, we do not touch cwnd in retransmission phases,

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 3/3] ipsec: Add ESP over TCP encapsulation support
  2018-01-11 13:21 [PATCH 0/3] ipsec: Add ESP over TCP encapsulation Herbert Xu
  2018-01-11 13:21 ` [PATCH 1/3] skbuff: Avoid sleeping in skb_send_sock_locked Herbert Xu
  2018-01-11 13:21 ` [PATCH 2/3] tcp: Add ESP encapsulation support Herbert Xu
@ 2018-01-11 13:21 ` Herbert Xu
  2 siblings, 0 replies; 7+ messages in thread
From: Herbert Xu @ 2018-01-11 13:21 UTC (permalink / raw)
  To: Steffen Klassert, netdev

This patch adds support for ESP over TCP encapsulation per RFC8229.

Most of the input processing is done in the TCP stack and not in
this patch, which is similar to UDP encapsulation.

On the output side, there are two potential levels of indirection.
Firstly all packets are fed through a tasklet in order to avoid
TCP socket lock recursion.  They're then processed directly if
the TCP socket is not owned by user-space.  If it is owned then
we'll place the packet in a queue (tp->encap_out) for processing
when the socket lock is released.

The first outbound packet will trigger a socket lockup for a
matching TCP socket.  If the TCP connection drops we will repeat
the lookup as needed.  The TCP socket is cached in the xfrm state
and is read using RCU.

Note that unlike normal IPsec packets, once we hit a TCP xfrm
state, the xfrm stack is short-circuited and its journey will
continue through the TCP stack, after which a new IPsec lookup
will be done.  This is different from how UDP encapsulation is
done.  This means that if you're doing nested IPsec then you
will need to construct the policies with this in mind.  That is,
start with a new policy whenever TCP encapsulation is done.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 include/net/xfrm.h    |    7 +
 net/ipv4/esp4.c       |  208 ++++++++++++++++++++++++++++++++++++++++++++++++--
 net/xfrm/xfrm_input.c |   21 +++--
 net/xfrm/xfrm_state.c |    3 
 4 files changed, 228 insertions(+), 11 deletions(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index ae35991..3694536 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -180,6 +180,7 @@ struct xfrm_state {
 
 	/* Data for encapsulator */
 	struct xfrm_encap_tmpl	*encap;
+	struct sock __rcu	*encap_sk;
 
 	/* Data for care-of address */
 	xfrm_address_t	*coaddr;
@@ -210,6 +211,9 @@ struct xfrm_state {
 	u32			replay_maxage;
 	u32			replay_maxdiff;
 
+	/* Copy of encap_type from encap to avoid locking. */
+	u16			encap_type;
+
 	/* Replay detection notification timer */
 	struct timer_list	rtimer;
 
@@ -1570,6 +1574,9 @@ struct xfrmk_spdinfo {
 int xfrm_prepare_input(struct xfrm_state *x, struct sk_buff *skb);
 int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 spi, int encap_type);
 int xfrm_input_resume(struct sk_buff *skb, int nexthdr);
+int xfrm_trans_queue_net(struct net *net, struct sk_buff *skb,
+			 int (*finish)(struct net *, struct sock *,
+				       struct sk_buff *));
 int xfrm_trans_queue(struct sk_buff *skb,
 		     int (*finish)(struct net *, struct sock *,
 				   struct sk_buff *));
diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c
index 61fe6e4..0544e4e 100644
--- a/net/ipv4/esp4.c
+++ b/net/ipv4/esp4.c
@@ -9,13 +9,16 @@
 #include <net/esp.h>
 #include <linux/scatterlist.h>
 #include <linux/kernel.h>
+#include <linux/netdevice.h>
 #include <linux/pfkeyv2.h>
+#include <linux/rcupdate.h>
 #include <linux/rtnetlink.h>
 #include <linux/slab.h>
 #include <linux/spinlock.h>
 #include <linux/in6.h>
 #include <net/icmp.h>
 #include <net/protocol.h>
+#include <net/tcp.h>
 #include <net/udp.h>
 
 #include <linux/highmem.h>
@@ -30,6 +33,11 @@ struct esp_output_extra {
 	u32 esphoff;
 };
 
+struct esp_tcp_sk {
+	struct sock *sk;
+	struct rcu_head rcu;
+};
+
 #define ESP_SKB_CB(__skb) ((struct esp_skb_cb *)&((__skb)->cb[0]))
 
 static u32 esp4_get_mtu(struct xfrm_state *x, int mtu);
@@ -118,6 +126,143 @@ static void esp_ssg_unref(struct xfrm_state *x, void *tmp)
 			put_page(sg_page(sg));
 }
 
+static void esp_free_tcp_sk(struct rcu_head *head)
+{
+	struct esp_tcp_sk *esk = container_of(head, struct esp_tcp_sk, rcu);
+
+	sock_put(esk->sk);
+	kfree(esk);
+}
+
+static struct sock *esp_find_tcp_sk(struct xfrm_state *x)
+{
+	struct xfrm_encap_tmpl *encap = x->encap;
+	struct esp_tcp_sk *esk;
+	__be16 sport, dport;
+	struct sock *nsk;
+	struct sock *sk;
+
+	sk = rcu_dereference(x->encap_sk);
+	if (sk && sk->sk_state == TCP_ESTABLISHED)
+		return sk;
+
+	spin_lock_bh(&x->lock);
+	sport = encap->encap_sport;
+	dport = encap->encap_dport;
+	nsk = rcu_dereference_protected(x->encap_sk,
+					lockdep_is_held(&x->lock));
+	if (sk && sk == nsk) {
+		esk = kmalloc(sizeof(*esk), GFP_ATOMIC);
+		if (!esk) {
+			spin_unlock_bh(&x->lock);
+			return ERR_PTR(-ENOMEM);
+		}
+		RCU_INIT_POINTER(x->encap_sk, NULL);
+		esk->sk = sk;
+		call_rcu(&esk->rcu, esp_free_tcp_sk);
+	}
+	spin_unlock_bh(&x->lock);
+
+	/* XXX We don't support bound_dev_if. */
+	sk = inet_lookup_established(xs_net(x), &tcp_hashinfo, x->id.daddr.a4,
+				     dport, x->props.saddr.a4, sport, 0);
+
+	if (!sk)
+		return ERR_PTR(-ENOENT);
+
+	if (!tcp_sk(sk)->encap) {
+		sock_put(sk);
+		return ERR_PTR(-EINVAL);
+	}
+
+	spin_lock_bh(&x->lock);
+	nsk = rcu_dereference_protected(x->encap_sk,
+					lockdep_is_held(&x->lock));
+	if (encap->encap_sport != sport ||
+	    encap->encap_dport != dport) {
+		sock_put(sk);
+		sk = nsk ?: ERR_PTR(-EREMCHG);
+	} else if (sk == nsk)
+		sock_put(sk);
+	else
+		rcu_assign_pointer(x->encap_sk, sk);
+	spin_unlock_bh(&x->lock);
+
+	return sk;
+}
+
+static int esp_output_tcp_encap2(struct xfrm_state *x, struct sk_buff *skb)
+{
+	struct tcp_sock *tp;
+	struct sock *sk;
+	int err;
+
+	rcu_read_lock();
+
+	sk = esp_find_tcp_sk(x);
+	err = PTR_ERR(sk);
+	if (IS_ERR(sk))
+		goto out;
+
+	err = -ENOBUFS;
+	bh_lock_sock(sk);
+	if (sock_owned_by_user(sk)) {
+		tp = tcp_sk(sk);
+		if (skb_queue_len(&tp->encap_out) >= netdev_max_backlog)
+			goto unlock_sock;
+
+		__skb_queue_tail(&tp->encap_out, skb);
+		set_bit(TCP_ESP_DEFERRED, &sk->sk_tsq_flags);
+
+		err = 0;
+		goto unlock_sock;
+	}
+
+	err = tcp_encap_output(sk, skb);
+
+unlock_sock:
+	bh_unlock_sock(sk);
+
+out:
+	rcu_read_unlock();
+
+	return err;
+}
+
+static int esp_output_tcp_encap_cb(struct net *net, struct sock *sk,
+				   struct sk_buff *skb)
+{
+	struct dst_entry *dst = skb_dst(skb);
+	struct xfrm_state *x = dst->xfrm;
+	int err;
+
+	err = esp_output_tcp_encap2(x, skb);
+
+	if (err)
+		xfrm_output_resume(skb, err);
+
+	return 0;
+}
+
+static int esp_output_tcp_encap(struct xfrm_state *x, struct sk_buff *skb)
+{
+	int err;
+
+	if (x->encap_type != TCP_ENCAP_ESPINTCP)
+		return 0;
+
+	/* Batch packets in interrupt mode to prevent TCP encap nesting. */
+	preempt_disable();
+	err = xfrm_trans_queue_net(xs_net(x), skb, esp_output_tcp_encap_cb);
+	preempt_enable();
+
+	/* EINPROGRESS just happens to do the right thing.  It
+	 * actually means that the skb has been consumed and
+	 * isn't coming back.
+	 */
+	return err ?: -EINPROGRESS;
+}
+
 static void esp_output_done(struct crypto_async_request *base, int err)
 {
 	struct sk_buff *skb = base->data;
@@ -128,6 +273,13 @@ static void esp_output_done(struct crypto_async_request *base, int err)
 	tmp = ESP_SKB_CB(skb)->tmp;
 	esp_ssg_unref(x, tmp);
 	kfree(tmp);
+
+	if (!err) {
+		err = esp_output_tcp_encap(x, skb);
+		if (err == -EINPROGRESS)
+			return;
+	}
+
 	xfrm_output_resume(skb, err);
 }
 
@@ -205,7 +357,8 @@ static void esp_output_fill_trailer(u8 *tail, int tfclen, int plen, __u8 proto)
 	tail[plen - 1] = proto;
 }
 
-static void esp_output_udp_encap(struct xfrm_state *x, struct sk_buff *skb, struct esp_info *esp)
+static void esp_output_encap(struct xfrm_state *x, struct sk_buff *skb,
+			     struct esp_info *esp)
 {
 	int encap_type;
 	struct udphdr *uh;
@@ -213,6 +366,9 @@ static void esp_output_udp_encap(struct xfrm_state *x, struct sk_buff *skb, stru
 	__be16 sport, dport;
 	struct xfrm_encap_tmpl *encap = x->encap;
 	struct ip_esp_hdr *esph = esp->esph;
+	unsigned len;
+
+	len = skb->len + esp->tailen - skb_transport_offset(skb);
 
 	spin_lock_bh(&x->lock);
 	sport = encap->encap_sport;
@@ -220,6 +376,14 @@ static void esp_output_udp_encap(struct xfrm_state *x, struct sk_buff *skb, stru
 	encap_type = encap->encap_type;
 	spin_unlock_bh(&x->lock);
 
+	if (encap_type == TCP_ENCAP_ESPINTCP) {
+		__be16 *lenp = (void *)esph;
+
+		*lenp = htons(len);
+		esph = (struct ip_esp_hdr *)(lenp + 1);
+		goto out;
+	}
+
 	uh = (struct udphdr *)esph;
 	uh->source = sport;
 	uh->dest = dport;
@@ -240,6 +404,8 @@ static void esp_output_udp_encap(struct xfrm_state *x, struct sk_buff *skb, stru
 	}
 
 	*skb_mac_header(skb) = IPPROTO_UDP;
+
+out:
 	esp->esph = esph;
 }
 
@@ -253,9 +419,8 @@ int esp_output_head(struct xfrm_state *x, struct sk_buff *skb, struct esp_info *
 	struct sk_buff *trailer;
 	int tailen = esp->tailen;
 
-	/* this is non-NULL only with UDP Encapsulation */
 	if (x->encap)
-		esp_output_udp_encap(x, skb, esp);
+		esp_output_encap(x, skb, esp);
 
 	if (!skb_cloned(skb)) {
 		if (tailen <= skb_tailroom(skb)) {
@@ -447,7 +612,7 @@ int esp_output_tail(struct xfrm_state *x, struct sk_buff *skb, struct esp_info *
 error_free:
 	kfree(tmp);
 error:
-	return err;
+	return err ?: esp_output_tcp_encap(x, skb);
 }
 EXPORT_SYMBOL_GPL(esp_output_tail);
 
@@ -570,7 +735,19 @@ int esp_input_done2(struct sk_buff *skb, int err)
 
 	if (x->encap) {
 		struct xfrm_encap_tmpl *encap = x->encap;
+		struct tcphdr *th = (void *)(skb_network_header(skb) + ihl);
 		struct udphdr *uh = (void *)(skb_network_header(skb) + ihl);
+		__be16 source;
+
+		switch (x->encap_type) {
+		case TCP_ENCAP_ESPINTCP:
+			source = th->source;
+			break;
+
+		default:
+			source = uh->source;
+			break;
+		}
 
 		/*
 		 * 1) if the NAT-T peer's IP or port changed then
@@ -579,11 +756,11 @@ int esp_input_done2(struct sk_buff *skb, int err)
 		 *    SRC ports.
 		 */
 		if (iph->saddr != x->props.saddr.a4 ||
-		    uh->source != encap->encap_sport) {
+		    source != encap->encap_sport) {
 			xfrm_address_t ipaddr;
 
 			ipaddr.a4 = iph->saddr;
-			km_new_mapping(x, &ipaddr, uh->source);
+			km_new_mapping(x, &ipaddr, source);
 
 			/* XXX: perhaps add an extra
 			 * policy check here, to see
@@ -762,6 +939,7 @@ static u32 esp4_get_mtu(struct xfrm_state *x, int mtu)
 	struct crypto_aead *aead = x->data;
 	u32 blksize = ALIGN(crypto_aead_blocksize(aead), 4);
 	unsigned int net_adj;
+	unsigned int props;
 
 	switch (x->props.mode) {
 	case XFRM_MODE_TRANSPORT:
@@ -775,6 +953,20 @@ static u32 esp4_get_mtu(struct xfrm_state *x, int mtu)
 		BUG();
 	}
 
+	props = x->props.header_len;
+
+	if (x->encap_type == TCP_ENCAP_ESPINTCP) {
+		struct sock *sk;
+
+		rcu_read_lock();
+
+		sk = esp_find_tcp_sk(x);
+		if (!IS_ERR(sk))
+			mtu = tcp_current_mss(sk) + sizeof(struct iphdr);
+
+		rcu_read_unlock();
+	}
+
 	return ((mtu - x->props.header_len - crypto_aead_authsize(aead) -
 		 net_adj) & ~(blksize - 1)) + net_adj - 2;
 }
@@ -979,6 +1171,8 @@ static int esp_init_state(struct xfrm_state *x)
 	if (x->encap) {
 		struct xfrm_encap_tmpl *encap = x->encap;
 
+		x->encap_type = encap->encap_type;
+
 		switch (encap->encap_type) {
 		default:
 			err = -EINVAL;
@@ -989,6 +1183,8 @@ static int esp_init_state(struct xfrm_state *x)
 		case UDP_ENCAP_ESPINUDP_NON_IKE:
 			x->props.header_len += sizeof(struct udphdr) + 2 * sizeof(u32);
 			break;
+		case TCP_ENCAP_ESPINTCP:
+			x->props.header_len += 2;
 		}
 	}
 
diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c
index 444fa37..1eb0bba 100644
--- a/net/xfrm/xfrm_input.c
+++ b/net/xfrm/xfrm_input.c
@@ -27,6 +27,7 @@ struct xfrm_trans_tasklet {
 
 struct xfrm_trans_cb {
 	int (*finish)(struct net *net, struct sock *sk, struct sk_buff *skb);
+	struct net *net;
 };
 
 #define XFRM_TRANS_SKB_CB(__skb) ((struct xfrm_trans_cb *)&((__skb)->cb[0]))
@@ -493,12 +494,13 @@ static void xfrm_trans_reinject(unsigned long data)
 	skb_queue_splice_init(&trans->queue, &queue);
 
 	while ((skb = __skb_dequeue(&queue)))
-		XFRM_TRANS_SKB_CB(skb)->finish(dev_net(skb->dev), NULL, skb);
+		XFRM_TRANS_SKB_CB(skb)->finish(XFRM_TRANS_SKB_CB(skb)->net,
+					       NULL, skb);
 }
 
-int xfrm_trans_queue(struct sk_buff *skb,
-		     int (*finish)(struct net *, struct sock *,
-				   struct sk_buff *))
+int xfrm_trans_queue_net(struct net *net, struct sk_buff *skb,
+			 int (*finish)(struct net *, struct sock *,
+				       struct sk_buff *))
 {
 	struct xfrm_trans_tasklet *trans;
 
@@ -508,10 +510,19 @@ int xfrm_trans_queue(struct sk_buff *skb,
 		return -ENOBUFS;
 
 	XFRM_TRANS_SKB_CB(skb)->finish = finish;
-	skb_queue_tail(&trans->queue, skb);
+	XFRM_TRANS_SKB_CB(skb)->net = net;
+	__skb_queue_tail(&trans->queue, skb);
 	tasklet_schedule(&trans->tasklet);
 	return 0;
 }
+EXPORT_SYMBOL(xfrm_trans_queue_net);
+
+int xfrm_trans_queue(struct sk_buff *skb,
+		     int (*finish)(struct net *, struct sock *,
+				   struct sk_buff *))
+{
+	return xfrm_trans_queue_net(dev_net(skb->dev), skb, finish);
+}
 EXPORT_SYMBOL(xfrm_trans_queue);
 
 void __init xfrm_input_init(void)
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 065d896..7b01d24 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -617,6 +617,9 @@ int __xfrm_state_delete(struct xfrm_state *x)
 		net->xfrm.state_num--;
 		spin_unlock(&net->xfrm.xfrm_state_lock);
 
+		if (x->encap_sk)
+			sock_put(rcu_dereference_raw(x->encap_sk));
+
 		xfrm_dev_state_delete(x);
 
 		/* All xfrm_state objects are created by xfrm_state_alloc.

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/3] tcp: Add ESP encapsulation support
  2018-01-11 13:21 ` [PATCH 2/3] tcp: Add ESP encapsulation support Herbert Xu
@ 2018-01-12 16:38   ` Eric Dumazet
  2018-01-16 10:28     ` Steffen Klassert
  0 siblings, 1 reply; 7+ messages in thread
From: Eric Dumazet @ 2018-01-12 16:38 UTC (permalink / raw)
  To: Herbert Xu, Steffen Klassert, netdev

On Fri, 2018-01-12 at 00:21 +1100, Herbert Xu wrote:
> This patch adds the plumbing in TCP for ESP encapsulation support
> per RFC8229.
> 
> The patch mostly deals with inbound processing, as well as enabling
> TCP encapsulation on a socket through setsockopt.  The outbound
> processing is dealt with in the ESP code as is done for UDP.
> 
> The inbound processing is split into two halves.  First of all,
> the softirq path directly intercepts ESP packets and feeds them
> into the IPsec stack.  Most of the time the packet will be freed
> right away if it contains complete ESP packets.  However, if
> the message is incomplete or it contains non-ESP data, then the
> skb will be added to the receive queue.  We also add packets to
> the receive queue if it is currently non-emtpy, in order to
> preserve sequence number continuity and minimise the changes
> to the TCP code.
> 
> On the user-space facing side, packets marked as ESP-only are
> skipped and not visible to user-space.  However, some ESP data
> may seep through.  For example, if we receive a partial message
> then we will always give it to user-space regardless of whether
> it turns out to be ESP or not.  So user-space should be prepared
> to skip ESP messages (SPI != 0).
> 
> There is a little bit of code dealing with the encapsulation side.
> In particular, if encapsulation data comes in while the socket
> is owned by user-space, the packets will be stored in tp->encap_out
> and processed during release_sock.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
> ---
> 
>  include/linux/tcp.h      |   15 ++
>  include/net/tcp.h        |   27 +++
>  include/uapi/linux/tcp.h |    1 
>  include/uapi/linux/udp.h |    1 
>  net/ipv4/tcp.c           |   68 +++++++++
>  net/ipv4/tcp_input.c     |  326 +++++++++++++++++++++++++++++++++++++++++++++--
>  net/ipv4/tcp_ipv4.c      |    1 
>  net/ipv4/tcp_output.c    |   48 ++++++
>  8 files changed, 473 insertions(+), 14 deletions(-)
> 

Ouch...

Is there any chance this can be done with almost no change in TCP
stack, using a layer model ? ( net/kcm comes to mind )

NFS uses TCP sockets, but does not invade TCP stack either.

I believe Christoph Paasch sent a patch series during holidays trying
to cleanup the MD5 mess (I had no time reviewing it, sorry)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/3] tcp: Add ESP encapsulation support
  2018-01-12 16:38   ` Eric Dumazet
@ 2018-01-16 10:28     ` Steffen Klassert
  2018-01-18  3:49       ` Herbert Xu
  0 siblings, 1 reply; 7+ messages in thread
From: Steffen Klassert @ 2018-01-16 10:28 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Herbert Xu, netdev

On Fri, Jan 12, 2018 at 08:38:01AM -0800, Eric Dumazet wrote:
> On Fri, 2018-01-12 at 00:21 +1100, Herbert Xu wrote:
> > This patch adds the plumbing in TCP for ESP encapsulation support
> > per RFC8229.
> > 
> > The patch mostly deals with inbound processing, as well as enabling
> > TCP encapsulation on a socket through setsockopt.  The outbound
> > processing is dealt with in the ESP code as is done for UDP.
> > 
> > The inbound processing is split into two halves.  First of all,
> > the softirq path directly intercepts ESP packets and feeds them
> > into the IPsec stack.  Most of the time the packet will be freed
> > right away if it contains complete ESP packets.  However, if
> > the message is incomplete or it contains non-ESP data, then the
> > skb will be added to the receive queue.  We also add packets to
> > the receive queue if it is currently non-emtpy, in order to
> > preserve sequence number continuity and minimise the changes
> > to the TCP code.
> > 
> > On the user-space facing side, packets marked as ESP-only are
> > skipped and not visible to user-space.  However, some ESP data
> > may seep through.  For example, if we receive a partial message
> > then we will always give it to user-space regardless of whether
> > it turns out to be ESP or not.  So user-space should be prepared
> > to skip ESP messages (SPI != 0).
> > 
> > There is a little bit of code dealing with the encapsulation side.
> > In particular, if encapsulation data comes in while the socket
> > is owned by user-space, the packets will be stored in tp->encap_out
> > and processed during release_sock.
> > 
> > Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
> > ---
> > 
> >  include/linux/tcp.h      |   15 ++
> >  include/net/tcp.h        |   27 +++
> >  include/uapi/linux/tcp.h |    1 
> >  include/uapi/linux/udp.h |    1 
> >  net/ipv4/tcp.c           |   68 +++++++++
> >  net/ipv4/tcp_input.c     |  326 +++++++++++++++++++++++++++++++++++++++++++++--
> >  net/ipv4/tcp_ipv4.c      |    1 
> >  net/ipv4/tcp_output.c    |   48 ++++++
> >  8 files changed, 473 insertions(+), 14 deletions(-)
> > 
> 
> Ouch...
> 
> Is there any chance this can be done with almost no change in TCP
> stack, using a layer model ? ( net/kcm comes to mind )

Herbert, would this be an option or is this not possible?

Thanks!

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/3] tcp: Add ESP encapsulation support
  2018-01-16 10:28     ` Steffen Klassert
@ 2018-01-18  3:49       ` Herbert Xu
  0 siblings, 0 replies; 7+ messages in thread
From: Herbert Xu @ 2018-01-18  3:49 UTC (permalink / raw)
  To: Steffen Klassert; +Cc: Eric Dumazet, netdev

On Tue, Jan 16, 2018 at 11:28:23AM +0100, Steffen Klassert wrote:
>
> > Is there any chance this can be done with almost no change in TCP
> > stack, using a layer model ? ( net/kcm comes to mind )
> 
> Herbert, would this be an option or is this not possible?

Yes it can be done.  I'm working on it.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2018-01-18  3:49 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-11 13:21 [PATCH 0/3] ipsec: Add ESP over TCP encapsulation Herbert Xu
2018-01-11 13:21 ` [PATCH 1/3] skbuff: Avoid sleeping in skb_send_sock_locked Herbert Xu
2018-01-11 13:21 ` [PATCH 2/3] tcp: Add ESP encapsulation support Herbert Xu
2018-01-12 16:38   ` Eric Dumazet
2018-01-16 10:28     ` Steffen Klassert
2018-01-18  3:49       ` Herbert Xu
2018-01-11 13:21 ` [PATCH 3/3] ipsec: Add ESP over TCP " Herbert Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.