All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC 0/2] kproxy: Kernel Proxy
@ 2017-06-29 18:27 Tom Herbert
  2017-06-29 18:27 ` [PATCH RFC 1/2] skbuff: Function to send an skbuf on a socket Tom Herbert
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Tom Herbert @ 2017-06-29 18:27 UTC (permalink / raw)
  To: netdev, davem; +Cc: Tom Herbert

This is raw, minimally tested, and error hanlding needs work. Posting
as RFC to get feedback on the design...

Sidecar proxies are becoming quite popular on server as a means to
perform layer 7 processing on application data as it is sent. Such
sidecars are used for SSL proxies, application firewalls, and L7
load balancers. While these proxies provide nice functionality,
their performance is obviously terrible since all the data needs
to take an extra hop though userspace.

Consider transmitting data on a TCP socket that goes through a
sidecar paroxy. The application does a sendmsg in userpsace, data
goes into kernel, back to userspace, and back to kernel. That is two
trips through TCP TX, one TCP RX, potentially three copies, three
sockets are touched, and three context switches. Using a proxy in the
receive path would have a similarly long path.

	 +--------------+      +------------------+
	 |  Application |      | Proxy            |
	 |              |      |                  |
	 |  sendmsg     |      | recvmsg sendmsg  |
	 +--------------+      +------------------+
	       |                    |       |
               |                    ^       |
---------------V--------------------|-------|--------------
	       |                    |       |
	       +---->--------->-----+       V
            TCP TX              TCP RX    TCP TX
  
The "boomerang" model this employs is quite expensive. This is
even much worse in the case that the proxy is an SSL proxy (e.g.
performing SSL inspection to implement and application firewall).
In this case The application encrypts using TLS, the proxy
immediately decrypts (it knows the key by virtue of have pretended
to be a certificate authority). Subsequently, the proxy re-encrpyts
it again to send. So each byte undergoes three crypto operations in
this path!


This patch set creates and in kernel proxy (kproxy). The concept is
fairly straightforward, two sockets are joined in the kernel as
a proxy. Proxy functionality will be done by BPF on the data stream,
kTLS is needed to make an SSL proxy. The most prominent ULP for a proxy
is http so we'll need a parser for http to make an in kernel http
proxy.

	 +--------------+
	 |  Application |
	 |              |
	 |  sendmsg     |
	 +--------------+
	       |
               |
---------------V-------------------------------------------
               |
	       |        +---------------+
               +------->|   Proxy       |---+
			| strparser+BPF |   |
			+---------------+   |
            TCP TX   TCP RX                 |
					    V
					 TCP TX

This patch set implements a very rudimentary kernel proxy, it just
provides an interface to create a proxy between two sockets. Once the
RX and TX paths are done for kTLS it should be straightforward to enable
to make an in kernel SSL proxy. Proxy functionality (like application
level filtering) will be implemented by BPF programs set on the kproxy.
This will use strparser to provide message deliniation (we'll need a
slight mofication to strparse to allow pass through mode). In kernel
layer 7 load balancing is also feasible, in that case we may want to
use a multiplexor structure like KCM (I had consider overloading KCM
for kproxy but decided they are too different.

kproxy eliminates the userspace boomerang, but you may notice that even
with kproxy we still have same number of sockets and sill potentially
perform three crypto ops on every byte. I have some ideas for how to
create a "zero proxy" that eliminates these without loss of the proxy
functionality. That would be the subject of a future path set.

Tom Herbert (2):
  skbuff: Function to send an skbuf on a socket
  kproxy: Kernel proxy

 include/linux/skbuff.h              |   2 +
 include/linux/socket.h              |   4 +-
 include/net/kproxy.h                |  80 +++++
 include/uapi/linux/kproxy.h         |  30 ++
 net/Kconfig                         |   1 +
 net/Makefile                        |   1 +
 net/core/skbuff.c                   |  66 ++++
 net/kproxy/Kconfig                  |  10 +
 net/kproxy/Makefile                 |   3 +
 net/kproxy/kproxyproc.c             | 246 +++++++++++++++
 net/kproxy/kproxysock.c             | 605 ++++++++++++++++++++++++++++++++++++
 security/selinux/hooks.c            |   4 +-
 security/selinux/include/classmap.h |   4 +-
 13 files changed, 1053 insertions(+), 3 deletions(-)
 create mode 100644 include/net/kproxy.h
 create mode 100644 include/uapi/linux/kproxy.h
 create mode 100644 net/kproxy/Kconfig
 create mode 100644 net/kproxy/Makefile
 create mode 100644 net/kproxy/kproxyproc.c
 create mode 100644 net/kproxy/kproxysock.c

-- 
2.7.4

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH RFC 1/2] skbuff: Function to send an skbuf on a socket
  2017-06-29 18:27 [PATCH RFC 0/2] kproxy: Kernel Proxy Tom Herbert
@ 2017-06-29 18:27 ` Tom Herbert
  2017-07-03 13:00   ` David Miller
  2017-06-29 18:27 ` [PATCH RFC 2/2] kproxy: Kernel proxy Tom Herbert
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 13+ messages in thread
From: Tom Herbert @ 2017-06-29 18:27 UTC (permalink / raw)
  To: netdev, davem; +Cc: Tom Herbert

Add skb_send_sock to send an skbuff on a socket within the kernel.
Arguments include and offset so that an skbuf might be sent in mulitple
calls (e.g.  send buffer limit is hit).
---
 include/linux/skbuff.h |  2 ++
 net/core/skbuff.c      | 66 ++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 68 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index a17e235..83849e3 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3115,6 +3115,8 @@ __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset, u8 *to,
 int skb_splice_bits(struct sk_buff *skb, struct sock *sk, unsigned int offset,
 		    struct pipe_inode_info *pipe, unsigned int len,
 		    unsigned int flags);
+int skb_send_sock(struct sk_buff *skb, struct socket *sock,
+				  unsigned int offset);
 void skb_copy_and_csum_dev(const struct sk_buff *skb, u8 *to);
 unsigned int skb_zerocopy_headlen(const struct sk_buff *from);
 int skb_zerocopy(struct sk_buff *to, struct sk_buff *from,
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index f75897a..ff9f88d 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2005,6 +2005,72 @@ int skb_splice_bits(struct sk_buff *skb, struct sock *sk, unsigned int offset,
 }
 EXPORT_SYMBOL_GPL(skb_splice_bits);
 
+/* Send skb data on a socket. */
+int skb_send_sock(struct sk_buff *skb, struct socket *sock, unsigned int offset)
+{
+	unsigned int sent = 0;
+	unsigned int ret;
+	unsigned short fragidx;
+
+	/* Deal with head data */
+	while (offset < skb_headlen(skb)) {
+		size_t len = skb_headlen(skb) - offset;
+		struct kvec kv;
+		struct msghdr msg;
+
+		kv.iov_base = skb->data + offset;
+		kv.iov_len = len;
+		memset(&msg, 0, sizeof(msg));
+
+		ret = kernel_sendmsg(sock, &msg, &kv, 1, len);
+		if (ret < 0)
+			goto error;
+
+		offset += ret;
+		sent += ret;
+	}
+
+	offset -= skb_headlen(skb);
+
+	/* Are there frags? */
+	if (!skb_shinfo(skb)->nr_frags)
+		goto out;
+
+	/* Find where we are in frag list */
+	for (fragidx = 0; fragidx < skb_shinfo(skb)->nr_frags; fragidx++) {
+		skb_frag_t *frag  = &skb_shinfo(skb)->frags[fragidx];
+
+		if (offset < frag->size)
+			break;
+
+		offset -= frag->size;
+	}
+
+	for (; fragidx < skb_shinfo(skb)->nr_frags; fragidx++) {
+		skb_frag_t *frag  = &skb_shinfo(skb)->frags[fragidx];
+
+		ret = kernel_sendpage(sock, frag->page.p,
+				      frag->page_offset + offset,
+				      frag->size - offset,
+				      MSG_DONTWAIT);
+		if (ret < 0)
+			goto error;
+
+		sent += ret;
+		offset = 0;
+	}
+
+out:
+	return sent;
+
+error:
+	if (sent)
+		return sent;
+	else
+		return ret;
+}
+EXPORT_SYMBOL_GPL(skb_send_sock);
+
 /**
  *	skb_store_bits - store bits from kernel buffer to skb
  *	@skb: destination buffer
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH RFC 2/2] kproxy: Kernel proxy
  2017-06-29 18:27 [PATCH RFC 0/2] kproxy: Kernel Proxy Tom Herbert
  2017-06-29 18:27 ` [PATCH RFC 1/2] skbuff: Function to send an skbuf on a socket Tom Herbert
@ 2017-06-29 18:27 ` Tom Herbert
  2017-07-03 13:01   ` David Miller
  2017-06-29 19:54 ` [PATCH RFC 0/2] kproxy: Kernel Proxy Willy Tarreau
  2017-06-29 22:04 ` Thomas Graf
  3 siblings, 1 reply; 13+ messages in thread
From: Tom Herbert @ 2017-06-29 18:27 UTC (permalink / raw)
  To: netdev, davem; +Cc: Tom Herbert

Implement an in-kernel proxy.

This patch defines a new address family for create a kproxy socket
(AF_KPROXY). Two IOCTL operations are defined SIOCKPROXYJOIN and
SIOCKPROXYUNJOIN. The first takes two (TCP) sockets and creates
a proxy between them, the latter destroys the proxy. When the
proxy is established packets received on on socket are sent on
the other. Proxy functionality (e.g. applicatation layer filtering
will be implement via BPF programs attached to the proxy.

A proc file (/prox/net/kproxy) is created to list all the running
kernel proxies and relevant statistics for them.
---
 include/linux/socket.h              |   4 +-
 include/net/kproxy.h                |  80 +++++
 include/uapi/linux/kproxy.h         |  30 ++
 net/Kconfig                         |   1 +
 net/Makefile                        |   1 +
 net/core/skbuff.c                   |   4 +-
 net/kproxy/Kconfig                  |  10 +
 net/kproxy/Makefile                 |   3 +
 net/kproxy/kproxyproc.c             | 246 +++++++++++++++
 net/kproxy/kproxysock.c             | 605 ++++++++++++++++++++++++++++++++++++
 security/selinux/hooks.c            |   4 +-
 security/selinux/include/classmap.h |   4 +-
 12 files changed, 987 insertions(+), 5 deletions(-)
 create mode 100644 include/net/kproxy.h
 create mode 100644 include/uapi/linux/kproxy.h
 create mode 100644 net/kproxy/Kconfig
 create mode 100644 net/kproxy/Makefile
 create mode 100644 net/kproxy/kproxyproc.c
 create mode 100644 net/kproxy/kproxysock.c

diff --git a/include/linux/socket.h b/include/linux/socket.h
index 8b13db5..50b8ccf 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -206,8 +206,9 @@ struct ucred {
 				 * PF_SMC protocol family that
 				 * reuses AF_INET address family
 				 */
+#define AF_KPROXY	44	/* Kernel Proxy */
 
-#define AF_MAX		44	/* For now.. */
+#define AF_MAX		45	/* For now.. */
 
 /* Protocol families, same as address families. */
 #define PF_UNSPEC	AF_UNSPEC
@@ -257,6 +258,7 @@ struct ucred {
 #define PF_QIPCRTR	AF_QIPCRTR
 #define PF_SMC		AF_SMC
 #define PF_MAX		AF_MAX
+#define PF_KPROXY	AF_KPROXY
 
 /* Maximum queue length specifiable by listen.  */
 #define SOMAXCONN	128
diff --git a/include/net/kproxy.h b/include/net/kproxy.h
new file mode 100644
index 0000000..237f275
--- /dev/null
+++ b/include/net/kproxy.h
@@ -0,0 +1,80 @@
+/*
+ * Kernel Proxy
+ *
+ * Copyright (c) 2017 Tom Herbert <tom@quantonium.net>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2
+ * as published by the Free Software Foundation.
+ */
+
+#ifndef __NET_KPROXY_H_
+#define __NET_KPROXY_H_
+
+#include <linux/skbuff.h>
+#include <net/sock.h>
+#include <uapi/linux/kproxy.h>
+
+extern unsigned int kproxy_net_id;
+
+struct kproxy_stats {
+	unsigned long long tx_bytes;
+	unsigned long long rx_bytes;
+};
+
+struct kproxy_psock {
+	struct sk_buff_head rxqueue;
+	unsigned int queue_hiwat;
+	unsigned int queue_lowat;
+	unsigned int produced;
+	unsigned int consumed;
+
+	struct kproxy_stats stats;
+
+	int save_sent;
+	struct sk_buff *save_skb;
+
+	u32 tx_stopped : 1;
+	int deferred_err;
+
+	struct socket *sock;
+	struct kproxy_psock *peer;
+	struct work_struct tx_work;
+	struct work_struct rx_work;
+
+	void (*save_data_ready)(struct sock *sk);
+	void (*save_write_space)(struct sock *sk);
+	void (*save_state_change)(struct sock *sk);
+};
+
+struct kproxy_sock {
+	struct sock sk;
+
+	u32 running : 1;
+
+	struct list_head kproxy_list;
+
+	struct kproxy_psock client_sock;
+	struct kproxy_psock server_sock;
+};
+
+struct kproxy_net {
+	struct mutex mutex;
+	struct list_head kproxy_list;
+	int count;
+};
+
+static inline unsigned int kproxy_enqueued(struct kproxy_psock *psock)
+{
+		return psock->produced - psock->peer->consumed;
+}
+
+#ifdef CONFIG_PROC_FS
+int kproxy_proc_init(void);
+void kproxy_proc_exit(void);
+#else
+static inline int kproxy_proc_init(void) { return 0; }
+static inline void kproxy_proc_exit(void) { }
+#endif
+
+#endif /* __NET_KPROXY_H_ */
diff --git a/include/uapi/linux/kproxy.h b/include/uapi/linux/kproxy.h
new file mode 100644
index 0000000..d9a3e9c
--- /dev/null
+++ b/include/uapi/linux/kproxy.h
@@ -0,0 +1,30 @@
+/*
+ * Kernel Proxy
+ *
+ * Copyright (c) 2017 Tom Herbert <tom@herbertland.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2
+ * as published by the Free Software Foundation.
+ *
+ * User API to create in-kernel proxies.
+ */
+
+#ifndef KPROXYKERNEL_H
+#define KPROXYKERNEL_H
+
+struct kproxy_join {
+	int client_fd;
+	int server_fd;
+};
+
+struct kproxy_unjoin {
+	int client_fd;
+	int server_fd;
+};
+
+#define SIOCKPROXYJOIN		(SIOCPROTOPRIVATE + 0)
+#define SIOCKPROXYUNJOIN	(SIOCPROTOPRIVATE + 1)
+
+#endif
+
diff --git a/net/Kconfig b/net/Kconfig
index 7d57ef3..1f2263d 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -368,6 +368,7 @@ source "net/irda/Kconfig"
 source "net/bluetooth/Kconfig"
 source "net/rxrpc/Kconfig"
 source "net/kcm/Kconfig"
+source "net/kproxy/Kconfig"
 source "net/strparser/Kconfig"
 
 config FIB_RULES
diff --git a/net/Makefile b/net/Makefile
index bed80fa..7f7026d 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -36,6 +36,7 @@ obj-$(CONFIG_BT)		+= bluetooth/
 obj-$(CONFIG_SUNRPC)		+= sunrpc/
 obj-$(CONFIG_AF_RXRPC)		+= rxrpc/
 obj-$(CONFIG_AF_KCM)		+= kcm/
+obj-$(CONFIG_AF_KPROXY)		+= kproxy/
 obj-$(CONFIG_STREAM_PARSER)	+= strparser/
 obj-$(CONFIG_ATM)		+= atm/
 obj-$(CONFIG_L2TP)		+= l2tp/
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index ff9f88d..a3b09e4 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2009,7 +2009,7 @@ EXPORT_SYMBOL_GPL(skb_splice_bits);
 int skb_send_sock(struct sk_buff *skb, struct socket *sock, unsigned int offset)
 {
 	unsigned int sent = 0;
-	unsigned int ret;
+	int ret;
 	unsigned short fragidx;
 
 	/* Deal with head data */
@@ -2053,7 +2053,7 @@ int skb_send_sock(struct sk_buff *skb, struct socket *sock, unsigned int offset)
 				      frag->page_offset + offset,
 				      frag->size - offset,
 				      MSG_DONTWAIT);
-		if (ret < 0)
+		if (ret <= 0)
 			goto error;
 
 		sent += ret;
diff --git a/net/kproxy/Kconfig b/net/kproxy/Kconfig
new file mode 100644
index 0000000..4754406
--- /dev/null
+++ b/net/kproxy/Kconfig
@@ -0,0 +1,10 @@
+
+config AF_KPROXY
+	tristate "Kernel proxy"
+	depends on INET
+	select BPF_SYSCALL
+	select STREAM_PARSER
+	---help---
+	  Kernel proxy implements an in kernel TCP proxy for performance.
+	  Filtering is done by BPF on incoming sockets.
+
diff --git a/net/kproxy/Makefile b/net/kproxy/Makefile
new file mode 100644
index 0000000..fdcd3a7
--- /dev/null
+++ b/net/kproxy/Makefile
@@ -0,0 +1,3 @@
+obj-$(CONFIG_AF_KPROXY) += kproxy.o
+
+kproxy-y := kproxysock.o kproxyproc.o
diff --git a/net/kproxy/kproxyproc.c b/net/kproxy/kproxyproc.c
new file mode 100644
index 0000000..66699d5
--- /dev/null
+++ b/net/kproxy/kproxyproc.c
@@ -0,0 +1,246 @@
+#include <linux/in.h>
+#include <linux/inet.h>
+#include <linux/list.h>
+#include <linux/module.h>
+#include <linux/net.h>
+#include <linux/proc_fs.h>
+#include <linux/rculist.h>
+#include <linux/seq_file.h>
+#include <linux/socket.h>
+#include <net/inet_sock.h>
+#include <net/kproxy.h>
+#include <net/net_namespace.h>
+#include <net/netns/generic.h>
+
+#ifdef CONFIG_PROC_FS
+struct kproxy_seq_proxyinfo {
+	char				*name;
+	const struct file_operations	*seq_fops;
+	const struct seq_operations	seq_ops;
+};
+
+static struct kproxy_sock *kproxy_get_first(struct seq_file *seq)
+{
+	struct net *net = seq_file_net(seq);
+	struct kproxy_net *knet = net_generic(net, kproxy_net_id);
+
+	return list_first_or_null_rcu(&knet->kproxy_list,
+				      struct kproxy_sock, kproxy_list);
+}
+
+static struct kproxy_sock *kproxy_get_next(struct kproxy_sock *kproxy)
+{
+	struct kproxy_net *knet = net_generic(sock_net(&kproxy->sk),
+					      kproxy_net_id);
+
+	return list_next_or_null_rcu(&knet->kproxy_list,
+				     &kproxy->kproxy_list,
+				     struct kproxy_sock, kproxy_list);
+}
+
+static struct kproxy_sock *kproxy_get_idx(struct seq_file *seq, loff_t pos)
+{
+	struct net *net = seq_file_net(seq);
+	struct kproxy_net *knet = net_generic(net, kproxy_net_id);
+	struct kproxy_sock *ksock;
+
+	list_for_each_entry_rcu(ksock, &knet->kproxy_list, kproxy_list) {
+		if (!pos)
+			return ksock;
+		--pos;
+	}
+	return NULL;
+}
+
+static void *kproxy_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	void *p;
+
+	if (v == SEQ_START_TOKEN)
+		p = kproxy_get_first(seq);
+	else
+		p = kproxy_get_next(v);
+	++*pos;
+	return p;
+}
+
+static void *kproxy_seq_start(struct seq_file *seq, loff_t *pos)
+	__acquires(rcu)
+{
+	rcu_read_lock();
+
+	if (!*pos)
+		return SEQ_START_TOKEN;
+	else
+		return kproxy_get_idx(seq, *pos - 1);
+}
+
+static void kproxy_seq_stop(struct seq_file *seq, void *v)
+	__releases(rcu)
+{
+	rcu_read_unlock();
+}
+
+struct kproxy_proc_proxy_state {
+	struct seq_net_private p;
+	int idx;
+};
+
+static int kproxy_seq_open(struct inode *inode, struct file *file)
+{
+	struct kproxy_seq_proxyinfo *proxyinfo = PDE_DATA(inode);
+
+	return seq_open_net(inode, file, &proxyinfo->seq_ops,
+			   sizeof(struct kproxy_proc_proxy_state));
+}
+
+static void kproxy_format_proxy_header(struct seq_file *seq)
+{
+	struct net *net = seq_file_net(seq);
+	struct kproxy_net *knet = net_generic(net, kproxy_net_id);
+
+	seq_printf(seq, "*** kProxy **** (%d proxies)\n",
+		   knet->count);
+
+	seq_printf(seq,
+		   "%-16s %-16s %-10s %-16s %-16s %-10s %s\n",
+		   "Client-RX-bytes",
+		   "ClientQ",
+		   "Server-TX-bytes",
+		   "Server-RX-bytes",
+		   "Client-TX-bytes",
+		   "ServerQ",
+		   "Addresses"
+	);
+}
+
+static void kproxy_format_addresses(struct seq_file *seq,
+				    struct sock *sk)
+{
+	switch (sk->sk_family) {
+	case AF_INET: {
+		struct inet_sock *inet = inet_sk(sk);
+
+		seq_printf(seq, "%pI4:%u->%pI4:%u",
+			   &inet->inet_saddr, ntohs(inet->inet_sport),
+			   &inet->inet_daddr, ntohs(inet->inet_dport));
+		break;
+	}
+	case AF_INET6: {
+		struct inet_sock *inet = (struct inet_sock *)sk;
+
+		seq_printf(seq, "%pI6:%u->%pI6:%u",
+			   &sk->sk_v6_rcv_saddr, ntohs(inet->inet_sport),
+			   &sk->sk_v6_daddr, ntohs(inet->inet_dport));
+		break;
+	}
+	default:
+		seq_puts(seq, "Unknown-family");
+	}
+}
+
+static void kproxy_format_proxy(struct kproxy_sock *ksock,
+				struct seq_file *seq)
+{
+	seq_printf(seq, "%-16llu %-16llu %-10u %-16llu %-16llu %-10u",
+		   ksock->client_sock.stats.rx_bytes,
+		   ksock->server_sock.stats.tx_bytes,
+		   kproxy_enqueued(&ksock->client_sock),
+		   ksock->server_sock.stats.rx_bytes,
+		   ksock->client_sock.stats.tx_bytes,
+		   kproxy_enqueued(&ksock->server_sock));
+
+	kproxy_format_addresses(seq, ksock->client_sock.sock->sk);
+	seq_puts(seq, " ");
+	kproxy_format_addresses(seq, ksock->server_sock.sock->sk);
+
+	seq_puts(seq, "\n");
+}
+
+static int kproxy_seq_show(struct seq_file *seq, void *v)
+{
+	struct kproxy_proc_proxy_state *proxy_state;
+
+	proxy_state = seq->private;
+	if (v == SEQ_START_TOKEN) {
+		proxy_state->idx = 0;
+		kproxy_format_proxy_header(seq);
+	} else {
+		kproxy_format_proxy(v, seq);
+		proxy_state->idx++;
+	}
+	return 0;
+}
+
+static const struct file_operations kproxy_seq_fops = {
+	.owner		= THIS_MODULE,
+	.open		= kproxy_seq_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release_net,
+};
+
+static struct kproxy_seq_proxyinfo kproxy_seq_proxyinfo = {
+	.name		= "kproxy",
+	.seq_fops	= &kproxy_seq_fops,
+	.seq_ops	= {
+		.show	= kproxy_seq_show,
+		.start	= kproxy_seq_start,
+		.next	= kproxy_seq_next,
+		.stop	= kproxy_seq_stop,
+	}
+};
+
+static int kproxy_proc_register(struct net *net,
+				struct kproxy_seq_proxyinfo *proxyinfo)
+{
+	struct proc_dir_entry *p;
+	int rc = 0;
+
+	p = proc_create_data(proxyinfo->name, 0444, net->proc_net,
+			     proxyinfo->seq_fops, proxyinfo);
+	if (!p)
+		rc = -ENOMEM;
+	return rc;
+}
+EXPORT_SYMBOL(kproxy_proc_register);
+
+static void kproxy_proc_unregister(struct net *net,
+				   struct kproxy_seq_proxyinfo *proxyinfo)
+{
+	remove_proc_entry(proxyinfo->name, net->proc_net);
+}
+EXPORT_SYMBOL(kproxy_proc_unregister);
+
+static int kproxy_proc_init_net(struct net *net)
+{
+	int err;
+
+	err = kproxy_proc_register(net, &kproxy_seq_proxyinfo);
+	if (err)
+		return err;
+
+	return 0;
+}
+
+static void kproxy_proc_exit_net(struct net *net)
+{
+	kproxy_proc_unregister(net, &kproxy_seq_proxyinfo);
+}
+
+static struct pernet_operations kproxy_net_ops = {
+	.init = kproxy_proc_init_net,
+	.exit = kproxy_proc_exit_net,
+};
+
+int __init kproxy_proc_init(void)
+{
+	return register_pernet_subsys(&kproxy_net_ops);
+}
+
+void __exit kproxy_proc_exit(void)
+{
+	unregister_pernet_subsys(&kproxy_net_ops);
+}
+
+#endif /* CONFIG_PROC_FS */
diff --git a/net/kproxy/kproxysock.c b/net/kproxy/kproxysock.c
new file mode 100644
index 0000000..1e71d51
--- /dev/null
+++ b/net/kproxy/kproxysock.c
@@ -0,0 +1,605 @@
+/*
+ * Kernel Proxy
+ *
+ * Copyright (c) 2017 Tom Herbert <tom@quantonium.net>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2
+ * as published by the Free Software Foundation.
+ */
+
+#include <linux/errno.h>
+#include <linux/file.h>
+#include <linux/in.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/net.h>
+#include <linux/rculist.h>
+#include <linux/skbuff.h>
+#include <linux/socket.h>
+#include <linux/workqueue.h>
+#include <net/kproxy.h>
+#include <net/netns/generic.h>
+#include <net/sock.h>
+#include <uapi/linux/kproxy.h>
+
+unsigned int kproxy_net_id;
+
+static inline struct kproxy_sock *kproxy_sk(const struct sock *sk)
+{
+	return (struct kproxy_sock *)sk;
+}
+
+static inline struct kproxy_psock *kproxy_psock_sk(const struct sock *sk)
+{
+	return (struct kproxy_psock *)sk->sk_user_data;
+}
+
+static void kproxy_report_sk_error(struct kproxy_psock *psock, int err,
+				   bool hard_report)
+{
+	struct sock *sk = psock->sock->sk;
+
+	/* Check if we still have stuff in receive queue that we might be
+	 * able to finish sending on the peer socket. Hold off on reporting
+	 * the error if there is data queued and not a hard error.
+	 */
+	if (hard_report || !kproxy_enqueued(psock)) {
+		sk->sk_err = err;
+		sk->sk_error_report(sk);
+	} else {
+		psock->deferred_err = err;
+	}
+}
+
+static void kproxy_report_deferred_error(struct kproxy_psock *psock)
+{
+	struct sock *sk = psock->sock->sk;
+
+	sk->sk_err = psock->deferred_err;
+	sk->sk_error_report(sk);
+}
+
+static void kproxy_state_change(struct sock *sk)
+{
+	kproxy_report_sk_error(kproxy_psock_sk(sk), EPIPE, false);
+}
+
+void schedule_writer(struct kproxy_psock *psock)
+{
+	schedule_work(&psock->tx_work);
+}
+
+static int kproxy_recv(read_descriptor_t *desc, struct sk_buff *skb,
+		       unsigned int offset, size_t len)
+{
+	struct kproxy_psock *psock = (struct kproxy_psock *)desc->arg.data;
+
+	WARN_ON(len != skb->len - offset);
+
+	/* Check limit of queued data */
+	if (kproxy_enqueued(psock) > psock->queue_hiwat)
+		return 0;
+
+	/* Dequeue from lower socket and put skbufs on an internal queue
+	 * queue.
+	 */
+
+	/* Always clone since we're consuming whole skbuf */
+	skb = skb_clone(skb, GFP_ATOMIC);
+	if (!skb) {
+		desc->error = -ENOMEM;
+		return 0;
+	}
+
+	if (unlikely(offset)) {
+		/* Don't expect offsets to be present */
+
+		if (!pskb_pull(skb, offset)) {
+			kfree_skb(skb);
+			desc->error = -ENOMEM;
+			return 0;
+		}
+	}
+
+	psock->produced += skb->len;
+	psock->stats.rx_bytes += skb->len;
+
+	skb_queue_tail(&psock->rxqueue, skb);
+
+	return skb->len;
+}
+
+/* Called with lock held on lower socket */
+static int kproxy_read_sock(struct kproxy_psock *psock)
+{
+	struct socket *sock = psock->sock;
+	read_descriptor_t desc;
+
+	/* Check limit of queued data. If we're over then just
+	 * return. We'll be called again when the write has
+	 * consumed data to below queue_lowat.
+	 */
+	if (kproxy_enqueued(psock) > psock->queue_hiwat)
+		return 0;
+
+	desc.arg.data = psock;
+	desc.error = 0;
+	desc.count = 1; /* give more than one skb per call */
+
+	/* sk should be locked here, so okay to do read_sock */
+	sock->ops->read_sock(sock->sk, &desc, kproxy_recv);
+
+	/* Probably got some data, kick writer side */
+	if (likely(!skb_queue_empty(&psock->rxqueue)))
+		schedule_writer(psock->peer);
+
+	return desc.error;
+}
+
+/* Called with lock held on socket */
+static void kproxy_data_ready(struct sock *sk)
+{
+	struct kproxy_psock *psock = kproxy_psock_sk(sk);
+
+	if (unlikely(!psock))
+		return;
+
+	read_lock_bh(&sk->sk_callback_lock);
+
+	if (kproxy_read_sock(psock) == -ENOMEM)
+		schedule_work(&psock->rx_work);
+
+	read_unlock_bh(&sk->sk_callback_lock);
+}
+
+static void check_for_rx_wakeup(struct kproxy_psock *psock,
+				int orig_consumed)
+{
+	int started_with = psock->produced - orig_consumed;
+
+	/* Check if we fell below low watermark with new data that
+	 * was consumed and if so schedule receiver.
+	 */
+	if (started_with > psock->queue_lowat &&
+	    kproxy_enqueued(psock) <= psock->queue_lowat)
+		schedule_work(&psock->peer->rx_work);
+}
+
+static void kproxy_rx_work(struct work_struct *w)
+{
+	struct kproxy_psock *psock = container_of(w, struct kproxy_psock,
+						  rx_work);
+	struct sock *sk = psock->sock->sk;
+
+	lock_sock(sk);
+	if (kproxy_read_sock(psock) == -ENOMEM)
+		schedule_work(&psock->peer->rx_work);
+	release_sock(sk);
+}
+
+/* Perform TX side. This is only called from the workqueue so we
+ * assume mutual exclusion.
+ */
+static void kproxy_tx_work(struct work_struct *w)
+{
+	struct kproxy_psock *psock = container_of(w, struct kproxy_psock,
+						  tx_work);
+	int sent, n;
+	struct sk_buff *skb;
+	int orig_consumed = psock->consumed;
+
+	if (unlikely(psock->tx_stopped))
+		return;
+
+	if (psock->save_skb) {
+		skb = psock->save_skb;
+		sent = psock->save_sent;
+		psock->save_skb = NULL;
+		goto start;
+	}
+
+	while ((skb = skb_dequeue(&psock->peer->rxqueue))) {
+		sent = 0;
+start:
+		do {
+			n = skb_send_sock(skb, psock->sock, sent);
+			if (n <= 0) {
+				if (n == -EAGAIN) {
+					/* Save state to try again when
+					 * there's write space on the
+					 * socket.
+					 */
+					psock->save_skb = skb;
+					psock->save_sent = sent;
+					goto out;
+				}
+
+				/* Got a hard error or socket had
+				 * been closed somehow. Report this
+				 * on the transport socket.
+				 */
+//				kproxy_report_sk_error(psock,
+//						       n ? -n : EPIPE, true);
+				psock->tx_stopped = 1;
+				goto out;
+			}
+			sent += n;
+			psock->consumed += n;
+			psock->stats.tx_bytes += n;
+		} while (sent < skb->len);
+	}
+
+	if (unlikely(psock->peer->deferred_err)) {
+		/* An error had been report on the peer and
+		 * now the queue has been drained, go ahead
+		 * and report the errot.
+		 */
+		kproxy_report_deferred_error(psock->peer);
+	}
+out:
+	check_for_rx_wakeup(psock, orig_consumed);
+}
+
+static void kproxy_write_space(struct sock *sk)
+{
+	struct kproxy_psock *psock = kproxy_psock_sk(sk);
+
+	schedule_writer(psock);
+}
+
+static void kproxy_stop_sock(struct kproxy_psock *psock)
+{
+	struct sock *sk = psock->sock->sk;
+
+	/* Set up callbacks */
+	write_lock_bh(&sk->sk_callback_lock);
+	sk->sk_data_ready = psock->save_data_ready;
+	sk->sk_write_space = psock->save_write_space;
+	sk->sk_state_change = psock->save_state_change;
+	sk->sk_user_data = NULL;
+	write_unlock_bh(&sk->sk_callback_lock);
+
+	/* Shut down the workers. Sequence is important because
+	 * RX and TX can schedule on another.
+	 */
+
+	psock->tx_stopped = 1;
+
+	/* Make sure tx_stopped is committed */
+	smp_mb();
+
+	cancel_work_sync(&psock->tx_work);
+	/* At this point tx_work will just return if schedule, it will
+	 * not schedule rx_work.
+	 */
+
+	cancel_work_sync(&psock->peer->rx_work);
+	/* rx_work is done */
+
+	cancel_work_sync(&psock->tx_work);
+	/* Just in case rx_work managed to schedule a tx_work after we
+	 * set tx_stopped .
+	 */
+}
+
+static void kproxy_done_psock(struct kproxy_psock *psock)
+{
+	__skb_queue_purge(&psock->rxqueue);
+	sock_put(psock->sock->sk);
+	fput(psock->sock->file);
+	psock->sock = NULL;
+}
+
+static int kproxy_unjoin(struct socket *sock, struct kproxy_unjoin *info)
+{
+	struct kproxy_sock *ksock = kproxy_sk(sock->sk);
+	int err = 0;
+
+	lock_sock(sock->sk);
+
+	if (ksock->running) {
+		err = -EALREADY;
+		goto out;
+	}
+
+	/* Stop proxy activity */
+	kproxy_stop_sock(&ksock->client_sock);
+	kproxy_stop_sock(&ksock->server_sock);
+
+	/* Done with sockets */
+	kproxy_done_psock(&ksock->client_sock);
+	kproxy_done_psock(&ksock->server_sock);
+
+	ksock->running = false;
+
+out:
+	release_sock(sock->sk);
+
+	return err;
+}
+
+static int kproxy_release(struct socket *sock)
+{
+	struct kproxy_net *knet = net_generic(sock_net(sock->sk),
+					      kproxy_net_id);
+	struct sock *sk = sock->sk;
+	struct kproxy_sock *ksock = kproxy_sk(sock->sk);
+
+	if (!sk)
+		goto out;
+
+	sock_orphan(sk);
+
+	if (ksock->running) {
+		struct kproxy_unjoin info;
+
+		memset(&info, 0, sizeof(info));
+		kproxy_unjoin(sock, &info);
+	}
+
+	mutex_lock(&knet->mutex);
+	list_del_rcu(&ksock->kproxy_list);
+	knet->count--;
+	mutex_unlock(&knet->mutex);
+
+	sock->sk = NULL;
+	sock_put(sk);
+
+out:
+	return 0;
+}
+
+static void kproxy_init_sock(struct kproxy_psock *psock,
+			     struct socket *sock,
+			     struct kproxy_psock *peer)
+{
+	skb_queue_head_init(&psock->rxqueue);
+	psock->sock = sock;
+	psock->peer = peer;
+	psock->queue_hiwat = 1000000;
+	psock->queue_lowat = 1000000;
+	INIT_WORK(&psock->tx_work, kproxy_tx_work);
+	INIT_WORK(&psock->rx_work, kproxy_rx_work);
+	sock_hold(sock->sk);
+}
+
+static void kproxy_start_sock(struct kproxy_psock *psock)
+{
+	struct sock *sk = psock->sock->sk;
+
+	/* Set up callbacks */
+	write_lock_bh(&sk->sk_callback_lock);
+	psock->save_data_ready = sk->sk_data_ready;
+	psock->save_write_space = sk->sk_write_space;
+	psock->save_state_change = sk->sk_state_change;
+	sk->sk_user_data = psock;
+	sk->sk_data_ready = kproxy_data_ready;
+	sk->sk_write_space = kproxy_write_space;
+	sk->sk_state_change = kproxy_state_change;
+	write_unlock_bh(&sk->sk_callback_lock);
+
+	schedule_work(&psock->rx_work);
+}
+
+static int kproxy_join(struct socket *sock, struct kproxy_join *info)
+{
+	struct kproxy_sock *ksock = kproxy_sk(sock->sk);
+	struct socket *csock, *ssock;
+	int err;
+
+	csock = sockfd_lookup(info->client_fd, &err);
+	if (!csock)
+		return err;
+
+	ssock = sockfd_lookup(info->server_fd, &err);
+	if (!ssock) {
+		fput(csock->file);
+		return err;
+	}
+
+	err = 0;
+
+	lock_sock(sock->sk);
+
+	if (ksock->running) {
+		err = -EALREADY;
+		goto outerr;
+	}
+
+	kproxy_init_sock(&ksock->client_sock, csock,
+			 &ksock->server_sock);
+	kproxy_init_sock(&ksock->server_sock, ssock,
+			 &ksock->client_sock);
+
+	kproxy_start_sock(&ksock->client_sock);
+	kproxy_start_sock(&ksock->server_sock);
+
+	ksock->running = true;
+
+	release_sock(sock->sk);
+	return 0;
+
+outerr:
+	release_sock(sock->sk);
+	fput(csock->file);
+	fput(ssock->file);
+
+	return err;
+}
+
+static int kproxy_ioctl(struct socket *sock, unsigned int cmd,
+			unsigned long arg)
+{
+	int err;
+
+	switch (cmd) {
+	case SIOCKPROXYJOIN: {
+		struct kproxy_join info;
+
+		if (copy_from_user(&info, (void __user *)arg, sizeof(info)))
+			return -EFAULT;
+
+		err = kproxy_join(sock, &info);
+
+		break;
+	}
+
+	case SIOCKPROXYUNJOIN: {
+		struct kproxy_unjoin info;
+
+		if (copy_from_user(&info, (void __user *)arg, sizeof(info)))
+			return -EFAULT;
+
+		err = kproxy_unjoin(sock, &info);
+
+		break;
+	}
+
+	default:
+		err = -ENOIOCTLCMD;
+		break;
+	}
+
+	return err;
+}
+
+static const struct proto_ops kproxy_dgram_ops = {
+	.family =	PF_KPROXY,
+	.owner =	THIS_MODULE,
+	.release =	kproxy_release,
+	.bind =		sock_no_bind,
+	.connect =	sock_no_connect,
+	.socketpair =	sock_no_socketpair,
+	.accept =	sock_no_accept,
+	.getname =	sock_no_getname,
+	.poll =		sock_no_poll,
+	.ioctl =	kproxy_ioctl,
+	.listen =	sock_no_listen,
+	.shutdown =	sock_no_shutdown,
+	.setsockopt =	sock_no_setsockopt,
+	.getsockopt =	sock_no_getsockopt,
+	.sendmsg =	sock_no_sendmsg,
+	.recvmsg =	sock_no_recvmsg,
+	.mmap =		sock_no_mmap,
+	.sendpage =	sock_no_sendpage,
+};
+
+static struct proto kproxy_proto = {
+	.name   = "KPROXY",
+	.owner  = THIS_MODULE,
+	.obj_size = sizeof(struct kproxy_sock),
+};
+
+/* Create proto operation for kcm sockets */
+static int kproxy_create(struct net *net, struct socket *sock,
+			 int protocol, int kern)
+{
+	struct kproxy_net *knet = net_generic(net, kproxy_net_id);
+	struct sock *sk;
+	struct kproxy_sock *ksock;
+
+	switch (sock->type) {
+	case SOCK_DGRAM:
+		sock->ops = &kproxy_dgram_ops;
+		break;
+	default:
+		return -ESOCKTNOSUPPORT;
+	}
+
+	sk = sk_alloc(net, PF_KPROXY, GFP_KERNEL, &kproxy_proto, kern);
+	if (!sk)
+		return -ENOMEM;
+
+	sock_init_data(sock, sk);
+
+	ksock = kproxy_sk(sk);
+
+	mutex_lock(&knet->mutex);
+	list_add_rcu(&ksock->kproxy_list, &knet->kproxy_list);
+	knet->count++;
+	mutex_unlock(&knet->mutex);
+
+	return 0;
+}
+
+static struct net_proto_family kproxy_family_ops = {
+	.family = PF_KPROXY,
+	.create = kproxy_create,
+	.owner  = THIS_MODULE,
+};
+
+static __net_init int kproxy_init_net(struct net *net)
+{
+	struct kproxy_net *knet = net_generic(net, kproxy_net_id);
+
+	INIT_LIST_HEAD_RCU(&knet->kproxy_list);
+	mutex_init(&knet->mutex);
+
+	return 0;
+}
+
+static __net_exit void kproxy_exit_net(struct net *net)
+{
+	struct kproxy_net *knet = net_generic(net, kproxy_net_id);
+
+	/* All kProxy sockets should be closed at this point, which should mean
+	 * that all multiplexors and psocks have been destroyed.
+	 */
+	WARN_ON(!list_empty(&knet->kproxy_list));
+}
+
+static struct pernet_operations kproxy_net_ops = {
+	.init = kproxy_init_net,
+	.exit = kproxy_exit_net,
+	.id   = &kproxy_net_id,
+	.size = sizeof(struct kproxy_net),
+};
+
+static int __init kproxy_init(void)
+{
+	int err = -ENOMEM;
+
+	err = proto_register(&kproxy_proto, 1);
+	if (err)
+		goto fail;
+
+	err = sock_register(&kproxy_family_ops);
+	if (err)
+		goto sock_register_fail;
+
+	err = register_pernet_device(&kproxy_net_ops);
+	if (err)
+		goto net_ops_fail;
+
+	err = kproxy_proc_init();
+	if (err)
+		goto proc_init_fail;
+
+	return 0;
+
+proc_init_fail:
+	unregister_pernet_device(&kproxy_net_ops);
+net_ops_fail:
+	sock_unregister(PF_KPROXY);
+sock_register_fail:
+	proto_unregister(&kproxy_proto);
+
+fail:
+
+	return err;
+}
+
+static void __exit kproxy_exit(void)
+{
+	kproxy_proc_exit();
+	unregister_pernet_device(&kproxy_net_ops);
+	sock_unregister(PF_KPROXY);
+	proto_unregister(&kproxy_proto);
+}
+
+module_init(kproxy_init);
+module_exit(kproxy_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_ALIAS_NETPROTO(PF_KPROXY);
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 819fd68..7f4c164 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -1404,7 +1404,9 @@ static inline u16 socket_type_to_security_class(int family, int type, int protoc
 			return SECCLASS_QIPCRTR_SOCKET;
 		case PF_SMC:
 			return SECCLASS_SMC_SOCKET;
-#if PF_MAX > 44
+		case PF_KPROXY:
+			return SECCLASS_KPROXY_SOCKET;
+#if PF_MAX > 45
 #error New address family defined, please update this function.
 #endif
 		}
diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h
index 1e0cc9b..047831e 100644
--- a/security/selinux/include/classmap.h
+++ b/security/selinux/include/classmap.h
@@ -231,9 +231,11 @@ struct security_class_mapping secclass_map[] = {
 	  { COMMON_SOCK_PERMS, NULL } },
 	{ "smc_socket",
 	  { COMMON_SOCK_PERMS, NULL } },
+	{ "kproxy_socket",
+	  { COMMON_SOCK_PERMS, NULL } },
 	{ NULL }
   };
 
-#if PF_MAX > 44
+#if PF_MAX > 45
 #error New address family defined, please update secclass_map.
 #endif
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC 0/2] kproxy: Kernel Proxy
  2017-06-29 18:27 [PATCH RFC 0/2] kproxy: Kernel Proxy Tom Herbert
  2017-06-29 18:27 ` [PATCH RFC 1/2] skbuff: Function to send an skbuf on a socket Tom Herbert
  2017-06-29 18:27 ` [PATCH RFC 2/2] kproxy: Kernel proxy Tom Herbert
@ 2017-06-29 19:54 ` Willy Tarreau
  2017-06-29 20:40   ` Tom Herbert
  2017-06-29 22:04 ` Thomas Graf
  3 siblings, 1 reply; 13+ messages in thread
From: Willy Tarreau @ 2017-06-29 19:54 UTC (permalink / raw)
  To: Tom Herbert; +Cc: netdev, davem

Hi Tom,

On Thu, Jun 29, 2017 at 11:27:03AM -0700, Tom Herbert wrote:
> Sidecar proxies are becoming quite popular on server as a means to
> perform layer 7 processing on application data as it is sent. Such
> sidecars are used for SSL proxies, application firewalls, and L7
> load balancers. While these proxies provide nice functionality,
> their performance is obviously terrible since all the data needs
> to take an extra hop though userspace.
> 
> Consider transmitting data on a TCP socket that goes through a
> sidecar paroxy. The application does a sendmsg in userpsace, data
> goes into kernel, back to userspace, and back to kernel. That is two
> trips through TCP TX, one TCP RX, potentially three copies, three
> sockets are touched, and three context switches. Using a proxy in the
> receive path would have a similarly long path.
> 
> 	 +--------------+      +------------------+
> 	 |  Application |      | Proxy            |
> 	 |              |      |                  |
> 	 |  sendmsg     |      | recvmsg sendmsg  |
> 	 +--------------+      +------------------+
> 	       |                    |       |
>                |                    ^       |
> ---------------V--------------------|-------|--------------
> 	       |                    |       |
> 	       +---->--------->-----+       V
>             TCP TX              TCP RX    TCP TX
>   
> The "boomerang" model this employs is quite expensive. This is
> even much worse in the case that the proxy is an SSL proxy (e.g.
> performing SSL inspection to implement and application firewall).

In fact that's not much what I observe in field. In practice, large
data streams are cheaply relayed using splice(), I could achieve
60 Gbps of HTTP forwarding via HAProxy on a 4-core xeon 2 years ago.
And when you use SSL, the cost of the copy to/from kernel is small
compared to all the crypto operations surrounding this.

Another point is that most HTTP requests are quite small (typically ~80%
20kB or less), and in this case the L7 processing and certain syscalls
significantly dominate the operations, data copies are comparatively
small. Simply parsing a HTTP header takes time (when you do it correctly).
You can hardly parse and index more than 800MB-1GB/s of HTTP headers
per core, which limits you to roughly 1-1.2 M req+resp per second for
a 400 byte request and a 400 byte response, and that's without any
processing at all. But when doing this, certain syscalls like connect(),
close() or epollctl() start to be quite expensive. Even splice() is
expensive to forward small data chunks because you need two calls, and
recv+send is faster. In fact our TCP stack has been so much optimized
for realistic workloads over the years that it becomes hard to gain
more by cheating on it :-)

In the end in haproxy I'm seeing about 300k req+resp per second in
HTTP keep-alive and more like 100-130k with close, when disabling
TCP quick-ack during accept() and connect() to save one ACK on each
side (just doing this generally brings performance gains between 7
and 10%).

Regarding kernel-side protocol parsing, there's an unfortunate trend
at moving more and more protocols to userland due to these protocols
evolving very quickly. At least you'll want to find a way to provide
these parsers from userspace, which will inevitably come with its set
of problems or limitations :-/

All this to say that while I can definitely imagine the benefits of
having in-kernel sockets for in-kernel L7 processing or filtering,
I'm having strong doubts about the benefits that userland may receive
by using this (or maybe you already have any performance numbers
supporting this ?).

Just my two cents,
Willy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC 0/2] kproxy: Kernel Proxy
  2017-06-29 19:54 ` [PATCH RFC 0/2] kproxy: Kernel Proxy Willy Tarreau
@ 2017-06-29 20:40   ` Tom Herbert
  2017-06-29 20:58     ` Willy Tarreau
  0 siblings, 1 reply; 13+ messages in thread
From: Tom Herbert @ 2017-06-29 20:40 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Linux Kernel Network Developers, David S. Miller

Hi Willy,

Thanks for the comments!

> In fact that's not much what I observe in field. In practice, large
> data streams are cheaply relayed using splice(), I could achieve
> 60 Gbps of HTTP forwarding via HAProxy on a 4-core xeon 2 years ago.
> And when you use SSL, the cost of the copy to/from kernel is small
> compared to all the crypto operations surrounding this.
>
Right, getting rid of the extra crypto operations and so called "SSL
inspection" is the ultimate goal this is going towards.

> Another point is that most HTTP requests are quite small (typically ~80%
> 20kB or less), and in this case the L7 processing and certain syscalls
> significantly dominate the operations, data copies are comparatively
> small. Simply parsing a HTTP header takes time (when you do it correctly).
> You can hardly parse and index more than 800MB-1GB/s of HTTP headers
> per core, which limits you to roughly 1-1.2 M req+resp per second for
> a 400 byte request and a 400 byte response, and that's without any
> processing at all. But when doing this, certain syscalls like connect(),
> close() or epollctl() start to be quite expensive. Even splice() is
> expensive to forward small data chunks because you need two calls, and
> recv+send is faster. In fact our TCP stack has been so much optimized
> for realistic workloads over the years that it becomes hard to gain
> more by cheating on it :-)
>
> In the end in haproxy I'm seeing about 300k req+resp per second in
> HTTP keep-alive and more like 100-130k with close, when disabling
> TCP quick-ack during accept() and connect() to save one ACK on each
> side (just doing this generally brings performance gains between 7
> and 10%).
>
HTTP is only one use case. The are other interesting use cases such as
those in container security where the application protocol might be
something like simple RPC. Performance is relevant because we
potentially want security applied to every message in every
communication in a containerized data center. Putting the userspace
hop in the datapath of every packet is know to be problematic, not
just for the performance hit  also because it increases the attack
surface on users' privacy.

> Regarding kernel-side protocol parsing, there's an unfortunate trend
> at moving more and more protocols to userland due to these protocols
> evolving very quickly. At least you'll want to find a way to provide
> these parsers from userspace, which will inevitably come with its set
> of problems or limitations :-/
>
That's why everything is going BPF now ;-)

> All this to say that while I can definitely imagine the benefits of
> having in-kernel sockets for in-kernel L7 processing or filtering,
> I'm having strong doubts about the benefits that userland may receive
> by using this (or maybe you already have any performance numbers
> supporting this ?).
>
Nope, no numbers yet.

> Just my two cents,
> Willy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC 0/2] kproxy: Kernel Proxy
  2017-06-29 20:40   ` Tom Herbert
@ 2017-06-29 20:58     ` Willy Tarreau
  2017-06-29 23:43       ` Tom Herbert
  0 siblings, 1 reply; 13+ messages in thread
From: Willy Tarreau @ 2017-06-29 20:58 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Linux Kernel Network Developers, David S. Miller

On Thu, Jun 29, 2017 at 01:40:26PM -0700, Tom Herbert wrote:
> > In fact that's not much what I observe in field. In practice, large
> > data streams are cheaply relayed using splice(), I could achieve
> > 60 Gbps of HTTP forwarding via HAProxy on a 4-core xeon 2 years ago.
> > And when you use SSL, the cost of the copy to/from kernel is small
> > compared to all the crypto operations surrounding this.
> >
> Right, getting rid of the extra crypto operations and so called "SSL
> inspection" is the ultimate goal this is going towards.

Yep but in order to take decisions at L7 you need to decapsulate SSL.

> HTTP is only one use case. The are other interesting use cases such as
> those in container security where the application protocol might be
> something like simple RPC.

OK that indeed makes sense in such environments.

> Performance is relevant because we
> potentially want security applied to every message in every
> communication in a containerized data center. Putting the userspace
> hop in the datapath of every packet is know to be problematic, not
> just for the performance hit  also because it increases the attack
> surface on users' privacy.

While I totally agree on the performance hit when inspecting each packet,
I fail to see the relation with users' privacy. In fact under some
circumstances it can even be the opposite. For example, using something
like kTLS for a TCP/HTTP proxy can result in cleartext being observable
in strace while it's not visible when TLS is terminated in userland because
all you see are openssl's read()/write() operations. Maybe you have specific
attacks in mind ?

> > Regarding kernel-side protocol parsing, there's an unfortunate trend
> > at moving more and more protocols to userland due to these protocols
> > evolving very quickly. At least you'll want to find a way to provide
> > these parsers from userspace, which will inevitably come with its set
> > of problems or limitations :-/
> >
> That's why everything is going BPF now ;-)

Yes, I knew you were going to suggest this :-)  I'm still prudent on it
to be honnest. I don't think it would be that easy to implement an HPACK
encoder/decoder using BPF. And even regarding just plain HTTP parsing,
certain very small operations in haproxy's parser can quickly result in
a 10% performance degradation when improperly optimized (ie: changing a
"likely", altering branch prediction, or cache walk patterns when using
arrays to evaluate character classes faster). But for general usage I
indeed think it should be OK.

> > All this to say that while I can definitely imagine the benefits of
> > having in-kernel sockets for in-kernel L7 processing or filtering,
> > I'm having strong doubts about the benefits that userland may receive
> > by using this (or maybe you already have any performance numbers
> > supporting this ?).
> >
> Nope, no numbers yet.

OK, no worries. Thanks for your explanations!

Willy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC 0/2] kproxy: Kernel Proxy
  2017-06-29 18:27 [PATCH RFC 0/2] kproxy: Kernel Proxy Tom Herbert
                   ` (2 preceding siblings ...)
  2017-06-29 19:54 ` [PATCH RFC 0/2] kproxy: Kernel Proxy Willy Tarreau
@ 2017-06-29 22:04 ` Thomas Graf
  2017-06-29 23:21   ` Tom Herbert
  3 siblings, 1 reply; 13+ messages in thread
From: Thomas Graf @ 2017-06-29 22:04 UTC (permalink / raw)
  To: Tom Herbert; +Cc: netdev, David S. Miller

Hi Tom

On 29 June 2017 at 11:27, Tom Herbert <tom@herbertland.com> wrote:
> This is raw, minimally tested, and error hanlding needs work. Posting
> as RFC to get feedback on the design...
>
> Sidecar proxies are becoming quite popular on server as a means to
> perform layer 7 processing on application data as it is sent. Such
> sidecars are used for SSL proxies, application firewalls, and L7
> load balancers. While these proxies provide nice functionality,
> their performance is obviously terrible since all the data needs
> to take an extra hop though userspace.

I really appreciate this work. It would have been nice to at least
attribute me in some way as this is exactly what I presented at
Netconf 2017 [0].

I'm also wondering why this is not built on top of KCM which you
suggested to use when we discussed this.

[0] https://docs.google.com/presentation/d/1dwSKSBGpUHD3WO5xxzZWj8awV_-xL-oYhvqQMOBhhtk/edit#slide=id.g203aae413f_0_0

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC 0/2] kproxy: Kernel Proxy
  2017-06-29 22:04 ` Thomas Graf
@ 2017-06-29 23:21   ` Tom Herbert
  2017-06-30  0:49     ` Thomas Graf
  0 siblings, 1 reply; 13+ messages in thread
From: Tom Herbert @ 2017-06-29 23:21 UTC (permalink / raw)
  To: Thomas Graf; +Cc: netdev, David S. Miller

On Thu, Jun 29, 2017 at 3:04 PM, Thomas Graf <tgraf@suug.ch> wrote:
> Hi Tom
>
> On 29 June 2017 at 11:27, Tom Herbert <tom@herbertland.com> wrote:
>> This is raw, minimally tested, and error hanlding needs work. Posting
>> as RFC to get feedback on the design...
>>
>> Sidecar proxies are becoming quite popular on server as a means to
>> perform layer 7 processing on application data as it is sent. Such
>> sidecars are used for SSL proxies, application firewalls, and L7
>> load balancers. While these proxies provide nice functionality,
>> their performance is obviously terrible since all the data needs
>> to take an extra hop though userspace.
>
Hi Thomas,

> I really appreciate this work. It would have been nice to at least
> attribute me in some way as this is exactly what I presented at
> Netconf 2017 [0].
>
Sure, will do that!

> I'm also wondering why this is not built on top of KCM which you
> suggested to use when we discussed this.
>

I think the main part of that discussion was around stream parser
which is needed for message delineation. For a 1:1 proxy,  KCM is
probably overkill (the whole KCM data path and lock becomes
superfluous). Also, there's no concept of creating a whole message
before routing it, in the 1:1 case we should let the message pass
through once it's cleared by the filter (this is the strparser change
I referred to). As I mentioned, for L7 load balancing we would want a
multiplexor probably also M:N, but the structure is different since
there's still no user facing sockets, they're all TCP for instance.
IMO, the 1:1 proxy case is compelling to solve in itself...

Tom



> [0] https://docs.google.com/presentation/d/1dwSKSBGpUHD3WO5xxzZWj8awV_-xL-oYhvqQMOBhhtk/edit#slide=id.g203aae413f_0_0

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC 0/2] kproxy: Kernel Proxy
  2017-06-29 20:58     ` Willy Tarreau
@ 2017-06-29 23:43       ` Tom Herbert
  2017-06-30  4:30         ` Willy Tarreau
  0 siblings, 1 reply; 13+ messages in thread
From: Tom Herbert @ 2017-06-29 23:43 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Linux Kernel Network Developers, David S. Miller

On Thu, Jun 29, 2017 at 1:58 PM, Willy Tarreau <w@1wt.eu> wrote:
> On Thu, Jun 29, 2017 at 01:40:26PM -0700, Tom Herbert wrote:
>> > In fact that's not much what I observe in field. In practice, large
>> > data streams are cheaply relayed using splice(), I could achieve
>> > 60 Gbps of HTTP forwarding via HAProxy on a 4-core xeon 2 years ago.
>> > And when you use SSL, the cost of the copy to/from kernel is small
>> > compared to all the crypto operations surrounding this.
>> >
>> Right, getting rid of the extra crypto operations and so called "SSL
>> inspection" is the ultimate goal this is going towards.
>
> Yep but in order to take decisions at L7 you need to decapsulate SSL.
>
Decapsulate or decrypt? There's a big difference... :-) I'm am aiming
to just have to decapsulate.

>> HTTP is only one use case. The are other interesting use cases such as
>> those in container security where the application protocol might be
>> something like simple RPC.
>
> OK that indeed makes sense in such environments.
>
>> Performance is relevant because we
>> potentially want security applied to every message in every
>> communication in a containerized data center. Putting the userspace
>> hop in the datapath of every packet is know to be problematic, not
>> just for the performance hit  also because it increases the attack
>> surface on users' privacy.
>
> While I totally agree on the performance hit when inspecting each packet,
> I fail to see the relation with users' privacy. In fact under some
> circumstances it can even be the opposite. For example, using something
> like kTLS for a TCP/HTTP proxy can result in cleartext being observable
> in strace while it's not visible when TLS is terminated in userland because
> all you see are openssl's read()/write() operations. Maybe you have specific
> attacks in mind ?
>
No, just the normal problem of making yet one more tool systematically
have access to user data.

>> > Regarding kernel-side protocol parsing, there's an unfortunate trend
>> > at moving more and more protocols to userland due to these protocols
>> > evolving very quickly. At least you'll want to find a way to provide
>> > these parsers from userspace, which will inevitably come with its set
>> > of problems or limitations :-/
>> >
>> That's why everything is going BPF now ;-)
>
> Yes, I knew you were going to suggest this :-)  I'm still prudent on it
> to be honnest. I don't think it would be that easy to implement an HPACK
> encoder/decoder using BPF. And even regarding just plain HTTP parsing,
> certain very small operations in haproxy's parser can quickly result in
> a 10% performance degradation when improperly optimized (ie: changing a
> "likely", altering branch prediction, or cache walk patterns when using
> arrays to evaluate character classes faster). But for general usage I
> indeed think it should be OK.
>
HTTP might qualify as a special case, and I believe there's already
been some work to put http in kernel by Alexander Krizhanovsky and
others. In this case maybe http parse could be front end before BPF.
Although, pretty clear we'll need regex in BPF if we want use it with
http.

Tom

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC 0/2] kproxy: Kernel Proxy
  2017-06-29 23:21   ` Tom Herbert
@ 2017-06-30  0:49     ` Thomas Graf
  0 siblings, 0 replies; 13+ messages in thread
From: Thomas Graf @ 2017-06-30  0:49 UTC (permalink / raw)
  To: Tom Herbert; +Cc: netdev, David S. Miller

On 29 June 2017 at 16:21, Tom Herbert <tom@herbertland.com> wrote:
> I think the main part of that discussion was around stream parser
> which is needed for message delineation. For a 1:1 proxy,  KCM is
> probably overkill (the whole KCM data path and lock becomes
> superfluous). Also, there's no concept of creating a whole message
> before routing it, in the 1:1 case we should let the message pass
> through once it's cleared by the filter (this is the strparser change
> I referred to). As I mentioned, for L7 load balancing we would want a
> multiplexor probably also M:N, but the structure is different since
> there's still no user facing sockets, they're all TCP for instance.
> IMO, the 1:1 proxy case is compelling to solve in itself...

I see. I was definitely thinking m:n. We should definitely evaluate
whether it makes sense to have a specific 1:1 implementation if we
need m:n anyway. For L7 LB, m:n seems obvious as a particular L4
connection may act as a transport for multiple requests bidirectional.
KCM looks like a good starting point for that.

When I talked about enqueueing entire messages, the main concern is to
buffer up the payload after the TLS handshake to the point to where a
forwarding decision can be made. I would definitely not advocate to
buffer entire messages before starting to forward.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC 0/2] kproxy: Kernel Proxy
  2017-06-29 23:43       ` Tom Herbert
@ 2017-06-30  4:30         ` Willy Tarreau
  0 siblings, 0 replies; 13+ messages in thread
From: Willy Tarreau @ 2017-06-30  4:30 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Linux Kernel Network Developers, David S. Miller

On Thu, Jun 29, 2017 at 04:43:28PM -0700, Tom Herbert wrote:
> On Thu, Jun 29, 2017 at 1:58 PM, Willy Tarreau <w@1wt.eu> wrote:
> > On Thu, Jun 29, 2017 at 01:40:26PM -0700, Tom Herbert wrote:
> >> > In fact that's not much what I observe in field. In practice, large
> >> > data streams are cheaply relayed using splice(), I could achieve
> >> > 60 Gbps of HTTP forwarding via HAProxy on a 4-core xeon 2 years ago.
> >> > And when you use SSL, the cost of the copy to/from kernel is small
> >> > compared to all the crypto operations surrounding this.
> >> >
> >> Right, getting rid of the extra crypto operations and so called "SSL
> >> inspection" is the ultimate goal this is going towards.
> >
> > Yep but in order to take decisions at L7 you need to decapsulate SSL.
> >
> Decapsulate or decrypt? There's a big difference... :-) I'm am aiming
> to just have to decapsulate.

Sorry, but what difference do you make ? For me "decapsulate" means
"extract the next level layer", and for SSL it means you need to decrypt.

> >
> >> Performance is relevant because we
> >> potentially want security applied to every message in every
> >> communication in a containerized data center. Putting the userspace
> >> hop in the datapath of every packet is know to be problematic, not
> >> just for the performance hit  also because it increases the attack
> >> surface on users' privacy.
> >
> > While I totally agree on the performance hit when inspecting each packet,
> > I fail to see the relation with users' privacy. In fact under some
> > circumstances it can even be the opposite. For example, using something
> > like kTLS for a TCP/HTTP proxy can result in cleartext being observable
> > in strace while it's not visible when TLS is terminated in userland because
> > all you see are openssl's read()/write() operations. Maybe you have specific
> > attacks in mind ?
> >
> No, just the normal problem of making yet one more tool systematically
> have access to user data.

OK.

> >> > Regarding kernel-side protocol parsing, there's an unfortunate trend
> >> > at moving more and more protocols to userland due to these protocols
> >> > evolving very quickly. At least you'll want to find a way to provide
> >> > these parsers from userspace, which will inevitably come with its set
> >> > of problems or limitations :-/
> >> >
> >> That's why everything is going BPF now ;-)
> >
> > Yes, I knew you were going to suggest this :-)  I'm still prudent on it
> > to be honnest. I don't think it would be that easy to implement an HPACK
> > encoder/decoder using BPF. And even regarding just plain HTTP parsing,
> > certain very small operations in haproxy's parser can quickly result in
> > a 10% performance degradation when improperly optimized (ie: changing a
> > "likely", altering branch prediction, or cache walk patterns when using
> > arrays to evaluate character classes faster). But for general usage I
> > indeed think it should be OK.
> >
> HTTP might qualify as a special case, and I believe there's already
> been some work to put http in kernel by Alexander Krizhanovsky and
> others. In this case maybe http parse could be front end before BPF.

It could indeed be an option. We've seen this with Tux in the past.

> Although, pretty clear we'll need regex in BPF if we want use it with
> http.

I think so as well. And some loop-like operations (foreach or stuff like
this so that they remain bounded).

Willy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC 1/2] skbuff: Function to send an skbuf on a socket
  2017-06-29 18:27 ` [PATCH RFC 1/2] skbuff: Function to send an skbuf on a socket Tom Herbert
@ 2017-07-03 13:00   ` David Miller
  0 siblings, 0 replies; 13+ messages in thread
From: David Miller @ 2017-07-03 13:00 UTC (permalink / raw)
  To: tom; +Cc: netdev

From: Tom Herbert <tom@herbertland.com>
Date: Thu, 29 Jun 2017 11:27:04 -0700

> +int skb_send_sock(struct sk_buff *skb, struct socket *sock, unsigned int offset)
> +{
> +	unsigned int sent = 0;
> +	unsigned int ret;
> +	unsigned short fragidx;

Please use reverse christmas tree ordering for these local variables.

> +	/* Deal with head data */
> +	while (offset < skb_headlen(skb)) {
> +		size_t len = skb_headlen(skb) - offset;
> +		struct kvec kv;
> +		struct msghdr msg;

Likewise.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC 2/2] kproxy: Kernel proxy
  2017-06-29 18:27 ` [PATCH RFC 2/2] kproxy: Kernel proxy Tom Herbert
@ 2017-07-03 13:01   ` David Miller
  0 siblings, 0 replies; 13+ messages in thread
From: David Miller @ 2017-07-03 13:01 UTC (permalink / raw)
  To: tom; +Cc: netdev

From: Tom Herbert <tom@herbertland.com>
Date: Thu, 29 Jun 2017 11:27:05 -0700

> A proc file (/prox/net/kproxy) is created to list all the running
> kernel proxies and relevant statistics for them.

proc is deprecated for dumping information like this, please use
sock diag instead.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2017-07-03 13:01 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-29 18:27 [PATCH RFC 0/2] kproxy: Kernel Proxy Tom Herbert
2017-06-29 18:27 ` [PATCH RFC 1/2] skbuff: Function to send an skbuf on a socket Tom Herbert
2017-07-03 13:00   ` David Miller
2017-06-29 18:27 ` [PATCH RFC 2/2] kproxy: Kernel proxy Tom Herbert
2017-07-03 13:01   ` David Miller
2017-06-29 19:54 ` [PATCH RFC 0/2] kproxy: Kernel Proxy Willy Tarreau
2017-06-29 20:40   ` Tom Herbert
2017-06-29 20:58     ` Willy Tarreau
2017-06-29 23:43       ` Tom Herbert
2017-06-30  4:30         ` Willy Tarreau
2017-06-29 22:04 ` Thomas Graf
2017-06-29 23:21   ` Tom Herbert
2017-06-30  0:49     ` Thomas Graf

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.