All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)
@ 2015-11-20 21:21 Tom Herbert
  2015-11-20 21:21 ` [PATCH net-next 1/6] rcu: Add list_next_or_null_rcu Tom Herbert
                   ` (6 more replies)
  0 siblings, 7 replies; 43+ messages in thread
From: Tom Herbert @ 2015-11-20 21:21 UTC (permalink / raw)
  To: davem, netdev; +Cc: kernel-team, davewatson, alexei.starovoitov

Kernel Connection Multiplexor (KCM) is a facility that provides a
message based interface over TCP for generic application protocols.
The motivation for this is based on the observation that although
TCP is byte stream transport protocol with no concept of message
boundaries, a common use case is to implement a framed application
layer protocol running over TCP. To date, most TCP stacks offer
byte stream API for applications, which places the burden of message
delineation, message I/O operation atomicity, and load balancing
in the application. With KCM an application can efficiently send
and receive application protocol messages over TCP using a
datagram interface.

In order to delineate message in a TCP stream for receive in KCM, the
kernel implements a message parser. For this we chose to employ BPF
which is applied to the TCP stream. BPF code parses application layer
messages and returns a message length. Nearly all binary application
protocols are parsable in this manner, so KCM should be applicable
across a wide range of applications. Other than message length
determination in receive, KCM does not require any other application
specific awareness. KCM does not implement any other application
protocol semantics-- these are are provided in userspace or could be
implemented in a kernel module layered above KCM.

KCM implements an NxM multiplexor in the kernel as diagrammed below:

+------------+   +------------+   +------------+   +------------+
| KCM socket |   | KCM socket |   | KCM socket |   | KCM socket |
+------------+   +------------+   +------------+   +------------+
      |                 |               |                |
      +-----------+     |               |     +----------+
                  |     |               |     |
               +----------------------------------+
               |           Multiplexor            |
               +----------------------------------+
                 |   |           |           |  |
       +---------+   |           |           |  ------------+
       |             |           |           |              |
+----------+  +----------+  +----------+  +----------+ +----------+
|  Psock   |  |  Psock   |  |  Psock   |  |  Psock   | |  Psock   |
+----------+  +----------+  +----------+  +----------+ +----------+
      |              |           |            |             |
+----------+  +----------+  +----------+  +----------+ +----------+
| TCP sock |  | TCP sock |  | TCP sock |  | TCP sock | | TCP sock |
+----------+  +----------+  +----------+  +----------+ +----------+

The KCM sockets provide the datagram interface to applications,
Psocks are the state for each attached TCP connection (i.e. where
message delineation is performed on receive).

A description of the APIs and design can be found in the included
Documentation/networking/kcm.txt.

In this patch set:

  - Add MSG_BATCH flag. This is used in sendmsg msg_hdr flags to
    indicate that more messages will be sent on the socket. The stack
    may batch messages up if it is beneficial for transmission.
  - In sendmmsg, set MSG_BATCH in all sub messages except for the last
    one.
  - In order to allow sendmmsg to contain multiple messages with
    SOCK_SEQPAKET we allow each msg_hdr in the sendmmsg to set MSG_EOR.
  - Add KCM module
    - This supports SOCK_DGRAM and SOCK_SEQPACKET.
  - KCM documentation

Testing:

Dave Watson has integrated KCM into Thrift and we intend to put these
changes into open source. Example of this is in:

https://github.com/djwatson/fbthrift/commit/
dd7e0f9cf4e80912fdb90f6cd394db24e61a14cc

Some initial KCM Thrift benchmark numbers (comment from Dave)

Thrift by default ties a single connection to a single thread.  KCM is
instead able to load balance multiple connections across multiple epoll
loops easily.

A test sending ~5k bytes of data to a kcm thrift server, dropping the
bytes on recv:

QPS     Latency / std dev Latency
  without KCM
    70336     209/123
  with KCM
    70353     191/124

A test sending a small request, then doing work in the epoll thread,
before serving more requests:

QPS     Latency / std dev Latency
without KCM
    14282     559/602
with KCM
    23192     344/234

At the high end, there's definitely some additional kernel overhead:

Cranking the pipelining way up, with lots of small requests

QPS     Latency / std dev Latency
without KCM
   1863429     127/119
with KCM
   1337713     192/241

---

So for a "realistic" workload, KCM performs pretty well (second case).
Under extreme conditions of highest tps we still have some work to do.
In its nature a multiplexor will spread work between CPUs which is
logically good for load balancing but coan conflict with the goal
promoting affinity. Batching messages on both send and receive are
the means to recoup performance.

Future support:

 - Integration with TLS (TLS-in-kernel is a separate initiative).
 - Page operations/splice support
 - Unconnected KCM sockets. Will be able to attach sockets to different
   destinations, AF_KCM addresses with be used in sendmsg and recvmsg
   to indicate destination
 - Explore more utility in performing BPF inline with a TCP data stream
   (setting SO_MARK, rxhash for messages being sent received on
   KCM sockets).
 - Performance work
   - Diagnose performance issues under high message load

FAQ (Questions posted on LWN)

Q: Why do this in the kernel?

A: Because the kernel is good at scheduling threads and steering packets
   to threads. KCM fits well into this model since it allows the unit
   of work for scheduling and steering to be the application layer
   messages themselves. KCM should be thought of as generic application
   protocol acceleration. It to the philosophy that the kernel provides
   generic and extensible interfaces.
   
Q: How can adding code in the path yield better performance?

A: It is true that for just sending receiving a single message there
   would be some performance loss since the code path is longer (for
   instance comparing netperf to KCM). But for real production
   applications performance takes on many dynamics. Parallelism, context
   switching, affinity, granularity of locking, and load balancing are
   all relevant. The theory of KCM is that by an application-centric
   interface, the kernel can provide better support for these
   performance characteristics.

Q: Why not use an existing message-oriented protocol such as RUDP,
   DCCP, SCTP, RDS, and others?

A: Because that would entail using a completely new transport protocol.
   Deploying a new protocol at scale is either a huge undertaking or
   fundamentally infeasible. This is true in either the Internet and in
   the data center due in a large part to protocol ossification.
   Besides, KCM we want KCM to work existing, well deployed application
   protocols that we couldn't change even if we wanted to (e.g. http/2).

   KCM simply defines a new interface method, it does not redefine any
   aspect of the transport protocol nor application protocol, nor set
   any new requirements on these. Neither does KCM attempt to implement
   any application protocol logic other than message deliniation in the
   stream. These are fundamental requirement of KCM.

Q: How does this affect TCP?

A: It doesn't, not in the slightest. The use of KCM can be one-sided,
   KCM has no effect on the wire.

Q: Why force TCP into doing something it's not designed for?

A: TCP is defined as transport protocol and there is no standard that
   says the API into TCP must be stream based sockets, or for that
   matter sockets at all (or even that TCP needs to be implemented in a
   kernel). KCM is not inconsistent with the design of TCP just because
   to makes an message based interface over TCP, if it were then every
   application protocol sending messages over TCP would also be! :-)

Q: What about the problem of a connections with very slow rate of
   incoming data? As a result your application can get storms of very
   short reads. And it actually happens a lot with connection from
   mobile devices and it is a problem for servers handling a lot of
   connections.

A: The storm of short reads will occur regardless of whether KCM is used
   or not. KCM does have one advantage in this scenario though, it will
   only wake up the application when a full message has been received,
   not for each packet that makes up part of a bigger messages. If a
   bunch of small messages are received, the application can receive
   messages in batches using recvmmsg.

Q: Why not just use DPDK, or at least provide KCM like functionality in
   DPDK?

A: DPDK, or more generally OS bypass presumably with a TCP stack in
   userland, presents a different model of load balancing than that of
   KCM (and the kernel). KCM implements load balancing of messages
   across the threads of an application, whereas DPDK load balances
   based on queues which are more static and coarse-grained since
   multiple connections are bound to queues. DPDK works best when
   processing of packets is silo'ed in a thread on the CPU processing
   a queue, and packet processing (for both the stack and application)
   is fairly uniform. KCM works well for applications where the amount
   of work to process messages varies an application work is commonly
   delegated to worker threads often on different CPUs.
 
   The message based interface over TCP is something that could be
   provide by a DPDK or OS bypass library.

Q: I'm not quite seeing this for HTTP. Maybe for HTTP/2, I guess, or web
   sockets?

A: Yes. KCM is most appropriate for message based protocols over TCP
   where is easy to deduce the message length (e.g. a length field)
   and the protocol implements its own message ordering semantics.
   Fortunately this encompasses many modern protocols.

Tom Herbert (6):
  rcu: Add list_next_or_null_rcu
  net: Make sock_alloc exportable
  net: Add MSG_BATCH flag
  kcm: Kernel Connection Multiplexor module
  kcm: Add statistics and proc interfaces
  kcm: Add description in Documentation

 Documentation/networking/kcm.txt |  273 +++++
 include/linux/net.h              |    1 +
 include/linux/rculist.h          |   21 +
 include/linux/socket.h           |    7 +-
 include/net/kcm.h                |  223 +++++
 include/uapi/linux/kcm.h         |   27 +
 net/Kconfig                      |    1 +
 net/Makefile                     |    1 +
 net/kcm/Kconfig                  |   10 +
 net/kcm/Makefile                 |    3 +
 net/kcm/kcmproc.c                |  422 ++++++++
 net/kcm/kcmsock.c                | 2040 ++++++++++++++++++++++++++++++++++++++
 net/socket.c                     |   20 +-
 13 files changed, 3043 insertions(+), 6 deletions(-)
 create mode 100644 Documentation/networking/kcm.txt
 create mode 100644 include/net/kcm.h
 create mode 100644 include/uapi/linux/kcm.h
 create mode 100644 net/kcm/Kconfig
 create mode 100644 net/kcm/Makefile
 create mode 100644 net/kcm/kcmproc.c
 create mode 100644 net/kcm/kcmsock.c

-- 
2.4.6

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH net-next 1/6] rcu: Add list_next_or_null_rcu
  2015-11-20 21:21 [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM) Tom Herbert
@ 2015-11-20 21:21 ` Tom Herbert
  2015-11-20 21:21 ` [PATCH net-next 2/6] net: Make sock_alloc exportable Tom Herbert
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 43+ messages in thread
From: Tom Herbert @ 2015-11-20 21:21 UTC (permalink / raw)
  To: davem, netdev; +Cc: kernel-team, davewatson, alexei.starovoitov

This is a convenience function that returns the next entry in an RCU
list or NULL if at the end of the list.

Signed-off-by: Tom Herbert <tom@herbertland.com>
---
 include/linux/rculist.h | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index 5ed5409..a9376fd 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -290,6 +290,27 @@ static inline void list_splice_init_rcu(struct list_head *list,
 })
 
 /**
+ * list_next_or_null_rcu - get the first element from a list
+ * @head:	the head for the list.
+ * @ptr:        the list head to take the next element from.
+ * @type:       the type of the struct this is embedded in.
+ * @member:     the name of the list_head within the struct.
+ *
+ * Note that if the ptr is at the end of the list, NULL is returned.
+ *
+ * This primitive may safely run concurrently with the _rcu list-mutation
+ * primitives such as list_add_rcu() as long as it's guarded by rcu_read_lock().
+ */
+#define list_next_or_null_rcu(head, ptr, type, member) \
+({ \
+	struct list_head *__head = (head); \
+	struct list_head *__ptr = (ptr); \
+	struct list_head *__next = READ_ONCE(__ptr->next); \
+	likely(__next != __head) ? list_entry_rcu(__next, type, \
+						  member) : NULL; \
+})
+
+/**
  * list_for_each_entry_rcu	-	iterate over rcu list of given type
  * @pos:	the type * to use as a loop cursor.
  * @head:	the head for your list.
-- 
2.4.6

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH net-next 2/6] net: Make sock_alloc exportable
  2015-11-20 21:21 [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM) Tom Herbert
  2015-11-20 21:21 ` [PATCH net-next 1/6] rcu: Add list_next_or_null_rcu Tom Herbert
@ 2015-11-20 21:21 ` Tom Herbert
  2015-11-20 21:21 ` [PATCH net-next 3/6] net: Add MSG_BATCH flag Tom Herbert
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 43+ messages in thread
From: Tom Herbert @ 2015-11-20 21:21 UTC (permalink / raw)
  To: davem, netdev; +Cc: kernel-team, davewatson, alexei.starovoitov

Export it for cases where we want to create sockets by hand.

Signed-off-by: Tom Herbert <tom@herbertland.com>
---
 include/linux/net.h | 1 +
 net/socket.c        | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/net.h b/include/linux/net.h
index 70ac5e2..f9e3d3a 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -210,6 +210,7 @@ int __sock_create(struct net *net, int family, int type, int proto,
 int sock_create(int family, int type, int proto, struct socket **res);
 int sock_create_kern(struct net *net, int family, int type, int proto, struct socket **res);
 int sock_create_lite(int family, int type, int proto, struct socket **res);
+struct socket *sock_alloc(void);
 void sock_release(struct socket *sock);
 int sock_sendmsg(struct socket *sock, struct msghdr *msg);
 int sock_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
diff --git a/net/socket.c b/net/socket.c
index dd2c247..21373f8 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -532,7 +532,7 @@ static const struct inode_operations sockfs_inode_ops = {
  *	NULL is returned.
  */
 
-static struct socket *sock_alloc(void)
+struct socket *sock_alloc(void)
 {
 	struct inode *inode;
 	struct socket *sock;
@@ -553,6 +553,7 @@ static struct socket *sock_alloc(void)
 	this_cpu_add(sockets_in_use, 1);
 	return sock;
 }
+EXPORT_SYMBOL(sock_alloc);
 
 /**
  *	sock_release	-	close a socket
-- 
2.4.6

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH net-next 3/6] net: Add MSG_BATCH flag
  2015-11-20 21:21 [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM) Tom Herbert
  2015-11-20 21:21 ` [PATCH net-next 1/6] rcu: Add list_next_or_null_rcu Tom Herbert
  2015-11-20 21:21 ` [PATCH net-next 2/6] net: Make sock_alloc exportable Tom Herbert
@ 2015-11-20 21:21 ` Tom Herbert
  2015-11-23 10:02   ` Hannes Frederic Sowa
  2015-11-20 21:21 ` [PATCH net-next 4/6] kcm: Kernel Connection Multiplexor module Tom Herbert
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 43+ messages in thread
From: Tom Herbert @ 2015-11-20 21:21 UTC (permalink / raw)
  To: davem, netdev; +Cc: kernel-team, davewatson, alexei.starovoitov

Add a new msg flag called MSG_BATCH. This flag is used in sendmsg to
indicate that more messages will follow (i.e. a batch of messages is
being sent). This is similar to MSG_MORE except that the following
messages are not merged into one packet, they are sent individually.

MSG_BATCH is a performance optimization in cases where a socket
implementation can benefit by transmitting packets in a batch.

This patch also updates sendmmsg so that each contained message except
for the last one is marked as MSG_BATCH.

Signed-off-by: Tom Herbert <tom@herbertland.com>
---
 include/linux/socket.h |  1 +
 net/socket.c           | 17 +++++++++++++----
 2 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/include/linux/socket.h b/include/linux/socket.h
index 5bf59c8..d834af2 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -274,6 +274,7 @@ struct ucred {
 #define MSG_MORE	0x8000	/* Sender will send more */
 #define MSG_WAITFORONE	0x10000	/* recvmmsg(): block until 1+ packets avail */
 #define MSG_SENDPAGE_NOTLAST 0x20000 /* sendpage() internal : not the last page */
+#define MSG_BATCH	0x40000 /* sendmmsg(): more messages coming */
 #define MSG_EOF         MSG_FIN
 
 #define MSG_FASTOPEN	0x20000000	/* Send data in TCP SYN */
diff --git a/net/socket.c b/net/socket.c
index 21373f8..ef64b72 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -1880,7 +1880,7 @@ static int copy_msghdr_from_user(struct msghdr *kmsg,
 
 static int ___sys_sendmsg(struct socket *sock, struct user_msghdr __user *msg,
 			 struct msghdr *msg_sys, unsigned int flags,
-			 struct used_address *used_address)
+			 struct used_address *used_address, bool doing_mmsg)
 {
 	struct compat_msghdr __user *msg_compat =
 	    (struct compat_msghdr __user *)msg;
@@ -1906,6 +1906,8 @@ static int ___sys_sendmsg(struct socket *sock, struct user_msghdr __user *msg,
 
 	if (msg_sys->msg_controllen > INT_MAX)
 		goto out_freeiov;
+	if (doing_mmsg)
+		flags |= (msg_sys->msg_flags & MSG_EOR);
 	ctl_len = msg_sys->msg_controllen;
 	if ((MSG_CMSG_COMPAT & flags) && ctl_len) {
 		err =
@@ -1984,7 +1986,7 @@ long __sys_sendmsg(int fd, struct user_msghdr __user *msg, unsigned flags)
 	if (!sock)
 		goto out;
 
-	err = ___sys_sendmsg(sock, msg, &msg_sys, flags, NULL);
+	err = ___sys_sendmsg(sock, msg, &msg_sys, flags, NULL, false);
 
 	fput_light(sock->file, fput_needed);
 out:
@@ -2011,6 +2013,7 @@ int __sys_sendmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen,
 	struct compat_mmsghdr __user *compat_entry;
 	struct msghdr msg_sys;
 	struct used_address used_address;
+	unsigned int oflags = flags;
 
 	if (vlen > UIO_MAXIOV)
 		vlen = UIO_MAXIOV;
@@ -2025,11 +2028,16 @@ int __sys_sendmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen,
 	entry = mmsg;
 	compat_entry = (struct compat_mmsghdr __user *)mmsg;
 	err = 0;
+	flags |= MSG_BATCH;
 
 	while (datagrams < vlen) {
+		if (datagrams == vlen - 1)
+			flags = oflags;
+
 		if (MSG_CMSG_COMPAT & flags) {
 			err = ___sys_sendmsg(sock, (struct user_msghdr __user *)compat_entry,
-					     &msg_sys, flags, &used_address);
+					     &msg_sys, flags, &used_address,
+					     true);
 			if (err < 0)
 				break;
 			err = __put_user(err, &compat_entry->msg_len);
@@ -2037,7 +2045,8 @@ int __sys_sendmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen,
 		} else {
 			err = ___sys_sendmsg(sock,
 					     (struct user_msghdr __user *)entry,
-					     &msg_sys, flags, &used_address);
+					     &msg_sys, flags, &used_address,
+					     true);
 			if (err < 0)
 				break;
 			err = put_user(err, &entry->msg_len);
-- 
2.4.6

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH net-next 4/6] kcm: Kernel Connection Multiplexor module
  2015-11-20 21:21 [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM) Tom Herbert
                   ` (2 preceding siblings ...)
  2015-11-20 21:21 ` [PATCH net-next 3/6] net: Add MSG_BATCH flag Tom Herbert
@ 2015-11-20 21:21 ` Tom Herbert
  2015-11-20 22:50   ` Sowmini Varadhan
                     ` (2 more replies)
  2015-11-20 21:21 ` [PATCH net-next 5/6] kcm: Add statistics and proc interfaces Tom Herbert
                   ` (2 subsequent siblings)
  6 siblings, 3 replies; 43+ messages in thread
From: Tom Herbert @ 2015-11-20 21:21 UTC (permalink / raw)
  To: davem, netdev; +Cc: kernel-team, davewatson, alexei.starovoitov

This module implement the Kernel Connection Multiplexor.

Kernel Connection Multiplexor (KCM) is a facility that provides a
message based interface over TCP for generic application protocols.
With KCM an application can efficiently send and receive application
protocol messages over TCP using datagram sockets.

For more information see the included Documentation/networking/kcm.txt

Signed-off-by: Tom Herbert <tom@herbertland.com>
---
 include/linux/socket.h   |    6 +-
 include/net/kcm.h        |  121 +++
 include/uapi/linux/kcm.h |   27 +
 net/Kconfig              |    1 +
 net/Makefile             |    1 +
 net/kcm/Kconfig          |   10 +
 net/kcm/Makefile         |    3 +
 net/kcm/kcmsock.c        | 1974 ++++++++++++++++++++++++++++++++++++++++++++++
 8 files changed, 2142 insertions(+), 1 deletion(-)
 create mode 100644 include/net/kcm.h
 create mode 100644 include/uapi/linux/kcm.h
 create mode 100644 net/kcm/Kconfig
 create mode 100644 net/kcm/Makefile
 create mode 100644 net/kcm/kcmsock.c

diff --git a/include/linux/socket.h b/include/linux/socket.h
index d834af2..73bf6c6 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -200,7 +200,9 @@ struct ucred {
 #define AF_ALG		38	/* Algorithm sockets		*/
 #define AF_NFC		39	/* NFC sockets			*/
 #define AF_VSOCK	40	/* vSockets			*/
-#define AF_MAX		41	/* For now.. */
+#define AF_KCM		41	/* Kernel Connection Multiplexor*/
+
+#define AF_MAX		42	/* For now.. */
 
 /* Protocol families, same as address families. */
 #define PF_UNSPEC	AF_UNSPEC
@@ -246,6 +248,7 @@ struct ucred {
 #define PF_ALG		AF_ALG
 #define PF_NFC		AF_NFC
 #define PF_VSOCK	AF_VSOCK
+#define PF_KCM		AF_KCM
 #define PF_MAX		AF_MAX
 
 /* Maximum queue length specifiable by listen.  */
@@ -323,6 +326,7 @@ struct ucred {
 #define SOL_CAIF	278
 #define SOL_ALG		279
 #define SOL_NFC		280
+#define SOL_KCM		281
 
 /* IPX options */
 #define IPX_TYPE	1
diff --git a/include/net/kcm.h b/include/net/kcm.h
new file mode 100644
index 0000000..4f371fe
--- /dev/null
+++ b/include/net/kcm.h
@@ -0,0 +1,121 @@
+/* Kernel Connection Multiplexor */
+
+#ifndef __NET_KCM_H_
+#define __NET_KCM_H_
+
+#include <linux/skbuff.h>
+#include <net/sock.h>
+#include <uapi/linux/kcm.h>
+
+#ifdef __KERNEL__
+
+extern unsigned int kcm_net_id;
+
+struct kcm_tx_msg {
+	unsigned int sent;
+	unsigned int fragidx;
+	unsigned int frag_offset;
+	unsigned int msg_flags;
+	struct sk_buff *frag_skb;
+	struct sk_buff *last_skb;
+};
+
+struct kcm_rx_msg {
+	int full_len;
+	int accum_len;
+	int offset;
+};
+
+/* Socket structure for KCM client sockets */
+struct kcm_sock {
+	struct sock sk;
+	struct kcm_mux *mux;
+	struct list_head kcm_sock_list;
+	int index;
+	u32 done : 1;
+	struct work_struct done_work;
+
+	/* Transmit */
+	struct kcm_psock *tx_psock;
+	struct work_struct tx_work;
+	struct list_head wait_psock_list;
+	struct sk_buff *seq_skb;
+
+	/* Don't use bit fields here, these are set under different locks */
+	bool tx_wait;
+	bool tx_wait_more;
+
+	/* Receive */
+	struct kcm_psock *rx_psock;
+	struct list_head wait_rx_list; /* KCMs waiting for receiving */
+	bool rx_wait;
+	u32 rx_disabled : 1;
+};
+
+struct bpf_prog;
+
+/* Structure for an attached lower socket */
+struct kcm_psock {
+	struct sock *sk;
+	struct kcm_mux *mux;
+	int index;
+
+	u32 tx_stopped : 1;
+	u32 rx_stopped : 1;
+	u32 done : 1;
+	u32 unattaching : 1;
+
+	void (*save_state_change)(struct sock *sk);
+	void (*save_data_ready)(struct sock *sk);
+	void (*save_write_space)(struct sock *sk);
+
+	struct list_head psock_list;
+
+	/* Receive */
+	struct sk_buff *rx_skb_head;
+	struct sk_buff **rx_skb_nextp;
+	struct sk_buff *ready_rx_msg;
+	struct list_head psock_ready_list;
+	struct work_struct rx_work;
+	struct delayed_work rx_delayed_work;
+	struct bpf_prog *bpf_prog;
+	struct kcm_sock *rx_kcm;
+
+	/* Transmit */
+	struct kcm_sock *tx_kcm;
+	struct list_head psock_avail_list;
+};
+
+/* Per net MUX list */
+struct kcm_net {
+	struct mutex mutex;
+	struct list_head mux_list;
+	int count;
+};
+
+/* Structure for a MUX */
+struct kcm_mux {
+	struct list_head kcm_mux_list;
+	struct rcu_head rcu;
+	struct kcm_net *knet;
+
+	struct list_head kcm_socks;	/* All KCM sockets on MUX */
+	int kcm_socks_cnt;		/* Total KCM socket count for MUX */
+	struct list_head psocks;	/* List of all psocks on MUX */
+	int psocks_cnt;		/* Total attached sockets */
+
+	/* Receive */
+	spinlock_t rx_lock ____cacheline_aligned_in_smp;
+	struct list_head kcm_rx_waiters; /* KCMs waiting for receiving */
+	struct list_head psocks_ready;	/* List of psocks with a msg ready */
+	struct sk_buff_head rx_hold_queue;
+
+	/* Transmit */
+	spinlock_t  lock ____cacheline_aligned_in_smp;	/* TX and mux locking */
+	struct list_head psocks_avail;	/* List of available psocks */
+	struct list_head kcm_tx_waiters; /* KCMs waiting for a TX psock */
+};
+
+#endif /* __KERNEL__ */
+
+#endif /* __NET_KCM_H_ */
diff --git a/include/uapi/linux/kcm.h b/include/uapi/linux/kcm.h
new file mode 100644
index 0000000..0c8d3ba
--- /dev/null
+++ b/include/uapi/linux/kcm.h
@@ -0,0 +1,27 @@
+#ifndef KCM_KERNEL_H
+#define KCM_KERNEL_H
+
+struct kcm_attach {
+	int fd;
+	int bpf_fd;
+};
+
+struct kcm_unattach {
+	int fd;
+};
+
+struct kcm_clone {
+	int fd;
+};
+
+#define SIOCKCMATTACH	(SIOCPROTOPRIVATE + 0)
+#define SIOCKCMUNATTACH	(SIOCPROTOPRIVATE + 1)
+#define SIOCKCMCLONE	(SIOCPROTOPRIVATE + 2)
+
+#define KCMPROTO_CONNECTED	0
+
+/* Socket options */
+#define KCM_RECV_DISABLE	1
+
+#endif
+
diff --git a/net/Kconfig b/net/Kconfig
index 127da94..b8439e6 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -351,6 +351,7 @@ source "net/can/Kconfig"
 source "net/irda/Kconfig"
 source "net/bluetooth/Kconfig"
 source "net/rxrpc/Kconfig"
+source "net/kcm/Kconfig"
 
 config FIB_RULES
 	bool
diff --git a/net/Makefile b/net/Makefile
index a5d0409..81d1411 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -34,6 +34,7 @@ obj-$(CONFIG_IRDA)		+= irda/
 obj-$(CONFIG_BT)		+= bluetooth/
 obj-$(CONFIG_SUNRPC)		+= sunrpc/
 obj-$(CONFIG_AF_RXRPC)		+= rxrpc/
+obj-$(CONFIG_AF_KCM)		+= kcm/
 obj-$(CONFIG_ATM)		+= atm/
 obj-$(CONFIG_L2TP)		+= l2tp/
 obj-$(CONFIG_DECNET)		+= decnet/
diff --git a/net/kcm/Kconfig b/net/kcm/Kconfig
new file mode 100644
index 0000000..8125ca7
--- /dev/null
+++ b/net/kcm/Kconfig
@@ -0,0 +1,10 @@
+
+config AF_KCM
+	tristate "KCM sockets"
+	depends on INET
+	select BPF_SYSCALL
+	---help---
+	  KCM (Kernel Connection Multiplexor) sockets provide a method
+	  for multiplexing a messages based application protocol over
+	  kernel connectons (e.g. TCP connections).
+
diff --git a/net/kcm/Makefile b/net/kcm/Makefile
new file mode 100644
index 0000000..cb525f7
--- /dev/null
+++ b/net/kcm/Makefile
@@ -0,0 +1,3 @@
+obj-$(CONFIG_AF_KCM) += kcm.o
+
+kcm-y := kcmsock.o
diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c
new file mode 100644
index 0000000..061263a
--- /dev/null
+++ b/net/kcm/kcmsock.c
@@ -0,0 +1,1974 @@
+#include <linux/bpf.h>
+#include <linux/errno.h>
+#include <linux/errqueue.h>
+#include <linux/file.h>
+#include <linux/in.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/net.h>
+#include <linux/netdevice.h>
+#include <linux/poll.h>
+#include <linux/rculist.h>
+#include <linux/skbuff.h>
+#include <linux/socket.h>
+#include <linux/uaccess.h>
+#include <linux/workqueue.h>
+#include <net/kcm.h>
+#include <net/netns/generic.h>
+#include <net/sock.h>
+#include <net/tcp.h>
+#include <uapi/linux/kcm.h>
+
+unsigned int kcm_net_id;
+
+static struct kmem_cache *kcm_psockp __read_mostly;
+static struct kmem_cache *kcm_muxp __read_mostly;
+static struct workqueue_struct *kcm_wq;
+
+static inline struct kcm_sock *kcm_sk(const struct sock *sk)
+{
+	return (struct kcm_sock *)sk;
+}
+
+static inline struct kcm_tx_msg *kcm_tx_msg(struct sk_buff *skb)
+{
+	return (struct kcm_tx_msg *)skb->cb;
+}
+
+static inline struct kcm_rx_msg *kcm_rx_msg(struct sk_buff *skb)
+{
+	return (struct kcm_rx_msg *)skb->cb;
+}
+
+static void report_csk_error(struct sock *csk, int err)
+{
+	csk->sk_err = EPIPE;
+	csk->sk_error_report(csk);
+}
+
+/* Callback lock held */
+static void kcm_abort_rx_psock(struct kcm_psock *psock, int err,
+			       struct sk_buff *skb)
+{
+	struct sock *csk = psock->sk;
+
+	/* Unrecoverable error in receive */
+
+	if (psock->rx_stopped)
+		return;
+
+	psock->rx_stopped = 1;
+
+	/* Report an error on the lower socket */
+	report_csk_error(csk, err);
+}
+
+static void kcm_abort_tx_psock(struct kcm_psock *psock, int err,
+			       bool wakeup_kcm)
+{
+	struct sock *csk = psock->sk;
+	struct kcm_mux *mux = psock->mux;
+
+	/* Unrecoverable error in transmit */
+
+	spin_lock_bh(&mux->lock);
+
+	if (psock->tx_stopped) {
+		spin_unlock_bh(&mux->lock);
+		return;
+	}
+
+	psock->tx_stopped = 1;
+
+	if (!psock->tx_kcm) {
+		/* Take off psocks_avail list */
+		list_del(&psock->psock_avail_list);
+	} else if (wakeup_kcm) {
+		/* In this case psock is being aborted while outside of
+		 * write_msgs and psock is reserved. Schedule tx_work
+		 * to handle the failure there. Need to commit tx_stopped
+		 * before queuing work.
+		 */
+		smp_mb();
+
+		queue_work(kcm_wq, &psock->tx_kcm->tx_work);
+	}
+
+	spin_unlock_bh(&mux->lock);
+
+	/* Report error on lower socket */
+	report_csk_error(csk, err);
+}
+
+static int kcm_queue_rcv_skb(struct sock *sk, struct sk_buff *skb);
+
+/* KCM is ready to receive messages on its queue-- either the KCM is new or
+ * has become unblocked after being blocked on full socket buffer. Queue any
+ * pending ready messages on a psock. RX mux lock held.
+ */
+static void kcm_rcv_ready(struct kcm_sock *kcm)
+{
+	struct kcm_mux *mux = kcm->mux;
+	struct kcm_psock *psock;
+	struct sk_buff *skb;
+
+	if (unlikely(kcm->rx_wait || kcm->rx_psock || kcm->rx_disabled))
+		return;
+
+	while (unlikely((skb = __skb_dequeue(&mux->rx_hold_queue)))) {
+		if (kcm_queue_rcv_skb(&kcm->sk, skb)) {
+			/* Assuming buffer limit has been reached */
+			skb_queue_head(&mux->rx_hold_queue, skb);
+			return;
+		}
+	}
+
+	while (!list_empty(&mux->psocks_ready)) {
+		psock = list_first_entry(&mux->psocks_ready, struct kcm_psock,
+					 psock_ready_list);
+
+		if (kcm_queue_rcv_skb(&kcm->sk, psock->ready_rx_msg)) {
+			/* Assuming buffer limit has been reached */
+			return;
+		}
+
+		/* Consumed the ready message on the psock. Schedule rx_work to
+		 * get more messages.
+		 */
+		list_del(&psock->psock_ready_list);
+		psock->ready_rx_msg = NULL;
+
+		/* Commit clearing of ready_rx_msg for queuing work */
+		smp_mb();
+
+		queue_work(kcm_wq, &psock->rx_work);
+	}
+
+	/* Buffer limit is okay now, add to ready list */
+	list_add_tail(&kcm->wait_rx_list,
+		      &kcm->mux->kcm_rx_waiters);
+	kcm->rx_wait = true;
+}
+
+static void kcm_rfree(struct sk_buff *skb)
+{
+	struct sock *sk = skb->sk;
+	struct kcm_sock *kcm = kcm_sk(sk);
+	struct kcm_mux *mux = kcm->mux;
+	unsigned int len = skb->truesize;
+
+	sk_mem_uncharge(sk, len);
+	atomic_sub(len, &sk->sk_rmem_alloc);
+
+	/* For reading rx_wait and rx_psock without holding lock */
+	smp_mb__after_atomic();
+
+	if (!kcm->rx_wait && !kcm->rx_psock &&
+	    sk_rmem_alloc_get(sk) < sk->sk_rcvlowat) {
+		spin_lock_bh(&mux->rx_lock);
+		kcm_rcv_ready(kcm);
+		spin_unlock_bh(&mux->rx_lock);
+	}
+}
+
+static int kcm_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
+{
+	struct sk_buff_head *list = &sk->sk_receive_queue;
+
+	if (atomic_read(&sk->sk_rmem_alloc) >= sk->sk_rcvbuf)
+		return -ENOMEM;
+
+	if (!sk_rmem_schedule(sk, skb, skb->truesize))
+		return -ENOBUFS;
+
+	skb->dev = NULL;
+
+	skb_orphan(skb);
+	skb->sk = sk;
+	skb->destructor = kcm_rfree;
+	atomic_add(skb->truesize, &sk->sk_rmem_alloc);
+	sk_mem_charge(sk, skb->truesize);
+
+	skb_queue_tail(list, skb);
+
+	if (!sock_flag(sk, SOCK_DEAD))
+		sk->sk_data_ready(sk);
+
+	return 0;
+}
+
+/* Requeue received messages for a kcm socket to other kcm sockets. This is
+ * called with a kcm socket is receive disabled.
+ * RX mux lock held.
+ */
+static void requeue_rx_msgs(struct kcm_mux *mux, struct sk_buff_head *head)
+{
+	struct sk_buff *skb;
+	struct kcm_sock *kcm;
+
+	while ((skb = __skb_dequeue(head))) {
+		/* Reset destructor to avoid calling kcm_rcv_ready */
+		skb->destructor = sock_rfree;
+		skb_orphan(skb);
+try_again:
+		if (list_empty(&mux->kcm_rx_waiters)) {
+			skb_queue_tail(&mux->rx_hold_queue, skb);
+			continue;
+		}
+
+		kcm = list_first_entry(&mux->kcm_rx_waiters,
+				       struct kcm_sock, wait_rx_list);
+
+		if (kcm_queue_rcv_skb(&kcm->sk, skb)) {
+			/* Should mean socket buffer full */
+			list_del(&kcm->wait_rx_list);
+			kcm->rx_wait = false;
+
+			/* Commit rx_wait to read in kcm_free */
+			smp_wmb();
+
+			goto try_again;
+		}
+	}
+}
+
+/* Lower sock lock held */
+static struct kcm_sock *reserve_rx_kcm(struct kcm_psock *psock,
+				       struct sk_buff *head)
+{
+	struct kcm_mux *mux = psock->mux;
+	struct kcm_sock *kcm;
+
+	WARN_ON(psock->ready_rx_msg);
+
+	if (psock->rx_kcm)
+		return psock->rx_kcm;
+
+	spin_lock_bh(&mux->rx_lock);
+
+	if (psock->rx_kcm) {
+		spin_unlock_bh(&mux->rx_lock);
+		return psock->rx_kcm;
+	}
+
+	if (list_empty(&mux->kcm_rx_waiters)) {
+		psock->ready_rx_msg = head;
+		list_add_tail(&psock->psock_ready_list,
+			      &mux->psocks_ready);
+		spin_unlock_bh(&mux->rx_lock);
+		return NULL;
+	}
+
+	kcm = list_first_entry(&mux->kcm_rx_waiters,
+			       struct kcm_sock, wait_rx_list);
+	list_del(&kcm->wait_rx_list);
+	kcm->rx_wait = false;
+
+	psock->rx_kcm = kcm;
+	kcm->rx_psock = psock;
+
+	spin_unlock_bh(&mux->rx_lock);
+
+	return kcm;
+}
+
+static void kcm_done(struct kcm_sock *kcm);
+
+static void kcm_done_work(struct work_struct *w)
+{
+	kcm_done(container_of(w, struct kcm_sock, done_work));
+}
+
+/* Lower sock held */
+static void unreserve_rx_kcm(struct kcm_psock *psock,
+			     bool rcv_ready)
+{
+	struct kcm_sock *kcm = psock->rx_kcm;
+	struct kcm_mux *mux = psock->mux;
+
+	if (!kcm)
+		return;
+
+	spin_lock_bh(&mux->rx_lock);
+
+	psock->rx_kcm = NULL;
+	kcm->rx_psock = NULL;
+
+	if (unlikely(kcm->done)) {
+		spin_unlock_bh(&mux->rx_lock);
+
+		/* Need to run kcm_done in a task since we need to qcquire
+		 * callback locks which may already be held here.
+		 */
+		INIT_WORK(&kcm->done_work, kcm_done_work);
+		schedule_work(&kcm->done_work);
+		return;
+	}
+
+	if (unlikely(kcm->rx_disabled)) {
+		requeue_rx_msgs(mux, &kcm->sk.sk_receive_queue);
+	} else if (rcv_ready || unlikely(!sk_rmem_alloc_get(&kcm->sk))) {
+		/* Check for degenerative race with rx_wait that all
+		 * data was dequeued (accounted for in kcm_rfree).
+		 */
+		kcm_rcv_ready(kcm);
+	}
+	spin_unlock_bh(&mux->rx_lock);
+}
+
+/* Macro to invoke filter function. */
+#define KCM_RUN_FILTER(prog, ctx) \
+	(*prog->bpf_func)(ctx, prog->insnsi)
+
+/* Lower socket lock held */
+static int kcm_tcp_recv(read_descriptor_t *desc, struct sk_buff *orig_skb,
+			unsigned int offset, size_t orig_len)
+{
+	struct kcm_psock *psock = (struct kcm_psock *)desc->arg.data;
+	struct kcm_rx_msg *rxm;
+	struct kcm_sock *kcm;
+	struct sk_buff *head, *skb;
+	size_t eaten = 0;
+	ssize_t extra;
+	int err;
+
+	if (psock->ready_rx_msg)
+		return 0;
+
+	head = psock->rx_skb_head;
+	if (head && !psock->rx_skb_nextp) {
+		int err;
+
+		/* We are going to append to the frags_list of head. Need to
+		 * unshare the skbuf data as wells as all the skbs on the
+		 * frag_list (if there are any). We deferred this work in hopes
+		 * that orignal skbuff was consumed by the stack so that there
+		 * is less work needed here.
+		 */
+		if (unlikely(skb_shinfo(head)->frag_list)) {
+			if (WARN_ON(head->next)) {
+				desc->error = -EINVAL;
+				return 0;
+			}
+
+			skb = alloc_skb(0, GFP_ATOMIC);
+			if (!skb) {
+				desc->error = -ENOMEM;
+				return 0;
+			}
+			skb->len = head->len;
+			skb->data_len = head->len;
+			skb->truesize = head->truesize;
+			*kcm_rx_msg(skb) = *kcm_rx_msg(head);
+			psock->rx_skb_nextp = &head->next;
+			skb_shinfo(skb)->frag_list = head;
+			psock->rx_skb_head = skb;
+			head = skb;
+		} else {
+			err = skb_unclone(head, GFP_ATOMIC);
+			if (err) {
+				desc->error = err;
+				return 0;
+			}
+			psock->rx_skb_nextp =
+			    &skb_shinfo(head)->frag_list;
+		}
+	}
+
+	while (eaten < orig_len) {
+		/* Always clone since we will consume something */
+		skb = skb_clone(orig_skb, GFP_ATOMIC);
+		if (!skb) {
+			desc->error = -ENOMEM;
+			break;
+		}
+
+		if (!pskb_pull(skb, offset + eaten)) {
+			kfree_skb(skb);
+			desc->error = -ENOMEM;
+			break;
+		}
+
+		if (WARN_ON(skb->len < orig_len - eaten)) {
+			kfree_skb(skb);
+			desc->error = -EINVAL;
+			break;
+		}
+
+		/* Need to trim, should be rare? */
+		err = pskb_trim(skb, orig_len - eaten);
+		if (err) {
+			kfree_skb(skb);
+			desc->error = err;
+			break;
+		}
+
+		/* Preliminary */
+		eaten += skb->len;
+
+		head = psock->rx_skb_head;
+		if (!head) {
+			head = skb;
+			psock->rx_skb_head = head;
+			/* Will set rx_skb_nextp on next packet if needed */
+			psock->rx_skb_nextp = NULL;
+			rxm = kcm_rx_msg(head);
+			memset(rxm, 0, sizeof(*rxm));
+			rxm->accum_len = head->len;
+		} else {
+			rxm = kcm_rx_msg(head);
+			*psock->rx_skb_nextp = skb;
+			psock->rx_skb_nextp = &skb->next;
+			rxm->accum_len += skb->len;
+			head->data_len += skb->len;
+			head->len += skb->len;
+			head->truesize += skb->truesize;
+		}
+
+		if (!rxm->full_len) {
+			ssize_t len;
+
+			len = KCM_RUN_FILTER(psock->bpf_prog, head);
+
+			if (!len) {
+				/* Need more header to determine length */
+				break;
+			} else if (len <= head->len - skb->len) {
+				/* Length must be into new skb (and also
+				 * greater than zero)
+				 */
+				desc->error = -EPROTO;
+				psock->rx_skb_head = NULL;
+				kcm_abort_rx_psock(psock, EPROTO, head);
+				break;
+			}
+
+			rxm->full_len = len;
+		}
+
+		extra = (ssize_t)rxm->accum_len - rxm->full_len;
+
+		if (extra < 0) {
+			/* Message not complete yet. */
+			break;
+		} else if (extra > 0) {
+			/* More bytes than needed for the message */
+
+			WARN_ON(extra > skb->len);
+
+			/* We don't bother calling pskb_trim here. The skbuff
+			 * holds the full message size which is used to
+			 * copy data out.
+			 */
+
+			eaten -= extra;
+		}
+
+		/* Hurray, we have a new message! */
+		psock->rx_skb_head = NULL;
+
+try_queue:
+		kcm = reserve_rx_kcm(psock, head);
+		if (!kcm) {
+			/* Unable to reserve a KCM, message is held in psock. */
+			break;
+		}
+
+		if (kcm_queue_rcv_skb(&kcm->sk, head)) {
+			/* Should mean socket buffer full */
+			unreserve_rx_kcm(psock, false);
+			goto try_queue;
+		}
+	}
+
+	return eaten;
+}
+
+/* Called with lock held on lower socket */
+static int psock_tcp_read_sock(struct kcm_psock *psock)
+{
+	read_descriptor_t desc;
+
+	desc.arg.data = psock;
+	desc.error = 0;
+	desc.count = 1; /* give more than one skb per call */
+
+	/* sk should be locked here, so okay to do tcp_read_sock */
+	tcp_read_sock(psock->sk, &desc, kcm_tcp_recv);
+
+	unreserve_rx_kcm(psock, true);
+
+	return desc.error;
+}
+
+/* Lower sock lock held */
+static void psock_tcp_data_ready(struct sock *sk)
+{
+	struct kcm_psock *psock;
+
+	read_lock_bh(&sk->sk_callback_lock);
+
+	psock = (struct kcm_psock *)sk->sk_user_data;
+	if (unlikely(!psock || psock->rx_stopped))
+		goto out;
+
+	if (psock->ready_rx_msg)
+		goto out;
+
+	if (psock_tcp_read_sock(psock) == -ENOMEM)
+		queue_delayed_work(kcm_wq, &psock->rx_delayed_work, 0);
+
+out:
+	read_unlock_bh(&sk->sk_callback_lock);
+}
+
+static void do_psock_rx_work(struct kcm_psock *psock)
+{
+	read_descriptor_t rd_desc;
+	struct sock *csk = psock->sk;
+
+	/* We need the read lock to synchronize with psock_tcp_data_ready. We
+	 * need the socket lock for calling tcp_read_sock.
+	 */
+	lock_sock(csk);
+	read_lock_bh(&csk->sk_callback_lock);
+
+	if (unlikely(csk->sk_user_data != psock))
+		goto out;
+
+	if (unlikely(psock->rx_stopped))
+		goto out;
+
+	if (psock->ready_rx_msg)
+		goto out;
+
+	rd_desc.arg.data = psock;
+
+	if (psock_tcp_read_sock(psock) == -ENOMEM)
+		queue_delayed_work(kcm_wq, &psock->rx_delayed_work, 0);
+
+out:
+	read_unlock_bh(&csk->sk_callback_lock);
+	release_sock(csk);
+}
+
+static void psock_rx_work(struct work_struct *w)
+{
+	do_psock_rx_work(container_of(w, struct kcm_psock, rx_work));
+}
+
+static void psock_rx_delayed_work(struct work_struct *w)
+{
+	do_psock_rx_work(container_of(w, struct kcm_psock,
+				      rx_delayed_work.work));
+}
+
+static void psock_tcp_state_change(struct sock *sk)
+{
+	/* TCP only does a POLLIN for a half close. Do a POLLHUP here
+	 * since application will normally not poll with POLLIN
+	 * on the TCP sockets.
+	 */
+
+	report_csk_error(sk, EPIPE);
+}
+
+static void psock_tcp_write_space(struct sock *sk)
+{
+	struct kcm_psock *psock;
+	struct kcm_mux *mux;
+	struct kcm_sock *kcm;
+
+	read_lock_bh(&sk->sk_callback_lock);
+
+	psock = (struct kcm_psock *)sk->sk_user_data;
+	if (unlikely(!psock))
+		goto out;
+
+	mux = psock->mux;
+
+	spin_lock_bh(&mux->lock);
+
+	/* Check if the socket is reserved so someone is waiting for sending. */
+	kcm = psock->tx_kcm;
+	if (kcm)
+		queue_work(kcm_wq, &kcm->tx_work);
+
+	spin_unlock_bh(&mux->lock);
+out:
+	read_unlock_bh(&sk->sk_callback_lock);
+}
+
+static void unreserve_psock(struct kcm_sock *kcm);
+
+/* kcm sock is locked. */
+static struct kcm_psock *reserve_psock(struct kcm_sock *kcm)
+{
+	struct kcm_mux *mux = kcm->mux;
+	struct kcm_psock *psock;
+
+	psock = kcm->tx_psock;
+
+	smp_rmb(); /* Must read tx_psock before tx_wait */
+
+	if (psock) {
+		WARN_ON(kcm->tx_wait);
+		if (unlikely(psock->tx_stopped))
+			unreserve_psock(kcm);
+		else
+			return kcm->tx_psock;
+	}
+
+	spin_lock_bh(&mux->lock);
+
+	/* Check again under lock to see if psock was reserved for this
+	 * psock via psock_unreserve.
+	 */
+	psock = kcm->tx_psock;
+	if (unlikely(psock)) {
+		WARN_ON(kcm->tx_wait);
+		spin_unlock_bh(&mux->lock);
+		return kcm->tx_psock;
+	}
+
+	if (!list_empty(&mux->psocks_avail)) {
+		psock = list_first_entry(&mux->psocks_avail,
+					 struct kcm_psock,
+					 psock_avail_list);
+		list_del(&psock->psock_avail_list);
+		if (kcm->tx_wait) {
+			list_del(&kcm->wait_psock_list);
+			kcm->tx_wait = false;
+		}
+		kcm->tx_psock = psock;
+		psock->tx_kcm = kcm;
+	} else if (!kcm->tx_wait) {
+		list_add_tail(&kcm->wait_psock_list,
+			      &mux->kcm_tx_waiters);
+		kcm->tx_wait = true;
+	}
+
+	spin_unlock_bh(&mux->lock);
+
+	return psock;
+}
+
+/* mux lock held */
+static void psock_now_avail(struct kcm_psock *psock)
+{
+	struct kcm_mux *mux = psock->mux;
+	struct kcm_sock *kcm;
+
+	if (list_empty(&mux->kcm_tx_waiters)) {
+		list_add_tail(&psock->psock_avail_list,
+			      &mux->psocks_avail);
+	} else {
+		kcm = list_first_entry(&mux->kcm_tx_waiters,
+				       struct kcm_sock,
+				       wait_psock_list);
+		list_del(&kcm->wait_psock_list);
+		kcm->tx_wait = false;
+		psock->tx_kcm = kcm;
+
+		/* Commit before changing tx_psock since that is read in
+		 * reserve_psock before queuing work.
+		 */
+		smp_mb();
+
+		kcm->tx_psock = psock;
+		queue_work(kcm_wq, &kcm->tx_work);
+	}
+}
+
+/* kcm sock is locked. */
+static void unreserve_psock(struct kcm_sock *kcm)
+{
+	struct kcm_psock *psock;
+	struct kcm_mux *mux = kcm->mux;
+
+	spin_lock_bh(&mux->lock);
+
+	psock = kcm->tx_psock;
+
+	if (WARN_ON(!psock)) {
+		spin_unlock_bh(&mux->lock);
+		return;
+	}
+
+	smp_rmb(); /* Read tx_psock before tx_wait */
+
+	WARN_ON(kcm->tx_wait);
+
+	kcm->tx_psock = NULL;
+	psock->tx_kcm = NULL;
+
+	if (unlikely(psock->tx_stopped)) {
+		if (psock->done) {
+			/* Deferred free */
+			list_del(&psock->psock_list);
+			mux->psocks_cnt--;
+			sock_put(psock->sk);
+			fput(psock->sk->sk_socket->file);
+			kmem_cache_free(kcm_psockp, psock);
+		}
+
+		/* Don't put back on available list */
+
+		spin_unlock_bh(&mux->lock);
+
+		return;
+	}
+
+	psock_now_avail(psock);
+
+	spin_unlock_bh(&mux->lock);
+}
+
+/* Write any messages ready on the kcm socket.  Called with kcm sock lock
+ * held.  Return bytes actually sent or error.
+ */
+static int kcm_write_msgs(struct kcm_sock *kcm)
+{
+	struct sock *sk = &kcm->sk;
+	struct kcm_psock *psock;
+	struct sk_buff *skb, *head;
+	struct kcm_tx_msg *txm;
+	unsigned short fragidx, frag_offset;
+	unsigned int sent, total_sent = 0;
+	int ret = 0;
+
+	kcm->tx_wait_more = false;
+	psock = kcm->tx_psock;
+	if (unlikely(psock && psock->tx_stopped)) {
+		/* A reserved psock was aborted asynchronously. Unreserve
+		 * it and we'll retry the message.
+		 */
+		unreserve_psock(kcm);
+		if (skb_queue_empty(&sk->sk_write_queue))
+			return 0;
+
+		kcm_tx_msg(skb_peek(&sk->sk_write_queue))->sent = 0;
+
+	} else if (skb_queue_empty(&sk->sk_write_queue)) {
+		return 0;
+	}
+
+	head = skb_peek(&sk->sk_write_queue);
+	txm = kcm_tx_msg(head);
+
+	if (txm->sent) {
+		/* Send of first skbuff in queue already in progress */
+		if (WARN_ON(!psock)) {
+			ret = -EINVAL;
+			goto out;
+		}
+		sent = txm->sent;
+		frag_offset = txm->frag_offset;
+		fragidx = txm->fragidx;
+		skb = txm->frag_skb;
+
+		goto do_frag;
+	}
+
+try_again:
+	psock = reserve_psock(kcm);
+	if (!psock)
+		goto out;
+
+	do {
+		skb = head;
+		txm = kcm_tx_msg(head);
+		sent = 0;
+
+do_frag_list:
+		if (WARN_ON(!skb_shinfo(skb)->nr_frags)) {
+			ret = -EINVAL;
+			goto out;
+		}
+
+		for (fragidx = 0; fragidx < skb_shinfo(skb)->nr_frags;
+		     fragidx++) {
+			skb_frag_t *frag;
+
+			frag_offset = 0;
+do_frag:
+			frag = &skb_shinfo(skb)->frags[fragidx];
+			if (WARN_ON(!frag->size)) {
+				ret = -EINVAL;
+				goto out;
+			}
+
+			ret = kernel_sendpage(psock->sk->sk_socket,
+					      frag->page.p,
+					      frag->page_offset + frag_offset,
+					      frag->size - frag_offset,
+					      MSG_DONTWAIT);
+			if (ret <= 0) {
+				if (ret == -EAGAIN) {
+					/* Save state to try again when there's
+					 * write space on the socket
+					 */
+					txm->sent = sent;
+					txm->frag_offset = frag_offset;
+					txm->fragidx = fragidx;
+					txm->frag_skb = skb;
+
+					ret = 0;
+					goto out;
+				}
+
+				/* Hard failure in sending message, abort this
+				 * psock since it has lost framing
+				 * synchonization and retry sending the
+				 * message from the beginning.
+				 */
+				kcm_abort_tx_psock(psock, ret ? -ret : EPIPE,
+						   true);
+				unreserve_psock(kcm);
+
+				txm->sent = 0;
+				ret = 0;
+
+				goto try_again;
+			}
+
+			sent += ret;
+			frag_offset += ret;
+			if (frag_offset < frag->size) {
+				/* Not finished with this frag */
+				goto do_frag;
+			}
+		}
+
+		if (skb == head) {
+			if (skb_has_frag_list(skb)) {
+				skb = skb_shinfo(skb)->frag_list;
+				goto do_frag_list;
+			}
+		} else if (skb->next) {
+			skb = skb->next;
+			goto do_frag_list;
+		}
+
+		/* Successfully sent the whole packet, account for it. */
+		skb_dequeue(&sk->sk_write_queue);
+		kfree_skb(head);
+		sk->sk_wmem_queued -= sent;
+		total_sent += sent;
+	} while ((head = skb_peek(&sk->sk_write_queue)));
+out:
+	if (!head) {
+		/* Done with all queued messages. */
+		WARN_ON(!skb_queue_empty(&sk->sk_write_queue));
+		unreserve_psock(kcm);
+	}
+
+	/* Check if write space is available */
+	sk->sk_write_space(sk);
+
+	return total_sent ? : ret;
+}
+
+static void kcm_tx_work(struct work_struct *w)
+{
+	struct kcm_sock *kcm = container_of(w, struct kcm_sock, tx_work);
+	struct sock *sk = &kcm->sk;
+	int err;
+
+	lock_sock(sk);
+
+	/* Primarily for SOCK_DGRAM sockets, also handle asynchronous tx
+	 * aborts
+	 */
+	err = kcm_write_msgs(kcm);
+	if (err < 0) {
+		/* Hard failure in write, report error on KCM socket */
+		pr_warn("KCM: Hard failure on kcm_write_msgs %d\n", err);
+		report_csk_error(&kcm->sk, -err);
+		goto out;
+	}
+
+	/* Primarily for SOCK_SEQPACKET sockets */
+	if (likely(sk->sk_socket) &&
+	    test_bit(SOCK_NOSPACE, &sk->sk_socket->flags)) {
+		clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
+		sk->sk_write_space(sk);
+	}
+
+out:
+	release_sock(sk);
+}
+
+static void kcm_push(struct kcm_sock *kcm)
+{
+	if (kcm->tx_wait_more)
+		kcm_write_msgs(kcm);
+}
+
+static int kcm_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
+{
+	struct sock *sk = sock->sk;
+	struct kcm_sock *kcm = kcm_sk(sk);
+	struct sk_buff *skb = NULL, *head = NULL;
+	size_t copy, copied = 0;
+	long timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);
+	bool eor =  (sock->type == SOCK_DGRAM || !!(msg->msg_flags & MSG_EOR));
+	int err = 0;
+
+	lock_sock(sk);
+
+	/* Per tcp_sendmsg this should be in poll */
+	clear_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
+
+	if (sk->sk_err)
+		goto out_error;
+
+	if (kcm->seq_skb) {
+		/* Previously opened message for SEQPACKET */
+		head = kcm->seq_skb;
+		skb = kcm_tx_msg(head)->last_skb;
+		goto start;
+	}
+
+	/* Call the sk_stream functions to manage the sndbuf mem. */
+	if (!sk_stream_memory_free(sk)) {
+		kcm_push(kcm);
+		set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
+		err = sk_stream_wait_memory(sk, &timeo);
+		if (err)
+			goto out_error;
+	}
+
+	/* New message, alloc head skb */
+	head = alloc_skb(0, sk->sk_allocation);
+	while (!head) {
+		kcm_push(kcm);
+		err = sk_stream_wait_memory(sk, &timeo);
+		if (err)
+			goto out_error;
+
+		head = alloc_skb(0, sk->sk_allocation);
+	}
+
+	skb = head;
+
+	/* Set ip_summed to CHECKSUM_UNNECESSARY to avoid calling
+	 * csum_and_copy_from_iter from skb_do_copy_data_nocache.
+	 */
+	skb->ip_summed = CHECKSUM_UNNECESSARY;
+
+start:
+	while (msg_data_left(msg)) {
+		bool merge = true;
+		int i = skb_shinfo(skb)->nr_frags;
+		struct page_frag *pfrag = sk_page_frag(sk);
+
+		if (!sk_page_frag_refill(sk, pfrag))
+			goto wait_for_memory;
+
+		if (!skb_can_coalesce(skb, i, pfrag->page,
+				      pfrag->offset)) {
+			if (i == MAX_SKB_FRAGS) {
+				struct sk_buff *tskb;
+
+				tskb = alloc_skb(0, sk->sk_allocation);
+				if (!tskb)
+					goto wait_for_memory;
+
+				if (head == skb)
+					skb_shinfo(head)->frag_list = tskb;
+				else
+					skb->next = tskb;
+
+				skb = tskb;
+				skb->ip_summed = CHECKSUM_UNNECESSARY;
+				continue;
+			}
+			merge = false;
+		}
+
+		copy = min_t(int, msg_data_left(msg),
+			     pfrag->size - pfrag->offset);
+
+		if (!sk_wmem_schedule(sk, copy))
+			goto wait_for_memory;
+
+		err = skb_copy_to_page_nocache(sk, &msg->msg_iter, skb,
+					       pfrag->page,
+					       pfrag->offset,
+					       copy);
+		if (err)
+			goto out_error;
+
+		/* Update the skb. */
+		if (merge) {
+			skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
+		} else {
+			skb_fill_page_desc(skb, i, pfrag->page,
+					   pfrag->offset, copy);
+			get_page(pfrag->page);
+		}
+
+		pfrag->offset += copy;
+		copied += copy;
+		if (head != skb) {
+			head->len += copy;
+			head->data_len += copy;
+		}
+
+		continue;
+
+wait_for_memory:
+		kcm_push(kcm);
+		err = sk_stream_wait_memory(sk, &timeo);
+		if (err)
+			goto out_error;
+	}
+
+	if (eor) {
+		bool not_busy = skb_queue_empty(&sk->sk_write_queue);
+
+		/* Message complete, queue it on send buffer */
+		__skb_queue_tail(&sk->sk_write_queue, head);
+		kcm->seq_skb = NULL;
+
+		if (msg->msg_flags & MSG_BATCH) {
+			kcm->tx_wait_more = true;
+		} else if (kcm->tx_wait_more || not_busy) {
+			err = kcm_write_msgs(kcm);
+			if (err < 0) {
+				/* We got a hard error in write_msgs but have
+				 * already queued this message. Report an error
+				 * in the socket, but don't affect return value
+				 * from sendmsg
+				 */
+				pr_warn("KCM: Hard failure on kcm_write_msgs\n");
+				report_csk_error(&kcm->sk, -err);
+			}
+		}
+	} else {
+		/* Message not complete, save state */
+partial_message:
+		kcm->seq_skb = head;
+		kcm_tx_msg(head)->last_skb = skb;
+	}
+
+	release_sock(sk);
+	return copied;
+
+out_error:
+	kcm_push(kcm);
+
+	if (copied && sock->type == SOCK_SEQPACKET) {
+		/* Wrote some bytes before encountering an
+		 * error, return partial success.
+		 */
+		goto partial_message;
+	}
+
+	if (head != kcm->seq_skb)
+		kfree_skb(head);
+
+	err = sk_stream_error(sk, msg->msg_flags, err);
+
+	/* make sure we wake any epoll edge trigger waiter */
+	if (unlikely(skb_queue_len(&sk->sk_write_queue) == 0 && err == -EAGAIN))
+		sk->sk_write_space(sk);
+
+	release_sock(sk);
+	return err;
+}
+
+static int kcm_recvmsg(struct socket *sock, struct msghdr *msg,
+		       size_t len, int flags)
+{
+	struct sock *sk = sock->sk;
+	int err = 0;
+	long timeo;
+	struct kcm_rx_msg *rxm;
+	int copied = 0;
+	struct sk_buff *skb;
+
+	timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
+
+	lock_sock(sk);
+
+	while (!(skb = skb_peek(&sk->sk_receive_queue))) {
+		if (sk->sk_err) {
+			err = sock_error(sk);
+			goto out;
+		}
+
+		if (sock_flag(sk, SOCK_DONE))
+			goto out;
+
+		if ((flags & MSG_DONTWAIT) || !timeo) {
+			err = -EAGAIN;
+			goto out;
+		}
+
+		sk_wait_data(sk, &timeo, NULL);
+
+		/* Handle signals */
+		if (signal_pending(current)) {
+			err = sock_intr_errno(timeo);
+			goto out;
+		}
+	}
+
+	/* Okay, have a message on the receive queue */
+
+	rxm = kcm_rx_msg(skb);
+
+	if (len > rxm->full_len)
+		len = rxm->full_len;
+
+	err = skb_copy_datagram_msg(skb, rxm->offset, msg, len);
+	if (err < 0)
+		goto out;
+
+	copied = len;
+	if (likely(!(flags & MSG_PEEK))) {
+		if (copied < rxm->full_len) {
+			if (sock->type == SOCK_DGRAM) {
+				/* Truncated message */
+				msg->msg_flags |= MSG_TRUNC;
+				goto msg_finished;
+			}
+			rxm->offset += copied;
+			rxm->full_len -= copied;
+		} else {
+msg_finished:
+			/* Finished with message */
+			msg->msg_flags |= MSG_EOR;
+			skb_unlink(skb, &sk->sk_receive_queue);
+			kfree_skb(skb);
+		}
+	}
+
+out:
+	release_sock(sk);
+
+	return copied ? : err;
+}
+
+/* kcm sock lock held */
+static void kcm_recv_disable(struct kcm_sock *kcm)
+{
+	struct kcm_mux *mux = kcm->mux;
+
+	if (kcm->rx_disabled)
+		return;
+
+	spin_lock_bh(&mux->rx_lock);
+
+	kcm->rx_disabled = 1;
+
+	/* If a psock is reserved we'll do cleanup in unreserve */
+	if (!kcm->rx_psock) {
+		if (kcm->rx_wait) {
+			list_del(&kcm->wait_rx_list);
+			kcm->rx_wait = false;
+		}
+
+		requeue_rx_msgs(mux, &kcm->sk.sk_receive_queue);
+	}
+
+	spin_unlock_bh(&mux->rx_lock);
+}
+
+/* kcm sock lock held */
+static void kcm_recv_enable(struct kcm_sock *kcm)
+{
+	struct kcm_mux *mux = kcm->mux;
+
+	if (!kcm->rx_disabled)
+		return;
+
+	spin_lock_bh(&mux->rx_lock);
+
+	kcm->rx_disabled = 0;
+	kcm_rcv_ready(kcm);
+
+	spin_unlock_bh(&mux->rx_lock);
+}
+
+static int kcm_setsockopt(struct socket *sock, int level, int optname,
+			  char __user *optval, unsigned int optlen)
+{
+	struct kcm_sock *kcm = kcm_sk(sock->sk);
+	int val, valbool;
+	int err = 0;
+
+	if (level != SOL_KCM)
+		return -ENOPROTOOPT;
+
+	if (optlen < sizeof(int))
+		return -EINVAL;
+
+	if (get_user(val, (int __user *)optval))
+		return -EINVAL;
+
+	valbool = val ? 1 : 0;
+
+	switch (optname) {
+	case KCM_RECV_DISABLE:
+		lock_sock(&kcm->sk);
+		if (valbool)
+			kcm_recv_disable(kcm);
+		else
+			kcm_recv_enable(kcm);
+		release_sock(&kcm->sk);
+		break;
+	default:
+		err = -ENOPROTOOPT;
+	}
+
+	return err;
+}
+
+static int kcm_getsockopt(struct socket *sock, int level, int optname,
+			  char __user *optval, int __user *optlen)
+{
+	struct kcm_sock *kcm = kcm_sk(sock->sk);
+	int val, len;
+
+	if (level != SOL_KCM)
+		return -ENOPROTOOPT;
+
+	if (get_user(len, optlen))
+		return -EFAULT;
+
+	len = min_t(unsigned int, len, sizeof(int));
+	if (len < 0)
+		return -EINVAL;
+
+	switch (optname) {
+	case KCM_RECV_DISABLE:
+		val = kcm->rx_disabled;
+		break;
+	default:
+		return -ENOPROTOOPT;
+	}
+
+	if (put_user(len, optlen))
+		return -EFAULT;
+	if (copy_to_user(optval, &val, len))
+		return -EFAULT;
+	return 0;
+}
+
+static void init_kcm_sock(struct kcm_sock *kcm, struct kcm_mux *mux)
+{
+	struct kcm_sock *tkcm;
+	struct list_head *head;
+	int index = 0;
+
+	/* For SOCK_SEQPACKET sock type, datagram_poll checks the sk_state, so
+	 * we set sk_state, otherwise epoll_wait always returns right away with
+	 * POLLHUP
+	 */
+	kcm->sk.sk_state = TCP_ESTABLISHED;
+
+	/* Add to mux's kcm sockets list */
+	kcm->mux = mux;
+	spin_lock_bh(&mux->lock);
+
+	head = &mux->kcm_socks;
+	list_for_each_entry(tkcm, &mux->kcm_socks, kcm_sock_list) {
+		if (tkcm->index != index)
+			break;
+		head = &tkcm->kcm_sock_list;
+		index++;
+	}
+
+	list_add(&kcm->kcm_sock_list, head);
+	kcm->index = index;
+
+	mux->kcm_socks_cnt++;
+	spin_unlock_bh(&mux->lock);
+
+	INIT_WORK(&kcm->tx_work, kcm_tx_work);
+
+	spin_lock_bh(&mux->rx_lock);
+	kcm_rcv_ready(kcm);
+	spin_unlock_bh(&mux->rx_lock);
+}
+
+static int kcm_attach(struct socket *sock, struct socket *csock,
+		      struct bpf_prog *prog)
+{
+	struct kcm_sock *kcm = kcm_sk(sock->sk);
+	struct kcm_mux *mux = kcm->mux;
+	struct sock *csk;
+	struct kcm_psock *psock = NULL, *tpsock;
+	struct list_head *head;
+	int index = 0;
+
+	if (csock->ops->family != PF_INET &&
+	    csock->ops->family != PF_INET6)
+		return -EINVAL;
+
+	csk = csock->sk;
+	if (!csk)
+		return -EINVAL;
+
+	/* Only support TCP for now */
+	if (csk->sk_protocol != IPPROTO_TCP)
+		return -EINVAL;
+
+	psock = kmem_cache_zalloc(kcm_psockp, GFP_KERNEL);
+	if (!psock)
+		return -ENOMEM;
+
+	psock->mux = mux;
+	psock->sk = csk;
+	psock->bpf_prog = prog;
+	INIT_WORK(&psock->rx_work, psock_rx_work);
+	INIT_DELAYED_WORK(&psock->rx_delayed_work, psock_rx_delayed_work);
+
+	sock_hold(csk);
+
+	write_lock_bh(&csk->sk_callback_lock);
+	psock->save_data_ready = csk->sk_data_ready;
+	psock->save_write_space = csk->sk_write_space;
+	psock->save_state_change = csk->sk_state_change;
+	csk->sk_user_data = psock;
+	csk->sk_data_ready = psock_tcp_data_ready;
+	csk->sk_write_space = psock_tcp_write_space;
+	csk->sk_state_change = psock_tcp_state_change;
+	write_unlock_bh(&csk->sk_callback_lock);
+
+	/* Finished initialization, now add the psock to the MUX. */
+	spin_lock_bh(&mux->lock);
+	head = &mux->psocks;
+	list_for_each_entry(tpsock, &mux->psocks, psock_list) {
+		if (tpsock->index != index)
+			break;
+		head = &tpsock->psock_list;
+		index++;
+	}
+
+	list_add(&psock->psock_list, head);
+	psock->index = index;
+
+	mux->psocks_cnt++;
+	psock_now_avail(psock);
+	spin_unlock_bh(&mux->lock);
+
+	/* Schedule RX work in case there are already bytes queued */
+	queue_work(kcm_wq, &psock->rx_work);
+
+	return 0;
+}
+
+static int kcm_attach_ioctl(struct socket *sock, struct kcm_attach *info)
+{
+	struct socket *csock;
+	struct bpf_prog *prog;
+	int err;
+
+	csock = sockfd_lookup(info->fd, &err);
+	if (!csock)
+		return -ENOENT;
+
+	prog = bpf_prog_get(info->bpf_fd);
+	if (IS_ERR(prog)) {
+		err = PTR_ERR(prog);
+		goto out;
+	}
+
+	if (prog->type != BPF_PROG_TYPE_SOCKET_FILTER) {
+		bpf_prog_put(prog);
+		err = -EINVAL;
+		goto out;
+	}
+
+	err = kcm_attach(sock, csock, prog);
+	if (err) {
+		bpf_prog_put(prog);
+		goto out;
+	}
+
+	/* Keep reference on file also */
+
+	return 0;
+out:
+	fput(csock->file);
+	return err;
+}
+
+static void kcm_unattach(struct kcm_psock *psock)
+{
+	struct sock *csk = psock->sk;
+	struct kcm_mux *mux = psock->mux;
+
+	/* Stop getting callbacks from TCP socket. After this there should
+	 * be no way to reserve a kcm for this psock.
+	 */
+	write_lock_bh(&csk->sk_callback_lock);
+	csk->sk_user_data = NULL;
+	csk->sk_data_ready = psock->save_data_ready;
+	csk->sk_write_space = psock->save_write_space;
+	csk->sk_state_change = psock->save_state_change;
+	psock->rx_stopped = 1;
+
+	if (WARN_ON(psock->rx_kcm)) {
+		write_unlock_bh(&csk->sk_callback_lock);
+		return;
+	}
+
+	spin_lock_bh(&mux->rx_lock);
+
+	/* Stop receiver activities. After this point psock should not be
+	 * able to get onto ready list either through callbacks or work.
+	 */
+	if (psock->ready_rx_msg) {
+		list_del(&psock->psock_ready_list);
+		kfree_skb(psock->ready_rx_msg);
+		psock->ready_rx_msg = NULL;
+	}
+
+	spin_unlock_bh(&mux->rx_lock);
+
+	write_unlock_bh(&csk->sk_callback_lock);
+
+	cancel_work_sync(&psock->rx_work);
+	cancel_delayed_work_sync(&psock->rx_delayed_work);
+
+	bpf_prog_put(psock->bpf_prog);
+
+	kfree_skb(psock->rx_skb_head);
+	psock->rx_skb_head = NULL;
+
+	spin_lock_bh(&mux->lock);
+
+	if (psock->tx_kcm) {
+		/* psock was reserved.  Just mark it finished and we will clean
+		 * up in the kcm paths, we need kcm lock which can not be
+		 * acquired here.
+		 */
+		spin_unlock_bh(&mux->lock);
+
+		/* We are unattaching a socket that is reserved. Abort the
+		 * socket since we may be out of sync in sending on it. We need
+		 * to do this without the mux lock.
+		 */
+		kcm_abort_tx_psock(psock, EPIPE, false);
+
+		spin_lock_bh(&mux->lock);
+		if (!psock->tx_kcm) {
+			/* psock now unreserved in window mux was unlocked */
+			goto no_reserved;
+		}
+		psock->done = 1;
+
+		/* Commit done before queuing work to process it */
+		smp_mb();
+
+		/* Queue tx work to make sure psock->done is handled */
+		queue_work(kcm_wq, &psock->tx_kcm->tx_work);
+		spin_unlock_bh(&mux->lock);
+	} else {
+no_reserved:
+		if (!psock->tx_stopped)
+			list_del(&psock->psock_avail_list);
+		list_del(&psock->psock_list);
+		mux->psocks_cnt--;
+		spin_unlock_bh(&mux->lock);
+
+		sock_put(csk);
+		fput(csk->sk_socket->file);
+		kmem_cache_free(kcm_psockp, psock);
+	}
+}
+
+static int kcm_unattach_ioctl(struct socket *sock, struct kcm_unattach *info)
+{
+	struct kcm_sock *kcm = kcm_sk(sock->sk);
+	struct kcm_mux *mux = kcm->mux;
+	struct kcm_psock *psock;
+	struct socket *csock;
+	struct sock *csk;
+	int err;
+
+	csock = sockfd_lookup(info->fd, &err);
+	if (!csock)
+		return -ENOENT;
+
+	csk = csock->sk;
+	if (!csk) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	err = -ENOENT;
+
+	spin_lock_bh(&mux->lock);
+
+	list_for_each_entry(psock, &mux->psocks, psock_list) {
+		if (psock->sk != csk)
+			continue;
+
+		/* Found the matching psock */
+
+		if (psock->unattaching || WARN_ON(psock->done)) {
+			err = -EALREADY;
+			break;
+		}
+
+		psock->unattaching = 1;
+
+		spin_unlock_bh(&mux->lock);
+
+		kcm_unattach(psock);
+
+		err = 0;
+		goto out;
+	}
+
+	spin_unlock_bh(&mux->lock);
+
+out:
+	fput(csock->file);
+	return err;
+}
+
+static struct proto kcm_proto = {
+	.name	= "KCM",
+	.owner	= THIS_MODULE,
+	.obj_size = sizeof(struct kcm_sock),
+};
+
+/* Clone a kcm socket. */
+static int kcm_clone(struct socket *osock, struct kcm_clone *info,
+		     struct socket **newsockp)
+{
+	struct socket *newsock;
+	struct sock *newsk;
+	struct file *newfile;
+	int err, newfd;
+
+	err = -ENFILE;
+	newsock = sock_alloc();
+	if (!newsock)
+		goto out;
+
+	newsock->type = osock->type;
+	newsock->ops = osock->ops;
+
+	__module_get(newsock->ops->owner);
+
+	newfd = get_unused_fd_flags(0);
+	if (unlikely(newfd < 0)) {
+		err = newfd;
+		goto out_fd_fail;
+	}
+
+	newfile = sock_alloc_file(newsock, 0, osock->sk->sk_prot_creator->name);
+	if (unlikely(IS_ERR(newfile))) {
+		err = PTR_ERR(newfile);
+		goto out_sock_alloc_fail;
+	}
+
+	newsk = sk_alloc(sock_net(osock->sk), PF_KCM, GFP_KERNEL,
+			 &kcm_proto, true);
+	if (!newsk) {
+		err = -ENOMEM;
+		goto out_sk_alloc_fail;
+	}
+
+	sock_init_data(newsock, newsk);
+	init_kcm_sock(kcm_sk(newsk), kcm_sk(osock->sk)->mux);
+
+	fd_install(newfd, newfile);
+	*newsockp = newsock;
+	info->fd = newfd;
+
+	return 0;
+
+out_sk_alloc_fail:
+	fput(newfile);
+out_sock_alloc_fail:
+	put_unused_fd(newfd);
+out_fd_fail:
+	sock_release(newsock);
+out:
+	return err;
+}
+
+static int kcm_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg)
+{
+	int err;
+
+	switch (cmd) {
+	case SIOCKCMATTACH: {
+		struct kcm_attach info;
+
+		if (copy_from_user(&info, (void __user *)arg, sizeof(info)))
+			err = -EFAULT;
+
+		err = kcm_attach_ioctl(sock, &info);
+
+		break;
+	}
+	case SIOCKCMUNATTACH: {
+		struct kcm_unattach info;
+
+		if (copy_from_user(&info, (void __user *)arg, sizeof(info)))
+			err = -EFAULT;
+
+		err = kcm_unattach_ioctl(sock, &info);
+
+		break;
+	}
+	case SIOCKCMCLONE: {
+		struct kcm_clone info;
+		struct socket *newsock = NULL;
+
+		if (copy_from_user(&info, (void __user *)arg, sizeof(info)))
+			err = -EFAULT;
+
+		err = kcm_clone(sock, &info, &newsock);
+
+		if (!err) {
+			if (copy_to_user((void __user *)arg, &info,
+					 sizeof(info))) {
+				err = -EFAULT;
+				sock_release(newsock);
+			}
+		}
+
+		break;
+	}
+	default:
+		err = -ENOIOCTLCMD;
+		break;
+	}
+
+	return err;
+}
+
+static void free_mux(struct rcu_head *rcu)
+{
+	struct kcm_mux *mux = container_of(rcu,
+	    struct kcm_mux, rcu);
+
+	kmem_cache_free(kcm_muxp, mux);
+}
+
+static void release_mux(struct kcm_mux *mux)
+{
+	struct kcm_net *knet = mux->knet;
+	struct kcm_psock *psock, *tmp_psock;
+
+	/* Release psocks */
+	list_for_each_entry_safe(psock, tmp_psock,
+				 &mux->psocks, psock_list) {
+		if (!WARN_ON(psock->unattaching))
+			kcm_unattach(psock);
+	}
+
+	if (WARN_ON(mux->psocks_cnt))
+		return;
+
+	__skb_queue_purge(&mux->rx_hold_queue);
+
+	mutex_lock(&knet->mutex);
+	list_del_rcu(&mux->kcm_mux_list);
+	knet->count--;
+	mutex_unlock(&knet->mutex);
+
+	call_rcu(&mux->rcu, free_mux);
+}
+
+static void kcm_done(struct kcm_sock *kcm)
+{
+	struct kcm_mux *mux = kcm->mux;
+	struct sock *sk = &kcm->sk;
+	int socks_cnt;
+
+	spin_lock_bh(&mux->rx_lock);
+	if (kcm->rx_psock) {
+		/* Cleanup in unreserve_rx_kcm */
+		WARN_ON(kcm->done);
+		kcm->rx_disabled = 1;
+		kcm->done = 1;
+		spin_unlock_bh(&mux->rx_lock);
+		return;
+	}
+
+	if (kcm->rx_wait) {
+		list_del(&kcm->wait_rx_list);
+		kcm->rx_wait = false;
+	}
+	/* Move any pending receive messages to other kcm sockets */
+	requeue_rx_msgs(mux, &sk->sk_receive_queue);
+
+	spin_unlock_bh(&mux->rx_lock);
+
+	if (WARN_ON(sk_rmem_alloc_get(sk)))
+		return;
+
+	/* Detach from MUX */
+	spin_lock_bh(&mux->lock);
+
+	list_del(&kcm->kcm_sock_list);
+	mux->kcm_socks_cnt--;
+	socks_cnt = mux->kcm_socks_cnt;
+
+	spin_unlock_bh(&mux->lock);
+
+	if (!socks_cnt) {
+		/* We are done with the mux now. */
+		release_mux(mux);
+	}
+
+	WARN_ON(kcm->rx_wait);
+
+	sock_put(&kcm->sk);
+}
+
+/* Called by kcm_release to close a KCM socket.
+ * If this is the last KCM socket on the MUX, destroy the MUX.
+ */
+static int kcm_release(struct socket *sock)
+{
+	struct sock *sk = sock->sk;
+	struct kcm_sock *kcm;
+	struct kcm_mux *mux;
+	struct kcm_psock *psock;
+
+	if (!sk)
+		return 0;
+
+	kcm = kcm_sk(sk);
+	mux = kcm->mux;
+
+	sock_orphan(sk);
+	kfree_skb(kcm->seq_skb);
+
+	lock_sock(sk);
+	/* Purge queue under lock to avoid race condition with tx_work trying
+	 * to act when queue is nonempty. If tx_work runs after this point
+	 * it will just return.
+	 */
+	__skb_queue_purge(&sk->sk_write_queue);
+	release_sock(sk);
+
+	spin_lock_bh(&mux->lock);
+	if (kcm->tx_wait) {
+		/* Take of tx_wait list, after this point there should be no way
+		 * that a psock will be assigned to this kcm.
+		 */
+		list_del(&kcm->wait_psock_list);
+		kcm->tx_wait = false;
+	}
+	spin_unlock_bh(&mux->lock);
+
+	/* Cancel work. After this point there should be no outside references
+	 * to the kcm socket.
+	 */
+	cancel_work_sync(&kcm->tx_work);
+
+	lock_sock(sk);
+	psock = kcm->tx_psock;
+	if (psock) {
+		/* A psock was reserved, so we need to kill it since it
+		 * may already have some bytes queued from a message. We
+		 * need to do this after removing kcm from tx_wait list.
+		 */
+		kcm_abort_tx_psock(psock, EPIPE, false);
+		unreserve_psock(kcm);
+	}
+	release_sock(sk);
+
+	WARN_ON(kcm->tx_wait);
+	WARN_ON(kcm->tx_psock);
+
+	sock->sk = NULL;
+
+	kcm_done(kcm);
+
+	return 0;
+}
+
+static const struct proto_ops kcm_ops = {
+	.family =	PF_KCM,
+	.owner =	THIS_MODULE,
+	.release =	kcm_release,
+	.bind =		sock_no_bind,
+	.connect =	sock_no_connect,
+	.socketpair =	sock_no_socketpair,
+	.accept =	sock_no_accept,
+	.getname =	sock_no_getname,
+	.poll =		datagram_poll,
+	.ioctl =	kcm_ioctl,
+	.listen =	sock_no_listen,
+	.shutdown =	sock_no_shutdown,
+	.setsockopt =	kcm_setsockopt,
+	.getsockopt =	kcm_getsockopt,
+	.sendmsg =	kcm_sendmsg,
+	.recvmsg =	kcm_recvmsg,
+	.mmap =		sock_no_mmap,
+	.sendpage =	sock_no_sendpage,
+};
+
+/* Create proto operation for kcm sockets */
+static int kcm_create(struct net *net, struct socket *sock,
+		      int protocol, int kern)
+{
+	struct kcm_net *knet = net_generic(net, kcm_net_id);
+	struct sock *sk;
+	struct kcm_mux *mux;
+
+	switch (sock->type) {
+	case SOCK_DGRAM:
+	case SOCK_SEQPACKET:
+		sock->ops = &kcm_ops;
+		break;
+	default:
+		return -ESOCKTNOSUPPORT;
+	}
+
+	if (protocol != KCMPROTO_CONNECTED)
+		return -EPROTONOSUPPORT;
+
+	sk = sk_alloc(net, PF_KCM, GFP_KERNEL, &kcm_proto, kern);
+	if (!sk)
+		return -ENOMEM;
+
+	/* Allocate a kcm mux, shared between KCM sockets */
+	mux = kmem_cache_zalloc(kcm_muxp, GFP_KERNEL);
+	if (!mux) {
+		sk_free(sk);
+		return -ENOMEM;
+	}
+
+	spin_lock_init(&mux->lock);
+	spin_lock_init(&mux->rx_lock);
+	INIT_LIST_HEAD(&mux->kcm_socks);
+	INIT_LIST_HEAD(&mux->kcm_rx_waiters);
+	INIT_LIST_HEAD(&mux->kcm_tx_waiters);
+
+	INIT_LIST_HEAD(&mux->psocks);
+	INIT_LIST_HEAD(&mux->psocks_ready);
+	INIT_LIST_HEAD(&mux->psocks_avail);
+
+	mux->knet = knet;
+
+	/* Add new MUX to list */
+	mutex_lock(&knet->mutex);
+	list_add_rcu(&mux->kcm_mux_list, &knet->mux_list);
+	knet->count++;
+	mutex_unlock(&knet->mutex);
+
+	skb_queue_head_init(&mux->rx_hold_queue);
+
+	/* Init KCM socket */
+	sock_init_data(sock, sk);
+	init_kcm_sock(kcm_sk(sk), mux);
+
+	return 0;
+}
+
+static struct net_proto_family kcm_family_ops = {
+	.family = PF_KCM,
+	.create = kcm_create,
+	.owner  = THIS_MODULE,
+};
+
+static __net_init int kcm_init_net(struct net *net)
+{
+	struct kcm_net *knet = net_generic(net, kcm_net_id);
+
+	INIT_LIST_HEAD_RCU(&knet->mux_list);
+	mutex_init(&knet->mutex);
+
+	return 0;
+}
+
+static __net_exit void kcm_exit_net(struct net *net)
+{
+	struct kcm_net *knet = net_generic(net, kcm_net_id);
+
+	/* All KCM sockets should be closed at this point, which should mean
+	 * that all multiplexors and psocks have been destroyed.
+	 */
+	WARN_ON(!list_empty(&knet->mux_list));
+}
+
+static struct pernet_operations kcm_net_ops = {
+	.init = kcm_init_net,
+	.exit = kcm_exit_net,
+	.id   = &kcm_net_id,
+	.size = sizeof(struct kcm_net),
+};
+
+static int __init kcm_init(void)
+{
+	int err = -ENOMEM;
+
+	kcm_muxp = kmem_cache_create("kcm_mux_cache",
+				     sizeof(struct kcm_mux), 0,
+				     SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
+	if (!kcm_muxp)
+		goto fail;
+
+	kcm_psockp = kmem_cache_create("kcm_psock_cache",
+				       sizeof(struct kcm_psock), 0,
+					SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
+	if (!kcm_psockp)
+		goto fail;
+
+	kcm_wq = create_singlethread_workqueue("kkcmd");
+	if (!kcm_wq)
+		goto fail;
+
+	err = proto_register(&kcm_proto, 1);
+	if (err)
+		goto fail;
+
+	err = sock_register(&kcm_family_ops);
+	if (err)
+		goto sock_register_fail;
+
+	err = register_pernet_device(&kcm_net_ops);
+	if (err)
+		goto net_ops_fail;
+
+	return 0;
+
+net_ops_fail:
+	sock_unregister(PF_KCM);
+
+sock_register_fail:
+	proto_unregister(&kcm_proto);
+
+fail:
+	kmem_cache_destroy(kcm_muxp);
+	kmem_cache_destroy(kcm_psockp);
+
+	if (kcm_wq)
+		destroy_workqueue(kcm_wq);
+
+	return err;
+}
+
+static void __exit kcm_exit(void)
+{
+	unregister_pernet_device(&kcm_net_ops);
+	sock_unregister(PF_KCM);
+	proto_unregister(&kcm_proto);
+	destroy_workqueue(kcm_wq);
+
+	kmem_cache_destroy(kcm_muxp);
+	kmem_cache_destroy(kcm_psockp);
+}
+
+module_init(kcm_init);
+module_exit(kcm_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_ALIAS_NETPROTO(PF_KCM);
+
-- 
2.4.6

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH net-next 5/6] kcm: Add statistics and proc interfaces
  2015-11-20 21:21 [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM) Tom Herbert
                   ` (3 preceding siblings ...)
  2015-11-20 21:21 ` [PATCH net-next 4/6] kcm: Kernel Connection Multiplexor module Tom Herbert
@ 2015-11-20 21:21 ` Tom Herbert
  2015-11-20 21:22 ` [PATCH net-next 6/6] kcm: Add description in Documentation Tom Herbert
  2015-11-23  9:53 ` [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM) Hannes Frederic Sowa
  6 siblings, 0 replies; 43+ messages in thread
From: Tom Herbert @ 2015-11-20 21:21 UTC (permalink / raw)
  To: davem, netdev; +Cc: kernel-team, davewatson, alexei.starovoitov

This patch adds various counters for KCM. These include counters for
messages and bytes received or sent, as well as counters for number of
attached/unattached TCP sockets and other error or edge events.

The statistics are exposed via a proc interface. /proc/net/kcm provides
statistics per KCM socket and per psock (attached TCP sockets).
/proc/net/kcm_stats provides aggregate statistics.

Signed-off-by: Tom Herbert <tom@herbertland.com>
---
 include/net/kcm.h | 102 +++++++++++++
 net/kcm/Makefile  |   2 +-
 net/kcm/kcmproc.c | 422 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 net/kcm/kcmsock.c |  66 +++++++++
 4 files changed, 591 insertions(+), 1 deletion(-)
 create mode 100644 net/kcm/kcmproc.c

diff --git a/include/net/kcm.h b/include/net/kcm.h
index 4f371fe..83b4f91 100644
--- a/include/net/kcm.h
+++ b/include/net/kcm.h
@@ -11,6 +11,45 @@
 
 extern unsigned int kcm_net_id;
 
+#define KCM_STATS_ADD(stat, count)			\
+	((stat) += (count))
+
+#define KCM_STATS_INCR(stat)				\
+	((stat)++)
+
+struct kcm_psock_stats {
+	unsigned long long rx_msgs;
+	unsigned long long rx_bytes;
+	unsigned long long tx_msgs;
+	unsigned long long tx_bytes;
+	unsigned int rx_aborts;
+	unsigned int rx_mem_fail;
+	unsigned int rx_need_more_hdr;
+	unsigned int rx_bad_hdr_len;
+	unsigned long long reserved;
+	unsigned long long unreserved;
+	unsigned int tx_aborts;
+};
+
+struct kcm_mux_stats {
+	unsigned long long rx_msgs;
+	unsigned long long rx_bytes;
+	unsigned long long tx_msgs;
+	unsigned long long tx_bytes;
+	unsigned int rx_ready_drops;
+	unsigned int tx_retries;
+	unsigned int psock_attach;
+	unsigned int psock_unattach_rsvd;
+	unsigned int psock_unattach;
+};
+
+struct kcm_stats {
+	unsigned long long rx_msgs;
+	unsigned long long rx_bytes;
+	unsigned long long tx_msgs;
+	unsigned long long tx_bytes;
+};
+
 struct kcm_tx_msg {
 	unsigned int sent;
 	unsigned int fragidx;
@@ -35,6 +74,8 @@ struct kcm_sock {
 	u32 done : 1;
 	struct work_struct done_work;
 
+	struct kcm_stats stats;
+
 	/* Transmit */
 	struct kcm_psock *tx_psock;
 	struct work_struct tx_work;
@@ -71,6 +112,8 @@ struct kcm_psock {
 
 	struct list_head psock_list;
 
+	struct kcm_psock_stats stats;
+
 	/* Receive */
 	struct sk_buff *rx_skb_head;
 	struct sk_buff **rx_skb_nextp;
@@ -80,15 +123,21 @@ struct kcm_psock {
 	struct delayed_work rx_delayed_work;
 	struct bpf_prog *bpf_prog;
 	struct kcm_sock *rx_kcm;
+	unsigned long long saved_rx_bytes;
+	unsigned long long saved_rx_msgs;
 
 	/* Transmit */
 	struct kcm_sock *tx_kcm;
 	struct list_head psock_avail_list;
+	unsigned long long saved_tx_bytes;
+	unsigned long long saved_tx_msgs;
 };
 
 /* Per net MUX list */
 struct kcm_net {
 	struct mutex mutex;
+	struct kcm_psock_stats aggregate_psock_stats;
+	struct kcm_mux_stats aggregate_mux_stats;
 	struct list_head mux_list;
 	int count;
 };
@@ -104,6 +153,9 @@ struct kcm_mux {
 	struct list_head psocks;	/* List of all psocks on MUX */
 	int psocks_cnt;		/* Total attached sockets */
 
+	struct kcm_mux_stats stats;
+	struct kcm_psock_stats aggregate_psock_stats;
+
 	/* Receive */
 	spinlock_t rx_lock ____cacheline_aligned_in_smp;
 	struct list_head kcm_rx_waiters; /* KCMs waiting for receiving */
@@ -116,6 +168,56 @@ struct kcm_mux {
 	struct list_head kcm_tx_waiters; /* KCMs waiting for a TX psock */
 };
 
+#ifdef CONFIG_PROC_FS
+int kcm_proc_init(void);
+void kcm_proc_exit(void);
+#else
+static int kcm_proc_init(void) { return 0; }
+static void kcm_proc_exit(void) { }
+#endif
+
+
+static inline void aggregate_psock_stats(struct kcm_psock_stats *stats,
+					 struct kcm_psock_stats *agg_stats)
+{
+	/* Save psock statistics in the mux when psock is being unattached. */
+
+#define SAVE_PSOCK_STATS(_stat) (agg_stats->_stat += stats->_stat)
+
+	SAVE_PSOCK_STATS(rx_msgs);
+	SAVE_PSOCK_STATS(rx_bytes);
+	SAVE_PSOCK_STATS(rx_aborts);
+	SAVE_PSOCK_STATS(rx_mem_fail);
+	SAVE_PSOCK_STATS(rx_need_more_hdr);
+	SAVE_PSOCK_STATS(rx_bad_hdr_len);
+	SAVE_PSOCK_STATS(tx_msgs);
+	SAVE_PSOCK_STATS(tx_bytes);
+	SAVE_PSOCK_STATS(reserved);
+	SAVE_PSOCK_STATS(unreserved);
+	SAVE_PSOCK_STATS(tx_aborts);
+
+#undef SAVE_PSOCK_STATS
+}
+
+static inline void aggregate_mux_stats(struct kcm_mux_stats *stats,
+				       struct kcm_mux_stats *agg_stats)
+{
+	/* Save psock statistics in the mux when psock is being unattached. */
+
+#define SAVE_MUX_STATS(_stat) (agg_stats->_stat += stats->_stat)
+
+	SAVE_MUX_STATS(rx_msgs);
+	SAVE_MUX_STATS(rx_bytes);
+	SAVE_MUX_STATS(tx_msgs);
+	SAVE_MUX_STATS(tx_bytes);
+	SAVE_MUX_STATS(rx_ready_drops);
+	SAVE_MUX_STATS(psock_attach);
+	SAVE_MUX_STATS(psock_unattach_rsvd);
+	SAVE_MUX_STATS(psock_unattach);
+
+#undef SAVE_MUX_STATS
+}
+
 #endif /* __KERNEL__ */
 
 #endif /* __NET_KCM_H_ */
diff --git a/net/kcm/Makefile b/net/kcm/Makefile
index cb525f7..7125613 100644
--- a/net/kcm/Makefile
+++ b/net/kcm/Makefile
@@ -1,3 +1,3 @@
 obj-$(CONFIG_AF_KCM) += kcm.o
 
-kcm-y := kcmsock.o
+kcm-y := kcmsock.o kcmproc.o
diff --git a/net/kcm/kcmproc.c b/net/kcm/kcmproc.c
new file mode 100644
index 0000000..5eb9809
--- /dev/null
+++ b/net/kcm/kcmproc.c
@@ -0,0 +1,422 @@
+#include <linux/in.h>
+#include <linux/inet.h>
+#include <linux/list.h>
+#include <linux/module.h>
+#include <linux/net.h>
+#include <linux/proc_fs.h>
+#include <linux/rculist.h>
+#include <linux/seq_file.h>
+#include <linux/socket.h>
+#include <net/inet_sock.h>
+#include <net/kcm.h>
+#include <net/net_namespace.h>
+#include <net/netns/generic.h>
+#include <net/tcp.h>
+
+#ifdef CONFIG_PROC_FS
+struct kcm_seq_muxinfo {
+	char				*name;
+	const struct file_operations	*seq_fops;
+	const struct seq_operations	seq_ops;
+};
+
+static struct kcm_mux *kcm_get_first(struct seq_file *seq)
+{
+	struct net *net = seq_file_net(seq);
+	struct kcm_net *knet = net_generic(net, kcm_net_id);
+
+	return list_first_or_null_rcu(&knet->mux_list,
+				      struct kcm_mux, kcm_mux_list);
+}
+
+static struct kcm_mux *kcm_get_next(struct kcm_mux *mux)
+{
+	struct kcm_net *knet = mux->knet;
+
+	return list_next_or_null_rcu(&knet->mux_list, &mux->kcm_mux_list,
+				     struct kcm_mux, kcm_mux_list);
+}
+
+static struct kcm_mux *kcm_get_idx(struct seq_file *seq, loff_t pos)
+{
+	struct net *net = seq_file_net(seq);
+	struct kcm_net *knet = net_generic(net, kcm_net_id);
+	struct kcm_mux *m;
+
+	list_for_each_entry_rcu(m, &knet->mux_list, kcm_mux_list) {
+		if (!pos)
+			return m;
+		--pos;
+	}
+	return NULL;
+}
+
+static void *kcm_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	void *p;
+
+	if (v == SEQ_START_TOKEN)
+		p = kcm_get_first(seq);
+	else
+		p = kcm_get_next(v);
+	++*pos;
+	return p;
+}
+
+static void *kcm_seq_start(struct seq_file *seq, loff_t *pos)
+	__acquires(rcu)
+{
+	rcu_read_lock();
+
+	if (!*pos)
+		return SEQ_START_TOKEN;
+	else
+		return kcm_get_idx(seq, *pos - 1);
+}
+
+static void kcm_seq_stop(struct seq_file *seq, void *v)
+	__releases(rcu)
+{
+	rcu_read_unlock();
+}
+
+struct kcm_proc_mux_state {
+	struct seq_net_private p;
+	int idx;
+};
+
+static int kcm_seq_open(struct inode *inode, struct file *file)
+{
+	struct kcm_seq_muxinfo *muxinfo = PDE_DATA(inode);
+	int err;
+
+	err = seq_open_net(inode, file, &muxinfo->seq_ops,
+			   sizeof(struct kcm_proc_mux_state));
+	if (err < 0)
+		return err;
+	return err;
+}
+
+static void kcm_format_mux_header(struct seq_file *seq)
+{
+	struct net *net = seq_file_net(seq);
+	struct kcm_net *knet = net_generic(net, kcm_net_id);
+
+	seq_printf(seq,
+		   "*** KCM statistics (%d MUX) ****\n",
+		   knet->count);
+
+	seq_printf(seq,
+		   "%-14s %-10s %-16s %-10s %-16s %-8s %-8s %-8s %-8s %s",
+		   "Object",
+		   "RX-Msgs",
+		   "RX-Bytes",
+		   "TX-Msgs",
+		   "TX-Bytes",
+		   "Recv-Q",
+		   "Rmem",
+		   "Send-Q",
+		   "Smem",
+		   "Status");
+
+	/* XXX: pdsts header stuff here */
+	seq_puts(seq, "\n");
+}
+
+static void kcm_format_sock(struct kcm_sock *kcm, struct seq_file *seq,
+			    int i, int *len)
+{
+	seq_printf(seq,
+		   "   kcm-%-7u %-10llu %-16llu %-10llu %-16llu %-8d %-8d %-8d %-8s ",
+		   kcm->index,
+		   kcm->stats.rx_msgs,
+		   kcm->stats.rx_bytes,
+		   kcm->stats.tx_msgs,
+		   kcm->stats.tx_bytes,
+		   kcm->sk.sk_receive_queue.qlen,
+		   sk_rmem_alloc_get(&kcm->sk),
+		   kcm->sk.sk_write_queue.qlen,
+		   "-");
+
+	if (kcm->tx_psock)
+		seq_printf(seq, "Psck-%u ", kcm->tx_psock->index);
+
+	if (kcm->tx_wait)
+		seq_puts(seq, "TxWait ");
+
+	if (kcm->tx_wait_more)
+		seq_puts(seq, "WMore ");
+
+	if (kcm->rx_wait)
+		seq_puts(seq, "RxWait ");
+
+	seq_puts(seq, "\n");
+}
+
+static void kcm_format_psock(struct kcm_psock *psock, struct seq_file *seq,
+			     int i, int *len)
+{
+	seq_printf(seq,
+		   "   psock-%-5u %-10llu %-16llu %-10llu %-16llu %-8d %-8d %-8d %-8d ",
+		   psock->index,
+		   psock->stats.rx_msgs,
+		   psock->stats.rx_bytes,
+		   psock->stats.tx_msgs,
+		   psock->stats.tx_bytes,
+		   psock->sk->sk_receive_queue.qlen,
+		   atomic_read(&psock->sk->sk_rmem_alloc),
+		   psock->sk->sk_write_queue.qlen,
+		   atomic_read(&psock->sk->sk_wmem_alloc));
+
+	if (psock->done)
+		seq_puts(seq, "Done ");
+
+	if (psock->tx_stopped)
+		seq_puts(seq, "TxStop ");
+
+	if (psock->rx_stopped)
+		seq_puts(seq, "RxStop ");
+
+	if (psock->tx_kcm)
+		seq_printf(seq, "Rsvd-%d ", psock->tx_kcm->index);
+
+	if (psock->ready_rx_msg)
+		seq_puts(seq, "RdyRx ");
+
+	seq_puts(seq, "\n");
+}
+
+static void
+kcm_format_mux(struct kcm_mux *mux, loff_t idx, struct seq_file *seq)
+{
+	int i, len;
+	struct kcm_sock *kcm;
+	struct kcm_psock *psock;
+
+	/* mux information */
+	seq_printf(seq,
+		   "%-6s%-8s %-10llu %-16llu %-10llu %-16llu %-8s %-8s %-8s %-8s ",
+		   "mux", "",
+		   mux->stats.rx_msgs,
+		   mux->stats.rx_bytes,
+		   mux->stats.tx_msgs,
+		   mux->stats.tx_bytes,
+		   "-", "-", "-", "-");
+
+	seq_printf(seq, "KCMs: %d, Psocks %d\n",
+		   mux->kcm_socks_cnt, mux->psocks_cnt);
+
+	/* kcm sock information */
+	i = 0;
+	spin_lock_bh(&mux->lock);
+	list_for_each_entry(kcm, &mux->kcm_socks, kcm_sock_list) {
+		kcm_format_sock(kcm, seq, i, &len);
+		i++;
+	}
+	i = 0;
+	list_for_each_entry(psock, &mux->psocks, psock_list) {
+		kcm_format_psock(psock, seq, i, &len);
+		i++;
+	}
+	spin_unlock_bh(&mux->lock);
+}
+
+static int kcm_seq_show(struct seq_file *seq, void *v)
+{
+	struct kcm_proc_mux_state *mux_state;
+
+	mux_state = seq->private;
+	if (v == SEQ_START_TOKEN) {
+		mux_state->idx = 0;
+		kcm_format_mux_header(seq);
+	} else {
+		kcm_format_mux(v, mux_state->idx, seq);
+		mux_state->idx++;
+	}
+	return 0;
+}
+
+static const struct file_operations kcm_seq_fops = {
+	.owner		= THIS_MODULE,
+	.open		= kcm_seq_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+};
+
+static struct kcm_seq_muxinfo kcm_seq_muxinfo = {
+	.name		= "kcm",
+	.seq_fops	= &kcm_seq_fops,
+	.seq_ops	= {
+		.show	= kcm_seq_show,
+		.start	= kcm_seq_start,
+		.next	= kcm_seq_next,
+		.stop	= kcm_seq_stop,
+	}
+};
+
+static int kcm_proc_register(struct net *net, struct kcm_seq_muxinfo *muxinfo)
+{
+	struct proc_dir_entry *p;
+	int rc = 0;
+
+	p = proc_create_data(muxinfo->name, S_IRUGO, net->proc_net,
+			     muxinfo->seq_fops, muxinfo);
+	if (!p)
+		rc = -ENOMEM;
+	return rc;
+}
+EXPORT_SYMBOL(kcm_proc_register);
+
+static void kcm_proc_unregister(struct net *net,
+				struct kcm_seq_muxinfo *muxinfo)
+{
+	remove_proc_entry(muxinfo->name, net->proc_net);
+}
+EXPORT_SYMBOL(kcm_proc_unregister);
+
+static int kcm_stats_seq_show(struct seq_file *seq, void *v)
+{
+	struct kcm_psock_stats psock_stats;
+	struct kcm_mux_stats mux_stats;
+	struct kcm_mux *mux;
+	struct kcm_psock *psock;
+	struct net *net = seq->private;
+	struct kcm_net *knet = net_generic(net, kcm_net_id);
+
+	memset(&mux_stats, 0, sizeof(mux_stats));
+	memset(&psock_stats, 0, sizeof(psock_stats));
+
+	mutex_lock(&knet->mutex);
+
+	aggregate_mux_stats(&knet->aggregate_mux_stats, &mux_stats);
+	aggregate_psock_stats(&knet->aggregate_psock_stats,
+			      &psock_stats);
+
+	list_for_each_entry_rcu(mux, &knet->mux_list, kcm_mux_list) {
+		spin_lock_bh(&mux->lock);
+		aggregate_mux_stats(&mux->stats, &mux_stats);
+		aggregate_psock_stats(&mux->aggregate_psock_stats,
+				      &psock_stats);
+		list_for_each_entry(psock, &mux->psocks, psock_list)
+			aggregate_psock_stats(&psock->stats, &psock_stats);
+		spin_unlock_bh(&mux->lock);
+	}
+
+	mutex_unlock(&knet->mutex);
+
+	seq_printf(seq,
+		   "%-8s %-10s %-16s %-10s %-16s %-10s %-10s %-10s %-10s %-10s\n",
+		   "MUX",
+		   "RX-Msgs",
+		   "RX-Bytes",
+		   "TX-Msgs",
+		   "TX-Bytes",
+		   "TX-Retries",
+		   "Attach",
+		   "Unattach",
+		   "UnattchRsvd",
+		   "RX-RdyDrops");
+
+	seq_printf(seq,
+		   "%-8s %-10llu %-16llu %-10llu %-16llu %-10u %-10u %-10u %-10u %-10u\n",
+		   "",
+		   mux_stats.rx_msgs,
+		   mux_stats.rx_bytes,
+		   mux_stats.tx_msgs,
+		   mux_stats.tx_bytes,
+		   mux_stats.tx_retries,
+		   mux_stats.psock_attach,
+		   mux_stats.psock_unattach_rsvd,
+		   mux_stats.psock_unattach,
+		   mux_stats.rx_ready_drops);
+
+	seq_printf(seq,
+		   "%-8s %-10s %-16s %-10s %-16s %-10s %-10s %-10s %-10s %-10s %-10s %-10s\n",
+		   "Psock",
+		   "RX-Msgs",
+		   "RX-Bytes",
+		   "TX-Msgs",
+		   "TX-Bytes",
+		   "Reserved",
+		   "Unreserved",
+		   "RX-Aborts",
+		   "RX-MemFail",
+		   "RX-NeedMor",
+		   "RX-BadLen",
+		   "TX-Aborts");
+
+	seq_printf(seq,
+		   "%-8s %-10llu %-16llu %-10llu %-16llu %-10llu %-10llu %-10u %-10u %-10u %-10u %-10u\n",
+		   "",
+		   psock_stats.rx_msgs,
+		   psock_stats.rx_bytes,
+		   psock_stats.tx_msgs,
+		   psock_stats.tx_bytes,
+		   psock_stats.reserved,
+		   psock_stats.unreserved,
+		   psock_stats.rx_aborts,
+		   psock_stats.rx_mem_fail,
+		   psock_stats.rx_need_more_hdr,
+		   psock_stats.rx_bad_hdr_len,
+		   psock_stats.tx_aborts);
+
+	return 0;
+}
+
+static int kcm_stats_seq_open(struct inode *inode, struct file *file)
+{
+	return single_open_net(inode, file, kcm_stats_seq_show);
+}
+
+static const struct file_operations kcm_stats_seq_fops = {
+	.owner   = THIS_MODULE,
+	.open    = kcm_stats_seq_open,
+	.read    = seq_read,
+	.llseek  = seq_lseek,
+	.release = single_release_net,
+};
+
+static int kcm_proc_init_net(struct net *net)
+{
+	int err;
+
+	if (!proc_create("kcm_stats", S_IRUGO, net->proc_net,
+			 &kcm_stats_seq_fops)) {
+		err = -ENOMEM;
+		goto out_kcm_stats;
+	}
+
+	err = kcm_proc_register(net, &kcm_seq_muxinfo);
+	if (err)
+		goto out_kcm;
+
+	return 0;
+
+out_kcm:
+	remove_proc_entry("kcm_stats", net->proc_net);
+out_kcm_stats:
+	return err;
+}
+
+static void kcm_proc_exit_net(struct net *net)
+{
+	kcm_proc_unregister(net, &kcm_seq_muxinfo);
+	remove_proc_entry("kcm_stats", net->proc_net);
+}
+
+static struct pernet_operations kcm_net_ops = {
+	.init = kcm_proc_init_net,
+	.exit = kcm_proc_exit_net,
+};
+
+int __init kcm_proc_init(void)
+{
+	return register_pernet_subsys(&kcm_net_ops);
+}
+
+void __exit kcm_proc_exit(void)
+{
+	unregister_pernet_subsys(&kcm_net_ops);
+}
+
+#endif /* CONFIG_PROC_FS */
diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c
index 061263a..451b504 100644
--- a/net/kcm/kcmsock.c
+++ b/net/kcm/kcmsock.c
@@ -58,6 +58,7 @@ static void kcm_abort_rx_psock(struct kcm_psock *psock, int err,
 		return;
 
 	psock->rx_stopped = 1;
+	KCM_STATS_INCR(psock->stats.rx_aborts);
 
 	/* Report an error on the lower socket */
 	report_csk_error(csk, err);
@@ -79,6 +80,7 @@ static void kcm_abort_tx_psock(struct kcm_psock *psock, int err,
 	}
 
 	psock->tx_stopped = 1;
+	KCM_STATS_INCR(psock->stats.tx_aborts);
 
 	if (!psock->tx_kcm) {
 		/* Take off psocks_avail list */
@@ -291,6 +293,13 @@ static void unreserve_rx_kcm(struct kcm_psock *psock,
 
 	spin_lock_bh(&mux->rx_lock);
 
+	KCM_STATS_ADD(mux->stats.rx_bytes,
+		      psock->stats.rx_bytes - psock->saved_rx_bytes);
+	mux->stats.rx_msgs +=
+		psock->stats.rx_msgs - psock->saved_rx_msgs;
+	psock->saved_rx_msgs = psock->stats.rx_msgs;
+	psock->saved_rx_bytes = psock->stats.rx_bytes;
+
 	psock->rx_kcm = NULL;
 	kcm->rx_psock = NULL;
 
@@ -353,6 +362,7 @@ static int kcm_tcp_recv(read_descriptor_t *desc, struct sk_buff *orig_skb,
 
 			skb = alloc_skb(0, GFP_ATOMIC);
 			if (!skb) {
+				KCM_STATS_INCR(psock->stats.rx_mem_fail);
 				desc->error = -ENOMEM;
 				return 0;
 			}
@@ -367,6 +377,7 @@ static int kcm_tcp_recv(read_descriptor_t *desc, struct sk_buff *orig_skb,
 		} else {
 			err = skb_unclone(head, GFP_ATOMIC);
 			if (err) {
+				KCM_STATS_INCR(psock->stats.rx_mem_fail);
 				desc->error = err;
 				return 0;
 			}
@@ -379,11 +390,13 @@ static int kcm_tcp_recv(read_descriptor_t *desc, struct sk_buff *orig_skb,
 		/* Always clone since we will consume something */
 		skb = skb_clone(orig_skb, GFP_ATOMIC);
 		if (!skb) {
+			KCM_STATS_INCR(psock->stats.rx_mem_fail);
 			desc->error = -ENOMEM;
 			break;
 		}
 
 		if (!pskb_pull(skb, offset + eaten)) {
+			KCM_STATS_INCR(psock->stats.rx_mem_fail);
 			kfree_skb(skb);
 			desc->error = -ENOMEM;
 			break;
@@ -398,6 +411,7 @@ static int kcm_tcp_recv(read_descriptor_t *desc, struct sk_buff *orig_skb,
 		/* Need to trim, should be rare? */
 		err = pskb_trim(skb, orig_len - eaten);
 		if (err) {
+			KCM_STATS_INCR(psock->stats.rx_mem_fail);
 			kfree_skb(skb);
 			desc->error = err;
 			break;
@@ -432,11 +446,13 @@ static int kcm_tcp_recv(read_descriptor_t *desc, struct sk_buff *orig_skb,
 
 			if (!len) {
 				/* Need more header to determine length */
+				KCM_STATS_INCR(psock->stats.rx_need_more_hdr);
 				break;
 			} else if (len <= head->len - skb->len) {
 				/* Length must be into new skb (and also
 				 * greater than zero)
 				 */
+				KCM_STATS_INCR(psock->stats.rx_bad_hdr_len);
 				desc->error = -EPROTO;
 				psock->rx_skb_head = NULL;
 				kcm_abort_rx_psock(psock, EPROTO, head);
@@ -466,6 +482,7 @@ static int kcm_tcp_recv(read_descriptor_t *desc, struct sk_buff *orig_skb,
 
 		/* Hurray, we have a new message! */
 		psock->rx_skb_head = NULL;
+		KCM_STATS_INCR(psock->stats.rx_msgs);
 
 try_queue:
 		kcm = reserve_rx_kcm(psock, head);
@@ -481,6 +498,8 @@ try_queue:
 		}
 	}
 
+	KCM_STATS_ADD(psock->stats.rx_bytes, eaten);
+
 	return eaten;
 }
 
@@ -642,6 +661,7 @@ static struct kcm_psock *reserve_psock(struct kcm_sock *kcm)
 		}
 		kcm->tx_psock = psock;
 		psock->tx_kcm = kcm;
+		KCM_STATS_INCR(psock->stats.reserved);
 	} else if (!kcm->tx_wait) {
 		list_add_tail(&kcm->wait_psock_list,
 			      &mux->kcm_tx_waiters);
@@ -676,6 +696,7 @@ static void psock_now_avail(struct kcm_psock *psock)
 		smp_mb();
 
 		kcm->tx_psock = psock;
+		KCM_STATS_INCR(psock->stats.reserved);
 		queue_work(kcm_wq, &kcm->tx_work);
 	}
 }
@@ -697,10 +718,18 @@ static void unreserve_psock(struct kcm_sock *kcm)
 
 	smp_rmb(); /* Read tx_psock before tx_wait */
 
+	KCM_STATS_ADD(mux->stats.tx_bytes,
+		      psock->stats.tx_bytes - psock->saved_tx_bytes);
+	mux->stats.tx_msgs +=
+		psock->stats.tx_msgs - psock->saved_tx_msgs;
+	psock->saved_tx_msgs = psock->stats.tx_msgs;
+	psock->saved_tx_bytes = psock->stats.tx_bytes;
+
 	WARN_ON(kcm->tx_wait);
 
 	kcm->tx_psock = NULL;
 	psock->tx_kcm = NULL;
+	KCM_STATS_INCR(psock->stats.unreserved);
 
 	if (unlikely(psock->tx_stopped)) {
 		if (psock->done) {
@@ -724,6 +753,15 @@ static void unreserve_psock(struct kcm_sock *kcm)
 	spin_unlock_bh(&mux->lock);
 }
 
+static void kcm_report_tx_retry(struct kcm_sock *kcm)
+{
+	struct kcm_mux *mux = kcm->mux;
+
+	spin_lock_bh(&mux->lock);
+	KCM_STATS_INCR(mux->stats.tx_retries);
+	spin_unlock_bh(&mux->lock);
+}
+
 /* Write any messages ready on the kcm socket.  Called with kcm sock lock
  * held.  Return bytes actually sent or error.
  */
@@ -744,6 +782,7 @@ static int kcm_write_msgs(struct kcm_sock *kcm)
 		 * it and we'll retry the message.
 		 */
 		unreserve_psock(kcm);
+		kcm_report_tx_retry(kcm);
 		if (skb_queue_empty(&sk->sk_write_queue))
 			return 0;
 
@@ -827,6 +866,7 @@ do_frag:
 				unreserve_psock(kcm);
 
 				txm->sent = 0;
+				kcm_report_tx_retry(kcm);
 				ret = 0;
 
 				goto try_again;
@@ -834,6 +874,7 @@ do_frag:
 
 			sent += ret;
 			frag_offset += ret;
+			KCM_STATS_ADD(psock->stats.tx_bytes, ret);
 			if (frag_offset < frag->size) {
 				/* Not finished with this frag */
 				goto do_frag;
@@ -855,6 +896,7 @@ do_frag:
 		kfree_skb(head);
 		sk->sk_wmem_queued -= sent;
 		total_sent += sent;
+		KCM_STATS_INCR(psock->stats.tx_msgs);
 	} while ((head = skb_peek(&sk->sk_write_queue)));
 out:
 	if (!head) {
@@ -1031,6 +1073,7 @@ wait_for_memory:
 		/* Message complete, queue it on send buffer */
 		__skb_queue_tail(&sk->sk_write_queue, head);
 		kcm->seq_skb = NULL;
+		KCM_STATS_INCR(kcm->stats.tx_msgs);
 
 		if (msg->msg_flags & MSG_BATCH) {
 			kcm->tx_wait_more = true;
@@ -1053,6 +1096,8 @@ partial_message:
 		kcm_tx_msg(head)->last_skb = skb;
 	}
 
+	KCM_STATS_ADD(kcm->stats.tx_bytes, copied);
+
 	release_sock(sk);
 	return copied;
 
@@ -1083,6 +1128,7 @@ static int kcm_recvmsg(struct socket *sock, struct msghdr *msg,
 		       size_t len, int flags)
 {
 	struct sock *sk = sock->sk;
+	struct kcm_sock *kcm = kcm_sk(sk);
 	int err = 0;
 	long timeo;
 	struct kcm_rx_msg *rxm;
@@ -1129,6 +1175,7 @@ static int kcm_recvmsg(struct socket *sock, struct msghdr *msg,
 
 	copied = len;
 	if (likely(!(flags & MSG_PEEK))) {
+		KCM_STATS_ADD(kcm->stats.rx_bytes, copied);
 		if (copied < rxm->full_len) {
 			if (sock->type == SOCK_DGRAM) {
 				/* Truncated message */
@@ -1141,6 +1188,7 @@ static int kcm_recvmsg(struct socket *sock, struct msghdr *msg,
 msg_finished:
 			/* Finished with message */
 			msg->msg_flags |= MSG_EOR;
+			KCM_STATS_INCR(kcm->stats.rx_msgs);
 			skb_unlink(skb, &sk->sk_receive_queue);
 			kfree_skb(skb);
 		}
@@ -1352,6 +1400,7 @@ static int kcm_attach(struct socket *sock, struct socket *csock,
 	list_add(&psock->psock_list, head);
 	psock->index = index;
 
+	KCM_STATS_INCR(mux->stats.psock_attach);
 	mux->psocks_cnt++;
 	psock_now_avail(psock);
 	spin_unlock_bh(&mux->lock);
@@ -1427,6 +1476,7 @@ static void kcm_unattach(struct kcm_psock *psock)
 		list_del(&psock->psock_ready_list);
 		kfree_skb(psock->ready_rx_msg);
 		psock->ready_rx_msg = NULL;
+		KCM_STATS_INCR(mux->stats.rx_ready_drops);
 	}
 
 	spin_unlock_bh(&mux->rx_lock);
@@ -1443,11 +1493,16 @@ static void kcm_unattach(struct kcm_psock *psock)
 
 	spin_lock_bh(&mux->lock);
 
+	aggregate_psock_stats(&psock->stats, &mux->aggregate_psock_stats);
+
+	KCM_STATS_INCR(mux->stats.psock_unattach);
+
 	if (psock->tx_kcm) {
 		/* psock was reserved.  Just mark it finished and we will clean
 		 * up in the kcm paths, we need kcm lock which can not be
 		 * acquired here.
 		 */
+		KCM_STATS_INCR(mux->stats.psock_unattach_rsvd);
 		spin_unlock_bh(&mux->lock);
 
 		/* We are unattaching a socket that is reserved. Abort the
@@ -1675,6 +1730,9 @@ static void release_mux(struct kcm_mux *mux)
 	__skb_queue_purge(&mux->rx_hold_queue);
 
 	mutex_lock(&knet->mutex);
+	aggregate_mux_stats(&mux->stats, &knet->aggregate_mux_stats);
+	aggregate_psock_stats(&mux->aggregate_psock_stats,
+			      &knet->aggregate_psock_stats);
 	list_del_rcu(&mux->kcm_mux_list);
 	knet->count--;
 	mutex_unlock(&knet->mutex);
@@ -1937,8 +1995,15 @@ static int __init kcm_init(void)
 	if (err)
 		goto net_ops_fail;
 
+	err = kcm_proc_init();
+	if (err)
+		goto proc_init_fail;
+
 	return 0;
 
+proc_init_fail:
+	unregister_pernet_device(&kcm_net_ops);
+
 net_ops_fail:
 	sock_unregister(PF_KCM);
 
@@ -1957,6 +2022,7 @@ fail:
 
 static void __exit kcm_exit(void)
 {
+	kcm_proc_exit();
 	unregister_pernet_device(&kcm_net_ops);
 	sock_unregister(PF_KCM);
 	proto_unregister(&kcm_proto);
-- 
2.4.6

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH net-next 6/6] kcm: Add description in Documentation
  2015-11-20 21:21 [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM) Tom Herbert
                   ` (4 preceding siblings ...)
  2015-11-20 21:21 ` [PATCH net-next 5/6] kcm: Add statistics and proc interfaces Tom Herbert
@ 2015-11-20 21:22 ` Tom Herbert
  2015-11-23  9:53 ` [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM) Hannes Frederic Sowa
  6 siblings, 0 replies; 43+ messages in thread
From: Tom Herbert @ 2015-11-20 21:22 UTC (permalink / raw)
  To: davem, netdev; +Cc: kernel-team, davewatson, alexei.starovoitov

Add kcm.txt to desribe KCM and interfaces.

Signed-off-by: Tom Herbert <tom@herbertland.com>
---
 Documentation/networking/kcm.txt | 273 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 273 insertions(+)
 create mode 100644 Documentation/networking/kcm.txt

diff --git a/Documentation/networking/kcm.txt b/Documentation/networking/kcm.txt
new file mode 100644
index 0000000..5432090
--- /dev/null
+++ b/Documentation/networking/kcm.txt
@@ -0,0 +1,273 @@
+Kernel Connection Mulitplexor
+-----------------------------
+
+Kernel Connection Multiplexor (KCM) is a mechanism that provides a message based
+interface over TCP for generic application protocols. With KCM an application
+can efficiently send and receive application protocol messages over TCP using
+datagram sockets.
+
+KCM implements an NxM multiplexor in the kernel as diagrammed below:
+
++------------+   +------------+   +------------+   +------------+
+| KCM socket |   | KCM socket |   | KCM socket |   | KCM socket |
++------------+   +------------+   +------------+   +------------+
+      |                 |               |                |
+      +-----------+     |               |     +----------+
+                  |     |               |     |
+               +----------------------------------+
+               |           Multiplexor            |
+               +----------------------------------+
+                 |   |           |           |  |
+       +---------+   |           |           |  ------------+
+       |             |           |           |              |
++----------+  +----------+  +----------+  +----------+ +----------+
+|  Psock   |  |  Psock   |  |  Psock   |  |  Psock   | |  Psock   |
++----------+  +----------+  +----------+  +----------+ +----------+
+      |              |           |            |             |
++----------+  +----------+  +----------+  +----------+ +----------+
+| TCP sock |  | TCP sock |  | TCP sock |  | TCP sock | | TCP sock |
++----------+  +----------+  +----------+  +----------+ +----------+
+
+KCM sockets
+-----------
+
+The KCM sockets provide the user interface to the muliplexor. All the KCM sockets
+bound to a multiplexor are considered to have equivalent function, and I/O
+operations in different sockets may be done in parallel without the need for
+synchronization between threads in userspace.
+
+Multiplexor
+-----------
+
+The multiplexor provides the message steering. In the transmit path, messages
+written on a KCM socket are sent atomically on an appropriate TCP socket.
+Similarly, in the receive path, messages are constructed on each TCP socket
+(Psock) and complete messages are steered to a KCM socket.
+
+TCP sockets & Psocks
+--------------------
+
+TCP sockets may be bound to a KCM multiplexor. A Psock structure is allocated
+for each bound TCP socket, this structure holds the state for constructing
+messages on receive as well as other connection specific information for KCM.
+
+Connected mode semantics
+------------------------
+
+Each multiplexor assumes that all attached TCP connections are to the same
+destination and can use the different connections for load balancing when
+transmitting. The normal send and recv calls (include sendmmsg and recvmmsg)
+can be used to send and receive messages from the KCM socket.
+
+Socket types
+------------
+
+KCM supports SOCK_DGRAM and SOCK_SEQPACKET socket types.
+
+Message delineation
+-------------------
+
+Messages are sent over a TCP stream with some application protocol message
+format that typically includes a header which frames the messages. The length
+of a received message can be deduced from the application protocol header
+(often just a simple length field).
+
+A TCP stream must be parsed to determine message boundaries. Berkeley Packet
+Filter (BPF) is used for this. When attaching a TCP socket to a multiplexor a
+BPF program must be specified. The program is called at the start of receiving
+a new message and is given an skbuff that contains the bytes received so far.
+It parses the message header and returns the length of the message. Given this
+information, KCM will construct the message of the stated length and deliver it
+to a KCM socket.
+
+TCP socket management
+---------------------
+
+When a TCP socket is attached to a KCM multiplexor data ready (POLLIN) and
+write space available (POLLOUT) events are handled by the multiplexor. If there
+is a state change (disconnection) or other error on a TCP socket, an error is
+posted on the TCP socket so that a POLLERR event happens and KCM discontinues
+using the socket. When the application gets the error notification for a
+TCP socket, it should unattach the socket from KCM and then handle the error
+condition (the typical response is to close the socket and create a new
+connection if necessary).
+
+User interface
+==============
+
+Creating a multiplexor
+----------------------
+
+A new multiplexor and initial KCM socket is created by a socket call:
+
+  socket(AF_KCM, type, protocol)
+
+  - type is either SOCK_DGRAM or SOCK_SEQPACKET
+  - protocol is KCMPROTO_CONNECTED
+
+Cloning KCM sockets
+-------------------
+
+After the first KCM socket is created using the socket call as described
+above, additional sockets for the multiplexor can be created by cloning
+a KCM socket. This is accomplished by an ioctl on a KCM socket:
+
+  /* From linux/kcm.h */
+  struct kcm_clone {
+        int fd;
+  };
+
+  struct kcm_clone info;
+
+  memset(&info, 0, sizeof(info));
+
+  err = ioctl(kcmfd, SIOCKCMCLONE, &info);
+
+  if (!err)
+    newkcmfd = info.fd;
+
+Attach transport sockets
+------------------------
+
+Attaching of transport sockets to a multiplexor is performed by calling an
+ioctl on a KCM socket for the multiplexor. e.g.:
+
+  /* From linux/kcm.h */
+  struct kcm_attach {
+        int fd;
+	int bpf_fd;
+  };
+
+  struct kcm_attach info;
+
+  memset(&info, 0, sizeof(info));
+
+  info.fd = tcpfd;
+  info.bpf_fd = bpf_prog_fd;
+
+  ioctl(kcmfd, SIOCKCMATTACH, &info);
+
+The kcm_attach structure contains:
+  fd: file descriptor for TCP socket being attached
+  bpf_prog_fd: file descriptor for compiled BPF program downloaded
+
+Unattach transport sockets
+--------------------------
+
+Unattaching a transport socket from a multiplexor is straightforward. An
+"unattach" ioctl is done with the kcm_unattach structure as the argument:
+
+  /* From linux/kcm.h */
+  struct kcm_unattach {
+        int fd;
+  };
+
+  struct kcm_unattach info;
+
+  memset(&info, 0, sizeof(info));
+
+  info.fd = cfd;
+
+  ioctl(fd, SIOCKCMUNATTACH, &info);
+
+Disabling receive on KCM socket
+-------------------------------
+
+A setsockopt is used to disable or enable receiving on a KCM socket.
+When receive is disabled, any pending messages in the socket's
+receive buffer are moved to other sockets. This feature is useful
+if an application thread knows that it will be doing a lot of
+work on a request and won't be able to service new messages for a
+while. Example use:
+
+  int val = 1;
+
+  setsockopt(kcmfd, SOL_KCM, KCM_RECV_DISABLE, &val, sizeof(val))
+
+BFP programs for message delineation
+------------------------------------
+
+BPF programs can be compiled using the BPF LLVM backend. For exmple,
+the BPF program for parsing Thrift is:
+
+  #include "bpf.h" /* for __sk_buff */
+  #include "bpf_helpers.h" /* for load_word intrinsic */
+
+  SEC("socket_kcm")
+  int bpf_prog1(struct __sk_buff *skb)
+  {
+       return load_word(skb, 0) + 4;
+  }
+
+  char _license[] SEC("license") = "GPL";
+
+Use in applications
+===================
+
+KCM accelerates application layer protocols. Specifically, it allows
+applications to use a message based interface for sending and receiving
+messages. The kernel provides necessary assurances that messages are sent
+and received atomically. This relieves much of the burden applications have
+in mapping a message based protocol onto the TCP stream. KCM also make
+application layer messages a unit of work in the kernel for the purposes of
+steerng and scheduling, which in turn allows a simpler networking model in
+multithreaded applications.
+
+Configurations
+--------------
+
+In an Nx1 configuration, KCM logically provides multiple socket handles
+to the same TCP connection. This allows parallelism between in I/O
+operations on the TCP socket (for instance copyin and copyout of data is
+parallelized). In an application, a KCM socket can be opened for each
+processing thread and inserted into the epoll (similar to how SO_REUSEPORT
+is used to allow multiple listener sockets on the same port).
+
+In a MxN configuration, multiple connections are established to the
+same destination. These are used for simple load balancing.
+
+Message batching
+----------------
+
+The primary purpose of KCM is load balancing between KCM sockets and hence
+threads in a nominal use case. Perfect load balancing, that is steering
+each received message to a different KCM socket or steering each sent
+message to a different TCP socket, can negatively impact performance
+since this doesn't allow for affinities to be established. Balancing
+based on groups, or batches of messages, can be beneficial for performance.
+
+On transmit, there are three ways an application can batch (pipeline)
+messages on a KCM socket.
+  1) Send multiple messages in a single sendmmsg.
+  2) Send a group of messages each with a sendmsg call, where all messages
+     except the last have MSG_BATCH in the flags of sendmsg call.
+  3) Create "super message" composed of multiple messages and send this
+     with a single sendmsg.
+
+On receive, the KCM module attempts to queue messages received on the
+same KCM socket during each TCP ready callback. The targeted KCM socket
+changes at each receive ready callback on the KCM socket. The application
+does not need to configure this.
+
+Error handling
+--------------
+
+An application should include a thread to monitor errors raised on
+the TCP connection. Normally, this will be done by placing each
+TCP socket attached to a KCM multiplexor in epoll set for POLLERR
+event. If an error occurs on an attached TCP socket, KCM sets an EPIPE
+on the socket thus waking up the application thread. When the application
+sees the error (which may just be a disconnect) it should unattach the
+socket from KCM and then close it. It is assumed that once an error is
+posted on the TCP socket the data stream is unrecoverable (i.e. an error
+may have occurred in in the middle of receiving a messssge).
+
+TCP connection monitoring
+-------------------------
+
+In KCM there is no means to correlate a message to the TCP socket that
+was used to send or receive the message (except in the case there is
+only one attached TCP socket). However, the application does retain
+an open file descriptor to the socket so it will be able to get statistics
+from the socket which can be used in detecting issues (such as high
+retransmissions on the socket).
-- 
2.4.6

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next 4/6] kcm: Kernel Connection Multiplexor module
  2015-11-20 21:21 ` [PATCH net-next 4/6] kcm: Kernel Connection Multiplexor module Tom Herbert
@ 2015-11-20 22:50   ` Sowmini Varadhan
  2015-11-20 23:19     ` Tom Herbert
  2015-11-20 23:10   ` Alexei Starovoitov
  2015-11-23  9:42   ` Daniel Borkmann
  2 siblings, 1 reply; 43+ messages in thread
From: Sowmini Varadhan @ 2015-11-20 22:50 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, netdev, kernel-team, davewatson, alexei.starovoitov

On (11/20/15 13:21), Tom Herbert wrote:
> +static int kcm_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
   :
> +
> +		if (msg->msg_flags & MSG_BATCH) {
> +			kcm->tx_wait_more = true;
> +		} else if (kcm->tx_wait_more || not_busy) {
> +			err = kcm_write_msgs(kcm);
> +			if (err < 0) {
> +				/* We got a hard error in write_msgs but have
> +				 * already queued this message. Report an error
> +				 * in the socket, but don't affect return value
> +				 * from sendmsg
> +				 */
> +				pr_warn("KCM: Hard failure on kcm_write_msgs\n");
> +				report_csk_error(&kcm->sk, -err);
> +			}
> +		}

It's interesting that kcm copies the user data to a skb and
then invokes kernel_sendpage on the frag_list in that skb- was this 
specifically done with some perf goals in mind? If yes, do you happen
to have some estimate of how much this approach buys you, as opposed
to just setting up a sglist and calling tcp_sendpage later? (RDS uses
the latter approach, and I've tried to use the changes introduced
by Eric's commit in 5640f76, it helps slightly but I think there may
be other bottlenecks to overcome first for the specific req-resp
patterns that are common in DB workloads)

The other question I had when reading this code is: what if the
application never sends that last MSG_BATCH-less message, e.g.,
it lies about how its going send more messages? will something eventually
time-out and send the data? Any estimates for a good batch size?

--Sowmini

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next 4/6] kcm: Kernel Connection Multiplexor module
  2015-11-20 21:21 ` [PATCH net-next 4/6] kcm: Kernel Connection Multiplexor module Tom Herbert
  2015-11-20 22:50   ` Sowmini Varadhan
@ 2015-11-20 23:10   ` Alexei Starovoitov
  2015-11-20 23:20     ` Tom Herbert
  2015-11-23  9:42   ` Daniel Borkmann
  2 siblings, 1 reply; 43+ messages in thread
From: Alexei Starovoitov @ 2015-11-20 23:10 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, netdev, kernel-team, davewatson

On Fri, Nov 20, 2015 at 01:21:58PM -0800, Tom Herbert wrote:
> +
> +	while (eaten < orig_len) {
> +		/* Always clone since we will consume something */
> +		skb = skb_clone(orig_skb, GFP_ATOMIC);
> +		if (!skb) {
> +			desc->error = -ENOMEM;
> +			break;
> +		}
...
> +		if (!pskb_pull(skb, offset + eaten)) {

pskb_pull after clone == pskb_expand_head
meaning that we'll be linearizing things all the time.
can we try skb_share_check() first ?
the packet shouldn't be shared at this point, so it might save a lot.

> +			kfree_skb(skb);
> +			desc->error = -ENOMEM;
> +			break;
> +		}
...
> +		/* Need to trim, should be rare? */
> +		err = pskb_trim(skb, orig_len - eaten);
...
> +		if (!rxm->full_len) {
> +			ssize_t len;
> +
> +			len = KCM_RUN_FILTER(psock->bpf_prog, head);

bpf can read packet beyond linear header just fine.
It's slower than reads from header, since extra calls invovled,
but if that becomes a bottle neck we can optimize the way we do JIT there.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next 4/6] kcm: Kernel Connection Multiplexor module
  2015-11-20 22:50   ` Sowmini Varadhan
@ 2015-11-20 23:19     ` Tom Herbert
  2015-11-20 23:27       ` Sowmini Varadhan
  0 siblings, 1 reply; 43+ messages in thread
From: Tom Herbert @ 2015-11-20 23:19 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: David S. Miller, Linux Kernel Network Developers, Kernel Team,
	davewatson, Alexei Starovoitov

On Fri, Nov 20, 2015 at 2:50 PM, Sowmini Varadhan
<sowmini.varadhan@oracle.com> wrote:
> On (11/20/15 13:21), Tom Herbert wrote:
>> +static int kcm_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
>    :
>> +
>> +             if (msg->msg_flags & MSG_BATCH) {
>> +                     kcm->tx_wait_more = true;
>> +             } else if (kcm->tx_wait_more || not_busy) {
>> +                     err = kcm_write_msgs(kcm);
>> +                     if (err < 0) {
>> +                             /* We got a hard error in write_msgs but have
>> +                              * already queued this message. Report an error
>> +                              * in the socket, but don't affect return value
>> +                              * from sendmsg
>> +                              */
>> +                             pr_warn("KCM: Hard failure on kcm_write_msgs\n");
>> +                             report_csk_error(&kcm->sk, -err);
>> +                     }
>> +             }
>
> It's interesting that kcm copies the user data to a skb and
> then invokes kernel_sendpage on the frag_list in that skb- was this
> specifically done with some perf goals in mind? If yes, do you happen
> to have some estimate of how much this approach buys you, as opposed
> to just setting up a sglist and calling tcp_sendpage later? (RDS uses
> the latter approach, and I've tried to use the changes introduced
> by Eric's commit in 5640f76, it helps slightly but I think there may
> be other bottlenecks to overcome first for the specific req-resp
> patterns that are common in DB workloads)
>
Hi Sowmini,

I did notice that RDS is just creating sglist, but I also noticed that
this requires allocating "struct rds_message" which holds pointers to
the sglist, list pointers for a queue, etc. This looks to me like its
emulating skbuffs anyway. I haven't looked if there's performance
issues otherwise in using the fraglist. It might be interesting if
there was an interface to send skbufs on a kernel socket.

> The other question I had when reading this code is: what if the
> application never sends that last MSG_BATCH-less message, e.g.,
> it lies about how its going send more messages? will something eventually
> time-out and send the data? Any estimates for a good batch size?
>
No time out. Sending will block. I don't think this behavior needs to
be any different than what happens if an application forgets to
complete a MSG_MORE.

Thanks,
Tom

> --Sowmini

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next 4/6] kcm: Kernel Connection Multiplexor module
  2015-11-20 23:10   ` Alexei Starovoitov
@ 2015-11-20 23:20     ` Tom Herbert
  0 siblings, 0 replies; 43+ messages in thread
From: Tom Herbert @ 2015-11-20 23:20 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Linux Kernel Network Developers, Kernel Team,
	davewatson

On Fri, Nov 20, 2015 at 3:10 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Fri, Nov 20, 2015 at 01:21:58PM -0800, Tom Herbert wrote:
>> +
>> +     while (eaten < orig_len) {
>> +             /* Always clone since we will consume something */
>> +             skb = skb_clone(orig_skb, GFP_ATOMIC);
>> +             if (!skb) {
>> +                     desc->error = -ENOMEM;
>> +                     break;
>> +             }
> ...
>> +             if (!pskb_pull(skb, offset + eaten)) {
>
> pskb_pull after clone == pskb_expand_head
> meaning that we'll be linearizing things all the time.
> can we try skb_share_check() first ?
> the packet shouldn't be shared at this point, so it might save a lot.
>
Yes it might. This pull does seem to be expensive.

>> +                     kfree_skb(skb);
>> +                     desc->error = -ENOMEM;
>> +                     break;
>> +             }
> ...
>> +             /* Need to trim, should be rare? */
>> +             err = pskb_trim(skb, orig_len - eaten);
> ...
>> +             if (!rxm->full_len) {
>> +                     ssize_t len;
>> +
>> +                     len = KCM_RUN_FILTER(psock->bpf_prog, head);
>
> bpf can read packet beyond linear header just fine.
> It's slower than reads from header, since extra calls invovled,
> but if that becomes a bottle neck we can optimize the way we do JIT there.
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next 4/6] kcm: Kernel Connection Multiplexor module
  2015-11-20 23:19     ` Tom Herbert
@ 2015-11-20 23:27       ` Sowmini Varadhan
  0 siblings, 0 replies; 43+ messages in thread
From: Sowmini Varadhan @ 2015-11-20 23:27 UTC (permalink / raw)
  To: Tom Herbert
  Cc: David S. Miller, Linux Kernel Network Developers, Kernel Team,
	davewatson, Alexei Starovoitov

On (11/20/15 15:19), Tom Herbert wrote:
> 
> I did notice that RDS is just creating sglist, but I also noticed that
> this requires allocating "struct rds_message" which holds pointers to
> the sglist, list pointers for a queue, etc. This looks to me like its
> emulating skbuffs anyway. I haven't looked if there's performance

yes, some of that might come from the IB origins of the code
so that it could be shared with an IB transport. I was just wondering
if directly using sk_buffs would be a significant perf win for rds-tcp.

> issues otherwise in using the fraglist. It might be interesting if
> there was an interface to send skbufs on a kernel socket.

yes. Or if there was a way to  factor out the non-zero page-order
enhancements in skb_page_frag_refill in a way that they could be
shared with RDS.

--Sowmini

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next 4/6] kcm: Kernel Connection Multiplexor module
  2015-11-20 21:21 ` [PATCH net-next 4/6] kcm: Kernel Connection Multiplexor module Tom Herbert
  2015-11-20 22:50   ` Sowmini Varadhan
  2015-11-20 23:10   ` Alexei Starovoitov
@ 2015-11-23  9:42   ` Daniel Borkmann
  2 siblings, 0 replies; 43+ messages in thread
From: Daniel Borkmann @ 2015-11-23  9:42 UTC (permalink / raw)
  To: Tom Herbert, davem, netdev; +Cc: kernel-team, davewatson, alexei.starovoitov

On 11/20/2015 10:21 PM, Tom Herbert wrote:
[...]
> +
> +/* Macro to invoke filter function. */
> +#define KCM_RUN_FILTER(prog, ctx) \
> +	(*prog->bpf_func)(ctx, prog->insnsi)

Any reason to redefine this macro?

We already have the same one as:

#define BPF_PROG_RUN(filter, ctx)  (*filter->bpf_func)(ctx, filter->insnsi)

[...]
> +static int kcm_attach_ioctl(struct socket *sock, struct kcm_attach *info)
> +{
> +	struct socket *csock;
> +	struct bpf_prog *prog;
> +	int err;
> +
> +	csock = sockfd_lookup(info->fd, &err);
> +	if (!csock)
> +		return -ENOENT;
> +
> +	prog = bpf_prog_get(info->bpf_fd);
> +	if (IS_ERR(prog)) {
> +		err = PTR_ERR(prog);
> +		goto out;
> +	}
> +
> +	if (prog->type != BPF_PROG_TYPE_SOCKET_FILTER) {
> +		bpf_prog_put(prog);

I'd move this and the below bpf_prog_put() under out_put label, too.

> +		err = -EINVAL;
> +		goto out;
> +	}
> +
> +	err = kcm_attach(sock, csock, prog);
> +	if (err) {
> +		bpf_prog_put(prog);

                 ^^^

> +		goto out;
> +	}
> +
> +	/* Keep reference on file also */
> +
> +	return 0;
> +out:
> +	fput(csock->file);
> +	return err;
> +}
[...]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)
  2015-11-20 21:21 [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM) Tom Herbert
                   ` (5 preceding siblings ...)
  2015-11-20 21:22 ` [PATCH net-next 6/6] kcm: Add description in Documentation Tom Herbert
@ 2015-11-23  9:53 ` Hannes Frederic Sowa
  2015-11-23 12:43   ` Sowmini Varadhan
  2015-11-23 17:33   ` Tom Herbert
  6 siblings, 2 replies; 43+ messages in thread
From: Hannes Frederic Sowa @ 2015-11-23  9:53 UTC (permalink / raw)
  To: Tom Herbert, davem, netdev; +Cc: kernel-team, davewatson, alexei.starovoitov

Hello,

On Fri, Nov 20, 2015, at 22:21, Tom Herbert wrote:
> Kernel Connection Multiplexor (KCM) is a facility that provides a
> message based interface over TCP for generic application protocols.
> The motivation for this is based on the observation that although
> TCP is byte stream transport protocol with no concept of message
> boundaries, a common use case is to implement a framed application
> layer protocol running over TCP. To date, most TCP stacks offer
> byte stream API for applications, which places the burden of message
> delineation, message I/O operation atomicity, and load balancing
> in the application. With KCM an application can efficiently send
> and receive application protocol messages over TCP using a
> datagram interface.

I am a bit struggling seeing a real need to come up with a new socket
type and subsystem for that. It looks like you want to solve the same
problem that PACKET_FANOUT does? TCP has TCP-PSH flag which could help
delimit messages and a way to improve FANOUT like PACKET_FANOUT would
solve this same problem, too? A propoer fallback has to be in user space
anyway but messages could maybe simply be flagged with an skb->mark and
fanout could push it to the correct FANOUT-subsocket.

> In order to delineate message in a TCP stream for receive in KCM, the
> kernel implements a message parser. For this we chose to employ BPF
> which is applied to the TCP stream. BPF code parses application layer
> messages and returns a message length. Nearly all binary application
> protocols are parsable in this manner, so KCM should be applicable
> across a wide range of applications. Other than message length
> determination in receive, KCM does not require any other application
> specific awareness. KCM does not implement any other application
> protocol semantics-- these are are provided in userspace or could be
> implemented in a kernel module layered above KCM.

For me this still looks a little bit like messages could be delimited by
TCP PSH flag, where we might need to have some more fine grained control
over and besides that just adding better fanout semantics to TCP, no?

Do kcm sockets still allow streaming unlimited amounts of data? E.g. if
you want to pass a data stream attached to a rpc message? I think not
allowing streaming is a major shortcoming then (even though this will
induce head of line blocking).

> Future support:
> 
>  - Integration with TLS (TLS-in-kernel is a separate initiative).

This is interesting:

Regarding the last week's discussion about better OOB support in TCP
e.g. for SOCKET_DESTROY, do you already have a plan to handle TLS alerts
and do CHANGE_CIPHER on the socket synchronously?

Thanks,
Hannes

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next 3/6] net: Add MSG_BATCH flag
  2015-11-20 21:21 ` [PATCH net-next 3/6] net: Add MSG_BATCH flag Tom Herbert
@ 2015-11-23 10:02   ` Hannes Frederic Sowa
  0 siblings, 0 replies; 43+ messages in thread
From: Hannes Frederic Sowa @ 2015-11-23 10:02 UTC (permalink / raw)
  To: Tom Herbert, davem, netdev; +Cc: kernel-team, davewatson, alexei.starovoitov

Hello,

On Fri, Nov 20, 2015, at 22:21, Tom Herbert wrote:
> Add a new msg flag called MSG_BATCH. This flag is used in sendmsg to
> indicate that more messages will follow (i.e. a batch of messages is
> being sent). This is similar to MSG_MORE except that the following
> messages are not merged into one packet, they are sent individually.
> 
> MSG_BATCH is a performance optimization in cases where a socket
> implementation can benefit by transmitting packets in a batch.
> 
> This patch also updates sendmmsg so that each contained message except
> for the last one is marked as MSG_BATCH.

This flag is only used for KCM because it does not make sense to expose
it to user space? As such, could this be made more clear? I don't see
such an optimization being needed for UDP or TCP.

Thanks,
Hannes

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)
  2015-11-23  9:53 ` [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM) Hannes Frederic Sowa
@ 2015-11-23 12:43   ` Sowmini Varadhan
  2015-11-23 17:33   ` Tom Herbert
  1 sibling, 0 replies; 43+ messages in thread
From: Sowmini Varadhan @ 2015-11-23 12:43 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Tom Herbert, davem, netdev, kernel-team, davewatson, alexei.starovoitov

On (11/23/15 10:53), Hannes Frederic Sowa wrote:
> > 
> >  - Integration with TLS (TLS-in-kernel is a separate initiative).
> 
> This is interesting:
> 
> Regarding the last week's discussion about better OOB support in TCP
> e.g. for SOCKET_DESTROY, do you already have a plan to handle TLS alerts
> and do CHANGE_CIPHER on the socket synchronously?

I have had that same question too. In fact I pointed this out already
in the thread at http://permalink.gmane.org/gmane.linux.network/382278

In addition to CCS, TLS does other complex things such as mid-session
regeneration of new session keys based on the master-secret. If you
move TLS to the kernel, there may be a lot of 
synchronicity/security/inter-op issues to resolve.

Perhaps it's not a good idea to use "TLS" on the TCP socket, but let
each kcm application negotiate a crypto key (in any manner that it wants) 
and set it on the PF_KCM socket, then use that key to encrypt application
data just before passing it off to tcp. (Of course, then you have to deal 
with the fact that BPF still needs to get to the clear data somehow)

--Sowmini

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)
  2015-11-23  9:53 ` [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM) Hannes Frederic Sowa
  2015-11-23 12:43   ` Sowmini Varadhan
@ 2015-11-23 17:33   ` Tom Herbert
  2015-11-23 19:35     ` Hannes Frederic Sowa
  2015-11-23 19:54     ` David Miller
  1 sibling, 2 replies; 43+ messages in thread
From: Tom Herbert @ 2015-11-23 17:33 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: David S. Miller, Linux Kernel Network Developers, Kernel Team,
	davewatson, Alexei Starovoitov

> For me this still looks a little bit like messages could be delimited by
> TCP PSH flag, where we might need to have some more fine grained control
> over and besides that just adding better fanout semantics to TCP, no?
>
The TCP PSH flag is not defined for message delineation (neither is
urgent pointer). We can't change that (many people have tried to add
message semantics to TCP protocol but have always failed miserably).
The fact is TCP is always going to be a stream based protocol. Period!
:-) It is up to the application to interpret the stream and extract
messages. Even if we could somehow apply the PSH bit to "help" in
message delineation, we would need to change senders to use the PSH
bit in that fashion for it to be of benefit to receivers.

> Do kcm sockets still allow streaming unlimited amounts of data? E.g. if
> you want to pass a data stream attached to a rpc message? I think not
> allowing streaming is a major shortcoming then (even though this will
> induce head of line blocking).
>
RPC messages can be of arbitrary size and with SOCK_SEQPACKET,
messages can be sent or received in multiple calls. No HOL blocking
since message are constructed on KCM sockets before starting to send
on TCP sockets. Socket buffer limits are respected. KCM does not
enforce a maximum message size, if an applications does have a maximum
then that can be checked in the BPF code.

>> Future support:
>>
>>  - Integration with TLS (TLS-in-kernel is a separate initiative).
>
> This is interesting:
>
> Regarding the last week's discussion about better OOB support in TCP
> e.g. for SOCKET_DESTROY, do you already have a plan to handle TLS alerts
> and do CHANGE_CIPHER on the socket synchronously?
>
Dave should be posting the basic TLS-in-the-kenel patches shortly,
those will be a better context for discussion.

Tom

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)
  2015-11-23 17:33   ` Tom Herbert
@ 2015-11-23 19:35     ` Hannes Frederic Sowa
  2015-11-23 19:54     ` David Miller
  1 sibling, 0 replies; 43+ messages in thread
From: Hannes Frederic Sowa @ 2015-11-23 19:35 UTC (permalink / raw)
  To: Tom Herbert
  Cc: David S. Miller, Linux Kernel Network Developers, Kernel Team,
	davewatson, Alexei Starovoitov

Hello Tom,

On Mon, Nov 23, 2015, at 18:33, Tom Herbert wrote:
> > For me this still looks a little bit like messages could be delimited by
> > TCP PSH flag, where we might need to have some more fine grained control
> > over and besides that just adding better fanout semantics to TCP, no?
> >
> The TCP PSH flag is not defined for message delineation (neither is
> urgent pointer). We can't change that (many people have tried to add
> message semantics to TCP protocol but have always failed miserably).
> The fact is TCP is always going to be a stream based protocol. Period!
> :-) It is up to the application to interpret the stream and extract
> messages. Even if we could somehow apply the PSH bit to "help" in
> message delineation, we would need to change senders to use the PSH
> bit in that fashion for it to be of benefit to receivers.

I see TCP PSH flags as an optimization and I agree it is hard to
properly make use of them in the internet. But in a datacenter where
everything is under control, this could be done?

Anyway, decoding arbitrary messages in the kernel with maybe huge
lengths could result in starvation problems if you adhere to the socket
receive buffer limits at all time. So I wonder if forward progress
guarantee can be achieved here agnostic of the eBPF program? I really
see this becoming a problem as soon as people use it for privilege
separation. Will there be central error handling?

Also, would a TCP option make sense here to add instead of using the TCP
PSH flag? Not sure, yet...

> > Do kcm sockets still allow streaming unlimited amounts of data? E.g. if
> > you want to pass a data stream attached to a rpc message? I think not
> > allowing streaming is a major shortcoming then (even though this will
> > induce head of line blocking).
> >
> RPC messages can be of arbitrary size and with SOCK_SEQPACKET,
> messages can be sent or received in multiple calls. No HOL blocking
> since message are constructed on KCM sockets before starting to send
> on TCP sockets. Socket buffer limits are respected. KCM does not
> enforce a maximum message size, if an applications does have a maximum
> then that can be checked in the BPF code.

I was referring to the receivers end HOL blocking, the same as in user
space TCP, where one data stream (or huge message) keeps the byte stream
busy so no other datagrams in there can be delivered. For low latency I
would actually use multiple streams or switch to UDP with user space
based retry.

I think this problem more and more comes down to improve epoll interface
with somewhat better CPU steered wake-up capabilities to make it more
agnostic. Some programs e.g. want also be woken up if a HTTP header is
received completely, SO_RCVLOWAT was made for this, FreeBSD has
accept_filter for this kind.

You want to use this in thrift which is mainly Java based and reuse the
existing NIO infrastructure?

> >> Future support:
> >>
> >>  - Integration with TLS (TLS-in-kernel is a separate initiative).
> >
> > This is interesting:
> >
> > Regarding the last week's discussion about better OOB support in TCP
> > e.g. for SOCKET_DESTROY, do you already have a plan to handle TLS alerts
> > and do CHANGE_CIPHER on the socket synchronously?
> >
> Dave should be posting the basic TLS-in-the-kenel patches shortly,
> those will be a better context for discussion.

Thanks, I am looking at them right now. :)

Thanks,
Hannes

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)
  2015-11-23 17:33   ` Tom Herbert
  2015-11-23 19:35     ` Hannes Frederic Sowa
@ 2015-11-23 19:54     ` David Miller
  2015-11-23 20:02       ` Tom Herbert
                         ` (2 more replies)
  1 sibling, 3 replies; 43+ messages in thread
From: David Miller @ 2015-11-23 19:54 UTC (permalink / raw)
  To: tom; +Cc: hannes, netdev, kernel-team, davewatson, alexei.starovoitov

From: Tom Herbert <tom@herbertland.com>
Date: Mon, 23 Nov 2015 09:33:44 -0800

> The TCP PSH flag is not defined for message delineation (neither is
> urgent pointer). We can't change that (many people have tried to add
> message semantics to TCP protocol but have always failed miserably).

Agreed.

My only gripe with kcm right now is a lack of a native sendpage.
We should be able to zero copy data through KCM streams without
any problems whatsoever.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)
  2015-11-23 19:54     ` David Miller
@ 2015-11-23 20:02       ` Tom Herbert
  2015-11-24 11:25       ` Hannes Frederic Sowa
  2015-11-24 15:27       ` Florian Westphal
  2 siblings, 0 replies; 43+ messages in thread
From: Tom Herbert @ 2015-11-23 20:02 UTC (permalink / raw)
  To: David Miller
  Cc: Hannes Frederic Sowa, Linux Kernel Network Developers,
	Kernel Team, davewatson, Alexei Starovoitov

On Mon, Nov 23, 2015 at 11:54 AM, David Miller <davem@davemloft.net> wrote:
> From: Tom Herbert <tom@herbertland.com>
> Date: Mon, 23 Nov 2015 09:33:44 -0800
>
>> The TCP PSH flag is not defined for message delineation (neither is
>> urgent pointer). We can't change that (many people have tried to add
>> message semantics to TCP protocol but have always failed miserably).
>
> Agreed.
>
> My only gripe with kcm right now is a lack of a native sendpage.
> We should be able to zero copy data through KCM streams without
> any problems whatsoever.

Right, there is no reason zero copy won't work here. I was just trying
minimize the initial implementation small for reviewability.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)
  2015-11-23 19:54     ` David Miller
  2015-11-23 20:02       ` Tom Herbert
@ 2015-11-24 11:25       ` Hannes Frederic Sowa
  2015-11-24 15:49         ` David Miller
  2015-11-24 15:27       ` Florian Westphal
  2 siblings, 1 reply; 43+ messages in thread
From: Hannes Frederic Sowa @ 2015-11-24 11:25 UTC (permalink / raw)
  To: David Miller, tom; +Cc: netdev, kernel-team, davewatson, alexei.starovoitov

Hello,

David Miller <davem@davemloft.net> writes:

> From: Tom Herbert <tom@herbertland.com>
> Date: Mon, 23 Nov 2015 09:33:44 -0800
>
>> The TCP PSH flag is not defined for message delineation (neither is
>> urgent pointer). We can't change that (many people have tried to add
>> message semantics to TCP protocol but have always failed miserably).
>
> Agreed.
>
> My only gripe with kcm right now is a lack of a native sendpage.
> We should be able to zero copy data through KCM streams without
> any problems whatsoever.

I understood from Tom's last mail that the messages are being
constructed *in kernel memory* before sending out of the tcp
socket. What advantage gives sendpage? The message construction must
actually happen before to fill in the necessary headers for the length,
so receiver can dissect again. Streaming semantics don't really fit
here?

Bye,
Hannes

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)
  2015-11-23 19:54     ` David Miller
  2015-11-23 20:02       ` Tom Herbert
  2015-11-24 11:25       ` Hannes Frederic Sowa
@ 2015-11-24 15:27       ` Florian Westphal
  2015-11-24 15:49         ` Eric Dumazet
  2015-11-24 15:55         ` David Miller
  2 siblings, 2 replies; 43+ messages in thread
From: Florian Westphal @ 2015-11-24 15:27 UTC (permalink / raw)
  To: David Miller
  Cc: tom, hannes, netdev, kernel-team, davewatson, alexei.starovoitov

David Miller <davem@davemloft.net> wrote:
> From: Tom Herbert <tom@herbertland.com>
> Date: Mon, 23 Nov 2015 09:33:44 -0800
> 
> > The TCP PSH flag is not defined for message delineation (neither is
> > urgent pointer). We can't change that (many people have tried to add
> > message semantics to TCP protocol but have always failed miserably).
>
> Agreed.
>
> My only gripe with kcm right now is a lack of a native sendpage.

Aside from Hannes comment -- KCM seems to be tied to the TLS work, i.e.
I have the impression that KCM without ability to do TLS in the kernel
is pretty much useless for whatever use case Tom has in mind.

And that ktls thing just gives me the creeps.

For KCM itself I don't even get the use case -- its in the 'yeah, you
can do that, but... why?' category 8-/

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)
  2015-11-24 11:25       ` Hannes Frederic Sowa
@ 2015-11-24 15:49         ` David Miller
  0 siblings, 0 replies; 43+ messages in thread
From: David Miller @ 2015-11-24 15:49 UTC (permalink / raw)
  To: hannes; +Cc: tom, netdev, kernel-team, davewatson, alexei.starovoitov

From: Hannes Frederic Sowa <hannes@stressinduktion.org>
Date: Tue, 24 Nov 2015 12:25:39 +0100

> Hello,
> 
> David Miller <davem@davemloft.net> writes:
> 
>> From: Tom Herbert <tom@herbertland.com>
>> Date: Mon, 23 Nov 2015 09:33:44 -0800
>>
>>> The TCP PSH flag is not defined for message delineation (neither is
>>> urgent pointer). We can't change that (many people have tried to add
>>> message semantics to TCP protocol but have always failed miserably).
>>
>> Agreed.
>>
>> My only gripe with kcm right now is a lack of a native sendpage.
>> We should be able to zero copy data through KCM streams without
>> any problems whatsoever.
> 
> I understood from Tom's last mail that the messages are being
> constructed *in kernel memory* before sending out of the tcp
> socket. What advantage gives sendpage? The message construction must
> actually happen before to fill in the necessary headers for the length,
> so receiver can dissect again. Streaming semantics don't really fit
> here?

You can sendpage part of a filesystem file, and push the headers in
front (the "KCM stuff") without disturbing the page frags.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)
  2015-11-24 15:27       ` Florian Westphal
@ 2015-11-24 15:49         ` Eric Dumazet
  2015-11-24 18:09           ` Rick Jones
  2015-11-24 15:55         ` David Miller
  1 sibling, 1 reply; 43+ messages in thread
From: Eric Dumazet @ 2015-11-24 15:49 UTC (permalink / raw)
  To: Florian Westphal
  Cc: David Miller, tom, hannes, netdev, kernel-team, davewatson,
	alexei.starovoitov

On Tue, 2015-11-24 at 16:27 +0100, Florian Westphal wrote:
> David Miller <davem@davemloft.net> wrote:
> > From: Tom Herbert <tom@herbertland.com>
> > Date: Mon, 23 Nov 2015 09:33:44 -0800
> > 
> > > The TCP PSH flag is not defined for message delineation (neither is
> > > urgent pointer). We can't change that (many people have tried to add
> > > message semantics to TCP protocol but have always failed miserably).
> >
> > Agreed.
> >
> > My only gripe with kcm right now is a lack of a native sendpage.
> 
> Aside from Hannes comment -- KCM seems to be tied to the TLS work, i.e.
> I have the impression that KCM without ability to do TLS in the kernel
> is pretty much useless for whatever use case Tom has in mind.
> 
> And that ktls thing just gives me the creeps.
> 
> For KCM itself I don't even get the use case -- its in the 'yeah, you
> can do that, but... why?' category 8-/

Note that I also played with a similar idea in TCP stack, trying to
wakeup the receiver only full RPC was present in the receive queue.

(When dealing with our internal Google RPC format, we can easily delimit
RPC boundaries)

But in the end, latencies were bigger, because the application had to
copy from kernel to user (read()) the full message in one go. While if
you wake up application for every incoming GRO message, we prefill cpu
caches, and the last read() only has to copy the remaining part and
benefit from hot caches (RFS up2date state, TCP socket structure, but
also data in the application)

I focused on making TSO/GRO more effective, and had better results.

One nice idea was to mark the PSH flag for every TSO packet we send.
(One line patch in TCP senders)

This had the nice effect of keeping the number of flows in the receiver
GRO engine as small as possible, avoiding the evictions that we do when
we reach 8 flows per RX queue (This is because the PSH flag tells GRO to
immediately complete the GRO packet to upper stacks)

-> Less cpu spent in GRO engine, as the gro_list is kept small,
better aggregation efficiency.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)
  2015-11-24 15:27       ` Florian Westphal
  2015-11-24 15:49         ` Eric Dumazet
@ 2015-11-24 15:55         ` David Miller
  2015-11-24 16:25           ` Florian Westphal
  1 sibling, 1 reply; 43+ messages in thread
From: David Miller @ 2015-11-24 15:55 UTC (permalink / raw)
  To: fw; +Cc: tom, hannes, netdev, kernel-team, davewatson, alexei.starovoitov

From: Florian Westphal <fw@strlen.de>
Date: Tue, 24 Nov 2015 16:27:44 +0100

> Aside from Hannes comment -- KCM seems to be tied to the TLS work, i.e.
> I have the impression that KCM without ability to do TLS in the kernel
> is pretty much useless for whatever use case Tom has in mind.

I do not get this impression at all.

Tom's design document in the final patch looks legitimately what the
core use case is.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)
  2015-11-24 15:55         ` David Miller
@ 2015-11-24 16:25           ` Florian Westphal
  2015-11-24 17:00             ` Tom Herbert
                               ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Florian Westphal @ 2015-11-24 16:25 UTC (permalink / raw)
  To: David Miller
  Cc: fw, tom, hannes, netdev, kernel-team, davejwatson, alexei.starovoitov

David Miller <davem@davemloft.net> wrote:
> From: Florian Westphal <fw@strlen.de>
> Date: Tue, 24 Nov 2015 16:27:44 +0100
> 
> > Aside from Hannes comment -- KCM seems to be tied to the TLS work, i.e.
> > I have the impression that KCM without ability to do TLS in the kernel
> > is pretty much useless for whatever use case Tom has in mind.
> 
> I do not get this impression at all.
> 
> Tom's design document in the final patch looks legitimately what the
> core use case is.

You mean
https://patchwork.ozlabs.org/patch/547054/ ?

Its a well-written document, but I don't see how moving the burden of
locking a single logical tcp connection (to prevent threads from
reading a partial record) from userspace to kernel is an improvement.

If you really have 100 threads and must use a single tcp connection
to multiplex some arbitrarily complex record-format in atomic fashion,
then your requirements suck.

Now, arguably, maybe the requirements of Toms use case are restricted
/cannot be avoided.

But that still begs the question: Why should mainline care?

Once its in, next step will be 'my single tcp connection that I use
for multiplexing via KCM now has requirement to use TLS'.

How far are you willing to take the KCM concept?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)
  2015-11-24 16:25           ` Florian Westphal
@ 2015-11-24 17:00             ` Tom Herbert
  2015-11-24 17:16               ` Florian Westphal
  2015-11-24 18:23             ` Hannes Frederic Sowa
  2015-11-25 16:26             ` Sowmini Varadhan
  2 siblings, 1 reply; 43+ messages in thread
From: Tom Herbert @ 2015-11-24 17:00 UTC (permalink / raw)
  To: Florian Westphal
  Cc: David Miller, Hannes Frederic Sowa,
	Linux Kernel Network Developers, Kernel Team, davejwatson,
	Alexei Starovoitov

On Tue, Nov 24, 2015 at 8:25 AM, Florian Westphal <fw@strlen.de> wrote:
> David Miller <davem@davemloft.net> wrote:
>> From: Florian Westphal <fw@strlen.de>
>> Date: Tue, 24 Nov 2015 16:27:44 +0100
>>
>> > Aside from Hannes comment -- KCM seems to be tied to the TLS work, i.e.
>> > I have the impression that KCM without ability to do TLS in the kernel
>> > is pretty much useless for whatever use case Tom has in mind.
>>
>> I do not get this impression at all.
>>
>> Tom's design document in the final patch looks legitimately what the
>> core use case is.
>
> You mean
> https://patchwork.ozlabs.org/patch/547054/ ?
>
> Its a well-written document, but I don't see how moving the burden of
> locking a single logical tcp connection (to prevent threads from
> reading a partial record) from userspace to kernel is an improvement.
>
> If you really have 100 threads and must use a single tcp connection
> to multiplex some arbitrarily complex record-format in atomic fashion,
> then your requirements suck.
>
Well, this is the sort of thing that multi threaded applications do.

> Now, arguably, maybe the requirements of Toms use case are restricted
> /cannot be avoided.
>
> But that still begs the question: Why should mainline care?
>
I have no idea. I guess it's the same reason that mainline would care
about RDS, iSCSI, FCOE, RMDA, or anything in that nature. No one is
being forced to use any of this.

> Once its in, next step will be 'my single tcp connection that I use
> for multiplexing via KCM now has requirement to use TLS'.
>
> How far are you willing to take the KCM concept?

Obviously we are looking forward TLS+KCM. But it does open up a bunch
of other possibilities.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)
  2015-11-24 17:00             ` Tom Herbert
@ 2015-11-24 17:16               ` Florian Westphal
  2015-11-24 17:43                 ` Tom Herbert
  0 siblings, 1 reply; 43+ messages in thread
From: Florian Westphal @ 2015-11-24 17:16 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Florian Westphal, David Miller, Hannes Frederic Sowa,
	Linux Kernel Network Developers, Kernel Team, davejwatson,
	Alexei Starovoitov

Tom Herbert <tom@herbertland.com> wrote:
> No one is being forced to use any of this.

Right.  But it will need to be maintained.
Lets ignore ktls for the time being and focus on KCM.

I'm currently trying to figure out how memory handling in KCM
is supposed to work.

say we have following record framing:

struct record {
	u32 len;
	char data[];
};

And I have a epbf filter that returns record->len within KCM.
Now this program says 'length 128mbyte' (or whatever).

If this was userspace, things are simple, userspace can either
decide to hang up or start to read this in chunks as data arrives.

AFAICS, with KCM, the kernel now has to keep 128mb of allocated
memory around, rmem limits are ignored.

Is that correct?  What if next record claims 4g in size?
I don't really see how we can make any guarantees wrt.
kernel stability...

Am I missing something?

Thanks,
Florian

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)
  2015-11-24 17:16               ` Florian Westphal
@ 2015-11-24 17:43                 ` Tom Herbert
  2015-11-24 20:55                   ` Florian Westphal
  0 siblings, 1 reply; 43+ messages in thread
From: Tom Herbert @ 2015-11-24 17:43 UTC (permalink / raw)
  To: Florian Westphal
  Cc: David Miller, Hannes Frederic Sowa,
	Linux Kernel Network Developers, Kernel Team, davejwatson,
	Alexei Starovoitov

On Tue, Nov 24, 2015 at 9:16 AM, Florian Westphal <fw@strlen.de> wrote:
> Tom Herbert <tom@herbertland.com> wrote:
>> No one is being forced to use any of this.
>
> Right.  But it will need to be maintained.
> Lets ignore ktls for the time being and focus on KCM.
>
> I'm currently trying to figure out how memory handling in KCM
> is supposed to work.
>
> say we have following record framing:
>
> struct record {
>         u32 len;
>         char data[];
> };
>
> And I have a epbf filter that returns record->len within KCM.
> Now this program says 'length 128mbyte' (or whatever).
>
> If this was userspace, things are simple, userspace can either
> decide to hang up or start to read this in chunks as data arrives.
>
> AFAICS, with KCM, the kernel now has to keep 128mb of allocated
> memory around, rmem limits are ignored.
>
> Is that correct?  What if next record claims 4g in size?
> I don't really see how we can make any guarantees wrt.
> kernel stability...
>
> Am I missing something?

Message size limits can be enforced in BPF or we could add a limit
enforced by KCM. For instance, the message size limit in http/2 is
16M. If it's needed, it wouldn't be much trouble to add a streaming
interface for large messages.

>
> Thanks,
> Florian

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)
  2015-11-24 15:49         ` Eric Dumazet
@ 2015-11-24 18:09           ` Rick Jones
  0 siblings, 0 replies; 43+ messages in thread
From: Rick Jones @ 2015-11-24 18:09 UTC (permalink / raw)
  To: Eric Dumazet, Florian Westphal
  Cc: David Miller, tom, hannes, netdev, kernel-team, davewatson,
	alexei.starovoitov

On 11/24/2015 07:49 AM, Eric Dumazet wrote:
> But in the end, latencies were bigger, because the application had to
> copy from kernel to user (read()) the full message in one go. While if
> you wake up application for every incoming GRO message, we prefill cpu
> caches, and the last read() only has to copy the remaining part and
> benefit from hot caches (RFS up2date state, TCP socket structure, but
> also data in the application)

You can see something similar (at least in terms of latency) when 
messing about with MTU sizes.  For some message sizes - 8KB being a 
popular one - you will see higher latency on the likes of netperf TCP_RR 
with JumboFrames than you would with the standard 1500 byte MTU. 
Something I saw on GbE links years back anyway.  I chalked it up to 
getting better parallelism between the NIC and the host.

Of course the service demands were lower with JumboFrames...

rick jones

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)
  2015-11-24 16:25           ` Florian Westphal
  2015-11-24 17:00             ` Tom Herbert
@ 2015-11-24 18:23             ` Hannes Frederic Sowa
  2015-11-24 18:59               ` Alexei Starovoitov
  2015-11-25 16:26             ` Sowmini Varadhan
  2 siblings, 1 reply; 43+ messages in thread
From: Hannes Frederic Sowa @ 2015-11-24 18:23 UTC (permalink / raw)
  To: Florian Westphal, David Miller
  Cc: tom, netdev, kernel-team, davejwatson, alexei.starovoitov

Hello,

On Tue, Nov 24, 2015, at 17:25, Florian Westphal wrote:
> Its a well-written document, but I don't see how moving the burden of
> locking a single logical tcp connection (to prevent threads from
> reading a partial record) from userspace to kernel is an improvement.
> 
> If you really have 100 threads and must use a single tcp connection
> to multiplex some arbitrarily complex record-format in atomic fashion,
> then your requirements suck.

Right, if we are in a datacenter I would probably write a script and use
all those IPv6 addresses to set up mappings a la:

for each $cpu; do
  $ip address add 2000::$host:$cpu/64 dev if0 pref_cpu $cpu
done

where ip-address would also ensure via ntuples that the packets get
routed to the correct cpu. Some management layer could then ensure
correct usage of all CPUs. This way less cross-cpu traffic is ensured
for the TCP synchronization.

I only dealt with medium to low busy RPC systems but that would be my
first approach. Would something like that make sense? Would the NxM
federation overhead between the RPC system kill this approach? This way
you could also make use of the TCP PSH flag like Eric described.

Bye,
Hannes

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)
  2015-11-24 18:23             ` Hannes Frederic Sowa
@ 2015-11-24 18:59               ` Alexei Starovoitov
  2015-11-24 19:16                 ` Hannes Frederic Sowa
  0 siblings, 1 reply; 43+ messages in thread
From: Alexei Starovoitov @ 2015-11-24 18:59 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Florian Westphal, David Miller, tom, netdev, kernel-team, davejwatson

On Tue, Nov 24, 2015 at 07:23:30PM +0100, Hannes Frederic Sowa wrote:
> Hello,
> 
> On Tue, Nov 24, 2015, at 17:25, Florian Westphal wrote:
> > Its a well-written document, but I don't see how moving the burden of
> > locking a single logical tcp connection (to prevent threads from
> > reading a partial record) from userspace to kernel is an improvement.
> > 
> > If you really have 100 threads and must use a single tcp connection
> > to multiplex some arbitrarily complex record-format in atomic fashion,
> > then your requirements suck.
> 
> Right, if we are in a datacenter I would probably write a script and use
> all those IPv6 addresses to set up mappings a la:
> 
> for each $cpu; do
>   $ip address add 2000::$host:$cpu/64 dev if0 pref_cpu $cpu
> done

interesting idea, but then remote host will be influencing local cpu selection?
how remote can figure out the number of local cpus?

Consider scenario where you have a ton of tcp sockets feeding into
bigger or smaller set of kcm sockets processed by threads or fibers.
Pinning sockets to cpu is not going to work.

Also note that opimizing byte copies between kernel and user space is important,
but we lose a lot more in user space due to scheduling and re-scheduling
when demux-ing user space thread is feeding other worker threads.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)
  2015-11-24 18:59               ` Alexei Starovoitov
@ 2015-11-24 19:16                 ` Hannes Frederic Sowa
  2015-11-24 19:26                   ` Hannes Frederic Sowa
  2015-11-24 20:23                   ` Alexei Starovoitov
  0 siblings, 2 replies; 43+ messages in thread
From: Hannes Frederic Sowa @ 2015-11-24 19:16 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Florian Westphal, David Miller, tom, netdev, kernel-team, davejwatson

Hello,

On Tue, Nov 24, 2015, at 19:59, Alexei Starovoitov wrote:
> On Tue, Nov 24, 2015 at 07:23:30PM +0100, Hannes Frederic Sowa wrote:
> > Hello,
> > 
> > On Tue, Nov 24, 2015, at 17:25, Florian Westphal wrote:
> > > Its a well-written document, but I don't see how moving the burden of
> > > locking a single logical tcp connection (to prevent threads from
> > > reading a partial record) from userspace to kernel is an improvement.
> > > 
> > > If you really have 100 threads and must use a single tcp connection
> > > to multiplex some arbitrarily complex record-format in atomic fashion,
> > > then your requirements suck.
> > 
> > Right, if we are in a datacenter I would probably write a script and use
> > all those IPv6 addresses to set up mappings a la:
> > 
> > for each $cpu; do
> >   $ip address add 2000::$host:$cpu/64 dev if0 pref_cpu $cpu
> > done
> 
> interesting idea, but then remote host will be influencing local cpu
> selection?
> how remote can figure out the number of local cpus?

Via rpc! :)

The configuration shouldn't change all the time and some get_info rpc
call could provide info for the topology of the machine, or...

> Consider scenario where you have a ton of tcp sockets feeding into
> bigger or smaller set of kcm sockets processed by threads or fibers.
> Pinning sockets to cpu is not going to work.
> 
> Also note that opimizing byte copies between kernel and user space is
> important,
> but we lose a lot more in user space due to scheduling and re-scheduling
> when demux-ing user space thread is feeding other worker threads.

...also ipvs/netfilter could be used to only inspect the header and
reroute the packet to some better fitting CPU. Complete hierarchies
could be build with NUMA and addresses, packets could be rerouted into
namespaces, etc.

Bye,
Hannes

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)
  2015-11-24 19:16                 ` Hannes Frederic Sowa
@ 2015-11-24 19:26                   ` Hannes Frederic Sowa
  2015-11-24 20:23                   ` Alexei Starovoitov
  1 sibling, 0 replies; 43+ messages in thread
From: Hannes Frederic Sowa @ 2015-11-24 19:26 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Florian Westphal, David Miller, tom, netdev, kernel-team, davejwatson


On Tue, Nov 24, 2015, at 20:16, Hannes Frederic Sowa wrote:
> ...also ipvs/netfilter could be used to only inspect the header and
> reroute the packet to some better fitting CPU. Complete hierarchies
> could be build with NUMA and addresses, packets could be rerouted into
> namespaces, etc.

Maybe also redirects could be useful. But they might be a bit intrusive
and hard to debug.

Bye,
Hannes

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)
  2015-11-24 19:16                 ` Hannes Frederic Sowa
  2015-11-24 19:26                   ` Hannes Frederic Sowa
@ 2015-11-24 20:23                   ` Alexei Starovoitov
       [not found]                     ` <1448402288.1489559.449199721.64EBB346@webmail.messagingengine.com>
  1 sibling, 1 reply; 43+ messages in thread
From: Alexei Starovoitov @ 2015-11-24 20:23 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Florian Westphal, David Miller, tom, netdev, kernel-team, davejwatson

On Tue, Nov 24, 2015 at 08:16:25PM +0100, Hannes Frederic Sowa wrote:
> Hello,
> 
> On Tue, Nov 24, 2015, at 19:59, Alexei Starovoitov wrote:
> > On Tue, Nov 24, 2015 at 07:23:30PM +0100, Hannes Frederic Sowa wrote:
> > > Hello,
> > > 
> > > On Tue, Nov 24, 2015, at 17:25, Florian Westphal wrote:
> > > > Its a well-written document, but I don't see how moving the burden of
> > > > locking a single logical tcp connection (to prevent threads from
> > > > reading a partial record) from userspace to kernel is an improvement.
> > > > 
> > > > If you really have 100 threads and must use a single tcp connection
> > > > to multiplex some arbitrarily complex record-format in atomic fashion,
> > > > then your requirements suck.
> > > 
> > > Right, if we are in a datacenter I would probably write a script and use
> > > all those IPv6 addresses to set up mappings a la:
> > > 
> > > for each $cpu; do
> > >   $ip address add 2000::$host:$cpu/64 dev if0 pref_cpu $cpu
> > > done
> > 
> > interesting idea, but then remote host will be influencing local cpu
> > selection?
> > how remote can figure out the number of local cpus?
> 
> Via rpc! :)
> 
> The configuration shouldn't change all the time and some get_info rpc
> call could provide info for the topology of the machine, or...

Configuration changes all the time. Machines crash, traffic redirected
because of load, etc, etc

> > Consider scenario where you have a ton of tcp sockets feeding into
> > bigger or smaller set of kcm sockets processed by threads or fibers.
> > Pinning sockets to cpu is not going to work.
> > 
> > Also note that opimizing byte copies between kernel and user space is
> > important,
> > but we lose a lot more in user space due to scheduling and re-scheduling
> > when demux-ing user space thread is feeding other worker threads.
> 
> ...also ipvs/netfilter could be used to only inspect the header and
> reroute the packet to some better fitting CPU. Complete hierarchies
> could be build with NUMA and addresses, packets could be rerouted into
> namespaces, etc.

or tc+bpf redirect...
but the reason it won't work is the same as af_packet+bpf fanout doesn't apply:
It's not packet based demuxing.
Kernel needs to deal with TCP stream first and different messages within single
TCP stream go to different workers.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)
  2015-11-24 17:43                 ` Tom Herbert
@ 2015-11-24 20:55                   ` Florian Westphal
  2015-11-24 21:49                     ` Tom Herbert
  0 siblings, 1 reply; 43+ messages in thread
From: Florian Westphal @ 2015-11-24 20:55 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Florian Westphal, David Miller, Hannes Frederic Sowa,
	Linux Kernel Network Developers, Kernel Team, davejwatson,
	Alexei Starovoitov

Tom Herbert <tom@herbertland.com> wrote:
> Message size limits can be enforced in BPF or we could add a limit
> enforced by KCM. For instance, the message size limit in http/2 is
> 16M. If it's needed, it wouldn't be much trouble to add a streaming
> interface for large messages.

That still won't change the fact that KCM allows eating large
amount of kernel memory (you could just open a lot of sockets...).

For tcp we cannot exceed the total rmem limits, even if I can open
4k sockets.

Why anyone would invest such a huge amount of work in making this
kernel-based framing for single-stream tcp record (de)mux rather than
improving the userspace protocol to use UDP or SCTP or at least
one tcp connection per worker is beyond me.

For TX side, why is writev not good enough?
Is KCM tx just so that userspace doesn't need to handle partial writes?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)
  2015-11-24 20:55                   ` Florian Westphal
@ 2015-11-24 21:49                     ` Tom Herbert
  2015-11-24 22:22                       ` Florian Westphal
  0 siblings, 1 reply; 43+ messages in thread
From: Tom Herbert @ 2015-11-24 21:49 UTC (permalink / raw)
  To: Florian Westphal
  Cc: David Miller, Hannes Frederic Sowa,
	Linux Kernel Network Developers, Kernel Team, davejwatson,
	Alexei Starovoitov

On Tue, Nov 24, 2015 at 12:55 PM, Florian Westphal <fw@strlen.de> wrote:
> Tom Herbert <tom@herbertland.com> wrote:
>> Message size limits can be enforced in BPF or we could add a limit
>> enforced by KCM. For instance, the message size limit in http/2 is
>> 16M. If it's needed, it wouldn't be much trouble to add a streaming
>> interface for large messages.
>
> That still won't change the fact that KCM allows eating large
> amount of kernel memory (you could just open a lot of sockets...).
>
> For tcp we cannot exceed the total rmem limits, even if I can open
> 4k sockets.
>
> Why anyone would invest such a huge amount of work in making this
> kernel-based framing for single-stream tcp record (de)mux rather than
> improving the userspace protocol to use UDP or SCTP or at least
> one tcp connection per worker is beyond me.
>
>From the /0 patch:

Q: Why not use an existing message-oriented protocol such as RUDP,
   DCCP, SCTP, RDS, and others?

A: Because that would entail using a completely new transport protocol.
   Deploying a new protocol at scale is either a huge undertaking or
   fundamentally infeasible. This is true in either the Internet and in
   the data center due in a large part to protocol ossification.
   Besides, KCM we want KCM to work existing, well deployed application
   protocols that we couldn't change even if we wanted to (e.g. http/2).

   KCM simply defines a new interface method, it does not redefine any
   aspect of the transport protocol nor application protocol, nor set
   any new requirements on these. Neither does KCM attempt to implement
   any application protocol logic other than message deliniation in the
   stream. These are fundamental requirement of KCM.


> For TX side, why is writev not good enough?

writev on a TCP stream does not guarantee atomicity of the operation.

> Is KCM tx just so that userspace doesn't need to handle partial writes?

It writes atomic without user space needing to implement locking when
a socket is shared amongst threads..

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)
  2015-11-24 21:49                     ` Tom Herbert
@ 2015-11-24 22:22                       ` Florian Westphal
  2015-11-24 22:25                         ` David Miller
  0 siblings, 1 reply; 43+ messages in thread
From: Florian Westphal @ 2015-11-24 22:22 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Florian Westphal, David Miller, Hannes Frederic Sowa,
	Linux Kernel Network Developers, Kernel Team, davejwatson,
	Alexei Starovoitov

Tom Herbert <tom@herbertland.com> wrote:
> On Tue, Nov 24, 2015 at 12:55 PM, Florian Westphal <fw@strlen.de> wrote:
> > Why anyone would invest such a huge amount of work in making this
> > kernel-based framing for single-stream tcp record (de)mux rather than
> > improving the userspace protocol to use UDP or SCTP or at least
> > one tcp connection per worker is beyond me.
> >
> From the /0 patch:
> 
> Q: Why not use an existing message-oriented protocol such as RUDP,
>    DCCP, SCTP, RDS, and others?
> 
> A: Because that would entail using a completely new transport protocol.

Thats why I wrote 'or at least one tcp connection per worker'.

> > For TX side, why is writev not good enough?
> 
> writev on a TCP stream does not guarantee atomicity of the operation.

Are you talking about short writes?

> It writes atomic without user space needing to implement locking when
> a socket is shared amongst threads.

Yes, I get that point, but I maintain that KCM is a strange workaround
for bad userspace design.

1 tcp connection per thread -> no userspace sockfd lock needed

Sender side can use writev, sendmsg, sendmmsg, etc to avoid sending
sub-record sized frames.

Is user space really so bad that instead of fixing it its simpler to
work around it with even more kernel bloat?

Since for KCM userspace has to be adjusted anyway I find that hard
to believe.

I don't know if the 'dynamic RCVLOWAT' that you want is needed
(you say 'yes', Eric reply seems to indicate its not (at least assuming
 a sane/friendly peer that doesn't intentionally xmit byte-by-byte).

But assuming there would really be a benefit, maybe a RCVLOWAT2 could
be added?  Of course we could only make it a hint and would have to
make a blocking read return with less data than desired when tcp rmem limit
gets hit.  But at least we'd avoid the 'unbounded allocation of large
amount of kernel memory' thing that we have with current proposal.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)
  2015-11-24 22:22                       ` Florian Westphal
@ 2015-11-24 22:25                         ` David Miller
  2015-11-24 22:45                           ` Florian Westphal
  2015-11-24 23:13                           ` Hannes Frederic Sowa
  0 siblings, 2 replies; 43+ messages in thread
From: David Miller @ 2015-11-24 22:25 UTC (permalink / raw)
  To: fw; +Cc: tom, hannes, netdev, kernel-team, davejwatson, alexei.starovoitov

From: Florian Westphal <fw@strlen.de>
Date: Tue, 24 Nov 2015 23:22:42 +0100

> Yes, I get that point, but I maintain that KCM is a strange workaround
> for bad userspace design.

I fundamentally disagree with you.

And even if I didn't, I would be remiss to completely dismiss the
difficulty in changing existing protocols and existing large scale
implementations of them.  If we can facilitate them somehow then
I see nothing wrong with that.

Neither you nor Hannes have made a strong enough argument for me
to consider Tom's work not suitable for upstream.

Have you even looked at the example userspace use case he referenced
and considered the constraints under which it operates?  I seriously
doubt you did.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)
  2015-11-24 22:25                         ` David Miller
@ 2015-11-24 22:45                           ` Florian Westphal
  2015-11-24 23:13                           ` Hannes Frederic Sowa
  1 sibling, 0 replies; 43+ messages in thread
From: Florian Westphal @ 2015-11-24 22:45 UTC (permalink / raw)
  To: David Miller
  Cc: fw, tom, hannes, netdev, kernel-team, davejwatson, alexei.starovoitov

David Miller <davem@redhat.com> wrote:
> From: Florian Westphal <fw@strlen.de>
> Date: Tue, 24 Nov 2015 23:22:42 +0100
> 
> > Yes, I get that point, but I maintain that KCM is a strange workaround
> > for bad userspace design.
> 
> I fundamentally disagree with you.

Fair enough.  Still, I do not see how what KCM intends to do
can be achieved while at the same time imposing some upper bound on
the amount of kernel memory we can allocate globally and per socket.

Once such limit would be enforced the question becomes how the kernel
could handle such an error other than via close of the underlying tcp
connection.

Not adding any limit is not a good idea in my opinion.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)
  2015-11-24 22:25                         ` David Miller
  2015-11-24 22:45                           ` Florian Westphal
@ 2015-11-24 23:13                           ` Hannes Frederic Sowa
  1 sibling, 0 replies; 43+ messages in thread
From: Hannes Frederic Sowa @ 2015-11-24 23:13 UTC (permalink / raw)
  To: David Miller, fw
  Cc: tom, netdev, kernel-team, davejwatson, alexei.starovoitov

Hi David,

On Tue, Nov 24, 2015, at 23:25, David Miller wrote:
> From: Florian Westphal <fw@strlen.de>
> Date: Tue, 24 Nov 2015 23:22:42 +0100
> 
> > Yes, I get that point, but I maintain that KCM is a strange workaround
> > for bad userspace design.
> 
> I fundamentally disagree with you.
> 
> And even if I didn't, I would be remiss to completely dismiss the
> difficulty in changing existing protocols and existing large scale
> implementations of them.  If we can facilitate them somehow then
> I see nothing wrong with that.
> 
> Neither you nor Hannes have made a strong enough argument for me
> to consider Tom's work not suitable for upstream.
> 
> Have you even looked at the example userspace use case he referenced
> and considered the constraints under which it operates?  I seriously
> doubt you did.

If you are referring to thrift and tls framing, yes indeed, I did. I
have experience in google protocol buffers and once cared about an
in-house RPC implementation. All I learned is that this approach is
prone to starving or building up huge messages in kernel space. That is
why xml streaming in form of StAX from the Java world is used more and
more and even Apache Jackson does provide a streaming API for JSON which
I once used because JSON messages streamed as hash tables got too big
and were prone to starve. Even user space needs to be careful what sizes
of messages they accept otherwise DoS attacks are possible and jvms with
small heaps are getting OutOfMemoryExceptions. This is the same in other
high level languages, non-GCed (or without copying garbage collector)
languages just reallocate and cause fragmentation in the long term. Even
keeping multiple 16MB chunks for HTTP/2 in the kernel heap so that user
space can read them in on go seems very much bad in my opinion.

Neither of all those approaches delimit datagrams by "read() barriers".
I think the alternatives should be tried. I think this framework is only
applicable to a small fractions of RPC systems.

Thanks for following up, :)
Hannes

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)
       [not found]                       ` <20151124222109.GA86838@ast-mbp.thefacebook.com>
@ 2015-11-25 10:38                         ` Hannes Frederic Sowa
  0 siblings, 0 replies; 43+ messages in thread
From: Hannes Frederic Sowa @ 2015-11-25 10:38 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: netdev, Tom Herbert, Florian Westphal, davem

Hi,

On Tue, Nov 24, 2015, at 23:21, Alexei Starovoitov wrote:
> On Tue, Nov 24, 2015 at 10:58:08PM +0100, Hannes Frederic Sowa wrote:
> > > > > 
> > > > > interesting idea, but then remote host will be influencing local cpu
> > > > > selection?
> > > > > how remote can figure out the number of local cpus?
> > > > 
> > > > Via rpc! :)
> > > > 
> > > > The configuration shouldn't change all the time and some get_info rpc
> > > > call could provide info for the topology of the machine, or...
> > > 
> > > Configuration changes all the time. Machines crash, traffic redirected
> > > because of load, etc, etc
> > 
> > Yeah, same problem you have to handle with the kcm approach.
> 
> not at all. No new remote configuration things are needed.

At some point you have to manage your ip address and if you move a host
subnet or an ip address it should not matter a lot.

> > Just use GRO engine and TCP PSH flag, maybe by making it more
> > intentional via user space APIs on one connection. Wake up thread with
> > one complete RPC message. If receiver eats more data it is no problem,
> > if less, it will block as in your approach, too. Your worker threads
> > must handle that anyway and can buffer and concatenate it to the next
> > message to be processed.
> 
> of course not. kcm reading threads don't have to deal with it.

If there is no fallback for this you have to drop frames and kill the
TCP connection for all your RPC endpoints. If you lost synchronization
with the headers there is no way you will resync. 

> PS
> just noticed that you've excluded all. By accident?
> Feel free to reply to all.

Oh, thanks, yes it was a fault on my side.

For full transparency I reply.

Bye,
Hannes

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)
  2015-11-24 16:25           ` Florian Westphal
  2015-11-24 17:00             ` Tom Herbert
  2015-11-24 18:23             ` Hannes Frederic Sowa
@ 2015-11-25 16:26             ` Sowmini Varadhan
  2 siblings, 0 replies; 43+ messages in thread
From: Sowmini Varadhan @ 2015-11-25 16:26 UTC (permalink / raw)
  To: Florian Westphal
  Cc: David Miller, tom, hannes, netdev, kernel-team, davejwatson,
	alexei.starovoitov

On (11/24/15 17:25), Florian Westphal wrote:
> Its a well-written document, but I don't see how moving the burden of
> locking a single logical tcp connection (to prevent threads from
> reading a partial record) from userspace to kernel is an improvement.
> 
> If you really have 100 threads and must use a single tcp connection
> to multiplex some arbitrarily complex record-format in atomic fashion,
> then your requirements suck.

In the interest of providing some context from the rds-tcp use-case
here (without drifting into hyperbole).. RDS-TCP, like KCM,
provides a dgram-over-stream socket, with SEQPACKET semantics,
and an upper-bounded record-size per POSIX/SEQPACKET semantics. 
The major difference from kcm is that it does not use BPF, but 
instead has its own protocol header for each datagram.

There seems to be some misconception in this thread that this model
is about allowing application to be "lazy" and do a 1:1 mapping between
streams- that's not the case for RDS.

In the case of cluster apps, we have DB apps that want to have a single
dgram socket to talk to multiple peers (i.e., a star-network, with the
node in the center of the star wanting to have dgram sockets to everyone
else. Scale is more than a mere 100 threads).

If that central node wants reliable, ordered, congestion-managed
delivery, it would have to use UDP + bunch of its own code for
seq#, rexmit etc. And they are doing that today, but dont want the
to reinvent TCP's congavoid (and in fact, in the absence of congestion,
one complaint is that udp latency is 2x-3x better than rds-tcp for a
512 byte req, 8K resp that is typical for DB workloads. I'm still 
investigating)

>From the TCP standpoint of rds-tcp, we have a many-one mapping: 
multiple RDS sockets funneling to a single tcp connection, sharing
a single congestion state-machine.

I dont know if this is a "poorly designed application", I'm sure
its not perfect, but we have a ton of Oracle clustering s/w that's
already doing this with IB, so extending this with rds-tcp made
sense for us at this point.

--Sowmini

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2015-11-25 16:26 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-20 21:21 [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM) Tom Herbert
2015-11-20 21:21 ` [PATCH net-next 1/6] rcu: Add list_next_or_null_rcu Tom Herbert
2015-11-20 21:21 ` [PATCH net-next 2/6] net: Make sock_alloc exportable Tom Herbert
2015-11-20 21:21 ` [PATCH net-next 3/6] net: Add MSG_BATCH flag Tom Herbert
2015-11-23 10:02   ` Hannes Frederic Sowa
2015-11-20 21:21 ` [PATCH net-next 4/6] kcm: Kernel Connection Multiplexor module Tom Herbert
2015-11-20 22:50   ` Sowmini Varadhan
2015-11-20 23:19     ` Tom Herbert
2015-11-20 23:27       ` Sowmini Varadhan
2015-11-20 23:10   ` Alexei Starovoitov
2015-11-20 23:20     ` Tom Herbert
2015-11-23  9:42   ` Daniel Borkmann
2015-11-20 21:21 ` [PATCH net-next 5/6] kcm: Add statistics and proc interfaces Tom Herbert
2015-11-20 21:22 ` [PATCH net-next 6/6] kcm: Add description in Documentation Tom Herbert
2015-11-23  9:53 ` [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM) Hannes Frederic Sowa
2015-11-23 12:43   ` Sowmini Varadhan
2015-11-23 17:33   ` Tom Herbert
2015-11-23 19:35     ` Hannes Frederic Sowa
2015-11-23 19:54     ` David Miller
2015-11-23 20:02       ` Tom Herbert
2015-11-24 11:25       ` Hannes Frederic Sowa
2015-11-24 15:49         ` David Miller
2015-11-24 15:27       ` Florian Westphal
2015-11-24 15:49         ` Eric Dumazet
2015-11-24 18:09           ` Rick Jones
2015-11-24 15:55         ` David Miller
2015-11-24 16:25           ` Florian Westphal
2015-11-24 17:00             ` Tom Herbert
2015-11-24 17:16               ` Florian Westphal
2015-11-24 17:43                 ` Tom Herbert
2015-11-24 20:55                   ` Florian Westphal
2015-11-24 21:49                     ` Tom Herbert
2015-11-24 22:22                       ` Florian Westphal
2015-11-24 22:25                         ` David Miller
2015-11-24 22:45                           ` Florian Westphal
2015-11-24 23:13                           ` Hannes Frederic Sowa
2015-11-24 18:23             ` Hannes Frederic Sowa
2015-11-24 18:59               ` Alexei Starovoitov
2015-11-24 19:16                 ` Hannes Frederic Sowa
2015-11-24 19:26                   ` Hannes Frederic Sowa
2015-11-24 20:23                   ` Alexei Starovoitov
     [not found]                     ` <1448402288.1489559.449199721.64EBB346@webmail.messagingengine.com>
     [not found]                       ` <20151124222109.GA86838@ast-mbp.thefacebook.com>
2015-11-25 10:38                         ` Hannes Frederic Sowa
2015-11-25 16:26             ` Sowmini Varadhan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.