All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/5] Introduce VM Sockets virtio transport
@ 2013-06-27  7:59 Asias He
  2013-06-27  8:00 ` [RFC 1/5] VSOCK: Introduce vsock_find_unbound_socket and vsock_bind_dgram_generic Asias He
                   ` (6 more replies)
  0 siblings, 7 replies; 21+ messages in thread
From: Asias He @ 2013-06-27  7:59 UTC (permalink / raw)
  To: netdev, kvm, virtualization
  Cc: Andy King, Michael S. Tsirkin, Reilly Grant, Pekka Enberg,
	Sasha Levin, David S. Miller, Dmitry Torokhov

Hello guys,

In commit d021c344051af91 (VSOCK: Introduce VM Sockets), VMware added VM
Sockets support. VM Sockets allows communication between virtual
machines and the hypervisor. VM Sockets is able to use different
hyervisor neutral transport to transfer data. Currently, only VMware
VMCI transport is supported. 

This series introduces virtio transport for VM Sockets.

Any comments are appreciated! Thanks!

Code:
=========================
1) kernel bits
   git://github.com/asias/linux.git vsock

2) userspace bits:
   git://github.com/asias/linux-kvm.git vsock

Howto:
=========================
Make sure you have these kernel options:

  CONFIG_VSOCKETS=y
  CONFIG_VIRTIO_VSOCKETS=y
  CONFIG_VIRTIO_VSOCKETS_COMMON=y
  CONFIG_VHOST_VSOCK=m

$ git clone git://github.com/asias/linux-kvm.git
$ cd linux-kvm/tools/kvm
$ co -b vsock origin/vsock
$ make
$ modprobe vhost_vsock
$ ./lkvm run -d os.img -k bzImage --vsock guest_cid

Test:
=========================
I hacked busybox's http server and wget to run over vsock. Start http
server in host and guest, download a 512MB file in guest and host
simultaneously for 6000 times. Manged to run the http stress test.

Also, I wrote a small libvsock.so to play the LD_PRELOAD trick and
managed to make sshd and ssh work over virito-vsock without modifying
the source code.

Draft VM Sockets Virtio Device spec:
=========================
Appendix K: VM Sockets Device

The virtio VM sockets device is a virtio transport device for VM Sockets. VM
Sockets allows communication between virtual machines and the hypervisor.

Configuration:

Subsystem Device ID 13

Virtqueues:
    0:controlq; 1:receiveq0; 2:transmitq0 ... 2N+1:receivqN; 2N+2:transmitqN

Feature bits:
    Currently, no feature bits are defined.

Device configuration layout:

Two configuration fields are currently defined.

   struct virtio_vsock_config {
           __u32 guest_cid;
           __u32 max_virtqueue_pairs;
   } __packed;

The guest_cid field specifies the guest context id which likes the guest IP
address. The max_virtqueue_pairs field specifies the maximum number of receive
and transmit virtqueue pairs (receiveq0 ...  receiveqN and transmitq0 ...
transmitqN respectively; N = max_virtqueue_pairs - 1 ) that can be configured.
The driver is free to use only one virtqueue pairs, or it can use more to
achieve better performance.

Device Initialization:
The initialization routine should discover the device's virtqueues.

Device Operation:
Packets are transmitted by placing them in the transmitq0..transmitqN, and
buffers for incoming packets are placed in the receiveq0..receiveqN. In each
case, the packet itself is preceded by a header:

   struct virtio_vsock_hdr {
           __u32   src_cid;
           __u32   src_port;
           __u32   dst_cid;
           __u32   dst_port;
           __u32   len;
           __u8    type;
           __u8    op;
           __u8    shut;
           __u64   fwd_cnt;
           __u64   buf_alloc;
   } __packed;

src_cid and dst_cid: specify the source and destination context id.
src_port and dst_port: specify the source and destination port.
len: specifies the size of the data payload, it could be zero if no data
payload is transferred.
type: specifies the type of the packet, it can be SOCK_STREAM or SOCK_DGRAM.
op: specifies the operation of the packet, it is defined as follows.

   enum {
           VIRTIO_VSOCK_OP_INVALID = 0,
           VIRTIO_VSOCK_OP_REQUEST = 1,
           VIRTIO_VSOCK_OP_NEGOTIATE = 2,
           VIRTIO_VSOCK_OP_OFFER = 3,
           VIRTIO_VSOCK_OP_ATTACH = 4,
           VIRTIO_VSOCK_OP_RW = 5,
           VIRTIO_VSOCK_OP_CREDIT = 6,
           VIRTIO_VSOCK_OP_RST = 7,
           VIRTIO_VSOCK_OP_SHUTDOWN = 8,
   };

shut: specifies the shutdown mode when the socket is being shutdown. 1 is for
receive shutdown, 2 is for transmit shutdown, 3 is for both receive and transmit
shutdown.
fwd_cnt: specifies the the number of bytes the receiver has forwarded to userspace.
buf_alloc: specifies the size of the receiver's recieve buffer in bytes.

Virtio VM socket connection creation:
1) Client sends VIRTIO_VSOCK_OP_REQUEST to server
2) Server reponses with VIRTIO_VSOCK_OP_NEGOTIATE to client
3) Client sends VIRTIO_VSOCK_OP_OFFER to server
4) Server responses with VIRTIO_VSOCK_OP_ATTACH to client

Virtio VM socket credit update:
Virtio VM socket uses credit-based flow control. The sender maintains tx_cnt
which counts the totoal number of bytes it has sent out, peer_fwd_cnt which
counts the totoal number of byes the receiver has forwarded, and peer_buf_alloc
which is the size of the receiver's receive buffer. The sender can send no more
than the credit the receiver gives to the sender: credit = peer_buf_alloc -
(tx_cnt - peer_fwd_cnt). The receiver can send VIRTIO_VSOCK_OP_CREDIT packet to
tell sender its current fwd_cnt and buf_alloc value explicitly. However, as an
optimization, the fwd_cnt and buf_alloc is always included in the packet header
virtio_vsock_hdr.

The guest driver should make the receive virtqueue as fully populated as
possible: if it runs out, the performance will suffer.

The controlq is used to control device. Currently, no control operation is
defined.

Asias He (5):
  VSOCK: Introduce vsock_find_unbound_socket and
    vsock_bind_dgram_generic
  VSOCK: Introduce virtio-vsock-common.ko
  VSOCK: Introduce virtio-vsock.ko
  VSOCK: Introduce vhost-vsock.ko
  VSOCK: Add Makefile and Kconfig

 drivers/vhost/Kconfig                   |   4 +
 drivers/vhost/Kconfig.vsock             |   7 +
 drivers/vhost/Makefile                  |   5 +
 drivers/vhost/vsock.c                   | 534 +++++++++++++++++
 drivers/vhost/vsock.h                   |   4 +
 include/linux/virtio_vsock.h            | 200 +++++++
 include/uapi/linux/virtio_ids.h         |   1 +
 include/uapi/linux/virtio_vsock.h       |  70 +++
 net/vmw_vsock/Kconfig                   |  18 +
 net/vmw_vsock/Makefile                  |   4 +
 net/vmw_vsock/af_vsock.c                |  70 +++
 net/vmw_vsock/af_vsock.h                |   2 +
 net/vmw_vsock/virtio_transport.c        | 424 ++++++++++++++
 net/vmw_vsock/virtio_transport_common.c | 992 ++++++++++++++++++++++++++++++++
 14 files changed, 2335 insertions(+)
 create mode 100644 drivers/vhost/Kconfig.vsock
 create mode 100644 drivers/vhost/vsock.c
 create mode 100644 drivers/vhost/vsock.h
 create mode 100644 include/linux/virtio_vsock.h
 create mode 100644 include/uapi/linux/virtio_vsock.h
 create mode 100644 net/vmw_vsock/virtio_transport.c
 create mode 100644 net/vmw_vsock/virtio_transport_common.c

-- 
1.8.1.4

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFC 1/5] VSOCK: Introduce vsock_find_unbound_socket and vsock_bind_dgram_generic
  2013-06-27  7:59 [RFC 0/5] Introduce VM Sockets virtio transport Asias He
@ 2013-06-27  8:00 ` Asias He
  2013-06-27  8:00 ` [RFC 2/5] VSOCK: Introduce virtio-vsock-common.ko Asias He
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 21+ messages in thread
From: Asias He @ 2013-06-27  8:00 UTC (permalink / raw)
  To: netdev, kvm, virtualization
  Cc: Andy King, Michael S. Tsirkin, Reilly Grant, Pekka Enberg,
	Sasha Levin, David S. Miller, Dmitry Torokhov

Signed-off-by: Asias He <asias@redhat.com>
---
 net/vmw_vsock/af_vsock.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++++
 net/vmw_vsock/af_vsock.h |  2 ++
 2 files changed, 72 insertions(+)

diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 593071d..bc76ddb 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -225,6 +225,17 @@ static struct sock *__vsock_find_bound_socket(struct sockaddr_vm *addr)
 	return NULL;
 }
 
+static struct sock *__vsock_find_unbound_socket(struct sockaddr_vm *addr)
+{
+	struct vsock_sock *vsk;
+
+	list_for_each_entry(vsk, vsock_unbound_sockets, bound_table)
+		if (addr->svm_port == vsk->local_addr.svm_port)
+			return sk_vsock(vsk);
+
+	return NULL;
+}
+
 static struct sock *__vsock_find_connected_socket(struct sockaddr_vm *src,
 						  struct sockaddr_vm *dst)
 {
@@ -300,6 +311,21 @@ struct sock *vsock_find_bound_socket(struct sockaddr_vm *addr)
 }
 EXPORT_SYMBOL_GPL(vsock_find_bound_socket);
 
+struct sock *vsock_find_unbound_socket(struct sockaddr_vm *addr)
+{
+	struct sock *sk;
+
+	spin_lock_bh(&vsock_table_lock);
+	sk = __vsock_find_unbound_socket(addr);
+	if (sk)
+		sock_hold(sk);
+
+	spin_unlock_bh(&vsock_table_lock);
+
+	return sk;
+}
+EXPORT_SYMBOL_GPL(vsock_find_unbound_socket);
+
 struct sock *vsock_find_connected_socket(struct sockaddr_vm *src,
 					 struct sockaddr_vm *dst)
 {
@@ -534,6 +560,50 @@ static int __vsock_bind_stream(struct vsock_sock *vsk,
 	return 0;
 }
 
+int vsock_bind_dgram_generic(struct vsock_sock *vsk, struct sockaddr_vm *addr)
+{
+	static u32 port = LAST_RESERVED_PORT + 1;
+	struct sockaddr_vm new_addr;
+
+	vsock_addr_init(&new_addr, addr->svm_cid, addr->svm_port);
+
+	if (addr->svm_port == VMADDR_PORT_ANY) {
+		bool found = false;
+		unsigned int i;
+
+		for (i = 0; i < MAX_PORT_RETRIES; i++) {
+			if (port <= LAST_RESERVED_PORT)
+				port = LAST_RESERVED_PORT + 1;
+
+			new_addr.svm_port = port++;
+
+			if (!__vsock_find_unbound_socket(&new_addr)) {
+				found = true;
+				break;
+			}
+		}
+
+		if (!found)
+			return -EADDRNOTAVAIL;
+	} else {
+		/* If port is in reserved range, ensure caller
+		 * has necessary privileges.
+		 */
+		if (addr->svm_port <= LAST_RESERVED_PORT &&
+		    !capable(CAP_NET_BIND_SERVICE)) {
+			return -EACCES;
+		}
+
+		if (__vsock_find_unbound_socket(&new_addr))
+			return -EADDRINUSE;
+	}
+
+	vsock_addr_init(&vsk->local_addr, new_addr.svm_cid, new_addr.svm_port);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(vsock_bind_dgram_generic);
+
 static int __vsock_bind_dgram(struct vsock_sock *vsk,
 			      struct sockaddr_vm *addr)
 {
diff --git a/net/vmw_vsock/af_vsock.h b/net/vmw_vsock/af_vsock.h
index 7d64d36..88f559a 100644
--- a/net/vmw_vsock/af_vsock.h
+++ b/net/vmw_vsock/af_vsock.h
@@ -168,8 +168,10 @@ void vsock_insert_connected(struct vsock_sock *vsk);
 void vsock_remove_bound(struct vsock_sock *vsk);
 void vsock_remove_connected(struct vsock_sock *vsk);
 struct sock *vsock_find_bound_socket(struct sockaddr_vm *addr);
+struct sock *vsock_find_unbound_socket(struct sockaddr_vm *addr);
 struct sock *vsock_find_connected_socket(struct sockaddr_vm *src,
 					 struct sockaddr_vm *dst);
 void vsock_for_each_connected_socket(void (*fn)(struct sock *sk));
+int vsock_bind_dgram_generic(struct vsock_sock *vsk, struct sockaddr_vm *addr);
 
 #endif /* __AF_VSOCK_H__ */
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC 2/5] VSOCK: Introduce virtio-vsock-common.ko
  2013-06-27  7:59 [RFC 0/5] Introduce VM Sockets virtio transport Asias He
  2013-06-27  8:00 ` [RFC 1/5] VSOCK: Introduce vsock_find_unbound_socket and vsock_bind_dgram_generic Asias He
@ 2013-06-27  8:00 ` Asias He
  2013-06-27 10:34   ` Michael S. Tsirkin
                     ` (2 more replies)
  2013-06-27  8:00 ` [RFC 3/5] VSOCK: Introduce virtio-vsock.ko Asias He
                   ` (4 subsequent siblings)
  6 siblings, 3 replies; 21+ messages in thread
From: Asias He @ 2013-06-27  8:00 UTC (permalink / raw)
  To: netdev, kvm, virtualization
  Cc: Andy King, Michael S. Tsirkin, Reilly Grant, Pekka Enberg,
	Sasha Levin, David S. Miller, Dmitry Torokhov

This module contains the common code and header files for the following
virtio-vsock and virtio-vhost kernel modules.

Signed-off-by: Asias He <asias@redhat.com>
---
 include/linux/virtio_vsock.h            | 200 +++++++
 include/uapi/linux/virtio_ids.h         |   1 +
 include/uapi/linux/virtio_vsock.h       |  70 +++
 net/vmw_vsock/virtio_transport_common.c | 992 ++++++++++++++++++++++++++++++++
 4 files changed, 1263 insertions(+)
 create mode 100644 include/linux/virtio_vsock.h
 create mode 100644 include/uapi/linux/virtio_vsock.h
 create mode 100644 net/vmw_vsock/virtio_transport_common.c

diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
new file mode 100644
index 0000000..cd8ed95
--- /dev/null
+++ b/include/linux/virtio_vsock.h
@@ -0,0 +1,200 @@
+/*
+ * This header, excluding the #ifdef __KERNEL__ part, is BSD licensed so
+ * anyone can use the definitions to implement compatible drivers/servers:
+ *
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ * 3. Neither the name of IBM nor the names of its contributors
+ *    may be used to endorse or promote products derived from this software
+ *    without specific prior written permission.
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ``AS IS''
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ *
+ * Copyright (C) Red Hat, Inc., 2013
+ * Copyright (C) Asias He <asias@redhat.com>, 2013
+ */
+
+#ifndef _LINUX_VIRTIO_VSOCK_H
+#define _LINUX_VIRTIO_VSOCK_H
+
+#include <uapi/linux/virtio_vsock.h>
+#include <linux/socket.h>
+#include <net/sock.h>
+
+#define VIRTIO_VSOCK_DEFAULT_MIN_BUF_SIZE	128
+#define VIRTIO_VSOCK_DEFAULT_BUF_SIZE		(1024 * 256)
+#define VIRTIO_VSOCK_DEFAULT_MAX_BUF_SIZE	(1024 * 256)
+#define VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE	(1024 * 4)
+#define VIRTIO_VSOCK_MAX_PKT_BUF_SIZE		(1024 * 64)
+
+struct vsock_transport_recv_notify_data;
+struct vsock_transport_send_notify_data;
+struct sockaddr_vm;
+struct vsock_sock;
+
+enum {
+	VSOCK_VQ_CTRL	= 0,
+	VSOCK_VQ_RX	= 1, /* for host to guest data */
+	VSOCK_VQ_TX	= 2, /* for guest to host data */
+	VSOCK_VQ_MAX	= 3,
+};
+
+/* virtio transport socket state */
+struct virtio_transport {
+	struct virtio_transport_pkt_ops	*ops;
+	struct vsock_sock *vsk;
+
+	u64 buf_size;
+	u64 buf_size_min;
+	u64 buf_size_max;
+
+	struct mutex tx_lock;
+	struct mutex rx_lock;
+
+	struct list_head rx_queue;
+	u64 rx_bytes;
+
+	/* Protected by trans->tx_lock */
+	u64 tx_cnt;
+	u64 buf_alloc;
+	u64 peer_fwd_cnt;
+	u64 peer_buf_alloc;
+	/* Protected by trans->rx_lock */
+	u64 fwd_cnt;
+};
+
+struct virtio_vsock_pkt {
+	struct virtio_vsock_hdr	hdr;
+	struct virtio_transport	*trans;
+	struct work_struct work;
+	struct list_head list;
+	void *buf;
+	u32 len;
+	u32 off;
+};
+
+struct virtio_vsock_pkt_info {
+	struct sockaddr_vm *src;
+	struct sockaddr_vm *dst;
+	struct iovec *iov;
+	u32 len;
+	u8 type;
+	u8 op;
+	u8 shut;
+};
+
+struct virtio_transport_pkt_ops {
+	int (*send_pkt)(struct vsock_sock *vsk,
+			struct virtio_vsock_pkt_info *info);
+};
+
+void virtio_vsock_dumppkt(const char *func,
+			  const struct virtio_vsock_pkt *pkt);
+
+struct sock *
+virtio_transport_get_pending(struct sock *listener,
+			     struct virtio_vsock_pkt *pkt);
+struct virtio_vsock_pkt *
+virtio_transport_alloc_pkt(struct vsock_sock *vsk,
+			   struct virtio_vsock_pkt_info *info,
+			   size_t len,
+			   u32 src_cid,
+			   u32 src_port,
+			   u32 dst_cid,
+			   u32 dst_port);
+ssize_t
+virtio_transport_stream_dequeue(struct vsock_sock *vsk,
+				struct iovec *iov,
+				size_t len,
+				int type);
+int
+virtio_transport_dgram_dequeue(struct kiocb *kiocb,
+			       struct vsock_sock *vsk,
+			       struct msghdr *msg,
+			       size_t len, int flags);
+
+s64 virtio_transport_stream_has_data(struct vsock_sock *vsk);
+s64 virtio_transport_stream_has_space(struct vsock_sock *vsk);
+
+int virtio_transport_do_socket_init(struct vsock_sock *vsk,
+				 struct vsock_sock *psk);
+u64 virtio_transport_get_buffer_size(struct vsock_sock *vsk);
+u64 virtio_transport_get_min_buffer_size(struct vsock_sock *vsk);
+u64 virtio_transport_get_max_buffer_size(struct vsock_sock *vsk);
+void virtio_transport_set_buffer_size(struct vsock_sock *vsk, u64 val);
+void virtio_transport_set_min_buffer_size(struct vsock_sock *vsk, u64 val);
+void virtio_transport_set_max_buffer_size(struct vsock_sock *vs, u64 val);
+int
+virtio_transport_notify_poll_in(struct vsock_sock *vsk,
+				size_t target,
+				bool *data_ready_now);
+int
+virtio_transport_notify_poll_out(struct vsock_sock *vsk,
+				 size_t target,
+				 bool *space_available_now);
+
+int virtio_transport_notify_recv_init(struct vsock_sock *vsk,
+	size_t target, struct vsock_transport_recv_notify_data *data);
+int virtio_transport_notify_recv_pre_block(struct vsock_sock *vsk,
+	size_t target, struct vsock_transport_recv_notify_data *data);
+int virtio_transport_notify_recv_pre_dequeue(struct vsock_sock *vsk,
+	size_t target, struct vsock_transport_recv_notify_data *data);
+int virtio_transport_notify_recv_post_dequeue(struct vsock_sock *vsk,
+	size_t target, ssize_t copied, bool data_read,
+	struct vsock_transport_recv_notify_data *data);
+int virtio_transport_notify_send_init(struct vsock_sock *vsk,
+	struct vsock_transport_send_notify_data *data);
+int virtio_transport_notify_send_pre_block(struct vsock_sock *vsk,
+	struct vsock_transport_send_notify_data *data);
+int virtio_transport_notify_send_pre_enqueue(struct vsock_sock *vsk,
+	struct vsock_transport_send_notify_data *data);
+int virtio_transport_notify_send_post_enqueue(struct vsock_sock *vsk,
+	ssize_t written, struct vsock_transport_send_notify_data *data);
+
+u64 virtio_transport_stream_rcvhiwat(struct vsock_sock *vsk);
+bool virtio_transport_stream_is_active(struct vsock_sock *vsk);
+bool virtio_transport_stream_allow(u32 cid, u32 port);
+int virtio_transport_dgram_bind(struct vsock_sock *vsk,
+				struct sockaddr_vm *addr);
+bool virtio_transport_dgram_allow(u32 cid, u32 port);
+
+int virtio_transport_connect(struct vsock_sock *vsk);
+
+int virtio_transport_shutdown(struct vsock_sock *vsk, int mode);
+
+void virtio_transport_release(struct vsock_sock *vsk);
+
+ssize_t
+virtio_transport_stream_enqueue(struct vsock_sock *vsk,
+				struct iovec *iov,
+				size_t len);
+int
+virtio_transport_dgram_enqueue(struct vsock_sock *vsk,
+			       struct sockaddr_vm *remote_addr,
+			       struct iovec *iov,
+			       size_t len);
+
+void virtio_transport_destruct(struct vsock_sock *vsk);
+
+void virtio_transport_recv_pkt(struct virtio_vsock_pkt *pkt);
+void virtio_transport_free_pkt(struct virtio_vsock_pkt *pkt);
+void virtio_transport_inc_tx_pkt(struct virtio_vsock_pkt *pkt);
+void virtio_transport_dec_tx_pkt(struct virtio_vsock_pkt *pkt);
+u64 virtio_transport_get_credit(struct virtio_transport *trans);
+#endif /* _LINUX_VIRTIO_VSOCK_H */
diff --git a/include/uapi/linux/virtio_ids.h b/include/uapi/linux/virtio_ids.h
index 284fc3a..8a27609 100644
--- a/include/uapi/linux/virtio_ids.h
+++ b/include/uapi/linux/virtio_ids.h
@@ -39,5 +39,6 @@
 #define VIRTIO_ID_9P		9 /* 9p virtio console */
 #define VIRTIO_ID_RPROC_SERIAL 11 /* virtio remoteproc serial link */
 #define VIRTIO_ID_CAIF	       12 /* Virtio caif */
+#define VIRTIO_ID_VSOCK        13 /* virtio vsock transport */
 
 #endif /* _LINUX_VIRTIO_IDS_H */
diff --git a/include/uapi/linux/virtio_vsock.h b/include/uapi/linux/virtio_vsock.h
new file mode 100644
index 0000000..0a58ac3
--- /dev/null
+++ b/include/uapi/linux/virtio_vsock.h
@@ -0,0 +1,70 @@
+/*
+ * This header, excluding the #ifdef __KERNEL__ part, is BSD licensed so
+ * anyone can use the definitions to implement compatible drivers/servers:
+ *
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ * 3. Neither the name of IBM nor the names of its contributors
+ *    may be used to endorse or promote products derived from this software
+ *    without specific prior written permission.
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ``AS IS''
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ *
+ * Copyright (C) Red Hat, Inc., 2013
+ * Copyright (C) Asias He <asias@redhat.com>, 2013
+ */
+
+#ifndef _UAPI_LINUX_VIRTIO_VSOCK_H
+#define _UAPI_LINUX_VIRTIO_VOSCK_H
+
+#include <linux/types.h>
+#include <linux/virtio_ids.h>
+#include <linux/virtio_config.h>
+
+struct virtio_vsock_config {
+	__u32 guest_cid;
+	__u32 max_virtqueue_pairs;
+} __packed;
+
+struct virtio_vsock_hdr {
+	__u32	src_cid;
+	__u32	src_port;
+	__u32	dst_cid;
+	__u32	dst_port;
+	__u32	len;
+	__u8	type;
+	__u8	op;
+	__u8	shut;
+	__u64	fwd_cnt;
+	__u64	buf_alloc;
+} __packed;
+
+enum {
+	VIRTIO_VSOCK_OP_INVALID = 0,
+	VIRTIO_VSOCK_OP_REQUEST = 1,
+	VIRTIO_VSOCK_OP_NEGOTIATE = 2,
+	VIRTIO_VSOCK_OP_OFFER = 3,
+	VIRTIO_VSOCK_OP_ATTACH = 4,
+	VIRTIO_VSOCK_OP_RW = 5,
+	VIRTIO_VSOCK_OP_CREDIT = 6,
+	VIRTIO_VSOCK_OP_RST = 7,
+	VIRTIO_VSOCK_OP_SHUTDOWN = 8,
+};
+
+#endif /* _UAPI_LINUX_VIRTIO_VSOCK_H */
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
new file mode 100644
index 0000000..0482eb1
--- /dev/null
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -0,0 +1,992 @@
+/*
+ * common code for virtio vsock
+ *
+ * Copyright (C) 2013 Red Hat, Inc.
+ * Author: Asias He <asias@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ */
+#include <linux/module.h>
+#include <linux/ctype.h>
+#include <linux/list.h>
+#include <linux/virtio.h>
+#include <linux/virtio_ids.h>
+#include <linux/virtio_config.h>
+#include <linux/virtio_vsock.h>
+
+#include <net/sock.h>
+#include "af_vsock.h"
+
+#define SS_LISTEN 255
+
+void virtio_vsock_dumppkt(const char *func,  const struct virtio_vsock_pkt *pkt)
+{
+	pr_debug("%s: pkt=%p, op=%d, len=%d, %d:%d---%d:%d, len=%d\n",
+		 func, pkt, pkt->hdr.op, pkt->hdr.len,
+		 pkt->hdr.src_cid, pkt->hdr.src_port,
+		 pkt->hdr.dst_cid, pkt->hdr.dst_port, pkt->len);
+}
+EXPORT_SYMBOL_GPL(virtio_vsock_dumppkt);
+
+struct virtio_vsock_pkt *
+virtio_transport_alloc_pkt(struct vsock_sock *vsk,
+			   struct virtio_vsock_pkt_info *info,
+			   size_t len,
+			   u32 src_cid,
+			   u32 src_port,
+			   u32 dst_cid,
+			   u32 dst_port)
+{
+	struct virtio_transport *trans = vsk->trans;
+	struct virtio_vsock_pkt *pkt;
+	int err;
+
+	BUG_ON(!trans);
+
+	pkt = kzalloc(sizeof(*pkt), GFP_KERNEL);
+	if (!pkt)
+		return NULL;
+
+	pkt->hdr.type		= info->type;
+	pkt->hdr.op		= info->op;
+	pkt->hdr.src_cid	= src_cid;
+	pkt->hdr.src_port	= src_port;
+	pkt->hdr.dst_cid	= dst_cid;
+	pkt->hdr.dst_port	= dst_port;
+	pkt->hdr.len		= len;
+	pkt->hdr.shut		= info->shut;
+	pkt->len		= len;
+	pkt->trans		= trans;
+
+	if (info->iov && len > 0) {
+		pkt->buf = kmalloc(len, GFP_KERNEL);
+		if (!pkt->buf)
+			goto out_pkt;
+		err = memcpy_fromiovec(pkt->buf, info->iov, len);
+		if (err)
+			goto out;
+	}
+
+	return pkt;
+
+out:
+	kfree(pkt->buf);
+out_pkt:
+	kfree(pkt);
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_alloc_pkt);
+
+struct sock *
+virtio_transport_get_pending(struct sock *listener,
+			     struct virtio_vsock_pkt *pkt)
+{
+	struct vsock_sock *vlistener;
+	struct vsock_sock *vpending;
+	struct sockaddr_vm src;
+	struct sockaddr_vm dst;
+	struct sock *pending;
+
+	vsock_addr_init(&src, pkt->hdr.src_cid, pkt->hdr.src_port);
+	vsock_addr_init(&dst, pkt->hdr.dst_cid, pkt->hdr.dst_port);
+
+	vlistener = vsock_sk(listener);
+	list_for_each_entry(vpending, &vlistener->pending_links,
+			    pending_links) {
+		if (vsock_addr_equals_addr(&src, &vpending->remote_addr) &&
+		    vsock_addr_equals_addr(&dst, &vpending->local_addr)) {
+			pending = sk_vsock(vpending);
+			sock_hold(pending);
+			return pending;
+		}
+	}
+
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_get_pending);
+
+static void virtio_transport_inc_rx_pkt(struct virtio_vsock_pkt *pkt)
+{
+	pkt->trans->rx_bytes += pkt->len;
+}
+
+static void virtio_transport_dec_rx_pkt(struct virtio_vsock_pkt *pkt)
+{
+	pkt->trans->rx_bytes -= pkt->len;
+	pkt->trans->fwd_cnt += pkt->len;
+}
+
+void virtio_transport_inc_tx_pkt(struct virtio_vsock_pkt *pkt)
+{
+	mutex_lock(&pkt->trans->tx_lock);
+	pkt->hdr.fwd_cnt = pkt->trans->fwd_cnt;
+	pkt->hdr.buf_alloc = pkt->trans->buf_alloc;
+	pkt->trans->tx_cnt += pkt->len;
+	mutex_unlock(&pkt->trans->tx_lock);
+}
+EXPORT_SYMBOL_GPL(virtio_transport_inc_tx_pkt);
+
+void virtio_transport_dec_tx_pkt(struct virtio_vsock_pkt *pkt)
+{
+}
+EXPORT_SYMBOL_GPL(virtio_transport_dec_tx_pkt);
+
+u64 virtio_transport_get_credit(struct virtio_transport *trans)
+{
+	u64 credit;
+
+	mutex_lock(&trans->tx_lock);
+	credit = trans->peer_buf_alloc - (trans->tx_cnt - trans->peer_fwd_cnt);
+	mutex_unlock(&trans->tx_lock);
+
+	pr_debug("credit=%lld, buf_alloc=%lld, peer_buf_alloc=%lld,"
+		 "tx_cnt=%lld, fwd_cnt=%lld, peer_fwd_cnt=%lld\n",
+		 credit, trans->buf_alloc, trans->peer_buf_alloc,
+		 trans->tx_cnt, trans->fwd_cnt, trans->peer_fwd_cnt);
+
+	return credit;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_get_credit);
+
+static int virtio_transport_send_credit(struct vsock_sock *vsk)
+{
+	struct virtio_transport *trans = vsk->trans;
+	struct virtio_vsock_pkt_info info = {
+		.op = VIRTIO_VSOCK_OP_CREDIT,
+		.type = SOCK_STREAM,
+	};
+
+	pr_debug("%s: sk=%p send_credit\n", __func__, vsk);
+	return trans->ops->send_pkt(vsk, &info);
+}
+
+static ssize_t
+virtio_transport_do_dequeue(struct vsock_sock *vsk,
+			    struct iovec *iov,
+			    size_t len,
+			    int type)
+{
+	struct virtio_transport *trans = vsk->trans;
+	struct virtio_vsock_pkt *pkt;
+	size_t bytes, total = 0;
+	int err = -EFAULT;
+
+	mutex_lock(&trans->rx_lock);
+	while (total < len && trans->rx_bytes > 0  &&
+			!list_empty(&trans->rx_queue)) {
+		pkt = list_first_entry(&trans->rx_queue,
+				       struct virtio_vsock_pkt, list);
+
+		if (pkt->hdr.type != type)
+			continue;
+
+		bytes = len - total;
+		if (bytes > pkt->len - pkt->off)
+			bytes = pkt->len - pkt->off;
+
+		err = memcpy_toiovec(iov, pkt->buf + pkt->off, bytes);
+		if (err)
+			goto out;
+		total += bytes;
+		pkt->off += bytes;
+		if (pkt->off == pkt->len) {
+			virtio_transport_dec_rx_pkt(pkt);
+			list_del(&pkt->list);
+			virtio_transport_free_pkt(pkt);
+		}
+	}
+	mutex_unlock(&trans->rx_lock);
+
+	/* Send a credit pkt to peer */
+	if (type == SOCK_STREAM)
+		virtio_transport_send_credit(vsk);
+
+	return total;
+
+out:
+	mutex_unlock(&trans->rx_lock);
+	if (total)
+		err = total;
+	return err;
+}
+
+ssize_t
+virtio_transport_stream_dequeue(struct vsock_sock *vsk,
+				struct iovec *iov,
+				size_t len, int flags)
+{
+	if (flags & MSG_PEEK)
+		return -EOPNOTSUPP;
+
+	return virtio_transport_do_dequeue(vsk, iov, len, SOCK_STREAM);
+}
+EXPORT_SYMBOL_GPL(virtio_transport_stream_dequeue);
+
+static void
+virtio_transport_recv_dgram(struct sock *sk,
+			    struct virtio_vsock_pkt *pkt)
+{
+	struct sk_buff *skb;
+	struct vsock_sock *vsk;
+	size_t size;
+
+	vsk = vsock_sk(sk);
+
+	pkt->len = pkt->hdr.len;
+	pkt->off = 0;
+
+	size = sizeof(*pkt) + pkt->len;
+	/* Attach the packet to the socket's receive queue as an sk_buff. */
+	skb = alloc_skb(size, GFP_ATOMIC);
+	if (!skb)
+		goto out;
+
+	/* sk_receive_skb() will do a sock_put(), so hold here. */
+	sock_hold(sk);
+	skb_put(skb, size);
+	memcpy(skb->data, pkt, sizeof(*pkt));
+	memcpy(skb->data + sizeof(*pkt), pkt->buf, pkt->len);
+
+	sk_receive_skb(sk, skb, 0);
+out:
+	virtio_transport_free_pkt(pkt);
+}
+
+int
+virtio_transport_dgram_dequeue(struct kiocb *kiocb,
+			       struct vsock_sock *vsk,
+			       struct msghdr *msg,
+			       size_t len, int flags)
+{
+	struct virtio_vsock_pkt *pkt;
+	struct sk_buff *skb;
+	size_t payload_len;
+	int noblock;
+	int err;
+
+	noblock = flags & MSG_DONTWAIT;
+
+	if (flags & MSG_OOB || flags & MSG_ERRQUEUE)
+		return -EOPNOTSUPP;
+
+	msg->msg_namelen = 0;
+
+	/* Retrieve the head sk_buff from the socket's receive queue. */
+	err = 0;
+	skb = skb_recv_datagram(&vsk->sk, flags, noblock, &err);
+	if (err)
+		return err;
+	if (!skb)
+		return -EAGAIN;
+
+	pkt = (struct virtio_vsock_pkt *)skb->data;
+	if (!pkt)
+		goto out;
+
+	/* FIXME: check payload_len */
+	payload_len = pkt->len;
+
+	/* Place the datagram payload in the user's iovec. */
+	err = skb_copy_datagram_iovec(skb, sizeof(*pkt),
+				      msg->msg_iov, payload_len);
+	if (err)
+		goto out;
+
+	if (msg->msg_name) {
+		struct sockaddr_vm *vm_addr;
+
+		/* Provide the address of the sender. */
+		vm_addr = (struct sockaddr_vm *)msg->msg_name;
+		vsock_addr_init(vm_addr, pkt->hdr.src_cid, pkt->hdr.src_port);
+		msg->msg_namelen = sizeof(*vm_addr);
+	}
+	err = payload_len;
+
+out:
+	skb_free_datagram(&vsk->sk, skb);
+	return err;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_dgram_dequeue);
+
+s64 virtio_transport_stream_has_data(struct vsock_sock *vsk)
+{
+	struct virtio_transport *trans = vsk->trans;
+	size_t bytes = 0;
+
+	mutex_lock(&trans->rx_lock);
+	bytes = trans->rx_bytes;
+	mutex_unlock(&trans->rx_lock);
+
+	return bytes;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_stream_has_data);
+
+static s64 __virtio_transport_stream_has_space(struct vsock_sock *vsk)
+{
+	struct virtio_transport *trans = vsk->trans;
+	size_t bytes = 0;
+
+	bytes = trans->peer_buf_alloc - (trans->tx_cnt - trans->peer_fwd_cnt);
+	if (bytes < 0)
+		bytes = 0;
+
+	return bytes;
+}
+
+s64 virtio_transport_stream_has_space(struct vsock_sock *vsk)
+{
+	struct virtio_transport *trans = vsk->trans;
+	size_t bytes = 0;
+
+	mutex_lock(&trans->tx_lock);
+	bytes = trans->peer_buf_alloc - (trans->tx_cnt - trans->peer_fwd_cnt);
+	if (bytes < 0)
+		bytes = 0;
+	mutex_unlock(&trans->tx_lock);
+	pr_debug("%s: bytes=%ld\n", __func__, bytes);
+
+	return bytes;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_stream_has_space);
+
+int virtio_transport_do_socket_init(struct vsock_sock *vsk,
+				    struct vsock_sock *psk)
+{
+	struct virtio_transport *trans;
+
+	trans = kzalloc(sizeof(*trans), GFP_KERNEL);
+	if (!trans)
+		return -ENOMEM;
+
+	vsk->trans = trans;
+	trans->vsk = vsk;
+	if (psk) {
+		struct virtio_transport *ptrans = psk->trans;
+		trans->buf_size	= ptrans->buf_size;
+		trans->buf_size_min = ptrans->buf_size_min;
+		trans->buf_size_max = ptrans->buf_size_max;
+	} else {
+		trans->buf_size = VIRTIO_VSOCK_DEFAULT_BUF_SIZE;
+		trans->buf_size_min = VIRTIO_VSOCK_DEFAULT_MIN_BUF_SIZE;
+		trans->buf_size_max = VIRTIO_VSOCK_DEFAULT_MAX_BUF_SIZE;
+	}
+
+	trans->buf_alloc = trans->buf_size;
+	pr_debug("%s: trans->buf_alloc=%lld\n", __func__, trans->buf_alloc);
+
+	mutex_init(&trans->rx_lock);
+	mutex_init(&trans->tx_lock);
+	INIT_LIST_HEAD(&trans->rx_queue);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_do_socket_init);
+
+u64 virtio_transport_get_buffer_size(struct vsock_sock *vsk)
+{
+	struct virtio_transport *trans = vsk->trans;
+
+	return trans->buf_size;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_get_buffer_size);
+
+u64 virtio_transport_get_min_buffer_size(struct vsock_sock *vsk)
+{
+	struct virtio_transport *trans = vsk->trans;
+
+	return trans->buf_size_min;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_get_min_buffer_size);
+
+u64 virtio_transport_get_max_buffer_size(struct vsock_sock *vsk)
+{
+	struct virtio_transport *trans = vsk->trans;
+
+	return trans->buf_size_max;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_get_max_buffer_size);
+
+void virtio_transport_set_buffer_size(struct vsock_sock *vsk, u64 val)
+{
+	struct virtio_transport *trans = vsk->trans;
+
+	if (val < trans->buf_size_min)
+		trans->buf_size_min = val;
+	if (val > trans->buf_size_max)
+		trans->buf_size_max = val;
+	trans->buf_size = val;
+	trans->buf_alloc = val;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_set_buffer_size);
+
+void virtio_transport_set_min_buffer_size(struct vsock_sock *vsk, u64 val)
+{
+	struct virtio_transport *trans = vsk->trans;
+
+	if (val > trans->buf_size)
+		trans->buf_size = val;
+	trans->buf_size_min = val;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_set_min_buffer_size);
+
+void virtio_transport_set_max_buffer_size(struct vsock_sock *vsk, u64 val)
+{
+	struct virtio_transport *trans = vsk->trans;
+
+	if (val < trans->buf_size)
+		trans->buf_size = val;
+	trans->buf_size_max = val;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_set_max_buffer_size);
+
+int
+virtio_transport_notify_poll_in(struct vsock_sock *vsk,
+				size_t target,
+				bool *data_ready_now)
+{
+	if (vsock_stream_has_data(vsk))
+		*data_ready_now = true;
+	else
+		*data_ready_now = false;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_notify_poll_in);
+
+int
+virtio_transport_notify_poll_out(struct vsock_sock *vsk,
+				 size_t target,
+				 bool *space_avail_now)
+{
+	s64 free_space;
+
+	free_space = vsock_stream_has_space(vsk);
+	if (free_space > 0)
+		*space_avail_now = true;
+	else if (free_space == 0)
+		*space_avail_now = false;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_notify_poll_out);
+
+int virtio_transport_notify_recv_init(struct vsock_sock *vsk,
+	size_t target, struct vsock_transport_recv_notify_data *data)
+{
+	return 0;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_notify_recv_init);
+
+int virtio_transport_notify_recv_pre_block(struct vsock_sock *vsk,
+	size_t target, struct vsock_transport_recv_notify_data *data)
+{
+	return 0;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_notify_recv_pre_block);
+
+int virtio_transport_notify_recv_pre_dequeue(struct vsock_sock *vsk,
+	size_t target, struct vsock_transport_recv_notify_data *data)
+{
+	return 0;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_notify_recv_pre_dequeue);
+
+int virtio_transport_notify_recv_post_dequeue(struct vsock_sock *vsk,
+	size_t target, ssize_t copied, bool data_read,
+	struct vsock_transport_recv_notify_data *data)
+{
+	return 0;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_notify_recv_post_dequeue);
+
+int virtio_transport_notify_send_init(struct vsock_sock *vsk,
+	struct vsock_transport_send_notify_data *data)
+{
+	return 0;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_notify_send_init);
+
+int virtio_transport_notify_send_pre_block(struct vsock_sock *vsk,
+	struct vsock_transport_send_notify_data *data)
+{
+	return 0;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_notify_send_pre_block);
+
+int virtio_transport_notify_send_pre_enqueue(struct vsock_sock *vsk,
+	struct vsock_transport_send_notify_data *data)
+{
+	return 0;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_notify_send_pre_enqueue);
+
+int virtio_transport_notify_send_post_enqueue(struct vsock_sock *vsk,
+	ssize_t written, struct vsock_transport_send_notify_data *data)
+{
+	return 0;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_notify_send_post_enqueue);
+
+u64 virtio_transport_stream_rcvhiwat(struct vsock_sock *vsk)
+{
+	struct virtio_transport *trans = vsk->trans;
+
+	return trans->buf_size;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_stream_rcvhiwat);
+
+bool virtio_transport_stream_is_active(struct vsock_sock *vsk)
+{
+	return true;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_stream_is_active);
+
+bool virtio_transport_stream_allow(u32 cid, u32 port)
+{
+	return true;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_stream_allow);
+
+int virtio_transport_dgram_bind(struct vsock_sock *vsk,
+				struct sockaddr_vm *addr)
+{
+	return vsock_bind_dgram_generic(vsk, addr);
+}
+EXPORT_SYMBOL_GPL(virtio_transport_dgram_bind);
+
+bool virtio_transport_dgram_allow(u32 cid, u32 port)
+{
+	return true;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_dgram_allow);
+
+int virtio_transport_connect(struct vsock_sock *vsk)
+{
+	struct virtio_transport *trans = vsk->trans;
+	struct virtio_vsock_pkt_info info = {
+		.op = VIRTIO_VSOCK_OP_REQUEST,
+		.type = SOCK_STREAM,
+	};
+
+	pr_debug("%s: vsk=%p send_request\n", __func__, vsk);
+	return trans->ops->send_pkt(vsk, &info);
+}
+EXPORT_SYMBOL_GPL(virtio_transport_connect);
+
+int virtio_transport_shutdown(struct vsock_sock *vsk, int mode)
+{
+	struct virtio_transport *trans = vsk->trans;
+	struct virtio_vsock_pkt_info info = {
+		.op = VIRTIO_VSOCK_OP_SHUTDOWN,
+		.type = SOCK_STREAM,
+		.shut = mode,
+	};
+
+	pr_debug("%s: vsk=%p: send_shutdown\n", __func__, vsk);
+	return trans->ops->send_pkt(vsk, &info);
+}
+EXPORT_SYMBOL_GPL(virtio_transport_shutdown);
+
+void virtio_transport_release(struct vsock_sock *vsk)
+{
+	pr_debug("%s: vsk=%p\n", __func__, vsk);
+}
+EXPORT_SYMBOL_GPL(virtio_transport_release);
+
+int
+virtio_transport_dgram_enqueue(struct vsock_sock *vsk,
+			       struct sockaddr_vm *remote_addr,
+			       struct iovec *iov,
+			       size_t len)
+{
+	struct virtio_transport *trans = vsk->trans;
+	struct virtio_vsock_pkt_info info = {
+		.op = VIRTIO_VSOCK_OP_RW,
+		.type = SOCK_DGRAM,
+		.iov = iov,
+		.len = len,
+	};
+
+	vsk->remote_addr = *remote_addr;
+	return trans->ops->send_pkt(vsk, &info);
+}
+EXPORT_SYMBOL_GPL(virtio_transport_dgram_enqueue);
+
+ssize_t
+virtio_transport_stream_enqueue(struct vsock_sock *vsk,
+				struct iovec *iov,
+				size_t len)
+{
+	struct virtio_transport *trans = vsk->trans;
+	struct virtio_vsock_pkt_info info = {
+		.op = VIRTIO_VSOCK_OP_RW,
+		.type = SOCK_STREAM,
+		.iov = iov,
+		.len = len,
+	};
+
+	return trans->ops->send_pkt(vsk, &info);
+}
+EXPORT_SYMBOL_GPL(virtio_transport_stream_enqueue);
+
+void virtio_transport_destruct(struct vsock_sock *vsk)
+{
+	struct virtio_transport *trans = vsk->trans;
+
+	pr_debug("%s: vsk=%p\n", __func__, vsk);
+	kfree(trans);
+}
+EXPORT_SYMBOL_GPL(virtio_transport_destruct);
+
+static int virtio_transport_send_attach(struct vsock_sock *vsk)
+{
+	struct virtio_transport *trans = vsk->trans;
+	struct virtio_vsock_pkt_info info = {
+		.op = VIRTIO_VSOCK_OP_ATTACH,
+		.type = SOCK_STREAM,
+	};
+
+	pr_debug("%s: vsk=%p send_attach\n", __func__, vsk);
+	return trans->ops->send_pkt(vsk, &info);
+}
+
+static int virtio_transport_send_offer(struct vsock_sock *vsk)
+{
+	struct virtio_transport *trans = vsk->trans;
+	struct virtio_vsock_pkt_info info = {
+		.op = VIRTIO_VSOCK_OP_OFFER,
+		.type = SOCK_STREAM,
+	};
+
+	pr_debug("%s: sk=%p send_offer\n", __func__, vsk);
+	return trans->ops->send_pkt(vsk, &info);
+}
+
+static int virtio_transport_send_reset(struct vsock_sock *vsk,
+				       struct virtio_vsock_pkt *pkt)
+{
+	struct virtio_transport *trans = vsk->trans;
+	struct virtio_vsock_pkt_info info = {
+		.op = VIRTIO_VSOCK_OP_RST,
+		.type = SOCK_STREAM,
+	};
+
+	pr_debug("%s\n", __func__);
+
+	/* Send RST only if the original pkt is not a RST pkt */
+	if (pkt->hdr.op == VIRTIO_VSOCK_OP_RST)
+		return 0;
+
+	return trans->ops->send_pkt(vsk, &info);
+}
+
+static int
+virtio_transport_recv_connecting(struct sock *sk,
+				 struct virtio_vsock_pkt *pkt)
+{
+	struct vsock_sock *vsk = vsock_sk(sk);
+	int err;
+	int skerr;
+
+	pr_debug("%s: vsk=%p\n", __func__, vsk);
+	switch (pkt->hdr.op) {
+	case VIRTIO_VSOCK_OP_ATTACH:
+		pr_debug("%s: got attach\n", __func__);
+		sk->sk_state = SS_CONNECTED;
+		sk->sk_socket->state = SS_CONNECTED;
+		vsock_insert_connected(vsk);
+		sk->sk_state_change(sk);
+		break;
+	case VIRTIO_VSOCK_OP_NEGOTIATE:
+		pr_debug("%s: got negotiate and send_offer\n", __func__);
+		err = virtio_transport_send_offer(vsk);
+		if (err < 0) {
+			skerr = -err;
+			goto destroy;
+		}
+		break;
+	case VIRTIO_VSOCK_OP_INVALID:
+		pr_debug("%s: got invalid\n", __func__);
+		break;
+	case VIRTIO_VSOCK_OP_RST:
+		pr_debug("%s: got rst\n", __func__);
+		skerr = ECONNRESET;
+		err = 0;
+		goto destroy;
+	default:
+		pr_debug("%s: got def\n", __func__);
+		skerr = EPROTO;
+		err = -EINVAL;
+		goto destroy;
+	}
+	return 0;
+
+destroy:
+	virtio_transport_send_reset(vsk, pkt);
+	sk->sk_state = SS_UNCONNECTED;
+	sk->sk_err = skerr;
+	sk->sk_error_report(sk);
+	return err;
+}
+
+static int
+virtio_transport_recv_connected(struct sock *sk,
+				struct virtio_vsock_pkt *pkt)
+{
+	struct vsock_sock *vsk = vsock_sk(sk);
+	struct virtio_transport *trans = vsk->trans;
+	int err = 0;
+
+	switch (pkt->hdr.op) {
+	case VIRTIO_VSOCK_OP_RW:
+		pkt->len = pkt->hdr.len;
+		pkt->off = 0;
+		pkt->trans = trans;
+
+		mutex_lock(&trans->rx_lock);
+		virtio_transport_inc_rx_pkt(pkt);
+		list_add_tail(&pkt->list, &trans->rx_queue);
+		mutex_unlock(&trans->rx_lock);
+
+		sk->sk_data_ready(sk, pkt->len);
+		return err;
+	case VIRTIO_VSOCK_OP_CREDIT:
+		sk->sk_write_space(sk);
+		break;
+	case VIRTIO_VSOCK_OP_SHUTDOWN:
+		pr_debug("%s: got shutdown\n", __func__);
+		if (pkt->hdr.shut) {
+			vsk->peer_shutdown |= pkt->hdr.shut;
+			sk->sk_state_change(sk);
+		}
+		break;
+	case VIRTIO_VSOCK_OP_RST:
+		pr_debug("%s: got rst\n", __func__);
+		sock_set_flag(sk, SOCK_DONE);
+		vsk->peer_shutdown = SHUTDOWN_MASK;
+		if (vsock_stream_has_data(vsk) <= 0)
+			sk->sk_state = SS_DISCONNECTING;
+		sk->sk_state_change(sk);
+		break;
+	default:
+		err = -EINVAL;
+		break;
+	}
+
+	virtio_transport_free_pkt(pkt);
+	return err;
+}
+
+static int
+virtio_transport_send_negotiate(struct vsock_sock *vsk,
+				struct virtio_vsock_pkt *pkt)
+{
+	struct virtio_transport *trans = vsk->trans;
+	struct virtio_vsock_pkt_info info = {
+		.op = VIRTIO_VSOCK_OP_NEGOTIATE,
+		.type = SOCK_STREAM,
+	};
+
+	pr_debug("%s: send_negotiate\n", __func__);
+
+	return trans->ops->send_pkt(vsk, &info);
+}
+
+/* Handle server socket */
+static int
+virtio_transport_recv_listen(struct sock *sk, struct virtio_vsock_pkt *pkt)
+{
+	struct vsock_sock *vsk = vsock_sk(sk);
+	struct vsock_sock *vpending;
+	struct sock *pending;
+	int err;
+
+	pending = virtio_transport_get_pending(sk, pkt);
+	if (pending) {
+		pr_debug("virtio_transport_recv_listen: get pending\n");
+		vpending = vsock_sk(pending);
+		lock_sock(pending);
+		switch (pending->sk_state) {
+		case SS_CONNECTING:
+			if (pkt->hdr.op != VIRTIO_VSOCK_OP_OFFER) {
+				pr_debug("%s: != OP_OFFER op=%d\n",
+					 __func__, pkt->hdr.op);
+				virtio_transport_send_reset(vpending, pkt);
+				pending->sk_err = EPROTO;
+				pending->sk_state = SS_UNCONNECTED;
+				sock_put(pending);
+			} else {
+				pending->sk_state = SS_CONNECTED;
+				vsock_insert_connected(vpending);
+
+				vsock_remove_pending(sk, pending);
+				vsock_enqueue_accept(sk, pending);
+
+				virtio_transport_send_attach(vpending);
+				sk->sk_state_change(sk);
+			}
+			err = 0;
+			break;
+		default:
+			pr_debug("%s: sk->sk_ack_backlog=%d\n", __func__,
+				 sk->sk_ack_backlog);
+			virtio_transport_send_reset(vpending, pkt);
+			err = -EINVAL;
+			break;
+		}
+		if (err < 0)
+			vsock_remove_pending(sk, pending);
+		release_sock(pending);
+
+		/* Release refcnt obtained in virtio_transport_get_pending */
+		sock_put(pending);
+
+		return err;
+	}
+
+	if (pkt->hdr.op != VIRTIO_VSOCK_OP_REQUEST) {
+		virtio_transport_send_reset(vsk, pkt);
+		pr_debug("%s:op != OP_REQUEST op = %d\n",
+			 __func__, pkt->hdr.op);
+		return -EINVAL;
+	}
+
+	if (sk->sk_ack_backlog >= sk->sk_max_ack_backlog) {
+		pr_debug("%s: sk->sk_ack_backlog=%d\n", __func__,
+			 sk->sk_ack_backlog);
+		virtio_transport_send_reset(vsk, pkt);
+		return -ECONNREFUSED;
+	}
+
+	/* So no pending socket are responsible for this pkt, create one */
+	pending = __vsock_create(sock_net(sk), NULL, sk, GFP_KERNEL,
+				 sk->sk_type);
+	if (!pending) {
+		virtio_transport_send_reset(vsk, pkt);
+		return -ENOMEM;
+	}
+	pr_debug("virtio_transport_recv_listen: create pending\n");
+
+	vpending = vsock_sk(pending);
+	vsock_addr_init(&vpending->local_addr, pkt->hdr.dst_cid,
+			pkt->hdr.dst_port);
+	vsock_addr_init(&vpending->remote_addr, pkt->hdr.src_cid,
+			pkt->hdr.src_port);
+
+	vsock_add_pending(sk, pending);
+
+	err = virtio_transport_send_negotiate(vpending, pkt);
+	if (err < 0) {
+		virtio_transport_send_reset(vsk, pkt);
+		sock_put(pending);
+		return err;
+	}
+
+	sk->sk_ack_backlog++;
+
+	pending->sk_state = SS_CONNECTING;
+
+	/* Clean up in case no further message is received for this socket */
+	vpending->listener = sk;
+	sock_hold(sk);
+	sock_hold(pending);
+	INIT_DELAYED_WORK(&vpending->dwork, vsock_pending_work);
+	schedule_delayed_work(&vpending->dwork, HZ);
+
+	return 0;
+}
+
+void virtio_transport_recv_pkt(struct virtio_vsock_pkt *pkt)
+{
+	struct virtio_transport *trans;
+	struct sockaddr_vm src, dst;
+	struct vsock_sock *vsk;
+	struct sock *sk;
+
+	vsock_addr_init(&src, pkt->hdr.src_cid, pkt->hdr.src_port);
+	vsock_addr_init(&dst, pkt->hdr.dst_cid, pkt->hdr.dst_port);
+
+	virtio_vsock_dumppkt(__func__, pkt);
+
+	if (pkt->hdr.type == SOCK_DGRAM) {
+		sk = vsock_find_unbound_socket(&dst);
+		if (!sk)
+			goto free_pkt;
+		return virtio_transport_recv_dgram(sk, pkt);
+	}
+
+	/* The socket must be in connected or bound table
+	 * otherwise send reset back
+	 */
+	sk = vsock_find_connected_socket(&src, &dst);
+	if (!sk) {
+		sk = vsock_find_bound_socket(&dst);
+		if (!sk) {
+			pr_debug("%s: can not find bound_socket\n", __func__);
+			virtio_vsock_dumppkt(__func__, pkt);
+			/* Ignore this pkt instead of sending reset back */
+			goto free_pkt;
+		}
+	}
+
+	vsk = vsock_sk(sk);
+	trans = vsk->trans;
+	BUG_ON(!trans);
+
+	mutex_lock(&trans->tx_lock);
+	trans->peer_buf_alloc = pkt->hdr.buf_alloc;
+	trans->peer_fwd_cnt = pkt->hdr.fwd_cnt;
+	if (__virtio_transport_stream_has_space(vsk))
+		sk->sk_write_space(sk);
+	mutex_unlock(&trans->tx_lock);
+
+	lock_sock(sk);
+	switch (sk->sk_state) {
+	case SS_LISTEN:
+		virtio_transport_recv_listen(sk, pkt);
+		virtio_transport_free_pkt(pkt);
+		break;
+	case SS_CONNECTING:
+		virtio_transport_recv_connecting(sk, pkt);
+		virtio_transport_free_pkt(pkt);
+		break;
+	case SS_CONNECTED:
+		virtio_transport_recv_connected(sk, pkt);
+		break;
+	default:
+		break;
+	}
+	release_sock(sk);
+
+	/* Release refcnt obtained when we fetched this socket out of the
+	 * bound or connected list.
+	 */
+	sock_put(sk);
+	return;
+
+free_pkt:
+	virtio_transport_free_pkt(pkt);
+}
+EXPORT_SYMBOL_GPL(virtio_transport_recv_pkt);
+
+void virtio_transport_free_pkt(struct virtio_vsock_pkt *pkt)
+{
+	kfree(pkt->buf);
+	kfree(pkt);
+}
+EXPORT_SYMBOL_GPL(virtio_transport_free_pkt);
+
+static int __init virtio_vsock_common_init(void)
+{
+	return 0;
+}
+
+static void __exit virtio_vsock_common_exit(void)
+{
+}
+
+module_init(virtio_vsock_common_init);
+module_exit(virtio_vsock_common_exit);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Asias He");
+MODULE_DESCRIPTION("common code for virtio vsock");
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC 3/5] VSOCK: Introduce virtio-vsock.ko
  2013-06-27  7:59 [RFC 0/5] Introduce VM Sockets virtio transport Asias He
  2013-06-27  8:00 ` [RFC 1/5] VSOCK: Introduce vsock_find_unbound_socket and vsock_bind_dgram_generic Asias He
  2013-06-27  8:00 ` [RFC 2/5] VSOCK: Introduce virtio-vsock-common.ko Asias He
@ 2013-06-27  8:00 ` Asias He
  2013-06-27  8:00 ` [RFC 4/5] VSOCK: Introduce vhost-vsock.ko Asias He
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 21+ messages in thread
From: Asias He @ 2013-06-27  8:00 UTC (permalink / raw)
  To: netdev, kvm, virtualization
  Cc: Andy King, Michael S. Tsirkin, Reilly Grant, Pekka Enberg,
	Sasha Levin, David S. Miller, Dmitry Torokhov

VM sockets virtio transport implementation. This module runs in guest
kernel.

Signed-off-by: Asias He <asias@redhat.com>
---
 net/vmw_vsock/virtio_transport.c | 424 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 424 insertions(+)
 create mode 100644 net/vmw_vsock/virtio_transport.c

diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
new file mode 100644
index 0000000..f4323aa
--- /dev/null
+++ b/net/vmw_vsock/virtio_transport.c
@@ -0,0 +1,424 @@
+/*
+ * virtio transport for vsock
+ *
+ * Copyright (C) 2013 Red Hat, Inc.
+ * Author: Asias He <asias@redhat.com>
+ *
+ * Some of the code is take from Gerd Hoffmann <kraxel@redhat.com>'s
+ * early virtio-vsock proof-of-concept bits.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ */
+#include <linux/spinlock.h>
+#include <linux/module.h>
+#include <linux/list.h>
+#include <linux/virtio.h>
+#include <linux/virtio_ids.h>
+#include <linux/virtio_config.h>
+#include <linux/virtio_vsock.h>
+#include <net/sock.h>
+#include <linux/mutex.h>
+#include "af_vsock.h"
+
+static struct workqueue_struct *virtio_vsock_workqueue;
+static struct virtio_vsock *the_virtio_vsock;
+static void virtio_vsock_rx_fill(struct virtio_vsock *vsock);
+
+struct virtio_vsock {
+	/* Virtio device */
+	struct virtio_device *vdev;
+	/* Virtio virtqueue */
+	struct virtqueue *vqs[VSOCK_VQ_MAX];
+	/* Wait queue for send pkt */
+	wait_queue_head_t queue_wait;
+	/* Work item to send pkt */
+	struct work_struct tx_work;
+	/* Work item to recv pkt */
+	struct work_struct rx_work;
+	/* Mutex to protect send pkt*/
+	struct mutex tx_lock;
+	/* Mutex to protect recv pkt*/
+	struct mutex rx_lock;
+	/* Number of recv buffers */
+	int rx_buf_nr;
+	/* Number of max recv buffers */
+	int rx_buf_max_nr;
+	/* Guest context id, just like guest ip address */
+	u32 guest_cid;
+};
+
+static struct virtio_vsock *virtio_vsock_get(void)
+{
+	return the_virtio_vsock;
+}
+
+static u32 virtio_transport_get_local_cid(void)
+{
+	struct virtio_vsock *vsock = virtio_vsock_get();
+
+	return vsock->guest_cid;
+}
+
+static int
+virtio_transport_send_pkt(struct vsock_sock *vsk,
+			  struct virtio_vsock_pkt_info *info)
+{
+	u32 src_cid, src_port, dst_cid, dst_port;
+	int ret, in_sg = 0, out_sg = 0;
+	struct virtio_transport *trans;
+	struct virtio_vsock_pkt *pkt;
+	struct virtio_vsock *vsock;
+	struct scatterlist sg[2];
+	struct virtqueue *vq;
+	DEFINE_WAIT(wait);
+	u64 credit;
+
+	vsock = virtio_vsock_get();
+	if (!vsock)
+		return -ENODEV;
+
+	src_cid	= virtio_transport_get_local_cid();
+	src_port = vsk->local_addr.svm_port;
+	dst_cid	= vsk->remote_addr.svm_cid;
+	dst_port = vsk->remote_addr.svm_port;
+
+	trans = vsk->trans;
+	vq = vsock->vqs[VSOCK_VQ_TX];
+
+	if (info->type == SOCK_STREAM) {
+		credit = virtio_transport_get_credit(trans);
+		if (info->len > credit)
+			info->len = credit;
+	}
+	if (info->len > VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE)
+		info->len = VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE;
+	/* Do not send zero length OP_RW pkt*/
+	if (info->len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
+		return info->len;
+
+	pkt = virtio_transport_alloc_pkt(vsk, info, info->len,
+					 src_cid, src_port,
+					 dst_cid, dst_port);
+	if (!pkt)
+		return -ENOMEM;
+
+	pr_debug("%s:info->len= %d\n", __func__, info->len);
+
+	/* Will be released in virtio_transport_send_pkt_work */
+	sock_hold(&trans->vsk->sk);
+	virtio_transport_inc_tx_pkt(pkt);
+
+	/* Put pkt in the virtqueue */
+	sg_init_table(sg, ARRAY_SIZE(sg));
+	sg_set_buf(&sg[out_sg++], &pkt->hdr, sizeof(pkt->hdr));
+	if (info->iov && info->len > 0)
+		sg_set_buf(&sg[out_sg++], pkt->buf, pkt->len);
+
+	mutex_lock(&vsock->tx_lock);
+	while ((ret = virtqueue_add_buf(vq, sg, out_sg, in_sg, pkt,
+					GFP_KERNEL)) < 0) {
+		prepare_to_wait_exclusive(&vsock->queue_wait, &wait,
+					  TASK_UNINTERRUPTIBLE);
+
+		mutex_unlock(&vsock->tx_lock);
+		schedule();
+		mutex_lock(&vsock->tx_lock);
+
+		finish_wait(&vsock->queue_wait, &wait);
+	}
+	virtqueue_kick(vq);
+	mutex_unlock(&vsock->tx_lock);
+
+	return info->len;
+}
+
+static struct virtio_transport_pkt_ops virtio_ops = {
+	.send_pkt = virtio_transport_send_pkt,
+};
+
+static void virtio_vsock_rx_fill(struct virtio_vsock *vsock)
+{
+	int buf_len = VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE;
+	struct virtio_vsock_pkt *pkt;
+	struct scatterlist sg[2];
+	struct virtqueue *vq;
+	int ret;
+
+	vq = vsock->vqs[VSOCK_VQ_RX];
+
+	do {
+		pkt = kzalloc(sizeof(*pkt), GFP_KERNEL);
+		if (!pkt) {
+			pr_debug("%s: fail to allocate pkt\n", __func__);
+			goto out;
+		}
+
+		/* TODO: use mergeable rx buffer */
+		pkt->buf = kmalloc(buf_len, GFP_KERNEL);
+		if (!pkt->buf) {
+			pr_debug("%s: fail to allocate pkt->buf\n", __func__);
+			goto err;
+		}
+
+		sg_init_table(sg, ARRAY_SIZE(sg));
+		sg_set_buf(&sg[0], &pkt->hdr, sizeof(pkt->hdr));
+		sg_set_buf(&sg[1], pkt->buf, buf_len);
+		ret = virtqueue_add_buf(vq, sg, 1, 1, pkt, GFP_KERNEL);
+		if (ret)
+			goto err;
+		vsock->rx_buf_nr++;
+	} while (vq->num_free);
+	if (vsock->rx_buf_nr > vsock->rx_buf_max_nr)
+		vsock->rx_buf_max_nr = vsock->rx_buf_nr;
+out:
+	virtqueue_kick(vq);
+	return;
+err:
+	virtqueue_kick(vq);
+	virtio_transport_free_pkt(pkt);
+	return;
+}
+
+static void virtio_transport_send_pkt_work(struct work_struct *work)
+{
+	struct virtio_vsock *vsock =
+		container_of(work, struct virtio_vsock, tx_work);
+	struct virtio_vsock_pkt *pkt;
+	struct virtqueue *vq;
+	unsigned int len;
+	struct sock *sk;
+
+	vq = vsock->vqs[VSOCK_VQ_TX];
+	mutex_lock(&vsock->tx_lock);
+	do {
+		virtqueue_disable_cb(vq);
+		while ((pkt = virtqueue_get_buf(vq, &len)) != NULL) {
+			sk = &pkt->trans->vsk->sk;
+			virtio_transport_dec_tx_pkt(pkt);
+			/* Release refcnt taken in virtio_transport_send_pkt */
+			sock_put(sk);
+			virtio_transport_free_pkt(pkt);
+		}
+	} while (!virtqueue_enable_cb(vq));
+	mutex_unlock(&vsock->tx_lock);
+
+	wake_up(&vsock->queue_wait);
+}
+
+static void virtio_transport_recv_pkt_work(struct work_struct *work)
+{
+	struct virtio_vsock *vsock =
+		container_of(work, struct virtio_vsock, rx_work);
+	struct virtio_vsock_pkt *pkt;
+	struct virtqueue *vq;
+	unsigned int len;
+
+	vq = vsock->vqs[VSOCK_VQ_RX];
+	mutex_lock(&vsock->rx_lock);
+	do {
+		virtqueue_disable_cb(vq);
+		while ((pkt = virtqueue_get_buf(vq, &len)) != NULL) {
+			pkt->len = len;
+			virtio_transport_recv_pkt(pkt);
+			vsock->rx_buf_nr--;
+		}
+	} while (!virtqueue_enable_cb(vq));
+
+	if (vsock->rx_buf_nr < vsock->rx_buf_max_nr / 2)
+		virtio_vsock_rx_fill(vsock);
+	mutex_unlock(&vsock->rx_lock);
+}
+
+static void virtio_vsock_ctrl_done(struct virtqueue *vq)
+{
+}
+
+static void virtio_vsock_tx_done(struct virtqueue *vq)
+{
+	struct virtio_vsock *vsock = vq->vdev->priv;
+
+	if (!vsock)
+		return;
+	queue_work(virtio_vsock_workqueue, &vsock->tx_work);
+}
+
+static void virtio_vsock_rx_done(struct virtqueue *vq)
+{
+	struct virtio_vsock *vsock = vq->vdev->priv;
+
+	if (!vsock)
+		return;
+	queue_work(virtio_vsock_workqueue, &vsock->rx_work);
+}
+
+static int
+virtio_transport_socket_init(struct vsock_sock *vsk, struct vsock_sock *psk)
+{
+	struct virtio_transport *trans;
+	int ret;
+
+	ret = virtio_transport_do_socket_init(vsk, psk);
+	if (ret)
+		return ret;
+
+	trans = vsk->trans;
+	trans->ops = &virtio_ops;
+	return ret;
+}
+
+static struct vsock_transport virtio_transport = {
+	.get_local_cid            = virtio_transport_get_local_cid,
+
+	.init                     = virtio_transport_socket_init,
+	.destruct                 = virtio_transport_destruct,
+	.release                  = virtio_transport_release,
+	.connect                  = virtio_transport_connect,
+	.shutdown                 = virtio_transport_shutdown,
+
+	.dgram_bind               = virtio_transport_dgram_bind,
+	.dgram_dequeue            = virtio_transport_dgram_dequeue,
+	.dgram_enqueue            = virtio_transport_dgram_enqueue,
+	.dgram_allow              = virtio_transport_dgram_allow,
+
+	.stream_dequeue           = virtio_transport_stream_dequeue,
+	.stream_enqueue           = virtio_transport_stream_enqueue,
+	.stream_has_data          = virtio_transport_stream_has_data,
+	.stream_has_space         = virtio_transport_stream_has_space,
+	.stream_rcvhiwat          = virtio_transport_stream_rcvhiwat,
+	.stream_is_active         = virtio_transport_stream_is_active,
+	.stream_allow             = virtio_transport_stream_allow,
+
+	.notify_poll_in           = virtio_transport_notify_poll_in,
+	.notify_poll_out          = virtio_transport_notify_poll_out,
+	.notify_recv_init         = virtio_transport_notify_recv_init,
+	.notify_recv_pre_block    = virtio_transport_notify_recv_pre_block,
+	.notify_recv_pre_dequeue  = virtio_transport_notify_recv_pre_dequeue,
+	.notify_recv_post_dequeue = virtio_transport_notify_recv_post_dequeue,
+	.notify_send_init         = virtio_transport_notify_send_init,
+	.notify_send_pre_block    = virtio_transport_notify_send_pre_block,
+	.notify_send_pre_enqueue  = virtio_transport_notify_send_pre_enqueue,
+	.notify_send_post_enqueue = virtio_transport_notify_send_post_enqueue,
+
+	.set_buffer_size          = virtio_transport_set_buffer_size,
+	.set_min_buffer_size      = virtio_transport_set_min_buffer_size,
+	.set_max_buffer_size      = virtio_transport_set_max_buffer_size,
+	.get_buffer_size          = virtio_transport_get_buffer_size,
+	.get_min_buffer_size      = virtio_transport_get_min_buffer_size,
+	.get_max_buffer_size      = virtio_transport_get_max_buffer_size,
+};
+
+static int virtio_vsock_probe(struct virtio_device *vdev)
+{
+	vq_callback_t *callbacks[] = {
+		virtio_vsock_ctrl_done,
+		virtio_vsock_rx_done,
+		virtio_vsock_tx_done,
+	};
+	const char *names[] = {
+		"ctrl",
+		"rx",
+		"tx",
+	};
+	struct virtio_vsock *vsock;
+	int ret = -ENOMEM;
+	u32 guest_cid;
+
+	/* Only one virtio-vsock device per guest is supported */
+	/* FIXME: need lock for the_virtio_vsock ?*/
+	if (the_virtio_vsock)
+		return -EBUSY;
+
+	vsock = kzalloc(sizeof(*vsock), GFP_KERNEL);
+	if (!vsock)
+		return -ENOMEM;
+
+	vsock->vdev = vdev;
+
+	ret = vsock->vdev->config->find_vqs(vsock->vdev, VSOCK_VQ_MAX,
+					    vsock->vqs, callbacks, names);
+	if (ret < 0)
+		goto out;
+
+	vdev->config->get(vdev, offsetof(struct virtio_vsock_config, guest_cid),
+			  &guest_cid, sizeof(guest_cid));
+	vsock->guest_cid = guest_cid;
+	pr_debug("%s:guest_cid=%d\n", __func__, guest_cid);
+
+	ret = vsock_core_init(&virtio_transport);
+	if (ret < 0)
+		goto out_vqs;
+
+	vsock->rx_buf_nr = 0;
+	vsock->rx_buf_max_nr = 0;
+
+	vdev->priv = the_virtio_vsock = vsock;
+	init_waitqueue_head(&vsock->queue_wait);
+	mutex_init(&vsock->tx_lock);
+	mutex_init(&vsock->rx_lock);
+	INIT_WORK(&vsock->rx_work, virtio_transport_recv_pkt_work);
+	INIT_WORK(&vsock->tx_work, virtio_transport_send_pkt_work);
+
+	mutex_lock(&vsock->rx_lock);
+	virtio_vsock_rx_fill(vsock);
+	mutex_unlock(&vsock->rx_lock);
+	return 0;
+
+out_vqs:
+	vsock->vdev->config->del_vqs(vsock->vdev);
+out:
+	kfree(vsock);
+	return ret;
+}
+
+static void virtio_vsock_remove(struct virtio_device *vdev)
+{
+	struct virtio_vsock *vsock = vdev->priv;
+
+	the_virtio_vsock = NULL;
+	vsock_core_exit();
+	kfree(vsock);
+}
+
+static struct virtio_device_id id_table[] = {
+	{ VIRTIO_ID_VSOCK, VIRTIO_DEV_ANY_ID },
+	{ 0 },
+};
+
+static unsigned int features[] = {
+};
+
+static struct virtio_driver virtio_vsock_driver = {
+	.feature_table = features,
+	.feature_table_size = ARRAY_SIZE(features),
+	.driver.name = KBUILD_MODNAME,
+	.driver.owner = THIS_MODULE,
+	.id_table = id_table,
+	.probe = virtio_vsock_probe,
+	.remove = virtio_vsock_remove,
+};
+
+static int __init virtio_vsock_init(void)
+{
+	int ret;
+
+	virtio_vsock_workqueue = alloc_workqueue("virtio_vsock", 0, 0);
+	if (!virtio_vsock_workqueue)
+		return -ENOMEM;
+	ret = register_virtio_driver(&virtio_vsock_driver);
+	if (ret)
+		destroy_workqueue(virtio_vsock_workqueue);
+	return ret;
+}
+
+static void __exit virtio_vsock_exit(void)
+{
+	unregister_virtio_driver(&virtio_vsock_driver);
+	destroy_workqueue(virtio_vsock_workqueue);
+}
+
+module_init(virtio_vsock_init);
+module_exit(virtio_vsock_exit);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Asias He");
+MODULE_DESCRIPTION("virtio transport for vsock");
+MODULE_DEVICE_TABLE(virtio, id_table);
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC 4/5] VSOCK: Introduce vhost-vsock.ko
  2013-06-27  7:59 [RFC 0/5] Introduce VM Sockets virtio transport Asias He
                   ` (2 preceding siblings ...)
  2013-06-27  8:00 ` [RFC 3/5] VSOCK: Introduce virtio-vsock.ko Asias He
@ 2013-06-27  8:00 ` Asias He
  2013-06-27 10:42   ` Michael S. Tsirkin
  2013-06-27  8:00 ` [RFC 5/5] VSOCK: Add Makefile and Kconfig Asias He
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 21+ messages in thread
From: Asias He @ 2013-06-27  8:00 UTC (permalink / raw)
  To: netdev, kvm, virtualization
  Cc: Andy King, Michael S. Tsirkin, Reilly Grant, Pekka Enberg,
	Sasha Levin, David S. Miller, Dmitry Torokhov

VM sockets vhost transport implementation. This module runs in host
kernel.

Signed-off-by: Asias He <asias@redhat.com>
---
 drivers/vhost/vsock.c | 534 ++++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/vhost/vsock.h |   4 +
 2 files changed, 538 insertions(+)
 create mode 100644 drivers/vhost/vsock.c
 create mode 100644 drivers/vhost/vsock.h

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
new file mode 100644
index 0000000..cb54090
--- /dev/null
+++ b/drivers/vhost/vsock.c
@@ -0,0 +1,534 @@
+/*
+ * vhost transport for vsock
+ *
+ * Copyright (C) 2013 Red Hat, Inc.
+ * Author: Asias He <asias@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ */
+#include <linux/miscdevice.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <net/sock.h>
+#include <linux/virtio_vsock.h>
+#include <linux/vhost.h>
+
+#include "../../../net/vmw_vsock/af_vsock.h"
+#include "vhost.h"
+#include "vsock.h"
+
+#define VHOST_VSOCK_DEFAULT_HOST_CID	2;
+
+static int vhost_transport_socket_init(struct vsock_sock *vsk,
+				       struct vsock_sock *psk);
+
+enum {
+	VHOST_VSOCK_FEATURES = VHOST_FEATURES,
+};
+
+/* Used to track all the vhost_vsock instacne on the system. */
+static LIST_HEAD(vhost_vsock_list);
+static DEFINE_MUTEX(vhost_vsock_mutex);
+
+struct vhost_vsock_virtqueue {
+	struct vhost_virtqueue vq;
+};
+
+struct vhost_vsock {
+	/* Vhost device */
+	struct vhost_dev dev;
+	/* Vhost vsock virtqueue*/
+	struct vhost_vsock_virtqueue vqs[VSOCK_VQ_MAX];
+	/* Link to global vhost_vsock_list*/
+	struct list_head list;
+	/* Head for pkt from host to guest */
+	struct list_head send_pkt_list;
+	/* Work item to send pkt */
+	struct vhost_work send_pkt_work;
+	/* Guest contex id this vhost_vsock instance handles */
+	u32 guest_cid;
+};
+
+static u32 vhost_transport_get_local_cid(void)
+{
+	u32 cid = VHOST_VSOCK_DEFAULT_HOST_CID;
+	return cid;
+}
+
+static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
+{
+	struct vhost_vsock *vsock;
+
+	mutex_lock(&vhost_vsock_mutex);
+	list_for_each_entry(vsock, &vhost_vsock_list, list) {
+		if (vsock->guest_cid == guest_cid) {
+			mutex_unlock(&vhost_vsock_mutex);
+			return vsock;
+		}
+	}
+	mutex_unlock(&vhost_vsock_mutex);
+
+	return NULL;
+}
+
+static void
+vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
+			    struct vhost_virtqueue *vq)
+{
+	struct virtio_vsock_pkt *pkt;
+	unsigned out, in;
+	struct sock *sk;
+	int head, ret;
+
+	mutex_lock(&vq->mutex);
+	vhost_disable_notify(&vsock->dev, vq);
+	for (;;) {
+		if (list_empty(&vsock->send_pkt_list)) {
+			vhost_enable_notify(&vsock->dev, vq);
+			break;
+		}
+
+		head = vhost_get_vq_desc(&vsock->dev, vq, vq->iov,
+					ARRAY_SIZE(vq->iov), &out, &in,
+					NULL, NULL);
+		pr_debug("%s: head = %d\n", __func__, head);
+		if (head < 0)
+			break;
+
+		if (head == vq->num) {
+			if (unlikely(vhost_enable_notify(&vsock->dev, vq))) {
+				vhost_disable_notify(&vsock->dev, vq);
+				continue;
+			}
+			break;
+		}
+
+		pkt = list_first_entry(&vsock->send_pkt_list,
+				       struct virtio_vsock_pkt, list);
+		list_del_init(&pkt->list);
+
+		/* FIXME: no assumption of frame layout */
+		ret = __copy_to_user(vq->iov[0].iov_base, &pkt->hdr,
+				     sizeof(pkt->hdr));
+		if (ret) {
+			virtio_transport_free_pkt(pkt);
+			vq_err(vq, "Faulted on copying pkt hdr\n");
+			break;
+		}
+		if (pkt->buf && pkt->len > 0) {
+			ret = __copy_to_user(vq->iov[1].iov_base, pkt->buf,
+					    pkt->len);
+			if (ret) {
+				virtio_transport_free_pkt(pkt);
+				vq_err(vq, "Faulted on copying pkt buf\n");
+				break;
+			}
+		}
+
+		vhost_add_used(vq, head, pkt->len);
+
+		virtio_transport_dec_tx_pkt(pkt);
+
+		sk = sk_vsock(pkt->trans->vsk);
+		/* Release refcnt taken in vhost_transport_send_pkt */
+		sock_put(sk);
+
+		virtio_transport_free_pkt(pkt);
+	}
+	vhost_signal(&vsock->dev, vq);
+	mutex_unlock(&vq->mutex);
+}
+
+static void vhost_transport_send_pkt_work(struct vhost_work *work)
+{
+	struct vhost_virtqueue *vq;
+	struct vhost_vsock *vsock;
+
+	vsock = container_of(work, struct vhost_vsock, send_pkt_work);
+	vq = &vsock->vqs[VSOCK_VQ_RX].vq;
+
+	vhost_transport_do_send_pkt(vsock, vq);
+}
+
+static int
+vhost_transport_send_pkt(struct vsock_sock *vsk,
+			 struct virtio_vsock_pkt_info *info)
+{
+	u32 src_cid, src_port, dst_cid, dst_port;
+	struct virtio_transport *trans;
+	struct virtio_vsock_pkt *pkt;
+	struct vhost_virtqueue *vq;
+	struct vhost_vsock *vsock;
+	u64 credit;
+
+	src_cid = vhost_transport_get_local_cid();
+	src_port = vsk->local_addr.svm_port;
+	dst_cid = vsk->remote_addr.svm_cid;
+	dst_port = vsk->remote_addr.svm_port;
+
+	/* Find the vhost_vsock according to guest context id  */
+	vsock = vhost_vsock_get(dst_cid);
+	if (!vsock)
+		return -ENODEV;
+
+	trans = vsk->trans;
+	vq = &vsock->vqs[VSOCK_VQ_RX].vq;
+
+	if (info->type == SOCK_STREAM) {
+		credit = virtio_transport_get_credit(trans);
+		if (info->len > credit)
+			info->len = credit;
+	}
+	if (info->len > VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE)
+		info->len = VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE;
+	/* Do not send zero length OP_RW pkt*/
+	if (info->len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
+		return info->len;
+
+	pkt = virtio_transport_alloc_pkt(vsk, info, info->len,
+					 src_cid, src_port,
+					 dst_cid, dst_port);
+	if (!pkt)
+		return -ENOMEM;
+
+	pr_debug("%s:info->len= %d\n", __func__, info->len);
+	/* Released in vhost_transport_do_send_pkt */
+	sock_hold(&trans->vsk->sk);
+	virtio_transport_inc_tx_pkt(pkt);
+
+	/* queue it up in vhost work */
+	mutex_lock(&vq->mutex);
+	list_add_tail(&pkt->list, &vsock->send_pkt_list);
+	vhost_work_queue(&vsock->dev, &vsock->send_pkt_work);
+	mutex_unlock(&vq->mutex);
+
+	return info->len;
+}
+
+static struct virtio_transport_pkt_ops vhost_ops = {
+	.send_pkt = vhost_transport_send_pkt,
+};
+
+static struct virtio_vsock_pkt *
+vhost_vsock_alloc_pkt(struct vhost_virtqueue *vq)
+{
+	struct virtio_vsock_pkt *pkt;
+	int ret;
+	int len;
+
+	pkt = kzalloc(sizeof(*pkt), GFP_KERNEL);
+	if (!pkt)
+		return NULL;
+
+	len = sizeof(pkt->hdr);
+	if (unlikely(vq->iov[0].iov_len != len)) {
+		vq_err(vq, "Expecting pkt->hdr = %d, got %zu bytes\n",
+		       len, vq->iov[0].iov_len);
+		kfree(pkt);
+		return NULL;
+	}
+	ret = __copy_from_user(&pkt->hdr, vq->iov[0].iov_base, len);
+	if (ret) {
+		vq_err(vq, "Faulted on virtio_vsock_hdr\n");
+		kfree(pkt);
+		return NULL;
+	}
+
+	pkt->len = pkt->hdr.len;
+	pkt->off = 0;
+
+	/* No payload */
+	if (!pkt->len)
+		return pkt;
+
+	/* The pkt is too big */
+	if (pkt->len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE) {
+		kfree(pkt);
+		return NULL;
+	}
+
+	pkt->buf = kmalloc(pkt->len, GFP_KERNEL);
+	if (!pkt->buf) {
+		kfree(pkt);
+		return NULL;
+	}
+
+	ret = __copy_from_user(pkt->buf, vq->iov[1].iov_base, pkt->len);
+	if (ret) {
+		vq_err(vq, "Faulted on virtio_vsock_hdr\n");
+		virtio_transport_free_pkt(pkt);
+	}
+
+	return pkt;
+}
+
+static void vhost_vsock_handle_ctl_kick(struct vhost_work *work)
+{
+	struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
+						  poll.work);
+	struct vhost_vsock *vsock = container_of(vq->dev, struct vhost_vsock,
+						 dev);
+
+	pr_debug("%s vq=%p, vsock=%p\n", __func__, vq, vsock);
+}
+
+static void vhost_vsock_handle_tx_kick(struct vhost_work *work)
+{
+	struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
+						  poll.work);
+	struct vhost_vsock *vsock = container_of(vq->dev, struct vhost_vsock,
+						 dev);
+	struct virtio_vsock_pkt *pkt;
+	int head, out, in;
+	u32 len;
+
+	mutex_lock(&vq->mutex);
+	vhost_disable_notify(&vsock->dev, vq);
+	for (;;) {
+		head = vhost_get_vq_desc(&vsock->dev, vq, vq->iov,
+					ARRAY_SIZE(vq->iov), &out, &in,
+					NULL, NULL);
+		if (head < 0)
+			break;
+
+		if (head == vq->num) {
+			if (unlikely(vhost_enable_notify(&vsock->dev, vq))) {
+				vhost_disable_notify(&vsock->dev, vq);
+				continue;
+			}
+			break;
+		}
+
+		pkt = vhost_vsock_alloc_pkt(vq);
+		if (!pkt) {
+			vq_err(vq, "Faulted on pkt\n");
+			continue;
+		}
+
+		len = pkt->len;
+		virtio_transport_recv_pkt(pkt);
+		vhost_add_used(vq, head, len);
+	}
+	vhost_signal(&vsock->dev, vq);
+	mutex_unlock(&vq->mutex);
+}
+
+static void vhost_vsock_handle_rx_kick(struct vhost_work *work)
+{
+	struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
+						poll.work);
+	struct vhost_vsock *vsock = container_of(vq->dev, struct vhost_vsock,
+						 dev);
+
+	vhost_transport_do_send_pkt(vsock, vq);
+}
+
+static int vhost_vsock_dev_open(struct inode *inode, struct file *file)
+{
+	struct vhost_virtqueue **vqs;
+	struct vhost_vsock *vsock;
+	int ret;
+
+	vsock = kzalloc(sizeof(*vsock), GFP_KERNEL);
+	if (!vsock)
+		return -ENOMEM;
+
+	pr_debug("%s:vsock=%p\n", __func__, vsock);
+
+	vqs = kmalloc(VSOCK_VQ_MAX * sizeof(*vqs), GFP_KERNEL);
+	if (!vqs) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	vqs[VSOCK_VQ_CTRL] = &vsock->vqs[VSOCK_VQ_CTRL].vq;
+	vqs[VSOCK_VQ_TX] = &vsock->vqs[VSOCK_VQ_TX].vq;
+	vqs[VSOCK_VQ_RX] = &vsock->vqs[VSOCK_VQ_RX].vq;
+	vsock->vqs[VSOCK_VQ_CTRL].vq.handle_kick = vhost_vsock_handle_ctl_kick;
+	vsock->vqs[VSOCK_VQ_TX].vq.handle_kick = vhost_vsock_handle_tx_kick;
+	vsock->vqs[VSOCK_VQ_RX].vq.handle_kick = vhost_vsock_handle_rx_kick;
+
+	ret = vhost_dev_init(&vsock->dev, vqs, VSOCK_VQ_MAX);
+	if (ret < 0)
+		goto out_vqs;
+
+	file->private_data = vsock;
+	INIT_LIST_HEAD(&vsock->send_pkt_list);
+	vhost_work_init(&vsock->send_pkt_work, vhost_transport_send_pkt_work);
+
+	mutex_lock(&vhost_vsock_mutex);
+	list_add_tail(&vsock->list, &vhost_vsock_list);
+	mutex_unlock(&vhost_vsock_mutex);
+	return ret;
+
+out_vqs:
+	kfree(vqs);
+out:
+	kfree(vsock);
+	return ret;
+}
+
+static void vhost_vsock_flush(struct vhost_vsock *vsock)
+{
+	int i;
+
+	for (i = 0; i < VSOCK_VQ_MAX; i++)
+		vhost_poll_flush(&vsock->vqs[i].vq.poll);
+	vhost_work_flush(&vsock->dev, &vsock->send_pkt_work);
+}
+
+static int vhost_vsock_dev_release(struct inode *inode, struct file *file)
+{
+	struct vhost_vsock *vsock = file->private_data;
+
+	mutex_lock(&vhost_vsock_mutex);
+	list_del(&vsock->list);
+	mutex_unlock(&vhost_vsock_mutex);
+
+	vhost_dev_stop(&vsock->dev);
+	vhost_vsock_flush(vsock);
+	vhost_dev_cleanup(&vsock->dev, false);
+	kfree(vsock->dev.vqs);
+	kfree(vsock);
+	return 0;
+}
+
+static int vhost_vsock_set_cid(struct vhost_vsock *vsock, u32 guest_cid)
+{
+	mutex_lock(&vhost_vsock_mutex);
+	vsock->guest_cid = guest_cid;
+	pr_debug("%s:guest_cid=%d\n", __func__, guest_cid);
+	mutex_unlock(&vhost_vsock_mutex);
+
+	return 0;
+}
+
+static long vhost_vsock_dev_ioctl(struct file *f, unsigned int ioctl,
+				  unsigned long arg)
+{
+	struct vhost_vsock *vsock = f->private_data;
+	void __user *argp = (void __user *)arg;
+	u64 __user *featurep = argp;
+	u32 __user *cidp = argp;
+	u32 guest_cid;
+	u64 features;
+	int r;
+
+	switch (ioctl) {
+	case VHOST_VSOCK_SET_GUEST_CID:
+		if (get_user(guest_cid, cidp))
+			return -EFAULT;
+		return vhost_vsock_set_cid(vsock, guest_cid);
+	case VHOST_GET_FEATURES:
+		features = VHOST_VSOCK_FEATURES;
+		if (copy_to_user(featurep, &features, sizeof(features)))
+			return -EFAULT;
+		return 0;
+	case VHOST_SET_FEATURES:
+		if (copy_from_user(&features, featurep, sizeof(features)))
+			return -EFAULT;
+		return 0;
+	default:
+		mutex_lock(&vsock->dev.mutex);
+		r = vhost_dev_ioctl(&vsock->dev, ioctl, argp);
+		if (r == -ENOIOCTLCMD)
+			r = vhost_vring_ioctl(&vsock->dev, ioctl, argp);
+		else
+			vhost_vsock_flush(vsock);
+		mutex_unlock(&vsock->dev.mutex);
+		return r;
+	}
+}
+
+static const struct file_operations vhost_vsock_fops = {
+	.owner          = THIS_MODULE,
+	.open           = vhost_vsock_dev_open,
+	.release        = vhost_vsock_dev_release,
+	.llseek		= noop_llseek,
+	.unlocked_ioctl = vhost_vsock_dev_ioctl,
+};
+
+static struct miscdevice vhost_vsock_misc = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = "vhost-vsock",
+	.fops = &vhost_vsock_fops,
+};
+
+static int
+vhost_transport_socket_init(struct vsock_sock *vsk, struct vsock_sock *psk)
+{
+	struct virtio_transport *trans;
+	int ret;
+
+	ret = virtio_transport_do_socket_init(vsk, psk);
+	if (ret)
+		return ret;
+
+	trans = vsk->trans;
+	trans->ops = &vhost_ops;
+
+	return ret;
+}
+
+static struct vsock_transport vhost_transport = {
+	.get_local_cid            = vhost_transport_get_local_cid,
+
+	.init                     = vhost_transport_socket_init,
+	.destruct                 = virtio_transport_destruct,
+	.release                  = virtio_transport_release,
+	.connect                  = virtio_transport_connect,
+	.shutdown                 = virtio_transport_shutdown,
+
+	.dgram_enqueue            = virtio_transport_dgram_enqueue,
+	.dgram_dequeue            = virtio_transport_dgram_dequeue,
+	.dgram_bind               = virtio_transport_dgram_bind,
+	.dgram_allow              = virtio_transport_dgram_allow,
+
+	.stream_enqueue           = virtio_transport_stream_enqueue,
+	.stream_dequeue           = virtio_transport_stream_dequeue,
+	.stream_has_data          = virtio_transport_stream_has_data,
+	.stream_has_space         = virtio_transport_stream_has_space,
+	.stream_rcvhiwat          = virtio_transport_stream_rcvhiwat,
+	.stream_is_active         = virtio_transport_stream_is_active,
+	.stream_allow             = virtio_transport_stream_allow,
+
+	.notify_poll_in           = virtio_transport_notify_poll_in,
+	.notify_poll_out          = virtio_transport_notify_poll_out,
+	.notify_recv_init         = virtio_transport_notify_recv_init,
+	.notify_recv_pre_block    = virtio_transport_notify_recv_pre_block,
+	.notify_recv_pre_dequeue  = virtio_transport_notify_recv_pre_dequeue,
+	.notify_recv_post_dequeue = virtio_transport_notify_recv_post_dequeue,
+	.notify_send_init         = virtio_transport_notify_send_init,
+	.notify_send_pre_block    = virtio_transport_notify_send_pre_block,
+	.notify_send_pre_enqueue  = virtio_transport_notify_send_pre_enqueue,
+	.notify_send_post_enqueue = virtio_transport_notify_send_post_enqueue,
+
+	.set_buffer_size          = virtio_transport_set_buffer_size,
+	.set_min_buffer_size      = virtio_transport_set_min_buffer_size,
+	.set_max_buffer_size      = virtio_transport_set_max_buffer_size,
+	.get_buffer_size          = virtio_transport_get_buffer_size,
+	.get_min_buffer_size      = virtio_transport_get_min_buffer_size,
+	.get_max_buffer_size      = virtio_transport_get_max_buffer_size,
+};
+
+static int __init vhost_vsock_init(void)
+{
+	int ret;
+
+	ret = vsock_core_init(&vhost_transport);
+	if (ret < 0)
+		return ret;
+	return misc_register(&vhost_vsock_misc);
+};
+
+static void __exit vhost_vsock_exit(void)
+{
+	misc_deregister(&vhost_vsock_misc);
+	vsock_core_exit();
+};
+
+module_init(vhost_vsock_init);
+module_exit(vhost_vsock_exit);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Asias He");
+MODULE_DESCRIPTION("vhost transport for vsock ");
diff --git a/drivers/vhost/vsock.h b/drivers/vhost/vsock.h
new file mode 100644
index 0000000..0ddb107
--- /dev/null
+++ b/drivers/vhost/vsock.h
@@ -0,0 +1,4 @@
+#ifndef VHOST_VSOCK_H
+#define VHOST_VSOCK_H
+#define VHOST_VSOCK_SET_GUEST_CID _IOW(VHOST_VIRTIO, 0x60, __u32)
+#endif
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC 5/5] VSOCK: Add Makefile and Kconfig
  2013-06-27  7:59 [RFC 0/5] Introduce VM Sockets virtio transport Asias He
                   ` (3 preceding siblings ...)
  2013-06-27  8:00 ` [RFC 4/5] VSOCK: Introduce vhost-vsock.ko Asias He
@ 2013-06-27  8:00 ` Asias He
  2013-06-27 10:23 ` [RFC 0/5] Introduce VM Sockets virtio transport Michael S. Tsirkin
  2013-06-27 19:03 ` Sasha Levin
  6 siblings, 0 replies; 21+ messages in thread
From: Asias He @ 2013-06-27  8:00 UTC (permalink / raw)
  To: netdev, kvm, virtualization
  Cc: Andy King, Michael S. Tsirkin, Reilly Grant, Pekka Enberg,
	Sasha Levin, David S. Miller, Dmitry Torokhov

Enable virtio-vsock and vhost-vsock.

Signed-off-by: Asias He <asias@redhat.com>
---
 drivers/vhost/Kconfig       |  4 ++++
 drivers/vhost/Kconfig.vsock |  7 +++++++
 drivers/vhost/Makefile      |  5 +++++
 net/vmw_vsock/Kconfig       | 18 ++++++++++++++++++
 net/vmw_vsock/Makefile      |  4 ++++
 5 files changed, 38 insertions(+)
 create mode 100644 drivers/vhost/Kconfig.vsock

diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index 017a1e8..169fb19 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -32,3 +32,7 @@ config VHOST
 	---help---
 	  This option is selected by any driver which needs to access
 	  the core of vhost.
+
+if STAGING
+source "drivers/vhost/Kconfig.vsock"
+endif
diff --git a/drivers/vhost/Kconfig.vsock b/drivers/vhost/Kconfig.vsock
new file mode 100644
index 0000000..3491865
--- /dev/null
+++ b/drivers/vhost/Kconfig.vsock
@@ -0,0 +1,7 @@
+config VHOST_VSOCK
+	tristate "vhost virtio-vsock driver"
+	depends on VSOCKETS && EVENTFD
+	select VIRTIO_VSOCKETS_COMMON
+	default n
+	---help---
+	Say M here to enable the vhost-vsock for virtio-vsock guests
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index e0441c3..ddf87cb 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -4,5 +4,10 @@ vhost_net-y := net.o
 obj-$(CONFIG_VHOST_SCSI) += vhost_scsi.o
 vhost_scsi-y := scsi.o
 
+obj-$(CONFIG_VHOST_VSOCK) += vhost_vsock.o
+vhost_vsock-y := vsock.o
+#CFLAGS_vsock.o := -DDEBUG
+
 obj-$(CONFIG_VHOST_RING) += vringh.o
+
 obj-$(CONFIG_VHOST)	+= vhost.o
diff --git a/net/vmw_vsock/Kconfig b/net/vmw_vsock/Kconfig
index b5fa7e4..c2b6d6f 100644
--- a/net/vmw_vsock/Kconfig
+++ b/net/vmw_vsock/Kconfig
@@ -26,3 +26,21 @@ config VMWARE_VMCI_VSOCKETS
 
 	  To compile this driver as a module, choose M here: the module
 	  will be called vmw_vsock_vmci_transport. If unsure, say N.
+
+config VIRTIO_VSOCKETS
+	tristate "virtio transport for Virtual Sockets"
+	depends on VSOCKETS && VIRTIO
+	select VIRTIO_VSOCKETS_COMMON
+	help
+	  This module implements a virtio transport for Virtual Sockets.
+
+	  Enable this transport if your Virtual Machine runs on Qemu/KVM.
+
+	  To compile this driver as a module, choose M here: the module
+	  will be called virtio_vsock_transport. If unsure, say N.
+
+config VIRTIO_VSOCKETS_COMMON
+       tristate
+       ---help---
+         This option is selected by any driver which needs to access
+         the virtio_vsock.
diff --git a/net/vmw_vsock/Makefile b/net/vmw_vsock/Makefile
index 2ce52d7..bc37e59 100644
--- a/net/vmw_vsock/Makefile
+++ b/net/vmw_vsock/Makefile
@@ -1,5 +1,9 @@
 obj-$(CONFIG_VSOCKETS) += vsock.o
 obj-$(CONFIG_VMWARE_VMCI_VSOCKETS) += vmw_vsock_vmci_transport.o
+obj-$(CONFIG_VIRTIO_VSOCKETS) += virtio_transport.o
+obj-$(CONFIG_VIRTIO_VSOCKETS_COMMON) += virtio_transport_common.o
+#CFLAGS_virtio_transport.o := -DDEBUG
+#CFLAGS_virtio_transport_common.o := -DDEBUG
 
 vsock-y += af_vsock.o vsock_addr.o
 
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [RFC 0/5] Introduce VM Sockets virtio transport
  2013-06-27  7:59 [RFC 0/5] Introduce VM Sockets virtio transport Asias He
                   ` (4 preceding siblings ...)
  2013-06-27  8:00 ` [RFC 5/5] VSOCK: Add Makefile and Kconfig Asias He
@ 2013-06-27 10:23 ` Michael S. Tsirkin
  2013-06-28  2:25   ` Andy King
  2013-06-28  6:12   ` Asias He
  2013-06-27 19:03 ` Sasha Levin
  6 siblings, 2 replies; 21+ messages in thread
From: Michael S. Tsirkin @ 2013-06-27 10:23 UTC (permalink / raw)
  To: Asias He
  Cc: Andy King, kvm, Dmitry Torokhov, netdev, Reilly Grant,
	virtualization, Pekka Enberg, Sasha Levin, David S. Miller

On Thu, Jun 27, 2013 at 03:59:59PM +0800, Asias He wrote:
> Hello guys,
> 
> In commit d021c344051af91 (VSOCK: Introduce VM Sockets), VMware added VM
> Sockets support. VM Sockets allows communication between virtual
> machines and the hypervisor. VM Sockets is able to use different
> hyervisor neutral transport to transfer data. Currently, only VMware
> VMCI transport is supported. 
> 
> This series introduces virtio transport for VM Sockets.
> 
> Any comments are appreciated! Thanks!
> 
> Code:
> =========================
> 1) kernel bits
>    git://github.com/asias/linux.git vsock
> 
> 2) userspace bits:
>    git://github.com/asias/linux-kvm.git vsock
> 
> Howto:
> =========================
> Make sure you have these kernel options:
> 
>   CONFIG_VSOCKETS=y
>   CONFIG_VIRTIO_VSOCKETS=y
>   CONFIG_VIRTIO_VSOCKETS_COMMON=y
>   CONFIG_VHOST_VSOCK=m
> 
> $ git clone git://github.com/asias/linux-kvm.git
> $ cd linux-kvm/tools/kvm
> $ co -b vsock origin/vsock
> $ make
> $ modprobe vhost_vsock
> $ ./lkvm run -d os.img -k bzImage --vsock guest_cid
> 
> Test:
> =========================
> I hacked busybox's http server and wget to run over vsock. Start http
> server in host and guest, download a 512MB file in guest and host
> simultaneously for 6000 times. Manged to run the http stress test.
> 
> Also, I wrote a small libvsock.so to play the LD_PRELOAD trick and
> managed to make sshd and ssh work over virito-vsock without modifying
> the source code.
> 
> Draft VM Sockets Virtio Device spec:
> =========================
> Appendix K: VM Sockets Device
> 
> The virtio VM sockets device is a virtio transport device for VM Sockets. VM
> Sockets allows communication between virtual machines and the hypervisor.
> 
> Configuration:
> 
> Subsystem Device ID 13
> 
> Virtqueues:
>     0:controlq; 1:receiveq0; 2:transmitq0 ... 2N+1:receivqN; 2N+2:transmitqN
> 
> Feature bits:
>     Currently, no feature bits are defined.
> 
> Device configuration layout:
> 
> Two configuration fields are currently defined.
> 
>    struct virtio_vsock_config {

which fields are RW,RO,WO?

>            __u32 guest_cid;

Given that cid is like an IP address, 32 bit seems too
limiting. I would go for a 64 bit one or maybe even 128 bit,
so that e.g. GUIDs can be used there.


>            __u32 max_virtqueue_pairs;

I'd make this little endian.

>    } __packed;


> 
> The guest_cid field specifies the guest context id which likes the guest IP
> address. The max_virtqueue_pairs field specifies the maximum number of receive
> and transmit virtqueue pairs (receiveq0 ...  receiveqN and transmitq0 ...
> transmitqN respectively; N = max_virtqueue_pairs - 1 ) that can be configured.
> The driver is free to use only one virtqueue pairs, or it can use more to
> achieve better performance.

Don't we need a field for driver to specify the # of VQs?

I note packets have no sequence numbers.
This means that a given stream should only use
a single VQ in each direction, correct?
Maybe make this explicit.

> 
> Device Initialization:
> The initialization routine should discover the device's virtqueues.
> 
> Device Operation:
> Packets are transmitted by placing them in the transmitq0..transmitqN, and
> buffers for incoming packets are placed in the receiveq0..receiveqN. In each
> case, the packet itself is preceded by a header:
> 
>    struct virtio_vsock_hdr {

Let's make header explicitly little endian and avoid the
heartburn we have with many other transports.

>            __u32   src_cid;
>            __u32   src_port;
>            __u32   dst_cid;
>            __u32   dst_port;

Ports are 32 bit? I guess most applications can't work with >16 bit.

Also, why put cid's in all packets? They are only needed
when you create a connection, no? Afterwards port numbers
can be used.

>            __u32   len;
>            __u8    type;
>            __u8    op;
>            __u8    shut;

Please add padding to align all field naturally.

>            __u64   fwd_cnt;
>            __u64   buf_alloc;

Is a 64 bit counter really needed? 64 bit math
has portability limitations and performance overhead on many
architectures.

>    } __packed;

Packing produces terrible code in many compilers.
Please avoid packed structures on data path, instead,
pad structures explicitly to align all fields naturally.

> 
> src_cid and dst_cid: specify the source and destination context id.
> src_port and dst_port: specify the source and destination port.
> len: specifies the size of the data payload, it could be zero if no data
> payload is transferred.
> type: specifies the type of the packet, it can be SOCK_STREAM or SOCK_DGRAM.
> op: specifies the operation of the packet, it is defined as follows.
> 
>    enum {
>            VIRTIO_VSOCK_OP_INVALID = 0,
>            VIRTIO_VSOCK_OP_REQUEST = 1,
>            VIRTIO_VSOCK_OP_NEGOTIATE = 2,
>            VIRTIO_VSOCK_OP_OFFER = 3,
>            VIRTIO_VSOCK_OP_ATTACH = 4,
>            VIRTIO_VSOCK_OP_RW = 5,
>            VIRTIO_VSOCK_OP_CREDIT = 6,
>            VIRTIO_VSOCK_OP_RST = 7,
>            VIRTIO_VSOCK_OP_SHUTDOWN = 8,
>    };
> 
> shut: specifies the shutdown mode when the socket is being shutdown. 1 is for
> receive shutdown, 2 is for transmit shutdown, 3 is for both receive and transmit
> shutdown.

It's only with VIRTIO_VSOCK_OP_SHUTDOWN - how about a generic
flags field that is interpreted depending on op?

> fwd_cnt: specifies the the number of bytes the receiver has forwarded to userspace.
> buf_alloc: specifies the size of the receiver's recieve buffer in bytes.
> 
> Virtio VM socket connection creation:
> 1) Client sends VIRTIO_VSOCK_OP_REQUEST to server
> 2) Server reponses with VIRTIO_VSOCK_OP_NEGOTIATE to client
> 3) Client sends VIRTIO_VSOCK_OP_OFFER to server
> 4) Server responses with VIRTIO_VSOCK_OP_ATTACH to client

What's the reason for a 4 stage setup? Most transports
make do with 3.
Also, at what stage can each side get/transmit packets?
What happens in case of errors at each stage?
Don't we want to distinguish between errors?
(E.g. wrong cid, no app listening on port, etc)?

This also appears to be vulnerable to a variant of
a syn flood attack (guest attacking host).
I think we need a cookie hash in there to prevent this.


Can you describe connection teardown please?
I see there are RST and SHUTDOWN messages.
What rules do they obey?


> 
> Virtio VM socket credit update:
> Virtio VM socket uses credit-based flow control. The sender maintains tx_cnt
> which counts the totoal number of bytes it has sent out, peer_fwd_cnt which
> counts the totoal number of byes the receiver has forwarded, and peer_buf_alloc
> which is the size of the receiver's receive buffer. The sender can send no more
> than the credit the receiver gives to the sender: credit = peer_buf_alloc -
> (tx_cnt - peer_fwd_cnt). The receiver can send VIRTIO_VSOCK_OP_CREDIT packet to
> tell sender its current fwd_cnt and buf_alloc value explicitly. However, as an
> optimization, the fwd_cnt and buf_alloc is always included in the packet header
> virtio_vsock_hdr.
> 
> The guest driver should make the receive virtqueue as fully populated as
> possible: if it runs out, the performance will suffer.
> 
> The controlq is used to control device. Currently, no control operation is
> defined.
> 
> Asias He (5):
>   VSOCK: Introduce vsock_find_unbound_socket and
>     vsock_bind_dgram_generic
>   VSOCK: Introduce virtio-vsock-common.ko
>   VSOCK: Introduce virtio-vsock.ko
>   VSOCK: Introduce vhost-vsock.ko
>   VSOCK: Add Makefile and Kconfig
> 
>  drivers/vhost/Kconfig                   |   4 +
>  drivers/vhost/Kconfig.vsock             |   7 +
>  drivers/vhost/Makefile                  |   5 +
>  drivers/vhost/vsock.c                   | 534 +++++++++++++++++
>  drivers/vhost/vsock.h                   |   4 +
>  include/linux/virtio_vsock.h            | 200 +++++++
>  include/uapi/linux/virtio_ids.h         |   1 +
>  include/uapi/linux/virtio_vsock.h       |  70 +++
>  net/vmw_vsock/Kconfig                   |  18 +
>  net/vmw_vsock/Makefile                  |   4 +
>  net/vmw_vsock/af_vsock.c                |  70 +++
>  net/vmw_vsock/af_vsock.h                |   2 +
>  net/vmw_vsock/virtio_transport.c        | 424 ++++++++++++++
>  net/vmw_vsock/virtio_transport_common.c | 992 ++++++++++++++++++++++++++++++++
>  14 files changed, 2335 insertions(+)
>  create mode 100644 drivers/vhost/Kconfig.vsock
>  create mode 100644 drivers/vhost/vsock.c
>  create mode 100644 drivers/vhost/vsock.h
>  create mode 100644 include/linux/virtio_vsock.h
>  create mode 100644 include/uapi/linux/virtio_vsock.h
>  create mode 100644 net/vmw_vsock/virtio_transport.c
>  create mode 100644 net/vmw_vsock/virtio_transport_common.c
> 
> -- 
> 1.8.1.4

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC 2/5] VSOCK: Introduce virtio-vsock-common.ko
  2013-06-27  8:00 ` [RFC 2/5] VSOCK: Introduce virtio-vsock-common.ko Asias He
@ 2013-06-27 10:34   ` Michael S. Tsirkin
  2013-06-28  6:28     ` Asias He
  2013-06-29  4:32   ` David Miller
  2013-06-29  4:32   ` David Miller
  2 siblings, 1 reply; 21+ messages in thread
From: Michael S. Tsirkin @ 2013-06-27 10:34 UTC (permalink / raw)
  To: Asias He
  Cc: Andy King, kvm, Dmitry Torokhov, netdev, Reilly Grant,
	virtualization, Pekka Enberg, Sasha Levin, David S. Miller

On Thu, Jun 27, 2013 at 04:00:01PM +0800, Asias He wrote:
> This module contains the common code and header files for the following
> virtio-vsock and virtio-vhost kernel modules.
> 
> Signed-off-by: Asias He <asias@redhat.com>
> ---
>  include/linux/virtio_vsock.h            | 200 +++++++
>  include/uapi/linux/virtio_ids.h         |   1 +
>  include/uapi/linux/virtio_vsock.h       |  70 +++
>  net/vmw_vsock/virtio_transport_common.c | 992 ++++++++++++++++++++++++++++++++
>  4 files changed, 1263 insertions(+)
>  create mode 100644 include/linux/virtio_vsock.h
>  create mode 100644 include/uapi/linux/virtio_vsock.h
>  create mode 100644 net/vmw_vsock/virtio_transport_common.c
> 
> diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
> new file mode 100644
> index 0000000..cd8ed95
> --- /dev/null
> +++ b/include/linux/virtio_vsock.h
> @@ -0,0 +1,200 @@
> +/*
> + * This header, excluding the #ifdef __KERNEL__ part, is BSD licensed so
> + * anyone can use the definitions to implement compatible drivers/servers:
> + *
> + *
> + * Redistribution and use in source and binary forms, with or without
> + * modification, are permitted provided that the following conditions
> + * are met:
> + * 1. Redistributions of source code must retain the above copyright
> + *    notice, this list of conditions and the following disclaimer.
> + * 2. Redistributions in binary form must reproduce the above copyright
> + *    notice, this list of conditions and the following disclaimer in the
> + *    documentation and/or other materials provided with the distribution.
> + * 3. Neither the name of IBM nor the names of its contributors
> + *    may be used to endorse or promote products derived from this software
> + *    without specific prior written permission.
> + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ``AS IS''
> + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
> + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
> + * ARE DISCLAIMED.  IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE
> + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
> + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
> + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
> + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
> + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
> + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
> + * SUCH DAMAGE.
> + *
> + * Copyright (C) Red Hat, Inc., 2013
> + * Copyright (C) Asias He <asias@redhat.com>, 2013
> + */
> +
> +#ifndef _LINUX_VIRTIO_VSOCK_H
> +#define _LINUX_VIRTIO_VSOCK_H
> +
> +#include <uapi/linux/virtio_vsock.h>
> +#include <linux/socket.h>
> +#include <net/sock.h>
> +
> +#define VIRTIO_VSOCK_DEFAULT_MIN_BUF_SIZE	128
> +#define VIRTIO_VSOCK_DEFAULT_BUF_SIZE		(1024 * 256)
> +#define VIRTIO_VSOCK_DEFAULT_MAX_BUF_SIZE	(1024 * 256)
> +#define VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE	(1024 * 4)
> +#define VIRTIO_VSOCK_MAX_PKT_BUF_SIZE		(1024 * 64)
> +
> +struct vsock_transport_recv_notify_data;
> +struct vsock_transport_send_notify_data;
> +struct sockaddr_vm;
> +struct vsock_sock;
> +
> +enum {
> +	VSOCK_VQ_CTRL	= 0,
> +	VSOCK_VQ_RX	= 1, /* for host to guest data */
> +	VSOCK_VQ_TX	= 2, /* for guest to host data */
> +	VSOCK_VQ_MAX	= 3,
> +};
> +
> +/* virtio transport socket state */
> +struct virtio_transport {
> +	struct virtio_transport_pkt_ops	*ops;
> +	struct vsock_sock *vsk;
> +
> +	u64 buf_size;
> +	u64 buf_size_min;
> +	u64 buf_size_max;
> +
> +	struct mutex tx_lock;
> +	struct mutex rx_lock;
> +
> +	struct list_head rx_queue;
> +	u64 rx_bytes;
> +
> +	/* Protected by trans->tx_lock */
> +	u64 tx_cnt;
> +	u64 buf_alloc;
> +	u64 peer_fwd_cnt;
> +	u64 peer_buf_alloc;
> +	/* Protected by trans->rx_lock */
> +	u64 fwd_cnt;
> +};
> +
> +struct virtio_vsock_pkt {
> +	struct virtio_vsock_hdr	hdr;
> +	struct virtio_transport	*trans;
> +	struct work_struct work;
> +	struct list_head list;
> +	void *buf;
> +	u32 len;
> +	u32 off;
> +};
> +
> +struct virtio_vsock_pkt_info {
> +	struct sockaddr_vm *src;
> +	struct sockaddr_vm *dst;
> +	struct iovec *iov;
> +	u32 len;
> +	u8 type;
> +	u8 op;
> +	u8 shut;
> +};
> +
> +struct virtio_transport_pkt_ops {
> +	int (*send_pkt)(struct vsock_sock *vsk,
> +			struct virtio_vsock_pkt_info *info);
> +};
> +
> +void virtio_vsock_dumppkt(const char *func,
> +			  const struct virtio_vsock_pkt *pkt);
> +
> +struct sock *
> +virtio_transport_get_pending(struct sock *listener,
> +			     struct virtio_vsock_pkt *pkt);
> +struct virtio_vsock_pkt *
> +virtio_transport_alloc_pkt(struct vsock_sock *vsk,
> +			   struct virtio_vsock_pkt_info *info,
> +			   size_t len,
> +			   u32 src_cid,
> +			   u32 src_port,
> +			   u32 dst_cid,
> +			   u32 dst_port);
> +ssize_t
> +virtio_transport_stream_dequeue(struct vsock_sock *vsk,
> +				struct iovec *iov,
> +				size_t len,
> +				int type);
> +int
> +virtio_transport_dgram_dequeue(struct kiocb *kiocb,
> +			       struct vsock_sock *vsk,
> +			       struct msghdr *msg,
> +			       size_t len, int flags);
> +
> +s64 virtio_transport_stream_has_data(struct vsock_sock *vsk);
> +s64 virtio_transport_stream_has_space(struct vsock_sock *vsk);
> +
> +int virtio_transport_do_socket_init(struct vsock_sock *vsk,
> +				 struct vsock_sock *psk);
> +u64 virtio_transport_get_buffer_size(struct vsock_sock *vsk);
> +u64 virtio_transport_get_min_buffer_size(struct vsock_sock *vsk);
> +u64 virtio_transport_get_max_buffer_size(struct vsock_sock *vsk);
> +void virtio_transport_set_buffer_size(struct vsock_sock *vsk, u64 val);
> +void virtio_transport_set_min_buffer_size(struct vsock_sock *vsk, u64 val);
> +void virtio_transport_set_max_buffer_size(struct vsock_sock *vs, u64 val);
> +int
> +virtio_transport_notify_poll_in(struct vsock_sock *vsk,
> +				size_t target,
> +				bool *data_ready_now);
> +int
> +virtio_transport_notify_poll_out(struct vsock_sock *vsk,
> +				 size_t target,
> +				 bool *space_available_now);
> +
> +int virtio_transport_notify_recv_init(struct vsock_sock *vsk,
> +	size_t target, struct vsock_transport_recv_notify_data *data);
> +int virtio_transport_notify_recv_pre_block(struct vsock_sock *vsk,
> +	size_t target, struct vsock_transport_recv_notify_data *data);
> +int virtio_transport_notify_recv_pre_dequeue(struct vsock_sock *vsk,
> +	size_t target, struct vsock_transport_recv_notify_data *data);
> +int virtio_transport_notify_recv_post_dequeue(struct vsock_sock *vsk,
> +	size_t target, ssize_t copied, bool data_read,
> +	struct vsock_transport_recv_notify_data *data);
> +int virtio_transport_notify_send_init(struct vsock_sock *vsk,
> +	struct vsock_transport_send_notify_data *data);
> +int virtio_transport_notify_send_pre_block(struct vsock_sock *vsk,
> +	struct vsock_transport_send_notify_data *data);
> +int virtio_transport_notify_send_pre_enqueue(struct vsock_sock *vsk,
> +	struct vsock_transport_send_notify_data *data);
> +int virtio_transport_notify_send_post_enqueue(struct vsock_sock *vsk,
> +	ssize_t written, struct vsock_transport_send_notify_data *data);
> +
> +u64 virtio_transport_stream_rcvhiwat(struct vsock_sock *vsk);
> +bool virtio_transport_stream_is_active(struct vsock_sock *vsk);
> +bool virtio_transport_stream_allow(u32 cid, u32 port);
> +int virtio_transport_dgram_bind(struct vsock_sock *vsk,
> +				struct sockaddr_vm *addr);
> +bool virtio_transport_dgram_allow(u32 cid, u32 port);
> +
> +int virtio_transport_connect(struct vsock_sock *vsk);
> +
> +int virtio_transport_shutdown(struct vsock_sock *vsk, int mode);
> +
> +void virtio_transport_release(struct vsock_sock *vsk);
> +
> +ssize_t
> +virtio_transport_stream_enqueue(struct vsock_sock *vsk,
> +				struct iovec *iov,
> +				size_t len);
> +int
> +virtio_transport_dgram_enqueue(struct vsock_sock *vsk,
> +			       struct sockaddr_vm *remote_addr,
> +			       struct iovec *iov,
> +			       size_t len);
> +
> +void virtio_transport_destruct(struct vsock_sock *vsk);
> +
> +void virtio_transport_recv_pkt(struct virtio_vsock_pkt *pkt);
> +void virtio_transport_free_pkt(struct virtio_vsock_pkt *pkt);
> +void virtio_transport_inc_tx_pkt(struct virtio_vsock_pkt *pkt);
> +void virtio_transport_dec_tx_pkt(struct virtio_vsock_pkt *pkt);
> +u64 virtio_transport_get_credit(struct virtio_transport *trans);
> +#endif /* _LINUX_VIRTIO_VSOCK_H */
> diff --git a/include/uapi/linux/virtio_ids.h b/include/uapi/linux/virtio_ids.h
> index 284fc3a..8a27609 100644
> --- a/include/uapi/linux/virtio_ids.h
> +++ b/include/uapi/linux/virtio_ids.h
> @@ -39,5 +39,6 @@
>  #define VIRTIO_ID_9P		9 /* 9p virtio console */
>  #define VIRTIO_ID_RPROC_SERIAL 11 /* virtio remoteproc serial link */
>  #define VIRTIO_ID_CAIF	       12 /* Virtio caif */
> +#define VIRTIO_ID_VSOCK        13 /* virtio vsock transport */
>  
>  #endif /* _LINUX_VIRTIO_IDS_H */
> diff --git a/include/uapi/linux/virtio_vsock.h b/include/uapi/linux/virtio_vsock.h
> new file mode 100644
> index 0000000..0a58ac3
> --- /dev/null
> +++ b/include/uapi/linux/virtio_vsock.h
> @@ -0,0 +1,70 @@
> +/*
> + * This header, excluding the #ifdef __KERNEL__ part, is BSD licensed so
> + * anyone can use the definitions to implement compatible drivers/servers:
> + *
> + *
> + * Redistribution and use in source and binary forms, with or without
> + * modification, are permitted provided that the following conditions
> + * are met:
> + * 1. Redistributions of source code must retain the above copyright
> + *    notice, this list of conditions and the following disclaimer.
> + * 2. Redistributions in binary form must reproduce the above copyright
> + *    notice, this list of conditions and the following disclaimer in the
> + *    documentation and/or other materials provided with the distribution.
> + * 3. Neither the name of IBM nor the names of its contributors
> + *    may be used to endorse or promote products derived from this software
> + *    without specific prior written permission.
> + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ``AS IS''
> + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
> + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
> + * ARE DISCLAIMED.  IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE
> + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
> + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
> + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
> + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
> + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
> + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
> + * SUCH DAMAGE.
> + *
> + * Copyright (C) Red Hat, Inc., 2013
> + * Copyright (C) Asias He <asias@redhat.com>, 2013
> + */
> +
> +#ifndef _UAPI_LINUX_VIRTIO_VSOCK_H
> +#define _UAPI_LINUX_VIRTIO_VOSCK_H
> +
> +#include <linux/types.h>
> +#include <linux/virtio_ids.h>
> +#include <linux/virtio_config.h>
> +
> +struct virtio_vsock_config {
> +	__u32 guest_cid;
> +	__u32 max_virtqueue_pairs;
> +} __packed;
> +
> +struct virtio_vsock_hdr {
> +	__u32	src_cid;
> +	__u32	src_port;
> +	__u32	dst_cid;
> +	__u32	dst_port;
> +	__u32	len;
> +	__u8	type;
> +	__u8	op;
> +	__u8	shut;
> +	__u64	fwd_cnt;
> +	__u64	buf_alloc;
> +} __packed;
> +
> +enum {
> +	VIRTIO_VSOCK_OP_INVALID = 0,
> +	VIRTIO_VSOCK_OP_REQUEST = 1,
> +	VIRTIO_VSOCK_OP_NEGOTIATE = 2,
> +	VIRTIO_VSOCK_OP_OFFER = 3,
> +	VIRTIO_VSOCK_OP_ATTACH = 4,
> +	VIRTIO_VSOCK_OP_RW = 5,
> +	VIRTIO_VSOCK_OP_CREDIT = 6,
> +	VIRTIO_VSOCK_OP_RST = 7,
> +	VIRTIO_VSOCK_OP_SHUTDOWN = 8,
> +};
> +
> +#endif /* _UAPI_LINUX_VIRTIO_VSOCK_H */
> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> new file mode 100644
> index 0000000..0482eb1
> --- /dev/null
> +++ b/net/vmw_vsock/virtio_transport_common.c
> @@ -0,0 +1,992 @@
> +/*
> + * common code for virtio vsock
> + *
> + * Copyright (C) 2013 Red Hat, Inc.
> + * Author: Asias He <asias@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.
> + */
> +#include <linux/module.h>
> +#include <linux/ctype.h>
> +#include <linux/list.h>
> +#include <linux/virtio.h>
> +#include <linux/virtio_ids.h>
> +#include <linux/virtio_config.h>
> +#include <linux/virtio_vsock.h>
> +
> +#include <net/sock.h>
> +#include "af_vsock.h"
> +
> +#define SS_LISTEN 255
> +
> +void virtio_vsock_dumppkt(const char *func,  const struct virtio_vsock_pkt *pkt)
> +{
> +	pr_debug("%s: pkt=%p, op=%d, len=%d, %d:%d---%d:%d, len=%d\n",
> +		 func, pkt, pkt->hdr.op, pkt->hdr.len,
> +		 pkt->hdr.src_cid, pkt->hdr.src_port,
> +		 pkt->hdr.dst_cid, pkt->hdr.dst_port, pkt->len);
> +}
> +EXPORT_SYMBOL_GPL(virtio_vsock_dumppkt);
> +
> +struct virtio_vsock_pkt *
> +virtio_transport_alloc_pkt(struct vsock_sock *vsk,
> +			   struct virtio_vsock_pkt_info *info,
> +			   size_t len,
> +			   u32 src_cid,
> +			   u32 src_port,
> +			   u32 dst_cid,
> +			   u32 dst_port)
> +{
> +	struct virtio_transport *trans = vsk->trans;
> +	struct virtio_vsock_pkt *pkt;
> +	int err;
> +
> +	BUG_ON(!trans);
> +
> +	pkt = kzalloc(sizeof(*pkt), GFP_KERNEL);
> +	if (!pkt)
> +		return NULL;
> +
> +	pkt->hdr.type		= info->type;
> +	pkt->hdr.op		= info->op;
> +	pkt->hdr.src_cid	= src_cid;
> +	pkt->hdr.src_port	= src_port;
> +	pkt->hdr.dst_cid	= dst_cid;
> +	pkt->hdr.dst_port	= dst_port;
> +	pkt->hdr.len		= len;
> +	pkt->hdr.shut		= info->shut;
> +	pkt->len		= len;
> +	pkt->trans		= trans;
> +
> +	if (info->iov && len > 0) {
> +		pkt->buf = kmalloc(len, GFP_KERNEL);
> +		if (!pkt->buf)
> +			goto out_pkt;
> +		err = memcpy_fromiovec(pkt->buf, info->iov, len);
> +		if (err)
> +			goto out;
> +	}
> +
> +	return pkt;
> +
> +out:
> +	kfree(pkt->buf);
> +out_pkt:
> +	kfree(pkt);
> +	return NULL;
> +}
> +EXPORT_SYMBOL_GPL(virtio_transport_alloc_pkt);
> +
> +struct sock *
> +virtio_transport_get_pending(struct sock *listener,
> +			     struct virtio_vsock_pkt *pkt)
> +{
> +	struct vsock_sock *vlistener;
> +	struct vsock_sock *vpending;
> +	struct sockaddr_vm src;
> +	struct sockaddr_vm dst;
> +	struct sock *pending;
> +
> +	vsock_addr_init(&src, pkt->hdr.src_cid, pkt->hdr.src_port);
> +	vsock_addr_init(&dst, pkt->hdr.dst_cid, pkt->hdr.dst_port);
> +
> +	vlistener = vsock_sk(listener);
> +	list_for_each_entry(vpending, &vlistener->pending_links,
> +			    pending_links) {
> +		if (vsock_addr_equals_addr(&src, &vpending->remote_addr) &&
> +		    vsock_addr_equals_addr(&dst, &vpending->local_addr)) {
> +			pending = sk_vsock(vpending);
> +			sock_hold(pending);
> +			return pending;
> +		}
> +	}
> +
> +	return NULL;
> +}
> +EXPORT_SYMBOL_GPL(virtio_transport_get_pending);
> +
> +static void virtio_transport_inc_rx_pkt(struct virtio_vsock_pkt *pkt)
> +{
> +	pkt->trans->rx_bytes += pkt->len;
> +}
> +
> +static void virtio_transport_dec_rx_pkt(struct virtio_vsock_pkt *pkt)
> +{
> +	pkt->trans->rx_bytes -= pkt->len;
> +	pkt->trans->fwd_cnt += pkt->len;
> +}
> +
> +void virtio_transport_inc_tx_pkt(struct virtio_vsock_pkt *pkt)
> +{
> +	mutex_lock(&pkt->trans->tx_lock);
> +	pkt->hdr.fwd_cnt = pkt->trans->fwd_cnt;
> +	pkt->hdr.buf_alloc = pkt->trans->buf_alloc;
> +	pkt->trans->tx_cnt += pkt->len;
> +	mutex_unlock(&pkt->trans->tx_lock);
> +}
> +EXPORT_SYMBOL_GPL(virtio_transport_inc_tx_pkt);
> +
> +void virtio_transport_dec_tx_pkt(struct virtio_vsock_pkt *pkt)
> +{
> +}
> +EXPORT_SYMBOL_GPL(virtio_transport_dec_tx_pkt);
> +
> +u64 virtio_transport_get_credit(struct virtio_transport *trans)
> +{
> +	u64 credit;
> +
> +	mutex_lock(&trans->tx_lock);
> +	credit = trans->peer_buf_alloc - (trans->tx_cnt - trans->peer_fwd_cnt);
> +	mutex_unlock(&trans->tx_lock);

So two callers can call virtio_transport_get_credit and
both get a credit. Later credit gets negative.

You must have the lock until you increment tx_cnt I think.

> +
> +	pr_debug("credit=%lld, buf_alloc=%lld, peer_buf_alloc=%lld,"
> +		 "tx_cnt=%lld, fwd_cnt=%lld, peer_fwd_cnt=%lld\n",
> +		 credit, trans->buf_alloc, trans->peer_buf_alloc,
> +		 trans->tx_cnt, trans->fwd_cnt, trans->peer_fwd_cnt);
> +
> +	return credit;
> +}
> +EXPORT_SYMBOL_GPL(virtio_transport_get_credit);
> +
> +static int virtio_transport_send_credit(struct vsock_sock *vsk)
> +{
> +	struct virtio_transport *trans = vsk->trans;
> +	struct virtio_vsock_pkt_info info = {
> +		.op = VIRTIO_VSOCK_OP_CREDIT,
> +		.type = SOCK_STREAM,
> +	};
> +
> +	pr_debug("%s: sk=%p send_credit\n", __func__, vsk);
> +	return trans->ops->send_pkt(vsk, &info);
> +}
> +
> +static ssize_t
> +virtio_transport_do_dequeue(struct vsock_sock *vsk,
> +			    struct iovec *iov,
> +			    size_t len,
> +			    int type)
> +{
> +	struct virtio_transport *trans = vsk->trans;
> +	struct virtio_vsock_pkt *pkt;
> +	size_t bytes, total = 0;
> +	int err = -EFAULT;
> +
> +	mutex_lock(&trans->rx_lock);
> +	while (total < len && trans->rx_bytes > 0  &&
> +			!list_empty(&trans->rx_queue)) {
> +		pkt = list_first_entry(&trans->rx_queue,
> +				       struct virtio_vsock_pkt, list);
> +
> +		if (pkt->hdr.type != type)
> +			continue;
> +
> +		bytes = len - total;
> +		if (bytes > pkt->len - pkt->off)
> +			bytes = pkt->len - pkt->off;
> +
> +		err = memcpy_toiovec(iov, pkt->buf + pkt->off, bytes);
> +		if (err)
> +			goto out;
> +		total += bytes;
> +		pkt->off += bytes;
> +		if (pkt->off == pkt->len) {
> +			virtio_transport_dec_rx_pkt(pkt);
> +			list_del(&pkt->list);
> +			virtio_transport_free_pkt(pkt);
> +		}
> +	}
> +	mutex_unlock(&trans->rx_lock);
> +
> +	/* Send a credit pkt to peer */
> +	if (type == SOCK_STREAM)
> +		virtio_transport_send_credit(vsk);
> +
> +	return total;
> +
> +out:
> +	mutex_unlock(&trans->rx_lock);
> +	if (total)
> +		err = total;
> +	return err;
> +}
> +
> +ssize_t
> +virtio_transport_stream_dequeue(struct vsock_sock *vsk,
> +				struct iovec *iov,
> +				size_t len, int flags)
> +{
> +	if (flags & MSG_PEEK)
> +		return -EOPNOTSUPP;
> +
> +	return virtio_transport_do_dequeue(vsk, iov, len, SOCK_STREAM);
> +}
> +EXPORT_SYMBOL_GPL(virtio_transport_stream_dequeue);
> +
> +static void
> +virtio_transport_recv_dgram(struct sock *sk,
> +			    struct virtio_vsock_pkt *pkt)
> +{
> +	struct sk_buff *skb;
> +	struct vsock_sock *vsk;
> +	size_t size;
> +
> +	vsk = vsock_sk(sk);
> +
> +	pkt->len = pkt->hdr.len;
> +	pkt->off = 0;
> +
> +	size = sizeof(*pkt) + pkt->len;
> +	/* Attach the packet to the socket's receive queue as an sk_buff. */
> +	skb = alloc_skb(size, GFP_ATOMIC);
> +	if (!skb)
> +		goto out;
> +
> +	/* sk_receive_skb() will do a sock_put(), so hold here. */
> +	sock_hold(sk);
> +	skb_put(skb, size);
> +	memcpy(skb->data, pkt, sizeof(*pkt));
> +	memcpy(skb->data + sizeof(*pkt), pkt->buf, pkt->len);
> +
> +	sk_receive_skb(sk, skb, 0);
> +out:
> +	virtio_transport_free_pkt(pkt);
> +}
> +
> +int
> +virtio_transport_dgram_dequeue(struct kiocb *kiocb,
> +			       struct vsock_sock *vsk,
> +			       struct msghdr *msg,
> +			       size_t len, int flags)
> +{
> +	struct virtio_vsock_pkt *pkt;
> +	struct sk_buff *skb;
> +	size_t payload_len;
> +	int noblock;
> +	int err;
> +
> +	noblock = flags & MSG_DONTWAIT;
> +
> +	if (flags & MSG_OOB || flags & MSG_ERRQUEUE)
> +		return -EOPNOTSUPP;
> +
> +	msg->msg_namelen = 0;
> +
> +	/* Retrieve the head sk_buff from the socket's receive queue. */
> +	err = 0;
> +	skb = skb_recv_datagram(&vsk->sk, flags, noblock, &err);
> +	if (err)
> +		return err;
> +	if (!skb)
> +		return -EAGAIN;
> +
> +	pkt = (struct virtio_vsock_pkt *)skb->data;
> +	if (!pkt)
> +		goto out;
> +
> +	/* FIXME: check payload_len */
> +	payload_len = pkt->len;
> +
> +	/* Place the datagram payload in the user's iovec. */
> +	err = skb_copy_datagram_iovec(skb, sizeof(*pkt),
> +				      msg->msg_iov, payload_len);
> +	if (err)
> +		goto out;
> +
> +	if (msg->msg_name) {
> +		struct sockaddr_vm *vm_addr;
> +
> +		/* Provide the address of the sender. */
> +		vm_addr = (struct sockaddr_vm *)msg->msg_name;
> +		vsock_addr_init(vm_addr, pkt->hdr.src_cid, pkt->hdr.src_port);
> +		msg->msg_namelen = sizeof(*vm_addr);
> +	}
> +	err = payload_len;
> +
> +out:
> +	skb_free_datagram(&vsk->sk, skb);
> +	return err;
> +}
> +EXPORT_SYMBOL_GPL(virtio_transport_dgram_dequeue);
> +
> +s64 virtio_transport_stream_has_data(struct vsock_sock *vsk)
> +{
> +	struct virtio_transport *trans = vsk->trans;
> +	size_t bytes = 0;
> +
> +	mutex_lock(&trans->rx_lock);
> +	bytes = trans->rx_bytes;
> +	mutex_unlock(&trans->rx_lock);
> +
> +	return bytes;
> +}
> +EXPORT_SYMBOL_GPL(virtio_transport_stream_has_data);
> +
> +static s64 __virtio_transport_stream_has_space(struct vsock_sock *vsk)
> +{
> +	struct virtio_transport *trans = vsk->trans;
> +	size_t bytes = 0;
> +
> +	bytes = trans->peer_buf_alloc - (trans->tx_cnt - trans->peer_fwd_cnt);
> +	if (bytes < 0)
> +		bytes = 0;
> +
> +	return bytes;
> +}
> +
> +s64 virtio_transport_stream_has_space(struct vsock_sock *vsk)
> +{
> +	struct virtio_transport *trans = vsk->trans;
> +	size_t bytes = 0;
> +
> +	mutex_lock(&trans->tx_lock);
> +	bytes = trans->peer_buf_alloc - (trans->tx_cnt - trans->peer_fwd_cnt);
> +	if (bytes < 0)
> +		bytes = 0;
> +	mutex_unlock(&trans->tx_lock);
> +	pr_debug("%s: bytes=%ld\n", __func__, bytes);
> +
> +	return bytes;
> +}
> +EXPORT_SYMBOL_GPL(virtio_transport_stream_has_space);
> +
> +int virtio_transport_do_socket_init(struct vsock_sock *vsk,
> +				    struct vsock_sock *psk)
> +{
> +	struct virtio_transport *trans;
> +
> +	trans = kzalloc(sizeof(*trans), GFP_KERNEL);
> +	if (!trans)
> +		return -ENOMEM;
> +
> +	vsk->trans = trans;
> +	trans->vsk = vsk;
> +	if (psk) {
> +		struct virtio_transport *ptrans = psk->trans;
> +		trans->buf_size	= ptrans->buf_size;
> +		trans->buf_size_min = ptrans->buf_size_min;
> +		trans->buf_size_max = ptrans->buf_size_max;
> +	} else {
> +		trans->buf_size = VIRTIO_VSOCK_DEFAULT_BUF_SIZE;
> +		trans->buf_size_min = VIRTIO_VSOCK_DEFAULT_MIN_BUF_SIZE;
> +		trans->buf_size_max = VIRTIO_VSOCK_DEFAULT_MAX_BUF_SIZE;
> +	}
> +
> +	trans->buf_alloc = trans->buf_size;
> +	pr_debug("%s: trans->buf_alloc=%lld\n", __func__, trans->buf_alloc);
> +
> +	mutex_init(&trans->rx_lock);
> +	mutex_init(&trans->tx_lock);
> +	INIT_LIST_HEAD(&trans->rx_queue);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(virtio_transport_do_socket_init);
> +
> +u64 virtio_transport_get_buffer_size(struct vsock_sock *vsk)
> +{
> +	struct virtio_transport *trans = vsk->trans;
> +
> +	return trans->buf_size;
> +}
> +EXPORT_SYMBOL_GPL(virtio_transport_get_buffer_size);
> +
> +u64 virtio_transport_get_min_buffer_size(struct vsock_sock *vsk)
> +{
> +	struct virtio_transport *trans = vsk->trans;
> +
> +	return trans->buf_size_min;
> +}
> +EXPORT_SYMBOL_GPL(virtio_transport_get_min_buffer_size);
> +
> +u64 virtio_transport_get_max_buffer_size(struct vsock_sock *vsk)
> +{
> +	struct virtio_transport *trans = vsk->trans;
> +
> +	return trans->buf_size_max;
> +}
> +EXPORT_SYMBOL_GPL(virtio_transport_get_max_buffer_size);
> +
> +void virtio_transport_set_buffer_size(struct vsock_sock *vsk, u64 val)
> +{
> +	struct virtio_transport *trans = vsk->trans;
> +
> +	if (val < trans->buf_size_min)
> +		trans->buf_size_min = val;
> +	if (val > trans->buf_size_max)
> +		trans->buf_size_max = val;
> +	trans->buf_size = val;
> +	trans->buf_alloc = val;
> +}
> +EXPORT_SYMBOL_GPL(virtio_transport_set_buffer_size);
> +
> +void virtio_transport_set_min_buffer_size(struct vsock_sock *vsk, u64 val)
> +{
> +	struct virtio_transport *trans = vsk->trans;
> +
> +	if (val > trans->buf_size)
> +		trans->buf_size = val;
> +	trans->buf_size_min = val;
> +}
> +EXPORT_SYMBOL_GPL(virtio_transport_set_min_buffer_size);
> +
> +void virtio_transport_set_max_buffer_size(struct vsock_sock *vsk, u64 val)
> +{
> +	struct virtio_transport *trans = vsk->trans;
> +
> +	if (val < trans->buf_size)
> +		trans->buf_size = val;
> +	trans->buf_size_max = val;
> +}
> +EXPORT_SYMBOL_GPL(virtio_transport_set_max_buffer_size);
> +
> +int
> +virtio_transport_notify_poll_in(struct vsock_sock *vsk,
> +				size_t target,
> +				bool *data_ready_now)
> +{
> +	if (vsock_stream_has_data(vsk))
> +		*data_ready_now = true;
> +	else
> +		*data_ready_now = false;
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(virtio_transport_notify_poll_in);
> +
> +int
> +virtio_transport_notify_poll_out(struct vsock_sock *vsk,
> +				 size_t target,
> +				 bool *space_avail_now)
> +{
> +	s64 free_space;
> +
> +	free_space = vsock_stream_has_space(vsk);
> +	if (free_space > 0)
> +		*space_avail_now = true;
> +	else if (free_space == 0)
> +		*space_avail_now = false;
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(virtio_transport_notify_poll_out);
> +
> +int virtio_transport_notify_recv_init(struct vsock_sock *vsk,
> +	size_t target, struct vsock_transport_recv_notify_data *data)
> +{
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(virtio_transport_notify_recv_init);
> +
> +int virtio_transport_notify_recv_pre_block(struct vsock_sock *vsk,
> +	size_t target, struct vsock_transport_recv_notify_data *data)
> +{
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(virtio_transport_notify_recv_pre_block);
> +
> +int virtio_transport_notify_recv_pre_dequeue(struct vsock_sock *vsk,
> +	size_t target, struct vsock_transport_recv_notify_data *data)
> +{
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(virtio_transport_notify_recv_pre_dequeue);
> +
> +int virtio_transport_notify_recv_post_dequeue(struct vsock_sock *vsk,
> +	size_t target, ssize_t copied, bool data_read,
> +	struct vsock_transport_recv_notify_data *data)
> +{
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(virtio_transport_notify_recv_post_dequeue);
> +
> +int virtio_transport_notify_send_init(struct vsock_sock *vsk,
> +	struct vsock_transport_send_notify_data *data)
> +{
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(virtio_transport_notify_send_init);
> +
> +int virtio_transport_notify_send_pre_block(struct vsock_sock *vsk,
> +	struct vsock_transport_send_notify_data *data)
> +{
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(virtio_transport_notify_send_pre_block);
> +
> +int virtio_transport_notify_send_pre_enqueue(struct vsock_sock *vsk,
> +	struct vsock_transport_send_notify_data *data)
> +{
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(virtio_transport_notify_send_pre_enqueue);
> +
> +int virtio_transport_notify_send_post_enqueue(struct vsock_sock *vsk,
> +	ssize_t written, struct vsock_transport_send_notify_data *data)
> +{
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(virtio_transport_notify_send_post_enqueue);
> +
> +u64 virtio_transport_stream_rcvhiwat(struct vsock_sock *vsk)
> +{
> +	struct virtio_transport *trans = vsk->trans;
> +
> +	return trans->buf_size;
> +}
> +EXPORT_SYMBOL_GPL(virtio_transport_stream_rcvhiwat);
> +
> +bool virtio_transport_stream_is_active(struct vsock_sock *vsk)
> +{
> +	return true;
> +}
> +EXPORT_SYMBOL_GPL(virtio_transport_stream_is_active);
> +
> +bool virtio_transport_stream_allow(u32 cid, u32 port)
> +{
> +	return true;
> +}
> +EXPORT_SYMBOL_GPL(virtio_transport_stream_allow);
> +
> +int virtio_transport_dgram_bind(struct vsock_sock *vsk,
> +				struct sockaddr_vm *addr)
> +{
> +	return vsock_bind_dgram_generic(vsk, addr);
> +}
> +EXPORT_SYMBOL_GPL(virtio_transport_dgram_bind);
> +
> +bool virtio_transport_dgram_allow(u32 cid, u32 port)
> +{
> +	return true;
> +}
> +EXPORT_SYMBOL_GPL(virtio_transport_dgram_allow);
> +
> +int virtio_transport_connect(struct vsock_sock *vsk)
> +{
> +	struct virtio_transport *trans = vsk->trans;
> +	struct virtio_vsock_pkt_info info = {
> +		.op = VIRTIO_VSOCK_OP_REQUEST,
> +		.type = SOCK_STREAM,
> +	};
> +
> +	pr_debug("%s: vsk=%p send_request\n", __func__, vsk);
> +	return trans->ops->send_pkt(vsk, &info);
> +}
> +EXPORT_SYMBOL_GPL(virtio_transport_connect);
> +
> +int virtio_transport_shutdown(struct vsock_sock *vsk, int mode)
> +{
> +	struct virtio_transport *trans = vsk->trans;
> +	struct virtio_vsock_pkt_info info = {
> +		.op = VIRTIO_VSOCK_OP_SHUTDOWN,
> +		.type = SOCK_STREAM,
> +		.shut = mode,
> +	};
> +
> +	pr_debug("%s: vsk=%p: send_shutdown\n", __func__, vsk);
> +	return trans->ops->send_pkt(vsk, &info);
> +}
> +EXPORT_SYMBOL_GPL(virtio_transport_shutdown);
> +
> +void virtio_transport_release(struct vsock_sock *vsk)
> +{
> +	pr_debug("%s: vsk=%p\n", __func__, vsk);
> +}
> +EXPORT_SYMBOL_GPL(virtio_transport_release);
> +
> +int
> +virtio_transport_dgram_enqueue(struct vsock_sock *vsk,
> +			       struct sockaddr_vm *remote_addr,
> +			       struct iovec *iov,
> +			       size_t len)
> +{
> +	struct virtio_transport *trans = vsk->trans;
> +	struct virtio_vsock_pkt_info info = {
> +		.op = VIRTIO_VSOCK_OP_RW,
> +		.type = SOCK_DGRAM,
> +		.iov = iov,
> +		.len = len,
> +	};
> +
> +	vsk->remote_addr = *remote_addr;
> +	return trans->ops->send_pkt(vsk, &info);
> +}
> +EXPORT_SYMBOL_GPL(virtio_transport_dgram_enqueue);
> +
> +ssize_t
> +virtio_transport_stream_enqueue(struct vsock_sock *vsk,
> +				struct iovec *iov,
> +				size_t len)
> +{
> +	struct virtio_transport *trans = vsk->trans;
> +	struct virtio_vsock_pkt_info info = {
> +		.op = VIRTIO_VSOCK_OP_RW,
> +		.type = SOCK_STREAM,
> +		.iov = iov,
> +		.len = len,
> +	};
> +
> +	return trans->ops->send_pkt(vsk, &info);
> +}
> +EXPORT_SYMBOL_GPL(virtio_transport_stream_enqueue);
> +
> +void virtio_transport_destruct(struct vsock_sock *vsk)
> +{
> +	struct virtio_transport *trans = vsk->trans;
> +
> +	pr_debug("%s: vsk=%p\n", __func__, vsk);
> +	kfree(trans);
> +}
> +EXPORT_SYMBOL_GPL(virtio_transport_destruct);
> +
> +static int virtio_transport_send_attach(struct vsock_sock *vsk)
> +{
> +	struct virtio_transport *trans = vsk->trans;
> +	struct virtio_vsock_pkt_info info = {
> +		.op = VIRTIO_VSOCK_OP_ATTACH,
> +		.type = SOCK_STREAM,
> +	};
> +
> +	pr_debug("%s: vsk=%p send_attach\n", __func__, vsk);
> +	return trans->ops->send_pkt(vsk, &info);
> +}
> +
> +static int virtio_transport_send_offer(struct vsock_sock *vsk)
> +{
> +	struct virtio_transport *trans = vsk->trans;
> +	struct virtio_vsock_pkt_info info = {
> +		.op = VIRTIO_VSOCK_OP_OFFER,
> +		.type = SOCK_STREAM,
> +	};
> +
> +	pr_debug("%s: sk=%p send_offer\n", __func__, vsk);
> +	return trans->ops->send_pkt(vsk, &info);
> +}
> +
> +static int virtio_transport_send_reset(struct vsock_sock *vsk,
> +				       struct virtio_vsock_pkt *pkt)
> +{
> +	struct virtio_transport *trans = vsk->trans;
> +	struct virtio_vsock_pkt_info info = {
> +		.op = VIRTIO_VSOCK_OP_RST,
> +		.type = SOCK_STREAM,
> +	};
> +
> +	pr_debug("%s\n", __func__);
> +
> +	/* Send RST only if the original pkt is not a RST pkt */
> +	if (pkt->hdr.op == VIRTIO_VSOCK_OP_RST)
> +		return 0;
> +
> +	return trans->ops->send_pkt(vsk, &info);
> +}
> +
> +static int
> +virtio_transport_recv_connecting(struct sock *sk,
> +				 struct virtio_vsock_pkt *pkt)
> +{
> +	struct vsock_sock *vsk = vsock_sk(sk);
> +	int err;
> +	int skerr;
> +
> +	pr_debug("%s: vsk=%p\n", __func__, vsk);
> +	switch (pkt->hdr.op) {
> +	case VIRTIO_VSOCK_OP_ATTACH:
> +		pr_debug("%s: got attach\n", __func__);
> +		sk->sk_state = SS_CONNECTED;
> +		sk->sk_socket->state = SS_CONNECTED;
> +		vsock_insert_connected(vsk);
> +		sk->sk_state_change(sk);
> +		break;
> +	case VIRTIO_VSOCK_OP_NEGOTIATE:
> +		pr_debug("%s: got negotiate and send_offer\n", __func__);
> +		err = virtio_transport_send_offer(vsk);
> +		if (err < 0) {
> +			skerr = -err;
> +			goto destroy;
> +		}
> +		break;
> +	case VIRTIO_VSOCK_OP_INVALID:
> +		pr_debug("%s: got invalid\n", __func__);
> +		break;
> +	case VIRTIO_VSOCK_OP_RST:
> +		pr_debug("%s: got rst\n", __func__);
> +		skerr = ECONNRESET;
> +		err = 0;
> +		goto destroy;
> +	default:
> +		pr_debug("%s: got def\n", __func__);
> +		skerr = EPROTO;
> +		err = -EINVAL;
> +		goto destroy;
> +	}
> +	return 0;
> +
> +destroy:
> +	virtio_transport_send_reset(vsk, pkt);
> +	sk->sk_state = SS_UNCONNECTED;
> +	sk->sk_err = skerr;
> +	sk->sk_error_report(sk);
> +	return err;
> +}
> +
> +static int
> +virtio_transport_recv_connected(struct sock *sk,
> +				struct virtio_vsock_pkt *pkt)
> +{
> +	struct vsock_sock *vsk = vsock_sk(sk);
> +	struct virtio_transport *trans = vsk->trans;
> +	int err = 0;
> +
> +	switch (pkt->hdr.op) {
> +	case VIRTIO_VSOCK_OP_RW:
> +		pkt->len = pkt->hdr.len;
> +		pkt->off = 0;
> +		pkt->trans = trans;
> +
> +		mutex_lock(&trans->rx_lock);
> +		virtio_transport_inc_rx_pkt(pkt);
> +		list_add_tail(&pkt->list, &trans->rx_queue);
> +		mutex_unlock(&trans->rx_lock);
> +
> +		sk->sk_data_ready(sk, pkt->len);
> +		return err;
> +	case VIRTIO_VSOCK_OP_CREDIT:
> +		sk->sk_write_space(sk);
> +		break;
> +	case VIRTIO_VSOCK_OP_SHUTDOWN:
> +		pr_debug("%s: got shutdown\n", __func__);
> +		if (pkt->hdr.shut) {
> +			vsk->peer_shutdown |= pkt->hdr.shut;
> +			sk->sk_state_change(sk);
> +		}
> +		break;
> +	case VIRTIO_VSOCK_OP_RST:
> +		pr_debug("%s: got rst\n", __func__);
> +		sock_set_flag(sk, SOCK_DONE);
> +		vsk->peer_shutdown = SHUTDOWN_MASK;
> +		if (vsock_stream_has_data(vsk) <= 0)
> +			sk->sk_state = SS_DISCONNECTING;
> +		sk->sk_state_change(sk);
> +		break;
> +	default:
> +		err = -EINVAL;
> +		break;
> +	}
> +
> +	virtio_transport_free_pkt(pkt);
> +	return err;
> +}
> +
> +static int
> +virtio_transport_send_negotiate(struct vsock_sock *vsk,
> +				struct virtio_vsock_pkt *pkt)
> +{
> +	struct virtio_transport *trans = vsk->trans;
> +	struct virtio_vsock_pkt_info info = {
> +		.op = VIRTIO_VSOCK_OP_NEGOTIATE,
> +		.type = SOCK_STREAM,
> +	};
> +
> +	pr_debug("%s: send_negotiate\n", __func__);
> +
> +	return trans->ops->send_pkt(vsk, &info);
> +}
> +
> +/* Handle server socket */
> +static int
> +virtio_transport_recv_listen(struct sock *sk, struct virtio_vsock_pkt *pkt)
> +{
> +	struct vsock_sock *vsk = vsock_sk(sk);
> +	struct vsock_sock *vpending;
> +	struct sock *pending;
> +	int err;
> +
> +	pending = virtio_transport_get_pending(sk, pkt);
> +	if (pending) {
> +		pr_debug("virtio_transport_recv_listen: get pending\n");
> +		vpending = vsock_sk(pending);
> +		lock_sock(pending);
> +		switch (pending->sk_state) {
> +		case SS_CONNECTING:
> +			if (pkt->hdr.op != VIRTIO_VSOCK_OP_OFFER) {
> +				pr_debug("%s: != OP_OFFER op=%d\n",
> +					 __func__, pkt->hdr.op);
> +				virtio_transport_send_reset(vpending, pkt);
> +				pending->sk_err = EPROTO;
> +				pending->sk_state = SS_UNCONNECTED;
> +				sock_put(pending);
> +			} else {
> +				pending->sk_state = SS_CONNECTED;
> +				vsock_insert_connected(vpending);
> +
> +				vsock_remove_pending(sk, pending);
> +				vsock_enqueue_accept(sk, pending);
> +
> +				virtio_transport_send_attach(vpending);
> +				sk->sk_state_change(sk);
> +			}
> +			err = 0;
> +			break;
> +		default:
> +			pr_debug("%s: sk->sk_ack_backlog=%d\n", __func__,
> +				 sk->sk_ack_backlog);
> +			virtio_transport_send_reset(vpending, pkt);
> +			err = -EINVAL;
> +			break;
> +		}
> +		if (err < 0)
> +			vsock_remove_pending(sk, pending);
> +		release_sock(pending);
> +
> +		/* Release refcnt obtained in virtio_transport_get_pending */
> +		sock_put(pending);
> +
> +		return err;
> +	}
> +
> +	if (pkt->hdr.op != VIRTIO_VSOCK_OP_REQUEST) {
> +		virtio_transport_send_reset(vsk, pkt);
> +		pr_debug("%s:op != OP_REQUEST op = %d\n",
> +			 __func__, pkt->hdr.op);
> +		return -EINVAL;
> +	}
> +
> +	if (sk->sk_ack_backlog >= sk->sk_max_ack_backlog) {
> +		pr_debug("%s: sk->sk_ack_backlog=%d\n", __func__,
> +			 sk->sk_ack_backlog);
> +		virtio_transport_send_reset(vsk, pkt);
> +		return -ECONNREFUSED;
> +	}
> +
> +	/* So no pending socket are responsible for this pkt, create one */
> +	pending = __vsock_create(sock_net(sk), NULL, sk, GFP_KERNEL,
> +				 sk->sk_type);
> +	if (!pending) {
> +		virtio_transport_send_reset(vsk, pkt);
> +		return -ENOMEM;
> +	}
> +	pr_debug("virtio_transport_recv_listen: create pending\n");
> +
> +	vpending = vsock_sk(pending);
> +	vsock_addr_init(&vpending->local_addr, pkt->hdr.dst_cid,
> +			pkt->hdr.dst_port);
> +	vsock_addr_init(&vpending->remote_addr, pkt->hdr.src_cid,
> +			pkt->hdr.src_port);
> +
> +	vsock_add_pending(sk, pending);
> +
> +	err = virtio_transport_send_negotiate(vpending, pkt);
> +	if (err < 0) {
> +		virtio_transport_send_reset(vsk, pkt);
> +		sock_put(pending);
> +		return err;
> +	}
> +
> +	sk->sk_ack_backlog++;
> +
> +	pending->sk_state = SS_CONNECTING;
> +
> +	/* Clean up in case no further message is received for this socket */
> +	vpending->listener = sk;
> +	sock_hold(sk);
> +	sock_hold(pending);
> +	INIT_DELAYED_WORK(&vpending->dwork, vsock_pending_work);
> +	schedule_delayed_work(&vpending->dwork, HZ);
> +
> +	return 0;
> +}
> +
> +void virtio_transport_recv_pkt(struct virtio_vsock_pkt *pkt)
> +{
> +	struct virtio_transport *trans;
> +	struct sockaddr_vm src, dst;
> +	struct vsock_sock *vsk;
> +	struct sock *sk;
> +
> +	vsock_addr_init(&src, pkt->hdr.src_cid, pkt->hdr.src_port);
> +	vsock_addr_init(&dst, pkt->hdr.dst_cid, pkt->hdr.dst_port);
> +
> +	virtio_vsock_dumppkt(__func__, pkt);
> +
> +	if (pkt->hdr.type == SOCK_DGRAM) {
> +		sk = vsock_find_unbound_socket(&dst);
> +		if (!sk)
> +			goto free_pkt;
> +		return virtio_transport_recv_dgram(sk, pkt);
> +	}
> +
> +	/* The socket must be in connected or bound table
> +	 * otherwise send reset back
> +	 */
> +	sk = vsock_find_connected_socket(&src, &dst);
> +	if (!sk) {
> +		sk = vsock_find_bound_socket(&dst);
> +		if (!sk) {
> +			pr_debug("%s: can not find bound_socket\n", __func__);
> +			virtio_vsock_dumppkt(__func__, pkt);
> +			/* Ignore this pkt instead of sending reset back */
> +			goto free_pkt;
> +		}
> +	}
> +
> +	vsk = vsock_sk(sk);
> +	trans = vsk->trans;
> +	BUG_ON(!trans);
> +
> +	mutex_lock(&trans->tx_lock);
> +	trans->peer_buf_alloc = pkt->hdr.buf_alloc;
> +	trans->peer_fwd_cnt = pkt->hdr.fwd_cnt;
> +	if (__virtio_transport_stream_has_space(vsk))
> +		sk->sk_write_space(sk);
> +	mutex_unlock(&trans->tx_lock);
> +
> +	lock_sock(sk);
> +	switch (sk->sk_state) {
> +	case SS_LISTEN:
> +		virtio_transport_recv_listen(sk, pkt);
> +		virtio_transport_free_pkt(pkt);
> +		break;
> +	case SS_CONNECTING:
> +		virtio_transport_recv_connecting(sk, pkt);
> +		virtio_transport_free_pkt(pkt);
> +		break;
> +	case SS_CONNECTED:
> +		virtio_transport_recv_connected(sk, pkt);
> +		break;
> +	default:
> +		break;
> +	}
> +	release_sock(sk);
> +
> +	/* Release refcnt obtained when we fetched this socket out of the
> +	 * bound or connected list.
> +	 */
> +	sock_put(sk);
> +	return;
> +
> +free_pkt:
> +	virtio_transport_free_pkt(pkt);
> +}
> +EXPORT_SYMBOL_GPL(virtio_transport_recv_pkt);
> +
> +void virtio_transport_free_pkt(struct virtio_vsock_pkt *pkt)
> +{
> +	kfree(pkt->buf);
> +	kfree(pkt);
> +}
> +EXPORT_SYMBOL_GPL(virtio_transport_free_pkt);
> +
> +static int __init virtio_vsock_common_init(void)
> +{
> +	return 0;
> +}
> +
> +static void __exit virtio_vsock_common_exit(void)
> +{
> +}
> +
> +module_init(virtio_vsock_common_init);
> +module_exit(virtio_vsock_common_exit);
> +MODULE_LICENSE("GPL v2");
> +MODULE_AUTHOR("Asias He");
> +MODULE_DESCRIPTION("common code for virtio vsock");
> -- 
> 1.8.1.4

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC 4/5] VSOCK: Introduce vhost-vsock.ko
  2013-06-27  8:00 ` [RFC 4/5] VSOCK: Introduce vhost-vsock.ko Asias He
@ 2013-06-27 10:42   ` Michael S. Tsirkin
  2013-06-28  2:38     ` Andy King
                       ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Michael S. Tsirkin @ 2013-06-27 10:42 UTC (permalink / raw)
  To: Asias He
  Cc: Andy King, kvm, Dmitry Torokhov, netdev, Reilly Grant,
	virtualization, Pekka Enberg, Sasha Levin, David S. Miller

On Thu, Jun 27, 2013 at 04:00:03PM +0800, Asias He wrote:
> VM sockets vhost transport implementation. This module runs in host
> kernel.
> 
> Signed-off-by: Asias He <asias@redhat.com>

Has any thought been given to how this affects migration?
I don't see any API for an application to
move to a different host and reconnect to a running
vsock in guest.

I think we could merge without this, there are more
pressing issues, but it's probably a requirement
if you want this to replace e.g. serial in many
scenarious.

> ---
>  drivers/vhost/vsock.c | 534 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  drivers/vhost/vsock.h |   4 +
>  2 files changed, 538 insertions(+)
>  create mode 100644 drivers/vhost/vsock.c
>  create mode 100644 drivers/vhost/vsock.h
> 
> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> new file mode 100644
> index 0000000..cb54090
> --- /dev/null
> +++ b/drivers/vhost/vsock.c
> @@ -0,0 +1,534 @@
> +/*
> + * vhost transport for vsock
> + *
> + * Copyright (C) 2013 Red Hat, Inc.
> + * Author: Asias He <asias@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.
> + */
> +#include <linux/miscdevice.h>
> +#include <linux/module.h>
> +#include <linux/mutex.h>
> +#include <net/sock.h>
> +#include <linux/virtio_vsock.h>
> +#include <linux/vhost.h>
> +
> +#include "../../../net/vmw_vsock/af_vsock.h"

Send patch to move this to include/linux ?

> +#include "vhost.h"
> +#include "vsock.h"
> +
> +#define VHOST_VSOCK_DEFAULT_HOST_CID	2;

Sure you want that ; there? This can result in strange code, e.g.

	int a = VHOST_VSOCK_DEFAULT_HOST_CID + 1;
	set's a to 2.

> +
> +static int vhost_transport_socket_init(struct vsock_sock *vsk,
> +				       struct vsock_sock *psk);
> +
> +enum {
> +	VHOST_VSOCK_FEATURES = VHOST_FEATURES,
> +};
> +
> +/* Used to track all the vhost_vsock instacne on the system. */

typo

> +static LIST_HEAD(vhost_vsock_list);
> +static DEFINE_MUTEX(vhost_vsock_mutex);
> +
> +struct vhost_vsock_virtqueue {
> +	struct vhost_virtqueue vq;
> +};
> +
> +struct vhost_vsock {
> +	/* Vhost device */
> +	struct vhost_dev dev;
> +	/* Vhost vsock virtqueue*/
> +	struct vhost_vsock_virtqueue vqs[VSOCK_VQ_MAX];
> +	/* Link to global vhost_vsock_list*/
> +	struct list_head list;
> +	/* Head for pkt from host to guest */
> +	struct list_head send_pkt_list;
> +	/* Work item to send pkt */
> +	struct vhost_work send_pkt_work;
> +	/* Guest contex id this vhost_vsock instance handles */
> +	u32 guest_cid;
> +};
> +
> +static u32 vhost_transport_get_local_cid(void)
> +{
> +	u32 cid = VHOST_VSOCK_DEFAULT_HOST_CID;
> +	return cid;
> +}
> +

Interesting. So all hosts in fact have the same CID?

> +static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
> +{
> +	struct vhost_vsock *vsock;
> +
> +	mutex_lock(&vhost_vsock_mutex);
> +	list_for_each_entry(vsock, &vhost_vsock_list, list) {
> +		if (vsock->guest_cid == guest_cid) {
> +			mutex_unlock(&vhost_vsock_mutex);
> +			return vsock;
> +		}
> +	}
> +	mutex_unlock(&vhost_vsock_mutex);
> +
> +	return NULL;
> +}
> +
> +static void
> +vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
> +			    struct vhost_virtqueue *vq)
> +{
> +	struct virtio_vsock_pkt *pkt;
> +	unsigned out, in;
> +	struct sock *sk;
> +	int head, ret;
> +
> +	mutex_lock(&vq->mutex);
> +	vhost_disable_notify(&vsock->dev, vq);
> +	for (;;) {
> +		if (list_empty(&vsock->send_pkt_list)) {
> +			vhost_enable_notify(&vsock->dev, vq);
> +			break;
> +		}
> +
> +		head = vhost_get_vq_desc(&vsock->dev, vq, vq->iov,
> +					ARRAY_SIZE(vq->iov), &out, &in,
> +					NULL, NULL);
> +		pr_debug("%s: head = %d\n", __func__, head);
> +		if (head < 0)
> +			break;
> +
> +		if (head == vq->num) {
> +			if (unlikely(vhost_enable_notify(&vsock->dev, vq))) {
> +				vhost_disable_notify(&vsock->dev, vq);
> +				continue;
> +			}
> +			break;
> +		}
> +
> +		pkt = list_first_entry(&vsock->send_pkt_list,
> +				       struct virtio_vsock_pkt, list);
> +		list_del_init(&pkt->list);
> +
> +		/* FIXME: no assumption of frame layout */

Pls fix. memcpy_from_iovec is not harder.

> +		ret = __copy_to_user(vq->iov[0].iov_base, &pkt->hdr,
> +				     sizeof(pkt->hdr));
> +		if (ret) {
> +			virtio_transport_free_pkt(pkt);
> +			vq_err(vq, "Faulted on copying pkt hdr\n");
> +			break;
> +		}
> +		if (pkt->buf && pkt->len > 0) {
> +			ret = __copy_to_user(vq->iov[1].iov_base, pkt->buf,
> +					    pkt->len);
> +			if (ret) {
> +				virtio_transport_free_pkt(pkt);
> +				vq_err(vq, "Faulted on copying pkt buf\n");
> +				break;
> +			}
> +		}
> +
> +		vhost_add_used(vq, head, pkt->len);
> +
> +		virtio_transport_dec_tx_pkt(pkt);
> +
> +		sk = sk_vsock(pkt->trans->vsk);
> +		/* Release refcnt taken in vhost_transport_send_pkt */
> +		sock_put(sk);
> +
> +		virtio_transport_free_pkt(pkt);
> +	}
> +	vhost_signal(&vsock->dev, vq);

I think you should not signal if used was not updated.

> +	mutex_unlock(&vq->mutex);
> +}
> +
> +static void vhost_transport_send_pkt_work(struct vhost_work *work)
> +{
> +	struct vhost_virtqueue *vq;
> +	struct vhost_vsock *vsock;
> +
> +	vsock = container_of(work, struct vhost_vsock, send_pkt_work);
> +	vq = &vsock->vqs[VSOCK_VQ_RX].vq;
> +
> +	vhost_transport_do_send_pkt(vsock, vq);
> +}
> +
> +static int
> +vhost_transport_send_pkt(struct vsock_sock *vsk,
> +			 struct virtio_vsock_pkt_info *info)
> +{
> +	u32 src_cid, src_port, dst_cid, dst_port;
> +	struct virtio_transport *trans;
> +	struct virtio_vsock_pkt *pkt;
> +	struct vhost_virtqueue *vq;
> +	struct vhost_vsock *vsock;
> +	u64 credit;
> +
> +	src_cid = vhost_transport_get_local_cid();

interestingly this is the only place cid
is used. Shouldn't we validate it?

> +	src_port = vsk->local_addr.svm_port;
> +	dst_cid = vsk->remote_addr.svm_cid;
> +	dst_port = vsk->remote_addr.svm_port;
> +
> +	/* Find the vhost_vsock according to guest context id  */
> +	vsock = vhost_vsock_get(dst_cid);

Confused. There's a single socket per dst cid?

> +	if (!vsock)
> +		return -ENODEV;
> +
> +	trans = vsk->trans;
> +	vq = &vsock->vqs[VSOCK_VQ_RX].vq;
> +
> +	if (info->type == SOCK_STREAM) {
> +		credit = virtio_transport_get_credit(trans);
> +		if (info->len > credit)
> +			info->len = credit;

Is there support for non stream sockets?
Without credits, you get all kind of nasty
starvation issues.

> +	}
> +	if (info->len > VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE)
> +		info->len = VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE;
> +	/* Do not send zero length OP_RW pkt*/
> +	if (info->len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
> +		return info->len;
> +
> +	pkt = virtio_transport_alloc_pkt(vsk, info, info->len,
> +					 src_cid, src_port,
> +					 dst_cid, dst_port);

We also need global limit on amount of memory per
socket. Even if remote is OK with getting 20G from us,
we might not have so much kernel memory.

> +	if (!pkt)
> +		return -ENOMEM;
> +
> +	pr_debug("%s:info->len= %d\n", __func__, info->len);
> +	/* Released in vhost_transport_do_send_pkt */
> +	sock_hold(&trans->vsk->sk);
> +	virtio_transport_inc_tx_pkt(pkt);
> +
> +	/* queue it up in vhost work */
> +	mutex_lock(&vq->mutex);
> +	list_add_tail(&pkt->list, &vsock->send_pkt_list);
> +	vhost_work_queue(&vsock->dev, &vsock->send_pkt_work);
> +	mutex_unlock(&vq->mutex);
> +
> +	return info->len;
> +}
> +
> +static struct virtio_transport_pkt_ops vhost_ops = {
> +	.send_pkt = vhost_transport_send_pkt,
> +};
> +
> +static struct virtio_vsock_pkt *
> +vhost_vsock_alloc_pkt(struct vhost_virtqueue *vq)
> +{
> +	struct virtio_vsock_pkt *pkt;
> +	int ret;
> +	int len;
> +
> +	pkt = kzalloc(sizeof(*pkt), GFP_KERNEL);
> +	if (!pkt)
> +		return NULL;
> +
> +	len = sizeof(pkt->hdr);
> +	if (unlikely(vq->iov[0].iov_len != len)) {
> +		vq_err(vq, "Expecting pkt->hdr = %d, got %zu bytes\n",
> +		       len, vq->iov[0].iov_len);
> +		kfree(pkt);
> +		return NULL;
> +	}
> +	ret = __copy_from_user(&pkt->hdr, vq->iov[0].iov_base, len);
> +	if (ret) {
> +		vq_err(vq, "Faulted on virtio_vsock_hdr\n");
> +		kfree(pkt);
> +		return NULL;
> +	}
> +
> +	pkt->len = pkt->hdr.len;
> +	pkt->off = 0;
> +
> +	/* No payload */
> +	if (!pkt->len)
> +		return pkt;
> +
> +	/* The pkt is too big */
> +	if (pkt->len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE) {
> +		kfree(pkt);
> +		return NULL;
> +	}
> +
> +	pkt->buf = kmalloc(pkt->len, GFP_KERNEL);
> +	if (!pkt->buf) {
> +		kfree(pkt);
> +		return NULL;
> +	}
> +
> +	ret = __copy_from_user(pkt->buf, vq->iov[1].iov_base, pkt->len);
> +	if (ret) {
> +		vq_err(vq, "Faulted on virtio_vsock_hdr\n");
> +		virtio_transport_free_pkt(pkt);
> +	}
> +
> +	return pkt;
> +}
> +
> +static void vhost_vsock_handle_ctl_kick(struct vhost_work *work)
> +{
> +	struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
> +						  poll.work);
> +	struct vhost_vsock *vsock = container_of(vq->dev, struct vhost_vsock,
> +						 dev);
> +
> +	pr_debug("%s vq=%p, vsock=%p\n", __func__, vq, vsock);
> +}
> +
> +static void vhost_vsock_handle_tx_kick(struct vhost_work *work)
> +{
> +	struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
> +						  poll.work);
> +	struct vhost_vsock *vsock = container_of(vq->dev, struct vhost_vsock,
> +						 dev);
> +	struct virtio_vsock_pkt *pkt;
> +	int head, out, in;
> +	u32 len;
> +
> +	mutex_lock(&vq->mutex);
> +	vhost_disable_notify(&vsock->dev, vq);
> +	for (;;) {
> +		head = vhost_get_vq_desc(&vsock->dev, vq, vq->iov,
> +					ARRAY_SIZE(vq->iov), &out, &in,
> +					NULL, NULL);
> +		if (head < 0)
> +			break;
> +
> +		if (head == vq->num) {
> +			if (unlikely(vhost_enable_notify(&vsock->dev, vq))) {
> +				vhost_disable_notify(&vsock->dev, vq);
> +				continue;
> +			}
> +			break;
> +		}
> +
> +		pkt = vhost_vsock_alloc_pkt(vq);
> +		if (!pkt) {
> +			vq_err(vq, "Faulted on pkt\n");
> +			continue;
> +		}
> +
> +		len = pkt->len;
> +		virtio_transport_recv_pkt(pkt);
> +		vhost_add_used(vq, head, len);
> +	}
> +	vhost_signal(&vsock->dev, vq);
> +	mutex_unlock(&vq->mutex);
> +}
> +
> +static void vhost_vsock_handle_rx_kick(struct vhost_work *work)
> +{
> +	struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
> +						poll.work);
> +	struct vhost_vsock *vsock = container_of(vq->dev, struct vhost_vsock,
> +						 dev);
> +
> +	vhost_transport_do_send_pkt(vsock, vq);
> +}
> +
> +static int vhost_vsock_dev_open(struct inode *inode, struct file *file)
> +{
> +	struct vhost_virtqueue **vqs;
> +	struct vhost_vsock *vsock;
> +	int ret;
> +
> +	vsock = kzalloc(sizeof(*vsock), GFP_KERNEL);
> +	if (!vsock)
> +		return -ENOMEM;
> +
> +	pr_debug("%s:vsock=%p\n", __func__, vsock);
> +
> +	vqs = kmalloc(VSOCK_VQ_MAX * sizeof(*vqs), GFP_KERNEL);
> +	if (!vqs) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	vqs[VSOCK_VQ_CTRL] = &vsock->vqs[VSOCK_VQ_CTRL].vq;
> +	vqs[VSOCK_VQ_TX] = &vsock->vqs[VSOCK_VQ_TX].vq;
> +	vqs[VSOCK_VQ_RX] = &vsock->vqs[VSOCK_VQ_RX].vq;
> +	vsock->vqs[VSOCK_VQ_CTRL].vq.handle_kick = vhost_vsock_handle_ctl_kick;
> +	vsock->vqs[VSOCK_VQ_TX].vq.handle_kick = vhost_vsock_handle_tx_kick;
> +	vsock->vqs[VSOCK_VQ_RX].vq.handle_kick = vhost_vsock_handle_rx_kick;
> +
> +	ret = vhost_dev_init(&vsock->dev, vqs, VSOCK_VQ_MAX);
> +	if (ret < 0)
> +		goto out_vqs;
> +
> +	file->private_data = vsock;
> +	INIT_LIST_HEAD(&vsock->send_pkt_list);
> +	vhost_work_init(&vsock->send_pkt_work, vhost_transport_send_pkt_work);
> +
> +	mutex_lock(&vhost_vsock_mutex);
> +	list_add_tail(&vsock->list, &vhost_vsock_list);
> +	mutex_unlock(&vhost_vsock_mutex);
> +	return ret;
> +
> +out_vqs:
> +	kfree(vqs);
> +out:
> +	kfree(vsock);
> +	return ret;
> +}
> +
> +static void vhost_vsock_flush(struct vhost_vsock *vsock)
> +{
> +	int i;
> +
> +	for (i = 0; i < VSOCK_VQ_MAX; i++)
> +		vhost_poll_flush(&vsock->vqs[i].vq.poll);
> +	vhost_work_flush(&vsock->dev, &vsock->send_pkt_work);
> +}
> +
> +static int vhost_vsock_dev_release(struct inode *inode, struct file *file)
> +{
> +	struct vhost_vsock *vsock = file->private_data;
> +
> +	mutex_lock(&vhost_vsock_mutex);
> +	list_del(&vsock->list);
> +	mutex_unlock(&vhost_vsock_mutex);
> +
> +	vhost_dev_stop(&vsock->dev);
> +	vhost_vsock_flush(vsock);
> +	vhost_dev_cleanup(&vsock->dev, false);
> +	kfree(vsock->dev.vqs);
> +	kfree(vsock);
> +	return 0;
> +}
> +
> +static int vhost_vsock_set_cid(struct vhost_vsock *vsock, u32 guest_cid)
> +{
> +	mutex_lock(&vhost_vsock_mutex);
> +	vsock->guest_cid = guest_cid;
> +	pr_debug("%s:guest_cid=%d\n", __func__, guest_cid);
> +	mutex_unlock(&vhost_vsock_mutex);
> +
> +	return 0;
> +}
> +
> +static long vhost_vsock_dev_ioctl(struct file *f, unsigned int ioctl,
> +				  unsigned long arg)
> +{
> +	struct vhost_vsock *vsock = f->private_data;
> +	void __user *argp = (void __user *)arg;
> +	u64 __user *featurep = argp;
> +	u32 __user *cidp = argp;
> +	u32 guest_cid;
> +	u64 features;
> +	int r;
> +
> +	switch (ioctl) {
> +	case VHOST_VSOCK_SET_GUEST_CID:
> +		if (get_user(guest_cid, cidp))
> +			return -EFAULT;
> +		return vhost_vsock_set_cid(vsock, guest_cid);
> +	case VHOST_GET_FEATURES:
> +		features = VHOST_VSOCK_FEATURES;
> +		if (copy_to_user(featurep, &features, sizeof(features)))
> +			return -EFAULT;
> +		return 0;
> +	case VHOST_SET_FEATURES:
> +		if (copy_from_user(&features, featurep, sizeof(features)))
> +			return -EFAULT;
> +		return 0;
> +	default:
> +		mutex_lock(&vsock->dev.mutex);
> +		r = vhost_dev_ioctl(&vsock->dev, ioctl, argp);
> +		if (r == -ENOIOCTLCMD)
> +			r = vhost_vring_ioctl(&vsock->dev, ioctl, argp);
> +		else
> +			vhost_vsock_flush(vsock);
> +		mutex_unlock(&vsock->dev.mutex);
> +		return r;
> +	}
> +}
> +
> +static const struct file_operations vhost_vsock_fops = {
> +	.owner          = THIS_MODULE,
> +	.open           = vhost_vsock_dev_open,
> +	.release        = vhost_vsock_dev_release,
> +	.llseek		= noop_llseek,
> +	.unlocked_ioctl = vhost_vsock_dev_ioctl,
> +};
> +
> +static struct miscdevice vhost_vsock_misc = {
> +	.minor = MISC_DYNAMIC_MINOR,
> +	.name = "vhost-vsock",
> +	.fops = &vhost_vsock_fops,
> +};
> +
> +static int
> +vhost_transport_socket_init(struct vsock_sock *vsk, struct vsock_sock *psk)
> +{
> +	struct virtio_transport *trans;
> +	int ret;
> +
> +	ret = virtio_transport_do_socket_init(vsk, psk);
> +	if (ret)
> +		return ret;
> +
> +	trans = vsk->trans;
> +	trans->ops = &vhost_ops;
> +
> +	return ret;
> +}
> +
> +static struct vsock_transport vhost_transport = {
> +	.get_local_cid            = vhost_transport_get_local_cid,
> +
> +	.init                     = vhost_transport_socket_init,
> +	.destruct                 = virtio_transport_destruct,
> +	.release                  = virtio_transport_release,
> +	.connect                  = virtio_transport_connect,
> +	.shutdown                 = virtio_transport_shutdown,
> +
> +	.dgram_enqueue            = virtio_transport_dgram_enqueue,
> +	.dgram_dequeue            = virtio_transport_dgram_dequeue,
> +	.dgram_bind               = virtio_transport_dgram_bind,
> +	.dgram_allow              = virtio_transport_dgram_allow,
> +
> +	.stream_enqueue           = virtio_transport_stream_enqueue,
> +	.stream_dequeue           = virtio_transport_stream_dequeue,
> +	.stream_has_data          = virtio_transport_stream_has_data,
> +	.stream_has_space         = virtio_transport_stream_has_space,
> +	.stream_rcvhiwat          = virtio_transport_stream_rcvhiwat,
> +	.stream_is_active         = virtio_transport_stream_is_active,
> +	.stream_allow             = virtio_transport_stream_allow,
> +
> +	.notify_poll_in           = virtio_transport_notify_poll_in,
> +	.notify_poll_out          = virtio_transport_notify_poll_out,
> +	.notify_recv_init         = virtio_transport_notify_recv_init,
> +	.notify_recv_pre_block    = virtio_transport_notify_recv_pre_block,
> +	.notify_recv_pre_dequeue  = virtio_transport_notify_recv_pre_dequeue,
> +	.notify_recv_post_dequeue = virtio_transport_notify_recv_post_dequeue,
> +	.notify_send_init         = virtio_transport_notify_send_init,
> +	.notify_send_pre_block    = virtio_transport_notify_send_pre_block,
> +	.notify_send_pre_enqueue  = virtio_transport_notify_send_pre_enqueue,
> +	.notify_send_post_enqueue = virtio_transport_notify_send_post_enqueue,
> +
> +	.set_buffer_size          = virtio_transport_set_buffer_size,
> +	.set_min_buffer_size      = virtio_transport_set_min_buffer_size,
> +	.set_max_buffer_size      = virtio_transport_set_max_buffer_size,
> +	.get_buffer_size          = virtio_transport_get_buffer_size,
> +	.get_min_buffer_size      = virtio_transport_get_min_buffer_size,
> +	.get_max_buffer_size      = virtio_transport_get_max_buffer_size,
> +};
> +
> +static int __init vhost_vsock_init(void)
> +{
> +	int ret;
> +
> +	ret = vsock_core_init(&vhost_transport);
> +	if (ret < 0)
> +		return ret;
> +	return misc_register(&vhost_vsock_misc);
> +};
> +
> +static void __exit vhost_vsock_exit(void)
> +{
> +	misc_deregister(&vhost_vsock_misc);
> +	vsock_core_exit();
> +};
> +
> +module_init(vhost_vsock_init);
> +module_exit(vhost_vsock_exit);
> +MODULE_LICENSE("GPL v2");
> +MODULE_AUTHOR("Asias He");
> +MODULE_DESCRIPTION("vhost transport for vsock ");
> diff --git a/drivers/vhost/vsock.h b/drivers/vhost/vsock.h
> new file mode 100644
> index 0000000..0ddb107
> --- /dev/null
> +++ b/drivers/vhost/vsock.h
> @@ -0,0 +1,4 @@
> +#ifndef VHOST_VSOCK_H
> +#define VHOST_VSOCK_H
> +#define VHOST_VSOCK_SET_GUEST_CID _IOW(VHOST_VIRTIO, 0x60, __u32)

No SET without GET please.

> +#endif
> -- 
> 1.8.1.4

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC 0/5] Introduce VM Sockets virtio transport
  2013-06-27  7:59 [RFC 0/5] Introduce VM Sockets virtio transport Asias He
                   ` (5 preceding siblings ...)
  2013-06-27 10:23 ` [RFC 0/5] Introduce VM Sockets virtio transport Michael S. Tsirkin
@ 2013-06-27 19:03 ` Sasha Levin
  2013-06-28  6:26   ` Asias He
  6 siblings, 1 reply; 21+ messages in thread
From: Sasha Levin @ 2013-06-27 19:03 UTC (permalink / raw)
  To: Asias He
  Cc: Andy King, kvm, Dmitry Torokhov, netdev, Reilly Grant,
	virtualization, Pekka Enberg, Michael S. Tsirkin,
	David S. Miller

Hi Asias,

Looks nice! Some comments inline below (I've removed anything that mst already
commented on).

On 06/27/2013 03:59 AM, Asias He wrote:
> Hello guys,
>
> In commit d021c344051af91 (VSOCK: Introduce VM Sockets), VMware added VM
> Sockets support. VM Sockets allows communication between virtual
> machines and the hypervisor. VM Sockets is able to use different
> hyervisor neutral transport to transfer data. Currently, only VMware
> VMCI transport is supported.
>
> This series introduces virtio transport for VM Sockets.
>
> Any comments are appreciated! Thanks!
>
> Code:
> =========================
> 1) kernel bits
>     git://github.com/asias/linux.git vsock
>
> 2) userspace bits:
>     git://github.com/asias/linux-kvm.git vsock
>
> Howto:
> =========================
> Make sure you have these kernel options:
>
>    CONFIG_VSOCKETS=y
>    CONFIG_VIRTIO_VSOCKETS=y
>    CONFIG_VIRTIO_VSOCKETS_COMMON=y
>    CONFIG_VHOST_VSOCK=m
>
> $ git clone git://github.com/asias/linux-kvm.git
> $ cd linux-kvm/tools/kvm
> $ co -b vsock origin/vsock
> $ make
> $ modprobe vhost_vsock
> $ ./lkvm run -d os.img -k bzImage --vsock guest_cid
>
> Test:
> =========================
> I hacked busybox's http server and wget to run over vsock. Start http
> server in host and guest, download a 512MB file in guest and host
> simultaneously for 6000 times. Manged to run the http stress test.
>
> Also, I wrote a small libvsock.so to play the LD_PRELOAD trick and
> managed to make sshd and ssh work over virito-vsock without modifying
> the source code.

Why did it require hacking in the first place? Does running with kvmtool
and just doing regular networking over virtio-net running on top of vsock
achieves the same goal?

> Draft VM Sockets Virtio Device spec:
> =========================
> Appendix K: VM Sockets Device
>
> The virtio VM sockets device is a virtio transport device for VM Sockets. VM
> Sockets allows communication between virtual machines and the hypervisor.
>
> Configuration:
>
> Subsystem Device ID 13
>
> Virtqueues:
>      0:controlq; 1:receiveq0; 2:transmitq0 ... 2N+1:receivqN; 2N+2:transmitqN

controlq is "defined but not used", is there something in mind for it? if not,
does it make sense keeping it here? we can always re-add it to the end just
like in virtio-net.

> Feature bits:
>      Currently, no feature bits are defined.
>
> Device configuration layout:
>
> Two configuration fields are currently defined.
>
>     struct virtio_vsock_config {
>             __u32 guest_cid;
>             __u32 max_virtqueue_pairs;
>     } __packed;
>
> The guest_cid field specifies the guest context id which likes the guest IP
> address. The max_virtqueue_pairs field specifies the maximum number of receive
> and transmit virtqueue pairs (receiveq0 ...  receiveqN and transmitq0 ...
> transmitqN respectively; N = max_virtqueue_pairs - 1 ) that can be configured.
> The driver is free to use only one virtqueue pairs, or it can use more to
> achieve better performance.

How does the driver tell the device how many vqs it's planning on actually using?
or is it assumed that all of them are in use?

>
> Device Initialization:
> The initialization routine should discover the device's virtqueues.
>
> Device Operation:
> Packets are transmitted by placing them in the transmitq0..transmitqN, and
> buffers for incoming packets are placed in the receiveq0..receiveqN. In each
> case, the packet itself is preceded by a header:
>
>     struct virtio_vsock_hdr {
>             __u32   src_cid;
>             __u32   src_port;
>             __u32   dst_cid;
>             __u32   dst_port;
>             __u32   len;
>             __u8    type;
>             __u8    op;
>             __u8    shut;
>             __u64   fwd_cnt;
>             __u64   buf_alloc;
>     } __packed;
>
> src_cid and dst_cid: specify the source and destination context id.
> src_port and dst_port: specify the source and destination port.
> len: specifies the size of the data payload, it could be zero if no data
> payload is transferred.
> type: specifies the type of the packet, it can be SOCK_STREAM or SOCK_DGRAM.
> op: specifies the operation of the packet, it is defined as follows.
>
>     enum {
>             VIRTIO_VSOCK_OP_INVALID = 0,
>             VIRTIO_VSOCK_OP_REQUEST = 1,
>             VIRTIO_VSOCK_OP_NEGOTIATE = 2,
>             VIRTIO_VSOCK_OP_OFFER = 3,
>             VIRTIO_VSOCK_OP_ATTACH = 4,
>             VIRTIO_VSOCK_OP_RW = 5,
>             VIRTIO_VSOCK_OP_CREDIT = 6,
>             VIRTIO_VSOCK_OP_RST = 7,
>             VIRTIO_VSOCK_OP_SHUTDOWN = 8,
>     };
>
> shut: specifies the shutdown mode when the socket is being shutdown. 1 is for
> receive shutdown, 2 is for transmit shutdown, 3 is for both receive and transmit
> shutdown.
> fwd_cnt: specifies the the number of bytes the receiver has forwarded to userspace.

For the previous packet? For the entire session? Reading ahead makes it clear but
it's worth mentioning here the context just to make it easy for implementers.

> buf_alloc: specifies the size of the receiver's recieve buffer in bytes.
						  receive

> Virtio VM socket connection creation:
> 1) Client sends VIRTIO_VSOCK_OP_REQUEST to server
> 2) Server reponses with VIRTIO_VSOCK_OP_NEGOTIATE to client
> 3) Client sends VIRTIO_VSOCK_OP_OFFER to server
> 4) Server responses with VIRTIO_VSOCK_OP_ATTACH to client
>
> Virtio VM socket credit update:
> Virtio VM socket uses credit-based flow control. The sender maintains tx_cnt
> which counts the totoal number of bytes it has sent out, peer_fwd_cnt which
		   total
> counts the totoal number of byes the receiver has forwarded, and peer_buf_alloc
	     total
> which is the size of the receiver's receive buffer. The sender can send no more
> than the credit the receiver gives to the sender: credit = peer_buf_alloc -


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC 0/5] Introduce VM Sockets virtio transport
  2013-06-27 10:23 ` [RFC 0/5] Introduce VM Sockets virtio transport Michael S. Tsirkin
@ 2013-06-28  2:25   ` Andy King
  2013-06-28  5:50     ` Asias He
  2013-06-28  6:12   ` Asias He
  1 sibling, 1 reply; 21+ messages in thread
From: Andy King @ 2013-06-28  2:25 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: kvm, Dmitry Torokhov, netdev, Reilly Grant, virtualization,
	Pekka Enberg, Sasha Levin, David S. Miller

Hi Michael,

> >            __u32 guest_cid;
> 
> Given that cid is like an IP address, 32 bit seems too
> limiting. I would go for a 64 bit one or maybe even 128 bit,
> so that e.g. GUIDs can be used there.

That's likely based on what vSockets uses, which is in turn based on
what the VMCI device exposes (which is what vSockets was originally
built on), so unfortunately it's too late to extend that type.
However, that still allows just under 2^32 VMs per host (there are
three reserved values).

> >            __u32   dst_port;
> 
> Ports are 32 bit? I guess most applications can't work with >16 bit.

As with the cid, the width of the port type comes from vSockets,
which is what this plugs into.

> Also, why put cid's in all packets? They are only needed
> when you create a connection, no? Afterwards port numbers
> can be used.

The cid is present in DGRAM packets and STREAM _control_ packets
(connection handshake, signal read/write and so forth).  I don't
think the intent here is for it to be in STREAM _data_ packets,
but Asias can clarify.

> > Virtio VM socket connection creation:
> > 1) Client sends VIRTIO_VSOCK_OP_REQUEST to server
> > 2) Server reponses with VIRTIO_VSOCK_OP_NEGOTIATE to client
> > 3) Client sends VIRTIO_VSOCK_OP_OFFER to server
> > 4) Server responses with VIRTIO_VSOCK_OP_ATTACH to client
> 
> What's the reason for a 4 stage setup? Most transports
> make do with 3.

I'm guessing that's also based on the original vSockets/VMCI
implementation, where the NEGOTIATE/OFFER stages are used to
negotiate the underlying transport buffer size (in VMCI, the
queuepair that underlies a STREAM socket).  The virtio
transport can probably use 3.

Thanks!
- Andy

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC 4/5] VSOCK: Introduce vhost-vsock.ko
  2013-06-27 10:42   ` Michael S. Tsirkin
  2013-06-28  2:38     ` Andy King
@ 2013-06-28  2:38     ` Andy King
  2013-06-28  6:55     ` Asias He
  2 siblings, 0 replies; 21+ messages in thread
From: Andy King @ 2013-06-28  2:38 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Asias He, netdev, kvm, virtualization, David S. Miller,
	Dmitry Torokhov, Reilly Grant, Rusty Russell, Jason Wang,
	Stefan Hajnoczi, Gerd Hoffmann, Pekka Enberg, Sasha Levin

Hi Michael,

> > +	u32 cid = VHOST_VSOCK_DEFAULT_HOST_CID;
> > +	return cid;
> > +}
> > +
> 
> Interesting. So all hosts in fact have the same CID?

"Host" here means the thing _below_ the VM.  Any process running on
the host OS can be addressed with cid 2.  Each VM gets its own cid.
So communication is always between VM x <-> host 2.  That makes for
easy lookup on the VM's part.  (Note that we further distinguish in
the VMCI transport between the hypervisor, specifically the VM's own
VMX, which is on cid 0, and the host on cid 2.)

Thanks!
- Andy

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC 4/5] VSOCK: Introduce vhost-vsock.ko
  2013-06-27 10:42   ` Michael S. Tsirkin
@ 2013-06-28  2:38     ` Andy King
  2013-06-28  2:38     ` Andy King
  2013-06-28  6:55     ` Asias He
  2 siblings, 0 replies; 21+ messages in thread
From: Andy King @ 2013-06-28  2:38 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: kvm, Dmitry Torokhov, netdev, Reilly Grant, virtualization,
	Pekka Enberg, Sasha Levin, David S. Miller

Hi Michael,

> > +	u32 cid = VHOST_VSOCK_DEFAULT_HOST_CID;
> > +	return cid;
> > +}
> > +
> 
> Interesting. So all hosts in fact have the same CID?

"Host" here means the thing _below_ the VM.  Any process running on
the host OS can be addressed with cid 2.  Each VM gets its own cid.
So communication is always between VM x <-> host 2.  That makes for
easy lookup on the VM's part.  (Note that we further distinguish in
the VMCI transport between the hypervisor, specifically the VM's own
VMX, which is on cid 0, and the host on cid 2.)

Thanks!
- Andy

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC 0/5] Introduce VM Sockets virtio transport
  2013-06-28  2:25   ` Andy King
@ 2013-06-28  5:50     ` Asias He
  0 siblings, 0 replies; 21+ messages in thread
From: Asias He @ 2013-06-28  5:50 UTC (permalink / raw)
  To: Andy King
  Cc: kvm, Dmitry Torokhov, netdev, Michael S. Tsirkin, Reilly Grant,
	virtualization, Pekka Enberg, Sasha Levin, David S. Miller

On Thu, Jun 27, 2013 at 07:25:40PM -0700, Andy King wrote:
> Hi Michael,
> 
> > >            __u32 guest_cid;
> > 
> > Given that cid is like an IP address, 32 bit seems too
> > limiting. I would go for a 64 bit one or maybe even 128 bit,
> > so that e.g. GUIDs can be used there.
> 
> That's likely based on what vSockets uses, which is in turn based on
> what the VMCI device exposes (which is what vSockets was originally
> built on), so unfortunately it's too late to extend that type.
> However, that still allows just under 2^32 VMs per host (there are
> three reserved values).

Yes, 32 bit cid and port are defined by vSockets, we can not go for 64 bit
for virtio transport.

> > >            __u32   dst_port;
> > 
> > Ports are 32 bit? I guess most applications can't work with >16 bit.
> 
> As with the cid, the width of the port type comes from vSockets,
> which is what this plugs into.

Yes.

> > Also, why put cid's in all packets? They are only needed
> > when you create a connection, no? Afterwards port numbers
> > can be used.
> 
> The cid is present in DGRAM packets and STREAM _control_ packets
> (connection handshake, signal read/write and so forth).  I don't
> think the intent here is for it to be in STREAM _data_ packets,
> but Asias can clarify.

Virtio transport stream data packets are a bit different from how VMCI transport
handles them. In VMCI, a dedicated queue pairs is created for each
stream. In virtio, all the streams share the single virtqueue pairs.

On the recv path, we need the cid and port information from the packet
header to figure out which socket is responsible for the packet.

> > > Virtio VM socket connection creation:
> > > 1) Client sends VIRTIO_VSOCK_OP_REQUEST to server
> > > 2) Server reponses with VIRTIO_VSOCK_OP_NEGOTIATE to client
> > > 3) Client sends VIRTIO_VSOCK_OP_OFFER to server
> > > 4) Server responses with VIRTIO_VSOCK_OP_ATTACH to client
> > 
> > What's the reason for a 4 stage setup? Most transports
> > make do with 3.
> 
> I'm guessing that's also based on the original vSockets/VMCI
> implementation, where the NEGOTIATE/OFFER stages are used to
> negotiate the underlying transport buffer size (in VMCI, the
> queuepair that underlies a STREAM socket).  The virtio
> transport can probably use 3.

Right, I wanted to follow how VMCI transport does the connection setup.

We can drop the IRTIO_VSOCK_OP_ATTACH stage, and make the client into
SS_CONNECTED state once we get the VIRTIO_VSOCK_OP_NEGOTIATE pkt.

> Thanks!
> - Andy

-- 
Asias

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC 0/5] Introduce VM Sockets virtio transport
  2013-06-27 10:23 ` [RFC 0/5] Introduce VM Sockets virtio transport Michael S. Tsirkin
  2013-06-28  2:25   ` Andy King
@ 2013-06-28  6:12   ` Asias He
  1 sibling, 0 replies; 21+ messages in thread
From: Asias He @ 2013-06-28  6:12 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Andy King, kvm, Dmitry Torokhov, netdev, Reilly Grant,
	virtualization, Pekka Enberg, Sasha Levin, David S. Miller

On Thu, Jun 27, 2013 at 01:23:24PM +0300, Michael S. Tsirkin wrote:
> On Thu, Jun 27, 2013 at 03:59:59PM +0800, Asias He wrote:
> > Hello guys,
> > 
> > In commit d021c344051af91 (VSOCK: Introduce VM Sockets), VMware added VM
> > Sockets support. VM Sockets allows communication between virtual
> > machines and the hypervisor. VM Sockets is able to use different
> > hyervisor neutral transport to transfer data. Currently, only VMware
> > VMCI transport is supported. 
> > 
> > This series introduces virtio transport for VM Sockets.
> > 
> > Any comments are appreciated! Thanks!
> > 
> > Code:
> > =========================
> > 1) kernel bits
> >    git://github.com/asias/linux.git vsock
> > 
> > 2) userspace bits:
> >    git://github.com/asias/linux-kvm.git vsock
> > 
> > Howto:
> > =========================
> > Make sure you have these kernel options:
> > 
> >   CONFIG_VSOCKETS=y
> >   CONFIG_VIRTIO_VSOCKETS=y
> >   CONFIG_VIRTIO_VSOCKETS_COMMON=y
> >   CONFIG_VHOST_VSOCK=m
> > 
> > $ git clone git://github.com/asias/linux-kvm.git
> > $ cd linux-kvm/tools/kvm
> > $ co -b vsock origin/vsock
> > $ make
> > $ modprobe vhost_vsock
> > $ ./lkvm run -d os.img -k bzImage --vsock guest_cid
> > 
> > Test:
> > =========================
> > I hacked busybox's http server and wget to run over vsock. Start http
> > server in host and guest, download a 512MB file in guest and host
> > simultaneously for 6000 times. Manged to run the http stress test.
> > 
> > Also, I wrote a small libvsock.so to play the LD_PRELOAD trick and
> > managed to make sshd and ssh work over virito-vsock without modifying
> > the source code.
> > 
> > Draft VM Sockets Virtio Device spec:
> > =========================
> > Appendix K: VM Sockets Device
> > 
> > The virtio VM sockets device is a virtio transport device for VM Sockets. VM
> > Sockets allows communication between virtual machines and the hypervisor.
> > 
> > Configuration:
> > 
> > Subsystem Device ID 13
> > 
> > Virtqueues:
> >     0:controlq; 1:receiveq0; 2:transmitq0 ... 2N+1:receivqN; 2N+2:transmitqN
> > 
> > Feature bits:
> >     Currently, no feature bits are defined.
> > 
> > Device configuration layout:
> > 
> > Two configuration fields are currently defined.
> > 
> >    struct virtio_vsock_config {
> 
> which fields are RW,RO,WO?

All are RO.

> >            __u32 guest_cid;
> 
> Given that cid is like an IP address, 32 bit seems too
> limiting. I would go for a 64 bit one or maybe even 128 bit,
> so that e.g. GUIDs can be used there.
> 
> 
> >            __u32 max_virtqueue_pairs;
> 
> I'd make this little endian.

okay.

> >    } __packed;
> 
> 
> > 
> > The guest_cid field specifies the guest context id which likes the guest IP
> > address. The max_virtqueue_pairs field specifies the maximum number of receive
> > and transmit virtqueue pairs (receiveq0 ...  receiveqN and transmitq0 ...
> > transmitqN respectively; N = max_virtqueue_pairs - 1 ) that can be configured.
> > The driver is free to use only one virtqueue pairs, or it can use more to
> > achieve better performance.
> 
> Don't we need a field for driver to specify the # of VQs?

To make it simple, I want to make the driver use all of the virtqueue
pairs we offered in max_virtqueue_pairs.

> I note packets have no sequence numbers.
> This means that a given stream should only use
> a single VQ in each direction, correct?
> Maybe make this explicit.

Right, let's make it explicit.

> > 
> > Device Initialization:
> > The initialization routine should discover the device's virtqueues.
> > 
> > Device Operation:
> > Packets are transmitted by placing them in the transmitq0..transmitqN, and
> > buffers for incoming packets are placed in the receiveq0..receiveqN. In each
> > case, the packet itself is preceded by a header:
> > 
> >    struct virtio_vsock_hdr {
> 
> Let's make header explicitly little endian and avoid the
> heartburn we have with many other transports.

Sounds good to me.

> >            __u32   src_cid;
> >            __u32   src_port;
> >            __u32   dst_cid;
> >            __u32   dst_port;
> 
> Ports are 32 bit? I guess most applications can't work with >16 bit.
> 
> Also, why put cid's in all packets? They are only needed
> when you create a connection, no? Afterwards port numbers
> can be used.
> 
> >            __u32   len;
> >            __u8    type;
> >            __u8    op;
> >            __u8    shut;
> 
> Please add padding to align all field naturally.

ok.

> >            __u64   fwd_cnt;
> >            __u64   buf_alloc;
> 
> Is a 64 bit counter really needed? 64 bit math
> has portability limitations and performance overhead on many
> architectures.

Good point, but 32 bit tx_cnt and fwd_cnt counter overflow very easily
since it contains the total number of bytes ever transferred. If we can
to use 32 bit counter, we need to figure out how to handle overflow. (64
bit can overflow as well.)

buf_alloc can be 32 bit.

> >    } __packed;
> 
> Packing produces terrible code in many compilers.
> Please avoid packed structures on data path, instead,
> pad structures explicitly to align all fields naturally.

We have packed structures on scsi:

include/linux/virtio_scsi.h

> 
> > 
> > src_cid and dst_cid: specify the source and destination context id.
> > src_port and dst_port: specify the source and destination port.
> > len: specifies the size of the data payload, it could be zero if no data
> > payload is transferred.
> > type: specifies the type of the packet, it can be SOCK_STREAM or SOCK_DGRAM.
> > op: specifies the operation of the packet, it is defined as follows.
> > 
> >    enum {
> >            VIRTIO_VSOCK_OP_INVALID = 0,
> >            VIRTIO_VSOCK_OP_REQUEST = 1,
> >            VIRTIO_VSOCK_OP_NEGOTIATE = 2,
> >            VIRTIO_VSOCK_OP_OFFER = 3,
> >            VIRTIO_VSOCK_OP_ATTACH = 4,
> >            VIRTIO_VSOCK_OP_RW = 5,
> >            VIRTIO_VSOCK_OP_CREDIT = 6,
> >            VIRTIO_VSOCK_OP_RST = 7,
> >            VIRTIO_VSOCK_OP_SHUTDOWN = 8,
> >    };
> > 
> > shut: specifies the shutdown mode when the socket is being shutdown. 1 is for
> > receive shutdown, 2 is for transmit shutdown, 3 is for both receive and transmit
> > shutdown.
> 
> It's only with VIRTIO_VSOCK_OP_SHUTDOWN - how about a generic
> flags field that is interpreted depending on op?

I can rename the shut field as a generic filed depending on op.

> > fwd_cnt: specifies the the number of bytes the receiver has forwarded to userspace.
> > buf_alloc: specifies the size of the receiver's recieve buffer in bytes.
> > 
> > Virtio VM socket connection creation:
> > 1) Client sends VIRTIO_VSOCK_OP_REQUEST to server
> > 2) Server reponses with VIRTIO_VSOCK_OP_NEGOTIATE to client
> > 3) Client sends VIRTIO_VSOCK_OP_OFFER to server
> > 4) Server responses with VIRTIO_VSOCK_OP_ATTACH to client
> 
> What's the reason for a 4 stage setup? Most transports
> make do with 3.
> Also, at what stage can each side get/transmit packets?
> What happens in case of errors at each stage?
> Don't we want to distinguish between errors?
> (E.g. wrong cid, no app listening on port, etc)?

RST is used to handle this kind of errors.

> This also appears to be vulnerable to a variant of
> a syn flood attack (guest attacking host).
> I think we need a cookie hash in there to prevent this.
> 
> 
> Can you describe connection teardown please?
> I see there are RST and SHUTDOWN messages.
> What rules do they obey?

SHUTDOWN follows shutdown (2) system call. I can write more details
about each packet types.

> 
> > 
> > Virtio VM socket credit update:
> > Virtio VM socket uses credit-based flow control. The sender maintains tx_cnt
> > which counts the totoal number of bytes it has sent out, peer_fwd_cnt which
> > counts the totoal number of byes the receiver has forwarded, and peer_buf_alloc
> > which is the size of the receiver's receive buffer. The sender can send no more
> > than the credit the receiver gives to the sender: credit = peer_buf_alloc -
> > (tx_cnt - peer_fwd_cnt). The receiver can send VIRTIO_VSOCK_OP_CREDIT packet to
> > tell sender its current fwd_cnt and buf_alloc value explicitly. However, as an
> > optimization, the fwd_cnt and buf_alloc is always included in the packet header
> > virtio_vsock_hdr.
> > 
> > The guest driver should make the receive virtqueue as fully populated as
> > possible: if it runs out, the performance will suffer.
> > 
> > The controlq is used to control device. Currently, no control operation is
> > defined.
> > 
> > Asias He (5):
> >   VSOCK: Introduce vsock_find_unbound_socket and
> >     vsock_bind_dgram_generic
> >   VSOCK: Introduce virtio-vsock-common.ko
> >   VSOCK: Introduce virtio-vsock.ko
> >   VSOCK: Introduce vhost-vsock.ko
> >   VSOCK: Add Makefile and Kconfig
> > 
> >  drivers/vhost/Kconfig                   |   4 +
> >  drivers/vhost/Kconfig.vsock             |   7 +
> >  drivers/vhost/Makefile                  |   5 +
> >  drivers/vhost/vsock.c                   | 534 +++++++++++++++++
> >  drivers/vhost/vsock.h                   |   4 +
> >  include/linux/virtio_vsock.h            | 200 +++++++
> >  include/uapi/linux/virtio_ids.h         |   1 +
> >  include/uapi/linux/virtio_vsock.h       |  70 +++
> >  net/vmw_vsock/Kconfig                   |  18 +
> >  net/vmw_vsock/Makefile                  |   4 +
> >  net/vmw_vsock/af_vsock.c                |  70 +++
> >  net/vmw_vsock/af_vsock.h                |   2 +
> >  net/vmw_vsock/virtio_transport.c        | 424 ++++++++++++++
> >  net/vmw_vsock/virtio_transport_common.c | 992 ++++++++++++++++++++++++++++++++
> >  14 files changed, 2335 insertions(+)
> >  create mode 100644 drivers/vhost/Kconfig.vsock
> >  create mode 100644 drivers/vhost/vsock.c
> >  create mode 100644 drivers/vhost/vsock.h
> >  create mode 100644 include/linux/virtio_vsock.h
> >  create mode 100644 include/uapi/linux/virtio_vsock.h
> >  create mode 100644 net/vmw_vsock/virtio_transport.c
> >  create mode 100644 net/vmw_vsock/virtio_transport_common.c
> > 
> > -- 
> > 1.8.1.4

-- 
Asias

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC 0/5] Introduce VM Sockets virtio transport
  2013-06-27 19:03 ` Sasha Levin
@ 2013-06-28  6:26   ` Asias He
  0 siblings, 0 replies; 21+ messages in thread
From: Asias He @ 2013-06-28  6:26 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Andy King, kvm, Dmitry Torokhov, netdev, Reilly Grant,
	virtualization, Pekka Enberg, Michael S. Tsirkin,
	David S. Miller

On Thu, Jun 27, 2013 at 03:03:01PM -0400, Sasha Levin wrote:
> Hi Asias,
> 
> Looks nice! Some comments inline below (I've removed anything that mst already
> commented on).

Thanks.

> On 06/27/2013 03:59 AM, Asias He wrote:
> >Hello guys,
> >
> >In commit d021c344051af91 (VSOCK: Introduce VM Sockets), VMware added VM
> >Sockets support. VM Sockets allows communication between virtual
> >machines and the hypervisor. VM Sockets is able to use different
> >hyervisor neutral transport to transfer data. Currently, only VMware
> >VMCI transport is supported.
> >
> >This series introduces virtio transport for VM Sockets.
> >
> >Any comments are appreciated! Thanks!
> >
> >Code:
> >=========================
> >1) kernel bits
> >    git://github.com/asias/linux.git vsock
> >
> >2) userspace bits:
> >    git://github.com/asias/linux-kvm.git vsock
> >
> >Howto:
> >=========================
> >Make sure you have these kernel options:
> >
> >   CONFIG_VSOCKETS=y
> >   CONFIG_VIRTIO_VSOCKETS=y
> >   CONFIG_VIRTIO_VSOCKETS_COMMON=y
> >   CONFIG_VHOST_VSOCK=m
> >
> >$ git clone git://github.com/asias/linux-kvm.git
> >$ cd linux-kvm/tools/kvm
> >$ co -b vsock origin/vsock
> >$ make
> >$ modprobe vhost_vsock
> >$ ./lkvm run -d os.img -k bzImage --vsock guest_cid
> >
> >Test:
> >=========================
> >I hacked busybox's http server and wget to run over vsock. Start http
> >server in host and guest, download a 512MB file in guest and host
> >simultaneously for 6000 times. Manged to run the http stress test.
> >
> >Also, I wrote a small libvsock.so to play the LD_PRELOAD trick and
> >managed to make sshd and ssh work over virito-vsock without modifying
> >the source code.
> 
> Why did it require hacking in the first place? Does running with kvmtool
> and just doing regular networking over virtio-net running on top of vsock
> achieves the same goal?

VM Sockets introduces a new address family. We need to modify the source
code of the application to use the new address family.

Using vsock there is no need to use virtio-net at all. We can not run
virtio-net on top of vsock.

> >Draft VM Sockets Virtio Device spec:
> >=========================
> >Appendix K: VM Sockets Device
> >
> >The virtio VM sockets device is a virtio transport device for VM Sockets. VM
> >Sockets allows communication between virtual machines and the hypervisor.
> >
> >Configuration:
> >
> >Subsystem Device ID 13
> >
> >Virtqueues:
> >     0:controlq; 1:receiveq0; 2:transmitq0 ... 2N+1:receivqN; 2N+2:transmitqN
> 
> controlq is "defined but not used", is there something in mind for it? if not,
> does it make sense keeping it here? we can always re-add it to the end just
> like in virtio-net.

In virtio-net, each queue pairs contains a controlq, I think it is not
necessary for virtio-vsock. e.g.

receiveq0 transmitq0 controlq0

In virtio-scsi, we have ctrlq evetq and N requestq. I want the similar
layout here, to reserve a ctrl queue in the front.

> >Feature bits:
> >     Currently, no feature bits are defined.
> >
> >Device configuration layout:
> >
> >Two configuration fields are currently defined.
> >
> >    struct virtio_vsock_config {
> >            __u32 guest_cid;
> >            __u32 max_virtqueue_pairs;
> >    } __packed;
> >
> >The guest_cid field specifies the guest context id which likes the guest IP
> >address. The max_virtqueue_pairs field specifies the maximum number of receive
> >and transmit virtqueue pairs (receiveq0 ...  receiveqN and transmitq0 ...
> >transmitqN respectively; N = max_virtqueue_pairs - 1 ) that can be configured.
> >The driver is free to use only one virtqueue pairs, or it can use more to
> >achieve better performance.
> 
> How does the driver tell the device how many vqs it's planning on actually using?
> or is it assumed that all of them are in use?

I want the driver to use all of them.

> >
> >Device Initialization:
> >The initialization routine should discover the device's virtqueues.
> >
> >Device Operation:
> >Packets are transmitted by placing them in the transmitq0..transmitqN, and
> >buffers for incoming packets are placed in the receiveq0..receiveqN. In each
> >case, the packet itself is preceded by a header:
> >
> >    struct virtio_vsock_hdr {
> >            __u32   src_cid;
> >            __u32   src_port;
> >            __u32   dst_cid;
> >            __u32   dst_port;
> >            __u32   len;
> >            __u8    type;
> >            __u8    op;
> >            __u8    shut;
> >            __u64   fwd_cnt;
> >            __u64   buf_alloc;
> >    } __packed;
> >
> >src_cid and dst_cid: specify the source and destination context id.
> >src_port and dst_port: specify the source and destination port.
> >len: specifies the size of the data payload, it could be zero if no data
> >payload is transferred.
> >type: specifies the type of the packet, it can be SOCK_STREAM or SOCK_DGRAM.
> >op: specifies the operation of the packet, it is defined as follows.
> >
> >    enum {
> >            VIRTIO_VSOCK_OP_INVALID = 0,
> >            VIRTIO_VSOCK_OP_REQUEST = 1,
> >            VIRTIO_VSOCK_OP_NEGOTIATE = 2,
> >            VIRTIO_VSOCK_OP_OFFER = 3,
> >            VIRTIO_VSOCK_OP_ATTACH = 4,
> >            VIRTIO_VSOCK_OP_RW = 5,
> >            VIRTIO_VSOCK_OP_CREDIT = 6,
> >            VIRTIO_VSOCK_OP_RST = 7,
> >            VIRTIO_VSOCK_OP_SHUTDOWN = 8,
> >    };
> >
> >shut: specifies the shutdown mode when the socket is being shutdown. 1 is for
> >receive shutdown, 2 is for transmit shutdown, 3 is for both receive and transmit
> >shutdown.
> >fwd_cnt: specifies the the number of bytes the receiver has forwarded to userspace.
> 
> For the previous packet? For the entire session? Reading ahead makes it clear but
> it's worth mentioning here the context just to make it easy for implementers.

Thanks. I will make this more clearer.

> >buf_alloc: specifies the size of the receiver's recieve buffer in bytes.
> 						  receive
> 
> >Virtio VM socket connection creation:
> >1) Client sends VIRTIO_VSOCK_OP_REQUEST to server
> >2) Server reponses with VIRTIO_VSOCK_OP_NEGOTIATE to client
> >3) Client sends VIRTIO_VSOCK_OP_OFFER to server
> >4) Server responses with VIRTIO_VSOCK_OP_ATTACH to client
> >
> >Virtio VM socket credit update:
> >Virtio VM socket uses credit-based flow control. The sender maintains tx_cnt
> >which counts the totoal number of bytes it has sent out, peer_fwd_cnt which
> 		   total
> >counts the totoal number of byes the receiver has forwarded, and peer_buf_alloc
> 	     total
> >which is the size of the receiver's receive buffer. The sender can send no more
> >than the credit the receiver gives to the sender: credit = peer_buf_alloc -
> 
> 
> Thanks,
> Sasha
> 

-- 
Asias

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC 2/5] VSOCK: Introduce virtio-vsock-common.ko
  2013-06-27 10:34   ` Michael S. Tsirkin
@ 2013-06-28  6:28     ` Asias He
  0 siblings, 0 replies; 21+ messages in thread
From: Asias He @ 2013-06-28  6:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Andy King, kvm, Dmitry Torokhov, netdev, Reilly Grant,
	virtualization, Pekka Enberg, Sasha Levin, David S. Miller

On Thu, Jun 27, 2013 at 01:34:30PM +0300, Michael S. Tsirkin wrote:
> On Thu, Jun 27, 2013 at 04:00:01PM +0800, Asias He wrote:
> > This module contains the common code and header files for the following
> > virtio-vsock and virtio-vhost kernel modules.
> > 
> > Signed-off-by: Asias He <asias@redhat.com>
> > ---
> >  include/linux/virtio_vsock.h            | 200 +++++++
> >  include/uapi/linux/virtio_ids.h         |   1 +
> >  include/uapi/linux/virtio_vsock.h       |  70 +++
> >  net/vmw_vsock/virtio_transport_common.c | 992 ++++++++++++++++++++++++++++++++
> >  4 files changed, 1263 insertions(+)
> >  create mode 100644 include/linux/virtio_vsock.h
> >  create mode 100644 include/uapi/linux/virtio_vsock.h
> >  create mode 100644 net/vmw_vsock/virtio_transport_common.c
> > 
> > diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
> > new file mode 100644
> > index 0000000..cd8ed95
> > --- /dev/null
> > +++ b/include/linux/virtio_vsock.h
> > @@ -0,0 +1,200 @@
> > +/*
> > + * This header, excluding the #ifdef __KERNEL__ part, is BSD licensed so
> > + * anyone can use the definitions to implement compatible drivers/servers:
> > + *
> > + *
> > + * Redistribution and use in source and binary forms, with or without
> > + * modification, are permitted provided that the following conditions
> > + * are met:
> > + * 1. Redistributions of source code must retain the above copyright
> > + *    notice, this list of conditions and the following disclaimer.
> > + * 2. Redistributions in binary form must reproduce the above copyright
> > + *    notice, this list of conditions and the following disclaimer in the
> > + *    documentation and/or other materials provided with the distribution.
> > + * 3. Neither the name of IBM nor the names of its contributors
> > + *    may be used to endorse or promote products derived from this software
> > + *    without specific prior written permission.
> > + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ``AS IS''
> > + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
> > + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
> > + * ARE DISCLAIMED.  IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE
> > + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
> > + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
> > + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
> > + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
> > + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
> > + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
> > + * SUCH DAMAGE.
> > + *
> > + * Copyright (C) Red Hat, Inc., 2013
> > + * Copyright (C) Asias He <asias@redhat.com>, 2013
> > + */
> > +
> > +#ifndef _LINUX_VIRTIO_VSOCK_H
> > +#define _LINUX_VIRTIO_VSOCK_H
> > +
> > +#include <uapi/linux/virtio_vsock.h>
> > +#include <linux/socket.h>
> > +#include <net/sock.h>
> > +
> > +#define VIRTIO_VSOCK_DEFAULT_MIN_BUF_SIZE	128
> > +#define VIRTIO_VSOCK_DEFAULT_BUF_SIZE		(1024 * 256)
> > +#define VIRTIO_VSOCK_DEFAULT_MAX_BUF_SIZE	(1024 * 256)
> > +#define VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE	(1024 * 4)
> > +#define VIRTIO_VSOCK_MAX_PKT_BUF_SIZE		(1024 * 64)
> > +
> > +struct vsock_transport_recv_notify_data;
> > +struct vsock_transport_send_notify_data;
> > +struct sockaddr_vm;
> > +struct vsock_sock;
> > +
> > +enum {
> > +	VSOCK_VQ_CTRL	= 0,
> > +	VSOCK_VQ_RX	= 1, /* for host to guest data */
> > +	VSOCK_VQ_TX	= 2, /* for guest to host data */
> > +	VSOCK_VQ_MAX	= 3,
> > +};
> > +
> > +/* virtio transport socket state */
> > +struct virtio_transport {
> > +	struct virtio_transport_pkt_ops	*ops;
> > +	struct vsock_sock *vsk;
> > +
> > +	u64 buf_size;
> > +	u64 buf_size_min;
> > +	u64 buf_size_max;
> > +
> > +	struct mutex tx_lock;
> > +	struct mutex rx_lock;
> > +
> > +	struct list_head rx_queue;
> > +	u64 rx_bytes;
> > +
> > +	/* Protected by trans->tx_lock */
> > +	u64 tx_cnt;
> > +	u64 buf_alloc;
> > +	u64 peer_fwd_cnt;
> > +	u64 peer_buf_alloc;
> > +	/* Protected by trans->rx_lock */
> > +	u64 fwd_cnt;
> > +};
> > +
> > +struct virtio_vsock_pkt {
> > +	struct virtio_vsock_hdr	hdr;
> > +	struct virtio_transport	*trans;
> > +	struct work_struct work;
> > +	struct list_head list;
> > +	void *buf;
> > +	u32 len;
> > +	u32 off;
> > +};
> > +
> > +struct virtio_vsock_pkt_info {
> > +	struct sockaddr_vm *src;
> > +	struct sockaddr_vm *dst;
> > +	struct iovec *iov;
> > +	u32 len;
> > +	u8 type;
> > +	u8 op;
> > +	u8 shut;
> > +};
> > +
> > +struct virtio_transport_pkt_ops {
> > +	int (*send_pkt)(struct vsock_sock *vsk,
> > +			struct virtio_vsock_pkt_info *info);
> > +};
> > +
> > +void virtio_vsock_dumppkt(const char *func,
> > +			  const struct virtio_vsock_pkt *pkt);
> > +
> > +struct sock *
> > +virtio_transport_get_pending(struct sock *listener,
> > +			     struct virtio_vsock_pkt *pkt);
> > +struct virtio_vsock_pkt *
> > +virtio_transport_alloc_pkt(struct vsock_sock *vsk,
> > +			   struct virtio_vsock_pkt_info *info,
> > +			   size_t len,
> > +			   u32 src_cid,
> > +			   u32 src_port,
> > +			   u32 dst_cid,
> > +			   u32 dst_port);
> > +ssize_t
> > +virtio_transport_stream_dequeue(struct vsock_sock *vsk,
> > +				struct iovec *iov,
> > +				size_t len,
> > +				int type);
> > +int
> > +virtio_transport_dgram_dequeue(struct kiocb *kiocb,
> > +			       struct vsock_sock *vsk,
> > +			       struct msghdr *msg,
> > +			       size_t len, int flags);
> > +
> > +s64 virtio_transport_stream_has_data(struct vsock_sock *vsk);
> > +s64 virtio_transport_stream_has_space(struct vsock_sock *vsk);
> > +
> > +int virtio_transport_do_socket_init(struct vsock_sock *vsk,
> > +				 struct vsock_sock *psk);
> > +u64 virtio_transport_get_buffer_size(struct vsock_sock *vsk);
> > +u64 virtio_transport_get_min_buffer_size(struct vsock_sock *vsk);
> > +u64 virtio_transport_get_max_buffer_size(struct vsock_sock *vsk);
> > +void virtio_transport_set_buffer_size(struct vsock_sock *vsk, u64 val);
> > +void virtio_transport_set_min_buffer_size(struct vsock_sock *vsk, u64 val);
> > +void virtio_transport_set_max_buffer_size(struct vsock_sock *vs, u64 val);
> > +int
> > +virtio_transport_notify_poll_in(struct vsock_sock *vsk,
> > +				size_t target,
> > +				bool *data_ready_now);
> > +int
> > +virtio_transport_notify_poll_out(struct vsock_sock *vsk,
> > +				 size_t target,
> > +				 bool *space_available_now);
> > +
> > +int virtio_transport_notify_recv_init(struct vsock_sock *vsk,
> > +	size_t target, struct vsock_transport_recv_notify_data *data);
> > +int virtio_transport_notify_recv_pre_block(struct vsock_sock *vsk,
> > +	size_t target, struct vsock_transport_recv_notify_data *data);
> > +int virtio_transport_notify_recv_pre_dequeue(struct vsock_sock *vsk,
> > +	size_t target, struct vsock_transport_recv_notify_data *data);
> > +int virtio_transport_notify_recv_post_dequeue(struct vsock_sock *vsk,
> > +	size_t target, ssize_t copied, bool data_read,
> > +	struct vsock_transport_recv_notify_data *data);
> > +int virtio_transport_notify_send_init(struct vsock_sock *vsk,
> > +	struct vsock_transport_send_notify_data *data);
> > +int virtio_transport_notify_send_pre_block(struct vsock_sock *vsk,
> > +	struct vsock_transport_send_notify_data *data);
> > +int virtio_transport_notify_send_pre_enqueue(struct vsock_sock *vsk,
> > +	struct vsock_transport_send_notify_data *data);
> > +int virtio_transport_notify_send_post_enqueue(struct vsock_sock *vsk,
> > +	ssize_t written, struct vsock_transport_send_notify_data *data);
> > +
> > +u64 virtio_transport_stream_rcvhiwat(struct vsock_sock *vsk);
> > +bool virtio_transport_stream_is_active(struct vsock_sock *vsk);
> > +bool virtio_transport_stream_allow(u32 cid, u32 port);
> > +int virtio_transport_dgram_bind(struct vsock_sock *vsk,
> > +				struct sockaddr_vm *addr);
> > +bool virtio_transport_dgram_allow(u32 cid, u32 port);
> > +
> > +int virtio_transport_connect(struct vsock_sock *vsk);
> > +
> > +int virtio_transport_shutdown(struct vsock_sock *vsk, int mode);
> > +
> > +void virtio_transport_release(struct vsock_sock *vsk);
> > +
> > +ssize_t
> > +virtio_transport_stream_enqueue(struct vsock_sock *vsk,
> > +				struct iovec *iov,
> > +				size_t len);
> > +int
> > +virtio_transport_dgram_enqueue(struct vsock_sock *vsk,
> > +			       struct sockaddr_vm *remote_addr,
> > +			       struct iovec *iov,
> > +			       size_t len);
> > +
> > +void virtio_transport_destruct(struct vsock_sock *vsk);
> > +
> > +void virtio_transport_recv_pkt(struct virtio_vsock_pkt *pkt);
> > +void virtio_transport_free_pkt(struct virtio_vsock_pkt *pkt);
> > +void virtio_transport_inc_tx_pkt(struct virtio_vsock_pkt *pkt);
> > +void virtio_transport_dec_tx_pkt(struct virtio_vsock_pkt *pkt);
> > +u64 virtio_transport_get_credit(struct virtio_transport *trans);
> > +#endif /* _LINUX_VIRTIO_VSOCK_H */
> > diff --git a/include/uapi/linux/virtio_ids.h b/include/uapi/linux/virtio_ids.h
> > index 284fc3a..8a27609 100644
> > --- a/include/uapi/linux/virtio_ids.h
> > +++ b/include/uapi/linux/virtio_ids.h
> > @@ -39,5 +39,6 @@
> >  #define VIRTIO_ID_9P		9 /* 9p virtio console */
> >  #define VIRTIO_ID_RPROC_SERIAL 11 /* virtio remoteproc serial link */
> >  #define VIRTIO_ID_CAIF	       12 /* Virtio caif */
> > +#define VIRTIO_ID_VSOCK        13 /* virtio vsock transport */
> >  
> >  #endif /* _LINUX_VIRTIO_IDS_H */
> > diff --git a/include/uapi/linux/virtio_vsock.h b/include/uapi/linux/virtio_vsock.h
> > new file mode 100644
> > index 0000000..0a58ac3
> > --- /dev/null
> > +++ b/include/uapi/linux/virtio_vsock.h
> > @@ -0,0 +1,70 @@
> > +/*
> > + * This header, excluding the #ifdef __KERNEL__ part, is BSD licensed so
> > + * anyone can use the definitions to implement compatible drivers/servers:
> > + *
> > + *
> > + * Redistribution and use in source and binary forms, with or without
> > + * modification, are permitted provided that the following conditions
> > + * are met:
> > + * 1. Redistributions of source code must retain the above copyright
> > + *    notice, this list of conditions and the following disclaimer.
> > + * 2. Redistributions in binary form must reproduce the above copyright
> > + *    notice, this list of conditions and the following disclaimer in the
> > + *    documentation and/or other materials provided with the distribution.
> > + * 3. Neither the name of IBM nor the names of its contributors
> > + *    may be used to endorse or promote products derived from this software
> > + *    without specific prior written permission.
> > + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ``AS IS''
> > + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
> > + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
> > + * ARE DISCLAIMED.  IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE
> > + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
> > + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
> > + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
> > + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
> > + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
> > + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
> > + * SUCH DAMAGE.
> > + *
> > + * Copyright (C) Red Hat, Inc., 2013
> > + * Copyright (C) Asias He <asias@redhat.com>, 2013
> > + */
> > +
> > +#ifndef _UAPI_LINUX_VIRTIO_VSOCK_H
> > +#define _UAPI_LINUX_VIRTIO_VOSCK_H
> > +
> > +#include <linux/types.h>
> > +#include <linux/virtio_ids.h>
> > +#include <linux/virtio_config.h>
> > +
> > +struct virtio_vsock_config {
> > +	__u32 guest_cid;
> > +	__u32 max_virtqueue_pairs;
> > +} __packed;
> > +
> > +struct virtio_vsock_hdr {
> > +	__u32	src_cid;
> > +	__u32	src_port;
> > +	__u32	dst_cid;
> > +	__u32	dst_port;
> > +	__u32	len;
> > +	__u8	type;
> > +	__u8	op;
> > +	__u8	shut;
> > +	__u64	fwd_cnt;
> > +	__u64	buf_alloc;
> > +} __packed;
> > +
> > +enum {
> > +	VIRTIO_VSOCK_OP_INVALID = 0,
> > +	VIRTIO_VSOCK_OP_REQUEST = 1,
> > +	VIRTIO_VSOCK_OP_NEGOTIATE = 2,
> > +	VIRTIO_VSOCK_OP_OFFER = 3,
> > +	VIRTIO_VSOCK_OP_ATTACH = 4,
> > +	VIRTIO_VSOCK_OP_RW = 5,
> > +	VIRTIO_VSOCK_OP_CREDIT = 6,
> > +	VIRTIO_VSOCK_OP_RST = 7,
> > +	VIRTIO_VSOCK_OP_SHUTDOWN = 8,
> > +};
> > +
> > +#endif /* _UAPI_LINUX_VIRTIO_VSOCK_H */
> > diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> > new file mode 100644
> > index 0000000..0482eb1
> > --- /dev/null
> > +++ b/net/vmw_vsock/virtio_transport_common.c
> > @@ -0,0 +1,992 @@
> > +/*
> > + * common code for virtio vsock
> > + *
> > + * Copyright (C) 2013 Red Hat, Inc.
> > + * Author: Asias He <asias@redhat.com>
> > + *
> > + * This work is licensed under the terms of the GNU GPL, version 2.
> > + */
> > +#include <linux/module.h>
> > +#include <linux/ctype.h>
> > +#include <linux/list.h>
> > +#include <linux/virtio.h>
> > +#include <linux/virtio_ids.h>
> > +#include <linux/virtio_config.h>
> > +#include <linux/virtio_vsock.h>
> > +
> > +#include <net/sock.h>
> > +#include "af_vsock.h"
> > +
> > +#define SS_LISTEN 255
> > +
> > +void virtio_vsock_dumppkt(const char *func,  const struct virtio_vsock_pkt *pkt)
> > +{
> > +	pr_debug("%s: pkt=%p, op=%d, len=%d, %d:%d---%d:%d, len=%d\n",
> > +		 func, pkt, pkt->hdr.op, pkt->hdr.len,
> > +		 pkt->hdr.src_cid, pkt->hdr.src_port,
> > +		 pkt->hdr.dst_cid, pkt->hdr.dst_port, pkt->len);
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_vsock_dumppkt);
> > +
> > +struct virtio_vsock_pkt *
> > +virtio_transport_alloc_pkt(struct vsock_sock *vsk,
> > +			   struct virtio_vsock_pkt_info *info,
> > +			   size_t len,
> > +			   u32 src_cid,
> > +			   u32 src_port,
> > +			   u32 dst_cid,
> > +			   u32 dst_port)
> > +{
> > +	struct virtio_transport *trans = vsk->trans;
> > +	struct virtio_vsock_pkt *pkt;
> > +	int err;
> > +
> > +	BUG_ON(!trans);
> > +
> > +	pkt = kzalloc(sizeof(*pkt), GFP_KERNEL);
> > +	if (!pkt)
> > +		return NULL;
> > +
> > +	pkt->hdr.type		= info->type;
> > +	pkt->hdr.op		= info->op;
> > +	pkt->hdr.src_cid	= src_cid;
> > +	pkt->hdr.src_port	= src_port;
> > +	pkt->hdr.dst_cid	= dst_cid;
> > +	pkt->hdr.dst_port	= dst_port;
> > +	pkt->hdr.len		= len;
> > +	pkt->hdr.shut		= info->shut;
> > +	pkt->len		= len;
> > +	pkt->trans		= trans;
> > +
> > +	if (info->iov && len > 0) {
> > +		pkt->buf = kmalloc(len, GFP_KERNEL);
> > +		if (!pkt->buf)
> > +			goto out_pkt;
> > +		err = memcpy_fromiovec(pkt->buf, info->iov, len);
> > +		if (err)
> > +			goto out;
> > +	}
> > +
> > +	return pkt;
> > +
> > +out:
> > +	kfree(pkt->buf);
> > +out_pkt:
> > +	kfree(pkt);
> > +	return NULL;
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_transport_alloc_pkt);
> > +
> > +struct sock *
> > +virtio_transport_get_pending(struct sock *listener,
> > +			     struct virtio_vsock_pkt *pkt)
> > +{
> > +	struct vsock_sock *vlistener;
> > +	struct vsock_sock *vpending;
> > +	struct sockaddr_vm src;
> > +	struct sockaddr_vm dst;
> > +	struct sock *pending;
> > +
> > +	vsock_addr_init(&src, pkt->hdr.src_cid, pkt->hdr.src_port);
> > +	vsock_addr_init(&dst, pkt->hdr.dst_cid, pkt->hdr.dst_port);
> > +
> > +	vlistener = vsock_sk(listener);
> > +	list_for_each_entry(vpending, &vlistener->pending_links,
> > +			    pending_links) {
> > +		if (vsock_addr_equals_addr(&src, &vpending->remote_addr) &&
> > +		    vsock_addr_equals_addr(&dst, &vpending->local_addr)) {
> > +			pending = sk_vsock(vpending);
> > +			sock_hold(pending);
> > +			return pending;
> > +		}
> > +	}
> > +
> > +	return NULL;
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_transport_get_pending);
> > +
> > +static void virtio_transport_inc_rx_pkt(struct virtio_vsock_pkt *pkt)
> > +{
> > +	pkt->trans->rx_bytes += pkt->len;
> > +}
> > +
> > +static void virtio_transport_dec_rx_pkt(struct virtio_vsock_pkt *pkt)
> > +{
> > +	pkt->trans->rx_bytes -= pkt->len;
> > +	pkt->trans->fwd_cnt += pkt->len;
> > +}
> > +
> > +void virtio_transport_inc_tx_pkt(struct virtio_vsock_pkt *pkt)
> > +{
> > +	mutex_lock(&pkt->trans->tx_lock);
> > +	pkt->hdr.fwd_cnt = pkt->trans->fwd_cnt;
> > +	pkt->hdr.buf_alloc = pkt->trans->buf_alloc;
> > +	pkt->trans->tx_cnt += pkt->len;
> > +	mutex_unlock(&pkt->trans->tx_lock);
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_transport_inc_tx_pkt);
> > +
> > +void virtio_transport_dec_tx_pkt(struct virtio_vsock_pkt *pkt)
> > +{
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_transport_dec_tx_pkt);
> > +
> > +u64 virtio_transport_get_credit(struct virtio_transport *trans)
> > +{
> > +	u64 credit;
> > +
> > +	mutex_lock(&trans->tx_lock);
> > +	credit = trans->peer_buf_alloc - (trans->tx_cnt - trans->peer_fwd_cnt);
> > +	mutex_unlock(&trans->tx_lock);
> 
> So two callers can call virtio_transport_get_credit and
> both get a credit. Later credit gets negative.
> 
> You must have the lock until you increment tx_cnt I think.

Right, will fix it.

> > +
> > +	pr_debug("credit=%lld, buf_alloc=%lld, peer_buf_alloc=%lld,"
> > +		 "tx_cnt=%lld, fwd_cnt=%lld, peer_fwd_cnt=%lld\n",
> > +		 credit, trans->buf_alloc, trans->peer_buf_alloc,
> > +		 trans->tx_cnt, trans->fwd_cnt, trans->peer_fwd_cnt);
> > +
> > +	return credit;
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_transport_get_credit);
> > +
> > +static int virtio_transport_send_credit(struct vsock_sock *vsk)
> > +{
> > +	struct virtio_transport *trans = vsk->trans;
> > +	struct virtio_vsock_pkt_info info = {
> > +		.op = VIRTIO_VSOCK_OP_CREDIT,
> > +		.type = SOCK_STREAM,
> > +	};
> > +
> > +	pr_debug("%s: sk=%p send_credit\n", __func__, vsk);
> > +	return trans->ops->send_pkt(vsk, &info);
> > +}
> > +
> > +static ssize_t
> > +virtio_transport_do_dequeue(struct vsock_sock *vsk,
> > +			    struct iovec *iov,
> > +			    size_t len,
> > +			    int type)
> > +{
> > +	struct virtio_transport *trans = vsk->trans;
> > +	struct virtio_vsock_pkt *pkt;
> > +	size_t bytes, total = 0;
> > +	int err = -EFAULT;
> > +
> > +	mutex_lock(&trans->rx_lock);
> > +	while (total < len && trans->rx_bytes > 0  &&
> > +			!list_empty(&trans->rx_queue)) {
> > +		pkt = list_first_entry(&trans->rx_queue,
> > +				       struct virtio_vsock_pkt, list);
> > +
> > +		if (pkt->hdr.type != type)
> > +			continue;
> > +
> > +		bytes = len - total;
> > +		if (bytes > pkt->len - pkt->off)
> > +			bytes = pkt->len - pkt->off;
> > +
> > +		err = memcpy_toiovec(iov, pkt->buf + pkt->off, bytes);
> > +		if (err)
> > +			goto out;
> > +		total += bytes;
> > +		pkt->off += bytes;
> > +		if (pkt->off == pkt->len) {
> > +			virtio_transport_dec_rx_pkt(pkt);
> > +			list_del(&pkt->list);
> > +			virtio_transport_free_pkt(pkt);
> > +		}
> > +	}
> > +	mutex_unlock(&trans->rx_lock);
> > +
> > +	/* Send a credit pkt to peer */
> > +	if (type == SOCK_STREAM)
> > +		virtio_transport_send_credit(vsk);
> > +
> > +	return total;
> > +
> > +out:
> > +	mutex_unlock(&trans->rx_lock);
> > +	if (total)
> > +		err = total;
> > +	return err;
> > +}
> > +
> > +ssize_t
> > +virtio_transport_stream_dequeue(struct vsock_sock *vsk,
> > +				struct iovec *iov,
> > +				size_t len, int flags)
> > +{
> > +	if (flags & MSG_PEEK)
> > +		return -EOPNOTSUPP;
> > +
> > +	return virtio_transport_do_dequeue(vsk, iov, len, SOCK_STREAM);
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_transport_stream_dequeue);
> > +
> > +static void
> > +virtio_transport_recv_dgram(struct sock *sk,
> > +			    struct virtio_vsock_pkt *pkt)
> > +{
> > +	struct sk_buff *skb;
> > +	struct vsock_sock *vsk;
> > +	size_t size;
> > +
> > +	vsk = vsock_sk(sk);
> > +
> > +	pkt->len = pkt->hdr.len;
> > +	pkt->off = 0;
> > +
> > +	size = sizeof(*pkt) + pkt->len;
> > +	/* Attach the packet to the socket's receive queue as an sk_buff. */
> > +	skb = alloc_skb(size, GFP_ATOMIC);
> > +	if (!skb)
> > +		goto out;
> > +
> > +	/* sk_receive_skb() will do a sock_put(), so hold here. */
> > +	sock_hold(sk);
> > +	skb_put(skb, size);
> > +	memcpy(skb->data, pkt, sizeof(*pkt));
> > +	memcpy(skb->data + sizeof(*pkt), pkt->buf, pkt->len);
> > +
> > +	sk_receive_skb(sk, skb, 0);
> > +out:
> > +	virtio_transport_free_pkt(pkt);
> > +}
> > +
> > +int
> > +virtio_transport_dgram_dequeue(struct kiocb *kiocb,
> > +			       struct vsock_sock *vsk,
> > +			       struct msghdr *msg,
> > +			       size_t len, int flags)
> > +{
> > +	struct virtio_vsock_pkt *pkt;
> > +	struct sk_buff *skb;
> > +	size_t payload_len;
> > +	int noblock;
> > +	int err;
> > +
> > +	noblock = flags & MSG_DONTWAIT;
> > +
> > +	if (flags & MSG_OOB || flags & MSG_ERRQUEUE)
> > +		return -EOPNOTSUPP;
> > +
> > +	msg->msg_namelen = 0;
> > +
> > +	/* Retrieve the head sk_buff from the socket's receive queue. */
> > +	err = 0;
> > +	skb = skb_recv_datagram(&vsk->sk, flags, noblock, &err);
> > +	if (err)
> > +		return err;
> > +	if (!skb)
> > +		return -EAGAIN;
> > +
> > +	pkt = (struct virtio_vsock_pkt *)skb->data;
> > +	if (!pkt)
> > +		goto out;
> > +
> > +	/* FIXME: check payload_len */
> > +	payload_len = pkt->len;
> > +
> > +	/* Place the datagram payload in the user's iovec. */
> > +	err = skb_copy_datagram_iovec(skb, sizeof(*pkt),
> > +				      msg->msg_iov, payload_len);
> > +	if (err)
> > +		goto out;
> > +
> > +	if (msg->msg_name) {
> > +		struct sockaddr_vm *vm_addr;
> > +
> > +		/* Provide the address of the sender. */
> > +		vm_addr = (struct sockaddr_vm *)msg->msg_name;
> > +		vsock_addr_init(vm_addr, pkt->hdr.src_cid, pkt->hdr.src_port);
> > +		msg->msg_namelen = sizeof(*vm_addr);
> > +	}
> > +	err = payload_len;
> > +
> > +out:
> > +	skb_free_datagram(&vsk->sk, skb);
> > +	return err;
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_transport_dgram_dequeue);
> > +
> > +s64 virtio_transport_stream_has_data(struct vsock_sock *vsk)
> > +{
> > +	struct virtio_transport *trans = vsk->trans;
> > +	size_t bytes = 0;
> > +
> > +	mutex_lock(&trans->rx_lock);
> > +	bytes = trans->rx_bytes;
> > +	mutex_unlock(&trans->rx_lock);
> > +
> > +	return bytes;
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_transport_stream_has_data);
> > +
> > +static s64 __virtio_transport_stream_has_space(struct vsock_sock *vsk)
> > +{
> > +	struct virtio_transport *trans = vsk->trans;
> > +	size_t bytes = 0;
> > +
> > +	bytes = trans->peer_buf_alloc - (trans->tx_cnt - trans->peer_fwd_cnt);
> > +	if (bytes < 0)
> > +		bytes = 0;
> > +
> > +	return bytes;
> > +}
> > +
> > +s64 virtio_transport_stream_has_space(struct vsock_sock *vsk)
> > +{
> > +	struct virtio_transport *trans = vsk->trans;
> > +	size_t bytes = 0;
> > +
> > +	mutex_lock(&trans->tx_lock);
> > +	bytes = trans->peer_buf_alloc - (trans->tx_cnt - trans->peer_fwd_cnt);
> > +	if (bytes < 0)
> > +		bytes = 0;
> > +	mutex_unlock(&trans->tx_lock);
> > +	pr_debug("%s: bytes=%ld\n", __func__, bytes);
> > +
> > +	return bytes;
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_transport_stream_has_space);
> > +
> > +int virtio_transport_do_socket_init(struct vsock_sock *vsk,
> > +				    struct vsock_sock *psk)
> > +{
> > +	struct virtio_transport *trans;
> > +
> > +	trans = kzalloc(sizeof(*trans), GFP_KERNEL);
> > +	if (!trans)
> > +		return -ENOMEM;
> > +
> > +	vsk->trans = trans;
> > +	trans->vsk = vsk;
> > +	if (psk) {
> > +		struct virtio_transport *ptrans = psk->trans;
> > +		trans->buf_size	= ptrans->buf_size;
> > +		trans->buf_size_min = ptrans->buf_size_min;
> > +		trans->buf_size_max = ptrans->buf_size_max;
> > +	} else {
> > +		trans->buf_size = VIRTIO_VSOCK_DEFAULT_BUF_SIZE;
> > +		trans->buf_size_min = VIRTIO_VSOCK_DEFAULT_MIN_BUF_SIZE;
> > +		trans->buf_size_max = VIRTIO_VSOCK_DEFAULT_MAX_BUF_SIZE;
> > +	}
> > +
> > +	trans->buf_alloc = trans->buf_size;
> > +	pr_debug("%s: trans->buf_alloc=%lld\n", __func__, trans->buf_alloc);
> > +
> > +	mutex_init(&trans->rx_lock);
> > +	mutex_init(&trans->tx_lock);
> > +	INIT_LIST_HEAD(&trans->rx_queue);
> > +
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_transport_do_socket_init);
> > +
> > +u64 virtio_transport_get_buffer_size(struct vsock_sock *vsk)
> > +{
> > +	struct virtio_transport *trans = vsk->trans;
> > +
> > +	return trans->buf_size;
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_transport_get_buffer_size);
> > +
> > +u64 virtio_transport_get_min_buffer_size(struct vsock_sock *vsk)
> > +{
> > +	struct virtio_transport *trans = vsk->trans;
> > +
> > +	return trans->buf_size_min;
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_transport_get_min_buffer_size);
> > +
> > +u64 virtio_transport_get_max_buffer_size(struct vsock_sock *vsk)
> > +{
> > +	struct virtio_transport *trans = vsk->trans;
> > +
> > +	return trans->buf_size_max;
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_transport_get_max_buffer_size);
> > +
> > +void virtio_transport_set_buffer_size(struct vsock_sock *vsk, u64 val)
> > +{
> > +	struct virtio_transport *trans = vsk->trans;
> > +
> > +	if (val < trans->buf_size_min)
> > +		trans->buf_size_min = val;
> > +	if (val > trans->buf_size_max)
> > +		trans->buf_size_max = val;
> > +	trans->buf_size = val;
> > +	trans->buf_alloc = val;
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_transport_set_buffer_size);
> > +
> > +void virtio_transport_set_min_buffer_size(struct vsock_sock *vsk, u64 val)
> > +{
> > +	struct virtio_transport *trans = vsk->trans;
> > +
> > +	if (val > trans->buf_size)
> > +		trans->buf_size = val;
> > +	trans->buf_size_min = val;
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_transport_set_min_buffer_size);
> > +
> > +void virtio_transport_set_max_buffer_size(struct vsock_sock *vsk, u64 val)
> > +{
> > +	struct virtio_transport *trans = vsk->trans;
> > +
> > +	if (val < trans->buf_size)
> > +		trans->buf_size = val;
> > +	trans->buf_size_max = val;
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_transport_set_max_buffer_size);
> > +
> > +int
> > +virtio_transport_notify_poll_in(struct vsock_sock *vsk,
> > +				size_t target,
> > +				bool *data_ready_now)
> > +{
> > +	if (vsock_stream_has_data(vsk))
> > +		*data_ready_now = true;
> > +	else
> > +		*data_ready_now = false;
> > +
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_transport_notify_poll_in);
> > +
> > +int
> > +virtio_transport_notify_poll_out(struct vsock_sock *vsk,
> > +				 size_t target,
> > +				 bool *space_avail_now)
> > +{
> > +	s64 free_space;
> > +
> > +	free_space = vsock_stream_has_space(vsk);
> > +	if (free_space > 0)
> > +		*space_avail_now = true;
> > +	else if (free_space == 0)
> > +		*space_avail_now = false;
> > +
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_transport_notify_poll_out);
> > +
> > +int virtio_transport_notify_recv_init(struct vsock_sock *vsk,
> > +	size_t target, struct vsock_transport_recv_notify_data *data)
> > +{
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_transport_notify_recv_init);
> > +
> > +int virtio_transport_notify_recv_pre_block(struct vsock_sock *vsk,
> > +	size_t target, struct vsock_transport_recv_notify_data *data)
> > +{
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_transport_notify_recv_pre_block);
> > +
> > +int virtio_transport_notify_recv_pre_dequeue(struct vsock_sock *vsk,
> > +	size_t target, struct vsock_transport_recv_notify_data *data)
> > +{
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_transport_notify_recv_pre_dequeue);
> > +
> > +int virtio_transport_notify_recv_post_dequeue(struct vsock_sock *vsk,
> > +	size_t target, ssize_t copied, bool data_read,
> > +	struct vsock_transport_recv_notify_data *data)
> > +{
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_transport_notify_recv_post_dequeue);
> > +
> > +int virtio_transport_notify_send_init(struct vsock_sock *vsk,
> > +	struct vsock_transport_send_notify_data *data)
> > +{
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_transport_notify_send_init);
> > +
> > +int virtio_transport_notify_send_pre_block(struct vsock_sock *vsk,
> > +	struct vsock_transport_send_notify_data *data)
> > +{
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_transport_notify_send_pre_block);
> > +
> > +int virtio_transport_notify_send_pre_enqueue(struct vsock_sock *vsk,
> > +	struct vsock_transport_send_notify_data *data)
> > +{
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_transport_notify_send_pre_enqueue);
> > +
> > +int virtio_transport_notify_send_post_enqueue(struct vsock_sock *vsk,
> > +	ssize_t written, struct vsock_transport_send_notify_data *data)
> > +{
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_transport_notify_send_post_enqueue);
> > +
> > +u64 virtio_transport_stream_rcvhiwat(struct vsock_sock *vsk)
> > +{
> > +	struct virtio_transport *trans = vsk->trans;
> > +
> > +	return trans->buf_size;
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_transport_stream_rcvhiwat);
> > +
> > +bool virtio_transport_stream_is_active(struct vsock_sock *vsk)
> > +{
> > +	return true;
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_transport_stream_is_active);
> > +
> > +bool virtio_transport_stream_allow(u32 cid, u32 port)
> > +{
> > +	return true;
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_transport_stream_allow);
> > +
> > +int virtio_transport_dgram_bind(struct vsock_sock *vsk,
> > +				struct sockaddr_vm *addr)
> > +{
> > +	return vsock_bind_dgram_generic(vsk, addr);
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_transport_dgram_bind);
> > +
> > +bool virtio_transport_dgram_allow(u32 cid, u32 port)
> > +{
> > +	return true;
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_transport_dgram_allow);
> > +
> > +int virtio_transport_connect(struct vsock_sock *vsk)
> > +{
> > +	struct virtio_transport *trans = vsk->trans;
> > +	struct virtio_vsock_pkt_info info = {
> > +		.op = VIRTIO_VSOCK_OP_REQUEST,
> > +		.type = SOCK_STREAM,
> > +	};
> > +
> > +	pr_debug("%s: vsk=%p send_request\n", __func__, vsk);
> > +	return trans->ops->send_pkt(vsk, &info);
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_transport_connect);
> > +
> > +int virtio_transport_shutdown(struct vsock_sock *vsk, int mode)
> > +{
> > +	struct virtio_transport *trans = vsk->trans;
> > +	struct virtio_vsock_pkt_info info = {
> > +		.op = VIRTIO_VSOCK_OP_SHUTDOWN,
> > +		.type = SOCK_STREAM,
> > +		.shut = mode,
> > +	};
> > +
> > +	pr_debug("%s: vsk=%p: send_shutdown\n", __func__, vsk);
> > +	return trans->ops->send_pkt(vsk, &info);
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_transport_shutdown);
> > +
> > +void virtio_transport_release(struct vsock_sock *vsk)
> > +{
> > +	pr_debug("%s: vsk=%p\n", __func__, vsk);
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_transport_release);
> > +
> > +int
> > +virtio_transport_dgram_enqueue(struct vsock_sock *vsk,
> > +			       struct sockaddr_vm *remote_addr,
> > +			       struct iovec *iov,
> > +			       size_t len)
> > +{
> > +	struct virtio_transport *trans = vsk->trans;
> > +	struct virtio_vsock_pkt_info info = {
> > +		.op = VIRTIO_VSOCK_OP_RW,
> > +		.type = SOCK_DGRAM,
> > +		.iov = iov,
> > +		.len = len,
> > +	};
> > +
> > +	vsk->remote_addr = *remote_addr;
> > +	return trans->ops->send_pkt(vsk, &info);
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_transport_dgram_enqueue);
> > +
> > +ssize_t
> > +virtio_transport_stream_enqueue(struct vsock_sock *vsk,
> > +				struct iovec *iov,
> > +				size_t len)
> > +{
> > +	struct virtio_transport *trans = vsk->trans;
> > +	struct virtio_vsock_pkt_info info = {
> > +		.op = VIRTIO_VSOCK_OP_RW,
> > +		.type = SOCK_STREAM,
> > +		.iov = iov,
> > +		.len = len,
> > +	};
> > +
> > +	return trans->ops->send_pkt(vsk, &info);
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_transport_stream_enqueue);
> > +
> > +void virtio_transport_destruct(struct vsock_sock *vsk)
> > +{
> > +	struct virtio_transport *trans = vsk->trans;
> > +
> > +	pr_debug("%s: vsk=%p\n", __func__, vsk);
> > +	kfree(trans);
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_transport_destruct);
> > +
> > +static int virtio_transport_send_attach(struct vsock_sock *vsk)
> > +{
> > +	struct virtio_transport *trans = vsk->trans;
> > +	struct virtio_vsock_pkt_info info = {
> > +		.op = VIRTIO_VSOCK_OP_ATTACH,
> > +		.type = SOCK_STREAM,
> > +	};
> > +
> > +	pr_debug("%s: vsk=%p send_attach\n", __func__, vsk);
> > +	return trans->ops->send_pkt(vsk, &info);
> > +}
> > +
> > +static int virtio_transport_send_offer(struct vsock_sock *vsk)
> > +{
> > +	struct virtio_transport *trans = vsk->trans;
> > +	struct virtio_vsock_pkt_info info = {
> > +		.op = VIRTIO_VSOCK_OP_OFFER,
> > +		.type = SOCK_STREAM,
> > +	};
> > +
> > +	pr_debug("%s: sk=%p send_offer\n", __func__, vsk);
> > +	return trans->ops->send_pkt(vsk, &info);
> > +}
> > +
> > +static int virtio_transport_send_reset(struct vsock_sock *vsk,
> > +				       struct virtio_vsock_pkt *pkt)
> > +{
> > +	struct virtio_transport *trans = vsk->trans;
> > +	struct virtio_vsock_pkt_info info = {
> > +		.op = VIRTIO_VSOCK_OP_RST,
> > +		.type = SOCK_STREAM,
> > +	};
> > +
> > +	pr_debug("%s\n", __func__);
> > +
> > +	/* Send RST only if the original pkt is not a RST pkt */
> > +	if (pkt->hdr.op == VIRTIO_VSOCK_OP_RST)
> > +		return 0;
> > +
> > +	return trans->ops->send_pkt(vsk, &info);
> > +}
> > +
> > +static int
> > +virtio_transport_recv_connecting(struct sock *sk,
> > +				 struct virtio_vsock_pkt *pkt)
> > +{
> > +	struct vsock_sock *vsk = vsock_sk(sk);
> > +	int err;
> > +	int skerr;
> > +
> > +	pr_debug("%s: vsk=%p\n", __func__, vsk);
> > +	switch (pkt->hdr.op) {
> > +	case VIRTIO_VSOCK_OP_ATTACH:
> > +		pr_debug("%s: got attach\n", __func__);
> > +		sk->sk_state = SS_CONNECTED;
> > +		sk->sk_socket->state = SS_CONNECTED;
> > +		vsock_insert_connected(vsk);
> > +		sk->sk_state_change(sk);
> > +		break;
> > +	case VIRTIO_VSOCK_OP_NEGOTIATE:
> > +		pr_debug("%s: got negotiate and send_offer\n", __func__);
> > +		err = virtio_transport_send_offer(vsk);
> > +		if (err < 0) {
> > +			skerr = -err;
> > +			goto destroy;
> > +		}
> > +		break;
> > +	case VIRTIO_VSOCK_OP_INVALID:
> > +		pr_debug("%s: got invalid\n", __func__);
> > +		break;
> > +	case VIRTIO_VSOCK_OP_RST:
> > +		pr_debug("%s: got rst\n", __func__);
> > +		skerr = ECONNRESET;
> > +		err = 0;
> > +		goto destroy;
> > +	default:
> > +		pr_debug("%s: got def\n", __func__);
> > +		skerr = EPROTO;
> > +		err = -EINVAL;
> > +		goto destroy;
> > +	}
> > +	return 0;
> > +
> > +destroy:
> > +	virtio_transport_send_reset(vsk, pkt);
> > +	sk->sk_state = SS_UNCONNECTED;
> > +	sk->sk_err = skerr;
> > +	sk->sk_error_report(sk);
> > +	return err;
> > +}
> > +
> > +static int
> > +virtio_transport_recv_connected(struct sock *sk,
> > +				struct virtio_vsock_pkt *pkt)
> > +{
> > +	struct vsock_sock *vsk = vsock_sk(sk);
> > +	struct virtio_transport *trans = vsk->trans;
> > +	int err = 0;
> > +
> > +	switch (pkt->hdr.op) {
> > +	case VIRTIO_VSOCK_OP_RW:
> > +		pkt->len = pkt->hdr.len;
> > +		pkt->off = 0;
> > +		pkt->trans = trans;
> > +
> > +		mutex_lock(&trans->rx_lock);
> > +		virtio_transport_inc_rx_pkt(pkt);
> > +		list_add_tail(&pkt->list, &trans->rx_queue);
> > +		mutex_unlock(&trans->rx_lock);
> > +
> > +		sk->sk_data_ready(sk, pkt->len);
> > +		return err;
> > +	case VIRTIO_VSOCK_OP_CREDIT:
> > +		sk->sk_write_space(sk);
> > +		break;
> > +	case VIRTIO_VSOCK_OP_SHUTDOWN:
> > +		pr_debug("%s: got shutdown\n", __func__);
> > +		if (pkt->hdr.shut) {
> > +			vsk->peer_shutdown |= pkt->hdr.shut;
> > +			sk->sk_state_change(sk);
> > +		}
> > +		break;
> > +	case VIRTIO_VSOCK_OP_RST:
> > +		pr_debug("%s: got rst\n", __func__);
> > +		sock_set_flag(sk, SOCK_DONE);
> > +		vsk->peer_shutdown = SHUTDOWN_MASK;
> > +		if (vsock_stream_has_data(vsk) <= 0)
> > +			sk->sk_state = SS_DISCONNECTING;
> > +		sk->sk_state_change(sk);
> > +		break;
> > +	default:
> > +		err = -EINVAL;
> > +		break;
> > +	}
> > +
> > +	virtio_transport_free_pkt(pkt);
> > +	return err;
> > +}
> > +
> > +static int
> > +virtio_transport_send_negotiate(struct vsock_sock *vsk,
> > +				struct virtio_vsock_pkt *pkt)
> > +{
> > +	struct virtio_transport *trans = vsk->trans;
> > +	struct virtio_vsock_pkt_info info = {
> > +		.op = VIRTIO_VSOCK_OP_NEGOTIATE,
> > +		.type = SOCK_STREAM,
> > +	};
> > +
> > +	pr_debug("%s: send_negotiate\n", __func__);
> > +
> > +	return trans->ops->send_pkt(vsk, &info);
> > +}
> > +
> > +/* Handle server socket */
> > +static int
> > +virtio_transport_recv_listen(struct sock *sk, struct virtio_vsock_pkt *pkt)
> > +{
> > +	struct vsock_sock *vsk = vsock_sk(sk);
> > +	struct vsock_sock *vpending;
> > +	struct sock *pending;
> > +	int err;
> > +
> > +	pending = virtio_transport_get_pending(sk, pkt);
> > +	if (pending) {
> > +		pr_debug("virtio_transport_recv_listen: get pending\n");
> > +		vpending = vsock_sk(pending);
> > +		lock_sock(pending);
> > +		switch (pending->sk_state) {
> > +		case SS_CONNECTING:
> > +			if (pkt->hdr.op != VIRTIO_VSOCK_OP_OFFER) {
> > +				pr_debug("%s: != OP_OFFER op=%d\n",
> > +					 __func__, pkt->hdr.op);
> > +				virtio_transport_send_reset(vpending, pkt);
> > +				pending->sk_err = EPROTO;
> > +				pending->sk_state = SS_UNCONNECTED;
> > +				sock_put(pending);
> > +			} else {
> > +				pending->sk_state = SS_CONNECTED;
> > +				vsock_insert_connected(vpending);
> > +
> > +				vsock_remove_pending(sk, pending);
> > +				vsock_enqueue_accept(sk, pending);
> > +
> > +				virtio_transport_send_attach(vpending);
> > +				sk->sk_state_change(sk);
> > +			}
> > +			err = 0;
> > +			break;
> > +		default:
> > +			pr_debug("%s: sk->sk_ack_backlog=%d\n", __func__,
> > +				 sk->sk_ack_backlog);
> > +			virtio_transport_send_reset(vpending, pkt);
> > +			err = -EINVAL;
> > +			break;
> > +		}
> > +		if (err < 0)
> > +			vsock_remove_pending(sk, pending);
> > +		release_sock(pending);
> > +
> > +		/* Release refcnt obtained in virtio_transport_get_pending */
> > +		sock_put(pending);
> > +
> > +		return err;
> > +	}
> > +
> > +	if (pkt->hdr.op != VIRTIO_VSOCK_OP_REQUEST) {
> > +		virtio_transport_send_reset(vsk, pkt);
> > +		pr_debug("%s:op != OP_REQUEST op = %d\n",
> > +			 __func__, pkt->hdr.op);
> > +		return -EINVAL;
> > +	}
> > +
> > +	if (sk->sk_ack_backlog >= sk->sk_max_ack_backlog) {
> > +		pr_debug("%s: sk->sk_ack_backlog=%d\n", __func__,
> > +			 sk->sk_ack_backlog);
> > +		virtio_transport_send_reset(vsk, pkt);
> > +		return -ECONNREFUSED;
> > +	}
> > +
> > +	/* So no pending socket are responsible for this pkt, create one */
> > +	pending = __vsock_create(sock_net(sk), NULL, sk, GFP_KERNEL,
> > +				 sk->sk_type);
> > +	if (!pending) {
> > +		virtio_transport_send_reset(vsk, pkt);
> > +		return -ENOMEM;
> > +	}
> > +	pr_debug("virtio_transport_recv_listen: create pending\n");
> > +
> > +	vpending = vsock_sk(pending);
> > +	vsock_addr_init(&vpending->local_addr, pkt->hdr.dst_cid,
> > +			pkt->hdr.dst_port);
> > +	vsock_addr_init(&vpending->remote_addr, pkt->hdr.src_cid,
> > +			pkt->hdr.src_port);
> > +
> > +	vsock_add_pending(sk, pending);
> > +
> > +	err = virtio_transport_send_negotiate(vpending, pkt);
> > +	if (err < 0) {
> > +		virtio_transport_send_reset(vsk, pkt);
> > +		sock_put(pending);
> > +		return err;
> > +	}
> > +
> > +	sk->sk_ack_backlog++;
> > +
> > +	pending->sk_state = SS_CONNECTING;
> > +
> > +	/* Clean up in case no further message is received for this socket */
> > +	vpending->listener = sk;
> > +	sock_hold(sk);
> > +	sock_hold(pending);
> > +	INIT_DELAYED_WORK(&vpending->dwork, vsock_pending_work);
> > +	schedule_delayed_work(&vpending->dwork, HZ);
> > +
> > +	return 0;
> > +}
> > +
> > +void virtio_transport_recv_pkt(struct virtio_vsock_pkt *pkt)
> > +{
> > +	struct virtio_transport *trans;
> > +	struct sockaddr_vm src, dst;
> > +	struct vsock_sock *vsk;
> > +	struct sock *sk;
> > +
> > +	vsock_addr_init(&src, pkt->hdr.src_cid, pkt->hdr.src_port);
> > +	vsock_addr_init(&dst, pkt->hdr.dst_cid, pkt->hdr.dst_port);
> > +
> > +	virtio_vsock_dumppkt(__func__, pkt);
> > +
> > +	if (pkt->hdr.type == SOCK_DGRAM) {
> > +		sk = vsock_find_unbound_socket(&dst);
> > +		if (!sk)
> > +			goto free_pkt;
> > +		return virtio_transport_recv_dgram(sk, pkt);
> > +	}
> > +
> > +	/* The socket must be in connected or bound table
> > +	 * otherwise send reset back
> > +	 */
> > +	sk = vsock_find_connected_socket(&src, &dst);
> > +	if (!sk) {
> > +		sk = vsock_find_bound_socket(&dst);
> > +		if (!sk) {
> > +			pr_debug("%s: can not find bound_socket\n", __func__);
> > +			virtio_vsock_dumppkt(__func__, pkt);
> > +			/* Ignore this pkt instead of sending reset back */
> > +			goto free_pkt;
> > +		}
> > +	}
> > +
> > +	vsk = vsock_sk(sk);
> > +	trans = vsk->trans;
> > +	BUG_ON(!trans);
> > +
> > +	mutex_lock(&trans->tx_lock);
> > +	trans->peer_buf_alloc = pkt->hdr.buf_alloc;
> > +	trans->peer_fwd_cnt = pkt->hdr.fwd_cnt;
> > +	if (__virtio_transport_stream_has_space(vsk))
> > +		sk->sk_write_space(sk);
> > +	mutex_unlock(&trans->tx_lock);
> > +
> > +	lock_sock(sk);
> > +	switch (sk->sk_state) {
> > +	case SS_LISTEN:
> > +		virtio_transport_recv_listen(sk, pkt);
> > +		virtio_transport_free_pkt(pkt);
> > +		break;
> > +	case SS_CONNECTING:
> > +		virtio_transport_recv_connecting(sk, pkt);
> > +		virtio_transport_free_pkt(pkt);
> > +		break;
> > +	case SS_CONNECTED:
> > +		virtio_transport_recv_connected(sk, pkt);
> > +		break;
> > +	default:
> > +		break;
> > +	}
> > +	release_sock(sk);
> > +
> > +	/* Release refcnt obtained when we fetched this socket out of the
> > +	 * bound or connected list.
> > +	 */
> > +	sock_put(sk);
> > +	return;
> > +
> > +free_pkt:
> > +	virtio_transport_free_pkt(pkt);
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_transport_recv_pkt);
> > +
> > +void virtio_transport_free_pkt(struct virtio_vsock_pkt *pkt)
> > +{
> > +	kfree(pkt->buf);
> > +	kfree(pkt);
> > +}
> > +EXPORT_SYMBOL_GPL(virtio_transport_free_pkt);
> > +
> > +static int __init virtio_vsock_common_init(void)
> > +{
> > +	return 0;
> > +}
> > +
> > +static void __exit virtio_vsock_common_exit(void)
> > +{
> > +}
> > +
> > +module_init(virtio_vsock_common_init);
> > +module_exit(virtio_vsock_common_exit);
> > +MODULE_LICENSE("GPL v2");
> > +MODULE_AUTHOR("Asias He");
> > +MODULE_DESCRIPTION("common code for virtio vsock");
> > -- 
> > 1.8.1.4

-- 
Asias

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC 4/5] VSOCK: Introduce vhost-vsock.ko
  2013-06-27 10:42   ` Michael S. Tsirkin
  2013-06-28  2:38     ` Andy King
  2013-06-28  2:38     ` Andy King
@ 2013-06-28  6:55     ` Asias He
  2 siblings, 0 replies; 21+ messages in thread
From: Asias He @ 2013-06-28  6:55 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Andy King, kvm, Dmitry Torokhov, netdev, Reilly Grant,
	virtualization, Pekka Enberg, Sasha Levin, David S. Miller

On Thu, Jun 27, 2013 at 01:42:46PM +0300, Michael S. Tsirkin wrote:
> On Thu, Jun 27, 2013 at 04:00:03PM +0800, Asias He wrote:
> > VM sockets vhost transport implementation. This module runs in host
> > kernel.
> > 
> > Signed-off-by: Asias He <asias@redhat.com>
> 
> Has any thought been given to how this affects migration?
> I don't see any API for an application to
> move to a different host and reconnect to a running
> vsock in guest.
> 
> I think we could merge without this, there are more
> pressing issues, but it's probably a requirement
> if you want this to replace e.g. serial in many
> scenarious.

I do not plan to support migration for the initial merge as well.

Reconnection is one issue needs to be addressed. Another issue is that if
the destination host is running vhost-vsock already, the port might be
used already. We probably need namesapce support.

> > ---
> >  drivers/vhost/vsock.c | 534 ++++++++++++++++++++++++++++++++++++++++++++++++++
> >  drivers/vhost/vsock.h |   4 +
> >  2 files changed, 538 insertions(+)
> >  create mode 100644 drivers/vhost/vsock.c
> >  create mode 100644 drivers/vhost/vsock.h
> > 
> > diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> > new file mode 100644
> > index 0000000..cb54090
> > --- /dev/null
> > +++ b/drivers/vhost/vsock.c
> > @@ -0,0 +1,534 @@
> > +/*
> > + * vhost transport for vsock
> > + *
> > + * Copyright (C) 2013 Red Hat, Inc.
> > + * Author: Asias He <asias@redhat.com>
> > + *
> > + * This work is licensed under the terms of the GNU GPL, version 2.
> > + */
> > +#include <linux/miscdevice.h>
> > +#include <linux/module.h>
> > +#include <linux/mutex.h>
> > +#include <net/sock.h>
> > +#include <linux/virtio_vsock.h>
> > +#include <linux/vhost.h>
> > +
> > +#include "../../../net/vmw_vsock/af_vsock.h"
> 
> Send patch to move this to include/linux ?

Okay. Will cook a patch.

> > +#include "vhost.h"
> > +#include "vsock.h"
> > +
> > +#define VHOST_VSOCK_DEFAULT_HOST_CID	2;
> 
> Sure you want that ; there? This can result in strange code, e.g.
> 
> 	int a = VHOST_VSOCK_DEFAULT_HOST_CID + 1;
> 	set's a to 2.

Fixed.

> > +
> > +static int vhost_transport_socket_init(struct vsock_sock *vsk,
> > +				       struct vsock_sock *psk);
> > +
> > +enum {
> > +	VHOST_VSOCK_FEATURES = VHOST_FEATURES,
> > +};
> > +
> > +/* Used to track all the vhost_vsock instacne on the system. */
> 
> typo

Fixed.

> > +static LIST_HEAD(vhost_vsock_list);
> > +static DEFINE_MUTEX(vhost_vsock_mutex);
> > +
> > +struct vhost_vsock_virtqueue {
> > +	struct vhost_virtqueue vq;
> > +};
> > +
> > +struct vhost_vsock {
> > +	/* Vhost device */
> > +	struct vhost_dev dev;
> > +	/* Vhost vsock virtqueue*/
> > +	struct vhost_vsock_virtqueue vqs[VSOCK_VQ_MAX];
> > +	/* Link to global vhost_vsock_list*/
> > +	struct list_head list;
> > +	/* Head for pkt from host to guest */
> > +	struct list_head send_pkt_list;
> > +	/* Work item to send pkt */
> > +	struct vhost_work send_pkt_work;
> > +	/* Guest contex id this vhost_vsock instance handles */
> > +	u32 guest_cid;
> > +};
> > +
> > +static u32 vhost_transport_get_local_cid(void)
> > +{
> > +	u32 cid = VHOST_VSOCK_DEFAULT_HOST_CID;
> > +	return cid;
> > +}
> > +
> 
> Interesting. So all hosts in fact have the same CID?

Yes. Andy commented on this already.

> > +static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
> > +{
> > +	struct vhost_vsock *vsock;
> > +
> > +	mutex_lock(&vhost_vsock_mutex);
> > +	list_for_each_entry(vsock, &vhost_vsock_list, list) {
> > +		if (vsock->guest_cid == guest_cid) {
> > +			mutex_unlock(&vhost_vsock_mutex);
> > +			return vsock;
> > +		}
> > +	}
> > +	mutex_unlock(&vhost_vsock_mutex);
> > +
> > +	return NULL;
> > +}
> > +
> > +static void
> > +vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
> > +			    struct vhost_virtqueue *vq)
> > +{
> > +	struct virtio_vsock_pkt *pkt;
> > +	unsigned out, in;
> > +	struct sock *sk;
> > +	int head, ret;
> > +
> > +	mutex_lock(&vq->mutex);
> > +	vhost_disable_notify(&vsock->dev, vq);
> > +	for (;;) {
> > +		if (list_empty(&vsock->send_pkt_list)) {
> > +			vhost_enable_notify(&vsock->dev, vq);
> > +			break;
> > +		}
> > +
> > +		head = vhost_get_vq_desc(&vsock->dev, vq, vq->iov,
> > +					ARRAY_SIZE(vq->iov), &out, &in,
> > +					NULL, NULL);
> > +		pr_debug("%s: head = %d\n", __func__, head);
> > +		if (head < 0)
> > +			break;
> > +
> > +		if (head == vq->num) {
> > +			if (unlikely(vhost_enable_notify(&vsock->dev, vq))) {
> > +				vhost_disable_notify(&vsock->dev, vq);
> > +				continue;
> > +			}
> > +			break;
> > +		}
> > +
> > +		pkt = list_first_entry(&vsock->send_pkt_list,
> > +				       struct virtio_vsock_pkt, list);
> > +		list_del_init(&pkt->list);
> > +
> > +		/* FIXME: no assumption of frame layout */
> 
> Pls fix. memcpy_from_iovec is not harder.

Do we have this helper?

> > +		ret = __copy_to_user(vq->iov[0].iov_base, &pkt->hdr,
> > +				     sizeof(pkt->hdr));
> > +		if (ret) {
> > +			virtio_transport_free_pkt(pkt);
> > +			vq_err(vq, "Faulted on copying pkt hdr\n");
> > +			break;
> > +		}
> > +		if (pkt->buf && pkt->len > 0) {
> > +			ret = __copy_to_user(vq->iov[1].iov_base, pkt->buf,
> > +					    pkt->len);
> > +			if (ret) {
> > +				virtio_transport_free_pkt(pkt);
> > +				vq_err(vq, "Faulted on copying pkt buf\n");
> > +				break;
> > +			}
> > +		}
> > +
> > +		vhost_add_used(vq, head, pkt->len);
> > +
> > +		virtio_transport_dec_tx_pkt(pkt);
> > +
> > +		sk = sk_vsock(pkt->trans->vsk);
> > +		/* Release refcnt taken in vhost_transport_send_pkt */
> > +		sock_put(sk);
> > +
> > +		virtio_transport_free_pkt(pkt);
> > +	}
> > +	vhost_signal(&vsock->dev, vq);
> 
> I think you should not signal if used was not updated.

Right, it is very easy to add the optimization here.

> > +	mutex_unlock(&vq->mutex);
> > +}
> > +
> > +static void vhost_transport_send_pkt_work(struct vhost_work *work)
> > +{
> > +	struct vhost_virtqueue *vq;
> > +	struct vhost_vsock *vsock;
> > +
> > +	vsock = container_of(work, struct vhost_vsock, send_pkt_work);
> > +	vq = &vsock->vqs[VSOCK_VQ_RX].vq;
> > +
> > +	vhost_transport_do_send_pkt(vsock, vq);
> > +}
> > +
> > +static int
> > +vhost_transport_send_pkt(struct vsock_sock *vsk,
> > +			 struct virtio_vsock_pkt_info *info)
> > +{
> > +	u32 src_cid, src_port, dst_cid, dst_port;
> > +	struct virtio_transport *trans;
> > +	struct virtio_vsock_pkt *pkt;
> > +	struct vhost_virtqueue *vq;
> > +	struct vhost_vsock *vsock;
> > +	u64 credit;
> > +
> > +	src_cid = vhost_transport_get_local_cid();
> 
> interestingly this is the only place cid
> is used. Shouldn't we validate it?

The local cid is a constant, how can we valicate it?

> > +	src_port = vsk->local_addr.svm_port;
> > +	dst_cid = vsk->remote_addr.svm_cid;
> > +	dst_port = vsk->remote_addr.svm_port;
> > +
> > +	/* Find the vhost_vsock according to guest context id  */
> > +	vsock = vhost_vsock_get(dst_cid);
> 
> Confused. There's a single socket per dst cid?

No, each guest has a cid, each guest has a struct vhost_vsock instance,
dst_cid tells us which struct vhost_vsock instance to use for this
packet.

> > +	if (!vsock)
> > +		return -ENODEV;
> > +
> > +	trans = vsk->trans;
> > +	vq = &vsock->vqs[VSOCK_VQ_RX].vq;
> > +
> > +	if (info->type == SOCK_STREAM) {
> > +		credit = virtio_transport_get_credit(trans);
> > +		if (info->len > credit)
> > +			info->len = credit;
> 
> Is there support for non stream sockets?
> Without credits, you get all kind of nasty
> starvation issues.

We support SOCK_STREAM and SOCK_DGRAM. The credit thing is used for
SOCK_STREAM right now. I can extend it to SOCK_DGRAM as well.

> > +	}
> > +	if (info->len > VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE)
> > +		info->len = VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE;
> > +	/* Do not send zero length OP_RW pkt*/
> > +	if (info->len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
> > +		return info->len;
> > +
> > +	pkt = virtio_transport_alloc_pkt(vsk, info, info->len,
> > +					 src_cid, src_port,
> > +					 dst_cid, dst_port);
> 
> We also need global limit on amount of memory per
> socket. Even if remote is OK with getting 20G from us,
> we might not have so much kernel memory.

Yes, we need global limit. 

> > +	if (!pkt)
> > +		return -ENOMEM;
> > +
> > +	pr_debug("%s:info->len= %d\n", __func__, info->len);
> > +	/* Released in vhost_transport_do_send_pkt */
> > +	sock_hold(&trans->vsk->sk);
> > +	virtio_transport_inc_tx_pkt(pkt);
> > +
> > +	/* queue it up in vhost work */
> > +	mutex_lock(&vq->mutex);
> > +	list_add_tail(&pkt->list, &vsock->send_pkt_list);
> > +	vhost_work_queue(&vsock->dev, &vsock->send_pkt_work);
> > +	mutex_unlock(&vq->mutex);
> > +
> > +	return info->len;
> > +}
> > +
> > +static struct virtio_transport_pkt_ops vhost_ops = {
> > +	.send_pkt = vhost_transport_send_pkt,
> > +};
> > +
> > +static struct virtio_vsock_pkt *
> > +vhost_vsock_alloc_pkt(struct vhost_virtqueue *vq)
> > +{
> > +	struct virtio_vsock_pkt *pkt;
> > +	int ret;
> > +	int len;
> > +
> > +	pkt = kzalloc(sizeof(*pkt), GFP_KERNEL);
> > +	if (!pkt)
> > +		return NULL;
> > +
> > +	len = sizeof(pkt->hdr);
> > +	if (unlikely(vq->iov[0].iov_len != len)) {
> > +		vq_err(vq, "Expecting pkt->hdr = %d, got %zu bytes\n",
> > +		       len, vq->iov[0].iov_len);
> > +		kfree(pkt);
> > +		return NULL;
> > +	}
> > +	ret = __copy_from_user(&pkt->hdr, vq->iov[0].iov_base, len);
> > +	if (ret) {
> > +		vq_err(vq, "Faulted on virtio_vsock_hdr\n");
> > +		kfree(pkt);
> > +		return NULL;
> > +	}
> > +
> > +	pkt->len = pkt->hdr.len;
> > +	pkt->off = 0;
> > +
> > +	/* No payload */
> > +	if (!pkt->len)
> > +		return pkt;
> > +
> > +	/* The pkt is too big */
> > +	if (pkt->len > VIRTIO_VSOCK_MAX_PKT_BUF_SIZE) {
> > +		kfree(pkt);
> > +		return NULL;
> > +	}
> > +
> > +	pkt->buf = kmalloc(pkt->len, GFP_KERNEL);
> > +	if (!pkt->buf) {
> > +		kfree(pkt);
> > +		return NULL;
> > +	}
> > +
> > +	ret = __copy_from_user(pkt->buf, vq->iov[1].iov_base, pkt->len);
> > +	if (ret) {
> > +		vq_err(vq, "Faulted on virtio_vsock_hdr\n");
> > +		virtio_transport_free_pkt(pkt);
> > +	}
> > +
> > +	return pkt;
> > +}
> > +
> > +static void vhost_vsock_handle_ctl_kick(struct vhost_work *work)
> > +{
> > +	struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
> > +						  poll.work);
> > +	struct vhost_vsock *vsock = container_of(vq->dev, struct vhost_vsock,
> > +						 dev);
> > +
> > +	pr_debug("%s vq=%p, vsock=%p\n", __func__, vq, vsock);
> > +}
> > +
> > +static void vhost_vsock_handle_tx_kick(struct vhost_work *work)
> > +{
> > +	struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
> > +						  poll.work);
> > +	struct vhost_vsock *vsock = container_of(vq->dev, struct vhost_vsock,
> > +						 dev);
> > +	struct virtio_vsock_pkt *pkt;
> > +	int head, out, in;
> > +	u32 len;
> > +
> > +	mutex_lock(&vq->mutex);
> > +	vhost_disable_notify(&vsock->dev, vq);
> > +	for (;;) {
> > +		head = vhost_get_vq_desc(&vsock->dev, vq, vq->iov,
> > +					ARRAY_SIZE(vq->iov), &out, &in,
> > +					NULL, NULL);
> > +		if (head < 0)
> > +			break;
> > +
> > +		if (head == vq->num) {
> > +			if (unlikely(vhost_enable_notify(&vsock->dev, vq))) {
> > +				vhost_disable_notify(&vsock->dev, vq);
> > +				continue;
> > +			}
> > +			break;
> > +		}
> > +
> > +		pkt = vhost_vsock_alloc_pkt(vq);
> > +		if (!pkt) {
> > +			vq_err(vq, "Faulted on pkt\n");
> > +			continue;
> > +		}
> > +
> > +		len = pkt->len;
> > +		virtio_transport_recv_pkt(pkt);
> > +		vhost_add_used(vq, head, len);
> > +	}
> > +	vhost_signal(&vsock->dev, vq);
> > +	mutex_unlock(&vq->mutex);
> > +}
> > +
> > +static void vhost_vsock_handle_rx_kick(struct vhost_work *work)
> > +{
> > +	struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
> > +						poll.work);
> > +	struct vhost_vsock *vsock = container_of(vq->dev, struct vhost_vsock,
> > +						 dev);
> > +
> > +	vhost_transport_do_send_pkt(vsock, vq);
> > +}
> > +
> > +static int vhost_vsock_dev_open(struct inode *inode, struct file *file)
> > +{
> > +	struct vhost_virtqueue **vqs;
> > +	struct vhost_vsock *vsock;
> > +	int ret;
> > +
> > +	vsock = kzalloc(sizeof(*vsock), GFP_KERNEL);
> > +	if (!vsock)
> > +		return -ENOMEM;
> > +
> > +	pr_debug("%s:vsock=%p\n", __func__, vsock);
> > +
> > +	vqs = kmalloc(VSOCK_VQ_MAX * sizeof(*vqs), GFP_KERNEL);
> > +	if (!vqs) {
> > +		ret = -ENOMEM;
> > +		goto out;
> > +	}
> > +
> > +	vqs[VSOCK_VQ_CTRL] = &vsock->vqs[VSOCK_VQ_CTRL].vq;
> > +	vqs[VSOCK_VQ_TX] = &vsock->vqs[VSOCK_VQ_TX].vq;
> > +	vqs[VSOCK_VQ_RX] = &vsock->vqs[VSOCK_VQ_RX].vq;
> > +	vsock->vqs[VSOCK_VQ_CTRL].vq.handle_kick = vhost_vsock_handle_ctl_kick;
> > +	vsock->vqs[VSOCK_VQ_TX].vq.handle_kick = vhost_vsock_handle_tx_kick;
> > +	vsock->vqs[VSOCK_VQ_RX].vq.handle_kick = vhost_vsock_handle_rx_kick;
> > +
> > +	ret = vhost_dev_init(&vsock->dev, vqs, VSOCK_VQ_MAX);
> > +	if (ret < 0)
> > +		goto out_vqs;
> > +
> > +	file->private_data = vsock;
> > +	INIT_LIST_HEAD(&vsock->send_pkt_list);
> > +	vhost_work_init(&vsock->send_pkt_work, vhost_transport_send_pkt_work);
> > +
> > +	mutex_lock(&vhost_vsock_mutex);
> > +	list_add_tail(&vsock->list, &vhost_vsock_list);
> > +	mutex_unlock(&vhost_vsock_mutex);
> > +	return ret;
> > +
> > +out_vqs:
> > +	kfree(vqs);
> > +out:
> > +	kfree(vsock);
> > +	return ret;
> > +}
> > +
> > +static void vhost_vsock_flush(struct vhost_vsock *vsock)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < VSOCK_VQ_MAX; i++)
> > +		vhost_poll_flush(&vsock->vqs[i].vq.poll);
> > +	vhost_work_flush(&vsock->dev, &vsock->send_pkt_work);
> > +}
> > +
> > +static int vhost_vsock_dev_release(struct inode *inode, struct file *file)
> > +{
> > +	struct vhost_vsock *vsock = file->private_data;
> > +
> > +	mutex_lock(&vhost_vsock_mutex);
> > +	list_del(&vsock->list);
> > +	mutex_unlock(&vhost_vsock_mutex);
> > +
> > +	vhost_dev_stop(&vsock->dev);
> > +	vhost_vsock_flush(vsock);
> > +	vhost_dev_cleanup(&vsock->dev, false);
> > +	kfree(vsock->dev.vqs);
> > +	kfree(vsock);
> > +	return 0;
> > +}
> > +
> > +static int vhost_vsock_set_cid(struct vhost_vsock *vsock, u32 guest_cid)
> > +{
> > +	mutex_lock(&vhost_vsock_mutex);
> > +	vsock->guest_cid = guest_cid;
> > +	pr_debug("%s:guest_cid=%d\n", __func__, guest_cid);
> > +	mutex_unlock(&vhost_vsock_mutex);
> > +
> > +	return 0;
> > +}
> > +
> > +static long vhost_vsock_dev_ioctl(struct file *f, unsigned int ioctl,
> > +				  unsigned long arg)
> > +{
> > +	struct vhost_vsock *vsock = f->private_data;
> > +	void __user *argp = (void __user *)arg;
> > +	u64 __user *featurep = argp;
> > +	u32 __user *cidp = argp;
> > +	u32 guest_cid;
> > +	u64 features;
> > +	int r;
> > +
> > +	switch (ioctl) {
> > +	case VHOST_VSOCK_SET_GUEST_CID:
> > +		if (get_user(guest_cid, cidp))
> > +			return -EFAULT;
> > +		return vhost_vsock_set_cid(vsock, guest_cid);
> > +	case VHOST_GET_FEATURES:
> > +		features = VHOST_VSOCK_FEATURES;
> > +		if (copy_to_user(featurep, &features, sizeof(features)))
> > +			return -EFAULT;
> > +		return 0;
> > +	case VHOST_SET_FEATURES:
> > +		if (copy_from_user(&features, featurep, sizeof(features)))
> > +			return -EFAULT;
> > +		return 0;
> > +	default:
> > +		mutex_lock(&vsock->dev.mutex);
> > +		r = vhost_dev_ioctl(&vsock->dev, ioctl, argp);
> > +		if (r == -ENOIOCTLCMD)
> > +			r = vhost_vring_ioctl(&vsock->dev, ioctl, argp);
> > +		else
> > +			vhost_vsock_flush(vsock);
> > +		mutex_unlock(&vsock->dev.mutex);
> > +		return r;
> > +	}
> > +}
> > +
> > +static const struct file_operations vhost_vsock_fops = {
> > +	.owner          = THIS_MODULE,
> > +	.open           = vhost_vsock_dev_open,
> > +	.release        = vhost_vsock_dev_release,
> > +	.llseek		= noop_llseek,
> > +	.unlocked_ioctl = vhost_vsock_dev_ioctl,
> > +};
> > +
> > +static struct miscdevice vhost_vsock_misc = {
> > +	.minor = MISC_DYNAMIC_MINOR,
> > +	.name = "vhost-vsock",
> > +	.fops = &vhost_vsock_fops,
> > +};
> > +
> > +static int
> > +vhost_transport_socket_init(struct vsock_sock *vsk, struct vsock_sock *psk)
> > +{
> > +	struct virtio_transport *trans;
> > +	int ret;
> > +
> > +	ret = virtio_transport_do_socket_init(vsk, psk);
> > +	if (ret)
> > +		return ret;
> > +
> > +	trans = vsk->trans;
> > +	trans->ops = &vhost_ops;
> > +
> > +	return ret;
> > +}
> > +
> > +static struct vsock_transport vhost_transport = {
> > +	.get_local_cid            = vhost_transport_get_local_cid,
> > +
> > +	.init                     = vhost_transport_socket_init,
> > +	.destruct                 = virtio_transport_destruct,
> > +	.release                  = virtio_transport_release,
> > +	.connect                  = virtio_transport_connect,
> > +	.shutdown                 = virtio_transport_shutdown,
> > +
> > +	.dgram_enqueue            = virtio_transport_dgram_enqueue,
> > +	.dgram_dequeue            = virtio_transport_dgram_dequeue,
> > +	.dgram_bind               = virtio_transport_dgram_bind,
> > +	.dgram_allow              = virtio_transport_dgram_allow,
> > +
> > +	.stream_enqueue           = virtio_transport_stream_enqueue,
> > +	.stream_dequeue           = virtio_transport_stream_dequeue,
> > +	.stream_has_data          = virtio_transport_stream_has_data,
> > +	.stream_has_space         = virtio_transport_stream_has_space,
> > +	.stream_rcvhiwat          = virtio_transport_stream_rcvhiwat,
> > +	.stream_is_active         = virtio_transport_stream_is_active,
> > +	.stream_allow             = virtio_transport_stream_allow,
> > +
> > +	.notify_poll_in           = virtio_transport_notify_poll_in,
> > +	.notify_poll_out          = virtio_transport_notify_poll_out,
> > +	.notify_recv_init         = virtio_transport_notify_recv_init,
> > +	.notify_recv_pre_block    = virtio_transport_notify_recv_pre_block,
> > +	.notify_recv_pre_dequeue  = virtio_transport_notify_recv_pre_dequeue,
> > +	.notify_recv_post_dequeue = virtio_transport_notify_recv_post_dequeue,
> > +	.notify_send_init         = virtio_transport_notify_send_init,
> > +	.notify_send_pre_block    = virtio_transport_notify_send_pre_block,
> > +	.notify_send_pre_enqueue  = virtio_transport_notify_send_pre_enqueue,
> > +	.notify_send_post_enqueue = virtio_transport_notify_send_post_enqueue,
> > +
> > +	.set_buffer_size          = virtio_transport_set_buffer_size,
> > +	.set_min_buffer_size      = virtio_transport_set_min_buffer_size,
> > +	.set_max_buffer_size      = virtio_transport_set_max_buffer_size,
> > +	.get_buffer_size          = virtio_transport_get_buffer_size,
> > +	.get_min_buffer_size      = virtio_transport_get_min_buffer_size,
> > +	.get_max_buffer_size      = virtio_transport_get_max_buffer_size,
> > +};
> > +
> > +static int __init vhost_vsock_init(void)
> > +{
> > +	int ret;
> > +
> > +	ret = vsock_core_init(&vhost_transport);
> > +	if (ret < 0)
> > +		return ret;
> > +	return misc_register(&vhost_vsock_misc);
> > +};
> > +
> > +static void __exit vhost_vsock_exit(void)
> > +{
> > +	misc_deregister(&vhost_vsock_misc);
> > +	vsock_core_exit();
> > +};
> > +
> > +module_init(vhost_vsock_init);
> > +module_exit(vhost_vsock_exit);
> > +MODULE_LICENSE("GPL v2");
> > +MODULE_AUTHOR("Asias He");
> > +MODULE_DESCRIPTION("vhost transport for vsock ");
> > diff --git a/drivers/vhost/vsock.h b/drivers/vhost/vsock.h
> > new file mode 100644
> > index 0000000..0ddb107
> > --- /dev/null
> > +++ b/drivers/vhost/vsock.h
> > @@ -0,0 +1,4 @@
> > +#ifndef VHOST_VSOCK_H
> > +#define VHOST_VSOCK_H
> > +#define VHOST_VSOCK_SET_GUEST_CID _IOW(VHOST_VIRTIO, 0x60, __u32)
> 
> No SET without GET please.

But the GET is useless here. We know the guest cid in userspace already.

> > +#endif
> > -- 
> > 1.8.1.4

-- 
Asias

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC 2/5] VSOCK: Introduce virtio-vsock-common.ko
  2013-06-27  8:00 ` [RFC 2/5] VSOCK: Introduce virtio-vsock-common.ko Asias He
  2013-06-27 10:34   ` Michael S. Tsirkin
@ 2013-06-29  4:32   ` David Miller
  2013-06-29 23:45     ` Asias He
  2013-06-29  4:32   ` David Miller
  2 siblings, 1 reply; 21+ messages in thread
From: David Miller @ 2013-06-29  4:32 UTC (permalink / raw)
  To: asias
  Cc: netdev, kvm, virtualization, acking, dtor, grantr, rusty, mst,
	jasowang, stefanha, kraxel, penberg, sasha.levin

From: Asias He <asias@redhat.com>
Date: Thu, 27 Jun 2013 16:00:01 +0800

> +static void
> +virtio_transport_recv_dgram(struct sock *sk,
> +			    struct virtio_vsock_pkt *pkt)
 ...
> +	memcpy(skb->data, pkt, sizeof(*pkt));
> +	memcpy(skb->data + sizeof(*pkt), pkt->buf, pkt->len);

Are you sure this is right?

Shouldn't you be using "sizeof(struct virtio_vsock_hdr)" instead of
"sizeof(*pkt)".  'pkt' is "struct virtio_vsock_pkt" and has all kinds
of meta-data you probably don't mean to include in the SKB.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC 2/5] VSOCK: Introduce virtio-vsock-common.ko
  2013-06-27  8:00 ` [RFC 2/5] VSOCK: Introduce virtio-vsock-common.ko Asias He
  2013-06-27 10:34   ` Michael S. Tsirkin
  2013-06-29  4:32   ` David Miller
@ 2013-06-29  4:32   ` David Miller
  2 siblings, 0 replies; 21+ messages in thread
From: David Miller @ 2013-06-29  4:32 UTC (permalink / raw)
  To: asias
  Cc: acking, kvm, dtor, netdev, grantr, virtualization, penberg, mst,
	sasha.levin

From: Asias He <asias@redhat.com>
Date: Thu, 27 Jun 2013 16:00:01 +0800

> +static void
> +virtio_transport_recv_dgram(struct sock *sk,
> +			    struct virtio_vsock_pkt *pkt)
 ...
> +	memcpy(skb->data, pkt, sizeof(*pkt));
> +	memcpy(skb->data + sizeof(*pkt), pkt->buf, pkt->len);

Are you sure this is right?

Shouldn't you be using "sizeof(struct virtio_vsock_hdr)" instead of
"sizeof(*pkt)".  'pkt' is "struct virtio_vsock_pkt" and has all kinds
of meta-data you probably don't mean to include in the SKB.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC 2/5] VSOCK: Introduce virtio-vsock-common.ko
  2013-06-29  4:32   ` David Miller
@ 2013-06-29 23:45     ` Asias He
  0 siblings, 0 replies; 21+ messages in thread
From: Asias He @ 2013-06-29 23:45 UTC (permalink / raw)
  To: David Miller
  Cc: acking, kvm, dtor, netdev, grantr, virtualization, penberg, mst,
	sasha.levin

Hi David,

On Fri, Jun 28, 2013 at 09:32:25PM -0700, David Miller wrote:
> From: Asias He <asias@redhat.com>
> Date: Thu, 27 Jun 2013 16:00:01 +0800
> 
> > +static void
> > +virtio_transport_recv_dgram(struct sock *sk,
> > +			    struct virtio_vsock_pkt *pkt)
>  ...
> > +	memcpy(skb->data, pkt, sizeof(*pkt));
> > +	memcpy(skb->data + sizeof(*pkt), pkt->buf, pkt->len);
> 
> Are you sure this is right?
> 
> Shouldn't you be using "sizeof(struct virtio_vsock_hdr)" instead of
> "sizeof(*pkt)".  'pkt' is "struct virtio_vsock_pkt" and has all kinds
> of meta-data you probably don't mean to include in the SKB.

Right, virtio_vsock_hdr is enough. Will fix this.

Thanks for looking at this. 

-- 
Asias

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2013-06-29 23:45 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-06-27  7:59 [RFC 0/5] Introduce VM Sockets virtio transport Asias He
2013-06-27  8:00 ` [RFC 1/5] VSOCK: Introduce vsock_find_unbound_socket and vsock_bind_dgram_generic Asias He
2013-06-27  8:00 ` [RFC 2/5] VSOCK: Introduce virtio-vsock-common.ko Asias He
2013-06-27 10:34   ` Michael S. Tsirkin
2013-06-28  6:28     ` Asias He
2013-06-29  4:32   ` David Miller
2013-06-29 23:45     ` Asias He
2013-06-29  4:32   ` David Miller
2013-06-27  8:00 ` [RFC 3/5] VSOCK: Introduce virtio-vsock.ko Asias He
2013-06-27  8:00 ` [RFC 4/5] VSOCK: Introduce vhost-vsock.ko Asias He
2013-06-27 10:42   ` Michael S. Tsirkin
2013-06-28  2:38     ` Andy King
2013-06-28  2:38     ` Andy King
2013-06-28  6:55     ` Asias He
2013-06-27  8:00 ` [RFC 5/5] VSOCK: Add Makefile and Kconfig Asias He
2013-06-27 10:23 ` [RFC 0/5] Introduce VM Sockets virtio transport Michael S. Tsirkin
2013-06-28  2:25   ` Andy King
2013-06-28  5:50     ` Asias He
2013-06-28  6:12   ` Asias He
2013-06-27 19:03 ` Sasha Levin
2013-06-28  6:26   ` Asias He

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.